Data Management for the MADIVA project

November 30, 2023


Data Scientist


My role

In January 2022, I was co-opted to the Multimorbidity in Africa: Digital Innovation, Visualization and Application (MADIVA) study as a data scientist in the Data Management and Analysis Core (DMAC) team. The main goal of the project is to develop fundamental machine learning (ML) techniques that can impact on the problem of multimorbidity (MM), in different African settings. Implemented in Kenya and South Africa, the study  is run by a consortium made up of the African Population and Health Research Center (APHRC), University of the Witwatersrand, International Business Machines Corporation (IBM), Sydney Brenner Institute of Molecular Biology (SBIMB) ) and South African Population Research Infrastructure Network (SAPRIN). Within this are smaller technical working groups charged with carrying out different components of the 5-year study. The diversity of the group makes for a fascinating experience with a lot of learning, full of ups and downs.

For this reason, therefore, one requires a specific set of skills to navigate the technical and programmatic aspects of project implementation. Key among these are soft skills which enable one to comprehend the wider picture of a task, an enterprise, or even a purpose. This includes communication through various channels in order to keep each other informed on the status of the various work-packages such as meetings and virtual workshops. Through MADIVA, I have learned how to synthesize technical information for better comprehension of the subject matter by both technical and non-technical persons.

The effective data management process requires close collaboration among the team members. This is no different with MADIVA where I have continued to appreciate the benefits of  teamwork with  other data scientists, physicians, clinicians, and researchers from partner organizations.

The study has also been an opportunity to hone existing skills and learn new ones from colleagues such as object-oriented programming (OOP) through platforms such as Postgres and working on a cloud platform known as the Wits cluster provided by the University of the Witwatersrand. This was made possible through the numerous in-person and virtual networking events. Additionally, the meetings have aided in fostering a sense of belonging among team members, fostering a culture of trust, and mutual support.

These experiences brought to the forefront the significance of each and every stage in the data management pipeline—beginning with an individual’s role in data processing and integration, data quality assurance, security and privacy, storage, and retrieval. When these activities are carried out as planned, working with data is much easier, and reduces the back and forth that would emanate from clarification-seeking efforts.

Managing stakeholder expectations, scheduling work to meet deadlines with conflicting priorities, and organizing numerous data custodians to provide a consistent supply of the datasets required for the project were some of the unique challenges that this role provided.

This is only possible if  scientists and data managers work together from the outset; drafting protocols and creating tools to ensure that there is a global perspective in mind and adherence to set standards. This will make data integration easier to do  when necessary. Teams should also use platforms that facilitate interoperability and improve cooperation.


MADIVA is an acronym for “Multimorbidity in Africa: Digital Innovation, Visualization and Application”. The co-existence of two or more chronic health diseases, such as diabetes and hypertension or any other combination you can think of, is known as multimorbidity. The broad aim in MADIVA project is to develop fundamental machine learning (ML) techniques that can impact on the problem of multimorbidity (MM), in different African settings, through working with an interdisciplinary team at different sites in Africa. Understanding MM and how public health interventions affect it; pioneering the use of polygenic risk scores and public precision medicine; understanding how data science (DS) can support sub-county health management teams (SCHMTs); developing tools for visualization and understanding our data; and developing novel ML techniques are the specific goals of the project.