The ideal data scientist is a Frankenstein monster of many skillsets
The second UN World Data Forum was held last month in Dubai, the United Arab Emirates. The biennial forum provides an opportunity for various actors in the data ecosystem to drive the use of quality data for the attainment of the Sustainable Development Goals (SDGs) by 2030.
The African Population and Health Research Center (APHRC) had a delegation at the Forum to speak at a session titled “Data Revolution in Africa: advances and experiences at the regional, national and subnational levels”.
The session’s objective was to highlight the data value chain, from the community to research to government policymakers and other decision-makers. It highlighted three particular developments: online data-sharing portals that provide for the curation of population-based data, partnerships with government to improve data systems and demographic surveillance in African slum communities.
On the very last plenary session of the last day of the Forum, one of the panelists, Nnenna Nwakanma, posed this question to all of us: what are your takeaways from the UN Data Forum?
One of the greatest takeaways I got from attending several of the parallel sessions and events was that we need to cogitate deeply on how we define the term ‘data scientist’. This is because of the increasing awareness of the importance of data in everyday aspects of life and the subsequent growth of a discipline known as “data science”.
This got me thinking a lot about data science and what it really means. O’Neil and Schutt (in their book Doing Data Science) define a data scientist as:
“Someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills—skills that are also necessary for understanding biases in the data, and for debugging logging output from code.”
Further digging revealed that the ideal data scientist must possess programming skills, statistical skills, machine learning abilities, multivariable calculus and linear algebra, data wrangling, data visualization and communication, software engineering and data intuition! Is it possible for one human being to possess all these skills and be good at all of them?
Yes, it is — if one has superhuman intelligence and time-defying abilities to acquire the requisite education and apprenticeship within their lifespan. Or if one were Frankenstein’s monster, composed of various parts of a data communicator, statistician, mathematician, web developer, product manager, among others as demonstrated by Stephan Kolassa.
Clearly, the task of finding the ideal data scientist would require Herculean efforts. It is rare to find the desired skill set and experience in one individual. However, it is possible to build a data science team comprised of experts in each of the individual skills listed above. And the team, importantly, must also be multicultural and gender-diverse.
Why is the development of a data science team important for APHRC? The Center has been at the forefront of research for evidence-based decision-making in Africa since 2002. The rich population and urban health data generated by various research programs are our most valuable asset, and they’re housed in our online data-sharing platform where they are available for use by various actors upon request.
Thus far, the data have been analyzed descriptively and have been able to provide historical information on a variety of subject matter. However, as shown in the next figure, in order for the value of the data to increase, they must mature analytically by undergoing diagnostic, predictive and then prescriptive analytics.
Prescriptive analysis requires the use of Machine Learning (ML) and Artificial Intelligence (AI), both of which “learn their intelligence” from a human being, not just from the data that they’re fed. For example, a team at the Massachusetts Institute of Technology found an AI’s income-prediction “shockingly racist and sexist” because of the nature of the data that it had been fed.
Without the intervention of the team, who adjusted the algorithms, the AI’s predictive abilities would have failed disenfranchised groups such as women and minorities. We see then that a diverse data science team is able to curtail the production of AI that is biased, especially in cases where the data used to build the AI were already biased in the first instance.
Understandably, assembling such a team comes with its own financial implications. This is where reaching for the low-hanging fruit first, before scaling up to full-blown operations would be ideal. This involves two things: first, taking advantage of ML platforms available on Google or Microsoft to perform basic operations and second, capitalizing on expertise currently available at the Center. This can be followed by a scaled-up restructuring of a team that has the requisite expertise and investment in infrastructure that supports ML and AI.
The roles being undertaken by the data science team would include among other, a Data Visualisation Engineer, an ML Engineer, and a Data Journalist.
The eventual data science team must consider the type of data science they want to pursue. That will be determined in part by the team: it could concentrate on cleaning and analyzing data for use in forecasting, modeling and visualization (an analytical team) or on software engineering to develop products using existing data (building team).
A diverse team would drive the production of high quality and varied research products that provide informed prescriptions for inclusive, evidence-based decision-making. This, in turn, would steer the Center’s vision to transform lives in Africa through research as well as Kenya’s efforts to leave no one behind in the implementation of the SDGs.
*This article was originally published in apolitical