The Full Case Study
Estimating Population with Cell Phone Records
Let us look at an introductory example where we apply the four conceptual lenses (positionality, power, narratives, sociotechnical systems) to just one phase of the data science lifecycle. After that, you can try using these lenses to think about other lifecycle stages. For this example, we will consider the “Data Discovery” stage of work when using mobile phone records to estimate populations and economic status.
Enumerations of populations and their economic status are collected for many essential purposes: determining electoral representation, allocating tax dollars, or deploying public health interventions. A census is a complete count of all residents according to where they live, including their income status and other information. Many countries conduct a census at intervals, such as every ten years.
However, censuses are very expensive and time-consuming. Researchers have been experimenting with alternative data collection and analysis sources that could more quickly and cheaply estimate populations and economic status in recent years. In particular, mobile phone records are being explored as a data source allowing governments to infer these measurements (Deville et al., 2014; Blumenstock et al., 2015). This approach is seen as particularly promising in countries where census taking is hindered by a lack of resources or conflict. While this offers a promising path forward, it is not without ethical challenges.
Now we will explore the ethical issues that surface when trying to estimate population size and economic status using mobile phone records if we adopt the lenses of power, reflexivity, narratives, and sociotechnical systems during the stage of Data Discovery. Recall that Data Discovery includes identifying potential data sources, filtering data, and cleaning, transforming, and integrating data into a usable dataset.
Consider using the lens of positionality to examine the Data Discovery stage of a project using mobile phone records to estimate population size and economic status. In this example, an essential part of data discovery is considering privacy concerns for people represented in the data.
With sensitive information like phone records, researchers typically perform a step during data discovery to prevent individuals from being identified – either by anonymizing the data, aggregating it, or introducing noise into the dataset.
What culturally derived assumptions are made when taking any of these typical approaches to privacy? Suppose, for instance; you come from a culture that values individual liberties over collective well-being. You may be assuming that the most significant dimension of protecting data is preserving the anonymity of individuals while overlooking the harm that can be done to segments of the population if their locations are revealed: could a government take advantage of this method to monitor an oppressed minority geographically concentrated, even if individuals within that population cannot be re-identified?
Having understood that their notions of ethics, privacy, and identity are culturally derived and situated, data scientists must decide how these questions affect their practice. Should they abandon the project if the state-of-the-art approaches to de-identification at their disposal can do nothing to protect the privacy of a persecuted ethnic minority in the aggregate? Or perhaps they have the opportunity to work with a dataset from a different country where that particular concern is less salient?
In the case of using mobile phone records to estimate population size, location, and economic status, it is essential to remember during the Data Discovery stage that for-profit businesses have the power to decide what data they will share, with whom, and under what conditions.
If a company has made mobile phone records available to you as a researcher, you will want to ask their motivation for doing this. Are they acting altruistically, or do they have something to gain politically or financially by sharing their data from a particular country or city? Are they, perhaps, trying to win favor with a governmental body that is responsible for regulating them? Are they hoping to burnish their image by doing something “good” for the world?
We also want to consider who does not have power in this example.
Do mobile phone customers know that their data is being aggregated and analyzed this way? Do researchers in this space have an obligation to inform phone customers about their research or obtain consent from them?
Finally, the power lens draws attention to how power is not only gained and lost but re-configures the very playing field of human relationships. In this case, we can see how the capacity of data scientists to estimate the population without relying on the census mechanism further blurs the distinction between citizen and phone user/consumer.
When addressing these concerns in practice, the data science team has many options. They may decide they do not trust the data provider enough to exploit their data. Alternatively, they may partake in the project not without negotiating the terms of the data sharing agreement to ensure that it aligns with their values.
They may want to conceal the name of the data provider so as not to be complicit in the company’s public relations goals. Conversely, they may determine it is crucial to disclose the provider’s name for transparency and accountability. They may arrange to have the company contact customers and inform them of the research, or they may prominently post about their study methods on a public-facing website so that mobile phone customers or journalists can better understand the research.
And all the while, they must question whether and how these steps make their work more ethical. For example, if the data subjects in these mobile phone records come from rural communities with little Internet access in a country where few people speak English, then posting information in English on the website of a US-based institution probably does little to empower their data subjects.
When applying the lens of sociotechnical systems to the mobile phone records case, it is helpful to think about how the technical system that generated the mobile phone network data is entwined with the lives of real people who act in varied, inconsistent, and sometimes surprising ways. This means that as a researcher, one cannot necessarily make straightforward assumptions about what the data generated by the system represent.
For example, is your first inclination to assume that a mobile phone record represents one unique individual? In a place like the United States, it may be common for most people to have a single phone (or several) per person. However, this is not the case in many parts of the world.
In some places, it is common for a whole household to share one mobile phone, for one person to use multiple phones, or for one phone to work with various SIM cards.
How can you account for this variety when deciding what the variables in your data represent during the data discovery phase?
Moreover, if you do not address this social complexity and variation, what are the repercussions in the world? If you over- or under-estimate a population, could that lead to people being deprived of critical resources or democratic representation?
In response to realizing that such concerns exist, the data scientists on this project would likely take steps to gain more insights into how the data were produced and what they mean in the local context.
This might require ongoing access to officials at the mobile phone company who can provide further insights or metadata as the need arises. Alternatively, it might entail first consulting or conducting qualitative research that explores how people utilize mobile phones in the local context of relevance. Even after going to these lengths, the practitioner may determine that the data simply are not suitable for answering their questions, given the meaning of the data in their specific sociotechnical context.
A project to study populations using mobile phone data is likely animated by a narrative that celebrates the ability and innovativeness of using readily available data in place of the painstaking process of data collection involved in conducting a census. In applying the narrative lens to this project, it is crucial to define this narrative, which includes identifying what precisely makes this a compelling project, according to whom, and against what alternatives.
Then it is necessary to consider criticisms of this dominant narrative. For example, some scholars have expressed concerns about efforts to find alternative data sources in place of census data. Richard Shearmur (2015) argues that many researchers are “dazzled by data,” meaning that they have bought into a utopian narrative that claims newly available digital traces of human activity, better known as “Big Data,” will help to answer previously unanswerable questions about the world: “The marvels, infinite possibilities and sheer newness of Big Data are contrasted with the staid and limited information that – it is thought – can be gleaned from the census. For example, Facebook can combine available information to track the formation and dissolution of networks in real-time, and cell phone companies can map the movements of their customers: can the census do that?” (Shearmur 2015, p. 965).
Shearmur argues that the existence of this utopian narrative has contributed to decisions by governments like his in Canada to dis-invest from collecting census data. This concerns him because data collected during a census is “authoritative, open to scrutiny, representative of the entire population, and resting on slowly evolving and relatively consensual definitions” – all things that Big Data gathered from mobile phone records can never be because it is inherently about customers and markets, not citizens.
Upon recognizing the existence of the “dazzling” utopian data narrative, the data scientists involved in this hypothetical project might respond in many ways. They may reconsider whether they should embark on the project at all, or perhaps they find ways to complement their analysis with more traditional administrative data, or they may be assiduously circumspect and cautious when interpreting their findings and disclosing the limitations of their study.