Chapter Five: The Numbers Don’t Speak for Themselves
In April 2014, 276 young women were kidnapped from their high school in the town of Chibok in northern Nigeria. Boko Haram, a militant terrorist group, claimed responsibility for the attacks. The press coverage, both in Nigeria and around the world, was fast and furious. SaharaReporters.com challenged the government's ability keep their students safe. CNN covered parents' anguish. The Japan Times connected the kidnappings to the increasing unrest in Nigeria's northern states. And the BBC told the story of a girl who had managed to evade the kidnappers. Several weeks arfter this initial reporting, the popular blog FiveThirtyEight published their own data-driven story about the event, titled "Kidnapping of Girls in Nigeria Is Part of a Worsening Problem." They reported skyrocketing rates of kidnappings. In 2013 alone, the story asserted that there had been more than 3,608 kidnappings of young women. Charts and maps accompanied the story to visually make the case that abduction was at an all-time high.
Shortly thereafter, they had to issue an apologetic retraction because their numbers were just plain wrong. The outlet had used the Global Database of Events, Language and Tone (GDELT) as their data source. GDELT is a big data project by computational social scientist Kalev Leetaru, with previous development by Philip Schrodt and Patrick Brandt. It collects news reports about events around the world and parses the news reports for actors, events, and geography with the aim of providing a comprehensive set of data for researchers, government and civil society. GDELT particularly tries to focus on conflict – whether conflict is likely between two countries, whether unrest is sparking a civil war – all by analyzing media reports. However, as political scientist Erin Simpson pointed out to FiveThirtyEight in a widely cited Twitter rant, GDELT's primary data source is media reports and it's not at a stage where you can use it to make reliable claims about separate cases of kidnapping. The kidnapping of schoolgirls in Nigeria was a single event. There were thousands of global media stories about it. GDELT deduplicated some of those to a single event but still logged, erroneously, that hundreds of events happened that day. And the FiveThiryEight report was counting each GDELT pseudo-event as a separate kidnapping incident.
One of Simpson's final admonishments in her long thread to FiveThirtyEight is to "never, ever use #GDELT for reporting of discrete events," because "that's not what it's for." This, combined with other commentary and critique from political scientists, statisticians and bloggers, was embarrassing for FiveThirtyEight– not to mention the reporter– but it also illustrates some larger problems about using data found “in the wild.” First of all, the hype around "Big Data" leads projects like GDELT to wildly overstate the completeness and accuracy of their data and algorithms. On their website and in publications, the project leads have stated that GDELT is "an initiative to construct a catalog of human societal-scale behavior and beliefs across all countries of the world, connecting every person, organization, location, count, theme, news source, and event across the planet into a single massive network that captures what's happening around the world, what its context is and who's involved, and how the world is feeling about it, every single day." That giant mouthful is no small dick of Big Data. It is clearly Big Dick Data.
Yet even when you get past the marketing hype aimed at funders, the GDELT technical documentation is not quite forthright when it comes to whether it is counting media reports (as Simpson asserts) or events. The database FiveThirtyEight used is called the "GDELT Event Database", which makes it sound like it's counting events and not media reports. And the documentation for it states that "if an event has been seen before it will not be included again". This also makes it sound like it's counting events. And a research paper from 2013 states that it's counting events but only specific to publications. So still events, but with an asterisk. Both in documentation and in practice, it is unclear how the system actually works, since there are multiple assertions that events (derived from media reports) are what is being counted. But from a practical standpoint, when you use the system, the results of counting include many duplicate events. Moreover, there is no guidance in its documentation about what kinds of research questions are appropriate to ask the database and what the limitations are.
A cross-institutional research team published an analysis in Science in 2016 that showed that only 21% of GDELT's events indicated an actual event. http://politicalviolenceataglance.org/2014/02/20/raining-on-the-parade-some-cautions-regarding-the-global-database-of-events-language-and-tone-dataset/, http://science.sciencemag.org/content/353/6307/1502.full. Concern about event detection at the sentence level removing context as well.
The stakes are high for context and data. GDELT is not that different from many other data repositories out there. There are a growing number of portals, observatories, and websites where one can download all manner of open government, corporate and scientific data. There are APIs
API stands for Application Programming Interface. APIs provide a way that a little program that you write can talk to a program over the Internet that is ready to receive data queries. Twitter, Zillow and MOMA are some examples of large entities that have APIs that you can use to programatically download data.
While FiveThirtyEight did need a schooling on verifying their data, there's a larger problem at work here that has to do with context. One of the central tenets of feminist thinking, outlined by Donna Haraway, is that all knowledge is "situated." What this means is that context matters. When approaching new knowledge – any kind of knowledge — it’s essential to ask about the social, cultural, historical and material conditions in which that knowledge was produced, as well as the identities of the humans who made (or are making) that knowledge. Rather than seeing knowledge artifacts – like datasets – as neutral and objective fodder to use for more knowledge making, a feminist perspective advocates for connecting them back to their context, to better understand their limitations and ethical obligations. And, ultimately, to better understand the ways in which power and privilege may be obscuring the truth.
The issue is that much of the data downloaded from web portals and APIs comes without context or metadata. If you are lucky you might get a paragraph about where the data are from or a data dictionary that describes what each column in a spreadsheet means. But more than likely you get something like this.
The data shown here – open budget data about government procurement in São Paulo – do not look very technically complicated. Rather, the complicated part is figuring out how the business process behind them works – how does the government run open bids? How do they decide who gets awarded a contract? Are all bids published here or just the ones that got the contract? What do terms like "competition," "cooperation agreement," and "terms of collaboration" mean to the data publisher? Why is there such variation in the publication numbering scheme? Without answers to some of these basic questions, it's hard to even begin a data exploration or analysis process.
This is not an uncommon situation. Most data arrives on our computational doorstep context-free and ripe for misinterpretation. And context becomes extra-complicated when poor data documentation is accompanied by the kind of marketing hype we see from GDELT or other Big Dick Data projects. Claims such as GDELT made to totality and universal objectivity are exactly what led Haraway to propose situated knowledge as a feminist antidote to these kind of "unlocatable, and so irresponsible, knowledge claims." The goal of feminist objectivity, then, becomes to connect knowledge back to the bodies of its producers and institutions, with their particular histories, values, limitations, and oversights. In short, to consider context in relation to data.
Ironically, some of the the admirable aims and actions of the Open Data Movement have worked against the urgency of providing context (often inadvertently). Open data is the idea that anyone can freely access, use, modify, and share data for any purpose. And the Open Data Movement – a loose network of organizations, governments, and individuals – has been active since the mid-2000's when groups like the Open Knowledge Institute were founded and campaigns like the Guardian's "Free Our Data”
But in practice, limited public funds for technological infrastructure mean that governments prioritize the "opening up" part of open data – publishing spreadsheets of things like licenses granted, arrest data, or flood zones – but cannot go further on developing context or making data usable in order to ensure access and use by broad groups of the public. Raw data dumps might be good for starting a conversation, notes scholar Tim Davies, but they cannot actually ensure engagement or accountability. And in fact, many published data sets sit idle on their portals, awaiting users to undertake the labor of deciphering their bureaucratic arcana. There is even a neologism coined by Daniel Kaufmann, an economist with the Revenue Watch Institute, that has been coined to describe this phenomenon: "Zombie data" is data that has been published without any purpose or clear use case in mind.
So, open data has a context problem. Or, a better way to say this is that governments and data providers have not invested as much time and resources in providing context to end users as they have in providing data.
But do we need to invest in context? Wired magazine editor Chris Anderson would say "No." In 2008, in a now famous Wired article titled "The End of Theory," Anderson made the claim that "with enough data, the numbers speak for themselves." His assertion was that the age of Big Data will soon permit data scientists to do analysis at the scale of the population. Statistics is based on the idea that you can infer things about a population by taking a random and representative sample. For example, say you want to know which candidate all 323 million people in the US will vote for in a presidential election. You can't contact all of them, but you can call 3,000 of them and use those results to predict what the rest of the people will do. Of course, there's some statistical modeling and uncertainty calculations that need to take place here. And this is the point where Anderson is saying that the theory happens – bridging data collected from a sample with calculations to infer things about a population. At the point when we have data collected about an entire population, theory is no longer necessary. We also, he says, don't need models and theories to understand why something is happening, just to be able to see that one thing is correlated with another: "Correlation is enough." Anderson's main example is Google search. Google's systems don't need to understand why some pages are more linked to than others, only that it's happening, and they will then use that as an indicator of relevance in search.
Now, you can't write an article claiming that the scientific method and all theory are obsolete and not expect some pushback. Anderson wrote the piece to be provocative, and there have been numerous responses and debates, including those that challenge the idea that this argument is a "new" way of thinking in the first place (for example, Francis Bacon argued for inductive reasoning in which the scientist gathers data, analyzes it, and only then forms a hypothesis). What has unfolded since 2008 in feminist thinking is also a more sophisticated understanding of the ways in which data-driven systems like Google Search do not just reflect back the racism and sexism embedded throughout society but also participate in reinforcing it. This is the central argument of Algorithms of Oppression, Safiya Noble's study of the harmful stereotypes about Black and Latina women perpetuated by search algorithms. In direct opposition to Anderson, Noble asserts that it is the corporation's responsibility to understand racism in page-linking. Correlation, without context, is not enough when it means that Google recirculates racism.
But there's another reason that the numbers don't speak for themselves when it comes to data about women and marginalized people: not all standpoints are valued equally by society. In writing about a Black woman's standpoint, sociologist Patricia Hill Collins explains that when a group's standpoint is consistently devalued, it becomes subjugated knowledge: "Traditionally, the suppression of Black women’s ideas within White male-controlled social institutions led African-American women to use music, literature, daily conversations, and everyday behavior as important locations for constructing a Black feminist consciousness." When groups of people are systematically taught that mainstream culture excludes their experience, stigmatizes their experience or completely neglects their experience, then their knowledge and cultural practices either go underground or are completely silenced. When mainstream institutions try to collect data in the context of subjugated knowledge, the results are uneven because the data setting has major imbalances of power. Nowhere is this more evident than in the case of violence against women and the data that tries (though, in most cases, does not try very hard) to capture the reality of this phenomenon.
In April 1986, Jeanne Clery was sexually assaulted and brutally murdered in her dorm room by an acquaintance at Lehigh University. Clery's parents were devastated. "Most Americans saw the [the space shuttle] Challenger splinter into a billion pieces. That's what happened to our hearts," Connie Clery told People Magazine. The Clerys mounted a campaign to improve data collection about crimes on college campuses and it was successful – the Jeanne Clery Act was passed in 1990 and requires all US colleges and universities to make on-campus crime statistics available to the public. This now includes separate and specific numbers on sexual violence such as sexual assault, dating violence, domestic violence and stalking.
So we have an ostensibly comprehensive national data set about an important public topic. In 2016, three senior students – Patrick Torphy, Michaela Halnon, and Jillian Meehan – in Catherine's data journalism class at Emerson College decided this was a good starting point for their final project. Could the Clery Act data tell them something important about the rape culture
However, upon downloading and exploring the data for colleges in Massachusetts the students were puzzled. Williams College, a small, wealthy liberal arts college in rural Massachusetts, seemed to have an epidemic of sexual assault, while Boston University, a large research institution in downtown Boston, seemed to have strikingly few cases for its size and population. Not to mention that several high-profile sexual assault cases at BU had made the news in recent years so BU did not have a great reputation around Boston. The students were suspicious – and with good reason. Their further investigation revealed that the truth is likely closer to the reverse of the picture that the Clery Act numbers paint. But you cannot know that without understanding the context of the data.
Colleges and universities are required to report sexual assault data and other campus crimes annually per the Clery Act, and there are stiff financial penalties for not reporting. But it's important to note that the numbers are self-reported. There are only sixteen staff members at the US Department of Education devoted to monitoring the more than 7,000 higher education institutions in the country so it is unlikely that underreporting by an institution would be discovered except in very high-profile cases like the Sandusky Case at Penn State. Moreover, there are strong incentives not to file a Clery report with high numbers. First of all, no college wants to tell the government– let alone parents of prospective students– that it has a high rate of sexual assault on campus. High numbers of sexual assault are bad for the bottom line so universities are actually financially incentivized to not encourage survivors to come forward. And survivors of sexual assault themselves often do not come forward because of social stigma, the trauma of reliving their experience, and the resulting lack of social and psychological support. This is subjugated knowledge - by normalizing male sexual violence, mainstream culture has taught survivors that their experience will not be treated with care and, in fact, they may face more harm, blame and trauma if they do come forward. The result is silence, and the effect on the data is that vast rows of survivors go unaccounted for.
As the students consulted with experts, compared Clery Act data with anonymous campus climate surveys, and interviewed survivors, they found that, paradoxically, many of the colleges with higher reported rates of sexual assault were actually places where more institutional resources were being devoted to support for survivors.
Do you remember the body issues we described in Chapter One? One of the key reasons that data science needs feminism is that bodies go uncounted, particularly bodies of women, non-binary and other gender non-conforming folks, and folks of color. This is true in the case of data about maternal health, human migration, police killings, health impacts of pollution, and more. And this is certainly the case with sexual assault data, where society systematically neglects and devalues the standpoint of survivors. Their experiences become subjugated knowledge - stigmatized and silenced. Thus, the collection environment has social, political and cultural incentives around reporting that are misaligned and work against collecting reliable, accurate data. Simply stated, there are imbalances of power in the data setting, so we cannot take the numbers in the data set at face value. Here one needs a sophisticated understanding of the context of the data and the actors in the data collection system in order to be able to work with it ethically and truthfully. Lacking this understanding of context and letting the numbers "speak for themselves" would tell a story that is not only patently false but could also be used to reward colleges that are systematically underreporting and creating hostile environments for survivors. Cathy O'Neil, the author of Weapons of Math Destruction, has a term for this: A "pernicious feedback loop" helps to reinforce the unfair environment from which it spawned. Deliberately undercounting cases of sexual assault leads to being rewarded for underreporting. And the silence around sexual assault continues: The administration is silent, the campus culture is silent, the data set is silent.
One of the key analytical missteps of work that "lets the numbers speak for themselves" is the assumption that data are a raw input rather than seeing them as artifacts that have emerged fully cooked into the world, birthed out of a complex set of social and political circumstances already existing in the data setting. It's important to note that there is an emerging class of "data creatives" whose very existence is premised on context-hopping by combining disparate data. This group includes data scientists, data journalists, data artists and designers, researchers, and entrepreneurs. In short, pretty much everyone who works with data right now. Data's new creative class is highly rewarded for producing work that creates new value and insight from mining and combining conceptually unrelated data sets.
But data is an output first. After that, it can become an input into a new process, but only with understanding of what the limitations of the collection environment were. “Raw Data” is an Oxymoron is the lovely title and primary assertion of a book edited by literary and information studies scholar Lisa Gitelman that traces the history of data and its connections to today’s data culture. Many data-driven projects aiming towards producing new, future insights forget to interrogate how the data got collected and cooked in the first place. FiveThirtyEight got the GDELT events data and jumped into the analysis without looking backwards into how the data was acquired and processed. Clery counts emerge out of a data setting that has an imbalance of power, subjugated knowledge and misaligned incentives and so does not measure what it appears to on first glimpse.
This kind of "data-is-raw-input" mentality happens in scholarly research as well. An academic paper about "the Baumgartner Reddit Corpus" authored by Devin Gaffney and J. Nathan Matias made waves in spring 2018. Three years prior, software developer Jason Baumgartner published a dataset that he claimed contained "every publicly available Reddit comment". Computational social scientists were thrilled. To date, at least fifteen peer-reviewed studies have used the dataset for research studies on topics like politics, online behavior, breaking news, and hate speech. But Gaffney and Matias found a big problem with this big data set: The supposedly complete corpus is missing at least 36 million comments and 28 million submissions. Depending on what the researchers used the corpus for, the missing data may affect the validity of their results. Some researchers have re-run their experiments and found no changes in their findings when they included the missing data
Gaffney and Matias' work represents an emerging feminist methodological approach to big data research. Instead of using large data sets as raw inputs to create other meaning, they are interrogating the context, limitations and validity of the data itself. Which is to say, they are examining the data to understand the cooking process. In a similar vein, computer scientists and historians at Stanford undertook a study called "Word embeddings quantify 100 years of gender and ethnic stereotypes." Using machine learning and a data set of 200 million words taken from US books, media and census data from the 20th century, they sought to analyze gender and ethnic stereotypes over time. Word embeddings are sets of numbers (technically, vectors) that quantify the relationships between words in a set of documents. They’re helpful for showing which words are most (or least) strongly associated with other words, according to the model. In this particular paper, the model showed that words like "intelligent", "logical", and "thoughtful," had masculine associations before the 1960s. But since then, according to the model, those words have increasingly been associated with women. However, other words, like those associated with physical appearance, did not show such comparative "progress." In the paper, the researchers assert that shifts in word embeddings can quantify the effects of social activism. They write, "The women’s movement in the 1960s and 1970s especially had a systemic and drastic effect in women’s portrayals in literature and culture."
What makes this project feminist in both topic and method is its use of computation to situate gender and ethnic bias in a social and temporal context. Note that the researchers did not try to assert that the data represent "how women and men are." They also did not try to "remove the bias" so that they could study gender differences. They saw the data as what they are – cultural indicators of the changing face of patriarchy and racism – and interrogated them as such.
So, how do we collectively produce more work that situates data, interrogates bias and sensitively treats subjugated standpoints and knowledges?
Unfortunately for Chris Anderson, the answer is that we need more theory, context, and scientific method, not less. Why? Because, quite simply, the humans are always in the loop. Even when the algorithms are doing the heavy lifting. As we showed in What Gets Counted Counts, without theory, survey designers and data analysts are relying on their intuition and "common sense" theories of the things they are measuring and modeling and this leads directly down the path towards cognitive bias.
Deep context, subjugated standpoints and computation are not incompatible. Desmond Patton has a unique background – trained as a social worker, he now runs SAFElab, a research lab at Columbia that uses artificial intelligence to examine the ways that youth of color navigate violence on and offline. He and a team of social work students use social media, specifically Twitter data, to understand and prevent gang violence in Chicago. But when he started doing this work five years ago, he ran into a problem. Even though he is African American, grew up in Chicago, and worked in many of these neighborhoods for years in violence prevention, "I didn't know what young people were saying, period." At the same time, Patton and his team are acutely aware of the fact that many groups, such as law enforcement and corporate platforms, are already surveilling youth of color online. He continues, "it became really clear to me that we needed to take a deeper approach to social media data in particular, so that we could really grasp culture, context and nuance, for the primary reason of not misinterpreting what's being said."
The solution to context, in this case, came through direct contact with and centering the perspectives of the youth whom they sought to understand. Patton and doctoral student William Frey hired formerly gang-involved youth to work on the project as domain experts. These experts coded a subset of the millions of tweets, then a team of social work students was trained to code them. The process was long, and not without challenges. Patton and Frey actually created a new "deep listening" method, called Contextual Analysis of Social Media, in order to help human coders mitigate their own bias in the coding process and get closer to the intended meaning of a single tweet. Finally, they trained a machine learning algorithm to classify the youths' tweets. Says Patton, "We trained the algorithm to think like a young African American man on the south side of Chicago."
Here is feminist standpoint theory in action in artificial intelligence. The dominant culture approach would have done something naive or misinformed, like counting violent words in tweets. So a tweet like "aint kill yo mans & ion kno ya homie" would have been classified as aggressive or violent, fulfilling the dominant culture's stereotype of Black youth. Taking a situated view, from the standpoint of Black youth themselves, Frey and Patton were able to show that many tweets like this one are actually youth quoting music lyrics of local rap stars, in this case Chicago rapper Lil Durk. These tweets are about sharing culture, not communicating threats.
Note that in order to train an algorithm to understand the context of subjugated standpoints, significant human infrastructure and ethical navigation is required. Frey and Patton have built long-term, ongoing relationships with individuals and organizations in the community. Indeed, Frye lives and works in the community. Both are trained social workers, with years of experience working in violence prevention. According to Patton, they lead with the social worker's code of ethics, one of whose principles is "Social workers recognize the central importance of human relationships." Rather than using computation to replace human relations, SAFELab is using AI to broker new forms of understanding across power differentials. This kind of social innovation often goes underappreciated in technical disciplines. As Patton says, "we had a lot of challenges with publishing papers in data science communities about this work, because it is very clear to me that they're slow to care about context. Not that they don't care, but they don't see the innovation or the social justice impact that the work can have."
Note that it's not just in the stages of data acquisition and analysis that context matters. Context also comes into play in the framing and communication of numbers. Let's imagine a scenario. In this case you are a data journalist and your editor has assigned you to create a graphic and short story about a recent research study: "Disparities in Mental Health Referral and Diagnosis in the New York City Jail Mental Health Service". This study looked at the medical records of 45,189 first-time inmates and found that some groups are likely to receive a treatment response and others are more likely to receive a punishment response. Older people and white people were more likely to receive a mental health diagnosis. Black and Hispanic inmates were more likely to enter solitary confinement. While the researchers explain some of this variation from differential diagnostic rates outside of jail, they also attribute some of the variation to discrimination within the jail system. Either way, the racial and ethnic disparities are a product of structural racism.
Consider the difference between these two graphics. The only variation is the title and framing of the chart.
Which one of these graphics would you (should you) choose to use? The first – "Mental Health in Jail" – represents the most typical way that data is communicated. The title appears to be neutral and free of bias. This is a graphic about rates of mental illness diagnosis of inmates broken down by race and ethnicity. The title does not mention race, ethnicity, racism, or health inequities. The title also does not point to what the data means. And remember from What Gets Counted Counts what happens in the case of "not enough meaning"? Our helpful, heuristic-loving brains will start to fill in the gaps. The particular subject matter of this data – combining mental illness and race and incarceration – contains three charged issues that are particularly prone to stigma and stereotypes. In the chart on the left, your viewers' brains will likely start to draw inferences based on stereotypes that use essentialist ideas about race/ethnicity (the main category of analysis depicted) to explain the variation in the data, as in, "Oh, this is because white people are x and Black people are y."
So, here is where an important context question comes in. Are you representing only the four numbers that we see in the chart? Or are you representing the context from which they emerged? Because the research study that produced these numbers presents convincing evidence that we should distrust the diagnosis numbers due to racial and ethnic disparities. So, if you publish a chart of those same numbers without questioning them in the title, you are actually undermining the main claim of the research. The scientists' results showed that white prisoners disproportionately receive treatment measures like mental health services and people of color disproportionately receive punitive measures like solitary confinement. So, the chart on the left is not only not providing enough meaning (and letting stereotypes flow in) but it is also not giving enough information about the main claim of the research.
Enter the chart on the right: "Racism in Jail: People of color less likely to get mental health diagnosis." There are a couple of important things this title is doing. First, the title is framing a way to interpret the numbers that is in line with the context and claims of the research study from which they emerged. The research study was about racial disparities, so the title and content of this chart are about racial disparities. Additionally, and crucially, this chart names the forces of oppression that are at work: "Racism in Prison." Rather than leave the door open for stereotypes and essentialist views of race and ethnicity and mental illness to flood your viewer's minds, this chart names the force at work that produces this inequality: it is racism.
"But", you may say (and our students say this a lot), "I don't want to tell people what to think. I want to let them interpret the numbers for themselves." This is an ostensibly noble sentiment, but it fails to acknowledge the power relationship between the author and the audience. As the data journalist in this scenario, you are in a position of power to be able to communicate something to your readers. Presumably, you have researched the topic and know more about it than your audience. Because of that, your audience is in a position of listening and paying attention. What this means is that you have a responsibility – precisely because of your position of privilege – to communicate both the data and the most accurate interpretation of the data. If you let the numbers speak for themselves, this is emphatically not more ethical and more democratic. Why? Because, stereotypes and heuristics love a vacuum. Your audience is highly unlikely to go read all the research you read, do the calculations you did, and interview the people you interviewed. They are reading your story or analysis precisely because they do not have the time for that. So if your work fails to provide meaning and context to numbers, their minds will fill in the gaps with the path of least resistance – and that will probably include stereotypes in the case of issues like race, gender, mental illness, or incarceration, among others. The feminist imperative to consider context and situate numbers in relation to their social and political context is not just a recommendation but a responsibility of ethical data communication.
This counsel – to name forces of oppression when they are clearly present in the numbers – particularly applies to data scientists and designers from the dominant group. White people, including ourselves, have a hard time naming and talking about racism. Men have a hard time naming and talking about sexism and patriarchy. Straight people have a hard time seeing and talking about homophobia and heteronormativity. If you are concerned with truth and justice in data, we suggest that you practice recognizing, naming and talking about these structural forces of oppression because it is in aggregated data that they are most evident. We go into further detail about these forces in The Power Chapter.
Part of considering context is understanding that data collection always involves an investment of some combination of interest, money, and time. Like we said in Bring Back the Bodies, counting is power, even if it shouldn’t be, and that power is not distributed equally across all social groups. There are many important issues, often related to women and other marginalized groups, about which we have little-to-no data. As artist Mimi Onuoha points out in relation to her project Missing Data Sets, this is primarily because institutional incentives do not exist to collect it. And the groups who are most affected by the problem often do not have the resources of either time or money or expertise to do it on their own.
These structural issues can feel overwhelming. But honoring context responsibly in the course of one's work with data is also not that complicated. It merely involves a reconception of the role of the data scientist from a "raw data" massager to a "cooked data" investigative biographer. It involves looking backwards at the data setting – and reflecting on your own identity in relation to the data – before you look forwards to create new insights produced with the data set.
Educators, journalists, and civic data publishers are starting to develop more robust tools and methods for context, and we'll take you on a quick tour of several. The first and most important is simply to take an "equity pause" at the beginning of the project, and later at key strategic moments. An equity pause is a process step in EquityXDesign, a justice-focused design framework developed by Christine Ortiz, Caroline Hill and Michelle Molitor. Their framework asserts that research and design can proceed hand-in-hand with racial equity, but only with an additional set of checks around power and privilege. As applied to data science, an equity pause would involve questioning your research questions, questioning your categories and questioning your expectations, particularly as they relate to data about people. This is really difficult – especially for people who are members of the dominant social group (and thus more susceptible to the overconfidence bias and the illusory superiority bias and the status quo bias, among others). And the smaller and less diverse your team is, the more likely you are to fall prey to habits of thinking like the self-serving bias and the egoecentric bias.
Remember the scenario in What Gets Counted Counts in which you were designing a data project about cell phone data usage? Planning time for an equity pause in the research and discovery phase of a project might lead you to entirely different design decisions as the survey designer. It would allow you the time and space to research contemporary ideas about gender and mobile technology and incorporate them into your work. For example, you might collect data on cell phone usage beyond the categories of only "women" and "men." Or you might collect gender on a spectrum, as a continuous variable. And an equity pause – especially together with a team – may lead you to make some of your implicit assumptions explicit like "women talk more so let's collect minutes." To which your female colleagues might respond, diplomatically, "that's an over-generalizing essentialist assumption." Finally, an equity pause informed by deeper research, may lead to highly innovative methods. For example, demographer Jill Williams suggests that quantitative work informed by feminist theory may need to treat gender as a dependent variable. Meaning, this work would look at gender as an outcome of other intersecting aspects of identity such as age, race, class, sexuality, and ethnicity. Doing that kind of intersectional analysis might lead you to discover the connection between millennials and expanded social networks that you had previously missed.
A related but slightly more technical proposal advocated by researchers at Microsoft is being called datasheets for datasets. Inspired by the datasheets that accompany hardware components, Timnit Gebru and colleagues advocate for data publishers to create a short, 3-5 page document that accompanies data sets and outlines how they were created and collected, what data is missing, whether preprocessing was done, how the dataset will be maintained, and legal and ethical considerations such as whether the data collection process complies with privacy laws in the EU.
Providing more context is in line with a feminist approach to data and it also helps move towards some of the unrealized ideals of the Open Data Movement around participation, transparency and civic empowerment. For example, Gisele Craveiro, a professor at the University of São Paulo, researches the dissemination and reuse of open government data. Brazil has a transparency law on the books that requires the government to publish data about every expenditure in 24 hours or less. Most of this gets published in impenetrable tables with little metadata or documentation, as shown earlier in the chapter with the example of the procurement table. In the project "Cuidando do Meu Bairro"(Caring for My Neighborhood), Craveiro and her team created a tool to make this spending data more accessible to citizens by adding context to the presentation of the information. Their results showed that people could engage with the data better once they could see expenditures that occurred in their neighborhood and what their funding status was (planned, committed or paid). Not only that, the research team was also able to communicate accessibility struggles around lack of context back to government officials and influence how the data was published in the first place.
So tools and methods for providing context are being developed and piloted, and there is still hope (we hope!) for the future of open data. But what remains murky is this: which actors in the data ecosystem are responsible for providing context?
Is it the end users? In the case of the reddit comments, we have seen how even the most highly educated among us failed to verify the basic claims of their data source. And datasheets for data sets are great, but can we expect individual people and small teams to conduct an in-depth background research project while on a deadline and a budget? This places unreasonable expectations and responsibility on newcomers and is likely to lead to further high-profile cases of errors and ethical breaches.
So, is it the data publishers? In the case of GDELT, we have seen how data publishers, in their quest for research funding, overstate their capabilities and don't document the limitations of the data. In the case of the reddit comments, the data was provided by an individual acting in good faith, but who did not verify – and probably did not have the resources to verify – his claim to completeness. In the case of the sexual assault data, the universities self-reporting cases are incentivized to underreport and government is under-resourced to verify and document all the limitations of the data. And if one of the goals is transparency and accountability, the institutions in power often have strong incentives to not provide context,
So, is it data intermediaries? Intermediaries might include librarians, journalists, nonprofits, educators and other public information professionals. These folks are doing context-building work in some piecemeal but important ways. For example, ProPublica, the nonprofit news organization, has compiled the largest US database on school segregation from public data sources. They provide a 21-page document to give context on where the data comes from, the time period it covers and what kinds of questions are appropriate to ask of the data. The nonprofit Measuring Justice provides comprehensive and contextualized data on criminal justice and incarceration rates in the US. So, intermediaries who clean and contextualize the data for public use have potential (and have fewer conflicts of interest), but there would have to be a funding mechanism, significant capacity building, and professional norms-setting that would need to take place to do this at scale.
Houston, we have a public information problem. Until we invest as much in providing (and maintaining) context as we do in publishing data, we will end up with public information resources that are subpar at best and dangerous at worst. The bottom line for numbers is that they cannot speak for themselves. In fact, those of us who work with data must actively prevent numbers from speaking for themselves because when those numbers come from a data setting with a power imbalance or misaligned collection incentives (read: pretty much all data settings!), and especially when the numbers have to do with human beings, then they run the risk of being not only discriminatory, not only empirically wrong, but actually dangerous in their reinforcement of an unjust status quo. Considering context should be a frontier for open data advocates, philanthropic foundations, researchers, news organizations, and – perhaps most importantly – regulators.