On the cultural ideology of Big Data.
“What science becomes in any historical era depends on what we make of it” —Sandra Harding, Whose Science? Whose Knowledge? (1991)
Modernity has long been obsessed with, perhaps even defined by, its epistemic insecurity, its grasping toward big truths that ultimately disappoint as our world grows only less knowable. New knowledge and new ways of understanding simultaneously produce new forms of nonknowledge, new uncertainties and mysteries. The scientific method, based in deduction and falsifiability, is better at proliferating questions than it is at answering them. For instance, Einstein’s theories about the curvature of space and motion at the quantum level provide new knowledge and generates new unknowns that previously could not be pondered.
Since every theory destabilizes as much as it solidifies in our view of the world, the collective frenzy to generate knowledge creates at the same time a mounting sense of futility, a tension looking for catharsis — a moment in which we could feel, if only for an instant, that we know something for sure. In contemporary culture, Big Data promises this relief.
As the name suggests, Big Data is about size. Many proponents of Big Data claim that massive databases can reveal a whole new set of truths because of the unprecedented quantity of information they contain. But the big in Big Data is also used to denote a qualitative difference — that aggregating a certain amount of information makes data pass over into Big Data, a “revolution in knowledge,” to use a phrase thrown around by startups and mass-market social-science books. Operating beyond normal science’s simple accumulation of more information, Big Data is touted as a different sort of knowledge altogether, an Enlightenment for social life reckoned at the scale of masses.
As with the similarly inferential sciences like evolutionary psychology and pop-neuroscience, Big Data can be used to give any chosen hypothesis a veneer of science and the unearned authority of numbers. The data is big enough to entertain any story. Big Data has thus spawned an entire industry ("predictive analytics") as well as reams of academic, corporate, and governmental research; it has also sparked the rise of “data journalism” like that of FiveThirtyEight, Vox, and the other multiplying explainer sites. It has shifted the center of gravity in these fields not merely because of its grand epistemological claims but also because it’s well-financed. Twitter, for example recently announced that it is putting $10 million into a “social machines” Big Data laboratory.
The rationalist fantasy that enough data can be collected with the “right” methodology to provide an objective and disinterested picture of reality is an old and familiar one: positivism. This is the understanding that the social world can be known and explained from a value-neutral, transcendent view from nowhere in particular. The term comes from Positive Philosophy (1830-1842), by August Comte, who also coined the term sociology in this image. As Western sociology began to congeal as a discipline (departments, paid jobs, journals, conferences), Emile Durkheim, another of the field's founders, believed it could function as a “social physics” capable of outlining “social facts” akin to the measurable facts that could be recorded about the physical properties of objects. It’s an arrogant view, in retrospect — one that aims for a grand, general theory that can explain social life, a view that became increasingly rooted as sociology became focused on empirical data collection.
A century later, that unwieldy aspiration has been largely abandoned by sociologists in favor of reorienting the discipline toward recognizing complexities rather than pursuing universal explanations for human sociality. But the advent of Big Data has resurrected the fantasy of a social physics, promising a new data-driven technique for ratifying social facts with sheer algorithmic processing power.
Positivism's intensity has waxed and waned over time, but it never entirely dies out, because its rewards are too seductive. The fantasy of a simple truth that can transcend the divisions that otherwise fragment a society riven by power and competing agendas is too powerful, and too profitable. To be able to assert convincingly that you have modeled the social world accurately is to know how to sell anything from a political position, a product, to one’s own authority. Big Data sells itself as a knowledge that equals power. But in fact, it relies on pre-existing power to equate data with knowledge.
Not all data science is Big Data. As with any research field, the practitioners of data science vary widely in ethics, intent, humility, and awareness of the limits of their methodologies. To critique the cultural deployment of Big Data as it filters into the mainstream is not to argue that all data research is worthless. (The new Data & Society Research Institute, for instance, takes a measured approach to research with large data sets.) But the positivist tendencies of data science — its myths of objectivity and political disinterestedness — loom larger than any study or any set of researchers, and they threaten to transform data science into an ideological tool for legitimizing the tech industry’s approach to product design and data collection.
Big Data research cannot be understood outside the powerful nexus of data science and social-media companies. It’s where the commanding view-from-nowhere ideology of Big Data is most transparent; it's where the algorithms, databases, and venture capital all meet. It was no accident that Facebook's research branch was behind the now infamous emotional manipulation study, which was widely condemned for its lax ethical standards and intellectual hubris. (One of the authors of the study said Big Data’s potential was akin to the invention of the microscope.)
Equally steeped in the Big Data way of knowing is Dataclysm, a new book-length expansion of OkCupid president Christian Rudder’s earlier blog-posted observations about the anomalies of his dating service’s data set. “We are on the cusp of momentous change in the study of human communication,” Rudder proclaims, echoing the Facebook researchers’ hubris. Dataclysm’s subtitle sets the same tone: “Who we are (when we think no one is watching).” The smirking implication is that when enough data is gathered behind our backs, we can finally have access to the dirty hidden truth beyond the subjectivity of not only researchers but their subjects as well. Big Data will expose human sociality and desire in ways those experiencing it can’t.
Because digital data collection on platforms like OkCupid seems to happen almost automatically — the interfaces passively record all sorts of information about users’ behavior — it appears unbiased by messy a priori theories. The numbers, as Rudder states multiple times in the book, are right there for you to conclude what you wish. Indeed, because so many numbers are there, they speak for themselves. With all of OkCupid's data points on love and sex and beauty, Rudder claims he can “lay bare vanities and vulnerabilities that were perhaps until now just shades of truth.”
For Rudder and the other neo-positivists conducting research from tech-company campuses, Big Data always stands in the shadow of the bigger data to come. The assumption is that there is more data today and there will necessarily be even more tomorrow, an expansion that will bring us ever closer to the inevitable “pure” data totality: the entirety of our everyday actions captured in data form, lending themselves to the project of a total causal explanation for everything. Over and over again, Rudder points out the size, power, and limitless potential of his data only to impress upon readers how it could be even bigger. This long-held positivist fantasy — the complete account of the universe that is always just around the corner — thereby establishes a moral mandate for ever more intrusive data collection.
But what's most fundamental to Rudder’s belief in his data’s truth-telling capability — and his justification for ignoring established research-ethics norms — is his view that data sets built through passive data collection eliminate researcher bias. In Rudder's view, shared by other neo-positivists that have defended human digital experimentation without consent, the problem with polling and other established methods for large-scale data gathering is that these have well-known sources of measurement error. As any adequately trained social scientist would confirm, how you word a question and who poses it can corrupt what a questionnaire captures. Rudder believes Big Data can get much closer to the truth by removing the researcher from the data-collection process altogether. For instance, with data scraped from Google searches, there is no researcher prodding subjects to reveal what they wanted to know. “There is no ask. You just tell,” Rudder writes.
This is why Rudder believes he doesn’t need to ask for permission before experimenting on his site’s users — to, say, artificially manipulate users’ “match” percentage or systematically remove some users’ photos from interactions. To obtain the most uncontaminated data, users cannot be asked for consent. They cannot know they are in a lab.
While the field of survey research has oriented itself almost completely to understanding and articulating the limits of its methods, Rudder copes with Big Data’s potentially even more egregious opportunities for systematic measurement error by ignoring them. “Sometimes,” he argues, “it takes a blind algorithm to really see the data.” Significantly downplayed in this view is how the way OkCupid captures its data points is governed by the political choices and specific cultural understandings of the site's programmers. Big Data positivism myopically regards the data passively collected by computers to be objective. But computers don’t remember anything on their own.
This naive perspective on how computers work echoes the early days of photography, when that new technology was sometimes represented as a vision that could go beyond vision, revealing truths previously impossible to capture. The most famous example is Eadweard Muybridge’s series of photographs that showed how a horse really galloped. But at the same time, as Shawn Michelle Smith explains in At the Edge of Sight: Photography and the Unseen, early photography often encoded specific and possibly unacknowledged understandings of race, gender, and sexuality as “real.” This vision beyond vision was in fact saturated with the cultural filter that photography was said to overcome.
Social-media platforms are similarly saturated. The politics that goes into designing these sites, what data they collect, how it is captured, how the variables are arranged and stored, how the data is queried and why are all full of messy politics, interests, and insecurities. Social-science researchers are trained to recognize this from the very beginning of their academic training and learn techniques to try to mitigate or at least articulate the resulting bias. Meanwhile, Rudder gives every first-year methods instructor heart palpitations by claiming that “there are times when a data set is so robust that if you set up your analysis right, you don’t need to ask it questions — it just tells you everything anyways.”
Evelyn Fox Keller, in Reflections on Gender in Science, describes how positivism is first enacted by distancing the researcher from the data. Big Data, as Rudder eagerly asserts, embraces this separation. This leads to perhaps the most dangerous consequence of Big Data ideology: that researchers whose work touches on the impact of race, gender, and sexuality in culture refuse to recognize how they invest their own unstated and perhaps unconscious theories, their specific social standpoint, into their entire research process. This replicates their existing bias and simultaneously hides that bias to the degree their findings are regarded as objectively truthful.
By moving the truth-telling ability from the researcher to data that supposedly speaks for itself, Big Data implicitly encourages researchers to ignore conceptual frameworks like intersectionality or debates about how social categories can be queered rather than reinforced. And there is no reason to suppose that those with access to Big Data — often tech companies and researchers affiliated with them — are immune to bias. They, like anyone, have specific orientations toward the social world, what sort of data could describe it, and how that data should be used. As danah boyd and Kate Crawford point out in "Critical Questions for Big Data,"
regardless of the size of a data, it is subject to limitation and bias. Without those biases and limitations being understood and outlined, misinterpretation is the result.
This kind of short-sightedness allows Rudder to write things like “The ideal source for analyzing gender difference is instead one where a user’s gender is nominally irrelevant, where it doesn’t matter if the person is a man or a woman. I chose Twitter to be that neutral ground” without pausing to consider how gender deeply informs the use of Twitter. Throughout Dataclysm, despite his posture of being separate from the data he works with, Rudder’s politics are continually intervening, not merely in his explanations, which often refer to brain science and evolutionary psychology, but also in how he chooses to measure variables and put them into his analyses.
In a society deeply stratified on the lines of race, class, sex, and many other vectors of domination, how can knowledge ever be said to be disinterested and objective? While former Wired editor-in-chief Chris Anderson was describing the supposed “end of theory” thanks to Big Data in a widely heralded article, Kate Crawford, Kate Miltner, and Mary Gray were correcting that view, pointing out simply that “Big Data is theory.” It's merely one that operates by failing to understand itself as one.
Positivism has been with us a long time, as have the critiques of it. Some research methodologists have addressed and incorporated these critiques: Sandra Harding’s Whose Science? Whose Knowledge? argues for a new, “strong” objectivity that sees including a researcher’s social standpoint as a feature instead of a flaw, permitting a diversity of perspectives instead one false view from nowhere. Patricia Hill Collins, in Black Feminist Thought, argues that “partiality and not universality is the condition of being heard.”
Big Data takes a different approach. Rather than accept partiality, its apologists try a new trick to salvage the myth of universal objectivity. To evade questions of standpoint, they lionize the data at the expense of the researcher. Big Data's proponents downplay both the role of the measurer in measurement and the researcher's expertise — Rudder makes constant note of his mediocre statistical skills — to subtly shift the source of authority. The ability to tell the truth becomes no longer a matter of analytical approach and instead one of sheer access to data.
The positivist fiction has always relied on unequal access: science could sell itself as morally and politically disinterested for so long because the requisite skills were so unevenly distributed. As scientific practice is increasingly conducted from different cultural standpoints, the inherited political biases of previous science become more obvious. As access to education and advanced research methodologies became more widespread, they could no longer support the positivist myth.
The cultural ideology of Big Data attempts to reverse this by shifting authority away from (slightly more) democratized research expertise toward unequal access to proprietary, gated data. (Molly Osberg points out in her review of Dataclysm for the Verge how Rudder explains in the notes how he gathered most of his information through personal interactions with other tech company executives.) When data is said to be so good that it tells its own truths and researchers downplay their own methodological skills, that should be understood as an effort to make access to that data more valuable, more rarefied. And the same people positioning this data as so valuable and authoritative are typically the ones who own it and routinely sell access to it.
Data science need not be an elitist practice. We should pursue a popular approach to large data sets that better understands and comes to terms with Big Data's own smallness, emphasizing how much of the intricacies of fluid social life cannot be held still in a database. We shouldn’t let the positivist veneer on data science cause us to overlook its valuable research potential.
But for Big Data to really enhance what we know about the social world, researchers need to fight against the very cultural ideology that, in the short term, overfunds and overvalues it. The view from nowhere that informs books like Dataclysm and much of the corporate and commercialized data science must be unmasked as a view from a very specific and familiar somewhere.