Digital Data Methods


twitter API bias

Developed with my colleagues Andreas Storz (Leiden University) and Daniela Stockmann (Hertie School of Governance), this project examines the potential biases introduced into research findings when scholars collect their data via the opaque application programming interfaces (APIs) provided by various digital platforms such as Google, Facebook, Instagram, and Twitter. In the first part of our research, we focus on Twitter--which is one of the most popular platforms for researchers to study precisely because of its relatively accessible data. When using keyword queries, Twitter's public Search and Streaming APIs rarely return the full population of tweets, and, due to proprietary restrictions placed on the algorithms, scholars do not know whether their data constitute a representative sample. Our manuscript (now under review) seeks to crack open this black box by examining data derived from four identical keyword queries to the Firehose (which provides the full population of tweets but is cost-prohibitive to access), Streaming, and Search APIs. We use Kendall’s-tau and logit regression analyses to understand the differences in the datasets, including what user and content characteristics make a tweet more or less likely to appear in sampled results. Ultimately, we find that there are indeed systematic differences that are likely to bias scholars’ findings in almost all datasets we examine, and we recommend that scholars exercise significant caution in future Twitter research.

Digital data decay

In a second part of the project, Andreas Storz and I are collecting extensive data from various social media platforms that will allow us to explore both (1) to what extent and (2) how digital data decay over time. Though the internet is often viewed as a permanent archive, content--from individual social media posts to entire websites--do of course disappear. And yet few researchers consider whether and how the moment of their digital data capture matters. We are therefore collecting numerous social media datasets, first, at or very near the very moment of online publication and then attempting to recapture this same data at regular intervals. By longitudinally analyzing how much and what type of data disappear, we hope to develop a better understanding of the biases that might be generated in digital research over time.

Computational social science

I have been working closely with computational researchers for the last several years and have developed some extremely fruitful partnerships with these colleagues. However, due to disciplinary norms and practices, it is often quite difficult for social scientists and computational scholars to collaborate. We frequently struggle to understand one another's priorities, desires, methodologies, approaches, and jargon. In an effort to help others bridge this divide, I am currently working on a paper with a number of other scholars from various humanities, social science, and computational backgrounds that explains our disciplinary approaches to operationalizing important social and cultural concepts. We believe this step in the research process is particularly important to unpack, because we have found it creates one of the biggest barriers to cross-disciplinary work. On the one hand, social scientists and humanities scholars want concepts to be theory-driven and rich, and we typically demand that the means of measuring our concepts be rigorous and sound. Indeed, we often criticize computational scholars for taking shortcuts that make measurement easy but lose the core, theoretically-rooted meaning behind a concept. On the other hand, computational scholars can grow frustrated with the seemingly obstinate and nit-picky insistence they face from us. And because many computational scholars are especially interested in engineering goals (e.g., improving the performance of algorithms), theory is usually not their top priority. In this paper--born out of a workshop hosted by The Alan Turing Institute in September 2017--we aim to make these different perspectives more comprehensible to other scholars.

Digital Research Ethics


The "right to be forgotten" in digital research

I am also particularly concerned about the ethics of digital data and digital research. Within this theme, Daniela Stockmann and I have a book chapter (Ch. 5) in the just-published Internet Research for the Social Age: New Cases and Challenges (edited by Michael Zimmer and Katharina Kinder-Kurlanda) that asks how we balance the digital media user's right to privacy against the social scientist's need for data integrity and reproducibility. As various public and private actors have begun to recognize internet users' "right to be forgotten," finding an answer to this question has become ever more difficult. Should social scientists acknowledge that, when obtained without formal consent, we have little right to maintain--and, in particular, to share--social media data once they are removed from the public domain, our research may become effectively impossible to reproduce.

Our study explores this possibility by taking Twitter activity during the 2014 "Umbrella Revolution" in Hong Kong as a case study. We compare two datasets collected from Twitter's historical archive, both of which comprise tweets issued between October 1st and 15th, 2014 containing at least one of six popular hashtags. However, the two datasets were collected one year apart, in December 2014 (just as the protests were ebbing) and December 2015. Because tweets users delete are removed from Twitter's archive, analysis of these data allows us to better understand how our findings might be impacted when researchers respect users' right to erase their public content from various platforms. We find that though only 9% of tweets disappeared after a year, statistical analyses performed across the two datasets produce significant discrepancies. We conclude that honoring the right to be forgotten in social media research could have substantial consequences for social scientific research, as the inferences drawn from such decayed data are likely to be considerably biased, and we encourage a much more vigorous debate among researchers on this important topic.