Replication Results for Computational Social Science Projects

Last week, I tweeted some results from the project presentations in my graduate seminar on Computational Social Science. The project involved attempting to replicate published work in the field, based on this fantastic article by Gary King. Some readers expressed interest in knowing more, so with the final reports in hand, this is a summary.

From a list of roughly 20 candidate papers, the student teams (1-3 students each) chose the following:

I think these five papers are exemplary, field-defining work in computational social science. They were selected in part because the authors already went out of their way to facilitate replication by releasing data and code. In several cases, the original authors went further, and helped the replication teams out with data or with key details. The process of replication inherently involves the potential for criticism, but please keep in mind that the students and I chose these papers precisely because we are so impressed by them.

In addition to replication, the students also did some extensions. I’m not going to talk about those, because that’s really their work, and I don’t want to scoop them in a blog post that doesn’t even mention them by name! Students, feel free to comment if you like, but remember to protect the privacy of your classmates, and don’t mention them by name.

Lessons Learned

  • Text, not networks. The readings in the course included analysis of both social networks and text, but none of the student groups selected papers on social networks for replication. This may be partially because my research focuses on text and not social networks, so I may have better explained these papers, or subconsciously shown more excitement when discussing them. But another factor is that many of the social network papers involved proprietary data that was simply inaccessible to anyone else not working with Facebook or whoever owns the data. The “immunity” of these papers to replication limits the trust one can place in their results. If I teach the class again, I’ll make an extra effort to find social network papers that include publicly-available data — suggestions welcome!
  • Data processing. In my opinion, the selected papers did an exemplary job at detailing their data processing pipeline. Yet, in nearly every case there were questions that couldn’t be resolved from the paper alone. Several students told me this was the a major learning component of the project — realizing that a paper that you thought you completely understood was actually totally underspecified.
  • Types of data. Different sorts of data have different degrees of difficulty, involving vastly different amounts of preprocessing: review data was relatively easy to work with; social media data from forums and wikipedia was more difficult; email was really difficult.
  • Source code! In cases where the original source code was either publicly available, or readily accessible to the author, students were able to get quick answers about confusing points in the original papers. Release your code, especially pre-processing scripts, and freeze the code that generates the results in the paper; this is in your interest, as it is the best way to avoid the awkward situation of someone failing to replicate your paper.
  • Machine learning is easier than statistics. Counter-intuitively, the machine learning results were generally easier to replicate than results that involved computing and comparing counts for things. Machine learning may be less sensitive to small differences in preprocessing, because the learning algorithm can aggregate the signal robustly. Counts and frequencies can be dramatically affected by decisions like tokenization, thresholds, et cetera.
  • Lessons not learned. I had hoped that the frustrating experience of trying to divine someone else’s research pipeline would inspire the student teams to describe their own pipelines in as much detail as possible. This was true in some cases, but not in others.

Narrative Framing of Consumer Sentiment in Online Restaurant Reviews

By Jurafsky et al, 2014. link

Data The dataset from the original paper is not publicly available, so the replication team (one student) used a dataset from the Yelp Dataset Challenge. The student focused on the city of Las Vegas, and so ended up with fewer reviews than in the original.

Features The independent variables included a number of word lists, and these were admirably detailed in the original, describing both the provenance and the motivation. In a few cases, the information wasn’t there (lists of service-staff words and addiction words), but Dan J responded to my email with a precise specification (the original source code!) in less than an hour. Lesson: package up all your source code as soon as you submit the paper, and freeze it so that your follow-up work doesn’t prevent you from getting at the pipeline from the published paper.

Results The replication team replicated nearly all of the reported results, with the possible exception of the relationship between 1st person plural and negative reviews — this worked out in a frequency analysis, but apparently it got reversed in the multivariate regression.

A Computational Approach to Politeness with Application to Social Factors

by Danescu-Niculescu-Mizil et al, 2013. link

Data The authors helpfully provide all the data from the original study. The replication team also explored some additional data from Yelp.

Features The independent variables in the original paper include detectors for twenty linguistic “strategies” for politeness. Some of these are detailed well enough in the paper, others perhaps not; but the authors fortunately provided source code, which enabled the replication team to extract them all.

Results The team was able to replicate the prediction accuracy on the wikipedia corpus, but not on StackExchange; however, the pre-trained classifier provided by the original authors did reproduce the results reported in the paper. The replication team was not able to determine why their reimplementation didn’t do as well, but suspect data preprocessing, because they do replicate the results on WikiPedia. The paper contains average politeness scores for the twenty strategies, but only for Wikipedia (preserving StackExchange as a test corpus); the replicators found that some of the strategies work similarly in both domains, but not all!

Phrases that Signal Workplace Hierarchy

by Gilbert, 2012. link

Two student groups worked on this paper.

Data and features Email data is difficult, and the preprocessing pipeline for this paper is very involved. Eric describes it nicely in the paper with a flowchart figure, but unfortunately this isn’t the whole story; the replicators were not able to reproduce the stated number of features from the paper. One team found a preprocessed version of the data which made their work somewhat easier, although this only corresponds to some of the preprocessing steps performed in the paper.

One issue is that to obtain generalizable features, the original author manually removed words and phrases that were specific to Enron. The replication teams did not have time to immerse themselves in the history of this company (as the original author reports doing, for example by reading the details of the bankruptcy proceeding), so they explored automatic solutions, such as removing proper nouns.

Results One team got very close to the classification accuracy reported in the paper; this was the team that found the partially preprocessed data, so perhaps that helped. The other team was roughly 5% below the numbers in the paper. A key result from the original paper was a list of features with the highest weights in each class (up or down the corporate hierarchy); neither team came close to replicating these feature lists successfully.

Echoes of Power: Language Effects and Power Differences in Social Interaction

by Danescu-Niculescul-Mizil et al, 2012. link

Data The data for this paper consists of Wikipedia discussions and Supreme Court transcripts, and is provided by the authors. However, there is some preprocessing required to identify exactly which observations in each dialogue are counted, and the replication team found that the results were quite sensitive to this decision. However, they were able to identify a preprocessing pipeline that yielded results that closely matched those in the paper.

Features The paper relied on several aggregated feature groups from LIWC. The team did not have access to LIWC at first, and I suggested that they could construct these feature groups on their own from labeled data, as they were mostly syntactic categories such as “quantifiers” and “prepositions.” In fact, this didn’t work at all; for example, while only a small handful of word types are ever labeled as quantifiers in the Brown corpus, a huge number of words are considered to be quantifiers in LIWC. This affected results dramatically.

Results The team replicated all of the main results in the paper, although as mentioned above, they were sensitive to seemingly irrelevant dataset construction details. The team noted that the degree of coordination was averaged differently in the Wikipedia and Supreme Court data, which is justified in the original paper in a footnote. In fact, this difference is crucial: if the Wikipedia averaging technique is used on the Supreme Court data, the effect is reversed and Justices appear to coordinate more with lawyers.

I think this team had the most success at replicating the original paper, even to the point of producing nearly identical figures. I often complain about reviewing WWW papers because they are very long, and we are often asked to review many at a time. But perhaps this is what permitted the authors to describe their approach is sufficiently detail for it to be replicated. A lot of the credit should go to the hard work of the replication team, too.

Political Ideology Detection Using Recursive Neural Networks

by Iyyer et al, 2014. link

This paper seemed quite ambitious to try to replicate in a few weeks, because the recursive neural net implementation is near the bleeding edge of contemporary NLP. But the team was undeterred, and made a strong effort.

Data The students were able to obtain the IBC dataset used in the paper, by emailing the authors. You can find a fragment of the dataset here. This eliminated preprocessing questions, since the annotations include the constituent parse trees needed to build the Recursive NN.

Method As predicted, building an efficient implementation of the RNN was a major challenge. The team reports five hours for training and testing on a single cross-validation fold; this seems pretty good to me, but it left them very little time to debug their approach. Baselines included logistic regression on bag-of-words and word2vec features, using alternatively sentence-level or phrase-level labels.

Results The results for the baseline systems and the RNN were 5-10% lower than in the original paper. The RNN did outperform the linear classifiers, and the phrase-level labels seemed to help. I was not surprised that the team was not able to replicate the RNN results in such a short timeframe, but I had hoped to at least see similar numbers for the “black box” classifiers. I think the team might have simply run out of team to tune these classifiers properly.