Georgia Tech at EMNLP 2015

Georgia Tech is presenting four papers at EMNLP 2015 in Lisbon!

  • One Vector is not Enough: Entity-Augmented Distributed Semantics for Discourse Relations. Yangfeng Ji and Jacob Eisenstein (published in the Transactions of Computational Linguistics). This paper presents a distributed semantics approach to discourse relation classification, computing representations for discourse arguments and entities through recursive neural networks.
  • Confounds and Consequences in Geotagged Twitter Data. Umashanthi Pavalanathan and Jacob Eisenstein (full paper). GPS-tagged Twitter data is used widely in social media analysis, but relatively little is understood about the biases that it contains. This paper compares GPS-tagged tweets with messages that are located by user self-report, considering geographical location (urban core versus periphery), gender, age, and linguistic content. We find significant differences on all dimensions, and show that text-based author geolocation performs best on older, male authors.
  • Better Document-level Sentiment Analysis from RST Discourse ParsingParminder Bhatia, Yangfeng Ji, and Jacob Eisenstein (short paper). We present two models for using hierarchical discourse parses to improve document-level sentiment analysis, obtaining significant improvements over both lexicon-based and classification-based sentiment analyzers.
  • Closing the gap: Domain adaptation from explicit to implicit discourse relations. Yangfeng Ji, Gongbo Zhang, and Jacob Eisenstein (short paper). Discourse relations can be marked explicitly with connectors like “however” and “nonetheless”, but they are often implicit. We show how explicitly marked discourse relations can serve as a supervision signal towards automatically classifying implicitly-labeled discourse relations.

Replication Results for Computational Social Science Projects

Last week, I tweeted some results from the project presentations in my graduate seminar on Computational Social Science. The project involved attempting to replicate published work in the field, based on this fantastic article by Gary King. Some readers expressed interest in knowing more, so with the final reports in hand, this is a summary.

From a list of roughly 20 candidate papers, the student teams (1-3 students each) chose the following:

I think these five papers are exemplary, field-defining work in computational social science. They were selected in part because the authors already went out of their way to facilitate replication by releasing data and code. In several cases, the original authors went further, and helped the replication teams out with data or with key details. The process of replication inherently involves the potential for criticism, but please keep in mind that the students and I chose these papers precisely because we are so impressed by them.

In addition to replication, the students also did some extensions. I’m not going to talk about those, because that’s really their work, and I don’t want to scoop them in a blog post that doesn’t even mention them by name! Students, feel free to comment if you like, but remember to protect the privacy of your classmates, and don’t mention them by name.

Lessons Learned

  • Text, not networks. The readings in the course included analysis of both social networks and text, but none of the student groups selected papers on social networks for replication. This may be partially because my research focuses on text and not social networks, so I may have better explained these papers, or subconsciously shown more excitement when discussing them. But another factor is that many of the social network papers involved proprietary data that was simply inaccessible to anyone else not working with Facebook or whoever owns the data. The “immunity” of these papers to replication limits the trust one can place in their results. If I teach the class again, I’ll make an extra effort to find social network papers that include publicly-available data — suggestions welcome!
  • Data processing. In my opinion, the selected papers did an exemplary job at detailing their data processing pipeline. Yet, in nearly every case there were questions that couldn’t be resolved from the paper alone. Several students told me this was the a major learning component of the project — realizing that a paper that you thought you completely understood was actually totally underspecified.
  • Types of data. Different sorts of data have different degrees of difficulty, involving vastly different amounts of preprocessing: review data was relatively easy to work with; social media data from forums and wikipedia was more difficult; email was really difficult.
  • Source code! In cases where the original source code was either publicly available, or readily accessible to the author, students were able to get quick answers about confusing points in the original papers. Release your code, especially pre-processing scripts, and freeze the code that generates the results in the paper; this is in your interest, as it is the best way to avoid the awkward situation of someone failing to replicate your paper.
  • Machine learning is easier than statistics. Counter-intuitively, the machine learning results were generally easier to replicate than results that involved computing and comparing counts for things. Machine learning may be less sensitive to small differences in preprocessing, because the learning algorithm can aggregate the signal robustly. Counts and frequencies can be dramatically affected by decisions like tokenization, thresholds, et cetera.
  • Lessons not learned. I had hoped that the frustrating experience of trying to divine someone else’s research pipeline would inspire the student teams to describe their own pipelines in as much detail as possible. This was true in some cases, but not in others.

Narrative Framing of Consumer Sentiment in Online Restaurant Reviews

By Jurafsky et al, 2014. link

Data The dataset from the original paper is not publicly available, so the replication team (one student) used a dataset from the Yelp Dataset Challenge. The student focused on the city of Las Vegas, and so ended up with fewer reviews than in the original.

Features The independent variables included a number of word lists, and these were admirably detailed in the original, describing both the provenance and the motivation. In a few cases, the information wasn’t there (lists of service-staff words and addiction words), but Dan J responded to my email with a precise specification (the original source code!) in less than an hour. Lesson: package up all your source code as soon as you submit the paper, and freeze it so that your follow-up work doesn’t prevent you from getting at the pipeline from the published paper.

Results The replication team replicated nearly all of the reported results, with the possible exception of the relationship between 1st person plural and negative reviews — this worked out in a frequency analysis, but apparently it got reversed in the multivariate regression.

A Computational Approach to Politeness with Application to Social Factors

by Danescu-Niculescu-Mizil et al, 2013. link

Data The authors helpfully provide all the data from the original study. The replication team also explored some additional data from Yelp.

Features The independent variables in the original paper include detectors for twenty linguistic “strategies” for politeness. Some of these are detailed well enough in the paper, others perhaps not; but the authors fortunately provided source code, which enabled the replication team to extract them all.

Results The team was able to replicate the prediction accuracy on the wikipedia corpus, but not on StackExchange; however, the pre-trained classifier provided by the original authors did reproduce the results reported in the paper. The replication team was not able to determine why their reimplementation didn’t do as well, but suspect data preprocessing, because they do replicate the results on WikiPedia. The paper contains average politeness scores for the twenty strategies, but only for Wikipedia (preserving StackExchange as a test corpus); the replicators found that some of the strategies work similarly in both domains, but not all!

Phrases that Signal Workplace Hierarchy

by Gilbert, 2012. link

Two student groups worked on this paper.

Data and features Email data is difficult, and the preprocessing pipeline for this paper is very involved. Eric describes it nicely in the paper with a flowchart figure, but unfortunately this isn’t the whole story; the replicators were not able to reproduce the stated number of features from the paper. One team found a preprocessed version of the data which made their work somewhat easier, although this only corresponds to some of the preprocessing steps performed in the paper.

One issue is that to obtain generalizable features, the original author manually removed words and phrases that were specific to Enron. The replication teams did not have time to immerse themselves in the history of this company (as the original author reports doing, for example by reading the details of the bankruptcy proceeding), so they explored automatic solutions, such as removing proper nouns.

Results One team got very close to the classification accuracy reported in the paper; this was the team that found the partially preprocessed data, so perhaps that helped. The other team was roughly 5% below the numbers in the paper. A key result from the original paper was a list of features with the highest weights in each class (up or down the corporate hierarchy); neither team came close to replicating these feature lists successfully.

Echoes of Power: Language Effects and Power Differences in Social Interaction

by Danescu-Niculescul-Mizil et al, 2012. link

Data The data for this paper consists of Wikipedia discussions and Supreme Court transcripts, and is provided by the authors. However, there is some preprocessing required to identify exactly which observations in each dialogue are counted, and the replication team found that the results were quite sensitive to this decision. However, they were able to identify a preprocessing pipeline that yielded results that closely matched those in the paper.

Features The paper relied on several aggregated feature groups from LIWC. The team did not have access to LIWC at first, and I suggested that they could construct these feature groups on their own from labeled data, as they were mostly syntactic categories such as “quantifiers” and “prepositions.” In fact, this didn’t work at all; for example, while only a small handful of word types are ever labeled as quantifiers in the Brown corpus, a huge number of words are considered to be quantifiers in LIWC. This affected results dramatically.

Results The team replicated all of the main results in the paper, although as mentioned above, they were sensitive to seemingly irrelevant dataset construction details. The team noted that the degree of coordination was averaged differently in the Wikipedia and Supreme Court data, which is justified in the original paper in a footnote. In fact, this difference is crucial: if the Wikipedia averaging technique is used on the Supreme Court data, the effect is reversed and Justices appear to coordinate more with lawyers.

I think this team had the most success at replicating the original paper, even to the point of producing nearly identical figures. I often complain about reviewing WWW papers because they are very long, and we are often asked to review many at a time. But perhaps this is what permitted the authors to describe their approach is sufficiently detail for it to be replicated. A lot of the credit should go to the hard work of the replication team, too.

Political Ideology Detection Using Recursive Neural Networks

by Iyyer et al, 2014. link

This paper seemed quite ambitious to try to replicate in a few weeks, because the recursive neural net implementation is near the bleeding edge of contemporary NLP. But the team was undeterred, and made a strong effort.

Data The students were able to obtain the IBC dataset used in the paper, by emailing the authors. You can find a fragment of the dataset here. This eliminated preprocessing questions, since the annotations include the constituent parse trees needed to build the Recursive NN.

Method As predicted, building an efficient implementation of the RNN was a major challenge. The team reports five hours for training and testing on a single cross-validation fold; this seems pretty good to me, but it left them very little time to debug their approach. Baselines included logistic regression on bag-of-words and word2vec features, using alternatively sentence-level or phrase-level labels.

Results The results for the baseline systems and the RNN were 5-10% lower than in the original paper. The RNN did outperform the linear classifiers, and the phrase-level labels seemed to help. I was not surprised that the team was not able to replicate the RNN results in such a short timeframe, but I had hoped to at least see similar numbers for the “black box” classifiers. I think the team might have simply run out of team to tune these classifiers properly.

How to replicate Google News?

For starters: I definitely don’t know how to replicate Google News! But some GT undergrads want to try, and they asked me for pointers. Here’s what I said:

“As far as I know, the original effort in this space is Newsblaster. It’s still up and running, looking a lot like Google News (which came much later). A key difference is that they try to summarize the stories, not just cluster them.


A few other things to read:

I was part of a team that did some work on news story clustering in 2012, but honestly it’s probably overly complicated for your purposes:

If these links don’t work you, just google the titles, you can find all these papers online. The papers cited in all these papers might also be useful, maybe more useful. There’s a lot to read in this area, and I suspect you’ll find that the technology from the late 90s and early 2000s will work pretty well.

There’s an NLP startup called Prismatic that is doing something similar, you might find their blog useful. (Full disclosure: I am friends with one of the co-founders and their “Chief Software Wench.”)

There’s a good information retrieval textbook available online, and it has chapters that will be relevant to many of the core technologies that you’ll need, like clustering.

Good luck!


Georgia Tech @ ACL 2014

Georgia Tech had a lot to say at this year’s annual meeting of Association for Computational Linguistics (ACL).

  • Representation Learning for Text-level Discourse Parsing. Ji and Eisenstein. Main conference.
  • Fast Easy Unsupervised Domain Adaptation with Marginalized Structured Dropout. Yang and Eisenstein. Main conference.
  • Modeling Factuality Judgments in Social Media Text. Soni, Mitra, Gilbert, and Eisenstein. Main conference.
  • POS induction with distributional and morphological information using a distance-dependent Chinese restaurant process. Sirts, Eisenstein, Elsner, and Goldwater. Main Conference.
  • Linguistic Style-Shifting in Online Social Media. Pavalanathan and Eisenstein, Workshop on Social Dynamics and Personal Attributes.
  • Mining Themes and Interests in the Asperger’s and Autism Community. Ji, Hong, Arriaga, Rozga, Abowd, and Eisenstein. Workshop on Computational Linguistics and Clinical Psychology.

I was especially excited to draw in some of my social computing colleagues, including Catherine Grevet, Tanushree Mitra, and Eric Gilbert. Here’s most of us:

From left: Sandeep Soni, Yangfeng Ji, Yi Yang, Catherine Grevet, Uma Pavalanathan, Tanushree Mitra, and Jacob Eisenstein

From left: Sandeep Soni, Yangfeng Ji, Yi Yang, Catherine Grevet, Uma Pavalanathan, Tanushree Mitra, and Jacob Eisenstein

Computational Social Science Hack Day 2014

GT and Emory Computational Social Science enthusiasts again joined forces — this time for a hack day at Georgia Tech — following up on a fun and successful workshop at Emory in November. The 20 participants from the two universities pursued a variety of projects, including:

  • Structural balance theory in comic book narratives (Amish, Sandeep, and Vinay, with help from Vinodh)
  • Identifying noteworthy items in text from State-of-the-Union addresses (Tanushree and Yangfeng, with help from Jeff and Tom)
  • Combating the trafficking of minors, using text analysis and computer vision (Eric and Parisa)
  • Connecting spelling variation with sentiment analysis (Uma, Yi, and Yu)
  • Predicting Grammy award winners from Tweet volume (Jayita and Spurthi)
  • Mining OpenSecrets to find latent clusters of campaign donors and recipients (Jacob, Jon, and Munmun)

Since this is my blogpost, I’ll take a little more time to talk about my project. Essentially we just read in OpenSecrets data and built a matrix of donors and recipients. The sparse SVD of this matrix revealed some interesting factors. Here are the top 7 candidates and donors for each of the most interesting factors, along with my own name for the factors.

—– Factor 2 (unions) —–
Julia Brownley (D) American Fedn of St/Cnty/Munic Employees
Ed Markey (D) American Assn for Justice
Ann Kirkpatrick (D) Intl Brotherhood of Electrical Workers
Ann Mclane Kuster (D) International Assn of Fire Fighters
Timothy H. Bishop (D) Operating Engineers Union
Ron Barber (D) American Federation of Teachers
Cheri Bustos (D) National Assn of Letter Carriers
—– Factor 3 (insurers) —–
Kay R. Hagan (D) Metlife Inc
Max Baucus (D) American Council of Life Insurers
Ron Kind (D) Principal Life Insurance
Dave Camp (R) Massachusetts Mutual Life Insurance
Joseph Crowley (D) Morgan Stanley
Richard E. Neal (D) TIAA-CREF
Pat Toomey (R) UBS Americas
—– Factor 4 (finance) —–
Jeb Hensarling (R) Investment Co Institute
Randy Neugebauer (R) American Land Title Assn
Scott Garrett (R) Chicago Mercantile Exchange
Michael Grimm (R) Indep Insurance Agents & Brokers/America
Bill Huizenga (R) Bank of America
Sean P. Duffy (R) Securities Industry & Financial Mkt Assn
Steve Stivers (R) PricewaterhouseCoopers
—– Factor 6 (construction and industry) —–
Bill Shuster (R) American Council of Engineering Cos
Frank A. LoBiondo (R) Owner-Operator Independent Drivers Assn
Patrick Meehan (R) CSX Corp
David P Joyce (R) Carpenters & Joiners Union
Jim Gerlach (R) American Road & Transport Builders Assn
Nick Rahall (D) Norfolk Southern
Tom Petri (R) NiSource Inc
—– Factor 7 (arms manufacturers) —–
Adam Smith (D) Raytheon Co
Buck Mckeon (R) Northrop Grumman
John Cornyn (R) Lockheed Martin
Joe Wilson (R) National Assn of Realtors
John Carter (R) AT&T Inc
Michael McCaul (R) BAE Systems
Lamar Smith (R) Honeywell International
—– Factor 8 (technology) —–
Jeanne Shaheen (D) Microsoft Corp
Ed Markey (D) Every Republican is Crucial PAC
Kay R. Hagan (D) Verizon Communications
George Holding (R) National Cable & Telecommunications Assn
Chris Coons (D) Google Inc
Joe Heck (R) National Assn of Broadcasters
Ron DeSantis (R) Viacom International
—– Factor 9 (energy + Halliburton) —–
Pete Olson (R) Halliburton Co
Ed Markey (D) Koch Industries
Joe Barton (R) Independent Petroleum Assn of America
Bill Johnson (R) National Cable & Telecommunications Assn
Michael G. Fitzpatrick (R) Occidental Petroleum
Mary L. Landrieu (D) Cellular Telecom & Internet Assn
Steve Scalise (R) DTE Energy
—– Factor 11 (communications) —–
Bob Goodlatte (R) Google Inc
Kelly Ayotte (R) Clear Channel Communications
Lindsey Graham (R) Sprint Corp
Tim Scott (R) Association of American Railroads
Susan Collins (R) Union Pacific Corp
Eric Swalwell (D) Norfolk Southern
John Thune (R) Microsoft Corp

Our plan was to learn more about these factors by connecting them with text from the wikipedia pages of the companies and with the NOMINATE scores and committee memberships of the legislators. Maybe that would have been possible in a 24-hour hackathon, but we ran out of time as I was just starting to get reasonable topics for the wikipedia pages.

We went into the day with the goal of building and strengthening connections across disciplines and institutions, and by that metric I think the day was a success. In any case, I had a blast working with new people and trying out some new ideas, and I’m confident this will impact my research in the long run. It was also a lot of fun to work with my own students and colleagues in a more collaborative setting. Taking a day off from endless paper and proposal deadlines (and non-stop email distractions) to hack on a new project felt like a mini-vacation.

Georgia Tech at EMNLP 2013

EMNLP is one of my favorite conferences, so I’m very pleased that Georgia Tech’s Computational Linguistics Lab will have two papers to present.

Yi Yang and I have written a paper that formalizes unsupervised text normalization in a log-linear model, which allows arbitrary (local) features to be combined with a target language model. The model is trained in a maximum-likelihood framework, marginalizing over possible normalizations using a novel sequential Monte Carlo training scheme. Longtime readers may find some irony in me writing a paper about social media normalization, but if we want to understand systematic orthographic variation — such as (TD)-deletion — then an accurate normalization system is a very useful tool to have in the shed.

Yangfeng Ji and I have obtained very strong results on paraphrase detection, beating the prior state-of-the-art on the well-studied MSR Paraphrase Corpus by 3% raw accuracy. We build a distributional representation for sentence semantics, which we combine with traditional fine-grained features. Yangfeng’s key insight in this paper is to also use supervised information to compute the distributional representation itself, by reweighting the words according to their discriminability.

How noisy is social media text?

In my NAACL 2013 paper/rant, I expressed concern that a lot of the NLP work targeting social media is based on folk linguistics rather than either solid theory or empirical data about how social media language actually works. In my paper I tried to provide a little of both: citations to some of my favorite papers from the CMC and sociolinguistics literatures (which seems to be nearly totally unknown in NLP circles), and an empirical analysis of social media language differences using n-gram counts.

This recent paper by Baldwin, Cook, Lui, Mackinlay, and Wang — basically contemporaneous with mine, but they were kind enough to cite me — takes the empirical analysis a good way further. I was particularly interested to see that they applied a generative HPSG grammar of English to corpora from Twitter, Youtube comments (the worst place on the whole internet?), web forums, blog posts, wikipedia, and the BNC. They found that if you want strict parsing of full sentences, Twitter is quite difficult — only 14% of tweets are parseable this way, as compared to 25% for blogs and 49% for wikipedia. Relaxing punctuation and capitalization reduces these differences considerably, yielding 36% parseability for tweets, 44% for blogs, and 68% for wikipedia. Another 25% of tweets are viewed as grammatical fragments (e.g., “very funny”), leaving only 37% of tweets as “unparseable”, compared to 35% for blogs and 26% for wikipedia. This coheres with arguments from linguists like Thurlow and Squires (sadly, I find no publicly available PDF for her cool 2010 Language and Society paper) that claims of a radically unreadable netspeak dialect are greatly exaggerated.

The paper also provides a lexical analysis, using chi-squared score to measure differences between the 500 most frequent words in each corpus. But if, as I argued in my 2013 NAACl paper, social media is an amalgam of writing styles rather than a single genre or dialect, few of these stylistic markers will attain enough universality to reach the top 500 words, besides the usual suspects: lol, you/u, gonna, and the most popular emoticons. Baldwin et al also measure the perplexity of a trigram language model complexity, which may capture this “long tail”, but personally I find this a little harder to interpret than simple n-gram out-of-vocabulary counts, as it depends on modeling decisions such as smoothing.

Overall, I’m very happy to see NLP technology used to empirically measure the similarities and differences between social media and other forms of writing, and I’m particularly intrigued by the use of automated generative parsing. As we try to make language technology robust to language variation, papers like this will help us move forward on a solid empirical footing.

(h/t Brendan O’Connor for pointing me to this paper)


Get every new post delivered to your Inbox.