How noisy is social media text?

In my NAACL 2013 paper/rant, I expressed concern that a lot of the NLP work targeting social media is based on folk linguistics rather than either solid theory or empirical data about how social media language actually works. In my paper I tried to provide a little of both: citations to some of my favorite papers from the CMC and sociolinguistics literatures (which seems to be nearly totally unknown in NLP circles), and an empirical analysis of social media language differences using n-gram counts.

This recent paper by Baldwin, Cook, Lui, Mackinlay, and Wang — basically contemporaneous with mine, but they were kind enough to cite me — takes the empirical analysis a good way further. I was particularly interested to see that they applied a generative HPSG grammar of English to corpora from Twitter, Youtube comments (the worst place on the whole internet?), web forums, blog posts, wikipedia, and the BNC. They found that if you want strict parsing of full sentences, Twitter is quite difficult — only 14% of tweets are parseable this way, as compared to 25% for blogs and 49% for wikipedia. Relaxing punctuation and capitalization reduces these differences considerably, yielding 36% parseability for tweets, 44% for blogs, and 68% for wikipedia. Another 25% of tweets are viewed as grammatical fragments (e.g., “very funny”), leaving only 37% of tweets as “unparseable”, compared to 35% for blogs and 26% for wikipedia. This coheres with arguments from linguists like Thurlow and Squires (sadly, I find no publicly available PDF for her cool 2010 Language and Society paper) that claims of a radically unreadable netspeak dialect are greatly exaggerated.

The paper also provides a lexical analysis, using chi-squared score to measure differences between the 500 most frequent words in each corpus. But if, as I argued in my 2013 NAACl paper, social media is an amalgam of writing styles rather than a single genre or dialect, few of these stylistic markers will attain enough universality to reach the top 500 words, besides the usual suspects: lol, you/u, gonna, and the most popular emoticons. Baldwin et al also measure the perplexity of a trigram language model complexity, which may capture this “long tail”, but personally I find this a little harder to interpret than simple n-gram out-of-vocabulary counts, as it depends on modeling decisions such as smoothing.

Overall, I’m very happy to see NLP technology used to empirically measure the similarities and differences between social media and other forms of writing, and I’m particularly intrigued by the use of automated generative parsing. As we try to make language technology robust to language variation, papers like this will help us move forward on a solid empirical footing.

(h/t Brendan O’Connor for pointing me to this paper)