How noisy is social media text?

In my NAACL 2013 paper/rant, I expressed concern that a lot of the NLP work targeting social media is based on folk linguistics rather than either solid theory or empirical data about how social media language actually works. In my paper I tried to provide a little of both: citations to some of my favorite papers from the CMC and sociolinguistics literatures (which seems to be nearly totally unknown in NLP circles), and an empirical analysis of social media language differences using n-gram counts.

This recent paper by Baldwin, Cook, Lui, Mackinlay, and Wang — basically contemporaneous with mine, but they were kind enough to cite me — takes the empirical analysis a good way further. I was particularly interested to see that they applied a generative HPSG grammar of English to corpora from Twitter, Youtube comments (the worst place on the whole internet?), web forums, blog posts, wikipedia, and the BNC. They found that if you want strict parsing of full sentences, Twitter is quite difficult — only 14% of tweets are parseable this way, as compared to 25% for blogs and 49% for wikipedia. Relaxing punctuation and capitalization reduces these differences considerably, yielding 36% parseability for tweets, 44% for blogs, and 68% for wikipedia. Another 25% of tweets are viewed as grammatical fragments (e.g., “very funny”), leaving only 37% of tweets as “unparseable”, compared to 35% for blogs and 26% for wikipedia. This coheres with arguments from linguists like Thurlow and Squires (sadly, I find no publicly available PDF for her cool 2010 Language and Society paper) that claims of a radically unreadable netspeak dialect are greatly exaggerated.

The paper also provides a lexical analysis, using chi-squared score to measure differences between the 500 most frequent words in each corpus. But if, as I argued in my 2013 NAACl paper, social media is an amalgam of writing styles rather than a single genre or dialect, few of these stylistic markers will attain enough universality to reach the top 500 words, besides the usual suspects: lol, you/u, gonna, and the most popular emoticons. Baldwin et al also measure the perplexity of a trigram language model complexity, which may capture this “long tail”, but personally I find this a little harder to interpret than simple n-gram out-of-vocabulary counts, as it depends on modeling decisions such as smoothing.

Overall, I’m very happy to see NLP technology used to empirically measure the similarities and differences between social media and other forms of writing, and I’m particularly intrigued by the use of automated generative parsing. As we try to make language technology robust to language variation, papers like this will help us move forward on a solid empirical footing.

(h/t Brendan O’Connor for pointing me to this paper)


3 Responses to How noisy is social media text?

  1. I remember reading your NAACL paper on bad language and had a couple of thoughts. One thought I had was, how many of the OOV words that you are coming across are new words/terms? I would expect that over time you would encounter more OOV hits that are previously unencountered misspellings of words, and wonder how many of the OOV terms are of this kind. Another thought, couldn’t it be that [some] people anticipate the shorter word limits and develop their statuses 1) based on a vocabulary of shorter terms and/or 2) in such a way that they express what they want in fewer words? If so, it would be fair to say that such people would not likely run near the Twitter character limit as they could express more words in fewer characters. And the last thought I can recall is that the purpose of normalization/domain adaptation techniques is not to correct or account for ‘bad’ language per se. At least the way that I think of it is more on the side of trying to condense widely varied information for applications in which the variation does not play a significant role. For instance, in trying to build a system that summarizes text samples, it might not be important to differentiate between ‘finna’ and ‘fixing to’ as they mean the same thing [but this would probably matter more if the system were also trying to attribute the writing to a particular author]. The fact that we are choosing to condense to ‘fixing to’ is because standard linguistic forms such as this change considerably less over time [because as you noted, slang terms evolve incredibly rapidly], without necessarily implying that they are/should be the norm.

    Just to be clear, I don’t disagree with the conclusions reached in the paper. These were just thoughts that came to mind while reading.

    • nlpjacob says:

      Thanks for the comment and interesting questions.
      1) I don’t know how many OOV words are “new,” and I think that’s a pretty hard question to answer. The purpose of the NER analysis was to distinguish names from other tokens. Distinguishing alternative spellings is more difficult, and involves slippery slope questions like whether you think “finna” are “fitna” are new words or just new spellings of “fixing to.”
      2) Yes, people could have developed substitutions like to/2 because of the character limit. That doesn’t explain why they use the shortened forms in some cases and not in others — and the character limit itself doesn’t seem to explain this either.
      3) I do think there are applications where normalization makes sense. Generally these are applications where paraphrase might also make sense, like summarization. My concern is that the original text contains important social meaning, and some people seem to be proposing to normalize before doing even the most basic downstream processing (POS tagging, parsing). To me this seems like a recipe for NLP that is both brittle and sterile.
      4) The idea that the “normalized” form is more permanent is interesting. I guess if longevity is the criterion, then perhaps we Americans should switch back to British spelling? In any case, it will be interesting to see whether CMC language continues to evolve rapidly, or whether it stabilizes around some form (or set of forms) that are better suited for informal communication.

  2. Thanks for the response. I can say I agree with what you said more or less.
    >”I guess if longevity is the criterion, then perhaps we Americans should switch back to British spelling?”
    I had originally thought to say that the standard forms are more well defined but then I thought, well, are they really? I’m not so sure. But anyway, I wasn’t suggesting that longevity was important from the perspective of people when they are communicating with each other but rather from the perspective of someone attempting to build and learn from a model of language data. I hope I did not come across as suggesting that we should all speak or write in these ”standard forms”. With all this in mind, maybe the question should be – “…perhaps we should normalize to British spellings rather than American ones?” And to that question, I don’t have an answer.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: