## Georgia Tech at EMNLP 2013

EMNLP is one of my favorite conferences, so I’m very pleased that Georgia Tech’s Computational Linguistics Lab will have two papers to present.

Yi Yang and I have written a paper that formalizes unsupervised text normalization in a log-linear model, which allows arbitrary (local) features to be combined with a target language model. The model is trained in a maximum-likelihood framework, marginalizing over possible normalizations using a novel sequential Monte Carlo training scheme. Longtime readers may find some irony in me writing a paper about social media normalization, but if we want to understand systematic orthographic variation — such as (TD)-deletion — then an accurate normalization system is a very useful tool to have in the shed. http://www.cc.gatech.edu/~jeisenst/papers/yang-emnlp-2013.pdf

Yangfeng Ji and I have obtained very strong results on paraphrase detection, beating the prior state-of-the-art on the well-studied MSR Paraphrase Corpus by 3% raw accuracy. We build a distributional representation for sentence semantics, which we combine with traditional fine-grained features. Yangfeng’s key insight in this paper is to also use supervised information to compute the distributional representation itself, by reweighting the words according to their discriminability. http://www.cc.gatech.edu/~jeisenst/papers/ji-emnlp-2013.pdf

## How noisy is social media text?

In my NAACL 2013 paper/rant, I expressed concern that a lot of the NLP work targeting social media is based on folk linguistics rather than either solid theory or empirical data about how social media language actually works. In my paper I tried to provide a little of both: citations to some of my favorite papers from the CMC and sociolinguistics literatures (which seems to be nearly totally unknown in NLP circles), and an empirical analysis of social media language differences using n-gram counts.

This recent paper by Baldwin, Cook, Lui, Mackinlay, and Wang — basically contemporaneous with mine, but they were kind enough to cite me — takes the empirical analysis a good way further. I was particularly interested to see that they applied a generative HPSG grammar of English to corpora from Twitter, Youtube comments (the worst place on the whole internet?), web forums, blog posts, wikipedia, and the BNC. They found that if you want strict parsing of full sentences, Twitter is quite difficult — only 14% of tweets are parseable this way, as compared to 25% for blogs and 49% for wikipedia. Relaxing punctuation and capitalization reduces these differences considerably, yielding 36% parseability for tweets, 44% for blogs, and 68% for wikipedia. Another 25% of tweets are viewed as grammatical fragments (e.g., “very funny”), leaving only 37% of tweets as “unparseable”, compared to 35% for blogs and 26% for wikipedia. This coheres with arguments from linguists like Thurlow and Squires (sadly, I find no publicly available PDF for her cool 2010 Language and Society paper) that claims of a radically unreadable netspeak dialect are greatly exaggerated.

The paper also provides a lexical analysis, using chi-squared score to measure differences between the 500 most frequent words in each corpus. But if, as I argued in my 2013 NAACl paper, social media is an amalgam of writing styles rather than a single genre or dialect, few of these stylistic markers will attain enough universality to reach the top 500 words, besides the usual suspects: lol, you/u, gonna, and the most popular emoticons. Baldwin et al also measure the perplexity of a trigram language model complexity, which may capture this “long tail”, but personally I find this a little harder to interpret than simple n-gram out-of-vocabulary counts, as it depends on modeling decisions such as smoothing.

Overall, I’m very happy to see NLP technology used to empirically measure the similarities and differences between social media and other forms of writing, and I’m particularly intrigued by the use of automated generative parsing. As we try to make language technology robust to language variation, papers like this will help us move forward on a solid empirical footing.

(h/t Brendan O’Connor for pointing me to this paper)

## adventures in cross-disciplinary collaboration, part 27: typesetting

One challenging thing about building bridges to the sociolinguistics community from the computer science world is that publication methods in sociolinguistics are… traditional. The open-access movement hasn’t made many inroads (language@internet is a great exception — please comment if I’m missing others), and many journals require microsoft word format for submissions. I find this pretty surprising in a discipline that involves a considerable amount of formal notation, both linguistic and mathematical.

Anyway, I convinced my co-authors to take the path of writing the document in latex and then converting right before submission — by promising that I would manage the conversion. And now the chickens have come home to roost. We’ve got a fairly complicated 45-page document, with all the usual stuff: equations, figures, tables, references, etc. I’ve spent the morning tracking down various forum posts about how to get as many of these features to survive the conversion as possible, with mixed success. Here’s what I’ve figured out so far:

latex2rtf is the current winner. It did a good job with citations, got some of the references, messed up all the math. make sure to update to version 2.3.3, not the 1.9.19 that is default with ubuntu.

my command line: latex2rtf main

pandoc lost all document-level formatting, citations, and references. But, it did a nice job on equations. I may create the main document from latex2rtf and then copy in the equations from pandoc.

my command line: pandoc -f latex -t odt -o main.odt main.tex

tex4ht gets a good recommendation here, but for me it generates a blank output

my command line: mk4ht oolatex main.tex

latex2html was advertised here, but I can’t get it into odt and anyway it doesn’t get any of the equations for me.

my command line: latex2html main.tex -split 0 -no_navigation -info “” -address “” -html_version 4.0,unicode

## (When) do we need Viterbi for POS tagging?

For my NLP class’s assignment on sequence labeling, I’m having them work on the Twitter POS data that I helped annotate with Noah’s Smith CMU group in 2011. The idea for the assignment is to first apply classifiers (Naive Bayes and Perceptron), which can look at each word and its neighbors, but cannot make a structured prediction for the entire sentence. Then we move to hidden markov models and finally structured perceptron, which should reveal the importance of both joint inference (Viterbi) and discriminative learning.

But a funny thing happened on my way to the perfect problem set. I used features similar to the “base features” in our ACL paper (Gimpel et al 2011), but I also added features for the left and right neighbors of each word. In an averaged perceptron, this resulted in an development set accuracy of 84.8%. The base feature CRF in our paper gets 82.7%. At this point, I started to get excited — by adding the magic of structured prediction, I might be on my way to state-of-the-art results! Sadly no: when I turn my averaged perceptron into a structured perceptron, accuracy is barely changed, coming in at 85.1%.

Now, when I had a simpler feature set (omitting the left and right neighbor features), averaged perceptron got 81%, and structured perceptron again got around 85%. So it seems that for this data, you can incorporate context through either your features or through structured prediction, but there’s hardly any advantage to combining the two.

I assume that this same experiment has been tried for more traditional POS datasets and that structured prediction has been found to help (although this is just an assumption; I don’t know of any specifics). So it’s interesting to think of why it doesn’t help here. One possibility is that the Twitter POS tagset is pretty coarse — only 23 tags. Maybe the sequence information would be more valuable if it were more fine-grained.

## Implementing spectral LDA

I spent some time this afternoon trying to implement this cool NIPS paper, which promises a new inference technique for LDA topic models using spectral decomposition. Unfortunately, I just couldn’t get it to work — the resulting “topics” include negative numbers. I think maybe I am not correctly computing the empirical estimates of the moments. It seems obvious, but the papers never quite explain how to do it, and it’s hard for me to think what else might be wrong. You may try using this code as a starting point (sorry for the clunky numpy and scipy, I’m still learning); if you figure it out, please comment!

from scipy import sparse, linalg
from numpy import random, array, sqrt, power

# Jacob Eisenstein
# attempt at implementing spectral LDA (Anandkumar et al, NIPS 2012)
# Dec 23, 2012

def getNormalizer(counts):
l = counts.shape[0]
return(sparse.spdiags(1/array(counts.sum(1)).flatten(),0,l,l))

def getDiagSum(counts):
l = counts.shape[0]
return(sparse.spdiags(array(counts.sum(1)).flatten(),0,l,l))

def normalizeRows(x):
#this is not the best way to do this
return(getNormalizer(x) * x)

def computeMu(counts):
return(counts.sum(axis=0).T / counts.sum())

def computeExxt(counts):
freqs = normalizeRows(counts)
return (freqs.T * getDiagSum(counts) * freqs).todense() / counts.sum()

def computePairs(counts,mu,a0):
pairs = computeExxt(counts) - (a0 / (a0+1)) * mu * mu.T
return(pairs)

def computeTriples(counts,eta,a0,mu):
l = counts.shape[0]
#computing E[x_1 x_2^T ]
#this is the part that i'm not 100% confident about
#  is a vector of length = #documents
freqs = normalizeRows(counts)
exxt = computeExxt(counts)
eta_mu_inner = (eta.T*mu).sum() #better way to do this?

part1 = freqs.T * sparse.spdiags(array(counts * eta).flatten(),0,l,l) * freqs / counts.sum()
part2 = (a0 / (a0+2)) * (exxt*eta*mu.T + mu*eta.T*exxt + eta_mu_inner*exxt)
part3 = (2 * a0 * a0 / ((a0+2)*(a0+1))) * eta_mu_inner*mu*mu.T
return(part1 - part2 + part3)

def eca(counts,k,a0,nips_version=True):
# counts should be an l x d sparse matrix, where l = number of documents and d = number of words
# k is number of desired topics
# a0 is sum of Dirichlet prior

#the paper says \theta is drawn uniformly from S^{k-1}, which I assume is the simplex
#it doesn't work if \theta is drawn from a zero-mean Gaussian either
theta = random.rand(k,1)
theta /= theta.sum()

mu = computeMu(counts)
pairs = computePairs(counts,mu,a0)

#step 1: random projection to d x k matrix
d = counts.shape[1]
U = pairs * random.randn(d,k)

#step 2: whiten (NIPS version)
if nips_version:
u,s,v = linalg.svd(U.T * pairs * U)
V = u / sqrt(s)
W = U * V
else:
#step 2b: whiten (ArXiv version, p 16)
u,s,v = linalg.svd(pairs)
W = u[:,0:k] / sqrt(s[0:k])

#step 3: svd
eta = W.dot(theta)
trips = computeTriples(counts,eta,a0,mu)
xi,singvals,ignore = linalg.svd(W.T * trips * W)

#step 4: reconstruct
O = linalg.pinv(W).T.dot(xi)
return(O,singvals)


## update

(August 6, 2013)

1. Thanks to Peter Lacey-Bordeaux for encouraging me to learn to typeset this code properly.
2. I was able to successfully implement the RECOVER algorithm in this ICML 2013 paper by Arora et al, in Matlab. I mean to do it in Python, but haven’t had time. The topics look good and it’s really fast.

## Gender and identity on Twitter

Our work on the relationship between language and gender in social media gets a nice writeup by Ben Zimmer in both Language Log and the Boston Globe!

The paper uses clustering analysis, social networks, and classification in search of a deeper understanding of the ways language varies with gender. We find a range of gendered styles and interests among Twitter users; some of these styles mirror the aggregate language-gender statistics, while others contradict them. Next, we investigate individuals whose language defies gender expectations. We find that such individuals have social networks that include significantly more individuals from the other gender, and that in general, ego-network gender homophily is correlated with the use of same-gender language markers.

Also, David has wrapped up a nice release of the data, if you want to play.