adventures in cross-disciplinary collaboration, part 27: typesetting

One challenging thing about building bridges to the sociolinguistics community from the computer science world is that publication methods in sociolinguistics are… traditional. The open-access movement hasn’t made many inroads (language@internet is a great exception — please comment if I’m missing others), and many journals require microsoft word format for submissions. I find this pretty surprising in a discipline that involves a considerable amount of formal notation, both linguistic and mathematical.

Anyway, I convinced my co-authors to take the path of writing the document in latex and then converting right before submission — by promising that I would manage the conversion. And now the chickens have come home to roost. We’ve got a fairly complicated 45-page document, with all the usual stuff: equations, figures, tables, references, etc. I’ve spent the morning tracking down various forum posts about how to get as many of these features to survive the conversion as possible, with mixed success. Here’s what I’ve figured out so far:

latex2rtf is the current winner. It did a good job with citations, got some of the references, messed up all the math. make sure to update to version 2.3.3, not the 1.9.19 that is default with ubuntu.

my command line: latex2rtf main

pandoc lost all document-level formatting, citations, and references. But, it did a nice job on equations. I may create the main document from latex2rtf and then copy in the equations from pandoc.

my command line: pandoc -f latex -t odt -o main.odt main.tex

tex4ht gets a good recommendation here, but for me it generates a blank output

my command line: mk4ht oolatex main.tex

latex2html was advertised here, but I can’t get it into odt and anyway it doesn’t get any of the equations for me.

my command line: latex2html main.tex -split 0 -no_navigation -info “” -address “” -html_version 4.0,unicode

(When) do we need Viterbi for POS tagging?

For my NLP class’s assignment on sequence labeling, I’m having them work on the Twitter POS data that I helped annotate with Noah’s Smith CMU group in 2011. The idea for the assignment is to first apply classifiers (Naive Bayes and Perceptron), which can look at each word and its neighbors, but cannot make a structured prediction for the entire sentence. Then we move to hidden markov models and finally structured perceptron, which should reveal the importance of both joint inference (Viterbi) and discriminative learning.

But a funny thing happened on my way to the perfect problem set. I used features similar to the “base features” in our ACL paper (Gimpel et al 2011), but I also added features for the left and right neighbors of each word. In an averaged perceptron, this resulted in an development set accuracy of 84.8%. The base feature CRF in our paper gets 82.7%. At this point, I started to get excited — by adding the magic of structured prediction, I might be on my way to state-of-the-art results! Sadly no: when I turn my averaged perceptron into a structured perceptron, accuracy is barely changed, coming in at 85.1%.

Now, when I had a simpler feature set (omitting the left and right neighbor features), averaged perceptron got 81%, and structured perceptron again got around 85%. So it seems that for this data, you can incorporate context through either your features or through structured prediction, but there’s hardly any advantage to combining the two.

I assume that this same experiment has been tried for more traditional POS datasets and that structured prediction has been found to help (although this is just an assumption; I don’t know of any specifics). So it’s interesting to think of why it doesn’t help here. One possibility is that the Twitter POS tagset is pretty coarse — only 23 tags. Maybe the sequence information would be more valuable if it were more fine-grained.

Implementing spectral LDA

I spent some time this afternoon trying to implement this cool NIPS paper, which promises a new inference technique for LDA topic models using spectral decomposition. Unfortunately, I just couldn’t get it to work — the resulting “topics” include negative numbers. I think maybe I am not correctly computing the empirical estimates of the moments. It seems obvious, but the papers never quite explain how to do it, and it’s hard for me to think what else might be wrong. You may try using this code as a starting point (sorry for the clunky numpy and scipy, I’m still learning); if you figure it out, please comment!


from scipy import sparse, linalg
from numpy import random, array, sqrt, power

# Jacob Eisenstein
# attempt at implementing spectral LDA (Anandkumar et al, NIPS 2012)
# Dec 23, 2012

def getNormalizer(counts):
l = counts.shape[0]
return(sparse.spdiags(1/array(counts.sum(1)).flatten(),0,l,l))

def getDiagSum(counts):
l = counts.shape[0]
return(sparse.spdiags(array(counts.sum(1)).flatten(),0,l,l))

def normalizeRows(x):
#this is not the best way to do this
return(getNormalizer(x) * x)

def computeMu(counts):
return(counts.sum(axis=0).T / counts.sum())

def computeExxt(counts):
freqs = normalizeRows(counts)
return (freqs.T * getDiagSum(counts) * freqs).todense() / counts.sum()

def computePairs(counts,mu,a0):
pairs = computeExxt(counts) - (a0 / (a0+1)) * mu * mu.T
return(pairs)

def computeTriples(counts,eta,a0,mu):
l = counts.shape[0]
#computing E[x_1 x_2^T ]
#this is the part that i'm not 100% confident about
# is a vector of length = #documents
freqs = normalizeRows(counts)
exxt = computeExxt(counts)
eta_mu_inner = (eta.T*mu).sum() #better way to do this?

part1 = freqs.T * sparse.spdiags(array(counts * eta).flatten(),0,l,l) * freqs / counts.sum()
part2 = (a0 / (a0+2)) * (exxt*eta*mu.T + mu*eta.T*exxt + eta_mu_inner*exxt)
part3 = (2 * a0 * a0 / ((a0+2)*(a0+1))) * eta_mu_inner*mu*mu.T
return(part1 - part2 + part3)

def eca(counts,k,a0,nips_version=True):
# counts should be an l x d sparse matrix, where l = number of documents and d = number of words
# k is number of desired topics
# a0 is sum of Dirichlet prior

#the paper says \theta is drawn uniformly from S^{k-1}, which I assume is the simplex
#it doesn't work if \theta is drawn from a zero-mean Gaussian either
theta = random.rand(k,1)
theta /= theta.sum()

mu = computeMu(counts)
pairs = computePairs(counts,mu,a0)

#step 1: random projection to d x k matrix
d = counts.shape[1]
U = pairs * random.randn(d,k)

#step 2: whiten (NIPS version)
if nips_version:
u,s,v = linalg.svd(U.T * pairs * U)
V = u / sqrt(s)
W = U * V
else:
#step 2b: whiten (ArXiv version, p 16)
u,s,v = linalg.svd(pairs)
W = u[:,0:k] / sqrt(s[0:k])

#step 3: svd
eta = W.dot(theta)
trips = computeTriples(counts,eta,a0,mu)
xi,singvals,ignore = linalg.svd(W.T * trips * W)

#step 4: reconstruct
O = linalg.pinv(W).T.dot(xi)
return(O,singvals)

Gender and identity on Twitter

Our work on the relationship between language and gender in social media gets a nice writeup by Ben Zimmer in both Language Log and the Boston Globe!

The paper uses clustering analysis, social networks, and classification in search of a deeper understanding of the ways language varies with gender. We find a range of gendered styles and interests among Twitter users; some of these styles mirror the aggregate language-gender statistics, while others contradict them. Next, we investigate individuals whose language defies gender expectations. We find that such individuals have social networks that include significantly more individuals from the other gender, and that in general, ego-network gender homophily is correlated with the use of same-gender language markers.

Also, David has wrapped up a nice release of the data, if you want to play.

This blog

This blog will keep updates from the work of the Computational Linguistics Lab at Georgia Tech’s School of Interactive Computing.

The lab is led by Jacob Eisenstein, whose old-school webpage is here.

Here is a Dinosaur Comic about natural language processing:

Follow

Get every new post delivered to your Inbox.