(When) do we need Viterbi for POS tagging?

For my NLP class’s assignment on sequence labeling, I’m having them work on the Twitter POS data that I helped annotate with Noah’s Smith CMU group in 2011. The idea for the assignment is to first apply classifiers (Naive Bayes and Perceptron), which can look at each word and its neighbors, but cannot make a structured prediction for the entire sentence. Then we move to hidden markov models and finally structured perceptron, which should reveal the importance of both joint inference (Viterbi) and discriminative learning.

But a funny thing happened on my way to the perfect problem set. I used features similar to the “base features” in our ACL paper (Gimpel et al 2011), but I also added features for the left and right neighbors of each word. In an averaged perceptron, this resulted in an development set accuracy of 84.8%. The base feature CRF in our paper gets 82.7%. At this point, I started to get excited — by adding the magic of structured prediction, I might be on my way to state-of-the-art results! Sadly no: when I turn my averaged perceptron into a structured perceptron, accuracy is barely changed, coming in at 85.1%.

Now, when I had a simpler feature set (omitting the left and right neighbor features), averaged perceptron got 81%, and structured perceptron again got around 85%. So it seems that for this data, you can incorporate context through either your features or through structured prediction, but there’s hardly any advantage to combining the two.

I assume that this same experiment has been tried for more traditional POS datasets and that structured prediction has been found to help (although this is just an assumption; I don’t know of any specifics). So it’s interesting to think of why it doesn’t help here. One possibility is that the Twitter POS tagset is pretty coarse — only 23 tags. Maybe the sequence information would be more valuable if it were more fine-grained.