(When) do we need Viterbi for POS tagging?

For my NLP class’s assignment on sequence labeling, I’m having them work on the Twitter POS data that I helped annotate with Noah’s Smith CMU group in 2011. The idea for the assignment is to first apply classifiers (Naive Bayes and Perceptron), which can look at each word and its neighbors, but cannot make a structured prediction for the entire sentence. Then we move to hidden markov models and finally structured perceptron, which should reveal the importance of both joint inference (Viterbi) and discriminative learning.

But a funny thing happened on my way to the perfect problem set. I used features similar to the “base features” in our ACL paper (Gimpel et al 2011), but I also added features for the left and right neighbors of each word. In an averaged perceptron, this resulted in an development set accuracy of 84.8%. The base feature CRF in our paper gets 82.7%. At this point, I started to get excited — by adding the magic of structured prediction, I might be on my way to state-of-the-art results! Sadly no: when I turn my averaged perceptron into a structured perceptron, accuracy is barely changed, coming in at 85.1%.

Now, when I had a simpler feature set (omitting the left and right neighbor features), averaged perceptron got 81%, and structured perceptron again got around 85%. So it seems that for this data, you can incorporate context through either your features or through structured prediction, but there’s hardly any advantage to combining the two.

I assume that this same experiment has been tried for more traditional POS datasets and that structured prediction has been found to help (although this is just an assumption; I don’t know of any specifics). So it’s interesting to think of why it doesn’t help here. One possibility is that the Twitter POS tagset is pretty coarse — only 23 tags. Maybe the sequence information would be more valuable if it were more fine-grained.


10 Responses to (When) do we need Viterbi for POS tagging?

  1. A few ideas

    – The Twitter POS set is much, much smaller than WSJ PTB. Therefore the relative gain of a more sophisticated model will be smaller.
    – My reading of the literature is that POS tagging requires less latent context awareness than NER or other shallow parsing tasks. For example, the Stanford POS tagger is a cyclic dependency network, which is somewhere between an MEMM and a CRF in terms of latent context awareness.

  2. For standard WSJ POS tagging I remember Chris Manning presenting a similar result a number of years ago, i.e., if you make the feature set really rich, than a MEMM or a simple classifier is not that much worse than a CRF. However, I think this result has some caveats in the local versus global learning context:

    1. As Brendan mentions, other sequential tasks like NER are much more dependent on the transition information and often benefit greatly from further structure such as semi-markov topologies.

    2. There is a speed trade-off. Though Viterbi is O(T^2) in the number of tags, you can really reduce this by keeping only transitions seen during training or seen during training above a threshold. This can largely negate this quadratic term without any major change in accuracy. If you add a bunch of features, then the dot product and feature extraction part of the computation gets larger, which sometimes is harder to optimize and could result in a really slow tagger.

    3. Accuracy might also not be the right metric. There is something to be said about consistency or plausibility of an analysis in total, which global models are better at. For example, in dependency parsing, arc-factored models have accuracy only slightly lower than higher-order models in terms of LAS, but if you plug the two into a MT system, the higher order system is way better. That is because the arc-factored model does weird things like predict two subjects because it doesn’t really model local consistency.

    4. With respect to 3, I think Dan Roth would say this is the perfect case where we should train locally yet infer globally (with ILP or something else). I.e., the model is powerful enough to learn the needed classifications, but we have global constraints that will guide inference.

  3. nlpjacob says:

    Thanks guys, really interesting comments. A lot of this stuff is easily testable for this dataset, that would make a nice class project for somebody.

  4. Bo says:

    It is interesting that similar phenomenon happend to some of the experiments I did for chord recognition. Chord recognition is pretty much similar to POS tagging, with each chord label as the tag and the features are MFCC or some other audio features. CRF barely improves the performance of Quadratic Discriminative Analysis plus simple smoothing over the chord labels.

  5. yoav goldberg says:

    Some additional datapoints:

    You can get pretty high accuracy (97.16) for tagging also for WSJ by having rich features and doing greedy sequential prediction with local models. Check out the 2004 paper “SVMTool: A general POS tagger generator based on Support Vector Machines”

    Similar for NER: with good features you get very competitive results with greedy decoding, see “Design Challenges and Misconceptions in Named Entity Recognition” Ratinov and Roth 2009.

    I guess the take-home from this is that sequence labeling aren’t really a good testbed for structured-prediction algorithms, because they are too simple.

  6. Another datapoint, we got nice accuracy on Twitter POS from an MEMM: http://www.ark.cs.cmu.edu/TweetNLP/owoputi+etal.naacl13.pdf

  7. nlpjacob says:

    @Brendan: that would be a more useful data point if you had tried the CRF too 🙂

  8. future work reserved for class projects

  9. Very interesting indeed!
    Percy Liang’s paper on structure compilation is interesting, showing that structured prediction is less necessary for POS than NER

    Also, we found structured prediction only helps a little for Japanese POS tagging:

  10. On CRF vs MEMM: here’s a datapoint. I switched the MEMM in our Twitter POS tagger to a first-order CRF with identical features (so, literally, there are an identical number of parameters). I can’t remember which MEMM decoder I used (it was either greedy or Viterbi), but CRF was Viterbi. 0.2% Daily547 test set accuracy improvement. The important things according to our 2013 paper are word cluster features, which are more like a 2% accuracy boost.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: