How to replicate Google News?

For starters: I definitely don’t know how to replicate Google News! But some GT undergrads want to try, and they asked me for pointers. Here’s what I said:

“As far as I know, the original effort in this space is Newsblaster. It’s still up and running, looking a lot like Google News (which came much later). A key difference is that they try to summarize the stories, not just cluster them.

demo: http://newsblaster.cs.columbia.edu/
paper: http://www.cs.columbia.edu/~sable/research/hlt-blaster.pdf

A few other things to read:

http://dl.acm.org/citation.cfm?id=2133826

http://dl.acm.org/citation.cfm?id=1557077

http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/jgc/publication/Learning_Approaches_Detecting_Tracking_IEEE_1999.pdf

I was part of a team that did some work on news story clustering in 2012, but honestly it’s probably overly complicated for your purposes:

http://dl.acm.org/citation.cfm?id=1963445

If these links don’t work you, just google the titles, you can find all these papers online. The papers cited in all these papers might also be useful, maybe more useful. There’s a lot to read in this area, and I suspect you’ll find that the technology from the late 90s and early 2000s will work pretty well.

There’s an NLP startup called Prismatic that is doing something similar, you might find their blog useful. (Full disclosure: I am friends with one of the co-founders and their “Chief Software Wench.”)

There’s a good information retrieval textbook available online, and it has chapters that will be relevant to many of the core technologies that you’ll need, like clustering. http://nlp.stanford.edu/IR-book/

Good luck!

Jacob”

Georgia Tech @ ACL 2014

Georgia Tech had a lot to say at this year’s annual meeting of Association for Computational Linguistics (ACL).

  • Representation Learning for Text-level Discourse Parsing. Ji and Eisenstein. Main conference.
  • Fast Easy Unsupervised Domain Adaptation with Marginalized Structured Dropout. Yang and Eisenstein. Main conference.
  • Modeling Factuality Judgments in Social Media Text. Soni, Mitra, Gilbert, and Eisenstein. Main conference.
  • POS induction with distributional and morphological information using a distance-dependent Chinese restaurant process. Sirts, Eisenstein, Elsner, and Goldwater. Main Conference.
  • Linguistic Style-Shifting in Online Social Media. Pavalanathan and Eisenstein, Workshop on Social Dynamics and Personal Attributes.
  • Mining Themes and Interests in the Asperger’s and Autism Community. Ji, Hong, Arriaga, Rozga, Abowd, and Eisenstein. Workshop on Computational Linguistics and Clinical Psychology.

I was especially excited to draw in some of my social computing colleagues, including Catherine Grevet, Tanushree Mitra, and Eric Gilbert. Here’s most of us:

From left: Sandeep Soni, Yangfeng Ji, Yi Yang, Catherine Grevet, Uma Pavalanathan, Tanushree Mitra, and Jacob Eisenstein

From left: Sandeep Soni, Yangfeng Ji, Yi Yang, Catherine Grevet, Uma Pavalanathan, Tanushree Mitra, and Jacob Eisenstein

Computational Social Science Hack Day 2014

GT and Emory Computational Social Science enthusiasts again joined forces — this time for a hack day at Georgia Tech — following up on a fun and successful workshop at Emory in November. The 20 participants from the two universities pursued a variety of projects, including:

  • Structural balance theory in comic book narratives (Amish, Sandeep, and Vinay, with help from Vinodh)
  • Identifying noteworthy items in text from State-of-the-Union addresses (Tanushree and Yangfeng, with help from Jeff and Tom)
  • Combating the trafficking of minors, using text analysis and computer vision (Eric and Parisa)
  • Connecting spelling variation with sentiment analysis (Uma, Yi, and Yu)
  • Predicting Grammy award winners from Tweet volume (Jayita and Spurthi)
  • Mining OpenSecrets to find latent clusters of campaign donors and recipients (Jacob, Jon, and Munmun)

Since this is my blogpost, I’ll take a little more time to talk about my project. Essentially we just read in OpenSecrets data and built a matrix of donors and recipients. The sparse SVD of this matrix revealed some interesting factors. Here are the top 7 candidates and donors for each of the most interesting factors, along with my own name for the factors.

—– Factor 2 (unions) —–
Julia Brownley (D) American Fedn of St/Cnty/Munic Employees
Ed Markey (D) American Assn for Justice
Ann Kirkpatrick (D) Intl Brotherhood of Electrical Workers
Ann Mclane Kuster (D) International Assn of Fire Fighters
Timothy H. Bishop (D) Operating Engineers Union
Ron Barber (D) American Federation of Teachers
Cheri Bustos (D) National Assn of Letter Carriers
—– Factor 3 (insurers) —–
Kay R. Hagan (D) Metlife Inc
Max Baucus (D) American Council of Life Insurers
Ron Kind (D) Principal Life Insurance
Dave Camp (R) Massachusetts Mutual Life Insurance
Joseph Crowley (D) Morgan Stanley
Richard E. Neal (D) TIAA-CREF
Pat Toomey (R) UBS Americas
—– Factor 4 (finance) —–
Jeb Hensarling (R) Investment Co Institute
Randy Neugebauer (R) American Land Title Assn
Scott Garrett (R) Chicago Mercantile Exchange
Michael Grimm (R) Indep Insurance Agents & Brokers/America
Bill Huizenga (R) Bank of America
Sean P. Duffy (R) Securities Industry & Financial Mkt Assn
Steve Stivers (R) PricewaterhouseCoopers
—– Factor 6 (construction and industry) —–
Bill Shuster (R) American Council of Engineering Cos
Frank A. LoBiondo (R) Owner-Operator Independent Drivers Assn
Patrick Meehan (R) CSX Corp
David P Joyce (R) Carpenters & Joiners Union
Jim Gerlach (R) American Road & Transport Builders Assn
Nick Rahall (D) Norfolk Southern
Tom Petri (R) NiSource Inc
—– Factor 7 (arms manufacturers) —–
Adam Smith (D) Raytheon Co
Buck Mckeon (R) Northrop Grumman
John Cornyn (R) Lockheed Martin
Joe Wilson (R) National Assn of Realtors
John Carter (R) AT&T Inc
Michael McCaul (R) BAE Systems
Lamar Smith (R) Honeywell International
—– Factor 8 (technology) —–
Jeanne Shaheen (D) Microsoft Corp
Ed Markey (D) Every Republican is Crucial PAC
Kay R. Hagan (D) Verizon Communications
George Holding (R) National Cable & Telecommunications Assn
Chris Coons (D) Google Inc
Joe Heck (R) National Assn of Broadcasters
Ron DeSantis (R) Viacom International
—– Factor 9 (energy + Halliburton) —–
Pete Olson (R) Halliburton Co
Ed Markey (D) Koch Industries
Joe Barton (R) Independent Petroleum Assn of America
Bill Johnson (R) National Cable & Telecommunications Assn
Michael G. Fitzpatrick (R) Occidental Petroleum
Mary L. Landrieu (D) Cellular Telecom & Internet Assn
Steve Scalise (R) DTE Energy
—– Factor 11 (communications) —–
Bob Goodlatte (R) Google Inc
Kelly Ayotte (R) Clear Channel Communications
Lindsey Graham (R) Sprint Corp
Tim Scott (R) Association of American Railroads
Susan Collins (R) Union Pacific Corp
Eric Swalwell (D) Norfolk Southern
John Thune (R) Microsoft Corp

Our plan was to learn more about these factors by connecting them with text from the wikipedia pages of the companies and with the NOMINATE scores and committee memberships of the legislators. Maybe that would have been possible in a 24-hour hackathon, but we ran out of time as I was just starting to get reasonable topics for the wikipedia pages.

Wrapup
We went into the day with the goal of building and strengthening connections across disciplines and institutions, and by that metric I think the day was a success. In any case, I had a blast working with new people and trying out some new ideas, and I’m confident this will impact my research in the long run. It was also a lot of fun to work with my own students and colleagues in a more collaborative setting. Taking a day off from endless paper and proposal deadlines (and non-stop email distractions) to hack on a new project felt like a mini-vacation.

Georgia Tech at EMNLP 2013

EMNLP is one of my favorite conferences, so I’m very pleased that Georgia Tech’s Computational Linguistics Lab will have two papers to present.

Yi Yang and I have written a paper that formalizes unsupervised text normalization in a log-linear model, which allows arbitrary (local) features to be combined with a target language model. The model is trained in a maximum-likelihood framework, marginalizing over possible normalizations using a novel sequential Monte Carlo training scheme. Longtime readers may find some irony in me writing a paper about social media normalization, but if we want to understand systematic orthographic variation — such as (TD)-deletion — then an accurate normalization system is a very useful tool to have in the shed. http://www.cc.gatech.edu/~jeisenst/papers/yang-emnlp-2013.pdf

Yangfeng Ji and I have obtained very strong results on paraphrase detection, beating the prior state-of-the-art on the well-studied MSR Paraphrase Corpus by 3% raw accuracy. We build a distributional representation for sentence semantics, which we combine with traditional fine-grained features. Yangfeng’s key insight in this paper is to also use supervised information to compute the distributional representation itself, by reweighting the words according to their discriminability. http://www.cc.gatech.edu/~jeisenst/papers/ji-emnlp-2013.pdf

How noisy is social media text?

In my NAACL 2013 paper/rant, I expressed concern that a lot of the NLP work targeting social media is based on folk linguistics rather than either solid theory or empirical data about how social media language actually works. In my paper I tried to provide a little of both: citations to some of my favorite papers from the CMC and sociolinguistics literatures (which seems to be nearly totally unknown in NLP circles), and an empirical analysis of social media language differences using n-gram counts.

This recent paper by Baldwin, Cook, Lui, Mackinlay, and Wang — basically contemporaneous with mine, but they were kind enough to cite me — takes the empirical analysis a good way further. I was particularly interested to see that they applied a generative HPSG grammar of English to corpora from Twitter, Youtube comments (the worst place on the whole internet?), web forums, blog posts, wikipedia, and the BNC. They found that if you want strict parsing of full sentences, Twitter is quite difficult — only 14% of tweets are parseable this way, as compared to 25% for blogs and 49% for wikipedia. Relaxing punctuation and capitalization reduces these differences considerably, yielding 36% parseability for tweets, 44% for blogs, and 68% for wikipedia. Another 25% of tweets are viewed as grammatical fragments (e.g., “very funny”), leaving only 37% of tweets as “unparseable”, compared to 35% for blogs and 26% for wikipedia. This coheres with arguments from linguists like Thurlow and Squires (sadly, I find no publicly available PDF for her cool 2010 Language and Society paper) that claims of a radically unreadable netspeak dialect are greatly exaggerated.

The paper also provides a lexical analysis, using chi-squared score to measure differences between the 500 most frequent words in each corpus. But if, as I argued in my 2013 NAACl paper, social media is an amalgam of writing styles rather than a single genre or dialect, few of these stylistic markers will attain enough universality to reach the top 500 words, besides the usual suspects: lol, you/u, gonna, and the most popular emoticons. Baldwin et al also measure the perplexity of a trigram language model complexity, which may capture this “long tail”, but personally I find this a little harder to interpret than simple n-gram out-of-vocabulary counts, as it depends on modeling decisions such as smoothing.

Overall, I’m very happy to see NLP technology used to empirically measure the similarities and differences between social media and other forms of writing, and I’m particularly intrigued by the use of automated generative parsing. As we try to make language technology robust to language variation, papers like this will help us move forward on a solid empirical footing.

(h/t Brendan O’Connor for pointing me to this paper)

adventures in cross-disciplinary collaboration, part 27: typesetting

One challenging thing about building bridges to the sociolinguistics community from the computer science world is that publication methods in sociolinguistics are… traditional. The open-access movement hasn’t made many inroads (language@internet is a great exception — please comment if I’m missing others), and many journals require microsoft word format for submissions. I find this pretty surprising in a discipline that involves a considerable amount of formal notation, both linguistic and mathematical.

Anyway, I convinced my co-authors to take the path of writing the document in latex and then converting right before submission — by promising that I would manage the conversion. And now the chickens have come home to roost. We’ve got a fairly complicated 45-page document, with all the usual stuff: equations, figures, tables, references, etc. I’ve spent the morning tracking down various forum posts about how to get as many of these features to survive the conversion as possible, with mixed success. Here’s what I’ve figured out so far:

latex2rtf is the current winner. It did a good job with citations, got some of the references, messed up all the math. make sure to update to version 2.3.3, not the 1.9.19 that is default with ubuntu.

my command line: latex2rtf main

pandoc lost all document-level formatting, citations, and references. But, it did a nice job on equations. I may create the main document from latex2rtf and then copy in the equations from pandoc.

my command line: pandoc -f latex -t odt -o main.odt main.tex

tex4ht gets a good recommendation here, but for me it generates a blank output

my command line: mk4ht oolatex main.tex

latex2html was advertised here, but I can’t get it into odt and anyway it doesn’t get any of the equations for me.

my command line: latex2html main.tex -split 0 -no_navigation -info “” -address “” -html_version 4.0,unicode

(When) do we need Viterbi for POS tagging?

For my NLP class’s assignment on sequence labeling, I’m having them work on the Twitter POS data that I helped annotate with Noah’s Smith CMU group in 2011. The idea for the assignment is to first apply classifiers (Naive Bayes and Perceptron), which can look at each word and its neighbors, but cannot make a structured prediction for the entire sentence. Then we move to hidden markov models and finally structured perceptron, which should reveal the importance of both joint inference (Viterbi) and discriminative learning.

But a funny thing happened on my way to the perfect problem set. I used features similar to the “base features” in our ACL paper (Gimpel et al 2011), but I also added features for the left and right neighbors of each word. In an averaged perceptron, this resulted in an development set accuracy of 84.8%. The base feature CRF in our paper gets 82.7%. At this point, I started to get excited — by adding the magic of structured prediction, I might be on my way to state-of-the-art results! Sadly no: when I turn my averaged perceptron into a structured perceptron, accuracy is barely changed, coming in at 85.1%.

Now, when I had a simpler feature set (omitting the left and right neighbor features), averaged perceptron got 81%, and structured perceptron again got around 85%. So it seems that for this data, you can incorporate context through either your features or through structured prediction, but there’s hardly any advantage to combining the two.

I assume that this same experiment has been tried for more traditional POS datasets and that structured prediction has been found to help (although this is just an assumption; I don’t know of any specifics). So it’s interesting to think of why it doesn’t help here. One possibility is that the Twitter POS tagset is pretty coarse — only 23 tags. Maybe the sequence information would be more valuable if it were more fine-grained.

Follow

Get every new post delivered to your Inbox.