September 12, 2014 Leave a comment
For starters: I definitely don’t know how to replicate Google News! But some GT undergrads want to try, and they asked me for pointers. Here’s what I said:
“As far as I know, the original effort in this space is Newsblaster. It’s still up and running, looking a lot like Google News (which came much later). A key difference is that they try to summarize the stories, not just cluster them.
A few other things to read:
I was part of a team that did some work on news story clustering in 2012, but honestly it’s probably overly complicated for your purposes:
If these links don’t work you, just google the titles, you can find all these papers online. The papers cited in all these papers might also be useful, maybe more useful. There’s a lot to read in this area, and I suspect you’ll find that the technology from the late 90s and early 2000s will work pretty well.
There’s an NLP startup called Prismatic that is doing something similar, you might find their blog useful. (Full disclosure: I am friends with one of the co-founders and their “Chief Software Wench.”)
There’s a good information retrieval textbook available online, and it has chapters that will be relevant to many of the core technologies that you’ll need, like clustering. http://nlp.stanford.edu/IR-book/