What to read
I often meet people who want to get involved in computational linguistics, text mining, natural language processing, etc, and I find myself sending lots of emails with pointers to different things to read to get started. So I decided to put all of the things on one webpage. Sorry if I left out your favorite thing, feel free to post it in the comments.
Natural Language Processing
Jurafsky and Martin is the standard textbook for NLP, and I use it in my class. As far as I know there is no real competitor as a general introduction to the subject. It does a pretty good job surveying the linguistics that NLP researchers need to know (real linguists may disagree), and covers the rudiments of the machine learning you will need to do NLP research.
Linguistic Structure Prediction by Noah Smith is a good, rigorous text for machine learning approaches to analyzing linguistic structures like parse trees. It’s not quite broad enough to use in a general NLP class. It’s much more technical than the J&M book, but the appendices make it fairly self-contained, meaning you can read it without a lot of additional background if you are diligent. At a lot of universities it is possible to get a free PDF.
Natural Language Processing with Python covers the Python NLTK library. It’s quite accessible and will get you up and running with NLP applications quickly. It doesn’t cover the machine learning side in any depth, and wouldn’t really suffice to train an NLP researcher.
I haven’t watched the lectures from Stanford’s coursera offering, but the syllabus looks good and the lecturers are great.
Linguistic Fundamentals for Natural Language Processing by Emily Bender is an accessible and readable introduction to linguistics for people working on language tech.
Machine Learning and Statistics
Machine Learning: a Probabilistic Perspective is my new favorite. It covers a very large proportion of the machine learning that we use in statistical NLP, and is very current and clear.
Elements of Statistical Learning takes more of a stats perspective. It’s quite good. The PDF is free.
Statistics in a nutshell covers more basic elements of statistics that you might like to refresh before diving into machine learning (recommended by Rick Rutledge).
The Analysis of Data: Probability is an overview of probability theory, emphasizing elements that are crucial for machine learning and text analysis. An HTML version is free.
Bayesian Reasoning and Machine Learning covers Bayesian methods and graphical models (recommended by Yangfeng Ji).
Pattern Recognition and Machine Learning is a broad introduction to machine learning, with a nice emphasis on graphical models.
Info Theory, Inference, and Learning Algorithms is getting a little old (published 2003), but it offers a really interesting take on the relationships between machine learning and fields like information theory, statistical physics, and cryptography. It probably shouldn’t be your first ML text, but some readers will find it uniquely enjoyable (I did).
Surveys and tutorials
Here is a nice survey of variational Bayesian inference, covering the fiddly details of continuous random variables.
There’s a Coursera offering from one of the top researchers in the field.
Rob Zinkov overviews several machine learning textbooks.
More specific machine learning subjects
Meta-optimize is a question-answer community for machine learning. If you are having trouble getting a published algorithm to work, chances are good that someone there knows the answer.
I work at the intersection of computational methods and sociolinguistics. Some of my favorite sociolinguistic things:
Principles of Language Change, Volume 2 by William Labov. This is describes a long-running and comprehensive research program into the social factors that drive language change. If you’re a computer scientist, it’s great for understanding more about social science research methodology; the stories are also fascinating.
American English describes the major dialects in the US, and perhaps more importantly, gives a good introduction for how to think about dialect.
Style and Sociolinguistic Variation is an edited volume containing many of the leading theoretical ideas on individual stylistic choices in language (circa 2002).
Trudgill’s Sociolinguistic Typology focuses on the social factors driving long-term language variation and change. I found it very thought-provoking.
Finally, this lecture by William Labov requires no linguistic background, and is both fascinating and touching. Recommended listening material for any road trip.
Getting Twitter data
I get asked about how to get data from Twitter. There is a free e-book on this topic.
If you’re working on language, you should read Language Log. The breakfast experiments are my favorite part.
Mark Dredze and Hannah Wallach wrote a nice guide about how to be a phd student. If you are one, are considering becoming one, or if you have a grad student in your life, then you should read it.