A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.
4. Some NLTK Features
sentence & word tokenization
part-of-speech tagging
chunking & named entity recognition
text classification
many included corpora
5. Sentence Tokenization
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("Hello SF Python. This is NLTK.")
['Hello SF Python.', 'This is NLTK.']
>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']
6. Word Tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('This is NLTK.')
['This', 'is', 'NLTK', '.']
15. Train a Sentiment Classifier
$ ./train_classifier.py movie_reviews --instances paras
loading movie_reviews
2 labels: ['neg', 'pos']
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/
movie_reviews_NaiveBayes.pickle
16. Notable Included Corpora
movie_reviews: pos & neg categorized IMDb reviews
treebank: tagged and parsed WSJ text
treebank_chunk: tagged and chunked WSJ text
brown: tagged & categorized english text
60 other corpora in many languages
20. NLTK Tutorial @ PyCon
What would you want to learn in 3 hours?
What kinds of NLP problems do you face at work?
What do you want to do with text?
Editor's Notes
\n
\n
text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
this is what I'll cover today, but there's a lot more I won't be covering\n
loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
loads a word tokenizer trained on treebank, then calls the tokenize() method\n
non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
every Tree has a draw() method that uses TKinter\n
\n
bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
\n
can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
treebank is a very standard corpus for testing taggers and chunkers\n
NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
\n
the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n