NLTK in 20 minutes

•Download as KEY, PDF•

124 likes•69,706 views

A sprint thru Python's Natural Language ToolKit, presented at SFPython on 9/14/2011. Covers tokenization, part of speech tagging, chunking & NER, text classification, and training text classifiers with nltk-trainer.

Technology Education

NLTK in 20 minutes
A sprint thru Python's Natural Language ToolKit

Jacob Perkins

Co-founder/CTO @ Weotta (we're hiring :)
"Python Text Processing with NLTK 2.0 Cookbook"
NLTK Contributor
Blog: http://streamhacker.com
NLTK Demos & APIs: http://text-processing.com
@japerk

Why Text Processing?
sentiment analysis
spam filtering
plagariasm detection / document similarity
document categorization / topic detection
phrase extraction, summarization
smarter search
simple keyword frequency analysis

Some NLTK Features

sentence & word tokenization
part-of-speech tagging
chunking & named entity recognition
text classification
many included corpora

Sentence Tokenization
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("Hello SF Python. This is NLTK.")
['Hello SF Python.', 'This is NLTK.']

>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']

Word Tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('This is NLTK.')
['This', 'is', 'NLTK', '.']

What's a Word?
>>> word_tokenize("What's up?")
['What', "'s", 'up', '?']
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("What's up?")
['What', "'", 's', 'up', '?']

Learn More: http://text-processing.com/demo/tokenize/

Part-of-Speech Tagging
>>> words = word_tokenize("And now for something
completely different")
>>> from nltk.tag import pos_tag
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'),
('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]

Tags List: http://www.ling.upenn.edu/courses/
Fall_2003/ling001/penn_treebank_pos.html

Why Part-of-Speech Tag?

word definition lookup (WordNet, WordNik)
fine-grained text analytics
part-of-speech specific keyword analysis
chunking & named entity recognition (NER)

Chunking & NER
>>> from nltk.chunk import ne_chunk
>>> ne_chunk(pos_tag(word_tokenize('My name is Jacob
Perkins.')))
Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'),
Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]),
('.', '.')])

NER not perfect
>>> ne_chunk(pos_tag(word_tokenize('San Francisco is
foggy.')))
Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON',
[('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'),
('.', '.')])

Text Classification
def bag_of_words(words):
return dict([(word, True) for word in words])

>>> feats = bag_of_words(word_tokenize("great movie"))
>>> import nltk.data
>>> classifier = nltk.data.load('classifiers/
movie_reviews_NaiveBayes.pickle')
>>> classifier.classify(feats)
'pos'

Classification Algos in NLTK

Naive Bayes
Maximum Entropy / Logistic Regression
Decision Tree
SVM (coming soon)

NLTK-Trainer

https://github.com/japerk/nltk-trainer
command line scripts
train custom models
analyze corpora
analyze models against corpora

Train a Sentiment Classifier
$ ./train_classifier.py movie_reviews --instances paras
loading movie_reviews
2 labels: ['neg', 'pos']
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/
movie_reviews_NaiveBayes.pickle

Notable Included Corpora

movie_reviews: pos & neg categorized IMDb reviews
treebank: tagged and parsed WSJ text
treebank_chunk: tagged and chunked WSJ text
brown: tagged & categorized english text
60 other corpora in many languages

Other NLTK Features

clustering
metrics
parsing
stemming
WordNet
... and a lot more

Other Python NLP Libraries

pattern: http://www.clips.ua.ac.be/pages/pattern
scikits.learn: http://scikit-learn.sourceforge.net/stable/
fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy

Learn More
http://www.nltk.org/
http://streamhacker.com
http://text-processing.com
nltk-users mailing list

NLTK Tutorial @ PyCon

What would you want to learn in 3 hours?
What kinds of NLP problems do you face at work?
What do you want to do with text?

What's hot

Natural Language Processing with PythonBenjamin Bengfort

Introduction to Python amiable_indian

Python: Basic InheritanceDamian T. Gordon

pandas - Python Data AnalysisAndrew Henshaw

Python ModulesNitin Reddy Katkam

Text classificationJames Wong

Chapter 17 TuplesPraveen M Jigajinni

Text analysis using pythonVijay Ramachandran

NLP Project Full CycleVsevolod Dyomkin

Python Programming - VI. Classes and ObjectsRanel Padon

Presentation on Text ClassificationSai Srinivas Kotni

Python Scipy NumpyGirish Khanzode

Text classification presentationMarijn van Zelst

pandas: Powerful data analysis tools for PythonWes McKinney

Pandasmaikroeder

Introduction to natural language processing (NLP)Alia Hamwi

Introduction to c#OpenSource Technologies Pvt. Ltd.

IE: Named Entity Recognition (NER)Marina Santini

Word embedding ShivaniChoudhary74

Python cheat-sheetsrinivasanr281952

What's hot (20)

Natural Language Processing with Python

Introduction to Python

Python: Basic Inheritance

pandas - Python Data Analysis

Python Modules

Text classification

Chapter 17 Tuples

Text analysis using python

NLP Project Full Cycle

Python Programming - VI. Classes and Objects

Presentation on Text Classification

Python Scipy Numpy

Text classification presentation

pandas: Powerful data analysis tools for Python

Pandas

Introduction to natural language processing (NLP)

Introduction to c#

IE: Named Entity Recognition (NER)

Word embedding

Python cheat-sheet

Similar to NLTK in 20 minutes

Procesamiento del lenguaje natural con pythonFacultad de Ciencias y Sistemas

pa-pe-pi-po-pure Python Text ProcessingRodrigo Senra

Casting for not so strange actorszucaritask

appengine java night #1Shinichi Ogawa

支撐英雄聯盟戰績網的那條巨蟒Toki Kanno

Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir

Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro

Categorizing and pos tagging with nltk pythonJanu Jahnavi

Poetic APIsErik Rose

Beyond Breakpoints: Advanced Debugging with XCodeAijaz Ansari

Clojure for Java developers - StockholmJan Kronquist

Pyconie 2012Yaqi Zhao

Natural Language Processing and Pythonanntp

Categorizing and pos tagging with nltk pythonJanu Jahnavi

NLTK introductionPrakash Pimpale

Building and Distributing PostgreSQL Extensions Without Learning CDavid Wheeler

The (unknown) collections modulePablo Enfedaque

Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Docker, Inc.

Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser

DBA だってもっと効率化したい！〜最近の自動化事情とOracle Database〜Michitoshi Yoshida

Similar to NLTK in 20 minutes (20)

Procesamiento del lenguaje natural con python

pa-pe-pi-po-pure Python Text Processing

Casting for not so strange actors

appengine java night #1

支撐英雄聯盟戰績網的那條巨蟒

Nltk:a tool for_nlp - py_con-dhaka-2014

Building a Gigaword Corpus (PyCon 2017)

Categorizing and pos tagging with nltk python

Poetic APIs

Beyond Breakpoints: Advanced Debugging with XCode

Clojure for Java developers - Stockholm

Pyconie 2012

Natural Language Processing and Python

Categorizing and pos tagging with nltk python

NLTK introduction

Building and Distributing PostgreSQL Extensions Without Learning C

The (unknown) collections module

Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart

Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...

DBA だってもっと効率化したい！〜最近の自動化事情とOracle Database〜

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

CloudStudio User manual (basic edition):comworks

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Take control of your SAP testing with UiPath Test SuiteDianaGray10

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Search Engine Optimization SEO PDF for 2024.pdfRankYa

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!

SIP trunking in Janus @ Kamailio World 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

CloudStudio User manual (basic edition):

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Nell’iperspazio con Rocket: il Framework Web di Rust!

Vertex AI Gemini Prompt Engineering Tips

DMCC Future of Trade Web3 - Special Edition

Take control of your SAP testing with UiPath Test Suite

"Debugging python applications inside k8s environment", Andrii Soldatenko

DSPy a system for AI to Write Prompts and Do Fine Tuning

SAP Build Work Zone - Overview L2-L3.pptx

What's New in Teams Calling, Meetings and Devices March 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

WordPress Websites for Engineers: Elevate Your Brand

Are Multi-Cloud and Serverless Good or Bad?

Connect Wave/ connectwave Pitch Deck Presentation

Search Engine Optimization SEO PDF for 2024.pdf

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

DevoxxFR 2024 Reproducible Builds with Apache Maven

NLTK in 20 minutes

1. NLTK in 20 minutes A sprint thru Python's Natural Language ToolKit

2. Jacob Perkins Co-founder/CTO @ Weotta (we're hiring :) "Python Text Processing with NLTK 2.0 Cookbook" NLTK Contributor Blog: http://streamhacker.com NLTK Demos & APIs: http://text-processing.com @japerk

3. Why Text Processing? sentiment analysis spam filtering plagariasm detection / document similarity document categorization / topic detection phrase extraction, summarization smarter search simple keyword frequency analysis

4. Some NLTK Features sentence & word tokenization part-of-speech tagging chunking & named entity recognition text classification many included corpora

5. Sentence Tokenization >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize("Hello SF Python. This is NLTK.") ['Hello SF Python.', 'This is NLTK.'] >>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']

6. Word Tokenization >>> from nltk.tokenize import word_tokenize >>> word_tokenize('This is NLTK.') ['This', 'is', 'NLTK', '.']

7. What's a Word? >>> word_tokenize("What's up?") ['What', "'s", 'up', '?'] >>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize("What's up?") ['What', "'", 's', 'up', '?'] Learn More: http://text-processing.com/demo/tokenize/

8. Part-of-Speech Tagging >>> words = word_tokenize("And now for something completely different") >>> from nltk.tag import pos_tag >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] Tags List: http://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html

9. Why Part-of-Speech Tag? word definition lookup (WordNet, WordNik) fine-grained text analytics part-of-speech specific keyword analysis chunking & named entity recognition (NER)

10. Chunking & NER >>> from nltk.chunk import ne_chunk >>> ne_chunk(pos_tag(word_tokenize('My name is Jacob Perkins.'))) Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]), ('.', '.')])

11. NER not perfect >>> ne_chunk(pos_tag(word_tokenize('San Francisco is foggy.'))) Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON', [('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'), ('.', '.')])

12. Text Classification def bag_of_words(words): return dict([(word, True) for word in words]) >>> feats = bag_of_words(word_tokenize("great movie")) >>> import nltk.data >>> classifier = nltk.data.load('classifiers/ movie_reviews_NaiveBayes.pickle') >>> classifier.classify(feats) 'pos'

13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)

14. NLTK-Trainer https://github.com/japerk/nltk-trainer command line scripts train custom models analyze corpora analyze models against corpora

15. Train a Sentiment Classifier $ ./train_classifier.py movie_reviews --instances paras loading movie_reviews 2 labels: ['neg', 'pos'] 2000 training feats, 2000 testing feats training NaiveBayes classifier accuracy: 0.967000 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 dumping NaiveBayesClassifier to ~/nltk_data/classifiers/ movie_reviews_NaiveBayes.pickle

16. Notable Included Corpora movie_reviews: pos & neg categorized IMDb reviews treebank: tagged and parsed WSJ text treebank_chunk: tagged and chunked WSJ text brown: tagged & categorized english text 60 other corpora in many languages

17. Other NLTK Features clustering metrics parsing stemming WordNet ... and a lot more

18. Other Python NLP Libraries pattern: http://www.clips.ua.ac.be/pages/pattern scikits.learn: http://scikit-learn.sourceforge.net/stable/ fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy

19. Learn More http://www.nltk.org/ http://streamhacker.com http://text-processing.com nltk-users mailing list

20. NLTK Tutorial @ PyCon What would you want to learn in 3 hours? What kinds of NLP problems do you face at work? What do you want to do with text?

Editor's Notes

\n
\n
text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
this is what I'll cover today, but there's a lot more I won't be covering\n
loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
loads a word tokenizer trained on treebank, then calls the tokenize() method\n
non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
every Tree has a draw() method that uses TKinter\n
\n
bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
\n
can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
treebank is a very standard corpus for testing taggers and chunkers\n
NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
\n
the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n
\n

NLTK in 20 minutes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLTK in 20 minutes

Similar to NLTK in 20 minutes (20)

Recently uploaded

Recently uploaded (20)

NLTK in 20 minutes

Editor's Notes