SlideShare a Scribd company logo
1 of 20
NLTK in 20 minutes
A sprint thru Python's Natural Language ToolKit
Jacob Perkins

Co-founder/CTO @ Weotta (we're hiring :)
"Python Text Processing with NLTK 2.0 Cookbook"
NLTK Contributor
Blog: http://streamhacker.com
NLTK Demos & APIs: http://text-processing.com
@japerk
Why Text Processing?
sentiment analysis
spam filtering
plagariasm detection / document similarity
document categorization / topic detection
phrase extraction, summarization
smarter search
simple keyword frequency analysis
Some NLTK Features

sentence & word tokenization
part-of-speech tagging
chunking & named entity recognition
text classification
many included corpora
Sentence Tokenization
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("Hello SF Python. This is NLTK.")
['Hello SF Python.', 'This is NLTK.']

>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']
Word Tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('This is NLTK.')
['This', 'is', 'NLTK', '.']
What's a Word?
>>> word_tokenize("What's up?")
['What', "'s", 'up', '?']
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("What's up?")
['What', "'", 's', 'up', '?']




Learn More: http://text-processing.com/demo/tokenize/
Part-of-Speech Tagging
>>> words = word_tokenize("And now for something
completely different")
>>> from nltk.tag import pos_tag
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'),
('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]

Tags List: http://www.ling.upenn.edu/courses/
Fall_2003/ling001/penn_treebank_pos.html
Why Part-of-Speech Tag?

word definition lookup (WordNet, WordNik)
fine-grained text analytics
part-of-speech specific keyword analysis
chunking & named entity recognition (NER)
Chunking & NER
>>> from nltk.chunk import ne_chunk
>>> ne_chunk(pos_tag(word_tokenize('My name is Jacob
Perkins.')))
Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'),
Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]),
('.', '.')])
NER not perfect
>>> ne_chunk(pos_tag(word_tokenize('San Francisco is
foggy.')))
Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON',
[('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'),
('.', '.')])
Text Classification
def bag_of_words(words):
    return dict([(word, True) for word in words])

>>> feats = bag_of_words(word_tokenize("great movie"))
>>> import nltk.data
>>> classifier = nltk.data.load('classifiers/
movie_reviews_NaiveBayes.pickle')
>>> classifier.classify(feats)
'pos'
Classification Algos in NLTK


 Naive Bayes
 Maximum Entropy / Logistic Regression
 Decision Tree
 SVM (coming soon)
NLTK-Trainer

https://github.com/japerk/nltk-trainer
command line scripts
train custom models
analyze corpora
analyze models against corpora
Train a Sentiment Classifier
$ ./train_classifier.py movie_reviews --instances paras
loading movie_reviews
2 labels: ['neg', 'pos']
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/
movie_reviews_NaiveBayes.pickle
Notable Included Corpora

movie_reviews: pos & neg categorized IMDb reviews
treebank: tagged and parsed WSJ text
treebank_chunk: tagged and chunked WSJ text
brown: tagged & categorized english text
60 other corpora in many languages
Other NLTK Features

clustering
metrics
parsing
stemming
WordNet
... and a lot more
Other Python NLP Libraries


pattern: http://www.clips.ua.ac.be/pages/pattern
scikits.learn: http://scikit-learn.sourceforge.net/stable/
fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
Learn More
http://www.nltk.org/
http://streamhacker.com
http://text-processing.com
nltk-users mailing list
NLTK Tutorial @ PyCon


What would you want to learn in 3 hours?
What kinds of NLP problems do you face at work?
What do you want to do with text?

More Related Content

What's hot

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기CONNECT FOUNDATION
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Databricks
 

What's hot (20)

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
[부스트캠프 Tech Talk] 진명훈_datasets로 협업하기
 
Numpy Talk at SIAM
Numpy Talk at SIAMNumpy Talk at SIAM
Numpy Talk at SIAM
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0Introduction to TensorFlow 2.0
Introduction to TensorFlow 2.0
 

Similar to NLTK in 20 minutes

pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text ProcessingRodrigo Senra
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actorszucaritask
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1Shinichi Ogawa
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒Toki Kanno
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeAijaz Ansari
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - StockholmJan Kronquist
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012Yaqi Zhao
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CDavid Wheeler
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections modulePablo Enfedaque
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Docker, Inc.
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜Michitoshi Yoshida
 

Similar to NLTK in 20 minutes (20)

Procesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con pythonProcesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con python
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actors
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCode
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - Stockholm
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning C
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections module
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

NLTK in 20 minutes

  • 1. NLTK in 20 minutes A sprint thru Python's Natural Language ToolKit
  • 2. Jacob Perkins Co-founder/CTO @ Weotta (we're hiring :) "Python Text Processing with NLTK 2.0 Cookbook" NLTK Contributor Blog: http://streamhacker.com NLTK Demos & APIs: http://text-processing.com @japerk
  • 3. Why Text Processing? sentiment analysis spam filtering plagariasm detection / document similarity document categorization / topic detection phrase extraction, summarization smarter search simple keyword frequency analysis
  • 4. Some NLTK Features sentence & word tokenization part-of-speech tagging chunking & named entity recognition text classification many included corpora
  • 5. Sentence Tokenization >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize("Hello SF Python. This is NLTK.") ['Hello SF Python.', 'This is NLTK.'] >>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']
  • 6. Word Tokenization >>> from nltk.tokenize import word_tokenize >>> word_tokenize('This is NLTK.') ['This', 'is', 'NLTK', '.']
  • 7. What's a Word? >>> word_tokenize("What's up?") ['What', "'s", 'up', '?'] >>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize("What's up?") ['What', "'", 's', 'up', '?'] Learn More: http://text-processing.com/demo/tokenize/
  • 8. Part-of-Speech Tagging >>> words = word_tokenize("And now for something completely different") >>> from nltk.tag import pos_tag >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] Tags List: http://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html
  • 9. Why Part-of-Speech Tag? word definition lookup (WordNet, WordNik) fine-grained text analytics part-of-speech specific keyword analysis chunking & named entity recognition (NER)
  • 10. Chunking & NER >>> from nltk.chunk import ne_chunk >>> ne_chunk(pos_tag(word_tokenize('My name is Jacob Perkins.'))) Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]), ('.', '.')])
  • 11. NER not perfect >>> ne_chunk(pos_tag(word_tokenize('San Francisco is foggy.'))) Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON', [('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'), ('.', '.')])
  • 12. Text Classification def bag_of_words(words): return dict([(word, True) for word in words]) >>> feats = bag_of_words(word_tokenize("great movie")) >>> import nltk.data >>> classifier = nltk.data.load('classifiers/ movie_reviews_NaiveBayes.pickle') >>> classifier.classify(feats) 'pos'
  • 13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)
  • 14. NLTK-Trainer https://github.com/japerk/nltk-trainer command line scripts train custom models analyze corpora analyze models against corpora
  • 15. Train a Sentiment Classifier $ ./train_classifier.py movie_reviews --instances paras loading movie_reviews 2 labels: ['neg', 'pos'] 2000 training feats, 2000 testing feats training NaiveBayes classifier accuracy: 0.967000 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 dumping NaiveBayesClassifier to ~/nltk_data/classifiers/ movie_reviews_NaiveBayes.pickle
  • 16. Notable Included Corpora movie_reviews: pos & neg categorized IMDb reviews treebank: tagged and parsed WSJ text treebank_chunk: tagged and chunked WSJ text brown: tagged & categorized english text 60 other corpora in many languages
  • 18. Other Python NLP Libraries pattern: http://www.clips.ua.ac.be/pages/pattern scikits.learn: http://scikit-learn.sourceforge.net/stable/ fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
  • 20. NLTK Tutorial @ PyCon What would you want to learn in 3 hours? What kinds of NLP problems do you face at work? What do you want to do with text?

Editor's Notes

  1. \n
  2. \n
  3. text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
  4. this is what I'll cover today, but there's a lot more I won't be covering\n
  5. loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
  6. loads a word tokenizer trained on treebank, then calls the tokenize() method\n
  7. non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
  8. loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
  9. pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
  10. every Tree has a draw() method that uses TKinter\n
  11. \n
  12. bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
  13. \n
  14. can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
  15. this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
  16. treebank is a very standard corpus for testing taggers and chunkers\n
  17. NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
  18. \n
  19. the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n
  20. \n