SlideShare a Scribd company logo
1 of 20
NLTK in 20 minutes
A sprint thru Python's Natural Language ToolKit
Jacob Perkins

Co-founder/CTO @ Weotta (we're hiring :)
"Python Text Processing with NLTK 2.0 Cookbook"
NLTK Contributor
Blog: http://streamhacker.com
NLTK Demos & APIs: http://text-processing.com
@japerk
Why Text Processing?
sentiment analysis
spam filtering
plagariasm detection / document similarity
document categorization / topic detection
phrase extraction, summarization
smarter search
simple keyword frequency analysis
Some NLTK Features

sentence & word tokenization
part-of-speech tagging
chunking & named entity recognition
text classification
many included corpora
Sentence Tokenization
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("Hello SF Python. This is NLTK.")
['Hello SF Python.', 'This is NLTK.']

>>> sent_tokenize("Hello, Mr. Anderson. We missed you!")
['Hello, Mr. Anderson.', 'We missed you!']
Word Tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('This is NLTK.')
['This', 'is', 'NLTK', '.']
What's a Word?
>>> word_tokenize("What's up?")
['What', "'s", 'up', '?']
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("What's up?")
['What', "'", 's', 'up', '?']




Learn More: http://text-processing.com/demo/tokenize/
Part-of-Speech Tagging
>>> words = word_tokenize("And now for something
completely different")
>>> from nltk.tag import pos_tag
>>> pos_tag(words)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'),
('something', 'NN'), ('completely', 'RB'), ('different',
'JJ')]

Tags List: http://www.ling.upenn.edu/courses/
Fall_2003/ling001/penn_treebank_pos.html
Why Part-of-Speech Tag?

word definition lookup (WordNet, WordNik)
fine-grained text analytics
part-of-speech specific keyword analysis
chunking & named entity recognition (NER)
Chunking & NER
>>> from nltk.chunk import ne_chunk
>>> ne_chunk(pos_tag(word_tokenize('My name is Jacob
Perkins.')))
Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'),
Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]),
('.', '.')])
NER not perfect
>>> ne_chunk(pos_tag(word_tokenize('San Francisco is
foggy.')))
Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON',
[('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'),
('.', '.')])
Text Classification
def bag_of_words(words):
    return dict([(word, True) for word in words])

>>> feats = bag_of_words(word_tokenize("great movie"))
>>> import nltk.data
>>> classifier = nltk.data.load('classifiers/
movie_reviews_NaiveBayes.pickle')
>>> classifier.classify(feats)
'pos'
Classification Algos in NLTK


 Naive Bayes
 Maximum Entropy / Logistic Regression
 Decision Tree
 SVM (coming soon)
NLTK-Trainer

https://github.com/japerk/nltk-trainer
command line scripts
train custom models
analyze corpora
analyze models against corpora
Train a Sentiment Classifier
$ ./train_classifier.py movie_reviews --instances paras
loading movie_reviews
2 labels: ['neg', 'pos']
2000 training feats, 2000 testing feats
training NaiveBayes classifier
accuracy: 0.967000
neg precision: 1.000000
neg recall: 0.934000
neg f-measure: 0.965874
pos precision: 0.938086
pos recall: 1.000000
pos f-measure: 0.968054
dumping NaiveBayesClassifier to ~/nltk_data/classifiers/
movie_reviews_NaiveBayes.pickle
Notable Included Corpora

movie_reviews: pos & neg categorized IMDb reviews
treebank: tagged and parsed WSJ text
treebank_chunk: tagged and chunked WSJ text
brown: tagged & categorized english text
60 other corpora in many languages
Other NLTK Features

clustering
metrics
parsing
stemming
WordNet
... and a lot more
Other Python NLP Libraries


pattern: http://www.clips.ua.ac.be/pages/pattern
scikits.learn: http://scikit-learn.sourceforge.net/stable/
fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
Learn More
http://www.nltk.org/
http://streamhacker.com
http://text-processing.com
nltk-users mailing list
NLTK Tutorial @ PyCon


What would you want to learn in 3 hours?
What kinds of NLP problems do you face at work?
What do you want to do with text?

More Related Content

What's hot

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python amiable_indian
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Text classification
Text classificationText classification
Text classificationJames Wong
 
Python Programming - VI. Classes and Objects
Python Programming - VI. Classes and ObjectsPython Programming - VI. Classes and Objects
Python Programming - VI. Classes and ObjectsRanel Padon
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 

What's hot (20)

Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
Python: Basic Inheritance
Python: Basic InheritancePython: Basic Inheritance
Python: Basic Inheritance
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Python Modules
Python ModulesPython Modules
Python Modules
 
Text classification
Text classificationText classification
Text classification
 
Chapter 17 Tuples
Chapter 17 TuplesChapter 17 Tuples
Chapter 17 Tuples
 
Text analysis using python
Text analysis using pythonText analysis using python
Text analysis using python
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Python Programming - VI. Classes and Objects
Python Programming - VI. Classes and ObjectsPython Programming - VI. Classes and Objects
Python Programming - VI. Classes and Objects
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Pandas
PandasPandas
Pandas
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Introduction to c#
Introduction to c#Introduction to c#
Introduction to c#
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Word embedding
Word embedding Word embedding
Word embedding
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 

Similar to NLTK in 20 minutes

pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text ProcessingRodrigo Senra
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actorszucaritask
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1Shinichi Ogawa
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒Toki Kanno
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Rebecca Bilbro
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeAijaz Ansari
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - StockholmJan Kronquist
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012Yaqi Zhao
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonJanu Jahnavi
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CDavid Wheeler
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections modulePablo Enfedaque
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Docker, Inc.
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Eelco Visser
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜Michitoshi Yoshida
 

Similar to NLTK in 20 minutes (20)

Procesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con pythonProcesamiento del lenguaje natural con python
Procesamiento del lenguaje natural con python
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
Casting for not so strange actors
Casting for not so strange actorsCasting for not so strange actors
Casting for not so strange actors
 
appengine java night #1
appengine java night #1appengine java night #1
appengine java night #1
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
Poetic APIs
Poetic APIsPoetic APIs
Poetic APIs
 
Beyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCodeBeyond Breakpoints: Advanced Debugging with XCode
Beyond Breakpoints: Advanced Debugging with XCode
 
Clojure for Java developers - Stockholm
Clojure for Java developers - StockholmClojure for Java developers - Stockholm
Clojure for Java developers - Stockholm
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Categorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk pythonCategorizing and pos tagging with nltk python
Categorizing and pos tagging with nltk python
 
NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Building and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning CBuilding and Distributing PostgreSQL Extensions Without Learning C
Building and Distributing PostgreSQL Extensions Without Learning C
 
The (unknown) collections module
The (unknown) collections moduleThe (unknown) collections module
The (unknown) collections module
 
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
 
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
Model-Driven Software Development - Pretty-Printing, Editor Services, Term Re...
 
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
 

Recently uploaded

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

NLTK in 20 minutes

  • 1. NLTK in 20 minutes A sprint thru Python's Natural Language ToolKit
  • 2. Jacob Perkins Co-founder/CTO @ Weotta (we're hiring :) "Python Text Processing with NLTK 2.0 Cookbook" NLTK Contributor Blog: http://streamhacker.com NLTK Demos & APIs: http://text-processing.com @japerk
  • 3. Why Text Processing? sentiment analysis spam filtering plagariasm detection / document similarity document categorization / topic detection phrase extraction, summarization smarter search simple keyword frequency analysis
  • 4. Some NLTK Features sentence & word tokenization part-of-speech tagging chunking & named entity recognition text classification many included corpora
  • 5. Sentence Tokenization >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize("Hello SF Python. This is NLTK.") ['Hello SF Python.', 'This is NLTK.'] >>> sent_tokenize("Hello, Mr. Anderson. We missed you!") ['Hello, Mr. Anderson.', 'We missed you!']
  • 6. Word Tokenization >>> from nltk.tokenize import word_tokenize >>> word_tokenize('This is NLTK.') ['This', 'is', 'NLTK', '.']
  • 7. What's a Word? >>> word_tokenize("What's up?") ['What', "'s", 'up', '?'] >>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize("What's up?") ['What', "'", 's', 'up', '?'] Learn More: http://text-processing.com/demo/tokenize/
  • 8. Part-of-Speech Tagging >>> words = word_tokenize("And now for something completely different") >>> from nltk.tag import pos_tag >>> pos_tag(words) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')] Tags List: http://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html
  • 9. Why Part-of-Speech Tag? word definition lookup (WordNet, WordNik) fine-grained text analytics part-of-speech specific keyword analysis chunking & named entity recognition (NER)
  • 10. Chunking & NER >>> from nltk.chunk import ne_chunk >>> ne_chunk(pos_tag(word_tokenize('My name is Jacob Perkins.'))) Tree('S', [('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), Tree('PERSON', [('Jacob', 'NNP'), ('Perkins', 'NNP')]), ('.', '.')])
  • 11. NER not perfect >>> ne_chunk(pos_tag(word_tokenize('San Francisco is foggy.'))) Tree('S', [Tree('GPE', [('San', 'NNP')]), Tree('PERSON', [('Francisco', 'NNP')]), ('is', 'VBZ'), ('foggy', 'NN'), ('.', '.')])
  • 12. Text Classification def bag_of_words(words): return dict([(word, True) for word in words]) >>> feats = bag_of_words(word_tokenize("great movie")) >>> import nltk.data >>> classifier = nltk.data.load('classifiers/ movie_reviews_NaiveBayes.pickle') >>> classifier.classify(feats) 'pos'
  • 13. Classification Algos in NLTK Naive Bayes Maximum Entropy / Logistic Regression Decision Tree SVM (coming soon)
  • 14. NLTK-Trainer https://github.com/japerk/nltk-trainer command line scripts train custom models analyze corpora analyze models against corpora
  • 15. Train a Sentiment Classifier $ ./train_classifier.py movie_reviews --instances paras loading movie_reviews 2 labels: ['neg', 'pos'] 2000 training feats, 2000 testing feats training NaiveBayes classifier accuracy: 0.967000 neg precision: 1.000000 neg recall: 0.934000 neg f-measure: 0.965874 pos precision: 0.938086 pos recall: 1.000000 pos f-measure: 0.968054 dumping NaiveBayesClassifier to ~/nltk_data/classifiers/ movie_reviews_NaiveBayes.pickle
  • 16. Notable Included Corpora movie_reviews: pos & neg categorized IMDb reviews treebank: tagged and parsed WSJ text treebank_chunk: tagged and chunked WSJ text brown: tagged & categorized english text 60 other corpora in many languages
  • 18. Other Python NLP Libraries pattern: http://www.clips.ua.ac.be/pages/pattern scikits.learn: http://scikit-learn.sourceforge.net/stable/ fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy
  • 20. NLTK Tutorial @ PyCon What would you want to learn in 3 hours? What kinds of NLP problems do you face at work? What do you want to do with text?

Editor's Notes

  1. \n
  2. \n
  3. text processing is very useful in a number of areas, and there's tons of unstructured text flooding the internet nowadays, and NLP/ML is one of the best ways to deal with it\n
  4. this is what I'll cover today, but there's a lot more I won't be covering\n
  5. loads a trained sentence tokenizer, then calls its tokenize() method. has sentence tokenizers for 16 languages. Smarter than just splitting on punctuation.\n
  6. loads a word tokenizer trained on treebank, then calls the tokenize() method\n
  7. non-ascii characters are also a problem for word_tokenize(). wordpunct_tokenize() can often be better, but you need to first decide what a word is for your specific case. do contractions matter? can you replace them with two words? Demo shows the results from 4 different tokenizers\n
  8. loads a pos tagger trained on treebank - first call will take a few seconds to load the pickle file off disk, every subsequent call will use in-memory tagger. can find tables of pos tag definitions online.\n
  9. pos tags might not be useful by themselves, but they are useful metadata for other NLP tasks like dictionary lookup, pos specific keyword analysis, and they are essential for chunking & NER\n
  10. every Tree has a draw() method that uses TKinter\n
  11. \n
  12. bag-of-words is the simplest model, but ignores frequency. good for small text, but frequency can be very important for larger documents. other algorithms, like SVM, create sparse arrays of 1 or 0 depending on word presence, but require knowning full vocabulary beforehand. this classifier is one I trained with nltk-trainer, and can be used for sentiment analysis because it's categories are "pos" and "neg".\n
  13. \n
  14. can train taggers, chunkers, and text classifiers, and is great for analyzing corpora and how a model performs against a labeled corpus. I use nltk-trainer to train all my models nowadays.\n
  15. this trains a very basic sentiment analysis classifier on the movie_reviews corpus, which has reviews categorized into pos or neg\n
  16. treebank is a very standard corpus for testing taggers and chunkers\n
  17. NLP isn't black magic, but you can treat it as a black box until the defaults aren't good enough. Then you need to dig in and learn how it works so you can make it do what you want. At that point, the best thing you can do is find/make good data, then use existing algos to learn from it.\n
  18. \n
  19. the original NLTK is very good, available for free online, but takes "textbook" approach. I tried to be a lot more practical in my cookbook. nltk-users mailing list is pretty active, and you can also try stackoverflow\n
  20. \n