SlideShare a Scribd company logo
1 of 16
Download to read offline
Introduction to Natural Language Processing
Lecture 2. Tokenization and word counts
Ekaterina Chernyak, Dmitry Ilvovsky
echernyak@hse.ru, dilvovsky@hse.ru
National Research University
Higher School of Economics (Moscow)
July 13, 2015
How many words?
“The rain in Spain stays mainly in the plain.”
9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain
7 (or 8) types: T / the rain, in, Spain, stays, mainly, plain
Type and token
Type is an element of the vocabulary.
Token is an instance of that type in the text.
N = number of tokens
V - vocabulary (i.e. all types)
|V | = size of vocabulary (i.e. number of types)
How are N and |V | related?
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 2 / 16
Zipf’s law
Zipf’s law [Gelbukh, Sidorov, 2001])
In any large enough text, the frequency ranks (starting from the highest)
of types are inversely proportional to the corresponding frequencies:
f =
1
r
f – frequency of a type
r – rank of a type (its position in the list of all types in order of their
frequency of occurrence)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 3 / 16
Zipf’s law: example
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 4 / 16
Heaps’ law
Heaps’ law [Gelbukh, Sidorov, 2001])
The number of different types in a text is roughly proportional to an
exponent of its size:
|V | = K ∗ Nb
N = number of tokens
|V | = size of vocabulary (i.e. number of types)
K, b – free parameters, K ∈ [10, 100], b ∈ [0.4, 0.6]
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 5 / 16
Why tokenization is difficult?
Easy example: “Good muffins cost $3.88 in New York. Please buy me
two of them. Thanks.”
is “.” a token?
is $3.88 a single token?
is “New York” a single token?
Real data may contain noise in it: code, markup, URLs, faulty
punctuation
Real data contains misspellings: “an dthen she aksed”
Period “.” does not always mean the end of sentence: m.p.h., PhD.
Nevertheless tokenization is important for all other text processing steps.
There are rule-based and machine learning-based approaches to
development of tokenizers.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 6 / 16
Rule-based tokenization
For example, define a token as a sequence of upper and lower case letters:
A-Za-z. Reqular expression is a nice tool for programming such rules.
RE in Python
In[1]: import re
In[2]: prog = re.compile(’[A-Za-z]+’)
In[3]: prog.findall("Words, words, words.")
Out[1]: [’Words’, ’words’, ’words’]
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 7 / 16
Sentence segmentation (1)
What are the sentence boundaries?
?, ! are usually unambiguous
Period “.” is an issue
Direct speech is also an issue: She said, ”What time will you be
home?” and I said, ”I don’t know! ”. Even worse in Russian!
Let us learn a classifier for sentence segmentation.
Binary classifier
A binary classifier f : X =⇒ 0, 1 takes input data X (a set of sentences)
and decides EndOfSentence (0) or NotEndOfSentence (1).
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 8 / 16
Sentence segmentation (2)
What can be the features for classification? I am a period, am I
EndOfSentence?
Lots of blanks after me?
Lots of lower case letters and ? or ! after me?
Do I belong to abbreviation?
etc.
We need a lot of hand-markup.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 9 / 16
Do we need to program this?
No! There is Natural Language Toolkit (NLTK) for everything.
NLTK tokenizers
In[1]: from nltk.tokenize import RegexpTokenizer,
wordpunct tokenize
In[2]: s = ’Good muffins cost $3.88 in New York. Please
buy me two of them. Thanks.’
In[3]: tokenizer = RegexpTokenizer(’w+| $ [d .]+ | S
+’)
In[4]: tokenizer.tokenize(s)
In[5]: wordpunct tokenize(s)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 10 / 16
Learning to tokenize
nltk.tokenize.punkt is a tool for learning to tokenize from your data.
It includes pre-trained Punkt tokenizer for English.
Punkt tokenizer
In[1]: import nltk.data
In[2]: sent detector =
nltk.data.load(’tokenizers/punkt/english.pickle’)
In[3]: sent detector.tokenize(s)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 11 / 16
Exercise 1.1 Word counts
Now it is time for some programming!
Exercise 1.1
Input: Alice in Wonderland (alice.txt) or your text
Output 1: number of tokens
Output 2: number of types
Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency
dictionary: [key, frequency(key)].
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 12 / 16
Lemmatization (Normalization)
Each word has a base form:
has, had, have =⇒ have
cats, cat, cat’s =⇒ cat
Windows =⇒ window or Windows?
Lemmatization [Jurafsky, Martin, 1999]
Lemmatization (or normalization) is used to reduce inflections or variant
forms to base forms. A dictionary with headwords is required.
Lemmatization
In[1]: from nltk.stem import WordNetLemmatizer
In[2]: lemmatizer = WordNetLemmatizer()
In[3]: lemmatizer.lemmatize(t)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 13 / 16
Stemming
A word is built with morphems: word = stem + affixes. Sometimes we do
not need affixes.
translate, translation, translator =⇒ translat
Stemming [Jurafsky, Martin, 1999]
Reduce terms to their stems in information retrieval and text classification.
Porter’s algorithm is the most common English stemmer.
Stemming
In[1]: from nltk.stem.porter import PorterStemmer
In[2]: stemmer = PorterStemmer()
In[3]: stemmer.stem(t)
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 14 / 16
Exercise 1.2 Word counts (continued)
Exercise 1.2
Input: Alice in Wonderland (alice.txt) or your text
Output 1: 20 most common lemmata
Output 2: 20 most common stems
Use FreqDist() to count lemmata and stems. Use
FreqDist().most common() to find most common lemmata and stems.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 15 / 16
Exercise 1.3 Do we need all words?
Stopword is a not meaningful word: prepositions, adjunctions, pronouns,
articles, etc.
Stopwords
In[1]: from nltk.corpus import stopwords
In[2]: print stopwords.words(’english’)
Exercise 1.3
Input: Alice in Wonderland (alice.txt) or your text
Output 1: 20 most common lemmata without stop words
Use not in operator to exclude not stopwords in a cycle.
Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 16 / 16

More Related Content

What's hot

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingSaurabh Kaushik
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games ResearchJose Zagal
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processingHareem Naz
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 

What's hot (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
8 issues in pos tagging
8 issues in pos tagging8 issues in pos tagging
8 issues in pos tagging
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Nlp
NlpNlp
Nlp
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing for Games Research
Natural Language Processing for Games ResearchNatural Language Processing for Games Research
Natural Language Processing for Games Research
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language-processing
Natural language-processingNatural language-processing
Natural language-processing
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Introduction to NLTK
Introduction to NLTKIntroduction to NLTK
Introduction to NLTK
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 

Similar to Intro to NLP. Lecture 2

13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLPThenmozhiK5
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxprakashvs7
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNetDomyoung Lee
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
 

Similar to Intro to NLP. Lecture 2 (20)

L3 v2
L3 v2L3 v2
L3 v2
 
Nltk
NltkNltk
Nltk
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
AI_08_NLP.pptx
AI_08_NLP.pptxAI_08_NLP.pptx
AI_08_NLP.pptx
 
Document similarity
Document similarityDocument similarity
Document similarity
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
FinalReport
FinalReportFinalReport
FinalReport
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Artificial Intelligence_NLP
Artificial Intelligence_NLPArtificial Intelligence_NLP
Artificial Intelligence_NLP
 
AI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptxAI UNIT-3 FINAL (1).pptx
AI UNIT-3 FINAL (1).pptx
 
NLP and Deep Learning
NLP and Deep LearningNLP and Deep Learning
NLP and Deep Learning
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
A Simple Explanation of XLNet
A Simple Explanation of XLNetA Simple Explanation of XLNet
A Simple Explanation of XLNet
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 

More from Ekaterina Chernyak (16)

Chernyak_defense
Chernyak_defenseChernyak_defense
Chernyak_defense
 
Backgammon
BackgammonBackgammon
Backgammon
 
Desktop game agar.io
Desktop game agar.ioDesktop game agar.io
Desktop game agar.io
 
Gayazov
GayazovGayazov
Gayazov
 
Koptsov.web.introduction
Koptsov.web.introductionKoptsov.web.introduction
Koptsov.web.introduction
 
Intro to NLP (RU)
Intro to NLP (RU)Intro to NLP (RU)
Intro to NLP (RU)
 
редактор параллельных разметок
редактор параллельных разметокредактор параллельных разметок
редактор параллельных разметок
 
Hse project introduction_22012015
Hse project introduction_22012015Hse project introduction_22012015
Hse project introduction_22012015
 
зачем нужен чистый корпус
зачем нужен чистый корпусзачем нужен чистый корпус
зачем нужен чистый корпус
 
Yacovlev
YacovlevYacovlev
Yacovlev
 
Suhoroslov
SuhoroslovSuhoroslov
Suhoroslov
 
Koptsov
KoptsovKoptsov
Koptsov
 
Gusakov
GusakovGusakov
Gusakov
 
Hse.projects 17.01.2015
Hse.projects 17.01.2015Hse.projects 17.01.2015
Hse.projects 17.01.2015
 
Ignat vita artur
Ignat vita arturIgnat vita artur
Ignat vita artur
 
Ivan p
Ivan pIvan p
Ivan p
 

Recently uploaded

Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Silpa
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfSumit Kumar yadav
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspectsmuralinath2
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.Silpa
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Silpa
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 

Recently uploaded (20)

Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 

Intro to NLP. Lecture 2

  • 1. Introduction to Natural Language Processing Lecture 2. Tokenization and word counts Ekaterina Chernyak, Dmitry Ilvovsky echernyak@hse.ru, dilvovsky@hse.ru National Research University Higher School of Economics (Moscow) July 13, 2015
  • 2. How many words? “The rain in Spain stays mainly in the plain.” 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T / the rain, in, Spain, stays, mainly, plain Type and token Type is an element of the vocabulary. Token is an instance of that type in the text. N = number of tokens V - vocabulary (i.e. all types) |V | = size of vocabulary (i.e. number of types) How are N and |V | related? Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 2 / 16
  • 3. Zipf’s law Zipf’s law [Gelbukh, Sidorov, 2001]) In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies: f = 1 r f – frequency of a type r – rank of a type (its position in the list of all types in order of their frequency of occurrence) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 3 / 16
  • 4. Zipf’s law: example Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 4 / 16
  • 5. Heaps’ law Heaps’ law [Gelbukh, Sidorov, 2001]) The number of different types in a text is roughly proportional to an exponent of its size: |V | = K ∗ Nb N = number of tokens |V | = size of vocabulary (i.e. number of types) K, b – free parameters, K ∈ [10, 100], b ∈ [0.4, 0.6] Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 5 / 16
  • 6. Why tokenization is difficult? Easy example: “Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.” is “.” a token? is $3.88 a single token? is “New York” a single token? Real data may contain noise in it: code, markup, URLs, faulty punctuation Real data contains misspellings: “an dthen she aksed” Period “.” does not always mean the end of sentence: m.p.h., PhD. Nevertheless tokenization is important for all other text processing steps. There are rule-based and machine learning-based approaches to development of tokenizers. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 6 / 16
  • 7. Rule-based tokenization For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules. RE in Python In[1]: import re In[2]: prog = re.compile(’[A-Za-z]+’) In[3]: prog.findall("Words, words, words.") Out[1]: [’Words’, ’words’, ’words’] Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 7 / 16
  • 8. Sentence segmentation (1) What are the sentence boundaries? ?, ! are usually unambiguous Period “.” is an issue Direct speech is also an issue: She said, ”What time will you be home?” and I said, ”I don’t know! ”. Even worse in Russian! Let us learn a classifier for sentence segmentation. Binary classifier A binary classifier f : X =⇒ 0, 1 takes input data X (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1). Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 8 / 16
  • 9. Sentence segmentation (2) What can be the features for classification? I am a period, am I EndOfSentence? Lots of blanks after me? Lots of lower case letters and ? or ! after me? Do I belong to abbreviation? etc. We need a lot of hand-markup. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 9 / 16
  • 10. Do we need to program this? No! There is Natural Language Toolkit (NLTK) for everything. NLTK tokenizers In[1]: from nltk.tokenize import RegexpTokenizer, wordpunct tokenize In[2]: s = ’Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.’ In[3]: tokenizer = RegexpTokenizer(’w+| $ [d .]+ | S +’) In[4]: tokenizer.tokenize(s) In[5]: wordpunct tokenize(s) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 10 / 16
  • 11. Learning to tokenize nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English. Punkt tokenizer In[1]: import nltk.data In[2]: sent detector = nltk.data.load(’tokenizers/punkt/english.pickle’) In[3]: sent detector.tokenize(s) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 11 / 16
  • 12. Exercise 1.1 Word counts Now it is time for some programming! Exercise 1.1 Input: Alice in Wonderland (alice.txt) or your text Output 1: number of tokens Output 2: number of types Use nltk.FreqDist() to count types. nltk.FreqDist() is a frequency dictionary: [key, frequency(key)]. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 12 / 16
  • 13. Lemmatization (Normalization) Each word has a base form: has, had, have =⇒ have cats, cat, cat’s =⇒ cat Windows =⇒ window or Windows? Lemmatization [Jurafsky, Martin, 1999] Lemmatization (or normalization) is used to reduce inflections or variant forms to base forms. A dictionary with headwords is required. Lemmatization In[1]: from nltk.stem import WordNetLemmatizer In[2]: lemmatizer = WordNetLemmatizer() In[3]: lemmatizer.lemmatize(t) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 13 / 16
  • 14. Stemming A word is built with morphems: word = stem + affixes. Sometimes we do not need affixes. translate, translation, translator =⇒ translat Stemming [Jurafsky, Martin, 1999] Reduce terms to their stems in information retrieval and text classification. Porter’s algorithm is the most common English stemmer. Stemming In[1]: from nltk.stem.porter import PorterStemmer In[2]: stemmer = PorterStemmer() In[3]: stemmer.stem(t) Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 14 / 16
  • 15. Exercise 1.2 Word counts (continued) Exercise 1.2 Input: Alice in Wonderland (alice.txt) or your text Output 1: 20 most common lemmata Output 2: 20 most common stems Use FreqDist() to count lemmata and stems. Use FreqDist().most common() to find most common lemmata and stems. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 15 / 16
  • 16. Exercise 1.3 Do we need all words? Stopword is a not meaningful word: prepositions, adjunctions, pronouns, articles, etc. Stopwords In[1]: from nltk.corpus import stopwords In[2]: print stopwords.words(’english’) Exercise 1.3 Input: Alice in Wonderland (alice.txt) or your text Output 1: 20 most common lemmata without stop words Use not in operator to exclude not stopwords in a cycle. Chernyak, Ilvovsky (HSE) Intro to NLP July 13, 2015 16 / 16