1. Q B R C T E C H T A L K
T H O M A S N A T E P E R S O N
C L I N I C A L S C I E N C E S
P R O S P R C E N T E R A N D Q B R C
M A Y 6 , 2 0 1 3
1
NLP: Natural Language Processing
3. 3
What is NLP?
Not:
Natural Language Programming (NLP)
Neuro-Linguistic Programing (NLP)
“Natural Language Processing (NLP) is a field of
computer science, artificial intelligence, and
linguistics concerned with the interactions between
computers and human(natural) languages.”
-Wikipedia
4. 4
What can’t it do?
Extract information not understandable or
discernible by “you”.
Extract deeper meaning.
Is not a substitute for Regular Expression pattern
matching
5. 5
Basics of NLP
Large research field
From Speech Recognitions to Optical Character Recognition
Examples:
Watson (Jeopardy)
Cleverbot
Siri/Dragon Speak
Captcha
I am only concerned about Information Extraction (IE)
Sentence detection
Part of Speech (POS) tagging
(nouns, verbs, adverbs)
Named-entity recognition (NER)
(names, organizations, locations)
Lemmatisation
(Walk, walked, walks, walking)
Relationship extraction
All possible word relationships
Parsing
Determining most probable word relationships
Coreference
Linking of references between multiple sentences
6. 6
What’s the point of all that?
Help categorize unstructured text into a more
structured format so that discrete information can
more easily be extracted.
7. 7
NLP Information Extraction Example
“Pierre Vinken, 61 years old, will join the board as a
nonexecutive director Nov. 29.”
8. 8
NLP Information Extraction Example
POS (Part of Speech) Tagging
“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”
Pierre/NNP
Vinken/NNP
,/,
61/CD
years/NNS
old/JJ
,/,
will/MD
join/VB
the/DT
board/NN
as/IN
a/DT
nonexecutive/JJ
director/NN
Nov./NNP
29/CD
./.
9. Penn Treebank Tagset
CC Coordinating conjunction e.g. and,but,or...
CD Cardinal Number
DT Determiner
EX Existential there
FW Foreign Word
IN Preposision or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List Item Marker
MD Modal e.g. can, could, might, may...
NN Noun, singular or mass
NNP Proper Noun, singular
NNPS Proper Noun, plural
NNS Noun, plural
PDT Predeterminer e.g. all, both ... when they
precede an article
POS Possessive Ending e.g. Nouns ending in 's
PRP Personal Pronoun e.g. I, me, you, he...
PRP$ Possessive Pronoun e.g. my, your, mine,
yours...
RB Adverb Most words that end in -ly as well
as degree words like quite, too and very
RBR Adverb, comparative Adverbs with the
comparative ending -er, with a strictly
comparative meaning.
RBS Adverb, superlative
RP Particle
SYM Symbol Should be used for mathematical,
scientific or technical symbols
TO to
UH Interjection e.g. uh, well, yes, my...
VB Verb, base form subsumes imperatives,
infinitives and subjunctives
VBD Verb, past tense includes the conditional
form of the verb to be
VBG Verb, gerund or persent participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner e.g. which, and that when it
is used as a relative pronoun
WP Wh-pronoun e.g. what, who, whom...
WP$ Possessive wh-pronoun e.g.
WRB Wh-adverb e.g. how, where why
11. 11
POS Parse Tree
“Pierre Vinken, 61 years old, will join the board as a nonexecutive director
Nov. 29.”
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken))
(, ,)
(ADJP (NML (CD 61) (NNS years))
(JJ old))
(, ,))
(VP (MD will)
(VP (VB join)
(NP (DT the) (NN board))
(PP-CLR (IN as)
(NP (DT a) (JJ nonexecutive) (NN director)))
(NP-TMP (NNP Nov.) (CD 29))))
(. .)))
12. 12
NLP Information Extraction Example
POS (Part of Speech) Tagging
“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”
Pierre/NNP
Vinken/NNP
,/,
61/CD
years/NNS
old/JJ
,/,
will/MD
join/VB
the/DT
board/NN
as/IN
a/DT
nonexecutive/JJ
director/NN
Nov./NNP
29/CD
./.
13. 13
NLP Information Extraction Example
NER (Named Entity Recognition) Tagging
“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”
Pierre/NNP/PERSON
Vinken/NNP/PERSON
,/,/O
61/CD/DURATION
years/NNS/NUMBER
old/JJ/DURATION
,/,/O
will/MD/O
join/VB/O
the/DT/O
board/NN/O
as/IN/O
a/DT/O
nonexecutive/JJ/O
director/NN/O
Nov./NNP/DATE
29/CD/DATE
././O
14. 14
NLP Information Extraction Example
Lemmatisation
“Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.”
Pierre/NNP/PERSON [Pierre]
Vinken/NNP/PERSON [Vinken]
,/,/O [,]
61/CD/DURATION [61]
years/NNS/NUMBER [year]
old/JJ/DURATION [old]
,/,/O [,]
will/MD/O [will]
join/VB/O [join]
the/DT/O [the]
board/NN/O [board]
as/IN/O [as]
a/DT/O [a]
nonexecutive/JJ/O [nonexecutive]
director/NN/O [director]
Nov./NNP/DATE [Nov.]
29/CD/DATE [29]
././O [.]
17. 17
NLP Toolkits
41 different toolkits listed in Wikipedia
Four of the more popular free open source (FOSS) IE
toolkits
Name Language License Creators
OpenNLP Java
Apache
License
2.0
Online
community
General Architecture for
Text Engineering
(GATE)
Java LGPL
GATE open
source community
Natural Language
Toolkit (NLTK)
Python
Apache
2.0
Team NLTK
Stanford NLP Java GPL
The Stanford
Natural Language
Processing Group
19. 19
NLP Toolkits
General Architecture for Text Engineering (GATE)
Extensive publications
Integrated Development Environment (IDE) to assist in
development
Java
Java Annotation Patterns Engine (JAPE)
20. 20
NLP Toolkits
Natural Language Tool Kit (NLTK)
Extensive publications
Two published documentation books from O’Reilly and Packt
21. 21
NLP Toolkits
Stanford Core NLP
Extensive publications
Wrappers for Perl, Python, Ruby, and Scala languages
Plugins for GATE and NLTK
22. 22
Questions from PROSPR to answer
From the hand typed Colonoscopy report:
How many Polyps
Location of Polyps
Size of Polyps
23. 23
Sample Workflow
Report Definition
Report Sectionization
Formatting the Text
Process the Section
Further analysis
24. Report Example
Gastroenterology Laboratory
Patient Name: Susan Storm Richards
Procedure Date: 5/06/2013 15:00:15 PM
MRN: 123456789
Age: 60
Accession #: 123456
Gender: Female
Order #: 123456789
Ethnicity:
Attending MD: Victor Von Doom MD
Note Status: Finalized
Room: 666
Procedure: Colonoscopy
Referring MD: Reed Richards
Providers: Victor von Doom, MD (Doctor)
Attending Participation: I personally performed the entire procedure.
Medicines: SomeDrug 3 mg IV, OtherDrug 75 micrograms IV
Indications: Screening for colorectal malignant neoplasm
Complications: No immediate complications.
Patient Profile: Refer to note in patient chart for documentation of history and
physical.
Procedure: Pre-Anesthesia Assessment:
- PLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum. ASA Grade Assessment: II - Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
Findings: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu f ugiat nulla
pariatur. Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.
Estimated Blood Loss: Estimated blood loss: none.
Recommendation: - Discharge patient to home (ambulatory).
- High fiber diet indefinitely.
CPT(c) Code(s): --- Technical ---
G0121, Colorectal cancer screening; colonoscopy on individual
not meeting criteria for high risk
CPT Copyright 2010 American Medical Association. All Rights Reserved.
The codes documented in this report are preliminary and upon coder review may be revised
to meet current compliance requirements.
Victor von Doom
Victor von Doom, MD
5/6/2013 15:10
This report has been signed electronically.
Number of Addenda: 0
25. 25
Sectioned
Findings: Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex
ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat
nulla pariatur. Three pedunculated polyps were
found in the mid sigmoid colon and in the
proximal ascending colonThe polyps were 30 mm
in size. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id
est laborum.
26. 26
Sample
“ Three pedunculated polyps were found in the mid
sigmoid colon and in the proximal ascending
colonThe polyps were 30 mm in size. ”
29. NLP Information Extraction Example
Relationship Dependencies
Original sentence:
Three pedunculated polyps were found in the mid
sigmoid colon and in the proximal ascending colon.
Dependencies:
num(polyps-2, Three-0) [numeric modifier]
amod(polyps-2, pedunculated-1) [adjectival
modifier]
nsubjpass(found-4, polyps-2) [nominal passive
subject]
auxpass(found-4, were-3) [passive auxiliary]
det(colon-9, the-6) [determiner]
amod(colon-9, mid-7) [adjectival modifier]
nn(colon-9, sigmoid-8) [nn modifier]
prep_in(found-4, colon-9) [prep_collapsed]
det(proximal-13, the-12) [determiner]
prep_in(found-4, proximal-13) [prep_collapsed]
conj_and(colon-9, proximal-13) [conj_collapsed]
partmod(proximal-13, ascending-14) [participial
modifier]
dobj(ascending-14, colon-15) [direct object]
Original sentence:
The polyps were 30mm in size.
Dependencies:
det(polyps-1, The-0) [determiner]
nsubj(30mm-3, polyps-1) [nominal subject]
cop(30mm-3, were-2) [copula]
prep_in(30mm-3, size-5) [prep_collapsed]
30. 30
Output
“Three pedunculated polyps were found in the mid
sigmoid colon and in the proximal ascending colon.
The polyps were 30 mm in size.”
Output
Number of Polyps: 3
Size of Polyps: 30,
Location of Polyps: 1,4,
31. 1 use Lingua::StanfordCoreNLP;
2 use Lingua::EN::Words2Nums;
3 use strict;
4 use warnings;
5 my $pipeline = new Lingua::StanfordCoreNLP::Pipeline(1);
6 my $text = "Three pedunculated polyps were found in the mid sigmoid colon and in the proximal ascending colonThe polyps were 30 mm in size.";
7 $text =~s/(bd+b)(s)(bmmb)/$1$3/g;
8 $text =~s/(b[a-z]+)([A-Z])([a-z]+b)/$1.s$2$3/g;
9 $text =~ s/^s+//;
10 $text =~ s/s+$//;
11 my $result = $pipeline->process($text);
12 my $polypCount;
13 my $polypSize;
14 my $polypLocation;
15 for my $sentence (@{$result->toArray})
16 {
17 for my $dep (@{$sentence->getDependencies->toArray})
18 {
19 my $relation = $dep->getRelation,
20 my $govern = $dep->getGovernor->getWord,
21 my $depend = $dep->getDependent->getWord;
22 my $num=words2nums($depend);
23
24 if(($relation eq "num")&&($govern=~/^polyp(|s)$/i))
25 {
26 $polypCount=$num;
27 }
28 if(($relation eq "nsubj")&&($govern=~/^d+mm$/)&&($depend=~/^polyp(|s)$/i))
29 {
30 $govern=~s/mm$//;
31 $polypSize="$govern,";
32 }
33 if(($relation eq "nn")&&($govern=~/^colon$/i)&&($depend=~/sigmoid/i))
34 {
35 $polypLocation="1,";
36 }
37 if(($relation eq "dobj")&&($govern=~/^ascending$/i)&&($depend=~/^colon$/i))
38 {
39 $polypLocation.="4,";
40 }
41 }
42 }
43 print "Number of Polyps:t$polypCountn";
44 print "Size of Polyps:tt$polypSizen";
45 print "Location of Polyps:t$polypLocationn";
Perl Example
32. 32
F - Score
6/26/2013
Comparison against a manually curated “Gold
Standard”
Precision = Proportion of True Positives
Recall = True Proportion of Actual Positives
To make the extraction via regular expression pattern matching much easier.
CD Cardinal NumberDT DeterminerIN Preposision or subordinating conjunctionJJ AdjectiveMD Modal e.g. can, could, might, may...NN Noun, singular or massNNP Proper Noun, singularNNS Noun, pluralVB Verb, base form subsumes imperatives, infinitives and subjunctives
CC Coordinating conjunction e.g. and,but,or...CD Cardinal NumberDT DeterminerEX Existential thereFW Foreign WordIN Preposision or subordinating conjunctionJJ AdjectiveJJR Adjective, comparativeJJS Adjective, superlativeLS List Item MarkerMD Modal e.g. can, could, might, may...NN Noun, singular or massNNP Proper Noun, singularNNPS Proper Noun, pluralNNS Noun, pluralPDT Predeterminer e.g. all, both ... when they precede an articlePOS Possessive Ending e.g. Nouns ending in 'sPRP Personal Pronoun e.g. I, me, you, he...PRP$ Possessive Pronoun e.g. my, your, mine, yours...RB Adverb Most words that end in -ly as well as degree words like quite, too and veryRBR Adverb, comparative Adverbs with the comparative ending -er, with a strictly comparative meaning.RBS Adverb, superlativeRP ParticleSYM Symbol Should be used for mathematical, scientific or technical symbolsTO toUH Interjection e.g. uh, well, yes, my...VB Verb, base form subsumes imperatives, infinitives and subjunctivesVBD Verb, past tense includes the conditional form of the verb to beVBG Verb, gerund or persent participleVBN Verb, past participleVBP Verb, non-3rd person singular presentVBZ Verb, 3rd person singular presentWDT Wh-determiner e.g. which, and that when it is used as a relative pronounWP Wh-pronoun e.g. what, who, whom...WP$ Possessive wh-pronoun e.g.WRB Wh-adverb e.g. how, where why
CD Cardinal NumberDT DeterminerIN Preposision or subordinating conjunctionJJ AdjectiveMD Modal e.g. can, could, might, may...NN Noun, singular or massNNP Proper Noun, singularVB Verb, base form subsumes imperatives, infinitives and subjunctives
CD Cardinal NumberDT DeterminerIN Preposision or subordinating conjunctionJJ AdjectiveMD Modal e.g. can, could, might, may...NN Noun, singular or massNNP Proper Noun, singularVB Verb, base form subsumes imperatives, infinitives and subjunctives
Difference between the Number 61 being a Duration and the Number 29 being part of a Date.e.g. You just wanted to extract all the characters in a book.
GATE = University of SheffieldTeam NLTK = (6 people from: UT Austin; University of Gothenburg, Sweden; University of Melbourn; University of Sydney; Oslo, Norway; Ekaterinburg, Russia)