This study validated a natural language processing (NLP) protocol for detecting signs and symptoms of heart failure (HF) in electronic health record (EHR) text notes. The protocol extracted mentions of 15 diagnostic criteria from the Framingham HF study from 400 EHR notes with an overall F-score of 0.91. The protocol also labeled encounters with the criteria mentioned with an F-score of 0.93. While challenges remain around data quality and syntactic diversity, information extracted about HF criteria appears useful for early detection of HF in downstream applications.
Documentations of Advanced Heath Care Directives Where Are They TAI_SEALE
Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD
1. Validation of a Natural Language
Processing Protocol for Detecting
Heart Failure Signs and
Symptoms in Electronic Health
Record Text Notes
Roy J. Byrd2, Steven R. Steinhubl1, Jimeng
Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F.
Stewart1
1Geisinger Medical Center, Center for Health
Research, Danville, PA
2 IBM, T.J. Watson Research Center, Hawthorne, NY
3. Background and Objectives
• Background
– Framingham criteria for HF published in 1971
– Geisinger/IBM “PredMED” project on predictive modeling for
early detection of HF, using longitudinal EHRs
• Overall Project Objective
Better understand the presentation of HF in the primary care
setting, in order to facilitate its more rapid identification and
treatment
• Objective of this paper:
Build and validate NLP extractors for Framingham criteria
(signs and symptoms) from EHR clinical notes, so that they
may be suitable for downstream diagnostic applications
4. Framingham HF Diagnostic Criteria
MAJOR SYMPTOMS MINOR SYMPTOMS
1. Paroxysmal Nocturnal Dyspnea 1. Bilateral Ankle Edema
(PND) or Orthopnea
2. Neck Vein Distension (JVD) 2. Nocturnal Cough
3. Rales 3. Dyspnea on ordinary exertion
4. Radiographic Cardiomegaly 4. Hepatomegaly
5. Acute Pulmonary Edema 5. Pleural effusion
6. A decrease in vital capacity by 1/3
6. S3 Gallop
of the maximal value recorded**
7. Increased Central Venous Pressure 7. Tachycardia (>120 BPM)
(> 16 cm H2O at RA)
8. Circulation Time of 25 seconds**
9. Hepatojugular Reflux (HJR) ** Not extracted, since these criteria
are not documented in routine
10.Weight loss 4.5kg in 5 days in clinical practice.
response to treatment
N Engl J Med. 1971;285:1441-1446.
5. (Sample downstream analysis)
Reports of Framingham HF criteria
in the year prior to diagnosis
Percent with Documented Criteria
60
50 Cases (N=4,644) Controls (N=45,981)
40
62.3 65
30
20
28.6
22.9
10 17.2 17.9 17.7
7.2 5.8 5.2 1.7 1.4 0.7 1.1
0
PND Rales JVD Pulm CMegaly Ankle DOE
Edema Edema
6. Datasets
• Clinical notes from longitudinal (2001-2010) EHR
encounters for
– 6,355 case patients
• Meet operational criteria for HF**
– 26,052 control patients
• Clinic-, gender- and age-matched to cases
– The case-control distinction is exploited in downstream
applications; it’s not relevant for criteria extraction.
• Development dataset **Operational HF Criteria
– 65 encounter notes –HF diagnosis on
• Selected for density of Framingham criteria problem list,
• Annotated by a clinical expert –HF diagnosis in EHR
for two outpatient
• Validation dataset encounters,
–Two or more
– 400 encounter notes (200 cases & 200 controls) medications with ICD-
• Randomly selected 9 code for HF, or
• Annotated by consensus of 4 trained coders –One HF diagnosis and
one medication with
• N = 1492 criteria ICD-9 code for HF
7. Tools
• LRW1 – LanguageWare Resource Workbench
UIMA Collection Processing Engine
– Basic Text Processing
Encounter – Dictionaries for
Basic Processing Dictionaries and Grammars Text Analysis Engines
Extracted
paragraphs, sentences, for recognizing criteria for applying constraints
Documents Criteria
– Grammars etc.
tokenization, candidates and annotating criteria
• UIMA2 - Unstructured Information Management
Architecture
– Execution Pipeline, including I/O management
– Text Analysis Engines
• TextSTAT3 – Simple Text Analysis Tool
– Concordance program, used for linguistic analysis
1http://www/alphaworks.ibm.com/tech/lrw 2http://uima.apache.org 3http://neon.niederlandistik.fu-berlin.de/en/textstat
8. Criteria Extraction Methods:
Dictionaries
• Framingham Criteria • Negating words
vocabulary – Used to deny criteria
– Words and phrases used to • no, free of, ruled out
mention the 15
Framingham Criteria
• Counterfactual triggers
– The criteria may not have
– edema, leg occurred
edema, oedema; shortness
of breath, SOB • if, should, as needed for
– Size: ~75 “lemma forms” • Miscellaneous Classes
(main entries) and – Weight loss phrases
hundreds of variant forms • lose weight, diurese
• Segment Header words – Time value words
and phrases • day, week, month
– Patient – Weight units
History, Examination, Plan, • pound, kilogram
Instruction – Diuretics
• Bumex, Furosimide
9. Criteria Extraction Methods:
Grammars
• Shallow English syntax • Negated Scope
– Noun Phrases – regular rate and rhythm
• some moderate DOE without
– Compound Noun Phrases murmurs, clicks, gallops, o
r rubs
• chest pain, DOE, or night
cough • Counterfactual Scope
– Prepositional Phrases – Patient should call if she
• No full-sentential parses experiences shortness of
breath
– Not needed for simple HF
criteria • Weight Loss
– Unreliable sentence – 20 pound weight loss in a
boundaries and syntax in week with diuretics
clinical notes • Tachycardia
– tachy at 120 (to 130)
– HR: 135
10. Criteria Extraction Methods:
Text Analysis Engines (TAEs)
• Rules to filter candidate • Co-occurrence
criteria created from constraints
dictionaries and – exercise HR: 135 doesn’t
grammars. affirm Tachycardia
• Deny criteria mentioned • Disambiguation
in negated contexts – edema is recognized as
– regular rate and rhythm APEdema, if near cxr, or in
without murmurs, clicks, a “Radiology” note, or in a
gallops, or rubs S3Neg “Chest X-Ray” segment
• Ignore criteria in • Numeric constraints
counterfactual contexts – she lost 5 pounds over a
month doesn’t affirm
– Patient should call if she WeightLoss
experiences shortness of
breath – tachy @ 115 doesn’t affirm
Tachycardia
11. Encounter Labeling Methods
• We can label an encounter note with labels showing the
criteria that the note mentions
– The labels can be used by downstream analyses to gather
information such as: “This patient exhibited those symptoms on
that date.”
• 2 Methods:
– Machine-learning
• Using candidate criteria and scope annotations, as features, …
• use a [CHAID decision tree] classifier to assign criteria as labels.
– Rule-based
• Run the full extractor pipeline, then …
• Assign labels consisting of all unique criteria that survive filtering.
15. Performance of Framingham
Diagnostic Criteria Extraction
99% Confidence
Precision Recall F-score
Interval (F-score)
Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929)
Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950)
Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824)
Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)
Note: Performance on affirmed criteria is worse, possibly because of their
greater syntactic diversity. For example, we don’t find:
PleuralEffusion: blunting of the right costrophrenic angle
DOExertion: she felt like she couldn’t get enough air in
17. Analysis of 1492 extracted criteria:
PredMED extractions vs.
Gold Standard annotations
e
tiv
ED eg
KE td
si
E g
TL g
g
AP DN
EP g
D Ne
W Ne
R eg
Po
H eg
R Ne
TA eg
JV e g
N eg
PN eg
AN dS
AN D
e
PL g
S3 g
EN
N
N
KE
ED
e
e
N
H
H
N
E
E
N
EN
e
N
N
D
D
ol
EP
G
G
C
C
AL
AL
JR
JR
E
D
D
ls
O
O
C
C
C
C
PredMED
PN
AP
TA
PL
S3
JV
Fa
G
D
H
H
H
N
R
R
ANKED 90 6 16
ANKEDNeg 230 6
APED 8 5 2 1 22
APEDNeg 0
DOE 116 17 1 3
DOENeg 3 135 2 1
HEP 0 1
HEPNeg 125
HJR 2 1
HJRNeg 9
JVD 7 2
JVDNeg 91
NC 2
NCNeg 43 2
PLE 8
PLENeg 1
PND 1 7 2
PNDNeg 69
RALE 11 1
RALENeg 197
RC 6
RCNeg 1
S3G 0
S3GNeg 131
TACH 1 2
TACHNeg 0 4
WTL 0
False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10
18. Discussion
• Challenges • Opportunities
– Data quality: EHR text data is – We can apply similar
messy. techniques to other collections
• >10% (i.e., 26/237) of the of criteria.
errors are caused by • NY Heart Association
misspellings & bad sentence • European Society of
boundaries Cardiology
– Human anatomy • MedicalCriteria.com
• We need a better solution – Many specific criteria
than word co-occurrence extractors can be re-used in
constraints
other settings.
– Syntactic diversity of affirmed
criteria
• We need deeper syntactic – For downstream applications,
and semantic analysis see posters and presentations
– Contradictions and from our project at this
redundancy conference
• An issue for downstream
analysis
19.
20. Summary
• Extractors can identify affirmations and denials
of Framingham HF criteria in EHR clinical notes
with an overall F-Score of 0.91.
• Classifiers can label EHR encounters with the
Framingham critera they mention with an F-
Score of 0.93.
• Information about HF criteria mentioned in EHR
notes appears to be useful for downstream
applications that seek to achieve early detection
of HF.
22. Iterative Annotation Refinement
• What are the problems solved?
– Annotations are required for training and evaluating
criteria extractors.
– Human annotators without guidelines have high
precision but lower recall.
– Domain experts’ intuitions (about the language for
expressing criteria) are initially imprecise.
• What is produced?
– Annotated dataset
– Annotation guidelines … that are consistent
– Criteria extractors
23. The Development Process:
Iterative Annotation Refinement
Initialization Results Iteration
Update the
Expert
Write Annotations annotations
initial and the
Expert guidelines guidelines
Discuss
the Annotation Annotate texts Perform
language Encounter Guidelines with current error
of HF Texts
extractors analysis
criteria
Build
Criteria Update the
initial Extractors extractors
extractors
Linguist
24. User interface for the annotation tool, which was
used to manage annotations during refinement.
25. Performance improvement during
development
Performance comparison
Final
PredMED Clinical Expert
1 Ini al
0.9
Final
0.8
Precision
Ini al
0.7
0.6
0.5
0.5 0.6 0.7 0.8 0.9 1
Recall
26. Iterative methods for creating
annotations, guidelines, and extractors
Extraction Result of using Sources of Arbiter for Objective (and
target the method annotations disagreements metric) for each
compared in at each iteration
each iteration iteration
Iterative Framingham - Annotations Expert and Expert Improve extractor
Annotation HF criteria - Guidelines Extractor performance (F-
Refinement - Extractor score)
Annotation Clinical - Guidelines (in Expert and Consensus Improve inter-
Induction conditions the form of an Linguist annotator
(Chapman, et annotation agreement (F-
al. J Biom Inf schema) score)
2006)
CDKRM Classes in the - Annotations 2 Experts Consensus Improve inter-
(Coden, et al., cancer disease - Guidelines annotator
J Biom Inf model agreement
2009) (agreement %)
TALLAL PHI (protected - Annotations Expert and Expert Annotate full
(Carrell, et al, health - Extractor Extractor dataset (to the
GHRI-IT information) expert’s
poster, 2010) classes satisfaction)