Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

Validation of a Natural Language
Processing Protocol for Detecting
Heart Failure Signs and
Symptoms in Electronic Health
Record Text Notes
Roy J. Byrd2, Steven R. Steinhubl1, Jimeng
Sun2, Shahram Ebadollahi2, Zahra Daar1, Walter F.
Stewart1
1Geisinger Medical Center, Center for Health
Research, Danville, PA
2 IBM, T.J. Watson Research Center, Hawthorne, NY

Outline
• Background and objectives
• Datasets
• Tools & Methods
• Results
• Discussion
– Challenges
– Opportunities
• Summary

• (Iterative annotation refinement)

Background and Objectives
• Background
– Framingham criteria for HF published in 1971
– Geisinger/IBM “PredMED” project on predictive modeling for
early detection of HF, using longitudinal EHRs

• Overall Project Objective
Better understand the presentation of HF in the primary care
setting, in order to facilitate its more rapid identification and
treatment

• Objective of this paper:
Build and validate NLP extractors for Framingham criteria
(signs and symptoms) from EHR clinical notes, so that they
may be suitable for downstream diagnostic applications

Framingham HF Diagnostic Criteria
MAJOR SYMPTOMS MINOR SYMPTOMS
1. Paroxysmal Nocturnal Dyspnea 1. Bilateral Ankle Edema
(PND) or Orthopnea
2. Neck Vein Distension (JVD) 2. Nocturnal Cough
3. Rales 3. Dyspnea on ordinary exertion
4. Radiographic Cardiomegaly 4. Hepatomegaly
5. Acute Pulmonary Edema 5. Pleural effusion
6. A decrease in vital capacity by 1/3
6. S3 Gallop
of the maximal value recorded**
7. Increased Central Venous Pressure 7. Tachycardia (>120 BPM)
(> 16 cm H2O at RA)
8. Circulation Time of 25 seconds**
9. Hepatojugular Reflux (HJR) ** Not extracted, since these criteria
are not documented in routine
10.Weight loss 4.5kg in 5 days in clinical practice.
response to treatment
N Engl J Med. 1971;285:1441-1446.

(Sample downstream analysis)

Reports of Framingham HF criteria
in the year prior to diagnosis
Percent with Documented Criteria

60

50 Cases (N=4,644) Controls (N=45,981)

40
62.3 65
30

20
28.6
22.9
10 17.2 17.9 17.7

7.2 5.8 5.2 1.7 1.4 0.7 1.1
0
PND Rales JVD Pulm CMegaly Ankle DOE
Edema Edema

Datasets
• Clinical notes from longitudinal (2001-2010) EHR
encounters for
– 6,355 case patients
• Meet operational criteria for HF**
– 26,052 control patients
• Clinic-, gender- and age-matched to cases
– The case-control distinction is exploited in downstream
applications; it’s not relevant for criteria extraction.
• Development dataset **Operational HF Criteria
– 65 encounter notes –HF diagnosis on
• Selected for density of Framingham criteria problem list,
• Annotated by a clinical expert –HF diagnosis in EHR
for two outpatient
• Validation dataset encounters,
–Two or more
– 400 encounter notes (200 cases & 200 controls) medications with ICD-
• Randomly selected 9 code for HF, or
• Annotated by consensus of 4 trained coders –One HF diagnosis and
one medication with
• N = 1492 criteria ICD-9 code for HF

Tools

• LRW1 – LanguageWare Resource Workbench
UIMA Collection Processing Engine
– Basic Text Processing
Encounter – Dictionaries for
Basic Processing Dictionaries and Grammars Text Analysis Engines
Extracted
paragraphs, sentences, for recognizing criteria for applying constraints
Documents Criteria
– Grammars etc.
tokenization, candidates and annotating criteria

• UIMA2 - Unstructured Information Management
Architecture
– Execution Pipeline, including I/O management
– Text Analysis Engines
• TextSTAT3 – Simple Text Analysis Tool
– Concordance program, used for linguistic analysis

1http://www/alphaworks.ibm.com/tech/lrw 2http://uima.apache.org 3http://neon.niederlandistik.fu-berlin.de/en/textstat

Criteria Extraction Methods:
Dictionaries
• Framingham Criteria • Negating words
vocabulary – Used to deny criteria
– Words and phrases used to • no, free of, ruled out
mention the 15
Framingham Criteria
• Counterfactual triggers
– The criteria may not have
– edema, leg occurred
edema, oedema; shortness
of breath, SOB • if, should, as needed for
– Size: ~75 “lemma forms” • Miscellaneous Classes
(main entries) and – Weight loss phrases
hundreds of variant forms • lose weight, diurese
• Segment Header words – Time value words
and phrases • day, week, month
– Patient – Weight units
History, Examination, Plan, • pound, kilogram
Instruction – Diuretics
• Bumex, Furosimide

Grammars
• Shallow English syntax • Negated Scope
– Noun Phrases – regular rate and rhythm
• some moderate DOE without
– Compound Noun Phrases murmurs, clicks, gallops, o
r rubs
• chest pain, DOE, or night
cough • Counterfactual Scope
– Prepositional Phrases – Patient should call if she
• No full-sentential parses experiences shortness of
breath
– Not needed for simple HF
criteria • Weight Loss
– Unreliable sentence – 20 pound weight loss in a
boundaries and syntax in week with diuretics
clinical notes • Tachycardia
– tachy at 120 (to 130)
– HR: 135

Text Analysis Engines (TAEs)
• Rules to filter candidate • Co-occurrence
criteria created from constraints
dictionaries and – exercise HR: 135 doesn’t
grammars. affirm Tachycardia
• Deny criteria mentioned • Disambiguation
in negated contexts – edema is recognized as
– regular rate and rhythm APEdema, if near cxr, or in
without murmurs, clicks, a “Radiology” note, or in a
gallops, or rubs  S3Neg “Chest X-Ray” segment
• Ignore criteria in • Numeric constraints
counterfactual contexts – she lost 5 pounds over a
month doesn’t affirm
– Patient should call if she WeightLoss
experiences shortness of
breath – tachy @ 115 doesn’t affirm
Tachycardia

Encounter Labeling Methods
• We can label an encounter note with labels showing the
criteria that the note mentions
– The labels can be used by downstream analyses to gather
information such as: “This patient exhibited those symptoms on
that date.”
• 2 Methods:
– Machine-learning
• Using candidate criteria and scope annotations, as features, …
• use a [CHAID decision tree] classifier to assign criteria as labels.
– Rule-based
• Run the full extractor pipeline, then …
• Assign labels consisting of all unique criteria that survive filtering.

Evaluation Flow

Metrics: Machine Encounter
Learning Labels
Precision (Positive Predictive Value):
Lexical Lexical Encounter
Encounter
#TruePositive / (#TruePositive &+Scope
Look-up #FalsePositive) Label
Documents
& Scope Annotations Evaluation
Recall (Sensitivity): Encounter
Rules
#TruePositive / (#TruePositive + #FalseNegative) Labels

F-Score (the harmonic mean of Precision and Recall):
(2 x Precision x Recall) / (Precision + Recall) Criteria

Encounter Labeling Performance

Machine-learning method Rule-based method

Recall Precision F-Score Recall Precision F-Score

Affirmed 0.675000 0.754190 0.712401 0.738532 0.899441 0.811083

Denied 0.945556 0.905319 0.925000 0.987599 0.931915 0.958949

Overall 0.896364 0.881144 0.888689 0.938462 0.926720 0.932554

Overall 99%
(0.848-0.929) (0.900-0.964)
Conf. Int.

Conclusion: Machine-learning labeling does not significantly underperform
rule-based labeling.

Performance of Framingham
Diagnostic Criteria Extraction
99% Confidence
Precision Recall F-score
Interval (F-score)

Overall (exact) 0.925234 0.896864 0.910828 (0.891 - 0.929)

Overall (relaxed) 0.948239 0.919164 0.933475 (0.916 - 0.950)

Affirmed 0.747801 0.789474 0.768072 (0.711 - 0.824)

Denied 0.982857 0.928058 0.954672 (0.938 - 0.970)

Note: Performance on affirmed criteria is worse, possibly because of their
greater syntactic diversity. For example, we don’t find:
PleuralEffusion: blunting of the right costrophrenic angle
DOExertion: she felt like she couldn’t get enough air in

Precision and Recall for Individual
Criteria

Analysis of 1492 extracted criteria:
PredMED extractions vs.
Gold Standard annotations

e
tiv
ED eg
KE td

si
E g

TL g
g
AP DN

EP g
D Ne

W Ne
R eg

Po
H eg

R Ne

TA eg
JV e g

N eg

PN eg
AN dS

AN D

e

PL g

S3 g
EN

N
N
KE

ED

e

e

N

H

H
N

E

E
N

EN

e
N

N
D

D
ol

EP

G

G

C

C
AL

AL
JR

JR

E
D

D

ls
O

O

C

C

C

C
PredMED

PN
AP

TA
PL

S3
JV

Fa
G

D

H

H

H

N

R

R
ANKED 90 6 16
ANKEDNeg 230 6
APED 8 5 2 1 22
APEDNeg 0
DOE 116 17 1 3
DOENeg 3 135 2 1
HEP 0 1
HEPNeg 125
HJR 2 1
HJRNeg 9
JVD 7 2
JVDNeg 91
NC 2
NCNeg 43 2
PLE 8
PLENeg 1
PND 1 7 2
PNDNeg 69
RALE 11 1
RALENeg 197
RC 6
RCNeg 1
S3G 0
S3GNeg 131
TACH 1 2
TACHNeg 0 4
WTL 0
False Negative 6 8 5 2 6 5 1 4 1 3 2 2 7 35 2 1 1 10

Discussion
• Challenges • Opportunities
– Data quality: EHR text data is – We can apply similar
messy. techniques to other collections
• >10% (i.e., 26/237) of the of criteria.
errors are caused by • NY Heart Association
misspellings & bad sentence • European Society of
boundaries Cardiology
– Human anatomy • MedicalCriteria.com
• We need a better solution – Many specific criteria
than word co-occurrence extractors can be re-used in
constraints
other settings.
– Syntactic diversity of affirmed
criteria
• We need deeper syntactic – For downstream applications,
and semantic analysis see posters and presentations
– Contradictions and from our project at this
redundancy conference
• An issue for downstream
analysis

Summary
• Extractors can identify affirmations and denials
of Framingham HF criteria in EHR clinical notes
with an overall F-Score of 0.91.
• Classifiers can label EHR encounters with the
Framingham critera they mention with an F-
Score of 0.93.
• Information about HF criteria mentioned in EHR
notes appears to be useful for downstream
applications that seek to achieve early detection
of HF.

Backup:
Iterative Annotation Refinement

• What are the problems solved?
– Annotations are required for training and evaluating
criteria extractors.
– Human annotators without guidelines have high
precision but lower recall.
– Domain experts’ intuitions (about the language for
expressing criteria) are initially imprecise.
• What is produced?
– Annotated dataset
– Annotation guidelines … that are consistent
– Criteria extractors

The Development Process:
Initialization Results Iteration

Update the
Expert
Write Annotations annotations
initial and the
Expert guidelines guidelines

Discuss
the Annotation Annotate texts Perform
language Encounter Guidelines with current error
of HF Texts
extractors analysis
criteria

Build
Criteria Update the
initial Extractors extractors
extractors
Linguist

User interface for the annotation tool, which was
used to manage annotations during refinement.

Performance improvement during
development
Performance comparison
Final
PredMED Clinical Expert
1 Ini al

0.9
Final

0.8
Precision

Ini al
0.7

0.6

0.5
0.5 0.6 0.7 0.8 0.9 1
Recall

Iterative methods for creating
annotations, guidelines, and extractors
Extraction Result of using Sources of Arbiter for Objective (and
target the method annotations disagreements metric) for each
compared in at each iteration
each iteration iteration

Iterative Framingham - Annotations Expert and Expert Improve extractor
Annotation HF criteria - Guidelines Extractor performance (F-
Refinement - Extractor score)

Annotation Clinical - Guidelines (in Expert and Consensus Improve inter-
Induction conditions the form of an Linguist annotator
(Chapman, et annotation agreement (F-
al. J Biom Inf schema) score)
2006)
CDKRM Classes in the - Annotations 2 Experts Consensus Improve inter-
(Coden, et al., cancer disease - Guidelines annotator
J Biom Inf model agreement
2009) (agreement %)
TALLAL PHI (protected - Annotations Expert and Expert Annotate full
(Carrell, et al, health - Extractor Extractor dataset (to the
GHRI-IT information) expert’s
poster, 2010) classes satisfaction)

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD

Similar to Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD (20)

More from HMO Research Network

More from HMO Research Network (20)

Validation of a Natural Language Processing Protocol for Detecting Heart Failure Sins in Electronic Health Record Notes BYRD