SlideShare una empresa de Scribd logo
1 de 24
Semantic decomposition of ontological
resources for the creation of flexible, high-
performance biomedical concept recognisers
26 June 2012
Phil Gooch
Centre for Health Informatics
Overview
●
Why identify biomedical concepts in free text?
●
How ontologies can help
●
Problems with using ontologies for concept identification
●
Potential solutions
●
Application of method to two ontologies: Foundation Model of
Anatomy and Disease Ontology
●
Evaluation against a small corpus of 163 clinical discharge
summaries, surgical, pathology and radiology reports
Why identify biomedical concepts in free text?
●
Indexing MedLine abstracts for semantic search
– Identifying 'hypertension' as being of semantic type 'disease',
moreover being a cardiovascular disease
●
Literature based knowledge discovery
– Disease D associated with increase in physiological function F
– Substance S inhibits F
– => S might be a treatment for D
●
Decision support
– What treatment recommendations do clinical guideline
documents provide for hypertension in pregnancy?
– What were the findings of the pathology report?
– 50% of clinically important information resides in the free text of
the patient record, rather than in structured fields (Sittig 2007)
Ontologies
●
Define the concepts of a given domain, their properties and their
relationships
– Provide canonical names for terms
– Classification hierarchy, whole-part relations and synonyms
●
Can function as dictionary, a lookup list of terms for concept
identification via string matching
●
Or defined properties can be used to infer concepts
– A Company issues Shares
– 'shares in Abc fell' => 'Abc' is a Company
Problems with biomedical ontologies for concept identification
●
Often very large
– Foundational Model of Anatomy > 200MB, 150K+ terms
– Even when expressed in a compact data structure (e.g. Trie),
potentially large RAM overhead when used to match strings
●
May not be complete: how to identify potentially new terms,
classes
●
May not contain all synonyms or other ways of expressing terms,
e.g. abbreviations
– Separate lists of word variations often compiled (e.g. NLM
SPECIALIST lexical variant generation tools)
Some solutions
●
Hearst patterns (Hearst 1992)
– Identify hypernomic (class-member) relations
– 'Bruises, cuts, and other injuries'
– 'Diseases such as atherosclerosis'
– High precision, but low recall
●
Boostrapping
– 'scaphoid, lunate, triquetral and pisiform'
– If we know that the scaphoid and lunate are bones of the wrist,
we can infer that the others in this list are also
– Improves recall, but reduces precision (Maynard 2009)
Some solutions
●
Domain-specific linguistic features
– Neoclassical combining forms
– Biomedical and clinical terms often composed of or contain well-
defined Latin and Greek roots, suffixes and prefixes
– -osis, -itis, -opathy => disease
– cardi-, ileo- => anatomy
– High precision, but low recall (Gooch & Roudsari 2011)
Some solutions
●
NLM MetaMap (Aronson 2010): uses neoclassical combining
forms + lexical variant generation + ontologies
– Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install)
●
mGrep (Meng 2009) radix trie-based lookup over ontologies
– Fast, higher precision but lower recall than MetaMap (Shah
2009)
– Still requires the complete source ontologies
– Requires substantial preprocessing of input text via the NCBO
web service (NCBO Support 2011)
Semantic decomposition of ontologies
●
Provide a systematic method of reducing the size of large
ontologies to make their use for concept identification feasible
●
Reproducible method so that concept recognisers for new
ontologies can be quickly developed
●
Has spin-off benefits for ontology quality assurance
– E.g. identification of spelling errors and lexical inconsistencies in
biomedical ontologies (Gooch 2011)
Semantic decomposition of ontologies
●
Little published work in this area
●
Tong et al (2008) decomposed the Gene Ontology into individual
tokens (words) and calculated the positional entropy of each token
via the probability of token t appearing at position p in a given
ontology term
●
Could be applied to identifying potential ontology terms in free text,
but wasn't evaluated
Semantic decomposition of ontologies
●
Initial focus on Foundational Model of Anatomy (FMA) (Rosse
2003) as anatomical terms are central to the identification of
– location of disease, morbidity
– location of symptoms
– location of procedures – surgery, pathology and radiology
reports
– administration route of medication
●
Apply the method to the Disease Ontology (Osborne et al 2009) to
see how well it generalises
Semantic decomposition of ontologies
●
Extend Tong et al's idea but classify each token according to its part of
speech (noun, adjective etc) and its semantic type
●
Reduce the set of tokens further by identifying words (free
morphemes) sharing common roots and suffixes (bound morphemes)
●
Morpheme – smallest linguistic unit that has meaning (cephalon,
-derm, -ium, -rrhea)
Regular expressions
●
Used to match sequences of characters against some input
●
Written in a formal language that describes the patterns in the input
that we wish to match
●
For this task, we precompile sets of regular expressions (regex)
generated from the set of morphemes extracted from the ontology
●
We write recombination rules over the regexes which include stop-
words (determiners, prepositions) to identify candidate noun phrases
and prepositional phrases that look like ontology terms
Regular expression and pattern generation
●
Create regexes from the union of entries (with morphological variants)
in each set
– nounPattern = … macula | malleus | mandible |
manubri(um|a) | manus ...
●
Top and tail with word boundaries, with optional plurality
– noun = b( + nounPattern + )?sb
– adjective = b( + adjPattern + )b
●
Combine regex output with patterns
– NP = adjective{0,5} (noun | properNoun){1,5}
– PP = NP “of|on” NP
– Term = NP | PP
●
Test by running the patterns against the complete ontology – all terms
should be matched
Evaluation
●
Corpus of discharge summaries, progress notes, and surgical,
radiology and pathology reports (Savova et al 2011)
●
Manually annotated for mentions of anatomical and disease
concepts
●
Compare manually identified terms against system-generated
terms via semantic decomposition/recombination pattern approach
vs direct ontology lookup vs MetaMap
●
Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2
* P * R / P + R), and Mann-Whitney U between approaches
Results – Anatomical terms
Method P R F Time
Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s
Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s
MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s
Figures in parentheses denote results after corpus correction
Semantic vs direct lookup: significant increase in P and R (p < 0.01)
Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
Error analysis – Anatomical terms
●
Many false positives (87.9%) were in fact correct terms – missing
from the manually annotated corpus
●
Adding these missing annotations increased precision from 0.36 to
0.89
●
Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a
haploidentical bone marrow', 'normal sinus', and non-specific
anatomical areas, e.g. 'multifocal areas', 'particular organ site',
'pruritic areas'.
●
Phrases not in the ontology as discrete terms picked up by
semantic method, e.g. 'angiolymphatic space', 'dentate line'
Results – Disease terms
Method P R F Time
Semantic 0.58 0.68 0.62 12s
Direct lookup 0.69 0.27 0.37 9s
MetaMap 0.46 0.83 0.59 1748s
Semantic vs direct lookup: significant increase in R (p << 0.01), significant
decrease in P (p < 0.01), overal significant increase in F (p < 0.01)
Semantic vs MetaMap: significant increase in P (p << 0.01), but significant
decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
Error analysis – Disease terms
●
Factors affecting recall:
– Abbreviations (e.g. COPD)
– Definite descriptors ('the disease', 'her infirmity')
– Symptoms annotated as disease ('mood changes', 'double
vision')
●
Factors affecting precision
– Terms manually annotated as Symptoms being marked as
Disease e.g. 'difficulty walking'
– Some inconsistent manual annotation of negated terms, family
history etc
Conclusion
●
Semantic decomposition and regex/pattern-based recombination
of ontology terms is slightly slower than directly looking up terms
and synonyms extracted from the ontology, but leads to
significantly increased accuracy that balances precision and recall
●
Against MetaMap, the improvements are measurable but not
statistically significant for anatomical terms, but precision is
significantly improved for disease terms. However, the processing
time is several orders of magnitude faster.
●
Our findings are comparable to Shah et al (2009) for mGrep vs
MetaMap, but we now have a systematic method for creating new
concept recognisers from scratch
Further work
●
Calculate positional entropy of each morpheme and use these to
help generate patterns (e.g. some morphemes are more likely to
occur at the start or end of a pattern)
●
Improve lookup performance by using a radix trie (better for
morpheme sets that share long prefixes and suffixes) rather than
standard Java.util.regex
●
Apply method to other biomedical ontologies
●
Evaluate against other corpora, e.g. annotated MedLine abstracts

Más contenido relacionado

Similar a Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

Heart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdfHeart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdf
sdfghj21
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Final
kdjamies
 
This assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docxThis assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docx
howardh5
 
Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14
mhaendel
 
Medical Terminology lecture in details..
Medical Terminology lecture in details..Medical Terminology lecture in details..
Medical Terminology lecture in details..
mariumta2012
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Jinho Choi
 
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docxBrain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
AASTHA76
 

Similar a Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers (20)

Heart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdfHeart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdf
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Final
 
Systematic Reviews: Context & Methodology for Librarians
Systematic Reviews: Context & Methodology for LibrariansSystematic Reviews: Context & Methodology for Librarians
Systematic Reviews: Context & Methodology for Librarians
 
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
 
'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notes'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notes
 
Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013
 
HRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal ModelsHRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal Models
 
POPSI
POPSIPOPSI
POPSI
 
This assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docxThis assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docx
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsA Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methods
 
Communicating Science
Communicating ScienceCommunicating Science
Communicating Science
 
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
 
Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14
 
(ARCHANA) Vocabulary-.ppt
(ARCHANA) Vocabulary-.ppt(ARCHANA) Vocabulary-.ppt
(ARCHANA) Vocabulary-.ppt
 
Babylon in der pflege
Babylon in der pflegeBabylon in der pflege
Babylon in der pflege
 
Medical Terminology lecture in details..
Medical Terminology lecture in details..Medical Terminology lecture in details..
Medical Terminology lecture in details..
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...
 
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docxBrain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

  • 1. Semantic decomposition of ontological resources for the creation of flexible, high- performance biomedical concept recognisers 26 June 2012 Phil Gooch Centre for Health Informatics
  • 2. Overview ● Why identify biomedical concepts in free text? ● How ontologies can help ● Problems with using ontologies for concept identification ● Potential solutions ● Application of method to two ontologies: Foundation Model of Anatomy and Disease Ontology ● Evaluation against a small corpus of 163 clinical discharge summaries, surgical, pathology and radiology reports
  • 3. Why identify biomedical concepts in free text? ● Indexing MedLine abstracts for semantic search – Identifying 'hypertension' as being of semantic type 'disease', moreover being a cardiovascular disease ● Literature based knowledge discovery – Disease D associated with increase in physiological function F – Substance S inhibits F – => S might be a treatment for D ● Decision support – What treatment recommendations do clinical guideline documents provide for hypertension in pregnancy? – What were the findings of the pathology report? – 50% of clinically important information resides in the free text of the patient record, rather than in structured fields (Sittig 2007)
  • 4. Ontologies ● Define the concepts of a given domain, their properties and their relationships – Provide canonical names for terms – Classification hierarchy, whole-part relations and synonyms ● Can function as dictionary, a lookup list of terms for concept identification via string matching ● Or defined properties can be used to infer concepts – A Company issues Shares – 'shares in Abc fell' => 'Abc' is a Company
  • 5. Problems with biomedical ontologies for concept identification ● Often very large – Foundational Model of Anatomy > 200MB, 150K+ terms – Even when expressed in a compact data structure (e.g. Trie), potentially large RAM overhead when used to match strings ● May not be complete: how to identify potentially new terms, classes ● May not contain all synonyms or other ways of expressing terms, e.g. abbreviations – Separate lists of word variations often compiled (e.g. NLM SPECIALIST lexical variant generation tools)
  • 6. Some solutions ● Hearst patterns (Hearst 1992) – Identify hypernomic (class-member) relations – 'Bruises, cuts, and other injuries' – 'Diseases such as atherosclerosis' – High precision, but low recall ● Boostrapping – 'scaphoid, lunate, triquetral and pisiform' – If we know that the scaphoid and lunate are bones of the wrist, we can infer that the others in this list are also – Improves recall, but reduces precision (Maynard 2009)
  • 7. Some solutions ● Domain-specific linguistic features – Neoclassical combining forms – Biomedical and clinical terms often composed of or contain well- defined Latin and Greek roots, suffixes and prefixes – -osis, -itis, -opathy => disease – cardi-, ileo- => anatomy – High precision, but low recall (Gooch & Roudsari 2011)
  • 8. Some solutions ● NLM MetaMap (Aronson 2010): uses neoclassical combining forms + lexical variant generation + ontologies – Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install) ● mGrep (Meng 2009) radix trie-based lookup over ontologies – Fast, higher precision but lower recall than MetaMap (Shah 2009) – Still requires the complete source ontologies – Requires substantial preprocessing of input text via the NCBO web service (NCBO Support 2011)
  • 9.
  • 10. Semantic decomposition of ontologies ● Provide a systematic method of reducing the size of large ontologies to make their use for concept identification feasible ● Reproducible method so that concept recognisers for new ontologies can be quickly developed ● Has spin-off benefits for ontology quality assurance – E.g. identification of spelling errors and lexical inconsistencies in biomedical ontologies (Gooch 2011)
  • 11. Semantic decomposition of ontologies ● Little published work in this area ● Tong et al (2008) decomposed the Gene Ontology into individual tokens (words) and calculated the positional entropy of each token via the probability of token t appearing at position p in a given ontology term ● Could be applied to identifying potential ontology terms in free text, but wasn't evaluated
  • 12. Semantic decomposition of ontologies ● Initial focus on Foundational Model of Anatomy (FMA) (Rosse 2003) as anatomical terms are central to the identification of – location of disease, morbidity – location of symptoms – location of procedures – surgery, pathology and radiology reports – administration route of medication ● Apply the method to the Disease Ontology (Osborne et al 2009) to see how well it generalises
  • 13. Semantic decomposition of ontologies ● Extend Tong et al's idea but classify each token according to its part of speech (noun, adjective etc) and its semantic type ● Reduce the set of tokens further by identifying words (free morphemes) sharing common roots and suffixes (bound morphemes) ● Morpheme – smallest linguistic unit that has meaning (cephalon, -derm, -ium, -rrhea)
  • 14. Regular expressions ● Used to match sequences of characters against some input ● Written in a formal language that describes the patterns in the input that we wish to match ● For this task, we precompile sets of regular expressions (regex) generated from the set of morphemes extracted from the ontology ● We write recombination rules over the regexes which include stop- words (determiners, prepositions) to identify candidate noun phrases and prepositional phrases that look like ontology terms
  • 15.
  • 16. Regular expression and pattern generation ● Create regexes from the union of entries (with morphological variants) in each set – nounPattern = … macula | malleus | mandible | manubri(um|a) | manus ... ● Top and tail with word boundaries, with optional plurality – noun = b( + nounPattern + )?sb – adjective = b( + adjPattern + )b ● Combine regex output with patterns – NP = adjective{0,5} (noun | properNoun){1,5} – PP = NP “of|on” NP – Term = NP | PP ● Test by running the patterns against the complete ontology – all terms should be matched
  • 17.
  • 18. Evaluation ● Corpus of discharge summaries, progress notes, and surgical, radiology and pathology reports (Savova et al 2011) ● Manually annotated for mentions of anatomical and disease concepts ● Compare manually identified terms against system-generated terms via semantic decomposition/recombination pattern approach vs direct ontology lookup vs MetaMap ● Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2 * P * R / P + R), and Mann-Whitney U between approaches
  • 19. Results – Anatomical terms Method P R F Time Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s Figures in parentheses denote results after corpus correction Semantic vs direct lookup: significant increase in P and R (p < 0.01) Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
  • 20. Error analysis – Anatomical terms ● Many false positives (87.9%) were in fact correct terms – missing from the manually annotated corpus ● Adding these missing annotations increased precision from 0.36 to 0.89 ● Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a haploidentical bone marrow', 'normal sinus', and non-specific anatomical areas, e.g. 'multifocal areas', 'particular organ site', 'pruritic areas'. ● Phrases not in the ontology as discrete terms picked up by semantic method, e.g. 'angiolymphatic space', 'dentate line'
  • 21. Results – Disease terms Method P R F Time Semantic 0.58 0.68 0.62 12s Direct lookup 0.69 0.27 0.37 9s MetaMap 0.46 0.83 0.59 1748s Semantic vs direct lookup: significant increase in R (p << 0.01), significant decrease in P (p < 0.01), overal significant increase in F (p < 0.01) Semantic vs MetaMap: significant increase in P (p << 0.01), but significant decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
  • 22. Error analysis – Disease terms ● Factors affecting recall: – Abbreviations (e.g. COPD) – Definite descriptors ('the disease', 'her infirmity') – Symptoms annotated as disease ('mood changes', 'double vision') ● Factors affecting precision – Terms manually annotated as Symptoms being marked as Disease e.g. 'difficulty walking' – Some inconsistent manual annotation of negated terms, family history etc
  • 23. Conclusion ● Semantic decomposition and regex/pattern-based recombination of ontology terms is slightly slower than directly looking up terms and synonyms extracted from the ontology, but leads to significantly increased accuracy that balances precision and recall ● Against MetaMap, the improvements are measurable but not statistically significant for anatomical terms, but precision is significantly improved for disease terms. However, the processing time is several orders of magnitude faster. ● Our findings are comparable to Shah et al (2009) for mGrep vs MetaMap, but we now have a systematic method for creating new concept recognisers from scratch
  • 24. Further work ● Calculate positional entropy of each morpheme and use these to help generate patterns (e.g. some morphemes are more likely to occur at the start or end of a pattern) ● Improve lookup performance by using a radix trie (better for morpheme sets that share long prefixes and suffixes) rather than standard Java.util.regex ● Apply method to other biomedical ontologies ● Evaluate against other corpora, e.g. annotated MedLine abstracts