SlideShare a Scribd company logo
1 of 35
Download to read offline
ACS National Meeting, Indianapolis, USA 8th September 2013
Tackling the difficult areas of
chemical entity extraction:
Misspelt chemical names and unconventional
entities
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK
ACS National Meeting, Indianapolis, USA 8th September 2013
Text mining is big business
2013 Bio-IT World Best Practices winner
ACS National Meeting, Indianapolis, USA 8th September 2013
Approaches to Entity recognition
• Dictionary based
• Grammar based
• Machine Learning
LeadMineLeadMine
ACS National Meeting, Indianapolis, USA 8th September 2013
Approaches to Entity recognition
• Dictionary based approaches are ideal for
relating entities to concepts but only
recognise a finite number of terms
– Will not recognise novel compound names
• Hence for chemistry, dictionary approaches
need to be used in conjunction with another
method
ACS National Meeting, Indianapolis, USA 8th September 2013
Advantages of grammars
• Don’t require annotated corpora
• Encode knowledge about the domain
• Very fast recognition
• Allow spelling correction if an entity is a near
match to one recognised by the grammar
ACS National Meeting, Indianapolis, USA 8th September 2013
Simple grammar Example
Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’
Digit : Digit1to9 | ‘0’
Cid : ‘CID:’ Digit1to9 Digit*
C I D 1..9:
0..9
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammar for IUPAC names
• Grammar for complete molecules: 485 rules
– trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...
– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...
• Generally aims to match a superset of the
nomenclature covered by IUPAC
• Specifically this is the superset that can be
theoretically be converted to structures
ACS National Meeting, Indianapolis, USA 8th September 2013
State machine size
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Statesrequired
Recall on names from MayBridge catalogue
ACS National Meeting, Indianapolis, USA 8th September 2013
Two Level State Machines
• Breaks problems into a state machine that
keeps track of when concepts have to be
matched and a state machine that matches
each concept e.g. an acyclic group
– Avoids duplication of states to match the same
concept in slightly different contexts
– Slower as multiple concepts may be possible that
are allowed to start with the same characters
ACS National Meeting, Indianapolis, USA 8th September 2013
State machine RevisiteD
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Statesrequired
Recall on names from MayBridge catalogue
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammar inheritance
• Molecule grammar serves as a good starting
point for a substituent grammar or generic
chemical grammar
– Inherit rules rather than duplicate them
– Allow overriding of rules
pluralisedChemical : chemical 's'
elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition
metal'|'transuranic element' | _elementaryMetalAtom
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #1
• Formulae:
– Sum formulae
• C20H25NO6
– Line formulae
• CH3CH2CH2Cl (complete molecule)
• CH2CH2 (linker)
• CH3CH2 (substituent)
– Salts
• MgSO4
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #2
• Peptide formulae
– Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2
• Oligosaccharides
– α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc-
(1→3)-β-D-Galp-(1→4)-D-Glc-ol
• Oligonucleotides
– 3'-AATG-5'
ACS National Meeting, Indianapolis, USA 8th September 2013
Unconventional entities #3
• Patent numbers
– U.S. Pat. No. 6,677,355
• Journal references
– (1974) J. Biol. Chem. 249, 4250-4256
• CAS numbers
– 90-13-1
• InChI and SMILES
ACS National Meeting, Indianapolis, USA 8th September 2013
navigating
ACS National Meeting, Indianapolis, USA 8th September 2013
Fast spelling correction
• Historically we have used Levenshtein-like distance measures
(all possible corrections)
• Only use spelling correction when recognition fails
• Allow a certain level of “look behind”
– 13 characters empirically found to yield identical results
– Speeds up spelling correction ~80%
• Dictionary of common English words can be used to prevent
attempting spelling correction
ACS National Meeting, Indianapolis, USA 8th September 2013
Words Ignored for spelling
correction (gray)
ACS National Meeting, Indianapolis, USA 8th September 2013
Exceptions to local errors
• Whether a space is allowed may only be
decidable once the suffix of a chemical name
is encountered
propyl bromochloromethanol 
propylbromochloromethanol
propyl bromochloromethanoate
19 character look behind required!
ACS National Meeting, Indianapolis, USA 8th September 2013
BioCreative IV
• CHEMDNER (Chemical compound and drug
name recognition task)
• 10000 annotated PubMed abstracts (3500 for
training, 3500 for development and 3000 for
testing)
• Deadline for submission: This Thursday
ACS National Meeting, Indianapolis, USA 8th September 2013
Typical annotated Abstract
ACS National Meeting, Indianapolis, USA 8th September 2013
Dictionaries… bigger is better
• For high recall of trivial names dictionaries
with high coverage are required.
• The largest publically available dictionary is
PubChem with over 94 million terms
• However most of these terms are either not
useful or actually detrimental to text mining
ACS National Meeting, Indianapolis, USA 8th September 2013
Aggressive filtering
• “what you don't see won't hurt you”
• Hence remove terms are also English words or start with an
English word
– Accomplished using a large English dictionary with
chemistry terms removed
• Remove internal identifiers used by depositors
• Remove terms that are matched by our grammars
• Ultimate result: 94 million less than 3 million
ACS National Meeting, Indianapolis, USA 8th September 2013
Structure Aware filtering
• “Do not tag proteins, polypeptides (> 15aa),
nucleic acid polymers, polysaccharides,
oligosaccharides [tetrasaccharide or longer] and other
biochemicals.”
• About 40,000 polypeptides and
oligosaccharides excluded from PubChem
using these criteria
ACS National Meeting, Indianapolis, USA 8th September 2013
Entity Extension
• Even PubChem is far from comprehensive hence it can be
useful to extend the start and/or end of entities to avoid
partial hits
– α-santalol can be recognised from santalol in the
dictionary
• Extension is bracketing aware and blocked by English words
• Entity trimming also performed to comply with the
annotation guidelines
– ‘Allura Red AC dye’  ‘Allura Red AC’
ACS National Meeting, Indianapolis, USA 8th September 2013
Entity Merging
• Adjacent entities may actually be the same
entities
– Ethyl ester one entity
– (+)-limonene epoxide  one entity
BUT
– Hexane-benzene two entities
ACS National Meeting, Indianapolis, USA 8th September 2013
Using an ontology to determine
when terms add information
• Genistein isoflavone  two entities
• Glycine ester  one entity
Genistein showing isoflavone core structure
ACS National Meeting, Indianapolis, USA 8th September 2013
Abbreviation detection
• Based on the Hearst and Schwartz algorithm
• Detects abbreviations of the following forms:
– Tetrahydrofuran (THF)
– THF (tetrahydrofuran)
– Tetrahydrofuran (THF;
– (tetrahydrofuran, THF)
– THF = tetrahydrofuran
Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on
Biocomputing 2003.
ACS National Meeting, Indianapolis, USA 8th September 2013
AnTI-Abbreviation detection
• Finds entities detected as abbreviations of
unrecognised entities
– Can mean a common chemical abbreviation has
been redefined in the scope of the document
current good manufacturing practice (cGMP)
cGMP = Cyclic guanosine monophosphate =
ACS National Meeting, Indianapolis, USA 8th September 2013
Grammars used
• Systematic molecule
• Systematic prefix
• Systematic generic name
• Registry number
• CAS number
• Chemical formulae
• Systematic polymer
• Semi systematic chemical name
– Systematic prefix + common trivial name/name from PubChem
ACS National Meeting, Indianapolis, USA 8th September 2013
Dictionaries used
• Noise words e.g. lead
• Trivial polymer
• Generic chemical terms (some from ChEBI)
• Common abbreviations
• Common trivial names
• Filtered PubChem
• Alloys
• Allotropes
• Minerals
ACS National Meeting, Indianapolis, USA 8th September 2013
Making the most of the knowledge
provided
• Use training data to identify terms that are
not currently recognised (a whitelist)
• Identify terms that are often false positives (a
blacklist)
• Each false positive and false negative is placed
into such a list if its inclusion increased F-score
(harmonic mean of precision/recall)
ACS National Meeting, Indianapolis, USA 8th September 2013
Results
(on development set)
Configuration Precision Recall F-score
Baseline 0.87 0.82 0.84
WhiteList 0.86 0.85 0.85
BlackList 0.88 0.80 0.84
WhiteList +
BlackList
0.87 0.83 0.85
ACS National Meeting, Indianapolis, USA 8th September 2013
Future work
• Typically we are focused on generating
structures from the entities we recognise
– Line formula parsing
– Generic chemical name parsing (difficult to do in a
way that the results are not tied to a particular
toolkit)
• Grammars serve as an excellent starting
point for writing parsers
ACS National Meeting, Indianapolis, USA 8th September 2013
Conclusions
• Two level state machines allow many complicated
grammars to be represented by far fewer states
• Back tracking spelling correction can provide significant
speed improvements without effecting recall
• Check out our blog (nextmovesoftware.co.uk/blog) in a
couple of weeks to find out how we did in BioCreative!
ACS National Meeting, Indianapolis, USA 8th September 2013
daniel@nextmovesoftware.com
Tackling the difficult areas of chemical entity
extraction:
Misspelt chemical names and unconventional entities
Thank you for your attention

More Related Content

What's hot

CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASDr. Haxel Consult
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) englishPOSTECH Library
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...NextMove Software
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?Dr. Haxel Consult
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...NextMove Software
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinSimon Jupp
 
How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...Benedictine University Library
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Valery Tkachenko
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...Dr. Haxel Consult
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4jSimon Jupp
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jSimon Jupp
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...mhaendel
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholdermhaendel
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 

What's hot (20)

CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
ICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CASICIC 2016: New Product Introduction CAS
ICIC 2016: New Product Introduction CAS
 
2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english2020 scifinder-n manual (2020) english
2020 scifinder-n manual (2020) english
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
 
Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...Standardized Representations of ELN Reactions for Categorization and Duplicat...
Standardized Representations of ELN Reactions for Categorization and Duplicat...
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...How to search_free_crystallography_databases_benedictine_university final 111...
How to search_free_crystallography_databases_benedictine_university final 111...
 
Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...Building linked data large-scale chemistry platform - challenges, lessons and...
Building linked data large-scale chemistry platform - challenges, lessons and...
 
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...ICIC 2016: Mind the Gap:  The novel benefits of human-curated substance locat...
ICIC 2016: Mind the Gap: The novel benefits of human-curated substance locat...
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
Building a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4jBuilding a repository of biomedical ontologies with Neo4j
Building a repository of biomedical ontologies with Neo4j
 
Building (and traveling) the data-brick road: A report from the front lines ...
Building (and traveling) the data-brick road:  A report from the front lines ...Building (and traveling) the data-brick road:  A report from the front lines ...
Building (and traveling) the data-brick road: A report from the front lines ...
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 

Viewers also liked

Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
 
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。青島 英和
 
Маркетингот на Чудаците
Маркетингот на ЧудацитеМаркетингот на Чудаците
Маркетингот на ЧудацитеBlaze Arizanov
 
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트gjsokdfjl
 
Music video analysis of Jessie J
Music video analysis of Jessie JMusic video analysis of Jessie J
Music video analysis of Jessie Jobennett92
 
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?Aditya Gupta
 
Presentación Creativa para Oksi
Presentación Creativa para Oksi Presentación Creativa para Oksi
Presentación Creativa para Oksi Huwen Arnone
 
Why It Pays To Be Likeable
Why It Pays To Be LikeableWhy It Pays To Be Likeable
Why It Pays To Be LikeableDave Kerpen
 
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake Levels
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake LevelsAcceptable Sweetness: How FDA Sets Acceptable Daily Intake Levels
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake LevelsFood Insight
 
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...Museo del Romanticismo
 
Membangun Kalimantan Berbasis Sistem Inovasi Daerah
Membangun Kalimantan Berbasis Sistem Inovasi DaerahMembangun Kalimantan Berbasis Sistem Inovasi Daerah
Membangun Kalimantan Berbasis Sistem Inovasi DaerahTri Widodo W. UTOMO
 

Viewers also liked (14)

Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。
作業管理をコトゴトで毎日の作業とキモチの変化をキロクに残そう。
 
Маркетингот на Чудаците
Маркетингот на ЧудацитеМаркетингот на Чудаците
Маркетингот на Чудаците
 
Zaragoza turismo-107
Zaragoza turismo-107Zaragoza turismo-107
Zaragoza turismo-107
 
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트
레드9카지노 싸이트 『OX600』。『COM』마작룰 싸이트
 
PDAC 2016 is coming!
PDAC 2016 is coming!PDAC 2016 is coming!
PDAC 2016 is coming!
 
Responsive design by Batavianet
Responsive design by BatavianetResponsive design by Batavianet
Responsive design by Batavianet
 
Music video analysis of Jessie J
Music video analysis of Jessie JMusic video analysis of Jessie J
Music video analysis of Jessie J
 
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?
Aasaanjobs: Why Should Aasaanjobs be your prime hiring partner?
 
Presentación Creativa para Oksi
Presentación Creativa para Oksi Presentación Creativa para Oksi
Presentación Creativa para Oksi
 
Why It Pays To Be Likeable
Why It Pays To Be LikeableWhy It Pays To Be Likeable
Why It Pays To Be Likeable
 
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake Levels
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake LevelsAcceptable Sweetness: How FDA Sets Acceptable Daily Intake Levels
Acceptable Sweetness: How FDA Sets Acceptable Daily Intake Levels
 
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...
Hablemos del Romanticismo. VI: El Romanticismo al otro lado del Atlántico: Am...
 
Membangun Kalimantan Berbasis Sistem Inovasi Daerah
Membangun Kalimantan Berbasis Sistem Inovasi DaerahMembangun Kalimantan Berbasis Sistem Inovasi Daerah
Membangun Kalimantan Berbasis Sistem Inovasi Daerah
 

Similar to Tackling the difficult areas of chemical entity extraction

ICIC 2013 New Product Introductions CAS
ICIC 2013 New Product Introductions CASICIC 2013 New Product Introductions CAS
ICIC 2013 New Product Introductions CASDr. Haxel Consult
 
Paramedicine slide share
Paramedicine slide shareParamedicine slide share
Paramedicine slide shareAndrea Francis
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...greatsalvation813
 
CAS: Transforming Discovery
CAS: Transforming DiscoveryCAS: Transforming Discovery
CAS: Transforming DiscoveryCAS
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsDaniel Tunkelang
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patentsdan2097
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Search Terms And Strategies
Search Terms And StrategiesSearch Terms And Strategies
Search Terms And Strategieskjurecki
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialRothamsted Research, UK
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...guest01a117
 

Similar to Tackling the difficult areas of chemical entity extraction (20)

ICIC 2013 New Product Introductions CAS
ICIC 2013 New Product Introductions CASICIC 2013 New Product Introductions CAS
ICIC 2013 New Product Introductions CAS
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Searching for evidence - Paramedicine
Searching for evidence - ParamedicineSearching for evidence - Paramedicine
Searching for evidence - Paramedicine
 
Information retrieval guide
Information retrieval guideInformation retrieval guide
Information retrieval guide
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Pharmacy libguide slideshare
Pharmacy libguide slidesharePharmacy libguide slideshare
Pharmacy libguide slideshare
 
Paramedicine slide share
Paramedicine slide shareParamedicine slide share
Paramedicine slide share
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
 
The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...The impact of domain-specific stop-word lists on ecommerce website search per...
The impact of domain-specific stop-word lists on ecommerce website search per...
 
CAS: Transforming Discovery
CAS: Transforming DiscoveryCAS: Transforming Discovery
CAS: Transforming Discovery
 
IC-SDV 2019: Elsevier
IC-SDV 2019: ElsevierIC-SDV 2019: Elsevier
IC-SDV 2019: Elsevier
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text Analytics
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Search Terms And Strategies
Search Terms And StrategiesSearch Terms And Strategies
Search Terms And Strategies
 
BioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS TutorialBioSamples Database Linked Data, SWAT4LS Tutorial
BioSamples Database Linked Data, SWAT4LS Tutorial
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
 

More from NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsNextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 

More from NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
RDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical DepictionsRDKit UGM 2016: Higher Quality Chemical Depictions
RDKit UGM 2016: Higher Quality Chemical Depictions
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

Recently uploaded

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEaurabinda banchhor
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 

Recently uploaded (20)

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Dust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSEDust Of Snow By Robert Frost Class-X English CBSE
Dust Of Snow By Robert Frost Class-X English CBSE
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 

Tackling the difficult areas of chemical entity extraction

  • 1. ACS National Meeting, Indianapolis, USA 8th September 2013 Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Daniel Lowe and Roger Sayle NextMove Software Cambridge, UK
  • 2. ACS National Meeting, Indianapolis, USA 8th September 2013 Text mining is big business 2013 Bio-IT World Best Practices winner
  • 3. ACS National Meeting, Indianapolis, USA 8th September 2013 Approaches to Entity recognition • Dictionary based • Grammar based • Machine Learning LeadMineLeadMine
  • 4. ACS National Meeting, Indianapolis, USA 8th September 2013 Approaches to Entity recognition • Dictionary based approaches are ideal for relating entities to concepts but only recognise a finite number of terms – Will not recognise novel compound names • Hence for chemistry, dictionary approaches need to be used in conjunction with another method
  • 5. ACS National Meeting, Indianapolis, USA 8th September 2013 Advantages of grammars • Don’t require annotated corpora • Encode knowledge about the domain • Very fast recognition • Allow spelling correction if an entity is a near match to one recognised by the grammar
  • 6. ACS National Meeting, Indianapolis, USA 8th September 2013 Simple grammar Example Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’ Digit : Digit1to9 | ‘0’ Cid : ‘CID:’ Digit1to9 Digit* C I D 1..9: 0..9
  • 7. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammar for IUPAC names • Grammar for complete molecules: 485 rules – trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'... – ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ... • Generally aims to match a superset of the nomenclature covered by IUPAC • Specifically this is the superset that can be theoretically be converted to structures
  • 8. ACS National Meeting, Indianapolis, USA 8th September 2013 State machine size 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Statesrequired Recall on names from MayBridge catalogue
  • 9. ACS National Meeting, Indianapolis, USA 8th September 2013 Two Level State Machines • Breaks problems into a state machine that keeps track of when concepts have to be matched and a state machine that matches each concept e.g. an acyclic group – Avoids duplication of states to match the same concept in slightly different contexts – Slower as multiple concepts may be possible that are allowed to start with the same characters
  • 10. ACS National Meeting, Indianapolis, USA 8th September 2013 State machine RevisiteD 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 Statesrequired Recall on names from MayBridge catalogue
  • 11. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammar inheritance • Molecule grammar serves as a good starting point for a substituent grammar or generic chemical grammar – Inherit rules rather than duplicate them – Allow overriding of rules pluralisedChemical : chemical 's' elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition metal'|'transuranic element' | _elementaryMetalAtom
  • 12. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #1 • Formulae: – Sum formulae • C20H25NO6 – Line formulae • CH3CH2CH2Cl (complete molecule) • CH2CH2 (linker) • CH3CH2 (substituent) – Salts • MgSO4
  • 13. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #2 • Peptide formulae – Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2 • Oligosaccharides – α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc- (1→3)-β-D-Galp-(1→4)-D-Glc-ol • Oligonucleotides – 3'-AATG-5'
  • 14. ACS National Meeting, Indianapolis, USA 8th September 2013 Unconventional entities #3 • Patent numbers – U.S. Pat. No. 6,677,355 • Journal references – (1974) J. Biol. Chem. 249, 4250-4256 • CAS numbers – 90-13-1 • InChI and SMILES
  • 15. ACS National Meeting, Indianapolis, USA 8th September 2013 navigating
  • 16. ACS National Meeting, Indianapolis, USA 8th September 2013 Fast spelling correction • Historically we have used Levenshtein-like distance measures (all possible corrections) • Only use spelling correction when recognition fails • Allow a certain level of “look behind” – 13 characters empirically found to yield identical results – Speeds up spelling correction ~80% • Dictionary of common English words can be used to prevent attempting spelling correction
  • 17. ACS National Meeting, Indianapolis, USA 8th September 2013 Words Ignored for spelling correction (gray)
  • 18. ACS National Meeting, Indianapolis, USA 8th September 2013 Exceptions to local errors • Whether a space is allowed may only be decidable once the suffix of a chemical name is encountered propyl bromochloromethanol  propylbromochloromethanol propyl bromochloromethanoate 19 character look behind required!
  • 19. ACS National Meeting, Indianapolis, USA 8th September 2013 BioCreative IV • CHEMDNER (Chemical compound and drug name recognition task) • 10000 annotated PubMed abstracts (3500 for training, 3500 for development and 3000 for testing) • Deadline for submission: This Thursday
  • 20. ACS National Meeting, Indianapolis, USA 8th September 2013 Typical annotated Abstract
  • 21. ACS National Meeting, Indianapolis, USA 8th September 2013 Dictionaries… bigger is better • For high recall of trivial names dictionaries with high coverage are required. • The largest publically available dictionary is PubChem with over 94 million terms • However most of these terms are either not useful or actually detrimental to text mining
  • 22. ACS National Meeting, Indianapolis, USA 8th September 2013 Aggressive filtering • “what you don't see won't hurt you” • Hence remove terms are also English words or start with an English word – Accomplished using a large English dictionary with chemistry terms removed • Remove internal identifiers used by depositors • Remove terms that are matched by our grammars • Ultimate result: 94 million less than 3 million
  • 23. ACS National Meeting, Indianapolis, USA 8th September 2013 Structure Aware filtering • “Do not tag proteins, polypeptides (> 15aa), nucleic acid polymers, polysaccharides, oligosaccharides [tetrasaccharide or longer] and other biochemicals.” • About 40,000 polypeptides and oligosaccharides excluded from PubChem using these criteria
  • 24. ACS National Meeting, Indianapolis, USA 8th September 2013 Entity Extension • Even PubChem is far from comprehensive hence it can be useful to extend the start and/or end of entities to avoid partial hits – α-santalol can be recognised from santalol in the dictionary • Extension is bracketing aware and blocked by English words • Entity trimming also performed to comply with the annotation guidelines – ‘Allura Red AC dye’  ‘Allura Red AC’
  • 25. ACS National Meeting, Indianapolis, USA 8th September 2013 Entity Merging • Adjacent entities may actually be the same entities – Ethyl ester one entity – (+)-limonene epoxide  one entity BUT – Hexane-benzene two entities
  • 26. ACS National Meeting, Indianapolis, USA 8th September 2013 Using an ontology to determine when terms add information • Genistein isoflavone  two entities • Glycine ester  one entity Genistein showing isoflavone core structure
  • 27. ACS National Meeting, Indianapolis, USA 8th September 2013 Abbreviation detection • Based on the Hearst and Schwartz algorithm • Detects abbreviations of the following forms: – Tetrahydrofuran (THF) – THF (tetrahydrofuran) – Tetrahydrofuran (THF; – (tetrahydrofuran, THF) – THF = tetrahydrofuran Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on Biocomputing 2003.
  • 28. ACS National Meeting, Indianapolis, USA 8th September 2013 AnTI-Abbreviation detection • Finds entities detected as abbreviations of unrecognised entities – Can mean a common chemical abbreviation has been redefined in the scope of the document current good manufacturing practice (cGMP) cGMP = Cyclic guanosine monophosphate =
  • 29. ACS National Meeting, Indianapolis, USA 8th September 2013 Grammars used • Systematic molecule • Systematic prefix • Systematic generic name • Registry number • CAS number • Chemical formulae • Systematic polymer • Semi systematic chemical name – Systematic prefix + common trivial name/name from PubChem
  • 30. ACS National Meeting, Indianapolis, USA 8th September 2013 Dictionaries used • Noise words e.g. lead • Trivial polymer • Generic chemical terms (some from ChEBI) • Common abbreviations • Common trivial names • Filtered PubChem • Alloys • Allotropes • Minerals
  • 31. ACS National Meeting, Indianapolis, USA 8th September 2013 Making the most of the knowledge provided • Use training data to identify terms that are not currently recognised (a whitelist) • Identify terms that are often false positives (a blacklist) • Each false positive and false negative is placed into such a list if its inclusion increased F-score (harmonic mean of precision/recall)
  • 32. ACS National Meeting, Indianapolis, USA 8th September 2013 Results (on development set) Configuration Precision Recall F-score Baseline 0.87 0.82 0.84 WhiteList 0.86 0.85 0.85 BlackList 0.88 0.80 0.84 WhiteList + BlackList 0.87 0.83 0.85
  • 33. ACS National Meeting, Indianapolis, USA 8th September 2013 Future work • Typically we are focused on generating structures from the entities we recognise – Line formula parsing – Generic chemical name parsing (difficult to do in a way that the results are not tied to a particular toolkit) • Grammars serve as an excellent starting point for writing parsers
  • 34. ACS National Meeting, Indianapolis, USA 8th September 2013 Conclusions • Two level state machines allow many complicated grammars to be represented by far fewer states • Back tracking spelling correction can provide significant speed improvements without effecting recall • Check out our blog (nextmovesoftware.co.uk/blog) in a couple of weeks to find out how we did in BioCreative!
  • 35. ACS National Meeting, Indianapolis, USA 8th September 2013 daniel@nextmovesoftware.com Tackling the difficult areas of chemical entity extraction: Misspelt chemical names and unconventional entities Thank you for your attention