Tackling the difficult areas of chemical entity extraction

ACS National Meeting, Indianapolis, USA 8th September 2013
Tackling the difficult areas of
chemical entity extraction:
Misspelt chemical names and unconventional
entities
Daniel Lowe and Roger Sayle
NextMove Software
Cambridge, UK

Text mining is big business
2013 Bio-IT World Best Practices winner

Approaches to Entity recognition
• Dictionary based
• Grammar based
• Machine Learning
LeadMineLeadMine

Approaches to Entity recognition
• Dictionary based approaches are ideal for
relating entities to concepts but only
recognise a finite number of terms
– Will not recognise novel compound names
• Hence for chemistry, dictionary approaches
need to be used in conjunction with another
method

Advantages of grammars
• Don’t require annotated corpora
• Encode knowledge about the domain
• Very fast recognition
• Allow spelling correction if an entity is a near
match to one recognised by the grammar

Simple grammar Example
Digit1to9 : ‘1’ | ‘2’ |’4’ |’5’ |’6’ |’7’ |’8’ |’9’
Digit : Digit1to9 | ‘0’
Cid : ‘CID:’ Digit1to9 Digit*
C I D 1..9:
0..9

Grammar for IUPAC names
• Grammar for complete molecules: 485 rules
– trivialRing : 'aceanthren'|'aceanthrylen'|'acenaphthen'...
– ringGroup : trivialRing | hantzschWidmanRing | vonBaeyerSystem ...
• Generally aims to match a superset of the
nomenclature covered by IUPAC
• Specifically this is the superset that can be
theoretically be converted to structures

State machine size
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Statesrequired
Recall on names from MayBridge catalogue

Two Level State Machines
• Breaks problems into a state machine that
keeps track of when concepts have to be
matched and a state machine that matches
each concept e.g. an acyclic group
– Avoids duplication of states to match the same
concept in slightly different contexts
– Slower as multiple concepts may be possible that
are allowed to start with the same characters

State machine RevisiteD
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
Statesrequired
Recall on names from MayBridge catalogue

Grammar inheritance
• Molecule grammar serves as a good starting
point for a substituent grammar or generic
chemical grammar
– Inherit rules rather than duplicate them
– Allow overriding of rules
pluralisedChemical : chemical 's'
elementaryMetalAtom : 'lanthanide'|'lanthanoid'|'transition
metal'|'transuranic element' | _elementaryMetalAtom

Unconventional entities #1
• Formulae:
– Sum formulae
• C20H25NO6
– Line formulae
• CH3CH2CH2Cl (complete molecule)
• CH2CH2 (linker)
• CH3CH2 (substituent)
– Salts
• MgSO4

• Peptide formulae
– Cys-Tyr-Phe-Gln-Asn-Cys-Pro-Arg-Gly-NH2
• Oligosaccharides
– α-L-Fucp-(1→4)-[β-D-Galp-(1→3)]-β-D-GlcpNAc-
(1→3)-β-D-Galp-(1→4)-D-Glc-ol
• Oligonucleotides
– 3'-AATG-5'

• Patent numbers
– U.S. Pat. No. 6,677,355
• Journal references
– (1974) J. Biol. Chem. 249, 4250-4256
• CAS numbers
– 90-13-1
• InChI and SMILES

navigating

Fast spelling correction
• Historically we have used Levenshtein-like distance measures
(all possible corrections)
• Only use spelling correction when recognition fails
• Allow a certain level of “look behind”
– 13 characters empirically found to yield identical results
– Speeds up spelling correction ~80%
• Dictionary of common English words can be used to prevent
attempting spelling correction

Words Ignored for spelling
correction (gray)

Exceptions to local errors
• Whether a space is allowed may only be
decidable once the suffix of a chemical name
is encountered
propyl bromochloromethanol 
propylbromochloromethanol
propyl bromochloromethanoate
19 character look behind required!

BioCreative IV
• CHEMDNER (Chemical compound and drug
name recognition task)
• 10000 annotated PubMed abstracts (3500 for
training, 3500 for development and 3000 for
testing)
• Deadline for submission: This Thursday

Typical annotated Abstract

Dictionaries… bigger is better
• For high recall of trivial names dictionaries
with high coverage are required.
• The largest publically available dictionary is
PubChem with over 94 million terms
• However most of these terms are either not
useful or actually detrimental to text mining

Aggressive filtering
• “what you don't see won't hurt you”
• Hence remove terms are also English words or start with an
English word
– Accomplished using a large English dictionary with
chemistry terms removed
• Remove internal identifiers used by depositors
• Remove terms that are matched by our grammars
• Ultimate result: 94 million less than 3 million

Structure Aware filtering
• “Do not tag proteins, polypeptides (> 15aa),
nucleic acid polymers, polysaccharides,
oligosaccharides [tetrasaccharide or longer] and other
biochemicals.”
• About 40,000 polypeptides and
oligosaccharides excluded from PubChem
using these criteria

Entity Extension
• Even PubChem is far from comprehensive hence it can be
useful to extend the start and/or end of entities to avoid
partial hits
– α-santalol can be recognised from santalol in the
dictionary
• Extension is bracketing aware and blocked by English words
• Entity trimming also performed to comply with the
annotation guidelines
– ‘Allura Red AC dye’  ‘Allura Red AC’

Entity Merging
• Adjacent entities may actually be the same
entities
– Ethyl ester one entity
– (+)-limonene epoxide  one entity
BUT
– Hexane-benzene two entities

Using an ontology to determine
when terms add information
• Genistein isoflavone  two entities
• Glycine ester  one entity
Genistein showing isoflavone core structure

Abbreviation detection
• Based on the Hearst and Schwartz algorithm
• Detects abbreviations of the following forms:
– Tetrahydrofuran (THF)
– THF (tetrahydrofuran)
– Tetrahydrofuran (THF;
– (tetrahydrofuran, THF)
– THF = tetrahydrofuran
Schwartz, A.; Hearst, M. Proceedings of the Pacific Symposium on
Biocomputing 2003.

AnTI-Abbreviation detection
• Finds entities detected as abbreviations of
unrecognised entities
– Can mean a common chemical abbreviation has
been redefined in the scope of the document
current good manufacturing practice (cGMP)
cGMP = Cyclic guanosine monophosphate =

Grammars used
• Systematic molecule
• Systematic prefix
• Systematic generic name
• Registry number
• CAS number
• Chemical formulae
• Systematic polymer
• Semi systematic chemical name
– Systematic prefix + common trivial name/name from PubChem

Dictionaries used
• Noise words e.g. lead
• Trivial polymer
• Generic chemical terms (some from ChEBI)
• Common abbreviations
• Common trivial names
• Filtered PubChem
• Alloys
• Allotropes
• Minerals

Making the most of the knowledge
provided
• Use training data to identify terms that are
not currently recognised (a whitelist)
• Identify terms that are often false positives (a
blacklist)
• Each false positive and false negative is placed
into such a list if its inclusion increased F-score
(harmonic mean of precision/recall)

Results
(on development set)
Configuration Precision Recall F-score
Baseline 0.87 0.82 0.84
WhiteList 0.86 0.85 0.85
BlackList 0.88 0.80 0.84
WhiteList +
BlackList
0.87 0.83 0.85

Future work
• Typically we are focused on generating
structures from the entities we recognise
– Line formula parsing
– Generic chemical name parsing (difficult to do in a
way that the results are not tied to a particular
toolkit)
• Grammars serve as an excellent starting
point for writing parsers

Conclusions
• Two level state machines allow many complicated
grammars to be represented by far fewer states
• Back tracking spelling correction can provide significant
speed improvements without effecting recall
• Check out our blog (nextmovesoftware.co.uk/blog) in a
couple of weeks to find out how we did in BioCreative!

daniel@nextmovesoftware.com
Tackling the difficult areas of chemical entity
extraction:
Misspelt chemical names and unconventional entities
Thank you for your attention

Tackling the difficult areas of chemical entity extraction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Tackling the difficult areas of chemical entity extraction

Similar to Tackling the difficult areas of chemical entity extraction (20)

More from NextMove Software

More from NextMove Software (20)

Recently uploaded

Recently uploaded (20)

Tackling the difficult areas of chemical entity extraction