Chemicals, Chemical Identifiers and Navigating Through Databases

Chemicals, Chemical Identifiers and
Navigating Through Databases
Antony Williams
UNC Chapel Hill, October 2010

Chemistry on the Internet
 Where do you source chemistry information?
 What can you trust online?
 How can you recognize potential issues?
 Cross-referencing and curating data

What is the Structure of Vitamin K?

MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K

What is the Structure of Vitamin K1?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Bioassay Data are Associated…

Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)

Molfiles
 10 9 0 0 1 0 0 0 0 0 1 V2000
 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 3 1 2 0 0 0 0
 4 1 1 0 0 0 0
 9 1 1 0 0 0 0
 7 2 1 0 0 0 0
 5 2 2 0 0 0 0
 8 2 1 0 0 0 0
 6 4 1 0 0 0 0
 4 10 1 6 0 0 0
 7 6 1 0 0 0 0
 M END

Molfiles
 Molfiles are the primary exchange format between
structure drawing packages
 Can be different between different drawing packages
 Most commonly carry X,Y coordinates for layout
 Can support polymers, organometallics, etc.
 Can carry 3D coordinates

SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
organometallics, etc.
 Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
 Generally different between
drawing packages

SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O
 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C
@H](C)CCC[C@H](C)CCCC(C)C
 ChEMBL
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C

InChI
 SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
 InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
 Well adopted by the community (databases,
publishers, blogs, Wikipedia) – good for searching
the internet

Tautomers – “Mobile H Perception”

Checking for Stereochemistry
Use your drawing package!

InChIStrings Hash to InChIKeys

PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N

InChI
 No support for polymers, organometallics
 Many option settings can lead to variability and
make integration across databases difficult –
FixedH option especially problematic
 “Slight” chance of collisions of InChIKeys
 VERY USEFUL FOR INTEGRATING THE WEB

Vancomycin
Search Molecular
SKELETON
Search Full Molecule

Full Skeleton Search: 104 Hits

Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
 Property databases
 Patents with chemical structures
 Drug Discovery data
 Scientific publications
 Compound aggregators
 Blogs/Wikis and Open Notebook Science

Linked Data on the Web
Taken from: Rafael Sidis’ Blog

Search for a Chemical…by name

Available Information…
 Linked to vendors, safety data, toxicity, metabolism

How do we build it?
 25 million chemicals from 400 data sources
 We deal in Molfiles or SDF files – including
coordinates
 We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
 We have our own “business logic” to standardize
 We use InChI to “aggregate tautomers” to one
record
 We link out to external sites where possible using
their IDs

Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE

Be careful searching by Name!
 Determining the correct structure by name
searching is difficult online! Good, not perfect
 Wikipedia
 ChEBI/ChEMBL
 ChemIDPlus
 ChemSpider
 Be VERY careful with MOST databases

Validating structures
 Check for “full stereo” and use stereo descriptors
especially for checking!
 Check for quality of associated data sources
 Check against reference literature when available
– but it can be wrong
 Question EVERYTHING!

Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 ChemSpider is unique…immediate curation
 ChemSpider live demo following this lecture
 Searching
 Deposition and Curation
 ChemSpider SyntheticPages

Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Chemicals, Chemical Identifiers and Navigating Through Databases

Recomendados

Recomendados

Más contenido relacionado

Similar a Chemicals, Chemical Identifiers and Navigating Through Databases

Similar a Chemicals, Chemical Identifiers and Navigating Through Databases (20)

Último

Último (20)

Chemicals, Chemical Identifiers and Navigating Through Databases