SlideShare a Scribd company logo
1 of 55
ChemSpider -Connecting and Curating
Online Chemistry Resources
Antony Williams
EBI, November 30th
2010
Chemistry on the Internet
 100s of websites serving up chemistry data, SDF
files of structures and data
 Some primary resources : PubChem, ChEBI,
DrugBank, ChemIDPlus, Wikipedia
 ChemSpider “links” chemistry on the internet
 Almost 25 million compounds, 400 data sources
 Allows community deposition, curation, annotation
 Integrating properties, publications, patents, media
 Text, structure, substructure (in testing) searching
www.chemspider.com
Search for a Chemical
Available Information…
 Linked to vendors, safety data, toxicity, metabolism
We Have Delivered the Vision
“Build a Structure Centric Community to
Serve Chemists”
 Integrate chemical structure data on the web
 Create a “structure-based hub” to information,
data and algorithmic predictions
 Let chemists contribute their own data
 Allow the community to curate/correct data
How Did We Build It?
 We deal in Molfiles or SDF files – including
coordinates
 We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
 We have our own “business logic” to standardize
 We use InChI to “aggregate tautomers” to one
record
 Link out to external sites where possible using IDs
Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE
What is the Structure of Vitamin K?
MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
What is the Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
ChEBI – Manual Curation
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Public Domain Chemistry Databases
 Our databases are a mess…
 Non-curated databases are proliferating errors
 We source and deposit data between databases
 Original sources of errors hard to determine
 Curation is time-consuming, challenging and
exacting
 An examination of quality in databases – inter/intra
lab comparison of processes for 150 drugs
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Vytorin: Ezetimibe/Simvastatin
Symbicort: Budesonide + Formoterol
Symbicort: Budesonide + Formoterol
ChemIDPlus
Wikipedia
DrugBank: Search Symbicort…
Symbicort: Budesonide + Formoterol
 PubChem
 8 structures called Budesonide. 1 “correct”
 6 structures called Formoterol. 1 “correct”
 Search on “Symbicort” gives 1 structure.
Taxol: Paclitaxel 44 structures
Taxol: Paclitaxel Bioassay Data
Taxol: Paclitaxel Bioassay Data
 Most Bioassay data associated with structure
with one ambiguous stereocenter
Data on the Web – Good or Bad??
Taken from: Rafael Sidis’ Blog
Data on the Registry
Data on the Registry
Data on the Registry
How are data handled in Pharma?
 Algorithms for “collapsing” data? Skeletons only?
 Processing structure-name pairs?
 Manual curation?
 Does it matter relative to the noise in the
measurements?
 Do correct structure representations matter, and
to who?????
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
 Consider searching each of these
chemical databases by chemical
name (systematic name, trade
name or synonym). Please mark
each online resource according to
how much you generally trust the
results.
Drug Name Generic Name ChEBI ChemSpider
CAS Com.
Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia
Spiriva
Tiotropium
Bromide
No Hits  No Hits    4/0 
Depakote
Valproate
semisodium
       No
Structure
Basen Voglibose   No Hits  No Hits  2/1 
Symbicort 1) Budesonide       8/1 
Symbicort 2) Formoterol WRONG  No Hits    6/1 
Vytorin 1) Ezetimibe   No Hits     
Vytorin 2) Simvastatin       2/1 
Taxol Paclitaxel       44/1 
Thalidomid Thalidomide No Hits       
Zocor Simvastatin       2/1 
Crestor Rosuvastatin   No Hits    2/1 
Why Curated Dictionaries Matter
Success Depends on Dictionaries
Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 ChemSpider allows immediate curation
Crowdsourcing Works
 Over 100 people have deposited data (structures,
spectra, etc) and participated in data curation
 Different level curators check each others work
 Wikipedia is the modern primary example
 Some curators are “madmen”…
Crowdsourcing Works
 Over 100 people have deposited data (structures,
spectra, etc) and participated in data curation
 Different level curators check each others work
 Wikipedia is the modern primary example
 Some curators are “madmen”…
 The Oxford English Dictionary
Collaborative Data Curation
 How can we COLLECTIVELY clean online data?
 ChemSpider has inherited junk from >400 data
sources. Some of this has proliferated into
PubChem. We should deprecate it.
 We need to develop a way to share curation actions
back to original data sources
 A mindset of bigger is better is problematic. How
many “real chemicals” are in the public databases?
ChemSpider
 ChemSpider is free to use.
 Multiple web services are available.
 New data added daily.
 Curation and data validation ongoing everyday.
 Provided by the RSC.
www.chemspider.com
Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

What's hot

Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and MedicineTheContentMine
 
Go pathway-interaction-integration
Go pathway-interaction-integrationGo pathway-interaction-integration
Go pathway-interaction-integrationChris Mungall
 
Mapping metabolites against pathway databases
Mapping metabolites against pathway databases Mapping metabolites against pathway databases
Mapping metabolites against pathway databases Dinesh Barupal
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! TheContentMine
 
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020Dinesh Barupal
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? TheContentMine
 
Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Valery Tkachenko
 
Metabolic Set Enrichment Analysis - chemrich - 2019
Metabolic Set Enrichment Analysis - chemrich - 2019Metabolic Set Enrichment Analysis - chemrich - 2019
Metabolic Set Enrichment Analysis - chemrich - 2019Dinesh Barupal
 
Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)Dinesh Barupal
 
3 surya gupta - tabloid proteome
3  surya gupta - tabloid proteome3  surya gupta - tabloid proteome
3 surya gupta - tabloid proteomeRik Van Bruggen
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...guest01a117
 

What's hot (18)

Integrating and curating internet based chemistry resources to serve life sci...
Integrating and curating internet based chemistry resources to serve life sci...Integrating and curating internet based chemistry resources to serve life sci...
Integrating and curating internet based chemistry resources to serve life sci...
 
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open ChemistryCrowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
Crowdsourcing, Collaborations And Text Mining In A World Of Open Chemistry
 
Cochrane workshop2016
Cochrane workshop2016Cochrane workshop2016
Cochrane workshop2016
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
Go pathway-interaction-integration
Go pathway-interaction-integrationGo pathway-interaction-integration
Go pathway-interaction-integration
 
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
A Presentation at Nature Publishing Group Crowdsourcing, Collaborations and T...
 
Mapping metabolites against pathway databases
Mapping metabolites against pathway databases Mapping metabolites against pathway databases
Mapping metabolites against pathway databases
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
Dinesh Barupal @ California Biomonitoring SGP Meeting July 2020
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...
 
Metabolic Set Enrichment Analysis - chemrich - 2019
Metabolic Set Enrichment Analysis - chemrich - 2019Metabolic Set Enrichment Analysis - chemrich - 2019
Metabolic Set Enrichment Analysis - chemrich - 2019
 
Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)Metabolite Set Enrichment Analysis (ChemRICH)
Metabolite Set Enrichment Analysis (ChemRICH)
 
Connecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpiderConnecting Chemistry Across the Internet Using ChemSpider
Connecting Chemistry Across the Internet Using ChemSpider
 
3 surya gupta - tabloid proteome
3  surya gupta - tabloid proteome3  surya gupta - tabloid proteome
3 surya gupta - tabloid proteome
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
 

Viewers also liked (14)

Jessica non cría en bruxas
Jessica non cría en bruxasJessica non cría en bruxas
Jessica non cría en bruxas
 
Web 2.0
Web 2.0 Web 2.0
Web 2.0
 
Presentacion arte indigena
Presentacion arte indigenaPresentacion arte indigena
Presentacion arte indigena
 
RANCRAFT YACHTS RANCRAFT 22.20, 2003, 29.000 € For Sale Brochure. Presented B...
RANCRAFT YACHTS RANCRAFT 22.20, 2003, 29.000 € For Sale Brochure. Presented B...RANCRAFT YACHTS RANCRAFT 22.20, 2003, 29.000 € For Sale Brochure. Presented B...
RANCRAFT YACHTS RANCRAFT 22.20, 2003, 29.000 € For Sale Brochure. Presented B...
 
งานลอกลาย
งานลอกลายงานลอกลาย
งานลอกลาย
 
Feyenoord
FeyenoordFeyenoord
Feyenoord
 
GT Alumni Magazine - 2001 Joe Morse
GT Alumni Magazine - 2001 Joe MorseGT Alumni Magazine - 2001 Joe Morse
GT Alumni Magazine - 2001 Joe Morse
 
Sb2
Sb2Sb2
Sb2
 
MobiU2011 Lecture: STRAT131 Mobile Legal Implications - Sedgwick LLP
MobiU2011 Lecture: STRAT131 Mobile Legal Implications - Sedgwick LLPMobiU2011 Lecture: STRAT131 Mobile Legal Implications - Sedgwick LLP
MobiU2011 Lecture: STRAT131 Mobile Legal Implications - Sedgwick LLP
 
Appendix 3
Appendix 3Appendix 3
Appendix 3
 
Manual instruction apc3.0
Manual instruction apc3.0Manual instruction apc3.0
Manual instruction apc3.0
 
Managed Review
Managed ReviewManaged Review
Managed Review
 
Cambodia
CambodiaCambodia
Cambodia
 
連携ビジネスVol1
連携ビジネスVol1連携ビジネスVol1
連携ビジネスVol1
 

Similar to Connecting Online Chemistry Resources Through Curated Data

Chemspider hosting linking and curating chemistry data for the community
Chemspider hosting linking and curating chemistry data for the communityChemspider hosting linking and curating chemistry data for the community
Chemspider hosting linking and curating chemistry data for the communityRoyal Society of Chemistry
 
Whitney Symposium Lecturejune 2008 1220331644496491 9
Whitney Symposium Lecturejune 2008 1220331644496491 9Whitney Symposium Lecturejune 2008 1220331644496491 9
Whitney Symposium Lecturejune 2008 1220331644496491 9Scott Conner
 

Similar to Connecting Online Chemistry Resources Through Curated Data (20)

ChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry ResourcesChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry Resources
 
AZ of Chemspider February 2011
AZ of Chemspider February 2011AZ of Chemspider February 2011
AZ of Chemspider February 2011
 
How the web has weaved a web of interlinked chemistry data final
How the web has weaved a web of interlinked chemistry data finalHow the web has weaved a web of interlinked chemistry data final
How the web has weaved a web of interlinked chemistry data final
 
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
Crowdsourced Curation of Chemistry Data. How Bad is Online Chemistry Data?
 
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
Chemspider hosting linking and curating chemistry data for the community
Chemspider hosting linking and curating chemistry data for the communityChemspider hosting linking and curating chemistry data for the community
Chemspider hosting linking and curating chemistry data for the community
 
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
Whitney Symposium Lecturejune 2008 1220331644496491 9
Whitney Symposium Lecturejune 2008 1220331644496491 9Whitney Symposium Lecturejune 2008 1220331644496491 9
Whitney Symposium Lecturejune 2008 1220331644496491 9
 
Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Web Crawling Chemistry
 
How an Online Resource for Chemistry Can Change Our World
How an Online Resource for Chemistry Can Change Our WorldHow an Online Resource for Chemistry Can Change Our World
How an Online Resource for Chemistry Can Change Our World
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
Online Public Compound Databases
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 

Connecting Online Chemistry Resources Through Curated Data

  • 1. ChemSpider -Connecting and Curating Online Chemistry Resources Antony Williams EBI, November 30th 2010
  • 2. Chemistry on the Internet  100s of websites serving up chemistry data, SDF files of structures and data  Some primary resources : PubChem, ChEBI, DrugBank, ChemIDPlus, Wikipedia  ChemSpider “links” chemistry on the internet  Almost 25 million compounds, 400 data sources  Allows community deposition, curation, annotation  Integrating properties, publications, patents, media  Text, structure, substructure (in testing) searching
  • 4. Search for a Chemical
  • 5. Available Information…  Linked to vendors, safety data, toxicity, metabolism
  • 6. We Have Delivered the Vision “Build a Structure Centric Community to Serve Chemists”  Integrate chemical structure data on the web  Create a “structure-based hub” to information, data and algorithmic predictions  Let chemists contribute their own data  Allow the community to curate/correct data
  • 7. How Did We Build It?  We deal in Molfiles or SDF files – including coordinates  We do rudimentary filtering – valence checking, charge imbalance – prior to deposition  We have our own “business logic” to standardize  We use InChI to “aggregate tautomers” to one record  Link out to external sites where possible using IDs
  • 8. Inherited Errors  We have inherited errors from every database… all public compound databases, including ours, have errors  “Incorrect” structures – assertions, timelines etc  “Incorrect” names associated with structures  Properties  Links  Publications  ENORMOUS CHALLENGE
  • 9. What is the Structure of Vitamin K?
  • 10. MeSH  A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 11. What is the Structure of Vitamin K1?
  • 12. What is the Structure of Vitamin K1?
  • 15.
  • 16.
  • 17. ChEBI – Manual Curation
  • 18.
  • 19.
  • 21.
  • 22. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione”  Variants of systematic names on PubChem  2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl  2-methyl-3-(3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 23. Public Domain Chemistry Databases  Our databases are a mess…  Non-curated databases are proliferating errors  We source and deposit data between databases  Original sources of errors hard to determine  Curation is time-consuming, challenging and exacting  An examination of quality in databases – inter/intra lab comparison of processes for 150 drugs
  • 24.
  • 31. Symbicort: Budesonide + Formoterol ChemIDPlus Wikipedia
  • 33. Symbicort: Budesonide + Formoterol  PubChem  8 structures called Budesonide. 1 “correct”  6 structures called Formoterol. 1 “correct”  Search on “Symbicort” gives 1 structure.
  • 34. Taxol: Paclitaxel 44 structures
  • 36. Taxol: Paclitaxel Bioassay Data  Most Bioassay data associated with structure with one ambiguous stereocenter
  • 37. Data on the Web – Good or Bad?? Taken from: Rafael Sidis’ Blog
  • 38. Data on the Registry
  • 39. Data on the Registry
  • 40. Data on the Registry
  • 41. How are data handled in Pharma?  Algorithms for “collapsing” data? Skeletons only?  Processing structure-name pairs?  Manual curation?  Does it matter relative to the noise in the measurements?  Do correct structure representations matter, and to who?????
  • 45.  Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
  • 46.
  • 47. Drug Name Generic Name ChEBI ChemSpider CAS Com. Chem ChemIDPlus DailyMed DrugBank PubChem Wikipedia Spiriva Tiotropium Bromide No Hits  No Hits    4/0  Depakote Valproate semisodium        No Structure Basen Voglibose   No Hits  No Hits  2/1  Symbicort 1) Budesonide       8/1  Symbicort 2) Formoterol WRONG  No Hits    6/1  Vytorin 1) Ezetimibe   No Hits      Vytorin 2) Simvastatin       2/1  Taxol Paclitaxel       44/1  Thalidomid Thalidomide No Hits        Zocor Simvastatin       2/1  Crestor Rosuvastatin   No Hits    2/1 
  • 49. Success Depends on Dictionaries
  • 50. Online Curation  Online databases generally do NOT allow curation or annotation  If you find errors they stay there!  ChemSpider allows immediate curation
  • 51. Crowdsourcing Works  Over 100 people have deposited data (structures, spectra, etc) and participated in data curation  Different level curators check each others work  Wikipedia is the modern primary example  Some curators are “madmen”…
  • 52. Crowdsourcing Works  Over 100 people have deposited data (structures, spectra, etc) and participated in data curation  Different level curators check each others work  Wikipedia is the modern primary example  Some curators are “madmen”…  The Oxford English Dictionary
  • 53. Collaborative Data Curation  How can we COLLECTIVELY clean online data?  ChemSpider has inherited junk from >400 data sources. Some of this has proliferated into PubChem. We should deprecate it.  We need to develop a way to share curation actions back to original data sources  A mindset of bigger is better is problematic. How many “real chemicals” are in the public databases?
  • 54. ChemSpider  ChemSpider is free to use.  Multiple web services are available.  New data added daily.  Curation and data validation ongoing everyday.  Provided by the RSC. www.chemspider.com
  • 55. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams