SlideShare una empresa de Scribd logo
1 de 22
Improving the quality of chemical databases with community-developed tools (and vice versa) Noel M. O’Boyle Aug 2011 5th Meeting on U.S. Government Chemical Databases and Open Chemistry NCI-Frederick, MD, U.S. Slides at http://tinyurl.com/noel-nci Open Babel
Improving the quality of chemical databases with community-developed tools (and vice versa) Part 1: Using Databases to improve Open Babel Part 2: Using Open Babel to improve Databases
Volunteer effort, an open source success story Originally a fork from OpenEye’s OELib in 2001 Lead is Geoff Hutchison (Uni of Pittsburgh) 4 or 5 active developers – I got involved in late 2005 http://openbabel.org Paper coming out Real Soon Now
Improving Open Babel using Databases Originally only had code to record stereochemistry in SMILES In 2005, Nick England as an undergraduate summer student with PMR (sponsored by Merck) added better support throughout the library However, by early 2009, it was clear that Open Babel’s handling of stereochemistry needed to be overhauled Bug reports: SMILES conversions were causing flipping of chirality, incorrect InChIs were being generated, … Tim Vandermeersch took the lead in writing new classes and stereo perception code I integrated the code into the various formats Handling stereochemistry is tricky (really!) Anticipating corner cases triggered by 1 in 10000 molecules is difficult… …unless, of course, you have a dataset of 10000 molecules (corrollary also true: developers of large databases are the people most likely to find bugs in cheminformatics toolkits) Solution: use PubChem and other databases to flush out bugs: Starting Material + 4
The Read/Write SMILES test Test set is first subset of PubChem 3D 18053 molecules as SDF file 3D structures nice to use because the stereochemistry is explicit (and easily visualised) Test Open Babel’s ability to correctly read or write SMILES strings: (a) Convert SDF to SMILES; convert these to CanSMILES (b) Convert SDF to CanSMILES (c) Compare (a) and (b) Differences will be principally due to errors in: Reading SDF, reading/writing SMILES Kekulisation or canonicalisation 19/Mar/2009: 1424 (8%) had differences 21/Mar/2009: 925 (5%) 22/Mar/2009: 324 (2%) 10/Oct/2009: 190 (1%) 04/Oct/2010: 5 (out of 18084) 31/May/2011: 2 5
Testing canonicalisation of SMILES Canonicalisation useful for comparing identity and compound registry Relatively simple to handle 95% of molecules (Morgan algorithm) More complicated for the general case Stereocenters related by symmetry, potential stereocenters whose configuration depends on other stereocenters Test set: eMolecules dataset (5.2m) Test canonicalisation by shuffling the atom order, and verifying that the same canonical SMILES is generated Repeated 10 times 23k (0.4%) failures for OB 2.2.3  4 failures for OB 2.3.1 (dev) 6
Independent test of SDF to SDF conversion Recently, Róbert Kiss evaluated Open Babel for use by mcule.com Selected all molecules from PubChem with at least one tet center and at least cistrans bond and 350<MW<750 478k molecules (2D SDF) Excluded 356 where InChI->SDF->InChI had error (a) Converted to InChIs with InChI binary (b) Converted SDF->SDF with OpenBabel, and then to InChIs with InChI binary (c) Compared (a) and (b) 09/Aug/2011: 878 (0.2%) disagreement 16/Aug/2011: 554 21/Aug/2011: 146 (…work in progress) 57 of these have the same substructure that exposes a Mol file corner case… 7
Mol file corner case InChI binary regards these Mol files as different Suggests useful rule for choosing location of wedge/hash when writing Mol file Rule: If two bonds are similar angles, chose one of these Three non-stereo bonds at widely spaced angles (although one is hidden) Two of the non-stereo bonds are very close => InChI decides that the stereochemistry is ambiguous 8
Summary of Part One Open Babel has been considerably improved and tested by training on large databases Large databases are essential as test cases for cheminformatics toolkits	 Help find errors Help ensure that the “fix” doesn’t generate more errors Devising an appropriate test is half the work Should focus on a particular aspect of the toolkit If a problem is found, it should be easy to figure out its origin Preferably should be a real usecase 9
Part Two Can we now use Open Babel to identify problems in the databases themselves? Case studies: Finding neutral 4-coordinate Ns in ZINC Identifying ambiguous stereochemistry in PubChem and ChEMBL mol files Verifying that chemical data presented is self-consistent – ChEMBL 10
Identifying structure problems in ZINC  Back in 2007, I noticed something strange in ZINC’s 3D structures Namely, structures with sp3 hybridised N, with four bonds, but where the N was uncharged So…I wrote a script using Open Babel to find all examples of this problem, and reported the results to ZINC About 5% of molecules had this problem (now fixed) import globimport pybelimport openbabel as oboutputfile = open("dodgyNs.txt", "w")for filename in glob.glob("gzipfiles/*.mol2"):  for mol in pybel.readfile("mol2", filename):    for atom in mol:      if atom.type == "N3":         # Internal OB atom type (equivalent to N.3)        numbonds = len(list(ob.OBAtomBondIter(atom.OBAtom)))        if numbonds == 4:          print >> outputfile, mol.title          breakoutputfile.close() 11
2D MOL files with Ambiguous Stereocenters Chirality specified at one stereocenter or two? Need to know the convention used Tip-only (useful to state or is this everywhere now?) Avoid this problem by choosing wedge/hash bonds that do not link potential stereocenters Almost always possible OB recipe: terminal H is preferred; next, of the bonds that do not link stereocenters, an exo-cyclic bond is preferred; finally, any remaining bond http://baoilleach.blogspot.com/2010/12/name-that-stereochemistry-when-mol.html 12
Easy to find? import pybel def dodgywedge(sdffile):   tot = probs = potential_probs = 0 for mol in pybel.readfile("sdf", sdffile):     tot += 1     facade = pybel.ob.OBStereoFacade(mol.OBMol)     tetcenters = [atom.OBAtom for atom in mol if      facade.HasTetrahedralStereo(atom.OBAtom.GetId())] for idx, atom_a in enumerate(tetcenters[:-1]): for atom_b in tetcenters[idx+1:]: if atom_a.IsConnected(atom_b):           potential_probs += 1           bond = atom_a.GetBond(atom_b) if bond.IsWedge() or bond.IsHash():             probs += 1 print"Total number of molecules", tot print"Potential problems:", potential_probs print"Actual problems:", probs if __name__ == "__main__":   dodgywedge("myfile.sdf") 13
How common? print"Total number of molecules", tot print"Potential problems:", potential_probs print"Actual problems:", probs (Dec 2010) PubChem subset: 	23k molecules 	14k bonds connecting chiral centers 	21 marked as stereobonds (<0.1%) ChEMBL: 	636k molecules 	483k bonds connecting chiral centers 	7k marked as stereobonds 	=> 1.4% are ambiguous stereobonds Easy to fix? (OB 2.3.1) obabel my2Dmol.mol –O fixed2Dmol.mol 14
Self-consistency of chemical data For a single molecule, a database will typically include several of the following: a 2D molfile a 2D depiction a 3D molfile a non-canonical SMILES string a canonical SMILES string an InChI an InChIKey But which one is the primary data, and which are derived? Derived data may be inconsistent with primary data Every transformation of the data can lead to information loss or corruption Maintainers should highlight the primary data Can Open Tools help identify inconsistencies? 15
Self-consistency of chemical data II As an example, let’s look for disagreements between the MOL file and the SMILES string provided in a subset of ChEMBL  Using Open Babel’s canonical SMILES: obabel chembl.sdf –ocan –O sdf_to_can.txt obabel chembl_can.txt –ocan –O can_to_can.txt Using Open Babel’s InChI interface: obabel chembl.sdf –oinchi –O sdf_to_inchi.txt obabel chembl_can.txt –oinchi –O can_to_inchi.txt Write a Python script to go through the text files and find differences Looking at the first 10000 entries in ChEMBL 10: 249 disagreements according to derived canonical smiles 76 disagreements according to derived InChIs 51 disagreements in common 25 only InChI, 198 only canonical SMILES 16
Note to self: Graphical software that makes this comparison easier would be very useful 17
18
N=N can be cis or trans SMILES string has unspecified stereochemistry However, Molfile has trans geometry and does not mark the stereobond as unspecified This source of disagreement accounts for 23 of the 51 cases. 19
Is the chirality specified? √ X Open Babel is a bit confused by this one too: > obabel -:"OC1CC[C@](CC1)(c1ccccc1)N1CCCCC1" –ocan OC1CC[C@](CC1)(N1CCCCC1)c1ccccc1 > obabel -:"OC1CC[C@@](CC1)(c1ccccc1)N1CCCCC1" -ocan OC1CC[C@](CC1)(N1CCCCC1)c1ccccc1 20
Concluding Points, Ideas and Questions Many classes of errors can be relatively easily identified using Open Toolkits Could crowd-source some of this, “and the iPad goes to the student who writes a script that finds the largest number of errors in MyDB” Must use toolkits to which we have access here at MyDB FP rate must be less than X Are these types of analyses useful to database maintainers? I think the Blue Obelisk community would contribute here if it were welcome Could provide sanity checkers or validation website using webservices, like checkciffor molecules Create a ValidateMyMolecule website It accepts a single structure, and then sends it to N webservices that validate it Each webservice is maintained by a cheminformatics toolkit or laboratory Good PR for the toolkit or advertising for a lab Encourages the development of validation tools Create an AreWeTheSameMolecule website It accepts a pair of structures, and then sends them to N webservices that check for identity 21
Improving the quality of chemical databases with community-developed tools (and vice versa) http://baoilleach.blogspot.com baoilleach@gmail.com Acknowledgements ,[object Object]

Más contenido relacionado

Similar a Improving the quality of chemical databases with community-developed tools (and vice versa)

OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscoverygwprice
 
Formal representation of models in systems biology
Formal representation of models in systems biologyFormal representation of models in systems biology
Formal representation of models in systems biologyMichel Dumontier
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systemsinside-BigData.com
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsJeremy Yang
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...NextMove Software
 
Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...Mike Hucka
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeChris Mungall
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformaticsBenjamin Bucior
 
Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Michel Dumontier
 
Hw2 Rec07answers
Hw2 Rec07answersHw2 Rec07answers
Hw2 Rec07answersariddlegirl
 
Simplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPTSimplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPTAustin Benson
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforRomain Boman
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paperDBOnto
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeDavid Thompson
 
NL to OCL via SBVR
NL to OCL via SBVRNL to OCL via SBVR
NL to OCL via SBVRImran Bajwa
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered HarmfulPrateek Singh
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxChris Mungall
 

Similar a Improving the quality of chemical databases with community-developed tools (and vice versa) (20)

OpenDiscovery
OpenDiscoveryOpenDiscovery
OpenDiscovery
 
Formal representation of models in systems biology
Formal representation of models in systems biologyFormal representation of models in systems biology
Formal representation of models in systems biology
 
Scientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing SystemsScientific Benchmarking of Parallel Computing Systems
Scientific Benchmarking of Parallel Computing Systems
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformatics
 
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
 
Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...Standards and software: practical aids for reproducibility of computational r...
Standards and software: practical aids for reproducibility of computational r...
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Symbolic Execution And KLEE
Symbolic Execution And KLEESymbolic Execution And KLEE
Symbolic Execution And KLEE
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of life
 
Overview of cheminformatics
Overview of cheminformaticsOverview of cheminformatics
Overview of cheminformatics
 
Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...Accurate biochemical knowledge starting with precise structure-based criteria...
Accurate biochemical knowledge starting with precise structure-based criteria...
 
Hw2 Rec07answers
Hw2 Rec07answersHw2 Rec07answers
Hw2 Rec07answers
 
Simplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPTSimplicial closure and higher-order link prediction LA/OPT
Simplicial closure and higher-order link prediction LA/OPT
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
 
PAGOdA paper
PAGOdA paperPAGOdA paper
PAGOdA paper
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 
NL to OCL via SBVR
NL to OCL via SBVRNL to OCL via SBVR
NL to OCL via SBVR
 
XML Considered Harmful
XML Considered HarmfulXML Considered Harmful
XML Considered Harmful
 
MADICES Mungall 2022.pptx
MADICES Mungall 2022.pptxMADICES Mungall 2022.pptx
MADICES Mungall 2022.pptx
 

Más de baoilleach

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESbaoilleach
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overviewbaoilleach
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?baoilleach
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Webbaoilleach
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2baoilleach
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand dockingbaoilleach
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculationbaoilleach
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSARbaoilleach
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...baoilleach
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopybaoilleach
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devicesbaoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment ratesbaoilleach
 
The Blue Obelisk community
The Blue Obelisk communityThe Blue Obelisk community
The Blue Obelisk communitybaoilleach
 
Interoperability and the Blue Obelisk
Interoperability and the Blue ObeliskInteroperability and the Blue Obelisk
Interoperability and the Blue Obeliskbaoilleach
 
Goslar2010 poster
Goslar2010 posterGoslar2010 poster
Goslar2010 posterbaoilleach
 
Open Babel 2.3 Quick Reference
Open Babel 2.3 Quick ReferenceOpen Babel 2.3 Quick Reference
Open Babel 2.3 Quick Referencebaoilleach
 

Más de baoilleach (20)

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILES
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
So I have an SD File... What do I do next?
So I have an SD File... What do I do next?So I have an SD File... What do I do next?
So I have an SD File... What do I do next?
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
 
The Blue Obelisk community
The Blue Obelisk communityThe Blue Obelisk community
The Blue Obelisk community
 
Interoperability and the Blue Obelisk
Interoperability and the Blue ObeliskInteroperability and the Blue Obelisk
Interoperability and the Blue Obelisk
 
Goslar2010 poster
Goslar2010 posterGoslar2010 poster
Goslar2010 poster
 
Open Babel 2.3 Quick Reference
Open Babel 2.3 Quick ReferenceOpen Babel 2.3 Quick Reference
Open Babel 2.3 Quick Reference
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Improving the quality of chemical databases with community-developed tools (and vice versa)

  • 1. Improving the quality of chemical databases with community-developed tools (and vice versa) Noel M. O’Boyle Aug 2011 5th Meeting on U.S. Government Chemical Databases and Open Chemistry NCI-Frederick, MD, U.S. Slides at http://tinyurl.com/noel-nci Open Babel
  • 2. Improving the quality of chemical databases with community-developed tools (and vice versa) Part 1: Using Databases to improve Open Babel Part 2: Using Open Babel to improve Databases
  • 3. Volunteer effort, an open source success story Originally a fork from OpenEye’s OELib in 2001 Lead is Geoff Hutchison (Uni of Pittsburgh) 4 or 5 active developers – I got involved in late 2005 http://openbabel.org Paper coming out Real Soon Now
  • 4. Improving Open Babel using Databases Originally only had code to record stereochemistry in SMILES In 2005, Nick England as an undergraduate summer student with PMR (sponsored by Merck) added better support throughout the library However, by early 2009, it was clear that Open Babel’s handling of stereochemistry needed to be overhauled Bug reports: SMILES conversions were causing flipping of chirality, incorrect InChIs were being generated, … Tim Vandermeersch took the lead in writing new classes and stereo perception code I integrated the code into the various formats Handling stereochemistry is tricky (really!) Anticipating corner cases triggered by 1 in 10000 molecules is difficult… …unless, of course, you have a dataset of 10000 molecules (corrollary also true: developers of large databases are the people most likely to find bugs in cheminformatics toolkits) Solution: use PubChem and other databases to flush out bugs: Starting Material + 4
  • 5. The Read/Write SMILES test Test set is first subset of PubChem 3D 18053 molecules as SDF file 3D structures nice to use because the stereochemistry is explicit (and easily visualised) Test Open Babel’s ability to correctly read or write SMILES strings: (a) Convert SDF to SMILES; convert these to CanSMILES (b) Convert SDF to CanSMILES (c) Compare (a) and (b) Differences will be principally due to errors in: Reading SDF, reading/writing SMILES Kekulisation or canonicalisation 19/Mar/2009: 1424 (8%) had differences 21/Mar/2009: 925 (5%) 22/Mar/2009: 324 (2%) 10/Oct/2009: 190 (1%) 04/Oct/2010: 5 (out of 18084) 31/May/2011: 2 5
  • 6. Testing canonicalisation of SMILES Canonicalisation useful for comparing identity and compound registry Relatively simple to handle 95% of molecules (Morgan algorithm) More complicated for the general case Stereocenters related by symmetry, potential stereocenters whose configuration depends on other stereocenters Test set: eMolecules dataset (5.2m) Test canonicalisation by shuffling the atom order, and verifying that the same canonical SMILES is generated Repeated 10 times 23k (0.4%) failures for OB 2.2.3 4 failures for OB 2.3.1 (dev) 6
  • 7. Independent test of SDF to SDF conversion Recently, Róbert Kiss evaluated Open Babel for use by mcule.com Selected all molecules from PubChem with at least one tet center and at least cistrans bond and 350<MW<750 478k molecules (2D SDF) Excluded 356 where InChI->SDF->InChI had error (a) Converted to InChIs with InChI binary (b) Converted SDF->SDF with OpenBabel, and then to InChIs with InChI binary (c) Compared (a) and (b) 09/Aug/2011: 878 (0.2%) disagreement 16/Aug/2011: 554 21/Aug/2011: 146 (…work in progress) 57 of these have the same substructure that exposes a Mol file corner case… 7
  • 8. Mol file corner case InChI binary regards these Mol files as different Suggests useful rule for choosing location of wedge/hash when writing Mol file Rule: If two bonds are similar angles, chose one of these Three non-stereo bonds at widely spaced angles (although one is hidden) Two of the non-stereo bonds are very close => InChI decides that the stereochemistry is ambiguous 8
  • 9. Summary of Part One Open Babel has been considerably improved and tested by training on large databases Large databases are essential as test cases for cheminformatics toolkits Help find errors Help ensure that the “fix” doesn’t generate more errors Devising an appropriate test is half the work Should focus on a particular aspect of the toolkit If a problem is found, it should be easy to figure out its origin Preferably should be a real usecase 9
  • 10. Part Two Can we now use Open Babel to identify problems in the databases themselves? Case studies: Finding neutral 4-coordinate Ns in ZINC Identifying ambiguous stereochemistry in PubChem and ChEMBL mol files Verifying that chemical data presented is self-consistent – ChEMBL 10
  • 11. Identifying structure problems in ZINC Back in 2007, I noticed something strange in ZINC’s 3D structures Namely, structures with sp3 hybridised N, with four bonds, but where the N was uncharged So…I wrote a script using Open Babel to find all examples of this problem, and reported the results to ZINC About 5% of molecules had this problem (now fixed) import globimport pybelimport openbabel as oboutputfile = open("dodgyNs.txt", "w")for filename in glob.glob("gzipfiles/*.mol2"):  for mol in pybel.readfile("mol2", filename):    for atom in mol:      if atom.type == "N3": # Internal OB atom type (equivalent to N.3)        numbonds = len(list(ob.OBAtomBondIter(atom.OBAtom)))        if numbonds == 4:          print >> outputfile, mol.title          breakoutputfile.close() 11
  • 12. 2D MOL files with Ambiguous Stereocenters Chirality specified at one stereocenter or two? Need to know the convention used Tip-only (useful to state or is this everywhere now?) Avoid this problem by choosing wedge/hash bonds that do not link potential stereocenters Almost always possible OB recipe: terminal H is preferred; next, of the bonds that do not link stereocenters, an exo-cyclic bond is preferred; finally, any remaining bond http://baoilleach.blogspot.com/2010/12/name-that-stereochemistry-when-mol.html 12
  • 13. Easy to find? import pybel def dodgywedge(sdffile): tot = probs = potential_probs = 0 for mol in pybel.readfile("sdf", sdffile): tot += 1 facade = pybel.ob.OBStereoFacade(mol.OBMol) tetcenters = [atom.OBAtom for atom in mol if facade.HasTetrahedralStereo(atom.OBAtom.GetId())] for idx, atom_a in enumerate(tetcenters[:-1]): for atom_b in tetcenters[idx+1:]: if atom_a.IsConnected(atom_b): potential_probs += 1 bond = atom_a.GetBond(atom_b) if bond.IsWedge() or bond.IsHash(): probs += 1 print"Total number of molecules", tot print"Potential problems:", potential_probs print"Actual problems:", probs if __name__ == "__main__": dodgywedge("myfile.sdf") 13
  • 14. How common? print"Total number of molecules", tot print"Potential problems:", potential_probs print"Actual problems:", probs (Dec 2010) PubChem subset: 23k molecules 14k bonds connecting chiral centers 21 marked as stereobonds (<0.1%) ChEMBL: 636k molecules 483k bonds connecting chiral centers 7k marked as stereobonds => 1.4% are ambiguous stereobonds Easy to fix? (OB 2.3.1) obabel my2Dmol.mol –O fixed2Dmol.mol 14
  • 15. Self-consistency of chemical data For a single molecule, a database will typically include several of the following: a 2D molfile a 2D depiction a 3D molfile a non-canonical SMILES string a canonical SMILES string an InChI an InChIKey But which one is the primary data, and which are derived? Derived data may be inconsistent with primary data Every transformation of the data can lead to information loss or corruption Maintainers should highlight the primary data Can Open Tools help identify inconsistencies? 15
  • 16. Self-consistency of chemical data II As an example, let’s look for disagreements between the MOL file and the SMILES string provided in a subset of ChEMBL Using Open Babel’s canonical SMILES: obabel chembl.sdf –ocan –O sdf_to_can.txt obabel chembl_can.txt –ocan –O can_to_can.txt Using Open Babel’s InChI interface: obabel chembl.sdf –oinchi –O sdf_to_inchi.txt obabel chembl_can.txt –oinchi –O can_to_inchi.txt Write a Python script to go through the text files and find differences Looking at the first 10000 entries in ChEMBL 10: 249 disagreements according to derived canonical smiles 76 disagreements according to derived InChIs 51 disagreements in common 25 only InChI, 198 only canonical SMILES 16
  • 17. Note to self: Graphical software that makes this comparison easier would be very useful 17
  • 18. 18
  • 19. N=N can be cis or trans SMILES string has unspecified stereochemistry However, Molfile has trans geometry and does not mark the stereobond as unspecified This source of disagreement accounts for 23 of the 51 cases. 19
  • 20. Is the chirality specified? √ X Open Babel is a bit confused by this one too: > obabel -:"OC1CC[C@](CC1)(c1ccccc1)N1CCCCC1" –ocan OC1CC[C@](CC1)(N1CCCCC1)c1ccccc1 > obabel -:"OC1CC[C@@](CC1)(c1ccccc1)N1CCCCC1" -ocan OC1CC[C@](CC1)(N1CCCCC1)c1ccccc1 20
  • 21. Concluding Points, Ideas and Questions Many classes of errors can be relatively easily identified using Open Toolkits Could crowd-source some of this, “and the iPad goes to the student who writes a script that finds the largest number of errors in MyDB” Must use toolkits to which we have access here at MyDB FP rate must be less than X Are these types of analyses useful to database maintainers? I think the Blue Obelisk community would contribute here if it were welcome Could provide sanity checkers or validation website using webservices, like checkciffor molecules Create a ValidateMyMolecule website It accepts a single structure, and then sends it to N webservices that validate it Each webservice is maintained by a cheminformatics toolkit or laboratory Good PR for the toolkit or advertising for a lab Encourages the development of validation tools Create an AreWeTheSameMolecule website It accepts a pair of structures, and then sends them to N webservices that check for identity 21
  • 22.
  • 23. All database maintainers everywhere! ChEMBL, eMolecules, PubChem, ZINCImage: Tintin44 (Flickr) 22

Notas del editor

  1. I’ve been involved with OB since late 2005
  2. The first problem is kekulization; the second is canonicalisation.
  3. John Irwin, ZINC paper 2005
  4. CID 10280
  5. Total number of molecules 999881Potential problems: 569170Actual problems: 20248
  6. Does the toolkit recognise N=N as a source of stereoisomerism?