SlideShare una empresa de Scribd logo
1 de 40
Capturing Chemistry in XML/CML J. A. Townsend * ,  S. E. Adams *  , J. M. Goodman * ,  P. Murray-Rust * , C. A. Waudby *   Capturing Chemistry in XML/CML ACS March 2004 *  Unilever Centre for Molecular Informatics, University of Cambridge
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
The Agony Of  Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66”  /> mp 65-66   C Human-readable Machine-readable
The Vision-2 ,[object Object],Capturing Chemistry in XML/CML ACS March 2004 ,[object Object],[object Object],[object Object],[object Object],But also
Our Approach ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Capturing Chemistry in XML/CML ACS March 2004
Machine Parsing  of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured  documents and data in  XML MACHINE PARSING   ?
How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add  Structure Parse with Regular Expressions Legacy to CML  converters
Regular Expressions Capturing Chemistry in XML/CML ACS March 2004 ,[object Object],Maybe ‘.’ Any  punctuation 0 or more digits Capital ‘ C’ Melting point: two possible syntaxes Capital or  lowercase ‘m’ Lowercase ‘ p’ Maybe whitespace Maybe degrees sign m.p. > 23.5 °C mp 23.5 – 25 °C
CML - XML For  Chemistry ,[object Object],[object Object],[object Object],[object Object],[object Object],Capturing Chemistry in XML/CML ACS March 2004 J. Chem. Inf. Comp. Sci.,  2003 ,  43 , 757
The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG,  * AniML + ,  * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci.,  2003 ,  43 , 757
Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure,  thermodynamic properties
Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is  normally used, but Hermann  Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar  dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66”  /> Linked to CML schema Accesses CCML  namespace Units dictionary id =&quot;celsius&quot;  name =&quot;Celsius&quot;  parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot;  constantToSI =&quot;273.15&quot;  abbreviation =&quot;C&quot;  unitType =&quot;temp&quot; id =&quot;meltrange&quot;  term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
Information  Checked / Extracted Capturing Chemistry in XML/CML ACS March 2004 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings  searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect  author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
Capturing Molecules Capturing Chemistry in XML/CML ACS March 2004 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Encourage chemists to
NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY:  Locant  Characteristic Group  Mono valent parent hydride Multiplier  Heterocyclic parent hydride
Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004

Más contenido relacionado

La actualidad más candente

ACSSA Halide-Water Poster
ACSSA Halide-Water PosterACSSA Halide-Water Poster
ACSSA Halide-Water Poster
Jiarong Zhou
 
DavidWooChemEResearchPosterv2
DavidWooChemEResearchPosterv2DavidWooChemEResearchPosterv2
DavidWooChemEResearchPosterv2
David Woo
 
Introduction to OECD QSAR Toolbox
Introduction to OECD QSAR ToolboxIntroduction to OECD QSAR Toolbox
Introduction to OECD QSAR Toolbox
guestcfca1eb1
 
Fac/Mer Isomerism in Fe(II) Complexes
Fac/Mer Isomerism in Fe(II) ComplexesFac/Mer Isomerism in Fe(II) Complexes
Fac/Mer Isomerism in Fe(II) Complexes
Rafia Aslam
 

La actualidad más candente (20)

General Concepts in QSAR for Using the QSAR Application Toolbox Part 2
General Concepts in QSAR for Using the QSAR Application Toolbox Part 2General Concepts in QSAR for Using the QSAR Application Toolbox Part 2
General Concepts in QSAR for Using the QSAR Application Toolbox Part 2
 
ACSSA Halide-Water Poster
ACSSA Halide-Water PosterACSSA Halide-Water Poster
ACSSA Halide-Water Poster
 
General Concepts in QSAR for Using the QSAR Application Toolbox Part 1
General Concepts in QSAR for Using the QSAR Application Toolbox Part 1General Concepts in QSAR for Using the QSAR Application Toolbox Part 1
General Concepts in QSAR for Using the QSAR Application Toolbox Part 1
 
Quantitative Structure Activity Relationship (QSAR)
Quantitative Structure Activity Relationship (QSAR)Quantitative Structure Activity Relationship (QSAR)
Quantitative Structure Activity Relationship (QSAR)
 
DavidWooChemEResearchPosterv2
DavidWooChemEResearchPosterv2DavidWooChemEResearchPosterv2
DavidWooChemEResearchPosterv2
 
Introduction to OECD QSAR Toolbox
Introduction to OECD QSAR ToolboxIntroduction to OECD QSAR Toolbox
Introduction to OECD QSAR Toolbox
 
Computer Simulation of EPR Orthorhombic Jahn-Teller Spectra of Cu2+in Cd2(NH4...
Computer Simulation of EPR Orthorhombic Jahn-Teller Spectra of Cu2+in Cd2(NH4...Computer Simulation of EPR Orthorhombic Jahn-Teller Spectra of Cu2+in Cd2(NH4...
Computer Simulation of EPR Orthorhombic Jahn-Teller Spectra of Cu2+in Cd2(NH4...
 
Fac/Mer Isomerism in Fe(II) Complexes
Fac/Mer Isomerism in Fe(II) ComplexesFac/Mer Isomerism in Fe(II) Complexes
Fac/Mer Isomerism in Fe(II) Complexes
 
Qsar lecture
Qsar lectureQsar lecture
Qsar lecture
 
Linking Ab Initio-Calphad for the Assessment of the AluminiumLutetium System
Linking Ab Initio-Calphad for the Assessment of the AluminiumLutetium SystemLinking Ab Initio-Calphad for the Assessment of the AluminiumLutetium System
Linking Ab Initio-Calphad for the Assessment of the AluminiumLutetium System
 
Free wilson analysis qsar
Free wilson analysis qsarFree wilson analysis qsar
Free wilson analysis qsar
 
Introduction to Quantitative Structure Activity Relationships
Introduction to Quantitative Structure Activity RelationshipsIntroduction to Quantitative Structure Activity Relationships
Introduction to Quantitative Structure Activity Relationships
 
Poster
PosterPoster
Poster
 
Quantum mechanical study the kinetics, mechanisms and
Quantum mechanical study the kinetics, mechanisms andQuantum mechanical study the kinetics, mechanisms and
Quantum mechanical study the kinetics, mechanisms and
 
Chem 2 - Chemical Kinetics III - Determining the Rate Law with the Method of ...
Chem 2 - Chemical Kinetics III - Determining the Rate Law with the Method of ...Chem 2 - Chemical Kinetics III - Determining the Rate Law with the Method of ...
Chem 2 - Chemical Kinetics III - Determining the Rate Law with the Method of ...
 
QSAR
QSARQSAR
QSAR
 
Qsar ppt
Qsar pptQsar ppt
Qsar ppt
 
A correlation for the prediction of thermal conductivity of liquids
A correlation for the prediction of thermal conductivity of liquidsA correlation for the prediction of thermal conductivity of liquids
A correlation for the prediction of thermal conductivity of liquids
 
QSAR
QSARQSAR
QSAR
 
Steric parameters taft’s steric factor (es)
Steric parameters  taft’s steric factor (es)Steric parameters  taft’s steric factor (es)
Steric parameters taft’s steric factor (es)
 

Destacado

Luento ammattilaiset, lauran muokkaama pohja 2014
Luento ammattilaiset, lauran muokkaama pohja 2014Luento ammattilaiset, lauran muokkaama pohja 2014
Luento ammattilaiset, lauran muokkaama pohja 2014
0458452713
 
evowatcger - computer monitoring system
evowatcger - computer monitoring systemevowatcger - computer monitoring system
evowatcger - computer monitoring system
Catalin Muresan
 

Destacado (20)

Effective Capability Building
Effective Capability BuildingEffective Capability Building
Effective Capability Building
 
MS Dynamics AX 2012
MS Dynamics AX 2012MS Dynamics AX 2012
MS Dynamics AX 2012
 
Neha_Resume_Dev
Neha_Resume_DevNeha_Resume_Dev
Neha_Resume_Dev
 
F.D
F.DF.D
F.D
 
Jane Howard
Jane HowardJane Howard
Jane Howard
 
Xp day roberto20130323
Xp day roberto20130323Xp day roberto20130323
Xp day roberto20130323
 
CHUYÊN ĐỀ LƯỢNG GIÁC CHƯƠNG 1 ĐẠI SỐ 11 MỚI NHẤT - HAY NHẤT
CHUYÊN ĐỀ LƯỢNG GIÁC CHƯƠNG 1 ĐẠI SỐ 11 MỚI NHẤT - HAY NHẤTCHUYÊN ĐỀ LƯỢNG GIÁC CHƯƠNG 1 ĐẠI SỐ 11 MỚI NHẤT - HAY NHẤT
CHUYÊN ĐỀ LƯỢNG GIÁC CHƯƠNG 1 ĐẠI SỐ 11 MỚI NHẤT - HAY NHẤT
 
diningfacilityconcept
diningfacilityconceptdiningfacilityconcept
diningfacilityconcept
 
ly thuyet va de kiem tra vat ly 7 hoc ky 1
ly thuyet va de kiem tra vat ly 7 hoc ky 1ly thuyet va de kiem tra vat ly 7 hoc ky 1
ly thuyet va de kiem tra vat ly 7 hoc ky 1
 
Luento ammattilaiset, lauran muokkaama pohja 2014
Luento ammattilaiset, lauran muokkaama pohja 2014Luento ammattilaiset, lauran muokkaama pohja 2014
Luento ammattilaiset, lauran muokkaama pohja 2014
 
Digital transformation : The Necessity
Digital transformation : The NecessityDigital transformation : The Necessity
Digital transformation : The Necessity
 
TONG HOP DE KIEM TRA CHUONG 2 DAI SO 11 HAY
TONG HOP DE KIEM TRA CHUONG 2 DAI SO 11 HAYTONG HOP DE KIEM TRA CHUONG 2 DAI SO 11 HAY
TONG HOP DE KIEM TRA CHUONG 2 DAI SO 11 HAY
 
Juknis lomba karya ilmiah
Juknis  lomba karya ilmiahJuknis  lomba karya ilmiah
Juknis lomba karya ilmiah
 
Enabling Voice Applications with WebRTC and ORTC in Microsoft Edge
Enabling Voice Applications with WebRTC and ORTC in Microsoft EdgeEnabling Voice Applications with WebRTC and ORTC in Microsoft Edge
Enabling Voice Applications with WebRTC and ORTC in Microsoft Edge
 
Chuyen de hinh hoc khong gian
Chuyen de hinh hoc khong gianChuyen de hinh hoc khong gian
Chuyen de hinh hoc khong gian
 
SE_Lec 11_ Project Management
SE_Lec 11_ Project ManagementSE_Lec 11_ Project Management
SE_Lec 11_ Project Management
 
Automated Securities Accounting System
Automated Securities Accounting System Automated Securities Accounting System
Automated Securities Accounting System
 
Singapore intresting facts
Singapore intresting factsSingapore intresting facts
Singapore intresting facts
 
evowatcger - computer monitoring system
evowatcger - computer monitoring systemevowatcger - computer monitoring system
evowatcger - computer monitoring system
 
وحدة التعلم الذاتي 2015
وحدة التعلم الذاتي 2015وحدة التعلم الذاتي 2015
وحدة التعلم الذاتي 2015
 

Similar a Capturing Chemistry In XML

Quantum pharmacology. Basics
Quantum pharmacology. BasicsQuantum pharmacology. Basics
Quantum pharmacology. Basics
Mobiliuz
 
Computational Organic Chemistry
Computational Organic ChemistryComputational Organic Chemistry
Computational Organic Chemistry
Isamu Katsuyama
 
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpiderIdentification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
Kamel Mansouri
 
Energy Minimization Using Gromacs
Energy Minimization Using GromacsEnergy Minimization Using Gromacs
Energy Minimization Using Gromacs
Rajendra K Labala
 
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
ManavBhugun3
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar a Capturing Chemistry In XML (20)

Poster_Jun 2014
Poster_Jun 2014Poster_Jun 2014
Poster_Jun 2014
 
Quantum pharmacology. Basics
Quantum pharmacology. BasicsQuantum pharmacology. Basics
Quantum pharmacology. Basics
 
Cheminformatics II
Cheminformatics IICheminformatics II
Cheminformatics II
 
Computational Organic Chemistry
Computational Organic ChemistryComputational Organic Chemistry
Computational Organic Chemistry
 
AWMA Presentation Application of Two State-of-the-art Dispersion Models
AWMA Presentation Application of Two State-of-the-art Dispersion ModelsAWMA Presentation Application of Two State-of-the-art Dispersion Models
AWMA Presentation Application of Two State-of-the-art Dispersion Models
 
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpiderIdentification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...
 
Molecular Simulation to build models for enzyme induced fit
Molecular Simulation to build models for enzyme induced fit Molecular Simulation to build models for enzyme induced fit
Molecular Simulation to build models for enzyme induced fit
 
Energy Minimization Using Gromacs
Energy Minimization Using GromacsEnergy Minimization Using Gromacs
Energy Minimization Using Gromacs
 
Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sir
 
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
Lecture_No._2_Computational_Chemistry_Tools___Application_of_computational_me...
 
A01 9-1
A01 9-1A01 9-1
A01 9-1
 
Parameterization of force field
Parameterization of force fieldParameterization of force field
Parameterization of force field
 
Machine Learning in Chemistry: Part I
Machine Learning in Chemistry: Part IMachine Learning in Chemistry: Part I
Machine Learning in Chemistry: Part I
 
molecular mechanics and quantum mechnics
molecular mechanics and quantum mechnicsmolecular mechanics and quantum mechnics
molecular mechanics and quantum mechnics
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Canonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformaticsCanonicalized systematic nomenclature in cheminformatics
Canonicalized systematic nomenclature in cheminformatics
 
Hydrogen fuel cells for the automotive system
Hydrogen fuel cells for the automotive systemHydrogen fuel cells for the automotive system
Hydrogen fuel cells for the automotive system
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Capturing Chemistry In XML

  • 1. Capturing Chemistry in XML/CML J. A. Townsend * , S. E. Adams * , J. M. Goodman * , P. Murray-Rust * , C. A. Waudby * Capturing Chemistry in XML/CML ACS March 2004 * Unilever Centre for Molecular Informatics, University of Cambridge
  • 2. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World
  • 3. The Agony Of Publication - Loss Capturing Chemistry in XML/CML ACS March 2004 The World Sad The Scientist The Lab Journals Web Pages
  • 4. The Vision-1 Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> mp 65-66  C Human-readable Machine-readable
  • 5.
  • 6.
  • 7. Machine Parsing of Chemistry Capturing Chemistry in XML/CML ACS March 2004 Structured (CompChem) Semi-Structured (Articles) Unstructured (Discussion) Structured documents and data in XML MACHINE PARSING ?
  • 8. How? Abstract Discussion Experimental Capturing Chemistry in XML/CML ACS March 2004 Article semi- structured Add Structure Parse with Regular Expressions Legacy to CML converters
  • 9.
  • 10.
  • 11. The CML Family Controlled XMLNamespaces: CMLCore – compounds and properties CMLReact – reactions CMLSpect – spectra * CMLComp – compChem CMLCryst – crystallography and condensed matter Interoperates with HTML, MathML, SVG, * AniML + , * ThermoML $ , etc. Capturing Chemistry in XML/CML ACS March 2004 + spectra: ANSI/JCAMP $ thermochemistry: NIST J. Chem. Inf. Comp. Sci., 2003 , 43 , 757
  • 12. Case Studies Parsing output from 750,000 MOPAC jobs High-throughput parsing of journals Capturing Chemistry in XML/CML ACS March 2004
  • 13. CompChem Logs Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Point Group Dipole Total Energy
  • 14. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • 15. Loss From CompChem Capturing Chemistry in XML/CML ACS March 2004 Coordinates Molecular Formula Calculation Type Dipole Total Energy Ionisation Potential
  • 16. Parsing Data CompChem Output Capturing Chemistry in XML/CML ACS March 2004 Coordinates Energy Levels Vibrations Coordinates Energy Level Vibration CML File CMLCore CMLCore CMLComp CMLSpect Input/jobControl General Parsers
  • 17. Display Process 1 Capturing Chemistry in XML/CML ACS March 2004 CompChem Log Xindice CML XSLT
  • 18. Display Process 2 Capturing Chemistry in XML/CML ACS March 2004 CML File CMLCore CMLCore CMLComp CMLSpect compChem Output 3D structure, electronic properties Coordinates Energy Levels Vibrations Input/jobControl XSLT Display Normal modes 2D structure, thermodynamic properties
  • 19. Parsing Data Capturing Chemistry in XML/CML ACS March 2004 Dictionary Entry: The pointgroup of a molecule ... The Schoenflies convention is normally used, but Hermann Mauguin is also allowed. D [debye] ParentSI: c.m Multiplier: 3.335641E-30 CGS units for electric dipole
  • 20. Dictionaries Capturing Chemistry in XML/CML ACS March 2004 < scalar dictRef =“ ccml:mp ” units =“units:c” minValue =“65” maxValue =“66” /> Linked to CML schema Accesses CCML namespace Units dictionary id =&quot;celsius&quot; name =&quot;Celsius&quot; parentSI =&quot;k&quot; multiplierToSI =&quot;1&quot; constantToSI =&quot;273.15&quot; abbreviation =&quot;C&quot; unitType =&quot;temp&quot; id =&quot;meltrange&quot; term =&quot;Melting range&quot; definition =&quot;Minimum and maximum values of melting range in degrees Celsius&quot;
  • 21. OSCAR Open Source Chemistry Analysis Routines Capturing Chemistry in XML/CML ACS March 2004 Sponsored by the Royal Society of Chemistry (Cambridge) Mounted on http://www.rsc.org/
  • 22. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 23. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 24. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 25. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 26. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Article
  • 27. Article Structure Capturing Chemistry in XML/CML ACS March 2004 Front Matter Abstract Introduction Discussion Experimental References Results Synthesis Set up Analysis Compound Name Article Experimental
  • 28.
  • 29. OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004 H NMR Nature HRMS
  • 30. OSCAR Parsing Data Capturing Chemistry in XML/CML ACS March 2004
  • 31. OSCAR Data Found Capturing Chemistry in XML/CML ACS March 2004 Results from one paper
  • 32. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 Serious Error Warning Type 1 Warning Type 2
  • 33. OSCAR Error Checking Capturing Chemistry in XML/CML ACS March 2004 ~30 errors / warnings searched for This article has: 4 errors 2 warnings (type 1) 30 warnings (type 2) Elemental analysis, incorrect – calculations are for a different molecular formula
  • 34. OSCAR Data Presentation Capturing Chemistry in XML/CML ACS March 2004
  • 35. OSCAR Speed Capturing Chemistry in XML/CML ACS March 2004 A typical paper contains ca. 20 compounds JOC (Feb 2004) contains ~600 compounds OSCAR could extract and tabulate in under 5 minutes OBC (Feb 2004) contains ~300 compounds OSCAR could extract and tabulate in under 3 minutes High throughput, high precision
  • 36. OSCAR Accuracy Capturing Chemistry in XML/CML ACS March 2004 92 % of Data Correctly Identified 3 % incorrect author entry 5 % missed 437 items, ~10,000 data fields in test set, working with current Regular Expressions False-positives: 3 %
  • 37. XML-CML Databases Capturing Chemistry in XML/CML ACS March 2004 CML Journals Theses CompChem XMLDb can support > 250,000 molecules Millisecond retrieval on INChI, properties Xindice
  • 38.
  • 39. NLP & Parsing Names Capturing Chemistry in XML/CML ACS March 2004 KEY: Locant Characteristic Group Mono valent parent hydride Multiplier Heterocyclic parent hydride
  • 40. Thank You Unilever RSC Jonathan Goodman Sam Adams Fraser Norton Chris Waudby Yong Zhang Capturing Chemistry in XML/CML ACS March 2004