SlideShare una empresa de Scribd logo
1 de 25
Delivering an online service for validating and
standardizing chemical structure files using
the ChemSpider platform
Overview
• Introduction
  – Why do we need to
    validate/standardise data
  – Examples of problems in general
  – Examples of Problems in
    ChemSpider
  – Why InChI is not enough
  – FDA rules
What are we trying to achieve?
• Everyone wants high quality data
• The ChemSpider team is building a reputation on
  data quality
• Many datasources have errors
• We need to identify:
  – Errors
  – Inconsistencies
  – Data duplication/Inappropriate separation of data
• Requires a process of validation and
  standardization
What do we mean by Validation and
         Standardisation?
• Validated
  – Check for hypervalency, charge balance, missing
    stereo
  – Name-Structure relationships, etc.
• Standardized
  – Use standard rules to “standardize” compounds;
    Nitro groups, O-Metal bonds, tautomers, etc.
Where will CVSP be useful
• Currently, a standalone system

• In the future; Validation/standardisation routines
  will be used:
   – Built in to our deposition system
   – At registration for new compounds
   – To improve existing data in ChemSpider – pass through
     the ChemSpider backfile

• Potential to offer optional checking service to
  authors
What we want to avoid
What do we do now?
• Currently, ChemSpider uses structures (as
  InChI’s) as the database key
• Need structures for depositions
• 2 Steps:
  – Pre-processing prior to deposition
  – InChI algorithm; provides standardisation and
    mapping
What are the common errors?
• Records without a structure

• Incorrect valences

• Atom labels
What are the common errors?
• Unbalanced charge
   – Name-structure errors

• Salts



• Polymers/Organometallics

• Missing stereochemistry
Side Effects of InChI on ChemSpider:
            Sort of helpful
Side Effects of InChI on ChemSpider




• Advantages and disadvantages
  – The depictions are meant to represent the same molecule
  – Not easy to pick out “bad” representations
Substance Registry System
• How do you decide your standardisation rules?
• Avoid standards in isolation




http://www.fda.gov/downloads/ForIndustry/DataStandards/SubstanceRegistrationSyste
    m-UniqueIngredientIdentifierUNII/ucm127743.pdf

• Note: This document is only a starting point
Salt and Ionic Bonds
Nitro groups
Ammonium salts
Validation rules
In XML:




Code generated dynamically from rule set.
Indigo API used behind the scenes.
Standardization rules
Corrections stored in database:



SMIRKS-based corrections and also proximity-
 based metal–non-metal reconnection.
Case study: DrugBank
• DrugBank (http://www.drugbank.ca/)
  maintained by David Wishart
• Database contains 6711 structures
• Widely regarded as a well curated, high
  quality dataset


DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Knox C, Law V,
Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R,
Guo AC, Wishart DS., Nucleic Acids Res., 2011, 39, Jan, D1035-41.
ChemSpider Standardization
• Entire ChemSpider database will be
  standardized using modified FDA rule set

• Original Molfiles will be standardized and all
  properties (predicted
  properties, SMILES, InChIs, Names) will all be
  regenerated

• Standardization procedures automatically
  applied to all future depositions
CVSP as a Flexible System
• There will be various rules sets
  – Rigid pre-defined rules: e.g. Meeting FDA
    specifications as written, Open PHACTS modified
    rules set, etc.

  – Flexible user-defined rules: users upload their
    rules in our custom format (XML)

  – The Open PHACTS rule set will be open to the
    community to reuse
Incorporating CVSP into data
processing platforms: Knime




      • The workflow
        includes:
         – SDF reader
         – Indigo nodes
         – calls for
           ChemSpider
           validation Web
           services
Incorporating CVSP into data
    processing platforms: Knime

• Warning is returned as a result of processing
Summary
• Will release back results of DrugBank
• Alpha version of CVSP available:
  http://cv.beta.rsc-us.org/Batches.aspx
• Will be a resource for the Community
• Will improve ChemSpider
• Still a long way to go….
Thank you

Email: chemspider@rsc.org
Twitter: ChemSpider
http://www.chemspider.com
http://cssp.chemspider.com/

Más contenido relacionado

Similar a ChemValidator – an online service for validating and standardizing chemical structure files

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 
Data standards for systems biology
Data standards for systems biologyData standards for systems biology
Data standards for systems biology
Neil Swainston
 
Data standards for systems biology
Data standards for systems biologyData standards for systems biology
Data standards for systems biology
Neil Swainston
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
Ken Karapetyan
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data  EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
ChemAxon
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similar a ChemValidator – an online service for validating and standardizing chemical structure files (20)

Amy Driskell - Information management and data Quality
Amy Driskell - Information management and data QualityAmy Driskell - Information management and data Quality
Amy Driskell - Information management and data Quality
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Data standards for systems biology
Data standards for systems biologyData standards for systems biology
Data standards for systems biology
 
Data standards for systems biology
Data standards for systems biologyData standards for systems biology
Data standards for systems biology
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Acs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvspAcs 2013 indianapolis_cvsp
Acs 2013 indianapolis_cvsp
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data  EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 

Último

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

ChemValidator – an online service for validating and standardizing chemical structure files

  • 1. Delivering an online service for validating and standardizing chemical structure files using the ChemSpider platform
  • 2. Overview • Introduction – Why do we need to validate/standardise data – Examples of problems in general – Examples of Problems in ChemSpider – Why InChI is not enough – FDA rules
  • 3. What are we trying to achieve? • Everyone wants high quality data • The ChemSpider team is building a reputation on data quality • Many datasources have errors • We need to identify: – Errors – Inconsistencies – Data duplication/Inappropriate separation of data • Requires a process of validation and standardization
  • 4. What do we mean by Validation and Standardisation? • Validated – Check for hypervalency, charge balance, missing stereo – Name-Structure relationships, etc. • Standardized – Use standard rules to “standardize” compounds; Nitro groups, O-Metal bonds, tautomers, etc.
  • 5. Where will CVSP be useful • Currently, a standalone system • In the future; Validation/standardisation routines will be used: – Built in to our deposition system – At registration for new compounds – To improve existing data in ChemSpider – pass through the ChemSpider backfile • Potential to offer optional checking service to authors
  • 6. What we want to avoid
  • 7. What do we do now? • Currently, ChemSpider uses structures (as InChI’s) as the database key • Need structures for depositions • 2 Steps: – Pre-processing prior to deposition – InChI algorithm; provides standardisation and mapping
  • 8. What are the common errors? • Records without a structure • Incorrect valences • Atom labels
  • 9. What are the common errors? • Unbalanced charge – Name-structure errors • Salts • Polymers/Organometallics • Missing stereochemistry
  • 10. Side Effects of InChI on ChemSpider: Sort of helpful
  • 11. Side Effects of InChI on ChemSpider • Advantages and disadvantages – The depictions are meant to represent the same molecule – Not easy to pick out “bad” representations
  • 12. Substance Registry System • How do you decide your standardisation rules? • Avoid standards in isolation http://www.fda.gov/downloads/ForIndustry/DataStandards/SubstanceRegistrationSyste m-UniqueIngredientIdentifierUNII/ucm127743.pdf • Note: This document is only a starting point
  • 13. Salt and Ionic Bonds
  • 16. Validation rules In XML: Code generated dynamically from rule set. Indigo API used behind the scenes.
  • 17. Standardization rules Corrections stored in database: SMIRKS-based corrections and also proximity- based metal–non-metal reconnection.
  • 18. Case study: DrugBank • DrugBank (http://www.drugbank.ca/) maintained by David Wishart • Database contains 6711 structures • Widely regarded as a well curated, high quality dataset DrugBank 3.0: a comprehensive resource for 'omics' research on drugs. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, Pon A, Banco K, Mak C, Neveu V, Djoumbou Y, Eisner R, Guo AC, Wishart DS., Nucleic Acids Res., 2011, 39, Jan, D1035-41.
  • 19.
  • 20. ChemSpider Standardization • Entire ChemSpider database will be standardized using modified FDA rule set • Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated • Standardization procedures automatically applied to all future depositions
  • 21. CVSP as a Flexible System • There will be various rules sets – Rigid pre-defined rules: e.g. Meeting FDA specifications as written, Open PHACTS modified rules set, etc. – Flexible user-defined rules: users upload their rules in our custom format (XML) – The Open PHACTS rule set will be open to the community to reuse
  • 22. Incorporating CVSP into data processing platforms: Knime • The workflow includes: – SDF reader – Indigo nodes – calls for ChemSpider validation Web services
  • 23. Incorporating CVSP into data processing platforms: Knime • Warning is returned as a result of processing
  • 24. Summary • Will release back results of DrugBank • Alpha version of CVSP available: http://cv.beta.rsc-us.org/Batches.aspx • Will be a resource for the Community • Will improve ChemSpider • Still a long way to go….
  • 25. Thank you Email: chemspider@rsc.org Twitter: ChemSpider http://www.chemspider.com http://cssp.chemspider.com/

Notas del editor

  1. This chemical is in many places…Wolfram Alpha, PubChem etc…. Is there value in “duplicates/triplicates”? A record with two instances of the same molecule? What about 15 waters? Can you show an example in PubChem that we inherited???
  2. We believe our dataset may collapse significantly…