Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

874 visualizaciones

Publicado el

Describes workflow for validation and standardization for Open PHACTS Chemical Registry System

  • Sé el primero en comentar

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

  1. 1. Standardization and Generation of Parents for Open PHACTS Chemical Registry System Karen Karapetyan, Valery Tkachenko Colin Batchelor, Antony Williams
  2. 2. Validation checks  Correct file format (SDF, MOL, CDX, etc)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  SP3 stereo  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  3. 3. Severity assigned to every validation issue
  4. 4. Filtering by severity and by issues
  5. 5. Standardization – Organometallics/Salts  Always disconnect N, O, and F from metals:  Disconnect nonmetals (except N,O,F) with transition metals (except Hg)  Ionize free metal with carboxylic acid (Metals of Group I and II)
  6. 6. Standardization SMIRKS (based on InChI normalization and on FDA SRS) Examples of InChI normalization  [*;H+:1]>>[*;H:1]  [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]  [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Examples of FDA SRS rules  [n:1]=[O:2]>>[n+:1][O-:2]  [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]  [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]  Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2]
  7. 7. Standardization  Dearomatize  Double bond with adjacent wiggly single bond  Fold hydrogen atoms with no up or down bonds
  8. 8. Standardization  Remove symmetric stereocenters  Turn off chiral flag if no up or down bonds  Do Layout Chiral flag is set
  9. 9. Standardization – partially ionized acids (move proton from strong acids to a weaker)
  10. 10. For each Compound parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description RDF Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. void:linkPredicate skos:closeMatch dul:expresses cheminf:CHEMINF_000460; Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000459 Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch cheminf:CHEMINF_000456 Tautomer- Unsensitive Tautomer canonicalization is attempting to generate a canonical tautomer void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000486; Super Parent Super parent is generated by applying modifications of all of the above void:linkPredicate skos:broadMatch; dul:expresses cheminf:CHEMINF_000458;
  11. 11. Fragment SID 1 SDF1 DataSource1 Synonym1 Synonym2 XRef1 SID 2 SDF2 DataSource2 Synonym1 Synonym3 XRef2 OPS_ID 1 Deposited Substances Parents Standardized MOLECULE DataSource1 DataSource2 Synonym1 Synonym2 Synonym3 XRef1 XRef2 Charge Parent (OPS_ID 6) Isotope Parent (OPS_ID 4) Stereo Parent (OPS_ID 3) Tautomer Parent (OPS_ID 5) Super Parent (OPS_ID 7) Compounds OPS_ID 2 Standardized MOL DataSource3 DataSource4 Synonym4 Synonym5 Synonym6 XRef3 XRef4
  12. 12. What do we use as chemical identity of the standardized records (primary compound key)? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES – can be too long; no accepted standard; needs to be hashed • Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereo Proposed Solution Non-standard InChI with options: SUU SLUUD FixedH SUCF • much more sensitive to stereo description • Fixes mobile hydrogens (so tautomers could be distinguished) • Handles “AND-ed” relative stereo
  13. 13. Thanks We would appreciate any comments. For comments or questions email