SlideShare a Scribd company logo
1 of 20
Download to read offline
Chemical mixtures: File format, open source tools,
example data, and mixtures InChI derivative
Alex M. Clark
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Premise
◈ Representing specific chemicals routine since 1980s
▷ e.g. the MDL Molfile CTAB for organics
◈ Most real world encounters are mixtures
◈ No industry standard format...
▷ ... even though it's rather easy
▷ interchange is done with text
2
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Goal
◈ Grant-supported project to:
▷ define a simple format (Mixfile)
▷ create open source tools for editing & manipulating
▷ bootstrap content via text mining
▷ work with IUPAC for interoperability (MInChI)
◈ Results pubished in Journal of Cheminformatics 11:33
3
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Common Lab Mixtures
4
impurity
isomers
simple

solution
multisolvent

solution
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Real World Mixtures
5
drug

tablet
toothpaste
COLLABORATIVE DRUG DISCOVERY - MIXTURES
File Format
◈ Mixfile is to mixtures as Molfile is to
molecules
◈ Components:
▷ structure & name
▷ concentration
▷ identifiers
◈ Hierarchy
▷ captures nuances of mixing
▷ relative concentrations/uncertainty
6
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Representation
◈ Serialised using JSON: natural datastructure, human
readable, concise, easy to code on any platform
7
{
"mixfileVersion": 0.01,
"name": "Lithium diisopropylamide solution",
"contents":
[
{
"name": "Lithium diisopropylamide",
"molfile": "nGenerated by WebMolKitnn 8 6 0 0 0 0 0 0 0 0999 V2000n 0.2500 1.5000 0.0000 N 0 5 0 0 0 0 0 0 0
0 0 0n -1.0490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -2.3481 1.5000 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0n -1.0490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 0.7500 0.0000 C 0
0 0 0 0 0 0 0 0 0 0 0n 2.8481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 -0.7500
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.2500 3.0000 0.0000 Li 0 3 0 0 0 0 0 0 0 0 0 0n 1 2 1
0 0 0 0n 2 3 1 0 0 0 0n 2 4 1 0 0 0 0n 1 5 1 0 0 0 0n 5 6 1 0 0 0 0n 5 7 1 0 0 0 0nM
CHG 2 1 -1 8 1nM END",
"quantity": 1,
"units": "mol/L",
"inchi": "InChI=1S/C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1",
"inchiKey": "InChIKey=ZCSHNCUQKCANBX-UHFFFAOYSA-N"
},
{
"contents":
[
{
"name": "THF",
"molfile": "nGenerated by WebMolKitnn 5 5 0 0 0 0 0 0 0 0999 V2000n -0.2500 3.7500 0.0000 O 0 0 0 0 0 0
0 0 0 0 0 0n -1.4600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0000 1.4400 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0n 0.9600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.5000 1.4400
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 3 2 1 0 0 0 0n 2 1 1 0 0 0 0n 1 4 1 0 0 0 0n 4 5 1 0
0 0 0n 5 3 1 0 0 0 0nM END",
"ratio":
[
1,
8
],
"inchi": "InChI=1S/C4H8O/c1-2-4-5-3-1/h1-4H2",
"inchiKey": "InChIKey=WYURNTSHIVDZCO-UHFFFAOYSA-N"
},
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Tooling
◈ Editor & libraries written in TypeScript, cheminformatics
library: WebMolKit
◈ Create, modify, view & render Mixfiles
◈ Open source https://github.com/cdd/mixtures
8
(a) (b) (c)
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Mixture Data
◈ Lots of mixtures,
but mostly text
◈ Active ingredient
often separated
▷ purity
▷ solvent/conc.
◈ Semi-structured
databases also
9
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Text Extraction
◈ Many common patterns in
text descriptions
10
1-Aza-12-crown-4 ≥97.0%
Trimethyl(trifluoromethyl)silane solution 2 M in THF
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Brute Force
◈ Compose a set of regular expression rules
◈ Remove brand names:
11
{"effect": "remove", "regex": "(.*)A[cC][rR][oO][sS] Organics?™?(.*?)$"},
{"effect": "remove", "regex": "(.*)AcroSeal™?(.*?)$"},
{"effect": "remove", "regex": "(.*)Alfa Aesar™?(.*?)$"},
{"effect": "remove", "regex": "(.*)BioReagents™?(.*?)$"},
{"effect": "remove", "regex": "(.*)Burdick & Jackson™?(.*?)$"},
{"effect": "remove", "regex": "(.*)(Technical)$"},
{"effect": "remove", "regex": "(.*)(Certified)$"},
{"effect": "remove", "regex": "(.*), pure$"},
{"effect": "remove", "regex": "(.*), for analysis$"},
{"effect": "remove", "regex": "(.*), extra pure$"},
◈ Remove suffixes:
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Brute Force (ctd)
◈ Identify branches (with quantities):
12
{"effect": "branch", "regex":
"(.*) (over (molecular sieve .*))", "substance": "$2"},
{"effect": "branch", "regex": "(.*) (d[d.]*) ? mM in (.*)",
"quantity": "$2", "units": "mmol/L", "substance": "$3"},
{"effect": "branch", "regex":
"(.*) ~?d[d.]*s?% in (.*) ((~?)(d[d.]*) M)$",
"quantity": "$4", "units": "mol/L", "relation": "$3", "substance": "$2"},
{"effect": "branch", "regex":
"(.*), (d[d.]*)s?M [Ss]olution in (.*)",
"quantity": "$2", "units": "mol/L", "substance": "$3"},
{"effect": "conc", "regex": "(.*) (d[d.]*)%, ee d[d.]*.%$",
"quantity": "$2","units": "%", "relation": ""},
{"effect": "conc", "regex": "(.*) ≥ ?(d[d.]*)%$",
"quantity": "$2","units": "%", "relation": ">="},
{"effect": "conc", "regex": "(.*) [Ss]olution (d[d.]*) ?M$",
"quantity": "$2", "units": "mol/L"},
◈ Concentration suffixes:
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Implementation
◈ Apply regular expression rules greedily/recursively
▷ package up into hierarchical components
▷ substances referenced by name
◈ Use OPSIN for name-to-SMILES, RDKit to depict
◈ Use lookup database for:
▷ common exceptions (e.g. trivial names)
▷ named mixtures (e.g. hexanes)
13
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Results
◈ Aggregated thousands of single-line text entries,
mostly from online catalogs
◈ 5600 mixtures verified & included on GitHub using
the Mixfile format
◈ Success rate ~95% with:
▷ 250 regular expression based rules
▷ 280 name lookup mappings
◈ All of them with calculated MInChI strings...
14
COLLABORATIVE DRUG DISCOVERY - MIXTURES
MInChI
◈ Mixtures InChI is a composite notation...
▷ Mixfile encapsulates Molfiles + extra metadata
▷ MInChI encapsulates InChI + extra metadata
◈ Distilled down to the basics:
▷ canonical structures
▷ simplified concentration
▷ nesting hierarchy
◈ Useful to accompany primary originated data
15
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Dissection
◈ Mixfile to MInChI: designed for one way
conversion (verbose-to-concise)
◈ Reduced information content of MInChI
adds value for certain use cases
◈ Proof of concept implemented
16
(a) MInChI=0.00.1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)11(2)7(13)9-5/h3H,1-2H3,(H,9,13)/n1/g99pp0
header structure identifier indexing concentration
MInChI=0.00.1S/BBr3/c2-1(3)4&CH2Cl2/c2-1-3/h1H2/n{1&2}/g{1mr0&}(b)
(c) MInChI=0.00.1S/C4H8O/c1-2-4-5-3-1/h1-4H2&C6H12/c1-6-4-2-3-5-6/h6H,2-5H2,1H3
&C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3&C6H14/c1-4-5-6(2)3/h6H,4-5H2,1-3H3
&C6H14/c1-4-6(3)5-2/h6H,4-5H2,1-3H3&C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1
/n{6&{1&{3&2&4&5}}}/g{1mr0&{1vp0&{5:7vf-1&1:2vf-1&1:5vf-2&1:5vf-2}7vp0}}
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: ELNs
◈ Embed convenient UI within web ELN
▷ sketch out mixture definitions for procedure writeup
▷ quick lookup databases of known mixtures
▷ search content using precise mixture-aware
queries
◈ Vendor integration
▷ work with vendors to markup their products
▷ scientists can use product code to embed mixture
17
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: Text Extraction
◈ Initial proof of concept is effective but crude
◈ Planning a much more sophisticated iterative-learning
strategy: start by formulating input for recurrent deep
neural network
18
Trimethyl(trifluoromethyl)silane solution 2 M in THF
■ component ■ quantity ■ branch ■ superfluous
... -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 ...
[... *, *, *, *, *, T, r, i, m, e, t ...] = ■ component
[... *, *, *, *, T, r, i, m, e, t, h ...] = ■ component
[... *, *, *, T, r, i, m, e, t, h, y ...] = ■ component
[... *, *, T, r, i, m, e, t, h, y, l ...] = ■ component
...
[... l, u, t, i, o, n, , 2, M, , i ...] = ■ superfluous
[... u, t, i, o, n, , 2, M, , i, n ...] = ■ superfluous
[... t, i, o, n, , 2, M, , i, n, ...] = ■ quantity
[... i, o, n, , 2, M, , i, n, , T ...] = ■ quantity
... etc. ...
◈ Learning from input/
output data rather than
curated rules: more
scalable
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: Molecules
◈ Structures currently limited to ≈ Molfile/InChI subset
◈ Want to extend to:
▷ inorganics (non-integral bond orders)
▷ polymers (repeat units & distributions)
▷ variations (Markush structures, partially definitions)
▷ large molecules (proteins, DNA)
▷ pseudomolecules (ceramics, alloys)
◈ More formal definition of ID codes (CAS, PubChem, etc.)
19
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Acknowledgments
20
◈ Leah McEwen
▷ and the rest of the IUPAC/InChI working group
◈ Hande Küçük McGinty (and iCorps)
▷ Collaborative Drug Discovery
Funding
NIH SBIR

More Related Content

More from Alex Clark

Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsAlex Clark
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsAlex Clark
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)Alex Clark
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Alex Clark
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)Alex Clark
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAlex Clark
 
Representing molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsRepresenting molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsAlex Clark
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...Alex Clark
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay ExpressAlex Clark
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?Alex Clark
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsAlex Clark
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designAlex Clark
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookAlex Clark
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Alex Clark
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Alex Clark
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Alex Clark
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark
 
Open Drug Discovery Teams @ Hacking Health Montreal
Open Drug Discovery Teams @ Hacking Health MontrealOpen Drug Discovery Teams @ Hacking Health Montreal
Open Drug Discovery Teams @ Hacking Health MontrealAlex Clark
 
Pistoia Alliance App Strategy
Pistoia Alliance App StrategyPistoia Alliance App Strategy
Pistoia Alliance App StrategyAlex Clark
 

More from Alex Clark (20)

Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream products
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer products
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocols
 
Representing molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informaticsRepresenting molecules with minimalism: A solution to the entropy of informatics
Representing molecules with minimalism: A solution to the entropy of informatics
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay Express
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithms
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile apps
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab Notebook
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013
 
Open Drug Discovery Teams @ Hacking Health Montreal
Open Drug Discovery Teams @ Hacking Health MontrealOpen Drug Discovery Teams @ Hacking Health Montreal
Open Drug Discovery Teams @ Hacking Health Montreal
 
Pistoia Alliance App Strategy
Pistoia Alliance App StrategyPistoia Alliance App Strategy
Pistoia Alliance App Strategy
 

Recently uploaded

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Recently uploaded (20)

Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Representing Chemical Mixtures in an Open Format

  • 1. Chemical mixtures: File format, open source tools, example data, and mixtures InChI derivative Alex M. Clark
  • 2. COLLABORATIVE DRUG DISCOVERY - MIXTURES Premise ◈ Representing specific chemicals routine since 1980s ▷ e.g. the MDL Molfile CTAB for organics ◈ Most real world encounters are mixtures ◈ No industry standard format... ▷ ... even though it's rather easy ▷ interchange is done with text 2
  • 3. COLLABORATIVE DRUG DISCOVERY - MIXTURES Goal ◈ Grant-supported project to: ▷ define a simple format (Mixfile) ▷ create open source tools for editing & manipulating ▷ bootstrap content via text mining ▷ work with IUPAC for interoperability (MInChI) ◈ Results pubished in Journal of Cheminformatics 11:33 3
  • 4. COLLABORATIVE DRUG DISCOVERY - MIXTURES Common Lab Mixtures 4 impurity isomers simple solution multisolvent solution
  • 5. COLLABORATIVE DRUG DISCOVERY - MIXTURES Real World Mixtures 5 drug tablet toothpaste
  • 6. COLLABORATIVE DRUG DISCOVERY - MIXTURES File Format ◈ Mixfile is to mixtures as Molfile is to molecules ◈ Components: ▷ structure & name ▷ concentration ▷ identifiers ◈ Hierarchy ▷ captures nuances of mixing ▷ relative concentrations/uncertainty 6
  • 7. COLLABORATIVE DRUG DISCOVERY - MIXTURES Representation ◈ Serialised using JSON: natural datastructure, human readable, concise, easy to code on any platform 7 { "mixfileVersion": 0.01, "name": "Lithium diisopropylamide solution", "contents": [ { "name": "Lithium diisopropylamide", "molfile": "nGenerated by WebMolKitnn 8 6 0 0 0 0 0 0 0 0999 V2000n 0.2500 1.5000 0.0000 N 0 5 0 0 0 0 0 0 0 0 0 0n -1.0490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -2.3481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 2.8481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.2500 3.0000 0.0000 Li 0 3 0 0 0 0 0 0 0 0 0 0n 1 2 1 0 0 0 0n 2 3 1 0 0 0 0n 2 4 1 0 0 0 0n 1 5 1 0 0 0 0n 5 6 1 0 0 0 0n 5 7 1 0 0 0 0nM CHG 2 1 -1 8 1nM END", "quantity": 1, "units": "mol/L", "inchi": "InChI=1S/C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1", "inchiKey": "InChIKey=ZCSHNCUQKCANBX-UHFFFAOYSA-N" }, { "contents": [ { "name": "THF", "molfile": "nGenerated by WebMolKitnn 5 5 0 0 0 0 0 0 0 0999 V2000n -0.2500 3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0n -1.4600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0000 1.4400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.9600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.5000 1.4400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 3 2 1 0 0 0 0n 2 1 1 0 0 0 0n 1 4 1 0 0 0 0n 4 5 1 0 0 0 0n 5 3 1 0 0 0 0nM END", "ratio": [ 1, 8 ], "inchi": "InChI=1S/C4H8O/c1-2-4-5-3-1/h1-4H2", "inchiKey": "InChIKey=WYURNTSHIVDZCO-UHFFFAOYSA-N" },
  • 8. COLLABORATIVE DRUG DISCOVERY - MIXTURES Tooling ◈ Editor & libraries written in TypeScript, cheminformatics library: WebMolKit ◈ Create, modify, view & render Mixfiles ◈ Open source https://github.com/cdd/mixtures 8 (a) (b) (c)
  • 9. COLLABORATIVE DRUG DISCOVERY - MIXTURES Mixture Data ◈ Lots of mixtures, but mostly text ◈ Active ingredient often separated ▷ purity ▷ solvent/conc. ◈ Semi-structured databases also 9
  • 10. COLLABORATIVE DRUG DISCOVERY - MIXTURES Text Extraction ◈ Many common patterns in text descriptions 10 1-Aza-12-crown-4 ≥97.0% Trimethyl(trifluoromethyl)silane solution 2 M in THF
  • 11. COLLABORATIVE DRUG DISCOVERY - MIXTURES Brute Force ◈ Compose a set of regular expression rules ◈ Remove brand names: 11 {"effect": "remove", "regex": "(.*)A[cC][rR][oO][sS] Organics?™?(.*?)$"}, {"effect": "remove", "regex": "(.*)AcroSeal™?(.*?)$"}, {"effect": "remove", "regex": "(.*)Alfa Aesar™?(.*?)$"}, {"effect": "remove", "regex": "(.*)BioReagents™?(.*?)$"}, {"effect": "remove", "regex": "(.*)Burdick & Jackson™?(.*?)$"}, {"effect": "remove", "regex": "(.*)(Technical)$"}, {"effect": "remove", "regex": "(.*)(Certified)$"}, {"effect": "remove", "regex": "(.*), pure$"}, {"effect": "remove", "regex": "(.*), for analysis$"}, {"effect": "remove", "regex": "(.*), extra pure$"}, ◈ Remove suffixes:
  • 12. COLLABORATIVE DRUG DISCOVERY - MIXTURES Brute Force (ctd) ◈ Identify branches (with quantities): 12 {"effect": "branch", "regex": "(.*) (over (molecular sieve .*))", "substance": "$2"}, {"effect": "branch", "regex": "(.*) (d[d.]*) ? mM in (.*)", "quantity": "$2", "units": "mmol/L", "substance": "$3"}, {"effect": "branch", "regex": "(.*) ~?d[d.]*s?% in (.*) ((~?)(d[d.]*) M)$", "quantity": "$4", "units": "mol/L", "relation": "$3", "substance": "$2"}, {"effect": "branch", "regex": "(.*), (d[d.]*)s?M [Ss]olution in (.*)", "quantity": "$2", "units": "mol/L", "substance": "$3"}, {"effect": "conc", "regex": "(.*) (d[d.]*)%, ee d[d.]*.%$", "quantity": "$2","units": "%", "relation": ""}, {"effect": "conc", "regex": "(.*) ≥ ?(d[d.]*)%$", "quantity": "$2","units": "%", "relation": ">="}, {"effect": "conc", "regex": "(.*) [Ss]olution (d[d.]*) ?M$", "quantity": "$2", "units": "mol/L"}, ◈ Concentration suffixes:
  • 13. COLLABORATIVE DRUG DISCOVERY - MIXTURES Implementation ◈ Apply regular expression rules greedily/recursively ▷ package up into hierarchical components ▷ substances referenced by name ◈ Use OPSIN for name-to-SMILES, RDKit to depict ◈ Use lookup database for: ▷ common exceptions (e.g. trivial names) ▷ named mixtures (e.g. hexanes) 13
  • 14. COLLABORATIVE DRUG DISCOVERY - MIXTURES Results ◈ Aggregated thousands of single-line text entries, mostly from online catalogs ◈ 5600 mixtures verified & included on GitHub using the Mixfile format ◈ Success rate ~95% with: ▷ 250 regular expression based rules ▷ 280 name lookup mappings ◈ All of them with calculated MInChI strings... 14
  • 15. COLLABORATIVE DRUG DISCOVERY - MIXTURES MInChI ◈ Mixtures InChI is a composite notation... ▷ Mixfile encapsulates Molfiles + extra metadata ▷ MInChI encapsulates InChI + extra metadata ◈ Distilled down to the basics: ▷ canonical structures ▷ simplified concentration ▷ nesting hierarchy ◈ Useful to accompany primary originated data 15
  • 16. COLLABORATIVE DRUG DISCOVERY - MIXTURES Dissection ◈ Mixfile to MInChI: designed for one way conversion (verbose-to-concise) ◈ Reduced information content of MInChI adds value for certain use cases ◈ Proof of concept implemented 16 (a) MInChI=0.00.1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)11(2)7(13)9-5/h3H,1-2H3,(H,9,13)/n1/g99pp0 header structure identifier indexing concentration MInChI=0.00.1S/BBr3/c2-1(3)4&CH2Cl2/c2-1-3/h1H2/n{1&2}/g{1mr0&}(b) (c) MInChI=0.00.1S/C4H8O/c1-2-4-5-3-1/h1-4H2&C6H12/c1-6-4-2-3-5-6/h6H,2-5H2,1H3 &C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3&C6H14/c1-4-5-6(2)3/h6H,4-5H2,1-3H3 &C6H14/c1-4-6(3)5-2/h6H,4-5H2,1-3H3&C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1 /n{6&{1&{3&2&4&5}}}/g{1mr0&{1vp0&{5:7vf-1&1:2vf-1&1:5vf-2&1:5vf-2}7vp0}}
  • 17. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: ELNs ◈ Embed convenient UI within web ELN ▷ sketch out mixture definitions for procedure writeup ▷ quick lookup databases of known mixtures ▷ search content using precise mixture-aware queries ◈ Vendor integration ▷ work with vendors to markup their products ▷ scientists can use product code to embed mixture 17
  • 18. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: Text Extraction ◈ Initial proof of concept is effective but crude ◈ Planning a much more sophisticated iterative-learning strategy: start by formulating input for recurrent deep neural network 18 Trimethyl(trifluoromethyl)silane solution 2 M in THF ■ component ■ quantity ■ branch ■ superfluous ... -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 ... [... *, *, *, *, *, T, r, i, m, e, t ...] = ■ component [... *, *, *, *, T, r, i, m, e, t, h ...] = ■ component [... *, *, *, T, r, i, m, e, t, h, y ...] = ■ component [... *, *, T, r, i, m, e, t, h, y, l ...] = ■ component ... [... l, u, t, i, o, n, , 2, M, , i ...] = ■ superfluous [... u, t, i, o, n, , 2, M, , i, n ...] = ■ superfluous [... t, i, o, n, , 2, M, , i, n, ...] = ■ quantity [... i, o, n, , 2, M, , i, n, , T ...] = ■ quantity ... etc. ... ◈ Learning from input/ output data rather than curated rules: more scalable
  • 19. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: Molecules ◈ Structures currently limited to ≈ Molfile/InChI subset ◈ Want to extend to: ▷ inorganics (non-integral bond orders) ▷ polymers (repeat units & distributions) ▷ variations (Markush structures, partially definitions) ▷ large molecules (proteins, DNA) ▷ pseudomolecules (ceramics, alloys) ◈ More formal definition of ID codes (CAS, PubChem, etc.) 19
  • 20. COLLABORATIVE DRUG DISCOVERY - MIXTURES Acknowledgments 20 ◈ Leah McEwen ▷ and the rest of the IUPAC/InChI working group ◈ Hande Küçük McGinty (and iCorps) ▷ Collaborative Drug Discovery Funding NIH SBIR