SlideShare una empresa de Scribd logo
1 de 94
Descargar para leer sin conexión
Introduction to
Retrosynthesis Prediction
2020. 06
Wonjun Jeong
wonjun.jg@kaist.ac.kr
wonjun.email@gmail.com
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Retrosynthesis prediction
• What is retrosynthesis prediction?
• Retrosynthesis or retrosynthetic pathway planning is the process of tracing back the
forward reaction, predicting which reactants are required to synthesize the target product.
4
Retrosynthesis prediction
• Retrosynthesis is crucial process of discovering new materials and drugs.
5
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
• Each process of discovering new materials and drug has own error, it should be
verified by chemist.
• Expensive
6
Desired
properties
Candidate
Product
Candidate
Reactants Test by chemist
Retrosynthesis prediction
Retrosynthesis prediction
Retrosynthesis prediction
• Retrosynthesis prediction has highly depended on the trial-and-error cycles of
experienced researchers of chemical expertise.
7
Retrosynthesis prediction
• If retrosynthesis prediction can be done with high accuracy …
• Capable of unlocking future possibilities of a fully automated material/drug discovery
pipeline.
8
Desired
properties
Candidate
Product
Candidate
Reactants
Test by robot
Retrosynthesis prediction
Dataset description
• SMILES (Simplified Molecular-Input Line-Entry System) [1]
• SMILES is a specification in the form of a line notation for describing the structure of
chemical species [2].
• Generation of SMILES.
• By printing symbol nodes encountered in a depth-first tree traversal of a chemical graph
9[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
Dataset description
• SMILES in detail
• Character of carbon(C) is omitted in the graph.
• Hydrogen(H) is omitted in the SMILES.
• Ring structures are written by breaking each ring at an arbitrary point to make an acyclic str
ucture and adding numerical ring closure labels to show connectivity between non-adjacen
t atoms.
• Branches are described with parentheses.
• A bond is represented using one of the symbols: ., -, =, #, $, :, /, 
• “.” indicates two parts are not bonded together
10[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
Dataset description
• Benchmark:
1. USPTO (United States Patent and Trademark Office)
• USPTO benchmark contains SMIELS representation of single target product (input) and
reactants (target)
• Variants
• USPTO-50k
• USTPO-500K
• USPTO-MIT
2. Pistachio [32]
3. Reaxys [25]
11[25] reaxys.com [32] Mayfield et al.
Overview of general approaches: Template-based
• Template-based approaches [2, 3, 4, 5, 14, 15, 16, 17] use the known chemical
reaction which is called reaction template.
• Reaction template contains sub-graph reaction patterns that describing how the reaction
occur between reactants and product.
• Pros
• High interpretability
• Cons
• Low generalizability to unseen templates
• Require domain knowledge to extract the reaction templates
12
Overview of general approaches: Template-free
• Template-free approaches [6, 7, 8, 9, 10, 12] learn mapping function product to a set of
reactants by extracting features directly from data.
• Seq2Seq framework
• [6, 7, 8, 12]
• Graph2Grpah framework
• [9, 10]
• Pros
• Generalizability
• Not require domain knowledge
• Cons
• Invalid/Inaccessible predictions
• Low interpretability
13
f
Overview of general approaches: Selection-based
• Selection-based approaches [11] select a candidate set of purchasable reactants.
• The objective of [11] is to discover retrosynthetic routes from a given desired product to co
mmercially available reactants
• Pros
• Accessibility of the prediction
• Not require domain knowledge
• Cons
• Novelty
14[11] Guo et al.
Rank := f(product; )
Purchasable pool
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Classical computer-aided methods
• Before deep learning, computer-aided retrosynthesis were mainly conducted using
reaction template. [2, 3, 4, 15, 16, 17]
• They are mainly about how to use known reactions and extract meaningful reaction
context.
• Characteristics
• It needs chemical expertise.
• Heuristics
• Computationally expensive
• Chemical space is vast
• Subgraph isomorphism problem*1.
• Not scalable
• Not generalizable
16*1: Appendix-1
Classical computer-aided methods
• The first computer-aided retrosynthesis:
• [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
• The author won the Nobel Prize in Chemistry for his contribution of retrosynthetic analysis.
• [19] The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Mol
ecules (Nobel lecture), 1991
17[18, 19] Corey et al.
Classical computer-aided methods:
Recent work [3] 2017
18[3] Coley et al.
• Key Idea
• It uses product similarity and reactants similarity to rank template of precedent reactions.
19[3] Coley et al.
Classical computer-aided methods:
Recent work [3] 2017 – Key Idea
• How to measure molecular similarity*2?
• Molecular fingerprints are a way of encoding the structure of molecule. We can use RDKit
library to get it.
• Most common way is Tanimoto similarity, but there is no canonical definition of molecule
similarity (subgraph isomorphism problem*1).
• , : Molecular fingerprint
20*1: Appendix-1, *2: Appendix-2
Img from [20]
Classical computer-aided methods:
Recent work [3] 2017 – Method (Similarity)
• Example of using similarity in [3]
• Total similarity := Product Sim * Reactants (Precursor) sim
21[3] Coley et al.
Rank
Classical computer-aided methods:
Recent work [3] 2017 – Method (Using similarity)
• Result of [3]
• [3] performs better than seq2seq. However, the seq2seq in table is template-free and [3] is
template-based.
• Contribution
• It mimics the retrosynthetic strategy by using molecular similarity without need to encode
any chemical knowledge.
• Limitation
• It inherently disfavors making creative retrosynthetic strategy because it relies on
precedent reactions.
22*3: Appendix-3
*3
Classical computer-aided methods:
Recent work [3] 2017 - Results
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• Open NMT
• Related works
• Future directions
• Reference
• Appendix
• Library
• Related works
Machine learning based methods
• Data-driven methods using machine learning and deep learning have been activated
since mid-2010s.
• The need for expertise has been reduced.
• More scalable and generalizable.
• Representative proposed methods
• Template-based
• NeuralSim [14], Graph Logic Network (GLN) [5]
• Template-free
• Seq2Seq [21], Molecular Transformer (MT) [6, 7], Latent variable Transformer (LV-MT)
[8], Self-Corrected Transformer (SCROP) [22], Graph2Graph (G2G) [9], GraphRetro [10]
• Selection-based
• Bayesian-Retro [11]
24
Machine learning based methods
Template-based: NeuralSim [14] 2017
25[14] Segler et al.
• Template-based: NeuralSim [14] (2017)
• Key Idea
• Given a target product, it uses neural network to predict most suitable rule in reaction
template.
26[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 – Key Idea
• Template-based: NeuralSim [14]
• It uses primitive models such as MLP and Highway network [23].
• It defines rule-selection as a multiclass classification.
• Molecular Descriptor [24] is defined as sum of molecular fingerprint:
27[14] Segler et al. [23] Srivastava et al. [24] pdf file
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Method
• Template-based: NeuralSim [14]
• Experiments
• Dataset: Reaxys database [25]
• # of class: 8720
• Contribution
• It shows neural networks can learn to which molecular context particular rules can be applied.
• Limitation
• The performance is affected by rule set cardinality.
• The larger the set size, the lower the performance.
28[14] Segler et al.
Machine learning based methods
Template-based: NeuralSim [14] 2017 - Results
• Template-based: Graph Logic Network (GLN) [5] (NeurIPS 2019)
29[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019
• Key Idea
• Modeling the joint distribution of reaction templates and reactants using logic variable.
• It learns when rules from reaction templates should be applied.
30[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 – Key Idea
• Retrosynthesis Template
• Using the retrosynthesis template can be decomposed into 2-step logic.
• Match template
• Match reactants
31[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Background
• Match template
• Match reactants
• Uncertainty
• Template score function
• Reactants score function
32[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
• Final joint probability
33[5] Dai et al. *4: Appendix-4
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
Parameterizing by GNN (Graph Neural Network)*4
• MLE with Efficient Inference
• Gradient approximation
34
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Method
[5] Dai et al.
• Top-k results
• Contribution
• Interpretability: Integration of probabilistic models and template(chemical rule)
• Limitation
• It share limitations of template-based method
• Scalability
35[5] Dai et al.
Machine learning based methods
Template-based: Graph Logic Network [5] 2019 - Results
36[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017
• Template-free: Seq2Seq [21] (2017)
• It tokenizes SMILES and treats retrosynthesis as machine translation.
• It uses bidirectional LSTM for a encoder and decoder.
• It uses beam search to produce a set of reactants.
37[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 - Method
• Results
• It performs comparably to the rule-based expert system baseline.
• Contribution
• It shows fully data-driven seq2seq model can learn retrosynthetic pathway.
• Limitations
• It produces grammatically invalid SMILES and chemically implausible predictions.
• Just naïve application of seq2seq model.
• Predictions generated by a vanilla seq2seq model with beam search typically exemplifies
low diversity with only minor differences in the suffix. [8]
38[21] Liu et al, [8] Chen et al
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
• Grammatically invalid SMILES
• Grammatically valid but chemically implausible
39[21] Liu et al.
Machine learning based methods
Template-free: Seq2Seq [21] 2017 – Results
40[6] Schwaller et al., [7] Lee et al.
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019
• Key Idea
• It also tokenizes SMILES and treats retrosynthesis as machine translation like [21].
• It uses Transformer instead of LSTM
• It performs better than seq2seq [21] but has same limitations.
41
Machine learning based methods
Template-free: Molecular Transformer [6, 7] 2019 – Key Idea
[6] Schwaller et al., [7] Lee et al. [21] Liu et al.
• Template-free: Latent variable Transformer (LV-MT) [8] (arXiv 2019)
42[8] Chen et al.
Machine learning based methods
Template-free: LV-MT [8] 2019
• It extends Molecular Transformer (MT) to become more generalizable to rare
reactions and produce diverse path.
• Key Idea
• It proposes novel pretrain method.
• Random bond cut
• Template-based bond cut
• It trains a mixture model with the online hard-EM algorithm.
43[8] Chen et al
Machine learning based methods
Template-free: LV-MT [8] 2019 – Key Idea
• Pretrain methods
• Random bond cut
• For each input target product, it generates new examples by selecting a random
bond to break.
• Template-based bond cut
• Instead of randomly breaking bonds, it uses the templates to break bonds.
• The model is pre-trained on these auxiliary examples, and then used as initialization
to be fine-tuned on the actual retrosynthesis data.
44
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Pretrain)
[8] Chen et al
• Why latent variables are introduced?
• It tackles the problem of generating diverse predictions.
• The outputs of beam search tend to be similar to each other.
• Given a target SMILES string x and reactants SMILES string y, a mixture model
introduces a multinomial latent variable z ∈ { 1, · · · , K } to capture different reaction
types, and decomposes the marginal likelihood as:
45
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
[8] Chen et al
• Hard-EM algorithm
1. Taking a mini-batch of training examples
2. It enumerates all K values of z and compute their loss,
• Dropout should be turned off [26].
3. For each , it selects the value of z that yields the minimum loss:
• For p(y | z, x; θ), it shares the encoder-decoder network among mixture components, and
feed the embedding of z as an input to the decoder so that y is conditioned on it
4. Back-propagate through it, so only one component receives gradients per example.
• Dropout should be turned back on [26].
46[8] Chen et al., [26] Shen et al.
Machine learning based methods
Template-free: LV-MT [8] 2019 – Method (Latent Var.)
• Results*5
47*5: We report better hyper-parameters and the results in Appendix-5
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
• Contributions
• It proposes novel pretraining methods for retrosynthesis.
• It uses mixture model Transformer for diverse predictions.
• Limitations
• The more latent variables are used, the worse the top 1 performance.
• The latent variable does not appear to contain information about the reaction class.
48
Machine learning based methods
Template-free: LV-MT [8] 2019 – Results
[8] Chen et al
• Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
49[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020
• Template-free: Self-Corrected Transformer (SCROP) [22] (2020)
• Key Idea
• It uses Transformer for correcting invalid predicted SMILES
• It makes syntax correction data via trained Transformer by constructing set of invalid
prediction-ground truth pairs.
• It trains another Transformer for syntax corrector using syntax correction data.
• At test time, it retains the top-1 candidate produced by the syntax corrector and
replace the original one.
50[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Key Idea
• Results
• Compare to Transformer (SCROP-noSC), the performance is improved by 0.4~1.7%.
51
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
[22] Zheng et al.
• Invalid SMILES rates
• Limitations
• Why SCROP? We can remove invalid SMILES by using RDKit without learned model.
52[22] Zheng et al.
Machine learning based methods
Template-free: SCROP [22] 2020 – Results
• Template-free: Graph2Graph (G2G) [9] (ICML 2020)
53[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020
• Key Idea
• It decomposes retrosynthesis as 2-step procedure:
• Breaking target product
• Transforming broken target product
• It trains Reaction Center Identification (RCI) module for making synthon(s) via breaking bonds in a
product graph.
• It trains Variational Graph Translation module for making reactants via a series of graph
transformation.
54
Machine learning based methods
Template-free: G2G [9] 2020 – Key Idea
[9] Shi et al.
• Reaction Center Identification (RCI)
• It uses a R-GCN [27] for learning graph representation.
• Overview
1. Given a chemical reaction , it derives a binary label matrix
2. Computing node embeddings and graph embedding.
3. To estimate the reactivity score of atom pair (i,j), the edge embedding is formed by
concatenating several features.
4. The final reactivity score of the atom pair (i, j) is calculated as:
5. The RCI is optimized by maximizing the cross entropy of the binary label
55
Machine learning based methods
Template-free: G2G [9] 2020 – Method (RCI)
[9] Shi et al. [27] Schlichtkrull et al.
• Reactants generation via Variational Graph Translation (VGT).
1. It receives synthons from the RCI and transform the synthons to reactants.
2. It generates a sequence of graph transformation actions , and apply them on
the initial synthon graph.
• It assumes graph generation as a Markov Decision Process (MDP).
56
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
• Reactants generation via Variational Graph Translation (VGT).
• Overview
1. Let transformation trajectory := , the graph transformation is
deterministic if the transformation trajectory is defined.
=
2. Let denote the graph after applying the sequence of actions to
3. Leveraging assumption of a MDP,
=
4. Finally, Graph transformation cab be factorized as follows:
57
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
[9] Shi et al.
• Reactants generation via Variational Graph Translation (VGT).
• Overview (cont’d)
4. Let an action is a tuple
5. It decomposes the distribution into 3 parts:
i. Termination prediction
ii. Nodes selection
iii. Edge labeling
6. It uses variational inference by introducing an approximate posterior
58[9] Shi et al.
Machine learning based methods
Template-free: G2G [9] 2020 – Method (VGT)
• Top-k result
59[9] Shi et al.
Reaction class is given Reaction class is unkwon
Machine learning based methods
Template-free: G2G [9] 2020 – Results
• Module performance
• Contribution
• It novelly formulates retrosynthesis prediction as a graph-to-graphs translation task
• Limitation
• Well-tuned Molecule Transformers performs better
60
Machine learning based methods
Template-free: G2G [9] 2020 – Results
[9] Shi et al.
• Template-free: GraphRetro [10] (arXiv 2020)
61
Machine learning based methods
Template-free: GraphRetro [10] 2020
[10] Somnath et al.
• Template-free: GraphRetro [10] (arXiv 2020)
• Key Idea
• It also uses the idea of breaking and modifying graphs like G2G[22].
• G2G[22] modified the graph at the level of atoms, but it operates at level of molecular fragments
called as leaving groups.
• G2G: Sequential generation
• GraphRetro: Leaving group selection
62
Machine learning based methods
Template-free: GraphRetro [10] 2020 – Key Idea
[10] Somnath et al.
• Top-k result
63
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
• Module performance
• Contribution
• Choosing a leaving group is a good idea for retrosynthesis problems
• Limitation
• Domain knowledge is required to create a leaving group vocabulary
64
Machine learning based methods
Template-free: GraphRetro [10] 2020 - Results
[10] Somnath et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11]
65[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11]
66
Cont’d
[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Key Idea
• Key Idea
• It uses pre-trained forward model for likelihood of Bayes’ theorem and uses approximate
posterior distribution of reactants.
• It uses Monte Carlo search for exploring synthetic routes
67[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method
• Likelihood is the Boltzmann distribution with an inverse temperature.
• Energy function: Tanimoto distance between target product and predicted product
• Approximate posterior
• Exact computation across all candidates is generally infeasible.
68
Predicted product by forward model (Molecular Transformer)
[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method (SMC)
• Method (Cont’d)
• Sampling from the posterior
• Sequential Monte Carlo (SMC)
• 
• Cons
• Particle impoverishment [38]
• Rapid loss of diversity
• Computation cost of using forward model (Molecular Transformer)
69[11] Guo et al. [38] Stavropoulos et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Method
• Method (Cont’d)
• SMC accelerated by surrogate likelihood.
• It trains Gradient Boosting Regression Tree that predicts likelihood of Molecular
Transformer
70[11] Guo et al.
Machine learning based
Selection-based: Bayesian Retrosynthesis [11] – Results
• Results
71[11] Guo et al.
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Challenges
Challenge 1. Balancing between template-free and template-based model
Challenge 2. Multi-Step retrosynthesis
Challenge 3. Extremely large space of synthesis routes
Challenge 4. Molecule decoding (Graph generation)
73[3] Coley et al. [14] Segler et al.
Challenges:
1. Balancing between template-free and template-based model
• How about a hybrid model using uncertainty ?
74
f
Pros
• High
interpretability
Cons
• Low
generalizability
• Require domain
knowledge
Pros
• Generalizability
Cons
• Invalid/Inaccessible
predictions
• Low interpretability
• Most chemical molecules in real world cannot be synthesized within one step.
• It could go up to 60 steps or even more.
• Error accumulation
• Extremely large space
• Most recent work [13] uses neural guided A* search.
75[13] Chen et al.
Challenges:
2. Multi-Step retrosynthesis
• Each molecule could be synthesized by hundreds of different possible reactants.
• How to measure a good synthesis routes ?
76
Challenges:
3. Extremely large space of synthesis routes
• Modeling complex distributions over graphs and then efficiently sampling is challengin
g!
• Why is it challenging?
• Non-unique
• High dimensional nature of graphs
• Complex, non-local dependencies b/w nodes and edges.
• Proposed methods
• Graph VAE [29] (ICANN 2018)
• Graph RNN [30] (ICML 2018)
• GRAN [31] (NeurIPS 2019)
• Junction tree VAE [35] (ICML 2019)
77[29] Schlichtkrull et al. [30] You et al. [31] Liao et al. [35] Jin et al.
Challenges:
4. Molecule decoding (Graph generation)
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Practice: RDkit
• Data pre-processing (RDKit)
• RDKit[20] is an open-source library for Cheminformatics.
• https://www.rdkit.org
• Why RDKit?
• Visualizing
• Substructure searching
• Calculate molecule similarity
• Validity check
• Various function for Cheminformatics
• We upload RDKit tutorial notebook:
• https://github.com/wonjun-dev/contrastive-retro
79
Practice: OpenNMT
• OpenNMT
• OpenNMT[28] is an open-source library for neural machine translations.
• https://opennmt.net
• It supports various models for encoder-decoder framework.
• Why OpenNMT?
• It supports various models for encoder-decoder framework.
• Built-in functions.
• Easy to engineer.
• Cons
• Too huge
• Flexibility
• Discontinued procedure (train-inference-performance check)*7
80[28] Klein et al., *7: We made fully-automated script.
Practice: OpenNMT – Where you should change
• OpenNMT
• Primary files in OpenNMT
• Data loader
• preprocess.py
• inputter.py (.onmt/inputters)
• Options
• opts.py (./onmt) => Several options for train, translate, preprocessing and etc. You can
make your own options in here.
• Train
• train.py => Entry point of training
• train_single.py (./ommt) => Second entry point of training
• trainer.py (./onmt) => Main training loop
• loss.py (.onmt/utils) => Several classes for loss function
• Model
• model_builder (./onmt)
• model.py (./onmt/models) => Model class
• model_saver (./onmt/models)
• Translation
• translate.py => Entry point of translation
• translator.py (./onmt/translate) => Translator class
• Performance check
• parse_output.py (./parse) => Parse predicted output and calculate accuracy via RDKit.
81
Practice: OpenNMT – Automated script
• OpenNMT
• We provide fully-automated (training to parsing) script.
• https://github.com/wonjun-dev/contrastive-retro @master branch
• run_experiment_mt.sh
• Train – Inference (Translate) – Performance check (Parse) – Averaging
• arg[0] : GPU id
• arg[1]: seed
• run_average.py
• The performance variation of MT and LV-MT is quite large depending on seed.
82
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Related works
• Forward synthesis
• Given reactants and reagents, predict the products.
• [7, 34, 36, 37]
• Reaction center prediction
• The task of identifying the reaction center is related to the step of deriving the synthons
(intermediate outcomes) in retrosynthesis.
• [9, 10, 33, 34]
• Graph generation
• Generative models for real-world graphs, including social, chemical and knowledge graph
• [29, 30, 31, 35]
84
Table of Contents
• Introduction
• Retrosynthesis prediction
• Dataset description
• Overview of general approaches: Template-based, Template-free, Selection-based
• Proposed methods
• Classical computer-aided methods
• Machine learning based methods
• Challenges
• Practice
• RDKit
• OpenNMT
• Related works
• Future directions
• Reference
• Appendix
Future directions
• Training chemical language models like BERT
• Learning better chemical representation
• Atomic or molecular embedding considering chemical properties
• Robust to SMILES augmentation
• Contrastive learning
• Template-Generative Hybrid model
• Graph encoding – SMILES decoding
• Graph decoding is challenging
• Predictive model for subgraph isomorphism
• Subgraph isomorphism is a NP-complete problem, it is not scalable.
86
References
[1] Weininger et al. “A chemical language and information system. 1. introduction to methodology and encoding
rules.” Journal of Chemical Information and Modeling, 1988.
[2] Christ et al. “Mining electronic laboratory notebooks: Analysis, retrosynthesis, and reaction based
enumeration.” Journal of Chemical Information and Modeling, 2012.
[3] Coley et al. “Computer-assisted retrosynthesis based on molecular similarity.” ACS Central Science, 2017.
[4] Klucznik et al. “Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed
in the laboratory.” Chem, 2018.
[5] Dai et al. “Retrosynthesis prediction with conditional graph logic network”. NeurIPS, 2019.
[6] Schwaller et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS
Central Science, 2019.
[7] Lee et al. “Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space.”
Chemical Communications, 2019.
[8] Chen et al. “Learning to make generalizable and diverse predictions for retrosynthesis.” arXiv preprint 2019.
[9] Shi et al. “A graph to graphs framework for retrosynthesis prediction.”, ICML, 2020
[10] Somnath et al. “Learning graph models for template-free retrosynthesis.”, arXiv, 2020
[11] Guo et al. “A Bayesian algorithm for retrosynthesis.”, arXiv, 2020
[12] Lin et al. “Automatic retrosynthetic route planning using template-free models.”, Chem. Sci., 2020
[13] Chen et al. “Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search”, ICML, 2020
87
References
[14] Segler et al., “Neural-Symbolic machine learning for retrosynthesis and reaction prediction.”, Chemistry-A European
Journal, 2017
[15] Satoh et al., “A novel approach to retrosynthetic analysis using knowledge bases derived from reaction databases.”,
Chem. Inf. Comput. Sci., 1999
[16] Law et al., “Route designer: A retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.”, Chem.
Inf., 2009
[17] Gasteiger et al., “A collection of computer methods for synthesis design and reaction prediction.”, Recl. Trav. Chim.
Pays-Bas, 1992
[18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985
[19] Corey et al., “The logic of chemical synthesis: Multistep synthesis of complex carbogenic molecules. (Nobel lecture)”,
1991
[20] http://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf
[21] Liu et al., “Retrosynthetic reaction prediction using neural sequence-to-sequence models.”, ACS Cent. Sci., 2017
[22] Zheng et al., “Predicting retrosynthetic reactions using self-corrected transformer neural networks.”, J. Chem. Inf.
Model., 2020
[23] Srivastava et al., “Highway networks”, NIPS, 2015
[24] https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201605499&fil
e=chem201605499-sup-0001-misc_information.pdf
[25] http://www.reaxys.com, Reaxys is a registered trademark of RELX Intellectual Properties SA used under license.
[26] Shen et al., “Mixture model for diverse machine translations: Tricks off the trade.”, arXiv, 2019
88
References
[27] Schlichtkrull et al., “Modeling relational data with graph convolutional networks.”, In European
Semantic Web Conference, 2018
[28] Klein et al., “OpenNMT: Open-Source Toolkit for Neural Machine Translation.”, arXiv, 2017
[29] Simonovsky et al., “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.”,
ICANN, 2018
[30] You et al., “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models.”, ICML, 2018
[31] Liao et al., “Efficient Graph Generation with Graph Recurrent Attention Networks.”, NeurIPS, 2019
[32] Mayfield et al., “Pistachio 2.0 edn software.”, 2018
[33] Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity.”,
Chemical Science 2019
[34] Coley et al., “Predicting organic reaction outcomes with Weisfeiler-Lehman Network.”, NeurIPS, 2017
[35] Jin et al., “Junction Tree Variational Autoencoder for molecular graph generation.”, ICML, 2019
[36] Bradshaw et al., “A generative model for electron path.”, ICLR, 2019
[37] DO et al., “Graph transformation policy network for chemical reaction prediction.”, KDD, 2019
[38] Stavropoulos et al., “Sequential Monte Carlo method in practice.”, Springer, 2001
89
Appendix
1. Subgraph isomorphism problem
• It is a computational task in which two graphs G and H are given as input, and one must det
ermine whether G contains a subgraph that is isomorphic to H
• NP-Complete
2. Molecular similarity metrics (x and y are molecular fingerprint)
90
Appendix
3. Reaction class
• Meta-information about type of chemical reactions.
• In USPTO, there are 10 reaction classes
91
Appendix
4. Parameterizing by GNN in [5]
• Graph embedding := Averaging node embedding
92
Appendix
5. Better hyper-parameters of MT and the results.
• Dropout p=0.25 is better than p=0.1
• We can remove invalid and repeated SMILES via RDKit.
• Also, Using 6 layers and increasing the dropout rate is better than using 4 layers.
93
Top 1 Top 3 Top 5 Top 10
MT [8] 0.420 0.570 0.619 0.657
MT (p=0.25, w/o
inval/repeat)
0.432 0.645 0.709 0.771
Thank you !
Any Questions ?

Más contenido relacionado

La actualidad más candente

Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systemsaimsnist
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee
 
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
An Introduction to Chemoinformatics for the postgraduate students of AgricultureAn Introduction to Chemoinformatics for the postgraduate students of Agriculture
An Introduction to Chemoinformatics for the postgraduate students of AgricultureDevakumar Jain
 
Graph Attention Networks.pptx
Graph Attention Networks.pptxGraph Attention Networks.pptx
Graph Attention Networks.pptxssuser2624f71
 
[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement LearningKiho Suh
 
Adversarial machine learning
Adversarial machine learning Adversarial machine learning
Adversarial machine learning nullowaspmumbai
 
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryMachine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryIchigaku Takigawa
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMLAI2
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOleg Mygryn
 

La actualidad más candente (20)

Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 
MOLECULAR MODELLING
MOLECULAR MODELLINGMOLECULAR MODELLING
MOLECULAR MODELLING
 
MD Simulation
MD SimulationMD Simulation
MD Simulation
 
Biocatalysis.pptx
Biocatalysis.pptxBiocatalysis.pptx
Biocatalysis.pptx
 
Siamese networks.pptx.pdf
Siamese networks.pptx.pdfSiamese networks.pptx.pdf
Siamese networks.pptx.pdf
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
An Introduction to Chemoinformatics for the postgraduate students of AgricultureAn Introduction to Chemoinformatics for the postgraduate students of Agriculture
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
 
Graph Attention Networks.pptx
Graph Attention Networks.pptxGraph Attention Networks.pptx
Graph Attention Networks.pptx
 
Advanced Molecular Dynamics 2016
Advanced Molecular Dynamics 2016Advanced Molecular Dynamics 2016
Advanced Molecular Dynamics 2016
 
[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning
 
Molecular Dynamics
Molecular DynamicsMolecular Dynamics
Molecular Dynamics
 
Adversarial machine learning
Adversarial machine learning Adversarial machine learning
Adversarial machine learning
 
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryMachine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
Machine Learning for Molecules: Lessons and Challenges of Data-Centric Chemistry
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
 
Asymmetric synthesis
Asymmetric synthesis Asymmetric synthesis
Asymmetric synthesis
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
chiral ligand.pptx
chiral ligand.pptxchiral ligand.pptx
chiral ligand.pptx
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 

Similar a Retrosynthesis tutorial v2

Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rulesChemseddine Berbague
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...LDBC council
 
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsAnalytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsIvan Ruchkin
 
Computational Chemical Engineering
Computational Chemical EngineeringComputational Chemical Engineering
Computational Chemical EngineeringIJRTEMJOURNAL
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...ssuser4b1f48
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
How to improve your unit tests?
How to improve your unit tests?How to improve your unit tests?
How to improve your unit tests?Péter Módos
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
 
Use of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactionsUse of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactionsMatthew Clark
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskQuantUniversity
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Aboul Ella Hassanien
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 

Similar a Retrosynthesis tutorial v2 (20)

Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rules
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical SystemsAnalytic Dependency Loops in Architectural Models of Cyber-Physical Systems
Analytic Dependency Loops in Architectural Models of Cyber-Physical Systems
 
Computational Chemical Engineering
Computational Chemical EngineeringComputational Chemical Engineering
Computational Chemical Engineering
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
NS-CUK Journal club: H.E.Lee, Review on " A biomedical knowledge graph-based ...
 
Unit 5
Unit 5Unit 5
Unit 5
 
Open Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & AnalysisOpen Chemistry: Input Preparation, Data Visualization & Analysis
Open Chemistry: Input Preparation, Data Visualization & Analysis
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
Apsec 2014 Presentation
Apsec 2014 PresentationApsec 2014 Presentation
Apsec 2014 Presentation
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
How to improve your unit tests?
How to improve your unit tests?How to improve your unit tests?
How to improve your unit tests?
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
Use of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactionsUse of GAN's to analyze chemical reactions
Use of GAN's to analyze chemical reactions
 
Machine Learning Applications in Credit Risk
Machine Learning Applications in Credit RiskMachine Learning Applications in Credit Risk
Machine Learning Applications in Credit Risk
 
Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...Novel algorithms for detection of unknown chemical molecules with specific bi...
Novel algorithms for detection of unknown chemical molecules with specific bi...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Method development
Method developmentMethod development
Method development
 

Último

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Último (20)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Retrosynthesis tutorial v2

  • 1. Introduction to Retrosynthesis Prediction 2020. 06 Wonjun Jeong wonjun.jg@kaist.ac.kr wonjun.email@gmail.com
  • 2. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 3. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 4. Retrosynthesis prediction • What is retrosynthesis prediction? • Retrosynthesis or retrosynthetic pathway planning is the process of tracing back the forward reaction, predicting which reactants are required to synthesize the target product. 4
  • 5. Retrosynthesis prediction • Retrosynthesis is crucial process of discovering new materials and drugs. 5 Desired properties Candidate Product Candidate Reactants Test by chemist Retrosynthesis prediction
  • 6. • Each process of discovering new materials and drug has own error, it should be verified by chemist. • Expensive 6 Desired properties Candidate Product Candidate Reactants Test by chemist Retrosynthesis prediction Retrosynthesis prediction
  • 7. Retrosynthesis prediction • Retrosynthesis prediction has highly depended on the trial-and-error cycles of experienced researchers of chemical expertise. 7
  • 8. Retrosynthesis prediction • If retrosynthesis prediction can be done with high accuracy … • Capable of unlocking future possibilities of a fully automated material/drug discovery pipeline. 8 Desired properties Candidate Product Candidate Reactants Test by robot Retrosynthesis prediction
  • 9. Dataset description • SMILES (Simplified Molecular-Input Line-Entry System) [1] • SMILES is a specification in the form of a line notation for describing the structure of chemical species [2]. • Generation of SMILES. • By printing symbol nodes encountered in a depth-first tree traversal of a chemical graph 9[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
  • 10. Dataset description • SMILES in detail • Character of carbon(C) is omitted in the graph. • Hydrogen(H) is omitted in the SMILES. • Ring structures are written by breaking each ring at an arbitrary point to make an acyclic str ucture and adding numerical ring closure labels to show connectivity between non-adjacen t atoms. • Branches are described with parentheses. • A bond is represented using one of the symbols: ., -, =, #, $, :, /, • “.” indicates two parts are not bonded together 10[1] Weininger et al .[2] https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system
  • 11. Dataset description • Benchmark: 1. USPTO (United States Patent and Trademark Office) • USPTO benchmark contains SMIELS representation of single target product (input) and reactants (target) • Variants • USPTO-50k • USTPO-500K • USPTO-MIT 2. Pistachio [32] 3. Reaxys [25] 11[25] reaxys.com [32] Mayfield et al.
  • 12. Overview of general approaches: Template-based • Template-based approaches [2, 3, 4, 5, 14, 15, 16, 17] use the known chemical reaction which is called reaction template. • Reaction template contains sub-graph reaction patterns that describing how the reaction occur between reactants and product. • Pros • High interpretability • Cons • Low generalizability to unseen templates • Require domain knowledge to extract the reaction templates 12
  • 13. Overview of general approaches: Template-free • Template-free approaches [6, 7, 8, 9, 10, 12] learn mapping function product to a set of reactants by extracting features directly from data. • Seq2Seq framework • [6, 7, 8, 12] • Graph2Grpah framework • [9, 10] • Pros • Generalizability • Not require domain knowledge • Cons • Invalid/Inaccessible predictions • Low interpretability 13 f
  • 14. Overview of general approaches: Selection-based • Selection-based approaches [11] select a candidate set of purchasable reactants. • The objective of [11] is to discover retrosynthetic routes from a given desired product to co mmercially available reactants • Pros • Accessibility of the prediction • Not require domain knowledge • Cons • Novelty 14[11] Guo et al. Rank := f(product; ) Purchasable pool
  • 15. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 16. Classical computer-aided methods • Before deep learning, computer-aided retrosynthesis were mainly conducted using reaction template. [2, 3, 4, 15, 16, 17] • They are mainly about how to use known reactions and extract meaningful reaction context. • Characteristics • It needs chemical expertise. • Heuristics • Computationally expensive • Chemical space is vast • Subgraph isomorphism problem*1. • Not scalable • Not generalizable 16*1: Appendix-1
  • 17. Classical computer-aided methods • The first computer-aided retrosynthesis: • [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985 • The author won the Nobel Prize in Chemistry for his contribution of retrosynthetic analysis. • [19] The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Mol ecules (Nobel lecture), 1991 17[18, 19] Corey et al.
  • 18. Classical computer-aided methods: Recent work [3] 2017 18[3] Coley et al.
  • 19. • Key Idea • It uses product similarity and reactants similarity to rank template of precedent reactions. 19[3] Coley et al. Classical computer-aided methods: Recent work [3] 2017 – Key Idea
  • 20. • How to measure molecular similarity*2? • Molecular fingerprints are a way of encoding the structure of molecule. We can use RDKit library to get it. • Most common way is Tanimoto similarity, but there is no canonical definition of molecule similarity (subgraph isomorphism problem*1). • , : Molecular fingerprint 20*1: Appendix-1, *2: Appendix-2 Img from [20] Classical computer-aided methods: Recent work [3] 2017 – Method (Similarity)
  • 21. • Example of using similarity in [3] • Total similarity := Product Sim * Reactants (Precursor) sim 21[3] Coley et al. Rank Classical computer-aided methods: Recent work [3] 2017 – Method (Using similarity)
  • 22. • Result of [3] • [3] performs better than seq2seq. However, the seq2seq in table is template-free and [3] is template-based. • Contribution • It mimics the retrosynthetic strategy by using molecular similarity without need to encode any chemical knowledge. • Limitation • It inherently disfavors making creative retrosynthetic strategy because it relies on precedent reactions. 22*3: Appendix-3 *3 Classical computer-aided methods: Recent work [3] 2017 - Results
  • 23. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • Open NMT • Related works • Future directions • Reference • Appendix • Library • Related works
  • 24. Machine learning based methods • Data-driven methods using machine learning and deep learning have been activated since mid-2010s. • The need for expertise has been reduced. • More scalable and generalizable. • Representative proposed methods • Template-based • NeuralSim [14], Graph Logic Network (GLN) [5] • Template-free • Seq2Seq [21], Molecular Transformer (MT) [6, 7], Latent variable Transformer (LV-MT) [8], Self-Corrected Transformer (SCROP) [22], Graph2Graph (G2G) [9], GraphRetro [10] • Selection-based • Bayesian-Retro [11] 24
  • 25. Machine learning based methods Template-based: NeuralSim [14] 2017 25[14] Segler et al.
  • 26. • Template-based: NeuralSim [14] (2017) • Key Idea • Given a target product, it uses neural network to predict most suitable rule in reaction template. 26[14] Segler et al. Machine learning based methods Template-based: NeuralSim [14] 2017 – Key Idea
  • 27. • Template-based: NeuralSim [14] • It uses primitive models such as MLP and Highway network [23]. • It defines rule-selection as a multiclass classification. • Molecular Descriptor [24] is defined as sum of molecular fingerprint: 27[14] Segler et al. [23] Srivastava et al. [24] pdf file Machine learning based methods Template-based: NeuralSim [14] 2017 - Method
  • 28. • Template-based: NeuralSim [14] • Experiments • Dataset: Reaxys database [25] • # of class: 8720 • Contribution • It shows neural networks can learn to which molecular context particular rules can be applied. • Limitation • The performance is affected by rule set cardinality. • The larger the set size, the lower the performance. 28[14] Segler et al. Machine learning based methods Template-based: NeuralSim [14] 2017 - Results
  • 29. • Template-based: Graph Logic Network (GLN) [5] (NeurIPS 2019) 29[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019
  • 30. • Key Idea • Modeling the joint distribution of reaction templates and reactants using logic variable. • It learns when rules from reaction templates should be applied. 30[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 – Key Idea
  • 31. • Retrosynthesis Template • Using the retrosynthesis template can be decomposed into 2-step logic. • Match template • Match reactants 31[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Background
  • 32. • Match template • Match reactants • Uncertainty • Template score function • Reactants score function 32[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method
  • 33. • Final joint probability 33[5] Dai et al. *4: Appendix-4 Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method Parameterizing by GNN (Graph Neural Network)*4
  • 34. • MLE with Efficient Inference • Gradient approximation 34 Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Method [5] Dai et al.
  • 35. • Top-k results • Contribution • Interpretability: Integration of probabilistic models and template(chemical rule) • Limitation • It share limitations of template-based method • Scalability 35[5] Dai et al. Machine learning based methods Template-based: Graph Logic Network [5] 2019 - Results
  • 36. 36[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017
  • 37. • Template-free: Seq2Seq [21] (2017) • It tokenizes SMILES and treats retrosynthesis as machine translation. • It uses bidirectional LSTM for a encoder and decoder. • It uses beam search to produce a set of reactants. 37[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017 - Method
  • 38. • Results • It performs comparably to the rule-based expert system baseline. • Contribution • It shows fully data-driven seq2seq model can learn retrosynthetic pathway. • Limitations • It produces grammatically invalid SMILES and chemically implausible predictions. • Just naïve application of seq2seq model. • Predictions generated by a vanilla seq2seq model with beam search typically exemplifies low diversity with only minor differences in the suffix. [8] 38[21] Liu et al, [8] Chen et al Machine learning based methods Template-free: Seq2Seq [21] 2017 – Results
  • 39. • Grammatically invalid SMILES • Grammatically valid but chemically implausible 39[21] Liu et al. Machine learning based methods Template-free: Seq2Seq [21] 2017 – Results
  • 40. 40[6] Schwaller et al., [7] Lee et al. Machine learning based methods Template-free: Molecular Transformer [6, 7] 2019
  • 41. • Key Idea • It also tokenizes SMILES and treats retrosynthesis as machine translation like [21]. • It uses Transformer instead of LSTM • It performs better than seq2seq [21] but has same limitations. 41 Machine learning based methods Template-free: Molecular Transformer [6, 7] 2019 – Key Idea [6] Schwaller et al., [7] Lee et al. [21] Liu et al.
  • 42. • Template-free: Latent variable Transformer (LV-MT) [8] (arXiv 2019) 42[8] Chen et al. Machine learning based methods Template-free: LV-MT [8] 2019
  • 43. • It extends Molecular Transformer (MT) to become more generalizable to rare reactions and produce diverse path. • Key Idea • It proposes novel pretrain method. • Random bond cut • Template-based bond cut • It trains a mixture model with the online hard-EM algorithm. 43[8] Chen et al Machine learning based methods Template-free: LV-MT [8] 2019 – Key Idea
  • 44. • Pretrain methods • Random bond cut • For each input target product, it generates new examples by selecting a random bond to break. • Template-based bond cut • Instead of randomly breaking bonds, it uses the templates to break bonds. • The model is pre-trained on these auxiliary examples, and then used as initialization to be fine-tuned on the actual retrosynthesis data. 44 Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Pretrain) [8] Chen et al
  • 45. • Why latent variables are introduced? • It tackles the problem of generating diverse predictions. • The outputs of beam search tend to be similar to each other. • Given a target SMILES string x and reactants SMILES string y, a mixture model introduces a multinomial latent variable z ∈ { 1, · · · , K } to capture different reaction types, and decomposes the marginal likelihood as: 45 Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Latent Var.) [8] Chen et al
  • 46. • Hard-EM algorithm 1. Taking a mini-batch of training examples 2. It enumerates all K values of z and compute their loss, • Dropout should be turned off [26]. 3. For each , it selects the value of z that yields the minimum loss: • For p(y | z, x; θ), it shares the encoder-decoder network among mixture components, and feed the embedding of z as an input to the decoder so that y is conditioned on it 4. Back-propagate through it, so only one component receives gradients per example. • Dropout should be turned back on [26]. 46[8] Chen et al., [26] Shen et al. Machine learning based methods Template-free: LV-MT [8] 2019 – Method (Latent Var.)
  • 47. • Results*5 47*5: We report better hyper-parameters and the results in Appendix-5 Machine learning based methods Template-free: LV-MT [8] 2019 – Results
  • 48. • Contributions • It proposes novel pretraining methods for retrosynthesis. • It uses mixture model Transformer for diverse predictions. • Limitations • The more latent variables are used, the worse the top 1 performance. • The latent variable does not appear to contain information about the reaction class. 48 Machine learning based methods Template-free: LV-MT [8] 2019 – Results [8] Chen et al
  • 49. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020) 49[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020
  • 50. • Template-free: Self-Corrected Transformer (SCROP) [22] (2020) • Key Idea • It uses Transformer for correcting invalid predicted SMILES • It makes syntax correction data via trained Transformer by constructing set of invalid prediction-ground truth pairs. • It trains another Transformer for syntax corrector using syntax correction data. • At test time, it retains the top-1 candidate produced by the syntax corrector and replace the original one. 50[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020 – Key Idea
  • 51. • Results • Compare to Transformer (SCROP-noSC), the performance is improved by 0.4~1.7%. 51 Machine learning based methods Template-free: SCROP [22] 2020 – Results [22] Zheng et al.
  • 52. • Invalid SMILES rates • Limitations • Why SCROP? We can remove invalid SMILES by using RDKit without learned model. 52[22] Zheng et al. Machine learning based methods Template-free: SCROP [22] 2020 – Results
  • 53. • Template-free: Graph2Graph (G2G) [9] (ICML 2020) 53[9] Shi et al. Machine learning based methods Template-free: G2G [9] 2020
  • 54. • Key Idea • It decomposes retrosynthesis as 2-step procedure: • Breaking target product • Transforming broken target product • It trains Reaction Center Identification (RCI) module for making synthon(s) via breaking bonds in a product graph. • It trains Variational Graph Translation module for making reactants via a series of graph transformation. 54 Machine learning based methods Template-free: G2G [9] 2020 – Key Idea [9] Shi et al.
  • 55. • Reaction Center Identification (RCI) • It uses a R-GCN [27] for learning graph representation. • Overview 1. Given a chemical reaction , it derives a binary label matrix 2. Computing node embeddings and graph embedding. 3. To estimate the reactivity score of atom pair (i,j), the edge embedding is formed by concatenating several features. 4. The final reactivity score of the atom pair (i, j) is calculated as: 5. The RCI is optimized by maximizing the cross entropy of the binary label 55 Machine learning based methods Template-free: G2G [9] 2020 – Method (RCI) [9] Shi et al. [27] Schlichtkrull et al.
  • 56. • Reactants generation via Variational Graph Translation (VGT). 1. It receives synthons from the RCI and transform the synthons to reactants. 2. It generates a sequence of graph transformation actions , and apply them on the initial synthon graph. • It assumes graph generation as a Markov Decision Process (MDP). 56 Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT) [9] Shi et al.
  • 57. • Reactants generation via Variational Graph Translation (VGT). • Overview 1. Let transformation trajectory := , the graph transformation is deterministic if the transformation trajectory is defined. = 2. Let denote the graph after applying the sequence of actions to 3. Leveraging assumption of a MDP, = 4. Finally, Graph transformation cab be factorized as follows: 57 Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT) [9] Shi et al.
  • 58. • Reactants generation via Variational Graph Translation (VGT). • Overview (cont’d) 4. Let an action is a tuple 5. It decomposes the distribution into 3 parts: i. Termination prediction ii. Nodes selection iii. Edge labeling 6. It uses variational inference by introducing an approximate posterior 58[9] Shi et al. Machine learning based methods Template-free: G2G [9] 2020 – Method (VGT)
  • 59. • Top-k result 59[9] Shi et al. Reaction class is given Reaction class is unkwon Machine learning based methods Template-free: G2G [9] 2020 – Results
  • 60. • Module performance • Contribution • It novelly formulates retrosynthesis prediction as a graph-to-graphs translation task • Limitation • Well-tuned Molecule Transformers performs better 60 Machine learning based methods Template-free: G2G [9] 2020 – Results [9] Shi et al.
  • 61. • Template-free: GraphRetro [10] (arXiv 2020) 61 Machine learning based methods Template-free: GraphRetro [10] 2020 [10] Somnath et al.
  • 62. • Template-free: GraphRetro [10] (arXiv 2020) • Key Idea • It also uses the idea of breaking and modifying graphs like G2G[22]. • G2G[22] modified the graph at the level of atoms, but it operates at level of molecular fragments called as leaving groups. • G2G: Sequential generation • GraphRetro: Leaving group selection 62 Machine learning based methods Template-free: GraphRetro [10] 2020 – Key Idea [10] Somnath et al.
  • 63. • Top-k result 63 Machine learning based methods Template-free: GraphRetro [10] 2020 - Results [10] Somnath et al.
  • 64. • Module performance • Contribution • Choosing a leaving group is a good idea for retrosynthesis problems • Limitation • Domain knowledge is required to create a leaving group vocabulary 64 Machine learning based methods Template-free: GraphRetro [10] 2020 - Results [10] Somnath et al.
  • 65. Machine learning based Selection-based: Bayesian Retrosynthesis [11] 65[11] Guo et al.
  • 66. Machine learning based Selection-based: Bayesian Retrosynthesis [11] 66 Cont’d [11] Guo et al.
  • 67. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Key Idea • Key Idea • It uses pre-trained forward model for likelihood of Bayes’ theorem and uses approximate posterior distribution of reactants. • It uses Monte Carlo search for exploring synthetic routes 67[11] Guo et al.
  • 68. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method • Method • Likelihood is the Boltzmann distribution with an inverse temperature. • Energy function: Tanimoto distance between target product and predicted product • Approximate posterior • Exact computation across all candidates is generally infeasible. 68 Predicted product by forward model (Molecular Transformer) [11] Guo et al.
  • 69. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method (SMC) • Method (Cont’d) • Sampling from the posterior • Sequential Monte Carlo (SMC) • • Cons • Particle impoverishment [38] • Rapid loss of diversity • Computation cost of using forward model (Molecular Transformer) 69[11] Guo et al. [38] Stavropoulos et al.
  • 70. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Method • Method (Cont’d) • SMC accelerated by surrogate likelihood. • It trains Gradient Boosting Regression Tree that predicts likelihood of Molecular Transformer 70[11] Guo et al.
  • 71. Machine learning based Selection-based: Bayesian Retrosynthesis [11] – Results • Results 71[11] Guo et al.
  • 72. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 73. Challenges Challenge 1. Balancing between template-free and template-based model Challenge 2. Multi-Step retrosynthesis Challenge 3. Extremely large space of synthesis routes Challenge 4. Molecule decoding (Graph generation) 73[3] Coley et al. [14] Segler et al.
  • 74. Challenges: 1. Balancing between template-free and template-based model • How about a hybrid model using uncertainty ? 74 f Pros • High interpretability Cons • Low generalizability • Require domain knowledge Pros • Generalizability Cons • Invalid/Inaccessible predictions • Low interpretability
  • 75. • Most chemical molecules in real world cannot be synthesized within one step. • It could go up to 60 steps or even more. • Error accumulation • Extremely large space • Most recent work [13] uses neural guided A* search. 75[13] Chen et al. Challenges: 2. Multi-Step retrosynthesis
  • 76. • Each molecule could be synthesized by hundreds of different possible reactants. • How to measure a good synthesis routes ? 76 Challenges: 3. Extremely large space of synthesis routes
  • 77. • Modeling complex distributions over graphs and then efficiently sampling is challengin g! • Why is it challenging? • Non-unique • High dimensional nature of graphs • Complex, non-local dependencies b/w nodes and edges. • Proposed methods • Graph VAE [29] (ICANN 2018) • Graph RNN [30] (ICML 2018) • GRAN [31] (NeurIPS 2019) • Junction tree VAE [35] (ICML 2019) 77[29] Schlichtkrull et al. [30] You et al. [31] Liao et al. [35] Jin et al. Challenges: 4. Molecule decoding (Graph generation)
  • 78. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 79. Practice: RDkit • Data pre-processing (RDKit) • RDKit[20] is an open-source library for Cheminformatics. • https://www.rdkit.org • Why RDKit? • Visualizing • Substructure searching • Calculate molecule similarity • Validity check • Various function for Cheminformatics • We upload RDKit tutorial notebook: • https://github.com/wonjun-dev/contrastive-retro 79
  • 80. Practice: OpenNMT • OpenNMT • OpenNMT[28] is an open-source library for neural machine translations. • https://opennmt.net • It supports various models for encoder-decoder framework. • Why OpenNMT? • It supports various models for encoder-decoder framework. • Built-in functions. • Easy to engineer. • Cons • Too huge • Flexibility • Discontinued procedure (train-inference-performance check)*7 80[28] Klein et al., *7: We made fully-automated script.
  • 81. Practice: OpenNMT – Where you should change • OpenNMT • Primary files in OpenNMT • Data loader • preprocess.py • inputter.py (.onmt/inputters) • Options • opts.py (./onmt) => Several options for train, translate, preprocessing and etc. You can make your own options in here. • Train • train.py => Entry point of training • train_single.py (./ommt) => Second entry point of training • trainer.py (./onmt) => Main training loop • loss.py (.onmt/utils) => Several classes for loss function • Model • model_builder (./onmt) • model.py (./onmt/models) => Model class • model_saver (./onmt/models) • Translation • translate.py => Entry point of translation • translator.py (./onmt/translate) => Translator class • Performance check • parse_output.py (./parse) => Parse predicted output and calculate accuracy via RDKit. 81
  • 82. Practice: OpenNMT – Automated script • OpenNMT • We provide fully-automated (training to parsing) script. • https://github.com/wonjun-dev/contrastive-retro @master branch • run_experiment_mt.sh • Train – Inference (Translate) – Performance check (Parse) – Averaging • arg[0] : GPU id • arg[1]: seed • run_average.py • The performance variation of MT and LV-MT is quite large depending on seed. 82
  • 83. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 84. Related works • Forward synthesis • Given reactants and reagents, predict the products. • [7, 34, 36, 37] • Reaction center prediction • The task of identifying the reaction center is related to the step of deriving the synthons (intermediate outcomes) in retrosynthesis. • [9, 10, 33, 34] • Graph generation • Generative models for real-world graphs, including social, chemical and knowledge graph • [29, 30, 31, 35] 84
  • 85. Table of Contents • Introduction • Retrosynthesis prediction • Dataset description • Overview of general approaches: Template-based, Template-free, Selection-based • Proposed methods • Classical computer-aided methods • Machine learning based methods • Challenges • Practice • RDKit • OpenNMT • Related works • Future directions • Reference • Appendix
  • 86. Future directions • Training chemical language models like BERT • Learning better chemical representation • Atomic or molecular embedding considering chemical properties • Robust to SMILES augmentation • Contrastive learning • Template-Generative Hybrid model • Graph encoding – SMILES decoding • Graph decoding is challenging • Predictive model for subgraph isomorphism • Subgraph isomorphism is a NP-complete problem, it is not scalable. 86
  • 87. References [1] Weininger et al. “A chemical language and information system. 1. introduction to methodology and encoding rules.” Journal of Chemical Information and Modeling, 1988. [2] Christ et al. “Mining electronic laboratory notebooks: Analysis, retrosynthesis, and reaction based enumeration.” Journal of Chemical Information and Modeling, 2012. [3] Coley et al. “Computer-assisted retrosynthesis based on molecular similarity.” ACS Central Science, 2017. [4] Klucznik et al. “Efficient syntheses of diverse, medicinally relevant targets planned by computer and executed in the laboratory.” Chem, 2018. [5] Dai et al. “Retrosynthesis prediction with conditional graph logic network”. NeurIPS, 2019. [6] Schwaller et al. “Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction.” ACS Central Science, 2019. [7] Lee et al. “Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space.” Chemical Communications, 2019. [8] Chen et al. “Learning to make generalizable and diverse predictions for retrosynthesis.” arXiv preprint 2019. [9] Shi et al. “A graph to graphs framework for retrosynthesis prediction.”, ICML, 2020 [10] Somnath et al. “Learning graph models for template-free retrosynthesis.”, arXiv, 2020 [11] Guo et al. “A Bayesian algorithm for retrosynthesis.”, arXiv, 2020 [12] Lin et al. “Automatic retrosynthetic route planning using template-free models.”, Chem. Sci., 2020 [13] Chen et al. “Retro*: Learning Retrosynthetic Planning with Neural Guided A* Search”, ICML, 2020 87
  • 88. References [14] Segler et al., “Neural-Symbolic machine learning for retrosynthesis and reaction prediction.”, Chemistry-A European Journal, 2017 [15] Satoh et al., “A novel approach to retrosynthetic analysis using knowledge bases derived from reaction databases.”, Chem. Inf. Comput. Sci., 1999 [16] Law et al., “Route designer: A retrosynthetic analysis tool utilizing automated retrosynthetic rule generation.”, Chem. Inf., 2009 [17] Gasteiger et al., “A collection of computer methods for synthesis design and reaction prediction.”, Recl. Trav. Chim. Pays-Bas, 1992 [18] Corey et al., “Computer-assisted analysis in organic synthesis.”, Science, 1985 [19] Corey et al., “The logic of chemical synthesis: Multistep synthesis of complex carbogenic molecules. (Nobel lecture)”, 1991 [20] http://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf [21] Liu et al., “Retrosynthetic reaction prediction using neural sequence-to-sequence models.”, ACS Cent. Sci., 2017 [22] Zheng et al., “Predicting retrosynthetic reactions using self-corrected transformer neural networks.”, J. Chem. Inf. Model., 2020 [23] Srivastava et al., “Highway networks”, NIPS, 2015 [24] https://chemistry-europe.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1002%2Fchem.201605499&fil e=chem201605499-sup-0001-misc_information.pdf [25] http://www.reaxys.com, Reaxys is a registered trademark of RELX Intellectual Properties SA used under license. [26] Shen et al., “Mixture model for diverse machine translations: Tricks off the trade.”, arXiv, 2019 88
  • 89. References [27] Schlichtkrull et al., “Modeling relational data with graph convolutional networks.”, In European Semantic Web Conference, 2018 [28] Klein et al., “OpenNMT: Open-Source Toolkit for Neural Machine Translation.”, arXiv, 2017 [29] Simonovsky et al., “GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders.”, ICANN, 2018 [30] You et al., “GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models.”, ICML, 2018 [31] Liao et al., “Efficient Graph Generation with Graph Recurrent Attention Networks.”, NeurIPS, 2019 [32] Mayfield et al., “Pistachio 2.0 edn software.”, 2018 [33] Coley et al., “A graph-convolutional neural network model for the prediction of chemical reactivity.”, Chemical Science 2019 [34] Coley et al., “Predicting organic reaction outcomes with Weisfeiler-Lehman Network.”, NeurIPS, 2017 [35] Jin et al., “Junction Tree Variational Autoencoder for molecular graph generation.”, ICML, 2019 [36] Bradshaw et al., “A generative model for electron path.”, ICLR, 2019 [37] DO et al., “Graph transformation policy network for chemical reaction prediction.”, KDD, 2019 [38] Stavropoulos et al., “Sequential Monte Carlo method in practice.”, Springer, 2001 89
  • 90. Appendix 1. Subgraph isomorphism problem • It is a computational task in which two graphs G and H are given as input, and one must det ermine whether G contains a subgraph that is isomorphic to H • NP-Complete 2. Molecular similarity metrics (x and y are molecular fingerprint) 90
  • 91. Appendix 3. Reaction class • Meta-information about type of chemical reactions. • In USPTO, there are 10 reaction classes 91
  • 92. Appendix 4. Parameterizing by GNN in [5] • Graph embedding := Averaging node embedding 92
  • 93. Appendix 5. Better hyper-parameters of MT and the results. • Dropout p=0.25 is better than p=0.1 • We can remove invalid and repeated SMILES via RDKit. • Also, Using 6 layers and increasing the dropout rate is better than using 4 layers. 93 Top 1 Top 3 Top 5 Top 10 MT [8] 0.420 0.570 0.619 0.657 MT (p=0.25, w/o inval/repeat) 0.432 0.645 0.709 0.771
  • 94. Thank you ! Any Questions ?