SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Intro to deep reinforcement learning
and applications to molecular design
Dan Elton
UMD College Park Fuge group tea talk
delton@umd.edu
December 5, 2018
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 1 / 22
Overview
1 Intro to RL
The Bellman equation
TD learning
Value vs policy learning
2 Deep Q learning
3 RL for molecular optimization
Implementation details
Tricks
Results
Interpretation of Q-functions
Hillclimb-MLE
4 References
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 2 / 22
Basic concepts in RL
The goal of the RL agent is to maximized the expected return, which is
the sum of future rewards:
Gt =
k=1,···
rt+k
Normally we want to include a discount factor 0 ≥ γ ≤ 1:
Gt =
k=1,···
γk−1
rt+k
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 3 / 22
Basic concepts in RL, continued...
A policy π = {(si , ai )} is a collection of all the possible (state, action)
pairs, that specifies completely the behavior of the agent.
A state-value function V π(s) under policy π is the expected future
return obtained by starting in state s and following policy π.
An action-value function Qπ(s, a) under policy π is the total future
return expected by starting in state s, taking action a and following policy
π from there.
An action-value function can may be related to a policy via a softmax:
π(s, ai ) =
eβQ(s,ai )
j
eβQ(s,aj )
When β = ∞ this results in a “greedy” policy that always exploits the highest value
action. A lower β is a more “explorative” policy. Another option is to use an ε-greedy
policy.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 4 / 22
The Bellman equation
V π
(s) = E[Gt|St = s]
= E[rt+1 + γGt+1|St = s]
=
a
π(a|s)
s ,r
p(s , r|s, a)[r + γE(Gt+1|St+1|St+1 = s ])
V π
(s) =
a
π(a|s)
s r
p(s , r|s, a)[r + γV π
(s )] (1)
If we know p(s , r|s, a), then we can calculate V π(s) for all s, which may
be denoted as a vector Vπ. The Bellman equation must be solved
recursively, but it can be proven the recursive solution method converges
correctly. However normally we do not know p(s , r|s, a) in advance. In
that case, we can use some form of value function learning like
TD-learning. With value function based method we need to learn a good
policy. Thus we need to start from a random (equiprobable action) policy,
run it forward, and perform policy evaluation and policy iteration.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 5 / 22
Other jargon
A model free method does not require a model for the transition dynamics
p(s , r|s, a) to be learned. Instead , it learns through episodes / samples.
An off policy method learns the optimal greedy policy while following a
different policy that ensures exploration (such as ε-greedy).
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 6 / 22
Temporal difference (TD) learning
A simple method for value function learning is TD-learning. The
algorithm is as follows:
Initialize V (s) arbitrarily for all states s. Choose a policy π(s, a) to
evaluate. Pick a random starting state s
Repeat for each time step t:
1. Pick an action a in state s, according to the policy π(s, a).
2. Act with a and move from state s to state s , collect reward r,
compute the TD-error: δ = r + γV (s ) − V (s).
3. Update V (s) according to : V (s) ← V (s) + αδ
4. Move to next state s ← s .
TD-learning has the following properties:
It is an online method, also called a “bootstrap” method.
It can be proven that the method converges to the exact V π(s) for a
given policy.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 7 / 22
Value function learning vs policy gradient methods
Broadly speaking, RL methods can be broken into two categories:
Value function learning
Learn value function or action value function. UseBellman equation or
TD-learning type approach. Techniques include Q learning and
actor-critic methods.
When it works, it can be much more sample efficient. Empirically
these methods converge faster, although there is so far no
mathematical proof they always converge faster.
Policy learning & policy gradient methods
The canonical policy gradient method is the REINFORCE algorithm,
which is used in the ORGAN for molecule generation and several
papers on molecular generation with RNNs
usuall requires doing Monte-Carlo and running simulations to the final
end state (a complete ”episode”), which can be computationally
demanding or impossible in the case of continuous learning.
May suffer from high variance when estimating gradient.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 8 / 22
Value function learning vs policy gradient methods,
continued..
Olivecrona et al. train an RNN with MLE and then fine-tune it using RL.
They argue for policy based learning as follows:
“For the problem addressed in this study, we believe that policy based
methods is the natural choice for three reasons:
Policy based methods can learn explicitly an optimal stochastic policy,
which is our goal.
The method used starts with a prior sequence model. The goal is to
finne tune this model according to some specifed scoring function.
Since the prior model already constitutes a policy, learning a finetuned
policy might require only small changes to the prior model.
The episodes in this case are short and fast to sample, reducing the
impact of the variance in the estimate of the gradients.”
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 9 / 22
Deep Q learning
Goal is to learn the action-value function Q(s, a), using a neural network
approximator with parameters θ, Q(s, a; θ). Goal is to approximate the
optimal action-value function Q∗(s, a):
Q∗,π
(s, a) = maxπE[Gt|St = s, At = a, π]
The general Bellman equation for Qπ(s, a) is
Qπ
(s, a) =
s ,r
p(s |s, a) r + γ
a
π(s , a )Qπ
(s , a )
The Bellman equation for Q∗,π is
Q∗,π
(s, a) =
s ,r
p(s |s, a) r + γmaxa Q∗,π
(s , a )|s, a
This can be solved iteratively as
Qπ
i+1(s, a) =
s ,r
p(s |s, a) r + γmaxa Qπ
i (s , a )|s, a (2)
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 10 / 22
Deep Q learning, continued..
In deep Q learning, the neural network model of Qπ(s, a) is retrained at the start of each
iteration of the Bellman equation solution to reduce the mean squared error between the LHS
and the RHS of the Bellman equation.
This approach was popularized by the DeepMind work :
Mnih, et al. “Human-level control through deep reinforcement learning”. Nature 518, pgs 529-533, 2015
A single deep Q-network based agent achieved human level performance on 49 Atari 2600
games, receiving only pixel values and game score as inputs.
input was 210x160 color video at 60 Hz
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 11 / 22
Experience replay
Improves stability and efficiency of deep Q learning. The experience of
the agent at each timestep et = (st, at, rt, st+1) are stored into a dataset
Dt = {e1, · · · et} which is assembled over many eipsodes (runs).
Then, each time Q is retrained, minibatch learning is performed using not
only the current state but also a set of experiences drawn randomly from
D.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 12 / 22
Deep Q Learning for Molecular Optimization
Zhou, et al. “Optimization of Molecules Via Deep Reinforcement Learning”. Oct. 2018,
arXiv:1810.08678v2.
We have a Markov decision process MDP(S, A, {Psa}, R)
S is the state space. s ∈ S is a tuple, (m, t) where m is the molecule and t is the
number of steps taken. The number of steps that can be taken is limited to T,
leading to a finite (but still very large) state space.
A is the action space. Possible actions are:
Atom addition - this is a replacement of implicit hydrogen(s) with some other atom
(ensuring valence rules are followed).
Bond addition - this can be performed with atoms with ”free valence” (which
doesn’t include implicit hydrogens).
Bond removal - this is either reducing the order of a bond (ie from double to
single), or removing a bond altogether. If removal of a bond results in a
disconnected atom, that atom is removed as well.
{Psa} are the state transition probabilities. They is set to 1 here, meaning state
transitions are deterministic.
R denotes the reward funcction of the state (m, t). Rewards are calculated at each
step. However, to ensure that that the final state is rewarded more than
intermediary states, a discount factor of γT−t
is applied. They used γ = 0.99
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 13 / 22
Implementation details
Molecules are converted to a vector using a Morgan fingerprint with radius
3 and length 2048. They used a 4-layer neural net with ReLu activation
and layer sizes of [1024,512,128,32]
They used ε-greedy policy exploration with linear annealing of ε from 1 to
0.001.
They used multiple objective RL. This involves a vector of rewards
rt = [r1,t, · · · , rk,t]. Instead of just doing a linear weighted sum to get a
new scalar reward, Zhou et al. learn separate Qi (s, a) for the expected
return from each reward. Zhou et al. implement a multitask neural
network with separate outputs for each Qi (s, a). Optimal action is chosen
via a scalarized Q:
at = max
a
wT
Q(s, a) (3)
where w ∈ Rk is a vector of weights. This method can have issues if there
are competition between rewards can yield sub-optimal results.
A review of multiple objective RL methods can be found in Liu, et al. IEEE
Transactions on Systems, Man, and Cybernetics 2015, 45, 385-398.
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 14 / 22
Tricks
We talked about using a softmax or ε-greedy learning to allow for
exploration.
following:
Osband et al. “Deep Exploration via Randomized Value Functions”.
arXiv:1703.07608 (2017)
They train H independent Q functions each trained on a different subset
of samples.
Other tricks they used:
prioritized experience replay
Double Q-learning
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 15 / 22
Results
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 16 / 22
Results
Benefits of the “DQN” approach
Starts from scratch
No need to train a generative model. (which can take significant GPU time
(weeks))
Possible weaknesses of the “DQN” approach
Starts from scratch (Olivecrona et al. talk about “drift” being an issue with RL)
Needs carefully tuned reward function
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 17 / 22
Reward curve
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 18 / 22
Interpretation of Q-functions
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 19 / 22
Hillclimb-MLE
Neil et al. (2018) introduce “Hillclimb-MLE” for optimization with a MLE-trained RNN:
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 20 / 22
References
Luca Mazzucato (2011)
Computational neuroscience: a physicist’s point of view
Richard S. Sutton and Andrew G. Barto (2018)
Reinforcement Learning: An Introduction, 2nd edition
Mnih, et al. (2015)
Human-level control through deep reinforcement learning
Nature 518, pgs 529-533
Zhou, Kearnes, Li, Zare, Riley (2018)
Optimization of Molecules via Deep Reinforcement Learning
arXiv:1810.08678v2
Olivecrona et al. (2017)
Molecular de-novo design through deep reinforce-ment learning
Journal of Cheminformatics, 9 (1)
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 21 / 22
The End
Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 22 / 22

Más contenido relacionado

La actualidad más candente

Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Traian Rebedea
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Willy Marroquin (WillyDevNET)
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Lushanthan Sivaneasharajah
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningDeren Lei
 
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...Jihun Yun
 
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...IJERA Editor
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equationsXequeMateShannon
 
DFTFIT: Potential Generation for Molecular Dynamics Calculations
DFTFIT: Potential Generation for Molecular Dynamics CalculationsDFTFIT: Potential Generation for Molecular Dynamics Calculations
DFTFIT: Potential Generation for Molecular Dynamics CalculationsChristopher Ostrouchov
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering TutorialZitao Liu
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptxQingsong Guo
 
Numerical approach of riemann-liouville fractional derivative operator
Numerical approach of riemann-liouville fractional derivative operatorNumerical approach of riemann-liouville fractional derivative operator
Numerical approach of riemann-liouville fractional derivative operatorIJECEIAES
 
K means clustering
K means clusteringK means clustering
K means clusteringThomas K T
 
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...On Projected Newton Barrier Methods for Linear Programming and an Equivalence...
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...SSA KPI
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Mengxi Jiang
 
Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2Techglyphs
 
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...ijfls
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++sabbirantor
 

La actualidad más candente (19)

Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2Algorithm Design and Complexity - Course 1&2
Algorithm Design and Complexity - Course 1&2
 
Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...Scalable trust-region method for deep reinforcement learning using Kronecker-...
Scalable trust-region method for deep reinforcement learning using Kronecker-...
 
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
Application of Fisher Linear Discriminant Analysis to Speech/Music Classifica...
 
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph ReasoningLearning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
 
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...
Trimming the L1 Regularizer: Statistical Analysis, Optimization, and Applicat...
 
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
A Mathematical Model to Solve Nonlinear Initial and Boundary Value Problems b...
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equations
 
DFTFIT: Potential Generation for Molecular Dynamics Calculations
DFTFIT: Potential Generation for Molecular Dynamics CalculationsDFTFIT: Potential Generation for Molecular Dynamics Calculations
DFTFIT: Potential Generation for Molecular Dynamics Calculations
 
Spectral clustering Tutorial
Spectral clustering TutorialSpectral clustering Tutorial
Spectral clustering Tutorial
 
K means
K meansK means
K means
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Numerical approach of riemann-liouville fractional derivative operator
Numerical approach of riemann-liouville fractional derivative operatorNumerical approach of riemann-liouville fractional derivative operator
Numerical approach of riemann-liouville fractional derivative operator
 
K means clustering
K means clusteringK means clustering
K means clustering
 
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...On Projected Newton Barrier Methods for Linear Programming and an Equivalence...
On Projected Newton Barrier Methods for Linear Programming and an Equivalence...
 
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
Application of Graphic LASSO in Portfolio Optimization_Yixuan Chen & Mengxi J...
 
Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2Bt0080 fundamentals of algorithms2
Bt0080 fundamentals of algorithms2
 
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
A Counterexample to the Forward Recursion in Fuzzy Critical Path Analysis Und...
 
K means clustering | K Means ++
K means clustering | K Means ++K means clustering | K Means ++
K means clustering | K Means ++
 

Similar a Introduction to Reinforcement Learning for Molecular Design

MM framework for RL
MM framework for RLMM framework for RL
MM framework for RLSung Yub Kim
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...IJCSEA Journal
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNINGMLReview
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Metaheuristic Optimization: Algorithm Analysis and Open Problems
Metaheuristic Optimization: Algorithm Analysis and Open ProblemsMetaheuristic Optimization: Algorithm Analysis and Open Problems
Metaheuristic Optimization: Algorithm Analysis and Open ProblemsXin-She Yang
 
A Comparison of Particle Swarm Optimization and Differential Evolution
A Comparison of Particle Swarm Optimization and Differential EvolutionA Comparison of Particle Swarm Optimization and Differential Evolution
A Comparison of Particle Swarm Optimization and Differential Evolutionijsc
 
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTION
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTIONA COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTION
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTIONijsc
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...IJDKP
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Optimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmOptimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmcsandit
 
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...IJITCA Journal
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...SSA KPI
 

Similar a Introduction to Reinforcement Learning for Molecular Design (20)

MM framework for RL
MM framework for RLMM framework for RL
MM framework for RL
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Sat
SatSat
Sat
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Metaheuristic Optimization: Algorithm Analysis and Open Problems
Metaheuristic Optimization: Algorithm Analysis and Open ProblemsMetaheuristic Optimization: Algorithm Analysis and Open Problems
Metaheuristic Optimization: Algorithm Analysis and Open Problems
 
A Comparison of Particle Swarm Optimization and Differential Evolution
A Comparison of Particle Swarm Optimization and Differential EvolutionA Comparison of Particle Swarm Optimization and Differential Evolution
A Comparison of Particle Swarm Optimization and Differential Evolution
 
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTION
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTIONA COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTION
A COMPARISON OF PARTICLE SWARM OPTIMIZATION AND DIFFERENTIAL EVOLUTION
 
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
A COMPREHENSIVE ANALYSIS OF QUANTUM CLUSTERING : FINDING ALL THE POTENTIAL MI...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
5. 8519 1-pb
5. 8519 1-pb5. 8519 1-pb
5. 8519 1-pb
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Optimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithmOptimal rule set generation using pso algorithm
Optimal rule set generation using pso algorithm
 
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
APPLYING DYNAMIC MODEL FOR MULTIPLE MANOEUVRING TARGET TRACKING USING PARTICL...
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...The Sample Average Approximation Method for Stochastic Programs with Integer ...
The Sample Average Approximation Method for Stochastic Programs with Integer ...
 

Más de Dan Elton

Societal, policy, and regulatory implications of AI for healthcare and medicine
Societal, policy, and regulatory implications of AI for healthcare and medicineSocietal, policy, and regulatory implications of AI for healthcare and medicine
Societal, policy, and regulatory implications of AI for healthcare and medicineDan Elton
 
How deep learning works and why it alone won't get us much closer to AGI
How deep learning works and why it alone won't get us much closer to AGIHow deep learning works and why it alone won't get us much closer to AGI
How deep learning works and why it alone won't get us much closer to AGIDan Elton
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Dan Elton
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Dan Elton
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoderDan Elton
 
Machine Learning Pitfalls
Machine Learning Pitfalls Machine Learning Pitfalls
Machine Learning Pitfalls Dan Elton
 

Más de Dan Elton (6)

Societal, policy, and regulatory implications of AI for healthcare and medicine
Societal, policy, and regulatory implications of AI for healthcare and medicineSocietal, policy, and regulatory implications of AI for healthcare and medicine
Societal, policy, and regulatory implications of AI for healthcare and medicine
 
How deep learning works and why it alone won't get us much closer to AGI
How deep learning works and why it alone won't get us much closer to AGIHow deep learning works and why it alone won't get us much closer to AGI
How deep learning works and why it alone won't get us much closer to AGI
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18
 
Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18Avoiding Machine Learning Pitfalls 2-10-18
Avoiding Machine Learning Pitfalls 2-10-18
 
Molecular autoencoder
Molecular autoencoderMolecular autoencoder
Molecular autoencoder
 
Machine Learning Pitfalls
Machine Learning Pitfalls Machine Learning Pitfalls
Machine Learning Pitfalls
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Introduction to Reinforcement Learning for Molecular Design

  • 1. Intro to deep reinforcement learning and applications to molecular design Dan Elton UMD College Park Fuge group tea talk delton@umd.edu December 5, 2018 Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 1 / 22
  • 2. Overview 1 Intro to RL The Bellman equation TD learning Value vs policy learning 2 Deep Q learning 3 RL for molecular optimization Implementation details Tricks Results Interpretation of Q-functions Hillclimb-MLE 4 References Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 2 / 22
  • 3. Basic concepts in RL The goal of the RL agent is to maximized the expected return, which is the sum of future rewards: Gt = k=1,··· rt+k Normally we want to include a discount factor 0 ≥ γ ≤ 1: Gt = k=1,··· γk−1 rt+k Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 3 / 22
  • 4. Basic concepts in RL, continued... A policy π = {(si , ai )} is a collection of all the possible (state, action) pairs, that specifies completely the behavior of the agent. A state-value function V π(s) under policy π is the expected future return obtained by starting in state s and following policy π. An action-value function Qπ(s, a) under policy π is the total future return expected by starting in state s, taking action a and following policy π from there. An action-value function can may be related to a policy via a softmax: π(s, ai ) = eβQ(s,ai ) j eβQ(s,aj ) When β = ∞ this results in a “greedy” policy that always exploits the highest value action. A lower β is a more “explorative” policy. Another option is to use an ε-greedy policy. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 4 / 22
  • 5. The Bellman equation V π (s) = E[Gt|St = s] = E[rt+1 + γGt+1|St = s] = a π(a|s) s ,r p(s , r|s, a)[r + γE(Gt+1|St+1|St+1 = s ]) V π (s) = a π(a|s) s r p(s , r|s, a)[r + γV π (s )] (1) If we know p(s , r|s, a), then we can calculate V π(s) for all s, which may be denoted as a vector Vπ. The Bellman equation must be solved recursively, but it can be proven the recursive solution method converges correctly. However normally we do not know p(s , r|s, a) in advance. In that case, we can use some form of value function learning like TD-learning. With value function based method we need to learn a good policy. Thus we need to start from a random (equiprobable action) policy, run it forward, and perform policy evaluation and policy iteration. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 5 / 22
  • 6. Other jargon A model free method does not require a model for the transition dynamics p(s , r|s, a) to be learned. Instead , it learns through episodes / samples. An off policy method learns the optimal greedy policy while following a different policy that ensures exploration (such as ε-greedy). Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 6 / 22
  • 7. Temporal difference (TD) learning A simple method for value function learning is TD-learning. The algorithm is as follows: Initialize V (s) arbitrarily for all states s. Choose a policy π(s, a) to evaluate. Pick a random starting state s Repeat for each time step t: 1. Pick an action a in state s, according to the policy π(s, a). 2. Act with a and move from state s to state s , collect reward r, compute the TD-error: δ = r + γV (s ) − V (s). 3. Update V (s) according to : V (s) ← V (s) + αδ 4. Move to next state s ← s . TD-learning has the following properties: It is an online method, also called a “bootstrap” method. It can be proven that the method converges to the exact V π(s) for a given policy. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 7 / 22
  • 8. Value function learning vs policy gradient methods Broadly speaking, RL methods can be broken into two categories: Value function learning Learn value function or action value function. UseBellman equation or TD-learning type approach. Techniques include Q learning and actor-critic methods. When it works, it can be much more sample efficient. Empirically these methods converge faster, although there is so far no mathematical proof they always converge faster. Policy learning & policy gradient methods The canonical policy gradient method is the REINFORCE algorithm, which is used in the ORGAN for molecule generation and several papers on molecular generation with RNNs usuall requires doing Monte-Carlo and running simulations to the final end state (a complete ”episode”), which can be computationally demanding or impossible in the case of continuous learning. May suffer from high variance when estimating gradient. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 8 / 22
  • 9. Value function learning vs policy gradient methods, continued.. Olivecrona et al. train an RNN with MLE and then fine-tune it using RL. They argue for policy based learning as follows: “For the problem addressed in this study, we believe that policy based methods is the natural choice for three reasons: Policy based methods can learn explicitly an optimal stochastic policy, which is our goal. The method used starts with a prior sequence model. The goal is to finne tune this model according to some specifed scoring function. Since the prior model already constitutes a policy, learning a finetuned policy might require only small changes to the prior model. The episodes in this case are short and fast to sample, reducing the impact of the variance in the estimate of the gradients.” Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 9 / 22
  • 10. Deep Q learning Goal is to learn the action-value function Q(s, a), using a neural network approximator with parameters θ, Q(s, a; θ). Goal is to approximate the optimal action-value function Q∗(s, a): Q∗,π (s, a) = maxπE[Gt|St = s, At = a, π] The general Bellman equation for Qπ(s, a) is Qπ (s, a) = s ,r p(s |s, a) r + γ a π(s , a )Qπ (s , a ) The Bellman equation for Q∗,π is Q∗,π (s, a) = s ,r p(s |s, a) r + γmaxa Q∗,π (s , a )|s, a This can be solved iteratively as Qπ i+1(s, a) = s ,r p(s |s, a) r + γmaxa Qπ i (s , a )|s, a (2) Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 10 / 22
  • 11. Deep Q learning, continued.. In deep Q learning, the neural network model of Qπ(s, a) is retrained at the start of each iteration of the Bellman equation solution to reduce the mean squared error between the LHS and the RHS of the Bellman equation. This approach was popularized by the DeepMind work : Mnih, et al. “Human-level control through deep reinforcement learning”. Nature 518, pgs 529-533, 2015 A single deep Q-network based agent achieved human level performance on 49 Atari 2600 games, receiving only pixel values and game score as inputs. input was 210x160 color video at 60 Hz Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 11 / 22
  • 12. Experience replay Improves stability and efficiency of deep Q learning. The experience of the agent at each timestep et = (st, at, rt, st+1) are stored into a dataset Dt = {e1, · · · et} which is assembled over many eipsodes (runs). Then, each time Q is retrained, minibatch learning is performed using not only the current state but also a set of experiences drawn randomly from D. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 12 / 22
  • 13. Deep Q Learning for Molecular Optimization Zhou, et al. “Optimization of Molecules Via Deep Reinforcement Learning”. Oct. 2018, arXiv:1810.08678v2. We have a Markov decision process MDP(S, A, {Psa}, R) S is the state space. s ∈ S is a tuple, (m, t) where m is the molecule and t is the number of steps taken. The number of steps that can be taken is limited to T, leading to a finite (but still very large) state space. A is the action space. Possible actions are: Atom addition - this is a replacement of implicit hydrogen(s) with some other atom (ensuring valence rules are followed). Bond addition - this can be performed with atoms with ”free valence” (which doesn’t include implicit hydrogens). Bond removal - this is either reducing the order of a bond (ie from double to single), or removing a bond altogether. If removal of a bond results in a disconnected atom, that atom is removed as well. {Psa} are the state transition probabilities. They is set to 1 here, meaning state transitions are deterministic. R denotes the reward funcction of the state (m, t). Rewards are calculated at each step. However, to ensure that that the final state is rewarded more than intermediary states, a discount factor of γT−t is applied. They used γ = 0.99 Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 13 / 22
  • 14. Implementation details Molecules are converted to a vector using a Morgan fingerprint with radius 3 and length 2048. They used a 4-layer neural net with ReLu activation and layer sizes of [1024,512,128,32] They used ε-greedy policy exploration with linear annealing of ε from 1 to 0.001. They used multiple objective RL. This involves a vector of rewards rt = [r1,t, · · · , rk,t]. Instead of just doing a linear weighted sum to get a new scalar reward, Zhou et al. learn separate Qi (s, a) for the expected return from each reward. Zhou et al. implement a multitask neural network with separate outputs for each Qi (s, a). Optimal action is chosen via a scalarized Q: at = max a wT Q(s, a) (3) where w ∈ Rk is a vector of weights. This method can have issues if there are competition between rewards can yield sub-optimal results. A review of multiple objective RL methods can be found in Liu, et al. IEEE Transactions on Systems, Man, and Cybernetics 2015, 45, 385-398. Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 14 / 22
  • 15. Tricks We talked about using a softmax or ε-greedy learning to allow for exploration. following: Osband et al. “Deep Exploration via Randomized Value Functions”. arXiv:1703.07608 (2017) They train H independent Q functions each trained on a different subset of samples. Other tricks they used: prioritized experience replay Double Q-learning Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 15 / 22
  • 16. Results Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 16 / 22
  • 17. Results Benefits of the “DQN” approach Starts from scratch No need to train a generative model. (which can take significant GPU time (weeks)) Possible weaknesses of the “DQN” approach Starts from scratch (Olivecrona et al. talk about “drift” being an issue with RL) Needs carefully tuned reward function Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 17 / 22
  • 18. Reward curve Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 18 / 22
  • 19. Interpretation of Q-functions Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 19 / 22
  • 20. Hillclimb-MLE Neil et al. (2018) introduce “Hillclimb-MLE” for optimization with a MLE-trained RNN: Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 20 / 22
  • 21. References Luca Mazzucato (2011) Computational neuroscience: a physicist’s point of view Richard S. Sutton and Andrew G. Barto (2018) Reinforcement Learning: An Introduction, 2nd edition Mnih, et al. (2015) Human-level control through deep reinforcement learning Nature 518, pgs 529-533 Zhou, Kearnes, Li, Zare, Riley (2018) Optimization of Molecules via Deep Reinforcement Learning arXiv:1810.08678v2 Olivecrona et al. (2017) Molecular de-novo design through deep reinforce-ment learning Journal of Cheminformatics, 9 (1) Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 21 / 22
  • 22. The End Dan Elton (UMD) Intro to RL for molecule design December 5, 2018 22 / 22