SlideShare a Scribd company logo
1 of 22
Download to read offline
Computing near-optimal policies from trajectories
by solving a sequence of standard supervised
learning problems
Damien Ernst
Hybrid Systems Control Group - Sup´elec
TU Delft, November 16, 2006
Damien Ernst Computing .... (1/22)
From trajectories to optimal policies: control community
versus computer science community
Problem of inference of (near) optimal policies from trajectories
has been studied by the control community and the computer
science community.
Control community favors a two-stage approach: identification of
an analytical model and derivation from it of an optimal policy.
Computer science community favors approaches which compute
optimal policies directly from the trajectories, without relying on
the identification of an analytical model.
Learning optimal policies from trajectories is known in the
computer science community as reinforcement learning.
Damien Ernst Computing .... (2/22)
Reinforcement learning
Classical approach: to infer from trajectories state-action value
functions.
Discrete (and not to large) state and action spaces: state-action
value functions can be represented in tabular form.
Continuous or large state and action spaces: function
approximators need to be used. Also, learning must be done from
finite and generally very sparse sets of trajectories.
Parametric function approximators with gradient descent like
methods: very popular techniques but do not generalize well
enough to move from the academic to the real-world !!!
Damien Ernst Computing .... (3/22)
Using supervised learning to solve the generalization
problem in reinforcement learning
Problem of generalization over an information space. Occurs also
in supervised learning (SL).
What is SL ? To infer from a sample of input-output pairs (input
= information state; output = class label or real number) a model
which explains “at best” these input-output pairs.
Supervised learning highly successful: state-of-the art SL
algorithms have been successfully applied to problems where
information state = thousands of components.
What we want: we want to use the generalization capabilities
of supervised learning in reinforcement learning.
Our answer: an algorithm named fitted Q iteration which
infers (near) optimal policies from trajectories by solving a
sequence of standard supervised learning problems.
Damien Ernst Computing .... (4/22)
Learning from a sample of trajectories: the RL approach
Problem formulation Deterministic version
Discrete-time dynamics: xt+1 = f (xt, ut) t = 0, 1, . . . where
xt ∈ X and ut ∈ U.
Cost function: c(x, u) : X × U → R. c(x, u) bounded by Bc.
Instantaneous cost: ct = c(xt, ut)
Discounted infinite horizon cost associated to stationary policy
µ : X → U: Jµ(x) = lim
N→∞
N−1
t=0 γtc(xt, µ(xt)) where γ ∈ [0, 1[.
Optimal stationary policy µ∗ : Policy that minimizes Jµ for all x.
Objective: Find an optimal policy µ∗.
We do not know: The discrete-time dynamics and the cost
function.
We know instead: A set of trajectories
{(x0, u0, c0, x1, · · · , uT−1, cT−1, xT )i }nbTraj
i=1 .
Damien Ernst Computing .... (5/22)
Some dynamic programming results
Sequence of state-action value functions QN: X × U → R
QN(x, u) = c(x, u) + γ min
u ∈U
QN−1(f (x, u), u ), ∀N > 1
with Q1(x, u) ≡ c(x, u), converges to the Q-function, unique
solution of the Bellman equation:
Q(x, u) = c(x, u) + γ min
u ∈U
Q(f (x, u), u ).
Necessary and sufficient optimality condition:
µ∗
(x) ∈ arg min
u∈U
Q(x, u)
Suboptimal stationary policy µ∗
N:
µ∗
N (x) ∈ arg min
u∈U
QN(x, u).
Bound on µ∗
N:
Jµ∗
N − Jµ∗
≤
2γN Bc
(1 − γ)2
.
Damien Ernst Computing .... (6/22)
Fitted Q iteration: the algorithm
Set of trajectories {(x0, u0, c0, x1, · · · , cT−1, uT−1, xT )i }nbTraj
i=1
transformed into a set of system transitions
F = {(xl
t , ul
t, cl
t, xl
t+1)}#F
l=1.
Fitted Q iteration computes from F the functions ˆQ1, ˆQ2, . . .,
ˆQN, approximations of Q1, Q2, . . ., QN.
Computation done iteratively by solving a sequence of standard
supervised learning (SL) problems. Training sample for the kth
(k ≥ 1) problem is (xl
t , ul
t), cl
t + γmin
u∈U
ˆQk−1(xl
t+1, u)
#F
l=1
with ˆQ0(x, u) ≡ 0. From the kth training sample, the supervised
learning algorithm outputs ˆQk.
ˆµ∗
N(x) ∈ arg min
u∈U
ˆQN (x, u) is taken as approximation of µ∗(x).
Damien Ernst Computing .... (7/22)
Fitted Q iteration: some remarks
Performances of the algorithm depends on the supervised learning
(SL) method chosen.
Excellent performances have been observed when combined with
supervised learning methods based on ensemble of regression trees.
Works also for stochastic systems
Consistency can be ensured under appropriate assumptions on the
SL method, the sampling process, the system dynamics and the
cost function.
Damien Ernst Computing .... (8/22)
Illustration I: Bicycle balancing and riding
ϕ
h
d + w Fcen
CM
ω
Mg
x-axisψgoal
contact
θ
T
front wheel
back wheel - ground
goal
(xb, yb)
frame of the bike
center of goal (coord. = (xgoal , ygoal ))
Stochastic problem
Seven states variables and two control actions
Time between t and t + 1= 0.01 s
Cost function of the type:
c(xt, ut) =
1 if bicycle has fallen
coeff . ∗ (|ψgoalt+1
)| − |ψgoalt |) otherwise
Damien Ernst Computing .... (9/22)
Illustration I: Results
Trajectories generation: action taken at random, bicycle initially far
from the goal and trajectory ends when bicycle falls.
goal
yb
0 250 500 750 1000
−100
−50
0
50
100
xb
−150
ψ0 = − 3π
4
ψ0 = π
ψ0 = − π
4ψ0 = − π
2
ψ0 = 0
ψ0 = π
4
ψ0 = π
2
ψ0 = 3π
4
ψ0 = −π
100 300 500 700 900
Nb of
trajectories
success
percent
success
percent
hundred
one
zero
Q-learning with neural networks: 100, 000 times trajectories
needed to compute a policy that drives the bicycle to the goal !!!!
Damien Ernst Computing .... (10/22)
Illustration II: Navigation from visual percepts
pt
0
0
20
100
100
p(0)
p(1)
pt+1(1) =
min(pt (1) + 25, 100)
possible
actions
c(p, u) = 0 c(p, u) = −1
c(p, u) = −2
20
ut = go up
0
0
20
100
100
p(0)
p(1)
20
0
0
100
100
p(0)
p(1)
pt
10 pixels
pt
15 pixels
100 pixels 30 pixels
the agent is in positon pt
186 186 103 103 250 250 250250
is the (30 ∗ i + j)th element of this vector.
the grey level of the pixel located at the ith line
grey level ∈ {0, 1, · · · , 255}
pixels(pt ) = 30*30 element vector such that
pixels(pt )
and the jth column of the observation image
observation image when
Damien Ernst Computing .... (11/22)
0
0
100
100
p(0)
p(1)
0
0
100
100
p(0)
p(1)
(a) ˆµ∗
10, 500 system trans. (b) ˆµ∗
10, 2000 system trans.
0
0
100
100
p(0)
p(1)
1000 5000 7000 9000
0.4
0.6
0.8
3000
1.
p used as state input
Optimal score (J
µ∗
)
#F
J
ˆµ∗
10
pixels(p) used as state input
(c) ˆµ∗
10, 8000 system trans. (d) score versus nb system trans.
Damien Ernst Computing .... (12/22)
Illustration III: Computation of Structured Treatment
Interruption (STI) for HIV+ patients
STI for HIV: to cycle the patient on and off drug therapy
In some remarkable cases, STI strategies have enabled the
patients to maintain immune control over the virus in the
absence of treatment
STIs offer patients periods of relief from treatment
We want to compute optimal STI strategies.
Damien Ernst Computing .... (13/22)
Illustration III: What can reinforcement learning techniques
offer ?
Have the potential to infer from clinical data good STI
strategies, without modeling the HIV infection dynamics.
Clinical data: time evolution of patient’s state (CD4+ T cell
count, systemic costs of the drugs, etc) recorded at
discrete-time instant and sequence of drugs administered.
Clinical data can be seen as trajectories of the immune system
responding to treatment.
Damien Ernst Computing .... (14/22)
The trajectories are processed
by using reinforcement learning techniques
patients
A pool of
HIV infected
problem which typically containts the following information:
some (near) optimal STI strategies,
often under the form of a mapping
given time and the drugs he has to take
protocols and are monitored at regular intervals
The patients follow some (possibly suboptimal) STI
The monitoring of each patient generates a trajectory for the optimal STI
drugs taken by the patient between t0 and t1 = t0 + n days
state of the patient at time t0
state of the patient at time t1
drugs taken by the patient between t1 and t2 = t1 + n days
state of the patient at time t2
drugs taken by the patient between t2 and t3 = t2 + n days
Processing of the trajectories gives
between the state of the patient at a
till the next time his state is monitored.
Figure: Determination of optimal STI strategies from clinical data by
using reinforcement learning algorithms: the overall principle.
Damien Ernst Computing .... (15/22)
Illustration III: state of the research
Promising results have been obtained by using “fitted Q
iteration” to analyze some artificially generated clinical data.
Next step: to analyze real-life clinical data generated by the
SMART study (Smart Management of Anti-Retroviral
Therapies).
Figure: Taken from
http://www.cpcra.org/docs/pubs/2006/croi2006-smart.pdf
Damien Ernst Computing .... (16/22)
Conclusions
Fitted Q iteration algorithm (combined with ensemble of
regression trees) has been evaluated on several problems and
was consistently performing much better than other
reinforcement learning algorithms.
Are its performances sufficient to lead to many successful
real-life applications ? I think YES !!!
Why has this algorithm not been proposed before ?
Damien Ernst Computing .... (17/22)
Fitted Q iteration: future research
Computational burdens grow with the number of trajectories
⇒ problems with on-line applications. Possible solution: to
keep only the most informative trajectories.
What are really the best supervised learning algorithm to use
in the inner loop of the fitted Q iteration process ? Several
criteria need to be considered: distribution of the data,
computational burdens, numerical stability of the algorithm,
etc.
Customization of SL methods (e.g. split criteria in trees other
than variance reduction, etc)
Damien Ernst Computing .... (18/22)
Supervised learning in dynamic programming: general view
Dynamic programming: resolution of optimal control problems
by extending iteratively the optimization horizon.
Two main algorithms: value iteration and policy iteration.
Fitted Q iteration: based on the value iteration algorithm.
Fitted Q iteration can be extended to the case where
dynamics and cost function are known to become an
Approximate Value Iteration algorithm (AVI).
Approximate Policy Iteration algorithms (API) based on SL
have recently been proposed. Can also be adapted to the case
where only trajectories are available.
For both SL based AVI and API, problem of generation of the
right trajectories is extremely important.
How to put all these works into a unified framework ?
Damien Ernst Computing .... (19/22)
Beyond dynamic programming...
Standard Model Predictive Control (MPC) formulation: solve
in a receding time manner a sequence of open-loop
deterministic optimal control problems by relying on some
classical optimization algorithms (sequential quadratic
programming, interior point methods, etc)
Could SL based dynamic programming be good optimizers for
MPC ? Have at least the advantage of not being intrinsically
suboptimal when the system is stochastic !!! But problem with
computation of max
u∈U
model(x, u) when dealing with large U.
How to extend MPC-like formulations to stochastic systems ?
Solutions have been proposed for problems with discrete
disturbance spaces but dimension of the search space for the
optimization algorithm is
O(number of disturbancesoptimization horizon). Research
direction: selection of a subset of relevant disturbances.
Damien Ernst Computing .... (20/22)
Planning under uncertainties: an example
Development of MPC-like algorithms for scheduling production
under uncertainties through selection of “interesting disturbance
scenarios”.
Quantity
produced
u
Clouds
x
w
Cost
demand - u
Customers
Hydraulic
production
production
Thermal
Figure: A typical hydro power plant production scheduling problem.
Damien Ernst Computing .... (21/22)
Additional readings
“Tree-based batch mode reinforcement learning”. D. Ernst, P. Geurts
and L. Wehenkel. In Journal of Machine Learning Research. April 2005,
Volume 6, pages 503-556.
“Selecting concise sets of samples for a reinforcement learning agent”. D.
Ernst. In Proceedings of CIRAS 2005, 14-16 December 2005, Singapore.
(6 pages)
“Clinical data based optimal STI strategies for HIV: a reinforcement
learning approach”. D. Ernst, G.B. Stan, J. Goncalves and L. Wehenkel.
In Proceedings of Benelearn 2006, 11-12 May 2006, Ghent, Belgium. (8
pages)
“Reinforcement learning with raw image pixels as state input”. D. Ernst,
R. Mar´ee and L. Wehenkel. International Workshop on Intelligent
Computing in Pattern Analysis/Synthesis (IWICPAS). Proceedings series:
LNCS, Volume 4153, page 446-454, August 2006.
”Reinforcement learning versus model predictive control: a comparison”.
D. Ernst, M. Glavic, F. Capitanescu and L. Wehenkel. Submitted.
“Planning under uncertainties by solving non-linear optimization
problems: a complexity analysis”. D. Ernst, B. Defourny and L.
Wehenkel. In preparation.
Damien Ernst Computing .... (22/22)

More Related Content

What's hot

Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2NBER
 
slides_online_optimization_david_mateos
slides_online_optimization_david_mateosslides_online_optimization_david_mateos
slides_online_optimization_david_mateosDavid Mateos
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習Masa Kato
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesMohamed Farouk
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for ActuariesArthur Charpentier
 
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGbutest
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieMarco Moldenhauer
 
Robust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global OptimizationRobust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global OptimizationMario Pavone
 
Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Umberto Picchini
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data AnalysisNBER
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networksbutest
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Nature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsNature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsXin-She Yang
 
My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Financethilankm
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningcsandit
 
Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Umberto Picchini
 

What's hot (20)

Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
slides_online_optimization_david_mateos
slides_online_optimization_david_mateosslides_online_optimization_david_mateos
slides_online_optimization_david_mateos
 
効率的反実仮想学習
効率的反実仮想学習効率的反実仮想学習
効率的反実仮想学習
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
 
Machine Learning for Actuaries
Machine Learning for ActuariesMachine Learning for Actuaries
Machine Learning for Actuaries
 
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNINGRECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
 
X01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorieX01 Supervised learning problem linear regression one feature theorie
X01 Supervised learning problem linear regression one feature theorie
 
Robust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global OptimizationRobust Immunological Algorithms for High-Dimensional Global Optimization
Robust Immunological Algorithms for High-Dimensional Global Optimization
 
Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...Accelerated approximate Bayesian computation with applications to protein fol...
Accelerated approximate Bayesian computation with applications to protein fol...
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...Inference for stochastic differential equations via approximate Bayesian comp...
Inference for stochastic differential equations via approximate Bayesian comp...
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Nature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic AlgorithmsNature-Inspired Metaheuristic Algorithms
Nature-Inspired Metaheuristic Algorithms
 
My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...My data are incomplete and noisy: Information-reduction statistical methods f...
My data are incomplete and noisy: Information-reduction statistical methods f...
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)Intro to Approximate Bayesian Computation (ABC)
Intro to Approximate Bayesian Computation (ABC)
 

Viewers also liked

Color Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU CloudColor Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU CloudCSCJournals
 
An improved color image encryption algorithm with pixel permutation and bit s...
An improved color image encryption algorithm with pixel permutation and bit s...An improved color image encryption algorithm with pixel permutation and bit s...
An improved color image encryption algorithm with pixel permutation and bit s...eSAT Journals
 
color image watermarking
color image watermarkingcolor image watermarking
color image watermarkingsudhakar_icon
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02Antony Pagan
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat Светлана Бреусова
 
Soal Pilihan Ganda Elektrolisis dan Korosi
Soal Pilihan Ganda Elektrolisis dan KorosiSoal Pilihan Ganda Elektrolisis dan Korosi
Soal Pilihan Ganda Elektrolisis dan KorosiNurlita19
 
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Université de Liège (ULg)
 
The 33 best gifts for travelers in 2014
The 33 best gifts for travelers in 2014The 33 best gifts for travelers in 2014
The 33 best gifts for travelers in 2014Pieter Prinsloo
 

Viewers also liked (10)

Color Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU CloudColor Image Watermarking Application for ERTU Cloud
Color Image Watermarking Application for ERTU Cloud
 
An improved color image encryption algorithm with pixel permutation and bit s...
An improved color image encryption algorithm with pixel permutation and bit s...An improved color image encryption algorithm with pixel permutation and bit s...
An improved color image encryption algorithm with pixel permutation and bit s...
 
color image watermarking
color image watermarkingcolor image watermarking
color image watermarking
 
12caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp0212caboutteamwoek 120521013446-phpapp02
12caboutteamwoek 120521013446-phpapp02
 
Social media business tips
Social media business tipsSocial media business tips
Social media business tips
 
презентация курса товароведение Cb krat
презентация курса  товароведение Cb krat  презентация курса  товароведение Cb krat
презентация курса товароведение Cb krat
 
Soal Pilihan Ganda Elektrolisis dan Korosi
Soal Pilihan Ganda Elektrolisis dan KorosiSoal Pilihan Ganda Elektrolisis dan Korosi
Soal Pilihan Ganda Elektrolisis dan Korosi
 
Ppt ict
Ppt ictPpt ict
Ppt ict
 
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
Capacity mechanisms for improving security of supply: quick fixes or thoughtf...
 
The 33 best gifts for travelers in 2014
The 33 best gifts for travelers in 2014The 33 best gifts for travelers in 2014
The 33 best gifts for travelers in 2014
 

Similar to Computing near-optimal policies from trajectories by solving a sequence of standard supervised learning problems

Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...Université de Liège (ULg)
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Asma Ben Slimene
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...Université de Liège (ULg)
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..butest
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Michael Lie
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingSSA KPI
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationVARUN KUMAR
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsFrank Kienle
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
 
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...IJECEIAES
 

Similar to Computing near-optimal policies from trajectories by solving a sequence of standard supervised learning problems (20)

AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...Batch mode reinforcement learning based on the synthesis of artificial trajec...
Batch mode reinforcement learning based on the synthesis of artificial trajec...
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...Beyond function approximators for batch mode reinforcement learning: rebuildi...
Beyond function approximators for batch mode reinforcement learning: rebuildi...
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Ydstie
YdstieYdstie
Ydstie
 
KAUST_talk_short.pdf
KAUST_talk_short.pdfKAUST_talk_short.pdf
KAUST_talk_short.pdf
 
Thiele
ThieleThiele
Thiele
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Approximation in Stochastic Integer Programming
Approximation in Stochastic Integer ProgrammingApproximation in Stochastic Integer Programming
Approximation in Stochastic Integer Programming
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Lecture 3 image sampling and quantization
Lecture 3 image sampling and quantizationLecture 3 image sampling and quantization
Lecture 3 image sampling and quantization
 
Lecture: Monte Carlo Methods
Lecture: Monte Carlo MethodsLecture: Monte Carlo Methods
Lecture: Monte Carlo Methods
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
 
intro
introintro
intro
 
Nonnegative Matrix Factorization with Side Information for Time Series Recove...
Nonnegative Matrix Factorization with Side Information for Time Series Recove...Nonnegative Matrix Factorization with Side Information for Time Series Recove...
Nonnegative Matrix Factorization with Side Information for Time Series Recove...
 
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...
A Fault Tolerant Control for Sensor and Actuator Failures of a Non Linear Hyb...
 

More from Université de Liège (ULg)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionUniversité de Liège (ULg)
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesUniversité de Liège (ULg)
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadUniversité de Liège (ULg)
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectUniversité de Liège (ULg)
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Université de Liège (ULg)
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Université de Liège (ULg)
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableUniversité de Liège (ULg)
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Université de Liège (ULg)
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Université de Liège (ULg)
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Université de Liège (ULg)
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationUniversité de Liège (ULg)
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Université de Liège (ULg)
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTUniversité de Liège (ULg)
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Université de Liège (ULg)
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoUniversité de Liège (ULg)
 

More from Université de Liège (ULg) (20)

Reinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transitionReinforcement learning for electrical markets and the energy transition
Reinforcement learning for electrical markets and the energy transition
 
Algorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communitiesAlgorithms for the control and sizing of renewable energy communities
Algorithms for the control and sizing of renewable energy communities
 
AI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future aheadAI for energy: a bright and uncertain future ahead
AI for energy: a bright and uncertain future ahead
 
Extreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata projectExtreme engineering for fighting climate change and the Katabata project
Extreme engineering for fighting climate change and the Katabata project
 
Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...Ex-post allocation of electricity and real-time control strategy for renewabl...
Ex-post allocation of electricity and real-time control strategy for renewabl...
 
Big infrastructures for fighting climate change
Big infrastructures for fighting climate changeBig infrastructures for fighting climate change
Big infrastructures for fighting climate change
 
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
Harvesting wind energy in Greenland: a project for Europe and a huge step tow...
 
Décret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelableDécret favorisant le développement des communautés d’énergie renouvelable
Décret favorisant le développement des communautés d’énergie renouvelable
 
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
Harnessing the Potential of Power-to-Gas Technologies. Insights from a prelim...
 
Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets Reinforcement learning, energy systems and deep neural nets
Reinforcement learning, energy systems and deep neural nets
 
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
Soirée des Grands Prix SEE - A glimpse at the research work of the laureate o...
 
Reinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisationReinforcement learning for data-driven optimisation
Reinforcement learning for data-driven optimisation
 
Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...Electricity retailing in Europe: remarkable events (with a special focus on B...
Electricity retailing in Europe: remarkable events (with a special focus on B...
 
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNSTProjet de décret « GRD »: quelques remarques du Prof. Damien ERNST
Projet de décret « GRD »: quelques remarques du Prof. Damien ERNST
 
Belgian offshore wind potential
Belgian offshore wind potentialBelgian offshore wind potential
Belgian offshore wind potential
 
Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...Time to make a choice between a fully liberal or fully regulated model for th...
Time to make a choice between a fully liberal or fully regulated model for th...
 
Electrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the CongoElectrification and the Democratic Republic of the Congo
Electrification and the Democratic Republic of the Congo
 
Energy: the clash of nations
Energy: the clash of nationsEnergy: the clash of nations
Energy: the clash of nations
 
Smart Grids Versus Microgrids
Smart Grids Versus MicrogridsSmart Grids Versus Microgrids
Smart Grids Versus Microgrids
 
Uber-like Models for the Electrical Industry
Uber-like Models for the Electrical IndustryUber-like Models for the Electrical Industry
Uber-like Models for the Electrical Industry
 

Recently uploaded

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 

Computing near-optimal policies from trajectories by solving a sequence of standard supervised learning problems

  • 1. Computing near-optimal policies from trajectories by solving a sequence of standard supervised learning problems Damien Ernst Hybrid Systems Control Group - Sup´elec TU Delft, November 16, 2006 Damien Ernst Computing .... (1/22)
  • 2. From trajectories to optimal policies: control community versus computer science community Problem of inference of (near) optimal policies from trajectories has been studied by the control community and the computer science community. Control community favors a two-stage approach: identification of an analytical model and derivation from it of an optimal policy. Computer science community favors approaches which compute optimal policies directly from the trajectories, without relying on the identification of an analytical model. Learning optimal policies from trajectories is known in the computer science community as reinforcement learning. Damien Ernst Computing .... (2/22)
  • 3. Reinforcement learning Classical approach: to infer from trajectories state-action value functions. Discrete (and not to large) state and action spaces: state-action value functions can be represented in tabular form. Continuous or large state and action spaces: function approximators need to be used. Also, learning must be done from finite and generally very sparse sets of trajectories. Parametric function approximators with gradient descent like methods: very popular techniques but do not generalize well enough to move from the academic to the real-world !!! Damien Ernst Computing .... (3/22)
  • 4. Using supervised learning to solve the generalization problem in reinforcement learning Problem of generalization over an information space. Occurs also in supervised learning (SL). What is SL ? To infer from a sample of input-output pairs (input = information state; output = class label or real number) a model which explains “at best” these input-output pairs. Supervised learning highly successful: state-of-the art SL algorithms have been successfully applied to problems where information state = thousands of components. What we want: we want to use the generalization capabilities of supervised learning in reinforcement learning. Our answer: an algorithm named fitted Q iteration which infers (near) optimal policies from trajectories by solving a sequence of standard supervised learning problems. Damien Ernst Computing .... (4/22)
  • 5. Learning from a sample of trajectories: the RL approach Problem formulation Deterministic version Discrete-time dynamics: xt+1 = f (xt, ut) t = 0, 1, . . . where xt ∈ X and ut ∈ U. Cost function: c(x, u) : X × U → R. c(x, u) bounded by Bc. Instantaneous cost: ct = c(xt, ut) Discounted infinite horizon cost associated to stationary policy µ : X → U: Jµ(x) = lim N→∞ N−1 t=0 γtc(xt, µ(xt)) where γ ∈ [0, 1[. Optimal stationary policy µ∗ : Policy that minimizes Jµ for all x. Objective: Find an optimal policy µ∗. We do not know: The discrete-time dynamics and the cost function. We know instead: A set of trajectories {(x0, u0, c0, x1, · · · , uT−1, cT−1, xT )i }nbTraj i=1 . Damien Ernst Computing .... (5/22)
  • 6. Some dynamic programming results Sequence of state-action value functions QN: X × U → R QN(x, u) = c(x, u) + γ min u ∈U QN−1(f (x, u), u ), ∀N > 1 with Q1(x, u) ≡ c(x, u), converges to the Q-function, unique solution of the Bellman equation: Q(x, u) = c(x, u) + γ min u ∈U Q(f (x, u), u ). Necessary and sufficient optimality condition: µ∗ (x) ∈ arg min u∈U Q(x, u) Suboptimal stationary policy µ∗ N: µ∗ N (x) ∈ arg min u∈U QN(x, u). Bound on µ∗ N: Jµ∗ N − Jµ∗ ≤ 2γN Bc (1 − γ)2 . Damien Ernst Computing .... (6/22)
  • 7. Fitted Q iteration: the algorithm Set of trajectories {(x0, u0, c0, x1, · · · , cT−1, uT−1, xT )i }nbTraj i=1 transformed into a set of system transitions F = {(xl t , ul t, cl t, xl t+1)}#F l=1. Fitted Q iteration computes from F the functions ˆQ1, ˆQ2, . . ., ˆQN, approximations of Q1, Q2, . . ., QN. Computation done iteratively by solving a sequence of standard supervised learning (SL) problems. Training sample for the kth (k ≥ 1) problem is (xl t , ul t), cl t + γmin u∈U ˆQk−1(xl t+1, u) #F l=1 with ˆQ0(x, u) ≡ 0. From the kth training sample, the supervised learning algorithm outputs ˆQk. ˆµ∗ N(x) ∈ arg min u∈U ˆQN (x, u) is taken as approximation of µ∗(x). Damien Ernst Computing .... (7/22)
  • 8. Fitted Q iteration: some remarks Performances of the algorithm depends on the supervised learning (SL) method chosen. Excellent performances have been observed when combined with supervised learning methods based on ensemble of regression trees. Works also for stochastic systems Consistency can be ensured under appropriate assumptions on the SL method, the sampling process, the system dynamics and the cost function. Damien Ernst Computing .... (8/22)
  • 9. Illustration I: Bicycle balancing and riding ϕ h d + w Fcen CM ω Mg x-axisψgoal contact θ T front wheel back wheel - ground goal (xb, yb) frame of the bike center of goal (coord. = (xgoal , ygoal )) Stochastic problem Seven states variables and two control actions Time between t and t + 1= 0.01 s Cost function of the type: c(xt, ut) = 1 if bicycle has fallen coeff . ∗ (|ψgoalt+1 )| − |ψgoalt |) otherwise Damien Ernst Computing .... (9/22)
  • 10. Illustration I: Results Trajectories generation: action taken at random, bicycle initially far from the goal and trajectory ends when bicycle falls. goal yb 0 250 500 750 1000 −100 −50 0 50 100 xb −150 ψ0 = − 3π 4 ψ0 = π ψ0 = − π 4ψ0 = − π 2 ψ0 = 0 ψ0 = π 4 ψ0 = π 2 ψ0 = 3π 4 ψ0 = −π 100 300 500 700 900 Nb of trajectories success percent success percent hundred one zero Q-learning with neural networks: 100, 000 times trajectories needed to compute a policy that drives the bicycle to the goal !!!! Damien Ernst Computing .... (10/22)
  • 11. Illustration II: Navigation from visual percepts pt 0 0 20 100 100 p(0) p(1) pt+1(1) = min(pt (1) + 25, 100) possible actions c(p, u) = 0 c(p, u) = −1 c(p, u) = −2 20 ut = go up 0 0 20 100 100 p(0) p(1) 20 0 0 100 100 p(0) p(1) pt 10 pixels pt 15 pixels 100 pixels 30 pixels the agent is in positon pt 186 186 103 103 250 250 250250 is the (30 ∗ i + j)th element of this vector. the grey level of the pixel located at the ith line grey level ∈ {0, 1, · · · , 255} pixels(pt ) = 30*30 element vector such that pixels(pt ) and the jth column of the observation image observation image when Damien Ernst Computing .... (11/22)
  • 12. 0 0 100 100 p(0) p(1) 0 0 100 100 p(0) p(1) (a) ˆµ∗ 10, 500 system trans. (b) ˆµ∗ 10, 2000 system trans. 0 0 100 100 p(0) p(1) 1000 5000 7000 9000 0.4 0.6 0.8 3000 1. p used as state input Optimal score (J µ∗ ) #F J ˆµ∗ 10 pixels(p) used as state input (c) ˆµ∗ 10, 8000 system trans. (d) score versus nb system trans. Damien Ernst Computing .... (12/22)
  • 13. Illustration III: Computation of Structured Treatment Interruption (STI) for HIV+ patients STI for HIV: to cycle the patient on and off drug therapy In some remarkable cases, STI strategies have enabled the patients to maintain immune control over the virus in the absence of treatment STIs offer patients periods of relief from treatment We want to compute optimal STI strategies. Damien Ernst Computing .... (13/22)
  • 14. Illustration III: What can reinforcement learning techniques offer ? Have the potential to infer from clinical data good STI strategies, without modeling the HIV infection dynamics. Clinical data: time evolution of patient’s state (CD4+ T cell count, systemic costs of the drugs, etc) recorded at discrete-time instant and sequence of drugs administered. Clinical data can be seen as trajectories of the immune system responding to treatment. Damien Ernst Computing .... (14/22)
  • 15. The trajectories are processed by using reinforcement learning techniques patients A pool of HIV infected problem which typically containts the following information: some (near) optimal STI strategies, often under the form of a mapping given time and the drugs he has to take protocols and are monitored at regular intervals The patients follow some (possibly suboptimal) STI The monitoring of each patient generates a trajectory for the optimal STI drugs taken by the patient between t0 and t1 = t0 + n days state of the patient at time t0 state of the patient at time t1 drugs taken by the patient between t1 and t2 = t1 + n days state of the patient at time t2 drugs taken by the patient between t2 and t3 = t2 + n days Processing of the trajectories gives between the state of the patient at a till the next time his state is monitored. Figure: Determination of optimal STI strategies from clinical data by using reinforcement learning algorithms: the overall principle. Damien Ernst Computing .... (15/22)
  • 16. Illustration III: state of the research Promising results have been obtained by using “fitted Q iteration” to analyze some artificially generated clinical data. Next step: to analyze real-life clinical data generated by the SMART study (Smart Management of Anti-Retroviral Therapies). Figure: Taken from http://www.cpcra.org/docs/pubs/2006/croi2006-smart.pdf Damien Ernst Computing .... (16/22)
  • 17. Conclusions Fitted Q iteration algorithm (combined with ensemble of regression trees) has been evaluated on several problems and was consistently performing much better than other reinforcement learning algorithms. Are its performances sufficient to lead to many successful real-life applications ? I think YES !!! Why has this algorithm not been proposed before ? Damien Ernst Computing .... (17/22)
  • 18. Fitted Q iteration: future research Computational burdens grow with the number of trajectories ⇒ problems with on-line applications. Possible solution: to keep only the most informative trajectories. What are really the best supervised learning algorithm to use in the inner loop of the fitted Q iteration process ? Several criteria need to be considered: distribution of the data, computational burdens, numerical stability of the algorithm, etc. Customization of SL methods (e.g. split criteria in trees other than variance reduction, etc) Damien Ernst Computing .... (18/22)
  • 19. Supervised learning in dynamic programming: general view Dynamic programming: resolution of optimal control problems by extending iteratively the optimization horizon. Two main algorithms: value iteration and policy iteration. Fitted Q iteration: based on the value iteration algorithm. Fitted Q iteration can be extended to the case where dynamics and cost function are known to become an Approximate Value Iteration algorithm (AVI). Approximate Policy Iteration algorithms (API) based on SL have recently been proposed. Can also be adapted to the case where only trajectories are available. For both SL based AVI and API, problem of generation of the right trajectories is extremely important. How to put all these works into a unified framework ? Damien Ernst Computing .... (19/22)
  • 20. Beyond dynamic programming... Standard Model Predictive Control (MPC) formulation: solve in a receding time manner a sequence of open-loop deterministic optimal control problems by relying on some classical optimization algorithms (sequential quadratic programming, interior point methods, etc) Could SL based dynamic programming be good optimizers for MPC ? Have at least the advantage of not being intrinsically suboptimal when the system is stochastic !!! But problem with computation of max u∈U model(x, u) when dealing with large U. How to extend MPC-like formulations to stochastic systems ? Solutions have been proposed for problems with discrete disturbance spaces but dimension of the search space for the optimization algorithm is O(number of disturbancesoptimization horizon). Research direction: selection of a subset of relevant disturbances. Damien Ernst Computing .... (20/22)
  • 21. Planning under uncertainties: an example Development of MPC-like algorithms for scheduling production under uncertainties through selection of “interesting disturbance scenarios”. Quantity produced u Clouds x w Cost demand - u Customers Hydraulic production production Thermal Figure: A typical hydro power plant production scheduling problem. Damien Ernst Computing .... (21/22)
  • 22. Additional readings “Tree-based batch mode reinforcement learning”. D. Ernst, P. Geurts and L. Wehenkel. In Journal of Machine Learning Research. April 2005, Volume 6, pages 503-556. “Selecting concise sets of samples for a reinforcement learning agent”. D. Ernst. In Proceedings of CIRAS 2005, 14-16 December 2005, Singapore. (6 pages) “Clinical data based optimal STI strategies for HIV: a reinforcement learning approach”. D. Ernst, G.B. Stan, J. Goncalves and L. Wehenkel. In Proceedings of Benelearn 2006, 11-12 May 2006, Ghent, Belgium. (8 pages) “Reinforcement learning with raw image pixels as state input”. D. Ernst, R. Mar´ee and L. Wehenkel. International Workshop on Intelligent Computing in Pattern Analysis/Synthesis (IWICPAS). Proceedings series: LNCS, Volume 4153, page 446-454, August 2006. ”Reinforcement learning versus model predictive control: a comparison”. D. Ernst, M. Glavic, F. Capitanescu and L. Wehenkel. Submitted. “Planning under uncertainties by solving non-linear optimization problems: a complexity analysis”. D. Ernst, B. Defourny and L. Wehenkel. In preparation. Damien Ernst Computing .... (22/22)