This document presents a final report for a master's thesis on applying machine learning techniques to complex control systems. It describes iterative linear quadratic Gaussian (iLQG), a method for computing locally optimal trajectories for systems with known dynamics. iLQG linearizes the system dynamics and cost function around a nominal trajectory. It then derives the discrete-time dynamics and cost equations. The report applies iLQG to cartpole, double cartpole, and quadrotor systems. It also discusses probabilistic differential dynamic programming (PDDP) for unknown dynamics using Gaussian processes. The author proposes extending these methods using Gaussian process regression networks to represent unknown system dynamics.
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
vonmoll-paper
1. MACHINE LEARNING APPLICATIONS IN COMPLEX
CONTROL SYSTEMS
AE 8900: Special Problems
Final Report
Presented to
The Academic Faculty
by
Alexander L. Von Moll
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in the
School of Aerospace Engineering
Georgia Institute of Technology
May 2016
2. MACHINE LEARNING APPLICATIONS IN COMPLEX
CONTROL SYSTEMS
Approved by:
Professor Evangelos Theodorou, Advisor
School of Aerospace Engineering
Georgia Institute of Technology
Date Approved: 18 December 2015
4. ACKNOWLEDGEMENTS
I want to thank Professor Evangelos Theodorou for advising me and guiding my
efforts. Additionally, I would like to thank Yunpeng Pan for guidance and for many
discussions on the details of these algorithms. Lastly, thanks to Manan Gandhi for
providing dynamics models for the cartpole, double cartpole, and quadrotor systems.
iv
6. LIST OF TABLES
1 Quadrotor parameters and values . . . . . . . . . . . . . . . . . . . . 16
vi
7. LIST OF FIGURES
1 A schematic of the cartpole system with all relevant parameters indicated 11
2 State trajectories for the cart-pole system over 2 seconds showing the
goal state being met . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Control signal for the cart-pole system over 2 seconds corresponding
to the positive force applied to the cart . . . . . . . . . . . . . . . . . 13
4 Cart-pole actual cost of the trajectory after each iteration . . . . . . 13
5 Schematic of the double cart-pole problem with all relevant features
specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 State trajectories for the double cart-pole system over 2 seconds show-
ing the goal state being met . . . . . . . . . . . . . . . . . . . . . . . 17
7 Control signal for the double cart-pole system over 2 seconds corre-
sponding to the positive force applied to the cart . . . . . . . . . . . 18
8 Double cart-pole actual cost of the trajectory after each iteration . . 18
9 Schematic representation of the quadrotor dynamic model [1]. . . . . 18
10 Quadrotor state trajectories from 0 to 3 seconds showing the vehicle
meeting the goal state . . . . . . . . . . . . . . . . . . . . . . . . . . 19
11 Quadrotor control effort for each rotor . . . . . . . . . . . . . . . . . 19
12 Quadrotor actual cost of the trajectory after each iteration . . . . . . 20
13 Schematic representation of a Gaussian Process Regression Network [12] 26
14 Comparison of ESS samples to normrnd on a standard Gaussian . . . 28
15 Comparison of ESS samples to direct method for GP regression . . . 30
16 Log-likelihood values of a chain of ESS samples for GP regression . . 31
17 Comparison of ESS samples to direct method for GP regression with
burn in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
8. CHAPTER I
INTRODUCTION
Control theory and machine learning are, in their own right, rich subjects with many
applications. Their evolution has been quite different: control theory developed as
a subdiscipline within electrical engineering and mathematics with roots going back
to the nineteenth century. Machine learning, on the other hand, has emerged much
more recently - though many of the foundational concepts are borrowed from older
disciplines such as linear algebra. Whereas control theory is centered around analyzing
and altering the properties of dynamical systems, machine learning is centered around
gaining useful insight from data. The intersection of these two research areas is
emerging rapidly.
Trajectory optimization is one particular topic that has been traditionally re-
searched under the umbrella of control theory, but can now benefit from concepts
from machine learning. Trajectory optimization is a process by which control inputs
for a dynamical system are chosen such that some measure of cost or performance
is minimized or maximized. Selection of control inputs affects how the state of the
dynamical system evolves, thereby determining its trajectory over time. For systems
that are described by a set of differential equations with a quadratic cost function,
several well known methods have been used to compute an optimal control law. These
methods include the linear quadratic regulator (LQR) and the linear quadratic Gaus-
sian (LQG) which extends the basic LQR method to incorporate uncertainty. For
LQG, the uncertainty shows up as additive white Gaussian noise in the differential
equations. Much of the optimal control research of the past several decades is based
on these approaches and has extended them for various purposes. One particular
1
9. extension, known as iterative linear quadratic Gaussian (iLQG) [11] is explored in
further detail in the following sections.
The key aspect of traditional trajectory optimization approaches is the require-
ment of a dynamical model specified by a set of differential equations. There are many
reasons why this might be a hindrance in some applications: a dynamical model of
this type may not be available, generating a dynamical model for a new system may
be difficult, or the dynamical model may be too inaccurate. In these cases it may
be necessary to create an alternative representation of the system dynamics. Several
algorithms have been proposed to solve the trajectory optimization problem using
alternative representations of the system dynamics; most notably, PILCO [2] and
probabilistic differential dynamic programming (PDDP) [9].
The research effort described in this report is focused on extending ideas from
iLQG and PDDP by using Gaussian Process Regression Networks (GPRN) [12] as
a means to represent unknown system dynamics. Chapter 2 contains a derivation
of iLQG along with demonstrations of the algorithm on several systems. Chapter 3
contains a description of PDDP, and a brief background on Gaussian processes (GPs)
which are used in PDDP to represent the unknown system dynamics. Chapter 4 con-
tains a description of GPRN, including a description and demonstration of elliptical
slice sampling (ESS) [7]. Finally, Chapter 5 summarizes the progress made in this
effort and outlines remaining work.
2
10. CHAPTER II
KNOWN DYNAMICS
2.1 Iterative Linear Quadratic Gaussian [11]
Iterative Linear Quadratic Gaussian (iLQG) is an algorithm used for computing lo-
cally optimal trajectories over a finite horizon for systems with known dynamics.
This section of the report will describe its derivation and demonstrate some results
on some toy systems.
2.1.1 Derivation
2.1.1.1 Definitions
dx = f(x, u)dt + F(x, u)dω (2.1.1)
v(t, x) E h(x(T)) +
T
t
l(τ, x(τ), u(τ))dτ (2.1.2)
Equations (2.1.1) and (2.1.2) specify the dynamics of the system and the expected
cost-to-go, respectively. The cost-to-go is defined as the total cost expected to accu-
mulate if the system is initialized in state x at time t and evolved until t = T. It
has already been implied by equation (2.1.2) that the expected cost-to-go is not a
function of u since we will assume going forward that the control law is a determin-
istic function of the state, x. The symbols that appear in these equations are defined
below:
3
11. f ≡ system dynamics
F ≡ noise process
ω ≡ Brownian motion noise, ω ∈ Rp
x ≡ state, x ∈ Rn
u ≡ control input, u ∈ Rm
v ≡ cost-to-go
l ≡ instantaneous cost rate
h ≡ penalty on the final rate
The iLQG algorithm is based upon linearization of the dynamics and cost and
quadratization of the cost around a nominal trajectory. Thus we define a nominal
state trajectory and set of control inputs as ¯x and ¯u, respectively, and a small per-
turbation around these series such that δx = x − ¯x, δu = u − ¯u.
2.1.1.2 Linearization of the Dynamics
The first step in the derivation is to linearize the system dynamics around the nominal
trajectories. This is done using a first-order Taylor series expansion and is provided
in detail, below.
dx
dt
= f(x, u) + F(x, u)
dω
dt
= f(x − ¯x + ¯x, u − ¯u + ¯u) + F(x − ¯x + ¯x, u + ¯u − ¯u)
dω
dt
= f(δx + ¯x, δu + ¯u) + F(δx + ¯x, δu + ¯u)
dω
dt
Next, we apply the first-order Taylor series expansion around ¯x, ¯u.
4
12. dx
dt
= f(¯x, ¯u) + F(¯x, ¯u)
dω
dt
+ xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + uF(¯x, ¯u)δu
In this case, we are ignoring the varation of the noise w.r.t. the state (i.e. xF(¯x, ¯u))
because we are only interested in multiplicative noise in the control input for this par-
ticular derivation. Substituting d¯x
dt
= f(¯x, ¯u) and noting that dδx
dt
= dx
dt
− d¯x
dt
gives:
dx
dt
=
dx
dt
−
d¯x
dt
= xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + F(¯x, ¯u)
dω
dt
+ uF(¯x, ¯u)δu
Now we can write the update rule in discrete time, setting dt = ∆t and using the
subscript k to denote the current time step.
δxk+1 = δxk + ∆t xf(¯xk, ¯uk)δxk + uf(¯xk, ¯uk)δuk + F(¯xk, ¯uk)
dω
dt
+ uF(¯xk, ¯uk)δuk
For convenience, we make the following substitutions:
Ak I + ∆t xf(¯x, ¯u)
Bk ∆t uf(¯x, ¯u)
Ck(δuk) [ci,k + Ci,kδuk, ..., cp,k + Cp,kδuk]
where ci,k =
√
∆tF(¯x, ¯u)[i]
, Ci,k =
√
∆t uF[i]
(¯x, ¯u)
ξ
√
∆t
dω
dt
∼ N(0, Ip)
⇒ δxk+1 = Akδxk + Bkδuk + Ck(δuk)ξk (2.1.3)
Equation (2.1.3) describes the evolution of the system linearized about the nominal
trajectory in discrete time. In the discrete case, K is used to denote the final time
step, where t = T.
5
13. 2.1.1.3 Linearization and Quadratization of the Cost
The instantaneous cost rate given by l may also be thought of as a state cost in the
case where k < K (i.e. t < T). This will be useful for rewriting equation (2.1.2) in a
discretized and recursive form. Linearization of the cost rate follows the same steps
as shown for the dynamics, however, here a second order Taylor series expansion is
used instead. Note that terms involving cross derivatives (e.g. x u) are ignored.
l(t, x(t), u) = l(t, x − ¯x + ¯x, u − ¯u + ¯u)
= l(t, δx + ¯x, δu + ¯u)
= l(t, ¯x, ¯u) + xl(t, ¯x, ¯u)δx +
1
2
δx x xl(t, ¯x, ¯u)δx
+ ul(t, ¯x, ¯u)δu +
1
2
δu u ul(t, ¯x, ¯u)δu
At k = K, t = T, the state cost is simply the penalty on the final state which
is given explicitly by h. In this case, the linearized cost function is given by the
following:
h(x(T)) = h(x(T) − ¯x(T) + ¯x(T))
= h(δx(T) + ¯x(T))
= h(¯x(T)) + xh(¯x(T))δx +
1
2
δx x xh(¯x(T))δx
For convenience, the following substitutions are made:
qk = lk, qk = xlk, Qk = x xlk, rk = ulk, Rk = u ulk for k < K
qK = h, qK = xh, QK = x xh
Note that r and R are undefined at the final time step. With the above substitu-
tions, the state cost for k < K becomes:
6
14. costk = qk + δxk qk +
1
2
δxk Qkδxk + δuk rk +
1
2
δuk Rkδuk (2.1.4)
2.1.1.4 Backwards Pass - Computation of the Control Law
It is now useful to introduce the discrete time, recursive form of equation (2.1.2)
which relies on the current state cost described by equation (2.1.4).
vk(δxk) = costk + E [vk+1(δxk+1)] (2.1.5)
Since it has been assumed that the control law, δu, is linear in δx we now prove
via induction backwards in time that equation (2.1.5) also has the following form:
vk = sk + δxk qk +
1
2
δxk Qkδxk (2.1.6)
First, equation (2.1.4) is substituted into equation (2.1.5). Then the system is
advanced one time step using equation (2.1.3) for the next state’s cost-to-go.
vk(δxk) = qk + δxk qk +
1
2
δxk Qkδxk + δuk rk +
1
2
δuk Rkδuk
+ E [vk+1(Akδxk + Bkδuk + noisek]
expanding this term gives:
sk+1 + (Akδxk + Bkδuk) sk+1 +
1
2
(Akδxk + Bkδuk) Sk+1(Akδxk + Bkδuk)
+
1
2
tr
p
i=1
(ci,k + Ci,kδuk) (ci,k + Ci,kδuk) Sk+1
For the tr(·) term, we make use of the fact that tr(V U) = tr(UV ). Thus, expand-
ing the tr term gives:
1
2
tr(·) =
1
2
δuk
p
i=1
Ci,kSk+1Ci,k δuk + δuk
p
i=1
Ci,kSk+1ci,k +
1
2
p
i=1
ci,kSk+1ci,k
7
15. ⇒ vk(δxk) = qk + sk+1 +
1
2
p
i=1
ci,kSk+1ci,k
+ δxk (qk + Ak sk+1) +
1
2
δxk (Qk + Ak Sk+1Ak)δxk
+ δuk rk + Bk sk+1 +
p
i=1
Ci,kSk+1ci,k
+
1
2
δuk Rk + Bk Sk+1Bk +
p
i=1
Ci,kSk+1Ci,k δuk + δuk Bk Sk+1Akδxk
(2.1.7)
Now we define a linear control law δuk = lk + Lkδxk along with the following
shortcuts:
g rk + Bk sk+1 +
p
i=1
Ci,kSk+1ci,k
G Bk Sk+1Ak
H Rk + Bk Sk+1Bk +
p
i=1
Ci,kSk+1Ci,k
Substituting these expressions back into equation (2.1.7), combining like terms,
and matching with equation (2.1.6) gives the following backwards-recursive update
formulas:
⇒ Sk = Qk + Ak Sk+1Ak + Lk HLk + Lk G + G Lk
sk = qk + Ak sk+1 + Lk g + Lk Hlk + G lk (2.1.8)
sk = qk + sk+1 +
1
2
sump
i=1ci,kSk+1ci,k + lk g +
1
2
lkHlk (2.1.9)
The cost-to-go function is initialized with SK = QK, sK = qK, and sK = qK.
Finally, the cost-to-go function can be written as:
8
16. vk(δxk) = qk + sk+1 +
1
2
p
i=1
ci,kSk+1ci,k
+ δxk (qk + Ak sk+1) +
1
2
δxk (Qk + Ak Sk+1Ak)δxk
+ δuk (g + Gδxk) +
1
2
δukHδuk
The above expression can now be minimized w.r.t. to the control signal δuk.
∂vk(δxk)
∂δxk
= g + Gδxk + Hδuk ≡ 0
⇒ δuk = −H−1
(g + Gδx)
When H has some negative eigenvalues the above expression could potentially
make vk arbitrarily negative. However, since vk is the cost-to-go function it must be
strictly non-negative. The solution, in this case, involves augmenting H by forcing its
eigenvalues to meet a minimum threshold. This is accomplished by diagonalization,
H = V DV , and subsequently replacing eigenvalues in D that are less than some
λmin with λmin to form D. Thus the augmented matrix becomes H = V DV . Now,
the control law can be specified:
⇒ lk = −V D−1
V g, Lk = −V D−1
V G
2.1.1.5 Forward Pass - Computation of the New Control Signal, Trajectory, and
Cost
At this point, we perform a forward pass through the linearized dynamics to compute
the new open-loop control signal. By definition δx1 = 0.
9
17. δuk = αlk + Lkδxk
δxk+1 = Akδxk + Bkδuk
Here, α is introduced as a backtracking linesearch parameter which is initialized
at α = 1 and reduced by some factor η < 1 if the cost of the new trajectory does not
give an improvement. Then we set u = ¯u + δu and compute a new state trajectory
using equation (2.1.1) in an Euler integration scheme:
xk+1 = xk + ∆tf(xk, uk)
The expected cost is computed using equation (2.1.2) in a similar setup. If there is
an improvement on the cost then we set ¯x = x and ¯u = u and start the process over,
beginning with linearization of the dynamics around the new trajectory. If the cost
improvement is smaller than some threshold value, then the algorithm is finished.
2.1.2 Cart-Pole Application
Cart-pole refers to a benchmark control problem in which a weighted pendulum is
attached to a cart which can be forced to roll along a single dimension. The dynamics
of the system are given by the following equations [3]:
¨x =
1
mc + mp sin2
θ
f + mp sin θ l ˙θ2
+ g cos θ (2.1.10)
¨θ =
1
l(mc + mp sin2
θ)
−f cos θ − mpl ˙θ2
cos θ sin θ − (mc + mp)g sin θ (2.1.11)
Figure 1 shows a schematic of the cart-pole system and identifies each of the
parameters in the equations above. For this demonstration, mc = 1 kg, mp =
0.01 kg, l = 0.25 m, and g = 9.81 m/s. The initial configuration of the cart-pole
is specified as x = ˙x = θ = ˙θ = 0, meaning the cart is at rest with the pendulum
10
18. Figure 1: A schematic of the cartpole system with all relevant parameters indicated
hanging downwards. The goal state for this demonstration is x = ˙x = ˙θ = 0, θ = π,
corresponding to the cart returning to its initial x position after swinging the pendu-
lum upright. A time horizon of 2 seconds is chosen with a time step of 0.02 seconds.
Lastly, the control input f is constrained to be |f| ≤ 5. The performance index used
for this demonstration is given by:
J0 =
x(T)
θ(T) − π
˙x(T)
˙θ(T)
5 0 0 0
0 1000 0 0
0 0 50 0
0 0 0 5
x(T)
θ(T) − π
˙x(T)
˙θ(T)
+
T
0
10−4
u(t)2
dt
Figures 2, 3, and 4 show the optimized state trajectories, control signal, and
cost evolution of the cart-pole system, respectively. In this case, the goal state was
achieved even though there was some saturation of the control signal.
2.1.3 Double Cart-Pole Application
The double cart-pole problem is similar to the cart-pole problem described in the
previous section. In this case, however, there is a second pendulum attached to the
end of the first pendulum. Thus there is one additional state for the second link
position, θ2 (relative to the first link) and an additional state for the second link
angular velocity, ˙θ2. The dynamics are given by the following equations [4]:
11
19. 0 0.5 1 1.5 2
θ(rad)
-2
0
2
4
time (s)
0 0.5 1 1.5 2
θrate(rad/s)
-10
0
10
20
0 0.5 1 1.5 2
x
-1
-0.5
0
0.5
actual
goal
time (s)
0 0.5 1 1.5 2
velocity
-2
-1
0
1
2
Figure 2: State trajectories for the cart-pole system over 2 seconds showing the goal
state being met
A
¨x
¨θ1
¨θ2
= b (2.1.12)
Where
12
20. time (s)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2f
-5
-4
-3
-2
-1
0
1
2
3
4
5
Control Signal
actual
limits
Figure 3: Control signal for the cart-pole system over 2 seconds corresponding to the
positive force applied to the cart
Iteration
0 5 10 15 20 25 30 35 40 45 50
Cost
0
500
1000
1500
2000
2500
3000
Cart-Pole Cost Evolution
Figure 4: Cart-pole actual cost of the trajectory after each iteration
A =
2(mc + mp1 + mp2) −(mp1 + 2mp2)l1 cos θ1 −mp2l2 cos θ2
−(3mp1 + 6mp2) cos θ1 (2mp1 + 6mp2)l1 3mp2l2 cos(θ1 − θ2)
−3 cos θ2 3l1 cos(θ1 − θ2) 2l2
b =
2u − (mp1 + 2mp2)l1
˙θ1
2
sin θ1 − mp2l2
˙θ2
2
sin θ2
(3mp1 + 6mp2)g sin θ1 − 3mp2l2
˙θ2
2
sin(θ1 − θ2)
3l1
˙θ1
2
sin(θ1 − θ2) + 3g sin θ2
Figure 5 shows the double cart-pole problem schematically, specifying all of the
salient features. Note, the definition for θ1 in this problem has changed in that
13
21. Figure 5: Schematic of the double cart-pole problem with all relevant features speci-
fied
θ1 = 0 corresponds to the first link pointing upright. The objective of the double
cart-pole application will be the same as in the single cart-pole: to start in the state
x = ˙x = ˙θ1 = ˙θ2, θ1 = θ2 = π corresponding to the cart at rest with the links
hanging downwards and swinging the links up to the state x = ˙x = ˙θ1 = ˙θ2 = θ1 =
θ2 = 0. For this demonstration the parameters are chosen to be mc = 10 kg, mp1 =
3 kg, mp2 = 1 kg, l1 = 1.4 m, l2 = 1 m, g = 9.81 m/s2
. The horizon and time
discretization are chosen as 2 s and 0.02 s, respectively. The performance index used
for the demonstration is given by:
J0 = x(T) QF x(T) +
T
0
10−4
u(t)2
dt
x = x ˙x θ1
˙θ1 θ2
˙θ2
QF = diag 5 5 1000 50 1000 50
Figures 6, 7, and 8 show the optimized state trajectories, control signal, and cost
evolution of the double cart-pole system, respectively. In this case, the goal state was
achieved within the allotted time horizon. The control force limitations were widened
for this application to ensure the goal state could be reached.
14
22. 2.1.4 Quadrotor Application
In this section, iLQG is applied to a 6DOF quadrotor model. The state, in this case,
is a combination of linear positions, linear velocities, angular positions, and angular
velocities, while the control is simply the 4 thrust forces of the rotors:
x = x y z ˙x ˙y ˙z φ θ ψ ˙φ ˙θ ˙ψ
u = F1 F2 F3 F4
The objective for this demonstration is to start at position (−1, 1, 0.5) and finish
at position (0.5, − 1, 1.5) after 3 seconds. All velocities and angles begin at zero
and should end at zero - thus the quadrotor begins at rest at the start position and
should reach the goal position and stop there. Figure 9 shows the quadrotor problem
schematically, identifying all of the angles and their directions [1].
For the dynamics model, it is useful to define the following quantities:
u1 = F1 + F2 + F3 + F4 thrust force
u2 = F4 − F2 roll force
u3 = F1 − F3 pitch force
u4 = 0.05(F2 + F4 − F1 − F3) yaw moment
Then the linear accelerations are given by [5]:
¨x =
1
m
(cos φ sin θ cos ψ + sin φ sin ψ)u1
¨y =
1
m
(cos φ sin θ sin ψ − sin φ cos ψ)u1
¨z =
1
m
cos φ cos θu1
15
23. Table 1: Quadrotor parameters and values
Symbol Description Value Units
Jx Moment of inertia about x-axis 0.0081 kg m2
Jy Moment of inertia about y-axis 0.0081 kg m2
Jz Moment of inertia about z-axis 0.0142 kg m2
L Distance of rotor to the center of mass 0.24 m
m Quadrotor mass 1 kg
g Gravitational acceleration 9.81 m/s2
And the angular accelerations are given by [5]:
¨φ =
1
Jx
( ˙θ ˙ψ(Jy − Jz) + Lu2)
¨θ =
1
Jy
( ˙φ ˙ψ(Jz − Jx) + Lu3)
¨ψ =
1
Jz
( ˙φ ˙θ(Jx − Jy) + u4)
Table 1 enumerates the parameters used in the quadrotor model along with the
values chosen for this demonstration. The performance index for this demonstration
is given by:
J0 = (x(T) − xG) QF (x(T) − xG) +
T
0
x(t) Qx(t) + u(t) u(t)dt
QF = diag 100 100 100 10 10 10 50 50 0 5 5 5
Q = diag 0 0 0 0 0 0 10 10 10 0 0 0
xG = 0.5 −1 1.5 0 0 0 0 0 0 0 0 0
Figures 10-12 show the results of running iLQG on the quadrotor problem. Each
of the state goals is met within the allotted time horizon of 3 seconds.
16
24. 0 0.5 1 1.5 2
θ1
(rad)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time (s)
0 0.5 1 1.5 2
θ1
rate(rad/s)
-8
-6
-4
-2
0
2
4
6
8
0 0.5 1 1.5 2
x
-2
-1.5
-1
-0.5
0
0.5
1
1.5
actual
goal
time (s)
0 0.5 1 1.5 2
velocity
-10
-8
-6
-4
-2
0
2
4
6
0 0.5 1 1.5 2
θ2
(rad)
-1
0
1
2
3
4
5
6
7
time (s)
0 0.5 1 1.5 2
θ2
rate(rad/s)
-25
-20
-15
-10
-5
0
5
10
Figure 6: State trajectories for the double cart-pole system over 2 seconds showing
the goal state being met
17
25. time (s)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
f
-1000
-800
-600
-400
-200
0
200
400
600
800
1000
Control Signal
actual
limits
Figure 7: Control signal for the double cart-pole system over 2 seconds corresponding
to the positive force applied to the cart
Iteration
0 20 40 60 80 100 120 140 160 180 200
Cost
×104
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Double Cart-Pole Cost Evolution
Figure 8: Double cart-pole actual cost of the trajectory after each iteration
Figure 9: Schematic representation of the quadrotor dynamic model [1].
18
26. time (s)
0 0.5 1 1.5 2 2.5 3
Position(m)
-1
-0.5
0
0.5
1
1.5
2
Position Trajectory
x
y
z
(a)
time (s)
0 0.5 1 1.5 2 2.5 3
Angle(rad)
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Angle Trajectory
φ
θ
ψ
(b)
time (s)
0 0.5 1 1.5 2 2.5 3
Velocity(m/s)
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Velocity Trajectory
˙x
˙y
˙z
(c)
time (s)
0 0.5 1 1.5 2 2.5 3
AngularRate(rad/s)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Angular Rate Trajectory
˙φ
˙θ
˙ψ
(d)
Figure 10: Quadrotor state trajectories from 0 to 3 seconds showing the vehicle
meeting the goal state
time (s)
0 0.5 1 1.5 2 2.5 3
ThrustForce(N)
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
Control Signal
F1
F2
F3
F4
Figure 11: Quadrotor control effort for each rotor
19
27. Iteration
0 10 20 30 40 50 60 70 80
Cost
×104
0
0.5
1
1.5
2
2.5
Cost Evolution
Figure 12: Quadrotor actual cost of the trajectory after each iteration
20
28. CHAPTER III
UNKNOWN DYNAMICS
In the last chapter, iLQG was introduced and derived as an algorithm useful for
situations when the system dynamics are known. In this chapter, PDDP will be
introduced as an algorithm useful for situations when the system dynamics are not
known. Although there are other algorithms designed to solve the same problems,
these two are discussed in this report to provide a foundation for the extensions
proposed in the following chapter.
3.1 Probabilistic Differential Dynamic Programming
Probabilistic Differential Dynamic Programming (PDDP) is an algorithm proposed by
Yunpeng Pan and Evangelos Theodorou as an extension to the well known differential
dynamic programming (DDP) algorithm [9]. DDP and iLQG are extremely similar in
their derivation; for this reason, PDDP is treated essentially as an extension of iLQG
in this context. The problem formulation of PDDP matches that of iLQG exactly.
3.1.1 Gaussian Processes
The innovation of PDDP lies in its use of a Gaussian Process to represent the system
dynamics instead of a collection of differential equations. This relieves the need to
have an explicit model of the system dynamics, as was the case in iLQG. Instead, the
GP-based representation relies on observations of the dynamics for given state and
control inputs. More precisely, the GP assumes a joint Gaussian distribution over a
sequence of state-control pairs, ˜X = {(x0, u0), ..., (xH, uH)}, and corresponding state
transitions dX = {dx0, ..., dxH} measured from the system itself [9]. Thus the state
transition is found through the process of GP regression, which can be interpreted as
21
29. regression over functions. In the case of PDDP, the function of interest is the system
dynamics (see Equation 2.1.1, for example), or rather, the state transition function.
A complete treatment of the general subject of Gaussian Processes can be found in
Gaussian Processes for Machine Learning by Rasmussen and Williams [10]. For
the remainder of the report, a ”sample from a GP” refers to the set of values of the
sampled function evaluated at the inputs.
GPs offer a nonparametric representation of the state transition function which
allows them to be extremely flexible in representing highly nonlinear systems. There
is freedom, however, in choosing the covariance function to specify the joint Gaus-
sian distribution over the observed data. The covariance function provides a way to
quantify the similarity of one input point (i.e. one state-control pair, in this case)
to another input point. In a conceptual sense, covariance functions are analogous
to kernel functions in general machine learning. The idea is that the more similar
two points are, as measured using the covariance function, the more correlated their
outputs will be. Thus in the regression setting, a query point’s output will be heavily
informed by similar points with known outputs.
3.1.2 Implications of the GP Representation
The use of a GP representation for the system dynamics presents a couple of chal-
lenges. By its very nature, GP regression incorporates uncertainty; thus, in this
case, the state transition value that is returned is, itself, a random variable. In a
discrete-time setting, the next state is usually found as xk+1 = xk + dxk∆t. The un-
certainty in state transition, dxk induces uncertainty in the state, xk+1. Meanwhile,
the state is treated as an input into the GP regression. This input uncertainty causes
the predictive distribution for the state transition to be analytically intractable as it
is non-Gaussian [9]. However, PDDP approximates the predictive distribution by a
Gaussian, wherein the predictive mean and covariances are computed by employing
22
30. the moment matching approach [9].
The uncertainty in the state values also necessitates augmentation to the state
vector seen in regular DDP (and iLQG). The augmented state vector contains the
means of each state along with a flattened version of the covariance matrix. This
allows the uncertainty to be carried through every part of the algorithm, allowing the
algorithm to balance the exploration strategy between closeness to the goal state and
uncertainty (variance) of the state [9]. Incidentally, the Jacobian matrices, described
in the iLQG section (i.e. xf(¯x, ¯u) and uf(¯x, ¯u) in Equation 2.1.3) can be com-
puted analytically as a result of the definition of the augmented state, relieving the
need for numerical differentiation.
23
31. CHAPTER IV
EXTENDING PDDP AND ILQG
PDDP extends DDP/iLQG by representing the system dynamics as a Gaussian Pro-
cess, which is data driven and uncertain, as opposed to a set of differential equations.
The noise model considered in PDDP is, however, quite simple as it only contains
standard Brownian motion noise (dω) in the dynamics [9]:
dx = f(x, u)dt + Cdω, x(t0) = x0, dω ∼ N(0, Σω)
Thus the desire to extend PDDP comes from the desire to incorporate a more
sophisticated noise model and a more adaptive regression framework.
4.1 Gaussian Process Regression Networks
The main extension proposed to PDDP is the use of a Gaussian Process Regression
Network (GPRN) representation of the system dynamics, in lieu of the regular GP
representation. GPRNs were originally proposed by Andrew Wilson, David Knowles,
and Zoubin Ghahramani [12]. GPRNs maintain desirable aspects of the GP rep-
resentation such as modeling the dynamics using data and allowing uncertainty to
be considered. Additionally, GPRNs offer input (predictor, or ”state”, in this case)
dependent signal and noise correlations between multiple output variables (state tran-
sitions), and input dependent length-scales and amplitudes. Inclusion of signal noise
allows the separation of noise effects of the physical process from measurement noise,
for example. The dependence of the noise and length scales on the input allow the
regression process to adapt to the input, making the framework more flexible. Fi-
nally, the ability to account for signal and noise correlations between multiple output
24
32. variables strengthens the predictive capability of the GPRN over the standard GP.
Imagine that we are trying to predict the angular acceleration of a vehicle about
a particular axis (i.e. a state transition variable) at a point that was not used to
train the GPRN. In standard GP regression, the prediction is only really informed by
the measurements of that angular acceleration used to train the model. Accounting
for correlations between multiple outputs enables us to use all of the measurements,
which enhances the prediction since angular acceleration may be correlated with lin-
ear velocity for a particular vehicle. This correlation between angular acceleration
and linear velocity may also depend on the state of the vehicle (e.g. altitude) - thus
GPRN’s input dependent correlations further enhance prediction. In general, this is
referred to as a multi-task method [12]. The case for using a GPRN to represent the
dynamics is clear as it satisfies the desire to incorporate a more sophisticated noise
model and a more adaptive regression framework.
Conceptually, a GPRN may be described simply as a mixture of GPs wherein the
mixture weights are also GPs. Figure 13, below, gives a schematic representation
of a GPRN [12]. It is important to note that the input, x, is in fact a vector, as it
represents the state. The functions, ˆfi(x), are GPs, each with their own noise, and are
referred to as nodes. The number of nodes is a choice for the designer of the GPRN
and is usually determined by experimenting to see which number produces the best
results. Each output (i.e. state transition) element, yj(x), is a linear combination of
the outputs of each node. The weights, Wij(x) applied to the node outputs are also
input-dependent GPs.
Regression in the GPRN framework is quite different from a standard GP. In
standard GP regression, we predict the output given the input and data comprised
of input/output training pairs. In a GPRN, we must predict the values of the node
and weight GPs whose values are latent and confounded in the output values; thus
the standard regression must be altered.
25
33. Figure 13: Schematic representation of a Gaussian Process Regression Network [12]
4.1.1 Elliptical Slice Sampling
Wilson et al. [12] solve this problem using a technique known as elliptical slice sam-
pling (ESS). Elliptical slice sampling [7], proposed by Iain Murray, Ryan Prescott
Adams, and David J.C. MacKay, is designed to be a generic, user-friendly (avoids
parameter tuning) sampling technique that may be viewed as an extension of slice
sampling [8]. ESS excels in sampling from posteriors with strongly correlated Gaus-
sian priors, which is the case in GPRNs. For a GPRN, ESS can be used to sample
from the posterior distribution of the node and weight GP functions evaluated at all
of the training points. This distribution is conditioned on the input/output pairs
and the hyperparameters for each GP. The ESS algorithm requires takes the current
sample and a new sample from the prior distribution and parameterizes the new sam-
ple as f = f cos θ + ν sin θ where ν is the prior sample, f is the current sample, f
is the new sample, and θ is the free parameter [7]. The current sample and prior
sample specify an ellipse that is centered at the origin when viewed in this light. Slice
26
34. sampling is then performed in which a θ value is selected uniformly over the support
of the region in which the likelihood of the new sample is above a randomly chosen
threshold.
Samples from the node and weight GPs evaluated at all of the training points
allows us to draw samples from the predictive distribution for node and weight values
at the query point. The output is simply a function of these values. This process is
repeated a number of times and the output value is averaged to marginalize out the
effect of the node and weight values.
4.1.1.1 Demonstration: Sampling from a Standard Gaussian
This section and the following demonstrate the usage of the ESS algorithm on two
toy problems: sampling from a standard Gaussian and sampling from a GP posterior.
They rely on the MATLAB implementation made available by Iain Murray as a
supplement [7].
The first demonstration compares samples drawn from standard methods for
Gaussian distributions (in MATLAB normrnd) with samples drawn using ESS. These
two approaches are fundamentally different due to the fact that ESS requires samples
from a prior. In order to emulate the behavior of normrnd, a uniform distribution is
chosen to keep the prior as uninformative as possible. Figure 14, below, shows normal-
ized histograms of the samples from each method, overlaid with the true probability
distribution function, for a variety of population sizes and choices of prior.
For this application, the histograms in Figure 14 indicate that the ESS samples
are sensitive to the range of the uniform prior. If the bounds on the prior are too far
spread apart, ESS tends to produce large outliers more frequently than normrnd. On
the other hand, reducing the range of the uniform prior tends to bias the ESS samples
thus skewing the distribution. Clearly, the U(−0.5, 0.5) prior would be insufficient
for sampling from N(0, 1).
27
35. Figure 14: Comparison of ESS samples to normrnd on a standard Gaussian
4.1.1.2 Demonstration: Sampling from a GP Posterior
ESS is more fitting for this application since, in the Gaussian process case, a prior is
specified. This demonstration roughly follows an example from Murray et al. [7] in
which ESS is used to generate samples from a GP posterior. For visualization pur-
poses, this demonstration is for a GP with only one input dimension and one output
dimension. Note, the predictive distribution of a GP can be computed analytically,
so this provides a baseline with which to compare the ESS samples to.
28
36. First, a synthetic dataset is generated with a sample size of N = 200 with in-
put values {xn}N
n=1 sampled from from U(0, 1). Then the covariance matrix, Σ is
computed using the following definition
Σij = σ2
f exp −
1
2
(xi − xj)2
/l2
(4.1.1)
This particular covariance function is the commonly used squared-exponential
covariance. Here, we set l = 1 and σ2
f = 1. The function values are then drawn
from the GP prior: f ∼ N(0, Σ). In practice, the true function values are not known
- thus Gaussian noise is added to the function values with variance σ2
n = 0.32
to
create the observations, y. ESS requires the log-likelihood function (or the probability
distribution function of the distribution being sampled). Taking the posterior samples
evaluated at all of the input values simplifies the process since the log-likelihood
function takes a simple form
L(f) =
n
N(yn; fn, σ2
n) (4.1.2)
The ability to sample from the GP prior and being able to evaluate the log-
likelihood function is all that is needed to sample from the posterior using ESS.
Rasmussen’s book [10] gives the process for sampling from the GP posterior di-
rectly. The posterior has a joint Gaussian distribution whose mean and covariance
are given by the following
¯f∗ E [f∗|y, Σ] = Σ Σ + σ2
nI
−1
y (4.1.3)
cov(f∗) = Σ − Σ Σ + σ2
nI
−1
Σ (4.1.4)
Figure 15, below, contains the results of taking 100 samples from the GP posterior
using both ESS and the direct method. Note, there is a vertical spread in the samples
at each location due to the fact that the function values at each location come from
29
37. its own Gaussian distribution. There are several ESS samples that stand out as being
very unlikely, meaning they are much further from the true function values plotted
on the figure.
x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f(x)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Gaussian Process Regression
ESS
Direct
True f(x)
Noisy Obs.
Figure 15: Comparison of ESS samples to direct method for GP regression
This issue can be attributed to the fact that ESS produces chains of samples.
The chain must start somewhere, and, in this case, it begins with a sample from the
prior distribution. Looking at a trace of the sample log-likelihood values in Figure 16
reveals the fact that the likelihood starts out quite low before converging to a stable
value.
Employing a burn in strategy, similar to those used for many other sampling
techniques, wherein a certain number of the initial samples are discarded, yields the
results shown in Figure 17. In this case the first 20 ESS samples were discarded. The
remaining samples are indistinguishable from the direct samples, demonstrating the
viability of this technique for GP regression.
30
38. Sample
0 10 20 30 40 50 60 70 80 90 100
Likelihood
-350
-300
-250
-200
-150
-100
-50
0
ESS Sample Likelihoods
Figure 16: Log-likelihood values of a chain of ESS samples for GP regression
x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f(x)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Gaussian Process Regression
ESS
Direct
True f(x)
Noisy Obs.
Figure 17: Comparison of ESS samples to direct method for GP regression with burn
in
31
39. 4.2 Implications of GPRN on PDDP
Uncertainty in the inputs is incorporated into PDDP using a moment matching ap-
proach. Inclusion of input uncertainty into the GPRN scheme being considered is
much more complex. This is due to the fact that the inputs and outputs in GPRN
are separated by a thick layer of latent variables - the node and weight GPs. In this
initial work, it is advantageous from a standpoint of conceptual complexity to drop
this uncertainty in the input. The most obvious approach to do so would be to simply
to take the mean of the distribution describing the input. Since the next state value
is simply a function of the current state and the state transition, we must simply take
the mean of the predictive state transition distribution (specified by the GPRN). As
this distribution is an average of a large number of intermediate Gaussian samples,
incorporating different node and weight samples, the mean is simply the mean of all
the Gaussian means.
In terms of computational complexity, GPRN generally scales linearly with the
dimension of the output and with the number of latent nodes. The computational
complexity is dominated by the number of data points considered, just as it is for
standard GP regression. It is likely that the GPRN representation will be more bur-
densome in practice due to the fact that it is sample-based, as opposed to analytical
as was the case in PDDP. The biggest difference between a GPRN representation
and a GP representation is how the predictive distribution mean and covariance are
calculated and how the model’s hyperparameters are learned.
4.3 Remaining Work
Much of the conceptual framework for incorporating a GPRN representation into
PDDP and iLQG has been developed and discussed throughout this report. Work
remains, however, in order to complete this work. The biggest aspect that is missing at
this point is a process for learning the hyperparameters for the GPRN. Each individual
32
40. GP, for each node and for each weight, comes with its own set of hyperparameters,
which may vary depending on the choice of covariance function. For a system with p
outputs, and q nodes, there are p(q + 1) sets of hyperparameters to set. One of the
traditional approach to learning parameters of a distribution is maximum likelihood
(ML). ML cannot be used to learn the GPRN hyperparameters due to the fact that the
hyperparameters are associated with latent variables (i.e. variables which we cannot
observe directly). Other categories of methods are being considered for suitability
for this problem including gradient based and sampling based (e.g. cross-entropy)
methods. Wilson et al. [12] utilize a message passing implementation of variational
Bayes (VB) called variational message passing (VMP). Their implementation is built
on top of Microsoft Research’s Infer.NET framework which is a general platform for
probabilistic programming [6]. Much of the remaining work on this effort can be
summarized by gaining a better understanding of VB and VMP, and then generating
an implementation in MATLAB.
33
41. CHAPTER V
CONCLUSION
The intersection between machine learning and control theory is emerging rapidly;
many rich applications are currently being developed. This report has explored one
particular area of interest: trajectory optimization. LQR and LQG were mentioned
as predecessors to iLQG, which was re-derived in this report and demonstrated on 3
systems of interest: cartpole, double cartpole, and quadrotor. The iLQG algorithm
is robust and widely applicable to a vast number of different systems. PDDP was
introduced as an important extension to iLQG which incorporates uncertainty into
the trajectory optimization process. The conceptual advances presented in this report
were centered around the use of a GPRN to represent the unknown system dynamics
in lieu of the standard GP representation employed by PDDP. GPRNs offer a more
sophisticated noise model and a more adaptive regression framework. More specifi-
cally, they incorporate input-dependent noise and signal correlation between multiple
outputs. This should enhance the predictive capability of the algorithm to represent
the unknown system dynamics. Work remains on fully incorporating GPRNs into a
PDDP-like algorithm: effort continues in the areas of understanding VB and VMP
for learning GPRN hyperparameters, and in creating a MATLAB implementation.
34
42. APPENDIX A
MATLAB CODE LISTINGS
The links below point to files embedded within the document.
my implementation of the iLQG algorithm for the purposes of this research
specification of dynamics, start, and goal states for cartpole system
specification of dynamics, start, and goal states for double cartpole system
specification of dynamics, start, and goal states for quadrotor
[7]
implements elliptical slice sampling, provided by the authors at
http://homepages.inf.ed.ac.uk/imurray2/pub/10ess/
demonstrates elliptical slice.m on two examples: sampling from a standard
Gaussian, and sampling from the posterior distribution for a simple Gaussian
Process
35
iLQG.m
demo cartpole.m
demo doublecartpole.m
demo quadrotor.m
elliptical slice.m
elliptical slice demo.m
43. REFERENCES
[1] Bolandi, H., Rezaei, M., Mohsenipour, R., Nemati, H., and
Smailzadeh, S., “Attitude control of a quadrotor with optimized pid con-
troller,” Intelligent Control and Automation, vol. 4, no. 3, pp. 335–342, 2013.
[2] Deisenroth, M. and Rasmussen, C. E., “Pilco: A model-based and data-
efficient approach to policy search,” in Proceedings of the 28th International
Conference on machine learning (ICML-11), pp. 465–472, 2011.
[3] Gandhi, M., “Cart-pole dynamics.” Private Communication, 2015. Georgia
Institute of Technology Student.
[4] Gandhi, M., “Double cart-pole dynamics.” Private Communication, 2015.
Georgia Institute of Technology Student.
[5] Gandhi, M., “Quadrotor dynamics.” Private Communication, 2015. Georgia
Institute of Technology Student.
[6] Minka, T., Winn, J., Guiver, J., Webster, S., Zaykov, Y., Yangel, B.,
Spengler, A., and Bronskill, J., “Infer.NET 2.6,” 2014. Microsoft Research
Cambridge. http://research.microsoft.com/infernet.
[7] Murray, I., Adams, R. P., and MacKay, D. J., “Elliptical slice sampling,”
arXiv preprint arXiv:1001.0175, 2009.
[8] Neal, R. M., “Slice sampling,” Annals of statistics, pp. 705–741, 2003.
[9] Pan, Y. and Theodorou, E., “Probabilistic differential dynamic program-
ming,” in Advances in Neural Information Processing Systems, pp. 1907–1915,
2014.
[10] Rasmussen, C. E., Gaussian processes for machine learning. Citeseer, 2006.
http://www.gaussianprocess.org/gpml/chapters/.
[11] Todorov, E. and Li, W., “A generalized iterative LQG method for locally-
optimal feedback control of constrained nonlinear stochastic systems,” (Atlantis,
Bahamas), 43rd IEEE Conference on Decision and Control, December 2004.
[12] Wilson, A. G., Knowles, D. A., and Ghahramani, Z., “Gaussian process
regression networks,” arXiv preprint arXiv:1110.4411, 2011.
36
44. Machine Learning Applications in Complex Control Systems
Alexander L. Von Moll
36 Pages
Directed by Professor Evangelos Theodorou
The intersection between machine learning and control theory is emerging
rapidly; many rich applications are currently being developed. This report explores
one particular area of interest: trajectory optimization. The Linear Quadratic Reg-
ulator and Linear Quadratic Gaussian algorithms are introduced as predecessors to
the Iterative Linear Quadratic Gaussian (iLQG) algorithm, which is re-derived in
this report and demonstrated on 3 systems of interest: cartpole, double cartpole,
and quadrotor. iLQG is robust and widely applicable to a vast number of different
systems. Probabilistic Differential Dynamic Programming (PDDP) is introduced as
an important extension to iLQG which incorporates uncertainty into the trajectory
optimization process. The conceptual advances presented in this report were centered
around the use of a Gaussian Process Regression Network (GPRN) to represent the
unknown system dynamics in lieu of the standard Gaussian Process representation
employed by PDDP. GPRNs offer a more sophisticated noise model and a more adap-
tive regression framework. More specifically, they incorporate input-dependent noise
and signal correlation between multiple outputs. This should enhance the predic-
tive capability of the algorithm to represent the unknown system dynamics. Work
remains on fully incorporating GPRN into a PDDP-like algorithm: effort continues
in the areas of understanding variational Bayes and variational message passing for
learning GPRN hyperparameters, and in creating a MATLAB implementation.