SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
MACHINE LEARNING APPLICATIONS IN COMPLEX
CONTROL SYSTEMS
AE 8900: Special Problems
Final Report
Presented to
The Academic Faculty
by
Alexander L. Von Moll
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in the
School of Aerospace Engineering
Georgia Institute of Technology
May 2016
MACHINE LEARNING APPLICATIONS IN COMPLEX
CONTROL SYSTEMS
Approved by:
Professor Evangelos Theodorou, Advisor
School of Aerospace Engineering
Georgia Institute of Technology
Date Approved: 18 December 2015
To my wife,
Christina
iii
ACKNOWLEDGEMENTS
I want to thank Professor Evangelos Theodorou for advising me and guiding my
efforts. Additionally, I would like to thank Yunpeng Pan for guidance and for many
discussions on the details of these algorithms. Lastly, thanks to Manan Gandhi for
providing dynamics models for the cartpole, double cartpole, and quadrotor systems.
iv
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II KNOWN DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Iterative Linear Quadratic Gaussian [11] . . . . . . . . . . . . . . . . 3
2.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Cart-Pole Application . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Double Cart-Pole Application . . . . . . . . . . . . . . . . . 11
2.1.4 Quadrotor Application . . . . . . . . . . . . . . . . . . . . . 15
III UNKNOWN DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Probabilistic Differential Dynamic Programming . . . . . . . . . . . 21
3.1.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Implications of the GP Representation . . . . . . . . . . . . . 22
IV EXTENDING PDDP AND ILQG . . . . . . . . . . . . . . . . . . . 24
4.1 Gaussian Process Regression Networks . . . . . . . . . . . . . . . . . 24
4.1.1 Elliptical Slice Sampling . . . . . . . . . . . . . . . . . . . . 26
4.2 Implications of GPRN on PDDP . . . . . . . . . . . . . . . . . . . . 32
4.3 Remaining Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
V CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
APPENDIX A — MATLAB CODE LISTINGS . . . . . . . . . . . . 35
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
v
LIST OF TABLES
1 Quadrotor parameters and values . . . . . . . . . . . . . . . . . . . . 16
vi
LIST OF FIGURES
1 A schematic of the cartpole system with all relevant parameters indicated 11
2 State trajectories for the cart-pole system over 2 seconds showing the
goal state being met . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Control signal for the cart-pole system over 2 seconds corresponding
to the positive force applied to the cart . . . . . . . . . . . . . . . . . 13
4 Cart-pole actual cost of the trajectory after each iteration . . . . . . 13
5 Schematic of the double cart-pole problem with all relevant features
specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 State trajectories for the double cart-pole system over 2 seconds show-
ing the goal state being met . . . . . . . . . . . . . . . . . . . . . . . 17
7 Control signal for the double cart-pole system over 2 seconds corre-
sponding to the positive force applied to the cart . . . . . . . . . . . 18
8 Double cart-pole actual cost of the trajectory after each iteration . . 18
9 Schematic representation of the quadrotor dynamic model [1]. . . . . 18
10 Quadrotor state trajectories from 0 to 3 seconds showing the vehicle
meeting the goal state . . . . . . . . . . . . . . . . . . . . . . . . . . 19
11 Quadrotor control effort for each rotor . . . . . . . . . . . . . . . . . 19
12 Quadrotor actual cost of the trajectory after each iteration . . . . . . 20
13 Schematic representation of a Gaussian Process Regression Network [12] 26
14 Comparison of ESS samples to normrnd on a standard Gaussian . . . 28
15 Comparison of ESS samples to direct method for GP regression . . . 30
16 Log-likelihood values of a chain of ESS samples for GP regression . . 31
17 Comparison of ESS samples to direct method for GP regression with
burn in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vii
CHAPTER I
INTRODUCTION
Control theory and machine learning are, in their own right, rich subjects with many
applications. Their evolution has been quite different: control theory developed as
a subdiscipline within electrical engineering and mathematics with roots going back
to the nineteenth century. Machine learning, on the other hand, has emerged much
more recently - though many of the foundational concepts are borrowed from older
disciplines such as linear algebra. Whereas control theory is centered around analyzing
and altering the properties of dynamical systems, machine learning is centered around
gaining useful insight from data. The intersection of these two research areas is
emerging rapidly.
Trajectory optimization is one particular topic that has been traditionally re-
searched under the umbrella of control theory, but can now benefit from concepts
from machine learning. Trajectory optimization is a process by which control inputs
for a dynamical system are chosen such that some measure of cost or performance
is minimized or maximized. Selection of control inputs affects how the state of the
dynamical system evolves, thereby determining its trajectory over time. For systems
that are described by a set of differential equations with a quadratic cost function,
several well known methods have been used to compute an optimal control law. These
methods include the linear quadratic regulator (LQR) and the linear quadratic Gaus-
sian (LQG) which extends the basic LQR method to incorporate uncertainty. For
LQG, the uncertainty shows up as additive white Gaussian noise in the differential
equations. Much of the optimal control research of the past several decades is based
on these approaches and has extended them for various purposes. One particular
1
extension, known as iterative linear quadratic Gaussian (iLQG) [11] is explored in
further detail in the following sections.
The key aspect of traditional trajectory optimization approaches is the require-
ment of a dynamical model specified by a set of differential equations. There are many
reasons why this might be a hindrance in some applications: a dynamical model of
this type may not be available, generating a dynamical model for a new system may
be difficult, or the dynamical model may be too inaccurate. In these cases it may
be necessary to create an alternative representation of the system dynamics. Several
algorithms have been proposed to solve the trajectory optimization problem using
alternative representations of the system dynamics; most notably, PILCO [2] and
probabilistic differential dynamic programming (PDDP) [9].
The research effort described in this report is focused on extending ideas from
iLQG and PDDP by using Gaussian Process Regression Networks (GPRN) [12] as
a means to represent unknown system dynamics. Chapter 2 contains a derivation
of iLQG along with demonstrations of the algorithm on several systems. Chapter 3
contains a description of PDDP, and a brief background on Gaussian processes (GPs)
which are used in PDDP to represent the unknown system dynamics. Chapter 4 con-
tains a description of GPRN, including a description and demonstration of elliptical
slice sampling (ESS) [7]. Finally, Chapter 5 summarizes the progress made in this
effort and outlines remaining work.
2
CHAPTER II
KNOWN DYNAMICS
2.1 Iterative Linear Quadratic Gaussian [11]
Iterative Linear Quadratic Gaussian (iLQG) is an algorithm used for computing lo-
cally optimal trajectories over a finite horizon for systems with known dynamics.
This section of the report will describe its derivation and demonstrate some results
on some toy systems.
2.1.1 Derivation
2.1.1.1 Definitions
dx = f(x, u)dt + F(x, u)dω (2.1.1)
v(t, x) E h(x(T)) +
T
t
l(τ, x(τ), u(τ))dτ (2.1.2)
Equations (2.1.1) and (2.1.2) specify the dynamics of the system and the expected
cost-to-go, respectively. The cost-to-go is defined as the total cost expected to accu-
mulate if the system is initialized in state x at time t and evolved until t = T. It
has already been implied by equation (2.1.2) that the expected cost-to-go is not a
function of u since we will assume going forward that the control law is a determin-
istic function of the state, x. The symbols that appear in these equations are defined
below:
3
f ≡ system dynamics
F ≡ noise process
ω ≡ Brownian motion noise, ω ∈ Rp
x ≡ state, x ∈ Rn
u ≡ control input, u ∈ Rm
v ≡ cost-to-go
l ≡ instantaneous cost rate
h ≡ penalty on the final rate
The iLQG algorithm is based upon linearization of the dynamics and cost and
quadratization of the cost around a nominal trajectory. Thus we define a nominal
state trajectory and set of control inputs as ¯x and ¯u, respectively, and a small per-
turbation around these series such that δx = x − ¯x, δu = u − ¯u.
2.1.1.2 Linearization of the Dynamics
The first step in the derivation is to linearize the system dynamics around the nominal
trajectories. This is done using a first-order Taylor series expansion and is provided
in detail, below.
dx
dt
= f(x, u) + F(x, u)
dω
dt
= f(x − ¯x + ¯x, u − ¯u + ¯u) + F(x − ¯x + ¯x, u + ¯u − ¯u)
dω
dt
= f(δx + ¯x, δu + ¯u) + F(δx + ¯x, δu + ¯u)
dω
dt
Next, we apply the first-order Taylor series expansion around ¯x, ¯u.
4
dx
dt
= f(¯x, ¯u) + F(¯x, ¯u)
dω
dt
+ xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + uF(¯x, ¯u)δu
In this case, we are ignoring the varation of the noise w.r.t. the state (i.e. xF(¯x, ¯u))
because we are only interested in multiplicative noise in the control input for this par-
ticular derivation. Substituting d¯x
dt
= f(¯x, ¯u) and noting that dδx
dt
= dx
dt
− d¯x
dt
gives:
dx
dt
=
dx
dt
−
d¯x
dt
= xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + F(¯x, ¯u)
dω
dt
+ uF(¯x, ¯u)δu
Now we can write the update rule in discrete time, setting dt = ∆t and using the
subscript k to denote the current time step.
δxk+1 = δxk + ∆t xf(¯xk, ¯uk)δxk + uf(¯xk, ¯uk)δuk + F(¯xk, ¯uk)
dω
dt
+ uF(¯xk, ¯uk)δuk
For convenience, we make the following substitutions:
Ak I + ∆t xf(¯x, ¯u)
Bk ∆t uf(¯x, ¯u)
Ck(δuk) [ci,k + Ci,kδuk, ..., cp,k + Cp,kδuk]
where ci,k =
√
∆tF(¯x, ¯u)[i]
, Ci,k =
√
∆t uF[i]
(¯x, ¯u)
ξ
√
∆t
dω
dt
∼ N(0, Ip)
⇒ δxk+1 = Akδxk + Bkδuk + Ck(δuk)ξk (2.1.3)
Equation (2.1.3) describes the evolution of the system linearized about the nominal
trajectory in discrete time. In the discrete case, K is used to denote the final time
step, where t = T.
5
2.1.1.3 Linearization and Quadratization of the Cost
The instantaneous cost rate given by l may also be thought of as a state cost in the
case where k < K (i.e. t < T). This will be useful for rewriting equation (2.1.2) in a
discretized and recursive form. Linearization of the cost rate follows the same steps
as shown for the dynamics, however, here a second order Taylor series expansion is
used instead. Note that terms involving cross derivatives (e.g. x u) are ignored.
l(t, x(t), u) = l(t, x − ¯x + ¯x, u − ¯u + ¯u)
= l(t, δx + ¯x, δu + ¯u)
= l(t, ¯x, ¯u) + xl(t, ¯x, ¯u)δx +
1
2
δx x xl(t, ¯x, ¯u)δx
+ ul(t, ¯x, ¯u)δu +
1
2
δu u ul(t, ¯x, ¯u)δu
At k = K, t = T, the state cost is simply the penalty on the final state which
is given explicitly by h. In this case, the linearized cost function is given by the
following:
h(x(T)) = h(x(T) − ¯x(T) + ¯x(T))
= h(δx(T) + ¯x(T))
= h(¯x(T)) + xh(¯x(T))δx +
1
2
δx x xh(¯x(T))δx
For convenience, the following substitutions are made:
qk = lk, qk = xlk, Qk = x xlk, rk = ulk, Rk = u ulk for k < K
qK = h, qK = xh, QK = x xh
Note that r and R are undefined at the final time step. With the above substitu-
tions, the state cost for k < K becomes:
6
costk = qk + δxk qk +
1
2
δxk Qkδxk + δuk rk +
1
2
δuk Rkδuk (2.1.4)
2.1.1.4 Backwards Pass - Computation of the Control Law
It is now useful to introduce the discrete time, recursive form of equation (2.1.2)
which relies on the current state cost described by equation (2.1.4).
vk(δxk) = costk + E [vk+1(δxk+1)] (2.1.5)
Since it has been assumed that the control law, δu, is linear in δx we now prove
via induction backwards in time that equation (2.1.5) also has the following form:
vk = sk + δxk qk +
1
2
δxk Qkδxk (2.1.6)
First, equation (2.1.4) is substituted into equation (2.1.5). Then the system is
advanced one time step using equation (2.1.3) for the next state’s cost-to-go.
vk(δxk) = qk + δxk qk +
1
2
δxk Qkδxk + δuk rk +
1
2
δuk Rkδuk
+ E [vk+1(Akδxk + Bkδuk + noisek]
expanding this term gives:
sk+1 + (Akδxk + Bkδuk) sk+1 +
1
2
(Akδxk + Bkδuk) Sk+1(Akδxk + Bkδuk)
+
1
2
tr
p
i=1
(ci,k + Ci,kδuk) (ci,k + Ci,kδuk) Sk+1
For the tr(·) term, we make use of the fact that tr(V U) = tr(UV ). Thus, expand-
ing the tr term gives:
1
2
tr(·) =
1
2
δuk
p
i=1
Ci,kSk+1Ci,k δuk + δuk
p
i=1
Ci,kSk+1ci,k +
1
2
p
i=1
ci,kSk+1ci,k
7
⇒ vk(δxk) = qk + sk+1 +
1
2
p
i=1
ci,kSk+1ci,k
+ δxk (qk + Ak sk+1) +
1
2
δxk (Qk + Ak Sk+1Ak)δxk
+ δuk rk + Bk sk+1 +
p
i=1
Ci,kSk+1ci,k
+
1
2
δuk Rk + Bk Sk+1Bk +
p
i=1
Ci,kSk+1Ci,k δuk + δuk Bk Sk+1Akδxk
(2.1.7)
Now we define a linear control law δuk = lk + Lkδxk along with the following
shortcuts:
g rk + Bk sk+1 +
p
i=1
Ci,kSk+1ci,k
G Bk Sk+1Ak
H Rk + Bk Sk+1Bk +
p
i=1
Ci,kSk+1Ci,k
Substituting these expressions back into equation (2.1.7), combining like terms,
and matching with equation (2.1.6) gives the following backwards-recursive update
formulas:
⇒ Sk = Qk + Ak Sk+1Ak + Lk HLk + Lk G + G Lk
sk = qk + Ak sk+1 + Lk g + Lk Hlk + G lk (2.1.8)
sk = qk + sk+1 +
1
2
sump
i=1ci,kSk+1ci,k + lk g +
1
2
lkHlk (2.1.9)
The cost-to-go function is initialized with SK = QK, sK = qK, and sK = qK.
Finally, the cost-to-go function can be written as:
8
vk(δxk) = qk + sk+1 +
1
2
p
i=1
ci,kSk+1ci,k
+ δxk (qk + Ak sk+1) +
1
2
δxk (Qk + Ak Sk+1Ak)δxk
+ δuk (g + Gδxk) +
1
2
δukHδuk
The above expression can now be minimized w.r.t. to the control signal δuk.
∂vk(δxk)
∂δxk
= g + Gδxk + Hδuk ≡ 0
⇒ δuk = −H−1
(g + Gδx)
When H has some negative eigenvalues the above expression could potentially
make vk arbitrarily negative. However, since vk is the cost-to-go function it must be
strictly non-negative. The solution, in this case, involves augmenting H by forcing its
eigenvalues to meet a minimum threshold. This is accomplished by diagonalization,
H = V DV , and subsequently replacing eigenvalues in D that are less than some
λmin with λmin to form D. Thus the augmented matrix becomes H = V DV . Now,
the control law can be specified:
⇒ lk = −V D−1
V g, Lk = −V D−1
V G
2.1.1.5 Forward Pass - Computation of the New Control Signal, Trajectory, and
Cost
At this point, we perform a forward pass through the linearized dynamics to compute
the new open-loop control signal. By definition δx1 = 0.
9
δuk = αlk + Lkδxk
δxk+1 = Akδxk + Bkδuk
Here, α is introduced as a backtracking linesearch parameter which is initialized
at α = 1 and reduced by some factor η < 1 if the cost of the new trajectory does not
give an improvement. Then we set u = ¯u + δu and compute a new state trajectory
using equation (2.1.1) in an Euler integration scheme:
xk+1 = xk + ∆tf(xk, uk)
The expected cost is computed using equation (2.1.2) in a similar setup. If there is
an improvement on the cost then we set ¯x = x and ¯u = u and start the process over,
beginning with linearization of the dynamics around the new trajectory. If the cost
improvement is smaller than some threshold value, then the algorithm is finished.
2.1.2 Cart-Pole Application
Cart-pole refers to a benchmark control problem in which a weighted pendulum is
attached to a cart which can be forced to roll along a single dimension. The dynamics
of the system are given by the following equations [3]:
¨x =
1
mc + mp sin2
θ
f + mp sin θ l ˙θ2
+ g cos θ (2.1.10)
¨θ =
1
l(mc + mp sin2
θ)
−f cos θ − mpl ˙θ2
cos θ sin θ − (mc + mp)g sin θ (2.1.11)
Figure 1 shows a schematic of the cart-pole system and identifies each of the
parameters in the equations above. For this demonstration, mc = 1 kg, mp =
0.01 kg, l = 0.25 m, and g = 9.81 m/s. The initial configuration of the cart-pole
is specified as x = ˙x = θ = ˙θ = 0, meaning the cart is at rest with the pendulum
10
Figure 1: A schematic of the cartpole system with all relevant parameters indicated
hanging downwards. The goal state for this demonstration is x = ˙x = ˙θ = 0, θ = π,
corresponding to the cart returning to its initial x position after swinging the pendu-
lum upright. A time horizon of 2 seconds is chosen with a time step of 0.02 seconds.
Lastly, the control input f is constrained to be |f| ≤ 5. The performance index used
for this demonstration is given by:
J0 =









x(T)
θ(T) − π
˙x(T)
˙θ(T)


















5 0 0 0
0 1000 0 0
0 0 50 0
0 0 0 5


















x(T)
θ(T) − π
˙x(T)
˙θ(T)









+
T
0
10−4
u(t)2
dt
Figures 2, 3, and 4 show the optimized state trajectories, control signal, and
cost evolution of the cart-pole system, respectively. In this case, the goal state was
achieved even though there was some saturation of the control signal.
2.1.3 Double Cart-Pole Application
The double cart-pole problem is similar to the cart-pole problem described in the
previous section. In this case, however, there is a second pendulum attached to the
end of the first pendulum. Thus there is one additional state for the second link
position, θ2 (relative to the first link) and an additional state for the second link
angular velocity, ˙θ2. The dynamics are given by the following equations [4]:
11
0 0.5 1 1.5 2
θ(rad)
-2
0
2
4
time (s)
0 0.5 1 1.5 2
θrate(rad/s)
-10
0
10
20
0 0.5 1 1.5 2
x
-1
-0.5
0
0.5
actual
goal
time (s)
0 0.5 1 1.5 2
velocity
-2
-1
0
1
2
Figure 2: State trajectories for the cart-pole system over 2 seconds showing the goal
state being met
A






¨x
¨θ1
¨θ2






= b (2.1.12)
Where
12
time (s)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2f
-5
-4
-3
-2
-1
0
1
2
3
4
5
Control Signal
actual
limits
Figure 3: Control signal for the cart-pole system over 2 seconds corresponding to the
positive force applied to the cart
Iteration
0 5 10 15 20 25 30 35 40 45 50
Cost
0
500
1000
1500
2000
2500
3000
Cart-Pole Cost Evolution
Figure 4: Cart-pole actual cost of the trajectory after each iteration
A =






2(mc + mp1 + mp2) −(mp1 + 2mp2)l1 cos θ1 −mp2l2 cos θ2
−(3mp1 + 6mp2) cos θ1 (2mp1 + 6mp2)l1 3mp2l2 cos(θ1 − θ2)
−3 cos θ2 3l1 cos(θ1 − θ2) 2l2






b =






2u − (mp1 + 2mp2)l1
˙θ1
2
sin θ1 − mp2l2
˙θ2
2
sin θ2
(3mp1 + 6mp2)g sin θ1 − 3mp2l2
˙θ2
2
sin(θ1 − θ2)
3l1
˙θ1
2
sin(θ1 − θ2) + 3g sin θ2






Figure 5 shows the double cart-pole problem schematically, specifying all of the
salient features. Note, the definition for θ1 in this problem has changed in that
13
Figure 5: Schematic of the double cart-pole problem with all relevant features speci-
fied
θ1 = 0 corresponds to the first link pointing upright. The objective of the double
cart-pole application will be the same as in the single cart-pole: to start in the state
x = ˙x = ˙θ1 = ˙θ2, θ1 = θ2 = π corresponding to the cart at rest with the links
hanging downwards and swinging the links up to the state x = ˙x = ˙θ1 = ˙θ2 = θ1 =
θ2 = 0. For this demonstration the parameters are chosen to be mc = 10 kg, mp1 =
3 kg, mp2 = 1 kg, l1 = 1.4 m, l2 = 1 m, g = 9.81 m/s2
. The horizon and time
discretization are chosen as 2 s and 0.02 s, respectively. The performance index used
for the demonstration is given by:
J0 = x(T) QF x(T) +
T
0
10−4
u(t)2
dt
x = x ˙x θ1
˙θ1 θ2
˙θ2
QF = diag 5 5 1000 50 1000 50
Figures 6, 7, and 8 show the optimized state trajectories, control signal, and cost
evolution of the double cart-pole system, respectively. In this case, the goal state was
achieved within the allotted time horizon. The control force limitations were widened
for this application to ensure the goal state could be reached.
14
2.1.4 Quadrotor Application
In this section, iLQG is applied to a 6DOF quadrotor model. The state, in this case,
is a combination of linear positions, linear velocities, angular positions, and angular
velocities, while the control is simply the 4 thrust forces of the rotors:
x = x y z ˙x ˙y ˙z φ θ ψ ˙φ ˙θ ˙ψ
u = F1 F2 F3 F4
The objective for this demonstration is to start at position (−1, 1, 0.5) and finish
at position (0.5, − 1, 1.5) after 3 seconds. All velocities and angles begin at zero
and should end at zero - thus the quadrotor begins at rest at the start position and
should reach the goal position and stop there. Figure 9 shows the quadrotor problem
schematically, identifying all of the angles and their directions [1].
For the dynamics model, it is useful to define the following quantities:
u1 = F1 + F2 + F3 + F4 thrust force
u2 = F4 − F2 roll force
u3 = F1 − F3 pitch force
u4 = 0.05(F2 + F4 − F1 − F3) yaw moment
Then the linear accelerations are given by [5]:
¨x =
1
m
(cos φ sin θ cos ψ + sin φ sin ψ)u1
¨y =
1
m
(cos φ sin θ sin ψ − sin φ cos ψ)u1
¨z =
1
m
cos φ cos θu1
15
Table 1: Quadrotor parameters and values
Symbol Description Value Units
Jx Moment of inertia about x-axis 0.0081 kg m2
Jy Moment of inertia about y-axis 0.0081 kg m2
Jz Moment of inertia about z-axis 0.0142 kg m2
L Distance of rotor to the center of mass 0.24 m
m Quadrotor mass 1 kg
g Gravitational acceleration 9.81 m/s2
And the angular accelerations are given by [5]:
¨φ =
1
Jx
( ˙θ ˙ψ(Jy − Jz) + Lu2)
¨θ =
1
Jy
( ˙φ ˙ψ(Jz − Jx) + Lu3)
¨ψ =
1
Jz
( ˙φ ˙θ(Jx − Jy) + u4)
Table 1 enumerates the parameters used in the quadrotor model along with the
values chosen for this demonstration. The performance index for this demonstration
is given by:
J0 = (x(T) − xG) QF (x(T) − xG) +
T
0
x(t) Qx(t) + u(t) u(t)dt
QF = diag 100 100 100 10 10 10 50 50 0 5 5 5
Q = diag 0 0 0 0 0 0 10 10 10 0 0 0
xG = 0.5 −1 1.5 0 0 0 0 0 0 0 0 0
Figures 10-12 show the results of running iLQG on the quadrotor problem. Each
of the state goals is met within the allotted time horizon of 3 seconds.
16
0 0.5 1 1.5 2
θ1
(rad)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time (s)
0 0.5 1 1.5 2
θ1
rate(rad/s)
-8
-6
-4
-2
0
2
4
6
8
0 0.5 1 1.5 2
x
-2
-1.5
-1
-0.5
0
0.5
1
1.5
actual
goal
time (s)
0 0.5 1 1.5 2
velocity
-10
-8
-6
-4
-2
0
2
4
6
0 0.5 1 1.5 2
θ2
(rad)
-1
0
1
2
3
4
5
6
7
time (s)
0 0.5 1 1.5 2
θ2
rate(rad/s)
-25
-20
-15
-10
-5
0
5
10
Figure 6: State trajectories for the double cart-pole system over 2 seconds showing
the goal state being met
17
time (s)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
f
-1000
-800
-600
-400
-200
0
200
400
600
800
1000
Control Signal
actual
limits
Figure 7: Control signal for the double cart-pole system over 2 seconds corresponding
to the positive force applied to the cart
Iteration
0 20 40 60 80 100 120 140 160 180 200
Cost
×104
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Double Cart-Pole Cost Evolution
Figure 8: Double cart-pole actual cost of the trajectory after each iteration
Figure 9: Schematic representation of the quadrotor dynamic model [1].
18
time (s)
0 0.5 1 1.5 2 2.5 3
Position(m)
-1
-0.5
0
0.5
1
1.5
2
Position Trajectory
x
y
z
(a)
time (s)
0 0.5 1 1.5 2 2.5 3
Angle(rad)
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Angle Trajectory
φ
θ
ψ
(b)
time (s)
0 0.5 1 1.5 2 2.5 3
Velocity(m/s)
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Velocity Trajectory
˙x
˙y
˙z
(c)
time (s)
0 0.5 1 1.5 2 2.5 3
AngularRate(rad/s)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Angular Rate Trajectory
˙φ
˙θ
˙ψ
(d)
Figure 10: Quadrotor state trajectories from 0 to 3 seconds showing the vehicle
meeting the goal state
time (s)
0 0.5 1 1.5 2 2.5 3
ThrustForce(N)
1.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
Control Signal
F1
F2
F3
F4
Figure 11: Quadrotor control effort for each rotor
19
Iteration
0 10 20 30 40 50 60 70 80
Cost
×104
0
0.5
1
1.5
2
2.5
Cost Evolution
Figure 12: Quadrotor actual cost of the trajectory after each iteration
20
CHAPTER III
UNKNOWN DYNAMICS
In the last chapter, iLQG was introduced and derived as an algorithm useful for
situations when the system dynamics are known. In this chapter, PDDP will be
introduced as an algorithm useful for situations when the system dynamics are not
known. Although there are other algorithms designed to solve the same problems,
these two are discussed in this report to provide a foundation for the extensions
proposed in the following chapter.
3.1 Probabilistic Differential Dynamic Programming
Probabilistic Differential Dynamic Programming (PDDP) is an algorithm proposed by
Yunpeng Pan and Evangelos Theodorou as an extension to the well known differential
dynamic programming (DDP) algorithm [9]. DDP and iLQG are extremely similar in
their derivation; for this reason, PDDP is treated essentially as an extension of iLQG
in this context. The problem formulation of PDDP matches that of iLQG exactly.
3.1.1 Gaussian Processes
The innovation of PDDP lies in its use of a Gaussian Process to represent the system
dynamics instead of a collection of differential equations. This relieves the need to
have an explicit model of the system dynamics, as was the case in iLQG. Instead, the
GP-based representation relies on observations of the dynamics for given state and
control inputs. More precisely, the GP assumes a joint Gaussian distribution over a
sequence of state-control pairs, ˜X = {(x0, u0), ..., (xH, uH)}, and corresponding state
transitions dX = {dx0, ..., dxH} measured from the system itself [9]. Thus the state
transition is found through the process of GP regression, which can be interpreted as
21
regression over functions. In the case of PDDP, the function of interest is the system
dynamics (see Equation 2.1.1, for example), or rather, the state transition function.
A complete treatment of the general subject of Gaussian Processes can be found in
Gaussian Processes for Machine Learning by Rasmussen and Williams [10]. For
the remainder of the report, a ”sample from a GP” refers to the set of values of the
sampled function evaluated at the inputs.
GPs offer a nonparametric representation of the state transition function which
allows them to be extremely flexible in representing highly nonlinear systems. There
is freedom, however, in choosing the covariance function to specify the joint Gaus-
sian distribution over the observed data. The covariance function provides a way to
quantify the similarity of one input point (i.e. one state-control pair, in this case)
to another input point. In a conceptual sense, covariance functions are analogous
to kernel functions in general machine learning. The idea is that the more similar
two points are, as measured using the covariance function, the more correlated their
outputs will be. Thus in the regression setting, a query point’s output will be heavily
informed by similar points with known outputs.
3.1.2 Implications of the GP Representation
The use of a GP representation for the system dynamics presents a couple of chal-
lenges. By its very nature, GP regression incorporates uncertainty; thus, in this
case, the state transition value that is returned is, itself, a random variable. In a
discrete-time setting, the next state is usually found as xk+1 = xk + dxk∆t. The un-
certainty in state transition, dxk induces uncertainty in the state, xk+1. Meanwhile,
the state is treated as an input into the GP regression. This input uncertainty causes
the predictive distribution for the state transition to be analytically intractable as it
is non-Gaussian [9]. However, PDDP approximates the predictive distribution by a
Gaussian, wherein the predictive mean and covariances are computed by employing
22
the moment matching approach [9].
The uncertainty in the state values also necessitates augmentation to the state
vector seen in regular DDP (and iLQG). The augmented state vector contains the
means of each state along with a flattened version of the covariance matrix. This
allows the uncertainty to be carried through every part of the algorithm, allowing the
algorithm to balance the exploration strategy between closeness to the goal state and
uncertainty (variance) of the state [9]. Incidentally, the Jacobian matrices, described
in the iLQG section (i.e. xf(¯x, ¯u) and uf(¯x, ¯u) in Equation 2.1.3) can be com-
puted analytically as a result of the definition of the augmented state, relieving the
need for numerical differentiation.
23
CHAPTER IV
EXTENDING PDDP AND ILQG
PDDP extends DDP/iLQG by representing the system dynamics as a Gaussian Pro-
cess, which is data driven and uncertain, as opposed to a set of differential equations.
The noise model considered in PDDP is, however, quite simple as it only contains
standard Brownian motion noise (dω) in the dynamics [9]:
dx = f(x, u)dt + Cdω, x(t0) = x0, dω ∼ N(0, Σω)
Thus the desire to extend PDDP comes from the desire to incorporate a more
sophisticated noise model and a more adaptive regression framework.
4.1 Gaussian Process Regression Networks
The main extension proposed to PDDP is the use of a Gaussian Process Regression
Network (GPRN) representation of the system dynamics, in lieu of the regular GP
representation. GPRNs were originally proposed by Andrew Wilson, David Knowles,
and Zoubin Ghahramani [12]. GPRNs maintain desirable aspects of the GP rep-
resentation such as modeling the dynamics using data and allowing uncertainty to
be considered. Additionally, GPRNs offer input (predictor, or ”state”, in this case)
dependent signal and noise correlations between multiple output variables (state tran-
sitions), and input dependent length-scales and amplitudes. Inclusion of signal noise
allows the separation of noise effects of the physical process from measurement noise,
for example. The dependence of the noise and length scales on the input allow the
regression process to adapt to the input, making the framework more flexible. Fi-
nally, the ability to account for signal and noise correlations between multiple output
24
variables strengthens the predictive capability of the GPRN over the standard GP.
Imagine that we are trying to predict the angular acceleration of a vehicle about
a particular axis (i.e. a state transition variable) at a point that was not used to
train the GPRN. In standard GP regression, the prediction is only really informed by
the measurements of that angular acceleration used to train the model. Accounting
for correlations between multiple outputs enables us to use all of the measurements,
which enhances the prediction since angular acceleration may be correlated with lin-
ear velocity for a particular vehicle. This correlation between angular acceleration
and linear velocity may also depend on the state of the vehicle (e.g. altitude) - thus
GPRN’s input dependent correlations further enhance prediction. In general, this is
referred to as a multi-task method [12]. The case for using a GPRN to represent the
dynamics is clear as it satisfies the desire to incorporate a more sophisticated noise
model and a more adaptive regression framework.
Conceptually, a GPRN may be described simply as a mixture of GPs wherein the
mixture weights are also GPs. Figure 13, below, gives a schematic representation
of a GPRN [12]. It is important to note that the input, x, is in fact a vector, as it
represents the state. The functions, ˆfi(x), are GPs, each with their own noise, and are
referred to as nodes. The number of nodes is a choice for the designer of the GPRN
and is usually determined by experimenting to see which number produces the best
results. Each output (i.e. state transition) element, yj(x), is a linear combination of
the outputs of each node. The weights, Wij(x) applied to the node outputs are also
input-dependent GPs.
Regression in the GPRN framework is quite different from a standard GP. In
standard GP regression, we predict the output given the input and data comprised
of input/output training pairs. In a GPRN, we must predict the values of the node
and weight GPs whose values are latent and confounded in the output values; thus
the standard regression must be altered.
25
Figure 13: Schematic representation of a Gaussian Process Regression Network [12]
4.1.1 Elliptical Slice Sampling
Wilson et al. [12] solve this problem using a technique known as elliptical slice sam-
pling (ESS). Elliptical slice sampling [7], proposed by Iain Murray, Ryan Prescott
Adams, and David J.C. MacKay, is designed to be a generic, user-friendly (avoids
parameter tuning) sampling technique that may be viewed as an extension of slice
sampling [8]. ESS excels in sampling from posteriors with strongly correlated Gaus-
sian priors, which is the case in GPRNs. For a GPRN, ESS can be used to sample
from the posterior distribution of the node and weight GP functions evaluated at all
of the training points. This distribution is conditioned on the input/output pairs
and the hyperparameters for each GP. The ESS algorithm requires takes the current
sample and a new sample from the prior distribution and parameterizes the new sam-
ple as f = f cos θ + ν sin θ where ν is the prior sample, f is the current sample, f
is the new sample, and θ is the free parameter [7]. The current sample and prior
sample specify an ellipse that is centered at the origin when viewed in this light. Slice
26
sampling is then performed in which a θ value is selected uniformly over the support
of the region in which the likelihood of the new sample is above a randomly chosen
threshold.
Samples from the node and weight GPs evaluated at all of the training points
allows us to draw samples from the predictive distribution for node and weight values
at the query point. The output is simply a function of these values. This process is
repeated a number of times and the output value is averaged to marginalize out the
effect of the node and weight values.
4.1.1.1 Demonstration: Sampling from a Standard Gaussian
This section and the following demonstrate the usage of the ESS algorithm on two
toy problems: sampling from a standard Gaussian and sampling from a GP posterior.
They rely on the MATLAB implementation made available by Iain Murray as a
supplement [7].
The first demonstration compares samples drawn from standard methods for
Gaussian distributions (in MATLAB normrnd) with samples drawn using ESS. These
two approaches are fundamentally different due to the fact that ESS requires samples
from a prior. In order to emulate the behavior of normrnd, a uniform distribution is
chosen to keep the prior as uninformative as possible. Figure 14, below, shows normal-
ized histograms of the samples from each method, overlaid with the true probability
distribution function, for a variety of population sizes and choices of prior.
For this application, the histograms in Figure 14 indicate that the ESS samples
are sensitive to the range of the uniform prior. If the bounds on the prior are too far
spread apart, ESS tends to produce large outliers more frequently than normrnd. On
the other hand, reducing the range of the uniform prior tends to bias the ESS samples
thus skewing the distribution. Clearly, the U(−0.5, 0.5) prior would be insufficient
for sampling from N(0, 1).
27
Figure 14: Comparison of ESS samples to normrnd on a standard Gaussian
4.1.1.2 Demonstration: Sampling from a GP Posterior
ESS is more fitting for this application since, in the Gaussian process case, a prior is
specified. This demonstration roughly follows an example from Murray et al. [7] in
which ESS is used to generate samples from a GP posterior. For visualization pur-
poses, this demonstration is for a GP with only one input dimension and one output
dimension. Note, the predictive distribution of a GP can be computed analytically,
so this provides a baseline with which to compare the ESS samples to.
28
First, a synthetic dataset is generated with a sample size of N = 200 with in-
put values {xn}N
n=1 sampled from from U(0, 1). Then the covariance matrix, Σ is
computed using the following definition
Σij = σ2
f exp −
1
2
(xi − xj)2
/l2
(4.1.1)
This particular covariance function is the commonly used squared-exponential
covariance. Here, we set l = 1 and σ2
f = 1. The function values are then drawn
from the GP prior: f ∼ N(0, Σ). In practice, the true function values are not known
- thus Gaussian noise is added to the function values with variance σ2
n = 0.32
to
create the observations, y. ESS requires the log-likelihood function (or the probability
distribution function of the distribution being sampled). Taking the posterior samples
evaluated at all of the input values simplifies the process since the log-likelihood
function takes a simple form
L(f) =
n
N(yn; fn, σ2
n) (4.1.2)
The ability to sample from the GP prior and being able to evaluate the log-
likelihood function is all that is needed to sample from the posterior using ESS.
Rasmussen’s book [10] gives the process for sampling from the GP posterior di-
rectly. The posterior has a joint Gaussian distribution whose mean and covariance
are given by the following
¯f∗ E [f∗|y, Σ] = Σ Σ + σ2
nI
−1
y (4.1.3)
cov(f∗) = Σ − Σ Σ + σ2
nI
−1
Σ (4.1.4)
Figure 15, below, contains the results of taking 100 samples from the GP posterior
using both ESS and the direct method. Note, there is a vertical spread in the samples
at each location due to the fact that the function values at each location come from
29
its own Gaussian distribution. There are several ESS samples that stand out as being
very unlikely, meaning they are much further from the true function values plotted
on the figure.
x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f(x)
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Gaussian Process Regression
ESS
Direct
True f(x)
Noisy Obs.
Figure 15: Comparison of ESS samples to direct method for GP regression
This issue can be attributed to the fact that ESS produces chains of samples.
The chain must start somewhere, and, in this case, it begins with a sample from the
prior distribution. Looking at a trace of the sample log-likelihood values in Figure 16
reveals the fact that the likelihood starts out quite low before converging to a stable
value.
Employing a burn in strategy, similar to those used for many other sampling
techniques, wherein a certain number of the initial samples are discarded, yields the
results shown in Figure 17. In this case the first 20 ESS samples were discarded. The
remaining samples are indistinguishable from the direct samples, demonstrating the
viability of this technique for GP regression.
30
Sample
0 10 20 30 40 50 60 70 80 90 100
Likelihood
-350
-300
-250
-200
-150
-100
-50
0
ESS Sample Likelihoods
Figure 16: Log-likelihood values of a chain of ESS samples for GP regression
x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
f(x)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Gaussian Process Regression
ESS
Direct
True f(x)
Noisy Obs.
Figure 17: Comparison of ESS samples to direct method for GP regression with burn
in
31
4.2 Implications of GPRN on PDDP
Uncertainty in the inputs is incorporated into PDDP using a moment matching ap-
proach. Inclusion of input uncertainty into the GPRN scheme being considered is
much more complex. This is due to the fact that the inputs and outputs in GPRN
are separated by a thick layer of latent variables - the node and weight GPs. In this
initial work, it is advantageous from a standpoint of conceptual complexity to drop
this uncertainty in the input. The most obvious approach to do so would be to simply
to take the mean of the distribution describing the input. Since the next state value
is simply a function of the current state and the state transition, we must simply take
the mean of the predictive state transition distribution (specified by the GPRN). As
this distribution is an average of a large number of intermediate Gaussian samples,
incorporating different node and weight samples, the mean is simply the mean of all
the Gaussian means.
In terms of computational complexity, GPRN generally scales linearly with the
dimension of the output and with the number of latent nodes. The computational
complexity is dominated by the number of data points considered, just as it is for
standard GP regression. It is likely that the GPRN representation will be more bur-
densome in practice due to the fact that it is sample-based, as opposed to analytical
as was the case in PDDP. The biggest difference between a GPRN representation
and a GP representation is how the predictive distribution mean and covariance are
calculated and how the model’s hyperparameters are learned.
4.3 Remaining Work
Much of the conceptual framework for incorporating a GPRN representation into
PDDP and iLQG has been developed and discussed throughout this report. Work
remains, however, in order to complete this work. The biggest aspect that is missing at
this point is a process for learning the hyperparameters for the GPRN. Each individual
32
GP, for each node and for each weight, comes with its own set of hyperparameters,
which may vary depending on the choice of covariance function. For a system with p
outputs, and q nodes, there are p(q + 1) sets of hyperparameters to set. One of the
traditional approach to learning parameters of a distribution is maximum likelihood
(ML). ML cannot be used to learn the GPRN hyperparameters due to the fact that the
hyperparameters are associated with latent variables (i.e. variables which we cannot
observe directly). Other categories of methods are being considered for suitability
for this problem including gradient based and sampling based (e.g. cross-entropy)
methods. Wilson et al. [12] utilize a message passing implementation of variational
Bayes (VB) called variational message passing (VMP). Their implementation is built
on top of Microsoft Research’s Infer.NET framework which is a general platform for
probabilistic programming [6]. Much of the remaining work on this effort can be
summarized by gaining a better understanding of VB and VMP, and then generating
an implementation in MATLAB.
33
CHAPTER V
CONCLUSION
The intersection between machine learning and control theory is emerging rapidly;
many rich applications are currently being developed. This report has explored one
particular area of interest: trajectory optimization. LQR and LQG were mentioned
as predecessors to iLQG, which was re-derived in this report and demonstrated on 3
systems of interest: cartpole, double cartpole, and quadrotor. The iLQG algorithm
is robust and widely applicable to a vast number of different systems. PDDP was
introduced as an important extension to iLQG which incorporates uncertainty into
the trajectory optimization process. The conceptual advances presented in this report
were centered around the use of a GPRN to represent the unknown system dynamics
in lieu of the standard GP representation employed by PDDP. GPRNs offer a more
sophisticated noise model and a more adaptive regression framework. More specifi-
cally, they incorporate input-dependent noise and signal correlation between multiple
outputs. This should enhance the predictive capability of the algorithm to represent
the unknown system dynamics. Work remains on fully incorporating GPRNs into a
PDDP-like algorithm: effort continues in the areas of understanding VB and VMP
for learning GPRN hyperparameters, and in creating a MATLAB implementation.
34
APPENDIX A
MATLAB CODE LISTINGS
The links below point to files embedded within the document.
my implementation of the iLQG algorithm for the purposes of this research
specification of dynamics, start, and goal states for cartpole system
specification of dynamics, start, and goal states for double cartpole system
specification of dynamics, start, and goal states for quadrotor
[7]
implements elliptical slice sampling, provided by the authors at
http://homepages.inf.ed.ac.uk/imurray2/pub/10ess/
demonstrates elliptical slice.m on two examples: sampling from a standard
Gaussian, and sampling from the posterior distribution for a simple Gaussian
Process
35
iLQG.m
demo cartpole.m
demo doublecartpole.m
demo quadrotor.m
elliptical slice.m
elliptical slice demo.m
REFERENCES
[1] Bolandi, H., Rezaei, M., Mohsenipour, R., Nemati, H., and
Smailzadeh, S., “Attitude control of a quadrotor with optimized pid con-
troller,” Intelligent Control and Automation, vol. 4, no. 3, pp. 335–342, 2013.
[2] Deisenroth, M. and Rasmussen, C. E., “Pilco: A model-based and data-
efficient approach to policy search,” in Proceedings of the 28th International
Conference on machine learning (ICML-11), pp. 465–472, 2011.
[3] Gandhi, M., “Cart-pole dynamics.” Private Communication, 2015. Georgia
Institute of Technology Student.
[4] Gandhi, M., “Double cart-pole dynamics.” Private Communication, 2015.
Georgia Institute of Technology Student.
[5] Gandhi, M., “Quadrotor dynamics.” Private Communication, 2015. Georgia
Institute of Technology Student.
[6] Minka, T., Winn, J., Guiver, J., Webster, S., Zaykov, Y., Yangel, B.,
Spengler, A., and Bronskill, J., “Infer.NET 2.6,” 2014. Microsoft Research
Cambridge. http://research.microsoft.com/infernet.
[7] Murray, I., Adams, R. P., and MacKay, D. J., “Elliptical slice sampling,”
arXiv preprint arXiv:1001.0175, 2009.
[8] Neal, R. M., “Slice sampling,” Annals of statistics, pp. 705–741, 2003.
[9] Pan, Y. and Theodorou, E., “Probabilistic differential dynamic program-
ming,” in Advances in Neural Information Processing Systems, pp. 1907–1915,
2014.
[10] Rasmussen, C. E., Gaussian processes for machine learning. Citeseer, 2006.
http://www.gaussianprocess.org/gpml/chapters/.
[11] Todorov, E. and Li, W., “A generalized iterative LQG method for locally-
optimal feedback control of constrained nonlinear stochastic systems,” (Atlantis,
Bahamas), 43rd IEEE Conference on Decision and Control, December 2004.
[12] Wilson, A. G., Knowles, D. A., and Ghahramani, Z., “Gaussian process
regression networks,” arXiv preprint arXiv:1110.4411, 2011.
36
Machine Learning Applications in Complex Control Systems
Alexander L. Von Moll
36 Pages
Directed by Professor Evangelos Theodorou
The intersection between machine learning and control theory is emerging
rapidly; many rich applications are currently being developed. This report explores
one particular area of interest: trajectory optimization. The Linear Quadratic Reg-
ulator and Linear Quadratic Gaussian algorithms are introduced as predecessors to
the Iterative Linear Quadratic Gaussian (iLQG) algorithm, which is re-derived in
this report and demonstrated on 3 systems of interest: cartpole, double cartpole,
and quadrotor. iLQG is robust and widely applicable to a vast number of different
systems. Probabilistic Differential Dynamic Programming (PDDP) is introduced as
an important extension to iLQG which incorporates uncertainty into the trajectory
optimization process. The conceptual advances presented in this report were centered
around the use of a Gaussian Process Regression Network (GPRN) to represent the
unknown system dynamics in lieu of the standard Gaussian Process representation
employed by PDDP. GPRNs offer a more sophisticated noise model and a more adap-
tive regression framework. More specifically, they incorporate input-dependent noise
and signal correlation between multiple outputs. This should enhance the predic-
tive capability of the algorithm to represent the unknown system dynamics. Work
remains on fully incorporating GPRN into a PDDP-like algorithm: effort continues
in the areas of understanding variational Bayes and variational message passing for
learning GPRN hyperparameters, and in creating a MATLAB implementation.

Más contenido relacionado

La actualidad más candente

12EE62R11_Final Presentation
12EE62R11_Final Presentation12EE62R11_Final Presentation
12EE62R11_Final Presentation
Amritesh Maitra
 
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executionsRedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
Redis Labs
 
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
Stephane Meteodyn
 
Chp%3 a10.1007%2f978 3-642-55753-8-3
Chp%3 a10.1007%2f978 3-642-55753-8-3Chp%3 a10.1007%2f978 3-642-55753-8-3
Chp%3 a10.1007%2f978 3-642-55753-8-3
Sabina Czyż
 
1103072R_Ramsay_ENG5041P_Final_Report_15-16
1103072R_Ramsay_ENG5041P_Final_Report_15-161103072R_Ramsay_ENG5041P_Final_Report_15-16
1103072R_Ramsay_ENG5041P_Final_Report_15-16
Lee Ramsay
 

La actualidad más candente (20)

12EE62R11_Final Presentation
12EE62R11_Final Presentation12EE62R11_Final Presentation
12EE62R11_Final Presentation
 
Article 1
Article 1Article 1
Article 1
 
Discrete Time Optimal Tracking Control of BLDC Motor
Discrete Time Optimal Tracking Control of BLDC MotorDiscrete Time Optimal Tracking Control of BLDC Motor
Discrete Time Optimal Tracking Control of BLDC Motor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Metal cutting tool position control using static output feedback and full sta...
Metal cutting tool position control using static output feedback and full sta...Metal cutting tool position control using static output feedback and full sta...
Metal cutting tool position control using static output feedback and full sta...
 
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executionsRedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
RedisDay London 2018 - CRDTs and Redis From sequential to concurrent executions
 
SIMMECHANICS VISUALIZATION OF EXPERIMENTAL MODEL OVERHEAD CRANE, ITS LINEARIZ...
SIMMECHANICS VISUALIZATION OF EXPERIMENTAL MODEL OVERHEAD CRANE, ITS LINEARIZ...SIMMECHANICS VISUALIZATION OF EXPERIMENTAL MODEL OVERHEAD CRANE, ITS LINEARIZ...
SIMMECHANICS VISUALIZATION OF EXPERIMENTAL MODEL OVERHEAD CRANE, ITS LINEARIZ...
 
Tuning of PID, SVFB and LQ Controllers Using Genetic Algorithms
Tuning of PID, SVFB and LQ Controllers Using Genetic AlgorithmsTuning of PID, SVFB and LQ Controllers Using Genetic Algorithms
Tuning of PID, SVFB and LQ Controllers Using Genetic Algorithms
 
Computational fluid dynamics
Computational fluid dynamicsComputational fluid dynamics
Computational fluid dynamics
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
UrbaWind, a Computational Fluid Dynamics tool to predict wind resource in urb...
 
wux
wuxwux
wux
 
Design and Control of a Hydraulic Servo System and Simulation Analysis
Design and Control of a Hydraulic Servo System and Simulation AnalysisDesign and Control of a Hydraulic Servo System and Simulation Analysis
Design and Control of a Hydraulic Servo System and Simulation Analysis
 
Chp%3 a10.1007%2f978 3-642-55753-8-3
Chp%3 a10.1007%2f978 3-642-55753-8-3Chp%3 a10.1007%2f978 3-642-55753-8-3
Chp%3 a10.1007%2f978 3-642-55753-8-3
 
Cfd 0
Cfd 0Cfd 0
Cfd 0
 
Implementation of low power divider techniques using
Implementation of low power divider techniques usingImplementation of low power divider techniques using
Implementation of low power divider techniques using
 
On selection of periodic kernels parameters in time series prediction
On selection of periodic kernels parameters in time series predictionOn selection of periodic kernels parameters in time series prediction
On selection of periodic kernels parameters in time series prediction
 
1103072R_Ramsay_ENG5041P_Final_Report_15-16
1103072R_Ramsay_ENG5041P_Final_Report_15-161103072R_Ramsay_ENG5041P_Final_Report_15-16
1103072R_Ramsay_ENG5041P_Final_Report_15-16
 
Feedback control of_dynamic_systems
Feedback control of_dynamic_systemsFeedback control of_dynamic_systems
Feedback control of_dynamic_systems
 
Alfonso Senatore
Alfonso SenatoreAlfonso Senatore
Alfonso Senatore
 

Destacado

Ad words campaign management
Ad words campaign managementAd words campaign management
Ad words campaign management
Seer Interactive
 
Ventana de dreamweaver
Ventana de dreamweaverVentana de dreamweaver
Ventana de dreamweaver
brillik
 
Otra Presentacion
Otra PresentacionOtra Presentacion
Otra Presentacion
Ana Rosa
 
DELITOS INFORMATICOS
DELITOS INFORMATICOSDELITOS INFORMATICOS
DELITOS INFORMATICOS
18061984
 

Destacado (20)

El agua
El aguaEl agua
El agua
 
Ad words campaign management
Ad words campaign managementAd words campaign management
Ad words campaign management
 
Startups unable to honour offers
Startups unable to honour offersStartups unable to honour offers
Startups unable to honour offers
 
English
EnglishEnglish
English
 
5 ways to Deal with Manipulative Colleagues
5 ways to Deal with Manipulative Colleagues5 ways to Deal with Manipulative Colleagues
5 ways to Deal with Manipulative Colleagues
 
Ventana de dreamweaver
Ventana de dreamweaverVentana de dreamweaver
Ventana de dreamweaver
 
製茶過程
製茶過程製茶過程
製茶過程
 
Taiwan - Genetic Screening and Prenatal Diagnosis of Thalassemias
Taiwan - Genetic Screening and Prenatal Diagnosis of ThalassemiasTaiwan - Genetic Screening and Prenatal Diagnosis of Thalassemias
Taiwan - Genetic Screening and Prenatal Diagnosis of Thalassemias
 
10 cosas sobre mi
10 cosas sobre mi10 cosas sobre mi
10 cosas sobre mi
 
Forkdiary
ForkdiaryForkdiary
Forkdiary
 
SEO in 2013
SEO in 2013SEO in 2013
SEO in 2013
 
Report on hiring entry level analytics report
Report on hiring entry level analytics reportReport on hiring entry level analytics report
Report on hiring entry level analytics report
 
Pubcon Content Presentation 2012
Pubcon Content Presentation 2012Pubcon Content Presentation 2012
Pubcon Content Presentation 2012
 
P@is@jes d melissit@
P@is@jes d melissit@P@is@jes d melissit@
P@is@jes d melissit@
 
CoCubes shares its take on Budget 2016
CoCubes shares its take on Budget 2016CoCubes shares its take on Budget 2016
CoCubes shares its take on Budget 2016
 
The dysfunctional church opener
The dysfunctional church openerThe dysfunctional church opener
The dysfunctional church opener
 
Otra Presentacion
Otra PresentacionOtra Presentacion
Otra Presentacion
 
CoCubes DNA: Facebook, Linkedin Post Challenge To Recruitment Firms
CoCubes DNA: Facebook, Linkedin Post Challenge To Recruitment FirmsCoCubes DNA: Facebook, Linkedin Post Challenge To Recruitment Firms
CoCubes DNA: Facebook, Linkedin Post Challenge To Recruitment Firms
 
DELITOS INFORMATICOS
DELITOS INFORMATICOSDELITOS INFORMATICOS
DELITOS INFORMATICOS
 
Financial Products for the Family
Financial Products for the FamilyFinancial Products for the Family
Financial Products for the Family
 

Similar a vonmoll-paper

Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Elliott Mitchell-Colgan
 
Master's_Thesis_XuejiaoHAN
Master's_Thesis_XuejiaoHANMaster's_Thesis_XuejiaoHAN
Master's_Thesis_XuejiaoHAN
Xuejiao Han
 
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensorASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
Adrià Serra Moral
 
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensorASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
Adrià Serra Moral
 
TR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigonTR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigon
Bob Zigon
 
Thesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submissionThesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submission
Eddie Zisser
 
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdfMachine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
YAAKOVSOLOMON1
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
Øystein Øihusom
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Wouter Verbeek
 

Similar a vonmoll-paper (20)

Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_PlanningThesis-MitchellColgan_LongTerm_PowerSystem_Planning
Thesis-MitchellColgan_LongTerm_PowerSystem_Planning
 
Master's_Thesis_XuejiaoHAN
Master's_Thesis_XuejiaoHANMaster's_Thesis_XuejiaoHAN
Master's_Thesis_XuejiaoHAN
 
esnq_control
esnq_controlesnq_control
esnq_control
 
Distributed Traffic management framework
Distributed Traffic management frameworkDistributed Traffic management framework
Distributed Traffic management framework
 
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensorASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
 
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensorASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
ASerraMoral_16MS_FlutterSuppressionFlexUAVUsingLESPSensor
 
TR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigonTR-CIS-0420-09 BobZigon
TR-CIS-0420-09 BobZigon
 
UROP MPC Report
UROP MPC ReportUROP MPC Report
UROP MPC Report
 
Dmitriy Rivkin Thesis
Dmitriy Rivkin ThesisDmitriy Rivkin Thesis
Dmitriy Rivkin Thesis
 
Thesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submissionThesis_Eddie_Zisser_final_submission
Thesis_Eddie_Zisser_final_submission
 
main
mainmain
main
 
Sarda_uta_2502M_12076
Sarda_uta_2502M_12076Sarda_uta_2502M_12076
Sarda_uta_2502M_12076
 
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdfMachine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
Machine-Type-Communication in 5G Cellular System-Li_Yue_PhD_2018.pdf
 
Hoifodt
HoifodtHoifodt
Hoifodt
 
bachelors-thesis
bachelors-thesisbachelors-thesis
bachelors-thesis
 
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)
•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)•	Sensorless speed and position estimation of a PMSM (Master´s Thesis)
• Sensorless speed and position estimation of a PMSM (Master´s Thesis)
 
P10 project
P10 projectP10 project
P10 project
 
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
UiA Slam (Øystein Øihusom & Ørjan l. Olsen)
 
Feedback control methods_for_a_single_machine_infinite_bus_system
Feedback control methods_for_a_single_machine_infinite_bus_systemFeedback control methods_for_a_single_machine_infinite_bus_system
Feedback control methods_for_a_single_machine_infinite_bus_system
 
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
Condition monitoring for track circuits: A multiple-model approach (MSc. Thesis)
 

vonmoll-paper

  • 1. MACHINE LEARNING APPLICATIONS IN COMPLEX CONTROL SYSTEMS AE 8900: Special Problems Final Report Presented to The Academic Faculty by Alexander L. Von Moll In Partial Fulfillment of the Requirements for the Degree Master of Science in the School of Aerospace Engineering Georgia Institute of Technology May 2016
  • 2. MACHINE LEARNING APPLICATIONS IN COMPLEX CONTROL SYSTEMS Approved by: Professor Evangelos Theodorou, Advisor School of Aerospace Engineering Georgia Institute of Technology Date Approved: 18 December 2015
  • 4. ACKNOWLEDGEMENTS I want to thank Professor Evangelos Theodorou for advising me and guiding my efforts. Additionally, I would like to thank Yunpeng Pan for guidance and for many discussions on the details of these algorithms. Lastly, thanks to Manan Gandhi for providing dynamics models for the cartpole, double cartpole, and quadrotor systems. iv
  • 5. TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II KNOWN DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Iterative Linear Quadratic Gaussian [11] . . . . . . . . . . . . . . . . 3 2.1.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Cart-Pole Application . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Double Cart-Pole Application . . . . . . . . . . . . . . . . . 11 2.1.4 Quadrotor Application . . . . . . . . . . . . . . . . . . . . . 15 III UNKNOWN DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1 Probabilistic Differential Dynamic Programming . . . . . . . . . . . 21 3.1.1 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2 Implications of the GP Representation . . . . . . . . . . . . . 22 IV EXTENDING PDDP AND ILQG . . . . . . . . . . . . . . . . . . . 24 4.1 Gaussian Process Regression Networks . . . . . . . . . . . . . . . . . 24 4.1.1 Elliptical Slice Sampling . . . . . . . . . . . . . . . . . . . . 26 4.2 Implications of GPRN on PDDP . . . . . . . . . . . . . . . . . . . . 32 4.3 Remaining Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 V CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 APPENDIX A — MATLAB CODE LISTINGS . . . . . . . . . . . . 35 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 v
  • 6. LIST OF TABLES 1 Quadrotor parameters and values . . . . . . . . . . . . . . . . . . . . 16 vi
  • 7. LIST OF FIGURES 1 A schematic of the cartpole system with all relevant parameters indicated 11 2 State trajectories for the cart-pole system over 2 seconds showing the goal state being met . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Control signal for the cart-pole system over 2 seconds corresponding to the positive force applied to the cart . . . . . . . . . . . . . . . . . 13 4 Cart-pole actual cost of the trajectory after each iteration . . . . . . 13 5 Schematic of the double cart-pole problem with all relevant features specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6 State trajectories for the double cart-pole system over 2 seconds show- ing the goal state being met . . . . . . . . . . . . . . . . . . . . . . . 17 7 Control signal for the double cart-pole system over 2 seconds corre- sponding to the positive force applied to the cart . . . . . . . . . . . 18 8 Double cart-pole actual cost of the trajectory after each iteration . . 18 9 Schematic representation of the quadrotor dynamic model [1]. . . . . 18 10 Quadrotor state trajectories from 0 to 3 seconds showing the vehicle meeting the goal state . . . . . . . . . . . . . . . . . . . . . . . . . . 19 11 Quadrotor control effort for each rotor . . . . . . . . . . . . . . . . . 19 12 Quadrotor actual cost of the trajectory after each iteration . . . . . . 20 13 Schematic representation of a Gaussian Process Regression Network [12] 26 14 Comparison of ESS samples to normrnd on a standard Gaussian . . . 28 15 Comparison of ESS samples to direct method for GP regression . . . 30 16 Log-likelihood values of a chain of ESS samples for GP regression . . 31 17 Comparison of ESS samples to direct method for GP regression with burn in . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 vii
  • 8. CHAPTER I INTRODUCTION Control theory and machine learning are, in their own right, rich subjects with many applications. Their evolution has been quite different: control theory developed as a subdiscipline within electrical engineering and mathematics with roots going back to the nineteenth century. Machine learning, on the other hand, has emerged much more recently - though many of the foundational concepts are borrowed from older disciplines such as linear algebra. Whereas control theory is centered around analyzing and altering the properties of dynamical systems, machine learning is centered around gaining useful insight from data. The intersection of these two research areas is emerging rapidly. Trajectory optimization is one particular topic that has been traditionally re- searched under the umbrella of control theory, but can now benefit from concepts from machine learning. Trajectory optimization is a process by which control inputs for a dynamical system are chosen such that some measure of cost or performance is minimized or maximized. Selection of control inputs affects how the state of the dynamical system evolves, thereby determining its trajectory over time. For systems that are described by a set of differential equations with a quadratic cost function, several well known methods have been used to compute an optimal control law. These methods include the linear quadratic regulator (LQR) and the linear quadratic Gaus- sian (LQG) which extends the basic LQR method to incorporate uncertainty. For LQG, the uncertainty shows up as additive white Gaussian noise in the differential equations. Much of the optimal control research of the past several decades is based on these approaches and has extended them for various purposes. One particular 1
  • 9. extension, known as iterative linear quadratic Gaussian (iLQG) [11] is explored in further detail in the following sections. The key aspect of traditional trajectory optimization approaches is the require- ment of a dynamical model specified by a set of differential equations. There are many reasons why this might be a hindrance in some applications: a dynamical model of this type may not be available, generating a dynamical model for a new system may be difficult, or the dynamical model may be too inaccurate. In these cases it may be necessary to create an alternative representation of the system dynamics. Several algorithms have been proposed to solve the trajectory optimization problem using alternative representations of the system dynamics; most notably, PILCO [2] and probabilistic differential dynamic programming (PDDP) [9]. The research effort described in this report is focused on extending ideas from iLQG and PDDP by using Gaussian Process Regression Networks (GPRN) [12] as a means to represent unknown system dynamics. Chapter 2 contains a derivation of iLQG along with demonstrations of the algorithm on several systems. Chapter 3 contains a description of PDDP, and a brief background on Gaussian processes (GPs) which are used in PDDP to represent the unknown system dynamics. Chapter 4 con- tains a description of GPRN, including a description and demonstration of elliptical slice sampling (ESS) [7]. Finally, Chapter 5 summarizes the progress made in this effort and outlines remaining work. 2
  • 10. CHAPTER II KNOWN DYNAMICS 2.1 Iterative Linear Quadratic Gaussian [11] Iterative Linear Quadratic Gaussian (iLQG) is an algorithm used for computing lo- cally optimal trajectories over a finite horizon for systems with known dynamics. This section of the report will describe its derivation and demonstrate some results on some toy systems. 2.1.1 Derivation 2.1.1.1 Definitions dx = f(x, u)dt + F(x, u)dω (2.1.1) v(t, x) E h(x(T)) + T t l(τ, x(τ), u(τ))dτ (2.1.2) Equations (2.1.1) and (2.1.2) specify the dynamics of the system and the expected cost-to-go, respectively. The cost-to-go is defined as the total cost expected to accu- mulate if the system is initialized in state x at time t and evolved until t = T. It has already been implied by equation (2.1.2) that the expected cost-to-go is not a function of u since we will assume going forward that the control law is a determin- istic function of the state, x. The symbols that appear in these equations are defined below: 3
  • 11. f ≡ system dynamics F ≡ noise process ω ≡ Brownian motion noise, ω ∈ Rp x ≡ state, x ∈ Rn u ≡ control input, u ∈ Rm v ≡ cost-to-go l ≡ instantaneous cost rate h ≡ penalty on the final rate The iLQG algorithm is based upon linearization of the dynamics and cost and quadratization of the cost around a nominal trajectory. Thus we define a nominal state trajectory and set of control inputs as ¯x and ¯u, respectively, and a small per- turbation around these series such that δx = x − ¯x, δu = u − ¯u. 2.1.1.2 Linearization of the Dynamics The first step in the derivation is to linearize the system dynamics around the nominal trajectories. This is done using a first-order Taylor series expansion and is provided in detail, below. dx dt = f(x, u) + F(x, u) dω dt = f(x − ¯x + ¯x, u − ¯u + ¯u) + F(x − ¯x + ¯x, u + ¯u − ¯u) dω dt = f(δx + ¯x, δu + ¯u) + F(δx + ¯x, δu + ¯u) dω dt Next, we apply the first-order Taylor series expansion around ¯x, ¯u. 4
  • 12. dx dt = f(¯x, ¯u) + F(¯x, ¯u) dω dt + xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + uF(¯x, ¯u)δu In this case, we are ignoring the varation of the noise w.r.t. the state (i.e. xF(¯x, ¯u)) because we are only interested in multiplicative noise in the control input for this par- ticular derivation. Substituting d¯x dt = f(¯x, ¯u) and noting that dδx dt = dx dt − d¯x dt gives: dx dt = dx dt − d¯x dt = xf(¯x, ¯u)δx + uf(¯x, ¯u)δu + F(¯x, ¯u) dω dt + uF(¯x, ¯u)δu Now we can write the update rule in discrete time, setting dt = ∆t and using the subscript k to denote the current time step. δxk+1 = δxk + ∆t xf(¯xk, ¯uk)δxk + uf(¯xk, ¯uk)δuk + F(¯xk, ¯uk) dω dt + uF(¯xk, ¯uk)δuk For convenience, we make the following substitutions: Ak I + ∆t xf(¯x, ¯u) Bk ∆t uf(¯x, ¯u) Ck(δuk) [ci,k + Ci,kδuk, ..., cp,k + Cp,kδuk] where ci,k = √ ∆tF(¯x, ¯u)[i] , Ci,k = √ ∆t uF[i] (¯x, ¯u) ξ √ ∆t dω dt ∼ N(0, Ip) ⇒ δxk+1 = Akδxk + Bkδuk + Ck(δuk)ξk (2.1.3) Equation (2.1.3) describes the evolution of the system linearized about the nominal trajectory in discrete time. In the discrete case, K is used to denote the final time step, where t = T. 5
  • 13. 2.1.1.3 Linearization and Quadratization of the Cost The instantaneous cost rate given by l may also be thought of as a state cost in the case where k < K (i.e. t < T). This will be useful for rewriting equation (2.1.2) in a discretized and recursive form. Linearization of the cost rate follows the same steps as shown for the dynamics, however, here a second order Taylor series expansion is used instead. Note that terms involving cross derivatives (e.g. x u) are ignored. l(t, x(t), u) = l(t, x − ¯x + ¯x, u − ¯u + ¯u) = l(t, δx + ¯x, δu + ¯u) = l(t, ¯x, ¯u) + xl(t, ¯x, ¯u)δx + 1 2 δx x xl(t, ¯x, ¯u)δx + ul(t, ¯x, ¯u)δu + 1 2 δu u ul(t, ¯x, ¯u)δu At k = K, t = T, the state cost is simply the penalty on the final state which is given explicitly by h. In this case, the linearized cost function is given by the following: h(x(T)) = h(x(T) − ¯x(T) + ¯x(T)) = h(δx(T) + ¯x(T)) = h(¯x(T)) + xh(¯x(T))δx + 1 2 δx x xh(¯x(T))δx For convenience, the following substitutions are made: qk = lk, qk = xlk, Qk = x xlk, rk = ulk, Rk = u ulk for k < K qK = h, qK = xh, QK = x xh Note that r and R are undefined at the final time step. With the above substitu- tions, the state cost for k < K becomes: 6
  • 14. costk = qk + δxk qk + 1 2 δxk Qkδxk + δuk rk + 1 2 δuk Rkδuk (2.1.4) 2.1.1.4 Backwards Pass - Computation of the Control Law It is now useful to introduce the discrete time, recursive form of equation (2.1.2) which relies on the current state cost described by equation (2.1.4). vk(δxk) = costk + E [vk+1(δxk+1)] (2.1.5) Since it has been assumed that the control law, δu, is linear in δx we now prove via induction backwards in time that equation (2.1.5) also has the following form: vk = sk + δxk qk + 1 2 δxk Qkδxk (2.1.6) First, equation (2.1.4) is substituted into equation (2.1.5). Then the system is advanced one time step using equation (2.1.3) for the next state’s cost-to-go. vk(δxk) = qk + δxk qk + 1 2 δxk Qkδxk + δuk rk + 1 2 δuk Rkδuk + E [vk+1(Akδxk + Bkδuk + noisek] expanding this term gives: sk+1 + (Akδxk + Bkδuk) sk+1 + 1 2 (Akδxk + Bkδuk) Sk+1(Akδxk + Bkδuk) + 1 2 tr p i=1 (ci,k + Ci,kδuk) (ci,k + Ci,kδuk) Sk+1 For the tr(·) term, we make use of the fact that tr(V U) = tr(UV ). Thus, expand- ing the tr term gives: 1 2 tr(·) = 1 2 δuk p i=1 Ci,kSk+1Ci,k δuk + δuk p i=1 Ci,kSk+1ci,k + 1 2 p i=1 ci,kSk+1ci,k 7
  • 15. ⇒ vk(δxk) = qk + sk+1 + 1 2 p i=1 ci,kSk+1ci,k + δxk (qk + Ak sk+1) + 1 2 δxk (Qk + Ak Sk+1Ak)δxk + δuk rk + Bk sk+1 + p i=1 Ci,kSk+1ci,k + 1 2 δuk Rk + Bk Sk+1Bk + p i=1 Ci,kSk+1Ci,k δuk + δuk Bk Sk+1Akδxk (2.1.7) Now we define a linear control law δuk = lk + Lkδxk along with the following shortcuts: g rk + Bk sk+1 + p i=1 Ci,kSk+1ci,k G Bk Sk+1Ak H Rk + Bk Sk+1Bk + p i=1 Ci,kSk+1Ci,k Substituting these expressions back into equation (2.1.7), combining like terms, and matching with equation (2.1.6) gives the following backwards-recursive update formulas: ⇒ Sk = Qk + Ak Sk+1Ak + Lk HLk + Lk G + G Lk sk = qk + Ak sk+1 + Lk g + Lk Hlk + G lk (2.1.8) sk = qk + sk+1 + 1 2 sump i=1ci,kSk+1ci,k + lk g + 1 2 lkHlk (2.1.9) The cost-to-go function is initialized with SK = QK, sK = qK, and sK = qK. Finally, the cost-to-go function can be written as: 8
  • 16. vk(δxk) = qk + sk+1 + 1 2 p i=1 ci,kSk+1ci,k + δxk (qk + Ak sk+1) + 1 2 δxk (Qk + Ak Sk+1Ak)δxk + δuk (g + Gδxk) + 1 2 δukHδuk The above expression can now be minimized w.r.t. to the control signal δuk. ∂vk(δxk) ∂δxk = g + Gδxk + Hδuk ≡ 0 ⇒ δuk = −H−1 (g + Gδx) When H has some negative eigenvalues the above expression could potentially make vk arbitrarily negative. However, since vk is the cost-to-go function it must be strictly non-negative. The solution, in this case, involves augmenting H by forcing its eigenvalues to meet a minimum threshold. This is accomplished by diagonalization, H = V DV , and subsequently replacing eigenvalues in D that are less than some λmin with λmin to form D. Thus the augmented matrix becomes H = V DV . Now, the control law can be specified: ⇒ lk = −V D−1 V g, Lk = −V D−1 V G 2.1.1.5 Forward Pass - Computation of the New Control Signal, Trajectory, and Cost At this point, we perform a forward pass through the linearized dynamics to compute the new open-loop control signal. By definition δx1 = 0. 9
  • 17. δuk = αlk + Lkδxk δxk+1 = Akδxk + Bkδuk Here, α is introduced as a backtracking linesearch parameter which is initialized at α = 1 and reduced by some factor η < 1 if the cost of the new trajectory does not give an improvement. Then we set u = ¯u + δu and compute a new state trajectory using equation (2.1.1) in an Euler integration scheme: xk+1 = xk + ∆tf(xk, uk) The expected cost is computed using equation (2.1.2) in a similar setup. If there is an improvement on the cost then we set ¯x = x and ¯u = u and start the process over, beginning with linearization of the dynamics around the new trajectory. If the cost improvement is smaller than some threshold value, then the algorithm is finished. 2.1.2 Cart-Pole Application Cart-pole refers to a benchmark control problem in which a weighted pendulum is attached to a cart which can be forced to roll along a single dimension. The dynamics of the system are given by the following equations [3]: ¨x = 1 mc + mp sin2 θ f + mp sin θ l ˙θ2 + g cos θ (2.1.10) ¨θ = 1 l(mc + mp sin2 θ) −f cos θ − mpl ˙θ2 cos θ sin θ − (mc + mp)g sin θ (2.1.11) Figure 1 shows a schematic of the cart-pole system and identifies each of the parameters in the equations above. For this demonstration, mc = 1 kg, mp = 0.01 kg, l = 0.25 m, and g = 9.81 m/s. The initial configuration of the cart-pole is specified as x = ˙x = θ = ˙θ = 0, meaning the cart is at rest with the pendulum 10
  • 18. Figure 1: A schematic of the cartpole system with all relevant parameters indicated hanging downwards. The goal state for this demonstration is x = ˙x = ˙θ = 0, θ = π, corresponding to the cart returning to its initial x position after swinging the pendu- lum upright. A time horizon of 2 seconds is chosen with a time step of 0.02 seconds. Lastly, the control input f is constrained to be |f| ≤ 5. The performance index used for this demonstration is given by: J0 =          x(T) θ(T) − π ˙x(T) ˙θ(T)                   5 0 0 0 0 1000 0 0 0 0 50 0 0 0 0 5                   x(T) θ(T) − π ˙x(T) ˙θ(T)          + T 0 10−4 u(t)2 dt Figures 2, 3, and 4 show the optimized state trajectories, control signal, and cost evolution of the cart-pole system, respectively. In this case, the goal state was achieved even though there was some saturation of the control signal. 2.1.3 Double Cart-Pole Application The double cart-pole problem is similar to the cart-pole problem described in the previous section. In this case, however, there is a second pendulum attached to the end of the first pendulum. Thus there is one additional state for the second link position, θ2 (relative to the first link) and an additional state for the second link angular velocity, ˙θ2. The dynamics are given by the following equations [4]: 11
  • 19. 0 0.5 1 1.5 2 θ(rad) -2 0 2 4 time (s) 0 0.5 1 1.5 2 θrate(rad/s) -10 0 10 20 0 0.5 1 1.5 2 x -1 -0.5 0 0.5 actual goal time (s) 0 0.5 1 1.5 2 velocity -2 -1 0 1 2 Figure 2: State trajectories for the cart-pole system over 2 seconds showing the goal state being met A       ¨x ¨θ1 ¨θ2       = b (2.1.12) Where 12
  • 20. time (s) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2f -5 -4 -3 -2 -1 0 1 2 3 4 5 Control Signal actual limits Figure 3: Control signal for the cart-pole system over 2 seconds corresponding to the positive force applied to the cart Iteration 0 5 10 15 20 25 30 35 40 45 50 Cost 0 500 1000 1500 2000 2500 3000 Cart-Pole Cost Evolution Figure 4: Cart-pole actual cost of the trajectory after each iteration A =       2(mc + mp1 + mp2) −(mp1 + 2mp2)l1 cos θ1 −mp2l2 cos θ2 −(3mp1 + 6mp2) cos θ1 (2mp1 + 6mp2)l1 3mp2l2 cos(θ1 − θ2) −3 cos θ2 3l1 cos(θ1 − θ2) 2l2       b =       2u − (mp1 + 2mp2)l1 ˙θ1 2 sin θ1 − mp2l2 ˙θ2 2 sin θ2 (3mp1 + 6mp2)g sin θ1 − 3mp2l2 ˙θ2 2 sin(θ1 − θ2) 3l1 ˙θ1 2 sin(θ1 − θ2) + 3g sin θ2       Figure 5 shows the double cart-pole problem schematically, specifying all of the salient features. Note, the definition for θ1 in this problem has changed in that 13
  • 21. Figure 5: Schematic of the double cart-pole problem with all relevant features speci- fied θ1 = 0 corresponds to the first link pointing upright. The objective of the double cart-pole application will be the same as in the single cart-pole: to start in the state x = ˙x = ˙θ1 = ˙θ2, θ1 = θ2 = π corresponding to the cart at rest with the links hanging downwards and swinging the links up to the state x = ˙x = ˙θ1 = ˙θ2 = θ1 = θ2 = 0. For this demonstration the parameters are chosen to be mc = 10 kg, mp1 = 3 kg, mp2 = 1 kg, l1 = 1.4 m, l2 = 1 m, g = 9.81 m/s2 . The horizon and time discretization are chosen as 2 s and 0.02 s, respectively. The performance index used for the demonstration is given by: J0 = x(T) QF x(T) + T 0 10−4 u(t)2 dt x = x ˙x θ1 ˙θ1 θ2 ˙θ2 QF = diag 5 5 1000 50 1000 50 Figures 6, 7, and 8 show the optimized state trajectories, control signal, and cost evolution of the double cart-pole system, respectively. In this case, the goal state was achieved within the allotted time horizon. The control force limitations were widened for this application to ensure the goal state could be reached. 14
  • 22. 2.1.4 Quadrotor Application In this section, iLQG is applied to a 6DOF quadrotor model. The state, in this case, is a combination of linear positions, linear velocities, angular positions, and angular velocities, while the control is simply the 4 thrust forces of the rotors: x = x y z ˙x ˙y ˙z φ θ ψ ˙φ ˙θ ˙ψ u = F1 F2 F3 F4 The objective for this demonstration is to start at position (−1, 1, 0.5) and finish at position (0.5, − 1, 1.5) after 3 seconds. All velocities and angles begin at zero and should end at zero - thus the quadrotor begins at rest at the start position and should reach the goal position and stop there. Figure 9 shows the quadrotor problem schematically, identifying all of the angles and their directions [1]. For the dynamics model, it is useful to define the following quantities: u1 = F1 + F2 + F3 + F4 thrust force u2 = F4 − F2 roll force u3 = F1 − F3 pitch force u4 = 0.05(F2 + F4 − F1 − F3) yaw moment Then the linear accelerations are given by [5]: ¨x = 1 m (cos φ sin θ cos ψ + sin φ sin ψ)u1 ¨y = 1 m (cos φ sin θ sin ψ − sin φ cos ψ)u1 ¨z = 1 m cos φ cos θu1 15
  • 23. Table 1: Quadrotor parameters and values Symbol Description Value Units Jx Moment of inertia about x-axis 0.0081 kg m2 Jy Moment of inertia about y-axis 0.0081 kg m2 Jz Moment of inertia about z-axis 0.0142 kg m2 L Distance of rotor to the center of mass 0.24 m m Quadrotor mass 1 kg g Gravitational acceleration 9.81 m/s2 And the angular accelerations are given by [5]: ¨φ = 1 Jx ( ˙θ ˙ψ(Jy − Jz) + Lu2) ¨θ = 1 Jy ( ˙φ ˙ψ(Jz − Jx) + Lu3) ¨ψ = 1 Jz ( ˙φ ˙θ(Jx − Jy) + u4) Table 1 enumerates the parameters used in the quadrotor model along with the values chosen for this demonstration. The performance index for this demonstration is given by: J0 = (x(T) − xG) QF (x(T) − xG) + T 0 x(t) Qx(t) + u(t) u(t)dt QF = diag 100 100 100 10 10 10 50 50 0 5 5 5 Q = diag 0 0 0 0 0 0 10 10 10 0 0 0 xG = 0.5 −1 1.5 0 0 0 0 0 0 0 0 0 Figures 10-12 show the results of running iLQG on the quadrotor problem. Each of the state goals is met within the allotted time horizon of 3 seconds. 16
  • 24. 0 0.5 1 1.5 2 θ1 (rad) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 time (s) 0 0.5 1 1.5 2 θ1 rate(rad/s) -8 -6 -4 -2 0 2 4 6 8 0 0.5 1 1.5 2 x -2 -1.5 -1 -0.5 0 0.5 1 1.5 actual goal time (s) 0 0.5 1 1.5 2 velocity -10 -8 -6 -4 -2 0 2 4 6 0 0.5 1 1.5 2 θ2 (rad) -1 0 1 2 3 4 5 6 7 time (s) 0 0.5 1 1.5 2 θ2 rate(rad/s) -25 -20 -15 -10 -5 0 5 10 Figure 6: State trajectories for the double cart-pole system over 2 seconds showing the goal state being met 17
  • 25. time (s) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 f -1000 -800 -600 -400 -200 0 200 400 600 800 1000 Control Signal actual limits Figure 7: Control signal for the double cart-pole system over 2 seconds corresponding to the positive force applied to the cart Iteration 0 20 40 60 80 100 120 140 160 180 200 Cost ×104 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Double Cart-Pole Cost Evolution Figure 8: Double cart-pole actual cost of the trajectory after each iteration Figure 9: Schematic representation of the quadrotor dynamic model [1]. 18
  • 26. time (s) 0 0.5 1 1.5 2 2.5 3 Position(m) -1 -0.5 0 0.5 1 1.5 2 Position Trajectory x y z (a) time (s) 0 0.5 1 1.5 2 2.5 3 Angle(rad) -0.15 -0.1 -0.05 0 0.05 0.1 0.15 Angle Trajectory φ θ ψ (b) time (s) 0 0.5 1 1.5 2 2.5 3 Velocity(m/s) -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 Velocity Trajectory ˙x ˙y ˙z (c) time (s) 0 0.5 1 1.5 2 2.5 3 AngularRate(rad/s) -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Angular Rate Trajectory ˙φ ˙θ ˙ψ (d) Figure 10: Quadrotor state trajectories from 0 to 3 seconds showing the vehicle meeting the goal state time (s) 0 0.5 1 1.5 2 2.5 3 ThrustForce(N) 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4 Control Signal F1 F2 F3 F4 Figure 11: Quadrotor control effort for each rotor 19
  • 27. Iteration 0 10 20 30 40 50 60 70 80 Cost ×104 0 0.5 1 1.5 2 2.5 Cost Evolution Figure 12: Quadrotor actual cost of the trajectory after each iteration 20
  • 28. CHAPTER III UNKNOWN DYNAMICS In the last chapter, iLQG was introduced and derived as an algorithm useful for situations when the system dynamics are known. In this chapter, PDDP will be introduced as an algorithm useful for situations when the system dynamics are not known. Although there are other algorithms designed to solve the same problems, these two are discussed in this report to provide a foundation for the extensions proposed in the following chapter. 3.1 Probabilistic Differential Dynamic Programming Probabilistic Differential Dynamic Programming (PDDP) is an algorithm proposed by Yunpeng Pan and Evangelos Theodorou as an extension to the well known differential dynamic programming (DDP) algorithm [9]. DDP and iLQG are extremely similar in their derivation; for this reason, PDDP is treated essentially as an extension of iLQG in this context. The problem formulation of PDDP matches that of iLQG exactly. 3.1.1 Gaussian Processes The innovation of PDDP lies in its use of a Gaussian Process to represent the system dynamics instead of a collection of differential equations. This relieves the need to have an explicit model of the system dynamics, as was the case in iLQG. Instead, the GP-based representation relies on observations of the dynamics for given state and control inputs. More precisely, the GP assumes a joint Gaussian distribution over a sequence of state-control pairs, ˜X = {(x0, u0), ..., (xH, uH)}, and corresponding state transitions dX = {dx0, ..., dxH} measured from the system itself [9]. Thus the state transition is found through the process of GP regression, which can be interpreted as 21
  • 29. regression over functions. In the case of PDDP, the function of interest is the system dynamics (see Equation 2.1.1, for example), or rather, the state transition function. A complete treatment of the general subject of Gaussian Processes can be found in Gaussian Processes for Machine Learning by Rasmussen and Williams [10]. For the remainder of the report, a ”sample from a GP” refers to the set of values of the sampled function evaluated at the inputs. GPs offer a nonparametric representation of the state transition function which allows them to be extremely flexible in representing highly nonlinear systems. There is freedom, however, in choosing the covariance function to specify the joint Gaus- sian distribution over the observed data. The covariance function provides a way to quantify the similarity of one input point (i.e. one state-control pair, in this case) to another input point. In a conceptual sense, covariance functions are analogous to kernel functions in general machine learning. The idea is that the more similar two points are, as measured using the covariance function, the more correlated their outputs will be. Thus in the regression setting, a query point’s output will be heavily informed by similar points with known outputs. 3.1.2 Implications of the GP Representation The use of a GP representation for the system dynamics presents a couple of chal- lenges. By its very nature, GP regression incorporates uncertainty; thus, in this case, the state transition value that is returned is, itself, a random variable. In a discrete-time setting, the next state is usually found as xk+1 = xk + dxk∆t. The un- certainty in state transition, dxk induces uncertainty in the state, xk+1. Meanwhile, the state is treated as an input into the GP regression. This input uncertainty causes the predictive distribution for the state transition to be analytically intractable as it is non-Gaussian [9]. However, PDDP approximates the predictive distribution by a Gaussian, wherein the predictive mean and covariances are computed by employing 22
  • 30. the moment matching approach [9]. The uncertainty in the state values also necessitates augmentation to the state vector seen in regular DDP (and iLQG). The augmented state vector contains the means of each state along with a flattened version of the covariance matrix. This allows the uncertainty to be carried through every part of the algorithm, allowing the algorithm to balance the exploration strategy between closeness to the goal state and uncertainty (variance) of the state [9]. Incidentally, the Jacobian matrices, described in the iLQG section (i.e. xf(¯x, ¯u) and uf(¯x, ¯u) in Equation 2.1.3) can be com- puted analytically as a result of the definition of the augmented state, relieving the need for numerical differentiation. 23
  • 31. CHAPTER IV EXTENDING PDDP AND ILQG PDDP extends DDP/iLQG by representing the system dynamics as a Gaussian Pro- cess, which is data driven and uncertain, as opposed to a set of differential equations. The noise model considered in PDDP is, however, quite simple as it only contains standard Brownian motion noise (dω) in the dynamics [9]: dx = f(x, u)dt + Cdω, x(t0) = x0, dω ∼ N(0, Σω) Thus the desire to extend PDDP comes from the desire to incorporate a more sophisticated noise model and a more adaptive regression framework. 4.1 Gaussian Process Regression Networks The main extension proposed to PDDP is the use of a Gaussian Process Regression Network (GPRN) representation of the system dynamics, in lieu of the regular GP representation. GPRNs were originally proposed by Andrew Wilson, David Knowles, and Zoubin Ghahramani [12]. GPRNs maintain desirable aspects of the GP rep- resentation such as modeling the dynamics using data and allowing uncertainty to be considered. Additionally, GPRNs offer input (predictor, or ”state”, in this case) dependent signal and noise correlations between multiple output variables (state tran- sitions), and input dependent length-scales and amplitudes. Inclusion of signal noise allows the separation of noise effects of the physical process from measurement noise, for example. The dependence of the noise and length scales on the input allow the regression process to adapt to the input, making the framework more flexible. Fi- nally, the ability to account for signal and noise correlations between multiple output 24
  • 32. variables strengthens the predictive capability of the GPRN over the standard GP. Imagine that we are trying to predict the angular acceleration of a vehicle about a particular axis (i.e. a state transition variable) at a point that was not used to train the GPRN. In standard GP regression, the prediction is only really informed by the measurements of that angular acceleration used to train the model. Accounting for correlations between multiple outputs enables us to use all of the measurements, which enhances the prediction since angular acceleration may be correlated with lin- ear velocity for a particular vehicle. This correlation between angular acceleration and linear velocity may also depend on the state of the vehicle (e.g. altitude) - thus GPRN’s input dependent correlations further enhance prediction. In general, this is referred to as a multi-task method [12]. The case for using a GPRN to represent the dynamics is clear as it satisfies the desire to incorporate a more sophisticated noise model and a more adaptive regression framework. Conceptually, a GPRN may be described simply as a mixture of GPs wherein the mixture weights are also GPs. Figure 13, below, gives a schematic representation of a GPRN [12]. It is important to note that the input, x, is in fact a vector, as it represents the state. The functions, ˆfi(x), are GPs, each with their own noise, and are referred to as nodes. The number of nodes is a choice for the designer of the GPRN and is usually determined by experimenting to see which number produces the best results. Each output (i.e. state transition) element, yj(x), is a linear combination of the outputs of each node. The weights, Wij(x) applied to the node outputs are also input-dependent GPs. Regression in the GPRN framework is quite different from a standard GP. In standard GP regression, we predict the output given the input and data comprised of input/output training pairs. In a GPRN, we must predict the values of the node and weight GPs whose values are latent and confounded in the output values; thus the standard regression must be altered. 25
  • 33. Figure 13: Schematic representation of a Gaussian Process Regression Network [12] 4.1.1 Elliptical Slice Sampling Wilson et al. [12] solve this problem using a technique known as elliptical slice sam- pling (ESS). Elliptical slice sampling [7], proposed by Iain Murray, Ryan Prescott Adams, and David J.C. MacKay, is designed to be a generic, user-friendly (avoids parameter tuning) sampling technique that may be viewed as an extension of slice sampling [8]. ESS excels in sampling from posteriors with strongly correlated Gaus- sian priors, which is the case in GPRNs. For a GPRN, ESS can be used to sample from the posterior distribution of the node and weight GP functions evaluated at all of the training points. This distribution is conditioned on the input/output pairs and the hyperparameters for each GP. The ESS algorithm requires takes the current sample and a new sample from the prior distribution and parameterizes the new sam- ple as f = f cos θ + ν sin θ where ν is the prior sample, f is the current sample, f is the new sample, and θ is the free parameter [7]. The current sample and prior sample specify an ellipse that is centered at the origin when viewed in this light. Slice 26
  • 34. sampling is then performed in which a θ value is selected uniformly over the support of the region in which the likelihood of the new sample is above a randomly chosen threshold. Samples from the node and weight GPs evaluated at all of the training points allows us to draw samples from the predictive distribution for node and weight values at the query point. The output is simply a function of these values. This process is repeated a number of times and the output value is averaged to marginalize out the effect of the node and weight values. 4.1.1.1 Demonstration: Sampling from a Standard Gaussian This section and the following demonstrate the usage of the ESS algorithm on two toy problems: sampling from a standard Gaussian and sampling from a GP posterior. They rely on the MATLAB implementation made available by Iain Murray as a supplement [7]. The first demonstration compares samples drawn from standard methods for Gaussian distributions (in MATLAB normrnd) with samples drawn using ESS. These two approaches are fundamentally different due to the fact that ESS requires samples from a prior. In order to emulate the behavior of normrnd, a uniform distribution is chosen to keep the prior as uninformative as possible. Figure 14, below, shows normal- ized histograms of the samples from each method, overlaid with the true probability distribution function, for a variety of population sizes and choices of prior. For this application, the histograms in Figure 14 indicate that the ESS samples are sensitive to the range of the uniform prior. If the bounds on the prior are too far spread apart, ESS tends to produce large outliers more frequently than normrnd. On the other hand, reducing the range of the uniform prior tends to bias the ESS samples thus skewing the distribution. Clearly, the U(−0.5, 0.5) prior would be insufficient for sampling from N(0, 1). 27
  • 35. Figure 14: Comparison of ESS samples to normrnd on a standard Gaussian 4.1.1.2 Demonstration: Sampling from a GP Posterior ESS is more fitting for this application since, in the Gaussian process case, a prior is specified. This demonstration roughly follows an example from Murray et al. [7] in which ESS is used to generate samples from a GP posterior. For visualization pur- poses, this demonstration is for a GP with only one input dimension and one output dimension. Note, the predictive distribution of a GP can be computed analytically, so this provides a baseline with which to compare the ESS samples to. 28
  • 36. First, a synthetic dataset is generated with a sample size of N = 200 with in- put values {xn}N n=1 sampled from from U(0, 1). Then the covariance matrix, Σ is computed using the following definition Σij = σ2 f exp − 1 2 (xi − xj)2 /l2 (4.1.1) This particular covariance function is the commonly used squared-exponential covariance. Here, we set l = 1 and σ2 f = 1. The function values are then drawn from the GP prior: f ∼ N(0, Σ). In practice, the true function values are not known - thus Gaussian noise is added to the function values with variance σ2 n = 0.32 to create the observations, y. ESS requires the log-likelihood function (or the probability distribution function of the distribution being sampled). Taking the posterior samples evaluated at all of the input values simplifies the process since the log-likelihood function takes a simple form L(f) = n N(yn; fn, σ2 n) (4.1.2) The ability to sample from the GP prior and being able to evaluate the log- likelihood function is all that is needed to sample from the posterior using ESS. Rasmussen’s book [10] gives the process for sampling from the GP posterior di- rectly. The posterior has a joint Gaussian distribution whose mean and covariance are given by the following ¯f∗ E [f∗|y, Σ] = Σ Σ + σ2 nI −1 y (4.1.3) cov(f∗) = Σ − Σ Σ + σ2 nI −1 Σ (4.1.4) Figure 15, below, contains the results of taking 100 samples from the GP posterior using both ESS and the direct method. Note, there is a vertical spread in the samples at each location due to the fact that the function values at each location come from 29
  • 37. its own Gaussian distribution. There are several ESS samples that stand out as being very unlikely, meaning they are much further from the true function values plotted on the figure. x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 f(x) -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Gaussian Process Regression ESS Direct True f(x) Noisy Obs. Figure 15: Comparison of ESS samples to direct method for GP regression This issue can be attributed to the fact that ESS produces chains of samples. The chain must start somewhere, and, in this case, it begins with a sample from the prior distribution. Looking at a trace of the sample log-likelihood values in Figure 16 reveals the fact that the likelihood starts out quite low before converging to a stable value. Employing a burn in strategy, similar to those used for many other sampling techniques, wherein a certain number of the initial samples are discarded, yields the results shown in Figure 17. In this case the first 20 ESS samples were discarded. The remaining samples are indistinguishable from the direct samples, demonstrating the viability of this technique for GP regression. 30
  • 38. Sample 0 10 20 30 40 50 60 70 80 90 100 Likelihood -350 -300 -250 -200 -150 -100 -50 0 ESS Sample Likelihoods Figure 16: Log-likelihood values of a chain of ESS samples for GP regression x 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 f(x) -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 Gaussian Process Regression ESS Direct True f(x) Noisy Obs. Figure 17: Comparison of ESS samples to direct method for GP regression with burn in 31
  • 39. 4.2 Implications of GPRN on PDDP Uncertainty in the inputs is incorporated into PDDP using a moment matching ap- proach. Inclusion of input uncertainty into the GPRN scheme being considered is much more complex. This is due to the fact that the inputs and outputs in GPRN are separated by a thick layer of latent variables - the node and weight GPs. In this initial work, it is advantageous from a standpoint of conceptual complexity to drop this uncertainty in the input. The most obvious approach to do so would be to simply to take the mean of the distribution describing the input. Since the next state value is simply a function of the current state and the state transition, we must simply take the mean of the predictive state transition distribution (specified by the GPRN). As this distribution is an average of a large number of intermediate Gaussian samples, incorporating different node and weight samples, the mean is simply the mean of all the Gaussian means. In terms of computational complexity, GPRN generally scales linearly with the dimension of the output and with the number of latent nodes. The computational complexity is dominated by the number of data points considered, just as it is for standard GP regression. It is likely that the GPRN representation will be more bur- densome in practice due to the fact that it is sample-based, as opposed to analytical as was the case in PDDP. The biggest difference between a GPRN representation and a GP representation is how the predictive distribution mean and covariance are calculated and how the model’s hyperparameters are learned. 4.3 Remaining Work Much of the conceptual framework for incorporating a GPRN representation into PDDP and iLQG has been developed and discussed throughout this report. Work remains, however, in order to complete this work. The biggest aspect that is missing at this point is a process for learning the hyperparameters for the GPRN. Each individual 32
  • 40. GP, for each node and for each weight, comes with its own set of hyperparameters, which may vary depending on the choice of covariance function. For a system with p outputs, and q nodes, there are p(q + 1) sets of hyperparameters to set. One of the traditional approach to learning parameters of a distribution is maximum likelihood (ML). ML cannot be used to learn the GPRN hyperparameters due to the fact that the hyperparameters are associated with latent variables (i.e. variables which we cannot observe directly). Other categories of methods are being considered for suitability for this problem including gradient based and sampling based (e.g. cross-entropy) methods. Wilson et al. [12] utilize a message passing implementation of variational Bayes (VB) called variational message passing (VMP). Their implementation is built on top of Microsoft Research’s Infer.NET framework which is a general platform for probabilistic programming [6]. Much of the remaining work on this effort can be summarized by gaining a better understanding of VB and VMP, and then generating an implementation in MATLAB. 33
  • 41. CHAPTER V CONCLUSION The intersection between machine learning and control theory is emerging rapidly; many rich applications are currently being developed. This report has explored one particular area of interest: trajectory optimization. LQR and LQG were mentioned as predecessors to iLQG, which was re-derived in this report and demonstrated on 3 systems of interest: cartpole, double cartpole, and quadrotor. The iLQG algorithm is robust and widely applicable to a vast number of different systems. PDDP was introduced as an important extension to iLQG which incorporates uncertainty into the trajectory optimization process. The conceptual advances presented in this report were centered around the use of a GPRN to represent the unknown system dynamics in lieu of the standard GP representation employed by PDDP. GPRNs offer a more sophisticated noise model and a more adaptive regression framework. More specifi- cally, they incorporate input-dependent noise and signal correlation between multiple outputs. This should enhance the predictive capability of the algorithm to represent the unknown system dynamics. Work remains on fully incorporating GPRNs into a PDDP-like algorithm: effort continues in the areas of understanding VB and VMP for learning GPRN hyperparameters, and in creating a MATLAB implementation. 34
  • 42. APPENDIX A MATLAB CODE LISTINGS The links below point to files embedded within the document. my implementation of the iLQG algorithm for the purposes of this research specification of dynamics, start, and goal states for cartpole system specification of dynamics, start, and goal states for double cartpole system specification of dynamics, start, and goal states for quadrotor [7] implements elliptical slice sampling, provided by the authors at http://homepages.inf.ed.ac.uk/imurray2/pub/10ess/ demonstrates elliptical slice.m on two examples: sampling from a standard Gaussian, and sampling from the posterior distribution for a simple Gaussian Process 35 iLQG.m demo cartpole.m demo doublecartpole.m demo quadrotor.m elliptical slice.m elliptical slice demo.m
  • 43. REFERENCES [1] Bolandi, H., Rezaei, M., Mohsenipour, R., Nemati, H., and Smailzadeh, S., “Attitude control of a quadrotor with optimized pid con- troller,” Intelligent Control and Automation, vol. 4, no. 3, pp. 335–342, 2013. [2] Deisenroth, M. and Rasmussen, C. E., “Pilco: A model-based and data- efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011. [3] Gandhi, M., “Cart-pole dynamics.” Private Communication, 2015. Georgia Institute of Technology Student. [4] Gandhi, M., “Double cart-pole dynamics.” Private Communication, 2015. Georgia Institute of Technology Student. [5] Gandhi, M., “Quadrotor dynamics.” Private Communication, 2015. Georgia Institute of Technology Student. [6] Minka, T., Winn, J., Guiver, J., Webster, S., Zaykov, Y., Yangel, B., Spengler, A., and Bronskill, J., “Infer.NET 2.6,” 2014. Microsoft Research Cambridge. http://research.microsoft.com/infernet. [7] Murray, I., Adams, R. P., and MacKay, D. J., “Elliptical slice sampling,” arXiv preprint arXiv:1001.0175, 2009. [8] Neal, R. M., “Slice sampling,” Annals of statistics, pp. 705–741, 2003. [9] Pan, Y. and Theodorou, E., “Probabilistic differential dynamic program- ming,” in Advances in Neural Information Processing Systems, pp. 1907–1915, 2014. [10] Rasmussen, C. E., Gaussian processes for machine learning. Citeseer, 2006. http://www.gaussianprocess.org/gpml/chapters/. [11] Todorov, E. and Li, W., “A generalized iterative LQG method for locally- optimal feedback control of constrained nonlinear stochastic systems,” (Atlantis, Bahamas), 43rd IEEE Conference on Decision and Control, December 2004. [12] Wilson, A. G., Knowles, D. A., and Ghahramani, Z., “Gaussian process regression networks,” arXiv preprint arXiv:1110.4411, 2011. 36
  • 44. Machine Learning Applications in Complex Control Systems Alexander L. Von Moll 36 Pages Directed by Professor Evangelos Theodorou The intersection between machine learning and control theory is emerging rapidly; many rich applications are currently being developed. This report explores one particular area of interest: trajectory optimization. The Linear Quadratic Reg- ulator and Linear Quadratic Gaussian algorithms are introduced as predecessors to the Iterative Linear Quadratic Gaussian (iLQG) algorithm, which is re-derived in this report and demonstrated on 3 systems of interest: cartpole, double cartpole, and quadrotor. iLQG is robust and widely applicable to a vast number of different systems. Probabilistic Differential Dynamic Programming (PDDP) is introduced as an important extension to iLQG which incorporates uncertainty into the trajectory optimization process. The conceptual advances presented in this report were centered around the use of a Gaussian Process Regression Network (GPRN) to represent the unknown system dynamics in lieu of the standard Gaussian Process representation employed by PDDP. GPRNs offer a more sophisticated noise model and a more adap- tive regression framework. More specifically, they incorporate input-dependent noise and signal correlation between multiple outputs. This should enhance the predic- tive capability of the algorithm to represent the unknown system dynamics. Work remains on fully incorporating GPRN into a PDDP-like algorithm: effort continues in the areas of understanding variational Bayes and variational message passing for learning GPRN hyperparameters, and in creating a MATLAB implementation.