SlideShare a Scribd company logo
1 of 63
Download to read offline
Intro AlphaGo AlphaGo Zero AlphaZero Summary
A Deep Journey of Playing Games with RL
NSE Seminar
Kim Hammar
kimham@kth.se
January 31, 2020
1 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Why Games
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Agent
Environment
Actionat
rtst
st+1
rt+1
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
VA LV E
R
AI & Machine Learning Games
Why Combine the two?
2 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Why Games
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Agent
Environment
Actionat
rtst
st+1
rt+1
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
VA LV E
R
AI & Machine Learning Games
Why Combine the two?
▸ AI & Games have a long history (Turing ’50& Minsky 60’)
▸ Simple to evaluate, reproducible, controllable, quick
feedback loop
▸ Common benchmark for the research community
2 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
1997: DeepBlue1 vs Kasparov
2em11
Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artif.
Intell. 134.1–2 (Jan. 2002), 57–83. issn: 0004-3702. doi: 10.1016/S0004- 3702(01)00129- 1. url:
https://doi.org/10.1016/S0004-3702(01)00129-1.
3 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
1992: Tesauro’s TD-Gammon2
2em12
Gerald Tesauro. “TD-Gammon, a Self-Teaching Backgammon Program, Achieves
Master-Level Play”. In: Neural Comput. 6.2 (Mar. 1994), 215–219. issn: 0899-7667. doi:
10.1162/neco.1994.6.2.215. url: https://doi.org/10.1162/neco.1994.6.2.215.
4 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
1959: Arthur Samuel’s Checkers Player3
2em13
A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In:
IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url:
https://doi.org/10.1147/rd.33.0210, A. L. Samuel. “Some Studies in Machine Learning Using
the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi:
10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210.
5 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
6 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
7 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Papers in Focus Today
▸ AlphaGo4
▸ AlphaGo Zero5
▸ AlphaZero6
AlphaGo
Nature, 6.5k citations
2016
AlphaGo Zero
Nature, 2.5k citations
2017
Alpha Zero
Science, 400 citations
2018
2em14
David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
2em15
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em16
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
8 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
The Reinforcement Learning Problem
▸ Notation; policy: π, state: s, reward: r, action: a
▸ Agent’s goal: maximize reward, Rt =
∞
∑
k=0
γk
rt+k+1 0 ≤ γ ≤ 1
▸ RL’s goal, find optimal policy π∗
= maxπ E[R π]
Agent
Environment
Actionat
rtst
st+1
rt+1
9 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
RL Examples: Elevator (Crites & Barto ’957)
Elevator Agent
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Observations ElevatorPosition rt+1
Reward ∈ R
select(up,down,wait,stop at floor 1,⋯,n) at+1
2em17
Robert H. Crites and Andrew G. Barto. “Improving Elevator Performance Using Re-
inforcement Learning”. In: Proceedings of the 8th International Conference on Neural Information
Processing Systems. NIPS’95. Denver, Colorado: MIT Press, 1995, 1017–1023.
10 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
RL Examples: Atari (Mnih ’15)8
DQN Agent
Q(s,a1) Q(s,a18)⋯
Observations rt+1
Screen frames ∈ R4×84×84 Reward ∈ R
at+1
2em18
Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”.
In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/
nature14236.
11 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
How to Act Optimally? (Bellman 57’9)
optimal(st) = max
π
E[
∞
∑
k=1
γk−1
rt+k st]
2em19
Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn:
9780486428093.
12 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
How to Act Optimally? (Bellman 57’10)
optimal(st) = max
π
E[
∞
∑
k=1
γk−1
rt+k st]
= max
π
E[rt+1
∞
∑
k=2
γk−1
rt+k st]
= max
at
E[rt+1 + max
π
E[
∞
∑
k=2
γk−1
rt+k st+1] st]
= max
at
E[rt+1 + γ max
π
E[
∞
∑
k=2
γk−2
rt+k st+1] st]
= max
at
E[rt+1 + γ max
π
E[
∞
∑
k=2
γk−2
rt+k st+1] st]
= max
at
E[rt+1 + γoptimal(st+1) st]
2em110
Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn:12 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Reinforcement Learning: An Overview
Deep Reinforcement Learning
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
⎛
⎜
⎜
⎝
x1
xn
⎞
⎟
⎟
⎠
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
∇θL(y, ˆy)
Algorithms: DQN, DDPG, Double-DQN
13 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Reinforcement Learning: An Overview
Deep Reinforcement Learning
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
⎛
⎜
⎜
⎝
x1
xn
⎞
⎟
⎟
⎠
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
∇θL(y, ˆy)
Algorithms: DQN, DDPG, Double-DQN
Direct Policy Search
w[0]
w
[1
]
w[2
]
Policy Spaceπ1 π2
π3
π4
Algorithms: REINFORCE, Evolutionary Search, DPG, Actor-Critic
Value Based Methods
s
a
r
s′
max
Value iteration Monte-Carlo
s
a
s′
r
TD(0)
s,a
s′
a′
r
max
Q-Learning
s,a
s′
a′
r
Sarsa
Dynamic Programming
V (s1)
V (s2)
V (s4) V (s5)
V (s3)
V (s4) V (s6)
s0
s1
s2
a0,r1
a1,r2
Heuristic Search
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
Algorithms: MCTS, Minimax Search
Mathematical Foundations
s0
s1
s2
a0,r1
a1,r2
Markov Decision Process
YZ
0.80.6
0.2
0.4
Markov Chain
S0
S1 S2
1
R = −2
0.5 0.5
R = +1
R = +0
R = +0
0.5
0.5
Markov Reward Process 13 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo ’201611
2em111
David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
14 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
15 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
The game of Go
▸ The world’s oldest game: 3000 years old, over 40M players
world wide
▸ To win: capture the most territory on the board
▸ Surrounded stones/areas are captured and removed
▸ Why is it so hard for computers? 10170
unique states!!,
≈ 250 branching factor
▸ High branching factor, large board (19 × 19), hard to
evaluate etc..
Capture
16 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
The game of Go
▸ The world’s oldest game: 3000 years old, over 40M players
world wide
▸ To win: capture the most territory on the board
▸ Surrounded stones are captured and removed
▸ Why is it so hard for computers? 10170
unique states!!
≈ 250 branching factor
▸ High branching factor, large board (19 × 19), hard to
evaluate etc..
(1) (2)
16 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Game Trees
▸ How do you program a computer to play a board game?
▸ Simplest approach:
▸ (1) Program a game tree; (2) Assume opponent think like
you; (3) Look-ahead and evaluate each move
▸ Requires Knowledge of game rules and evaluation function
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
17 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Search + Go =
18 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Some Numbers
▸ Atoms in the universe
▸ ≈ 1080
▸ States
▸ Go: 10170
, Chess: 1047
▸ Game tree complexity
▸ Go: 10360
, Chess: 10123
▸ Average branching factor
▸ Go: 250, Chess: 35
▸ Board size (positions)
▸ Go: 361, Chess: 64
2 4 6 8
1 · 109
2 · 109
3 · 109
250depth
states
depth
19 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Monte-Carlo Tree Search
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
20 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo’s Approach
▸ Brute-Force Search does not work
▸ At least not until hardware has improved a lot.
▸ Human Go professionals rely on small search guided by
intuition/experience
▸ AlphaGo’s Approach: Complement MCTS with
“artifical intuition”
▸ Artificial intuition provided by two neural networks: value
network and policy network
21 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Computer Go as an RL Problem
AlphaGo Agent
Observations rt+1
Board features ∈ RK×19×19 Reward ∈ [−1,1]
at+1
22 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
Human Expert Moves D1
23 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Rollout Policy Network
pπ (a s)
Classification
minpπ L(predicted move,expert move)
Human Expert Moves D1
23 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Rollout Policy Network
pπ (a s)
Classification
minpπ L(predicted move,expert move)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Policy Network
pσ (a s)
Classification
minpσ L(predicted move,expert move)
Human Expert Moves D1
23 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
24 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
Self Play Dataset D2
24 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Value Network
vθ(s′
)
Classification
minvθ
L(predicted outcome,actual outcome)
Self Play Dataset D2
24 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Guided Search Search Using the Policy
Network
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
25 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Depth-Limited Search Using the Value
Network
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
26 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
Input
Position
Game State s
27 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
27 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
27 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction PipelineGuided-
Look-ahead
Search
(Self-play)
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
27 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction PipelineSelected
Move
Guided-
Look-ahead
Search
(Self-play)
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
SelectedAction
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
27 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Vs Lee Sedol
▸ In March 2016, Alpha Go won against Lee Sedol 4-1
▸ Lee Sedol was 18-time World Champion prior to the game
▸ Two famous moves: Move 37 by AlphaGo and Move 78 by
Sedol
28 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo: Key Takeaways
1. Different AI techniques can be complementary
▸ Supervised learning
▸ Reinforcement learning
▸ Search
▸ Rules/Domain Knowledge
▸ What ever it takes to win!!
2. Self-play
3. Vast computation still required for training and inference
▸ AlphaGo used 1200 CPUs and 176 GPUs
29 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero ’201712
2em112
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
30 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero
▸ AlphaGo Zero is a successor to AlphaGo
▸ AlphaGo Zero is simpler and stronger than AlphaGo
▸ AlphaGo Zero beats AlphaGo 100 − 0 in matches
▸ AlphaGo Zero starts from Zero domain knowledge
▸ Uses a single neural network (compared to 4 NNs in
AlphaGo)
▸ Learns by Self-Play only (No supervised learning like in
AlphaGo)
31 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Neural Network Architecture
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
x × y kernels
Input State(s) features
Multi Headed ResNet fθ(s) = (p,v)
Policy Head Value Head
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1
π1
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
π1
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
π1 π2
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
Self Play Training Data
fθ(st) = (pt,vt) θ′
= θ − α∇θL((pt,vt),(πt,z))
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
Self Play Training Data
fθ(st) = (pt,vt) θ′
= θ − α∇θL((pt,vt),(πt,z))
L(fθ(st),(πt,z)) = (z − vt)2
MSE
−πT
t log pt + c θ 2
Cross-entropy loss
32 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
33 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
Learning
2
θt+1 = θt − α∇θL(fθ,D)
Self Play Data D(π,z)
33 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
Repeat
3
Learning
2
θt+1 = θt − α∇θL(fθ,D)
Self Play Data D(π,z)
33 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero: Key Takeaways
1. A simpler system can be more powerful than a complex
one (AlphaGo Zero vs AlphaGo)
2. Neural networks can be combined like LEGO blocks
3. ResNet is better than traditional ConvNets
4. Self-play
34 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero ’201813
2em113
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
35 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero
▸ AlphaGo Zero is able to reach superhuman level at Go
without any domain knowledge...
▸ As AlphaGo Zero is not dependent on Go, can the same
algorithm play other games?
▸ AlphaZero extends AlphaGo to play not only Go but also
Chess and Shogi
▸ The same algorithm achieves superhuman performance on
all three games
36 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
Go Input FeaturesShogi Input Features Chess Input FeaturesOROR
Multi Headed ResNet fθ(s) = (p,v)
Policy Head Value Head
37 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero is much Simpler than AlphaGo,
Yet More Powerful
Generality
Performance
AlphaGo
AlphaGo Zero
AlphaZero
38 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero: Key Takeaways
1. Being able to play three games at a superhuman level,
AlphaZero is one step closer to general AI
2. Massive compute power still required for training
AlphaZero
▸ 5000 TPUs
3. Self-play
39 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Next Challenge..
40 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
Presentation Summary
▸ Sometimes a simpler system can be more powerful than a
complex one
▸ Universal research principle: strive for generality,
simplicity, Occam’s Razor
▸ Self-play: no human bias, learn from first principles
▸ Deep RL is still in its infancy, a lot to more to be
expected in the next few years
▸ Open challenges: Sample efficiency, data efficiency
▸ Yes, AlphaGo can learn to play Go after hundreds of game
years, but a human can reach a decent level of play in only
a couple of hours
▸ How can we make reinforcement learning more efficient?
Model-based learning is a research area with increasing
attention
41 / 41
Intro AlphaGo AlphaGo Zero AlphaZero Summary
References19
▸ DQN14
▸ AlphaGo15
▸ AlphaGo Zero16
▸ AlphaZero17
▸ AlphaStar18
2em114
Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”.
In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/
nature14236.
2em115
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em116
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em117
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
2em118
Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement
learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z.
2em119
Thanks to Rolf Stadler for Reviewing and discussing drafts of this presentation
41 / 41

More Related Content

Similar to A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"YeChan(Paul) Kim
 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMDierk König
 
The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210Mahmoud Samir Fayed
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learningsimaokasonse
 
The Ring programming language version 1.5.4 book - Part 53 of 185
The Ring programming language version 1.5.4 book - Part 53 of 185The Ring programming language version 1.5.4 book - Part 53 of 185
The Ring programming language version 1.5.4 book - Part 53 of 185Mahmoud Samir Fayed
 
Lenses for the masses - introducing Goggles
Lenses for the masses - introducing GogglesLenses for the masses - introducing Goggles
Lenses for the masses - introducing Goggleskenbot
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Qiangning Hong
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data typesAlice Zheng
 
Tensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymTensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymHO-HSUN LIN
 
Juegos minimax AlfaBeta
Juegos minimax AlfaBetaJuegos minimax AlfaBeta
Juegos minimax AlfaBetanelsonbc20
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentRaphael Reitzig
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - BasicWei-Yuan Chang
 
Scala kansai summit-2016
Scala kansai summit-2016Scala kansai summit-2016
Scala kansai summit-2016Naoki Kitora
 
Hindsight experience replay paper review
Hindsight experience replay paper reviewHindsight experience replay paper review
Hindsight experience replay paper reviewEuijin Jeong
 

Similar to A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar (20)

pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
Understanding AlphaGo
Understanding AlphaGoUnderstanding AlphaGo
Understanding AlphaGo
 
Frege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVMFrege - consequently functional programming for the JVM
Frege - consequently functional programming for the JVM
 
The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184The Ring programming language version 1.5.3 book - Part 69 of 184
The Ring programming language version 1.5.3 book - Part 69 of 184
 
The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210The Ring programming language version 1.9 book - Part 69 of 210
The Ring programming language version 1.9 book - Part 69 of 210
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 
The Ring programming language version 1.5.4 book - Part 53 of 185
The Ring programming language version 1.5.4 book - Part 53 of 185The Ring programming language version 1.5.4 book - Part 53 of 185
The Ring programming language version 1.5.4 book - Part 53 of 185
 
Lenses for the masses - introducing Goggles
Lenses for the masses - introducing GogglesLenses for the masses - introducing Goggles
Lenses for the masses - introducing Goggles
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010Python于Web 2.0网站的应用 - QCon Beijing 2010
Python于Web 2.0网站的应用 - QCon Beijing 2010
 
Provenance Games
Provenance GamesProvenance Games
Provenance Games
 
Feature engineering for diverse data types
Feature engineering for diverse data typesFeature engineering for diverse data types
Feature engineering for diverse data types
 
Games.4
Games.4Games.4
Games.4
 
Tensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI GymTensorflow + Keras & Open AI Gym
Tensorflow + Keras & Open AI Gym
 
AlphaZero
AlphaZeroAlphaZero
AlphaZero
 
Juegos minimax AlfaBeta
Juegos minimax AlfaBetaJuegos minimax AlfaBeta
Juegos minimax AlfaBeta
 
Music as data
Music as dataMusic as data
Music as data
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
 
Python Fundamentals - Basic
Python Fundamentals - BasicPython Fundamentals - Basic
Python Fundamentals - Basic
 
Scala kansai summit-2016
Scala kansai summit-2016Scala kansai summit-2016
Scala kansai summit-2016
 
Hindsight experience replay paper review
Hindsight experience replay paper reviewHindsight experience replay paper review
Hindsight experience replay paper review
 

More from Kim Hammar

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesKim Hammar
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHKim Hammar
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion ResponseKim Hammar
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlKim Hammar
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Kim Hammar
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionKim Hammar
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security AutomationKim Hammar
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Kim Hammar
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Kim Hammar
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingKim Hammar
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...Kim Hammar
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseKim Hammar
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Kim Hammar
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingKim Hammar
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayKim Hammar
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Kim Hammar
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Kim Hammar
 

More from Kim Hammar (20)

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
Självlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTHSjälvlärande System för Cybersäkerhet. KTH
Självlärande System för Cybersäkerhet. KTH
 
Learning Automated Intrusion Response
Learning Automated Intrusion ResponseLearning Automated Intrusion Response
Learning Automated Intrusion Response
 
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback ControlIntrusion Tolerance for Networked Systems through Two-level Feedback Control
Intrusion Tolerance for Networked Systems through Two-level Feedback Control
 
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
Gamesec23 - Scalable Learning of Intrusion Response through Recursive Decompo...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
Learning Near-Optimal Intrusion Responses for IT Infrastructures via Decompos...
 
Learning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via DecompositionLearning Optimal Intrusion Responses via Decomposition
Learning Optimal Intrusion Responses via Decomposition
 
Digital Twins for Security Automation
Digital Twins for Security AutomationDigital Twins for Security Automation
Digital Twins for Security Automation
 
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
Learning Near-Optimal Intrusion Response for Large-Scale IT Infrastructures v...
 
Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.Självlärande system för cyberförsvar.
Självlärande system för cyberförsvar.
 
Intrusion Response through Optimal Stopping
Intrusion Response through Optimal StoppingIntrusion Response through Optimal Stopping
Intrusion Response through Optimal Stopping
 
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
CNSM 2022 - An Online Framework for Adapting Security Policies in Dynamic IT ...
 
Self-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber DefenseSelf-Learning Systems for Cyber Defense
Self-Learning Systems for Cyber Defense
 
Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.Self-learning Intrusion Prevention Systems.
Self-learning Intrusion Prevention Systems.
 
Learning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal StoppingLearning Security Strategies through Game Play and Optimal Stopping
Learning Security Strategies through Game Play and Optimal Stopping
 
Intrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal StoppingIntrusion Prevention through Optimal Stopping
Intrusion Prevention through Optimal Stopping
 
Intrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-PlayIntrusion Prevention through Optimal Stopping and Self-Play
Intrusion Prevention through Optimal Stopping and Self-Play
 
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
Introduktion till försvar mot nätverksintrång. 22 Feb 2022. EP1200 KTH.
 
Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.Intrusion Prevention through Optimal Stopping.
Intrusion Prevention through Optimal Stopping.
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar

  • 1. Intro AlphaGo AlphaGo Zero AlphaZero Summary A Deep Journey of Playing Games with RL NSE Seminar Kim Hammar kimham@kth.se January 31, 2020 1 / 41
  • 2. Intro AlphaGo AlphaGo Zero AlphaZero Summary Why Games b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Agent Environment Actionat rtst st+1 rt+1 C O N V C O N V C O N V C O N V C O N V rNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P P Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 VA LV E R AI & Machine Learning Games Why Combine the two? 2 / 41
  • 3. Intro AlphaGo AlphaGo Zero AlphaZero Summary Why Games b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Agent Environment Actionat rtst st+1 rt+1 C O N V C O N V C O N V C O N V C O N V rNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P P Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 VA LV E R AI & Machine Learning Games Why Combine the two? ▸ AI & Games have a long history (Turing ’50& Minsky 60’) ▸ Simple to evaluate, reproducible, controllable, quick feedback loop ▸ Common benchmark for the research community 2 / 41
  • 4. Intro AlphaGo AlphaGo Zero AlphaZero Summary 1997: DeepBlue1 vs Kasparov 2em11 Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artif. Intell. 134.1–2 (Jan. 2002), 57–83. issn: 0004-3702. doi: 10.1016/S0004- 3702(01)00129- 1. url: https://doi.org/10.1016/S0004-3702(01)00129-1. 3 / 41
  • 5. Intro AlphaGo AlphaGo Zero AlphaZero Summary 1992: Tesauro’s TD-Gammon2 2em12 Gerald Tesauro. “TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play”. In: Neural Comput. 6.2 (Mar. 1994), 215–219. issn: 0899-7667. doi: 10.1162/neco.1994.6.2.215. url: https://doi.org/10.1162/neco.1994.6.2.215. 4 / 41
  • 6. Intro AlphaGo AlphaGo Zero AlphaZero Summary 1959: Arthur Samuel’s Checkers Player3 2em13 A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210, A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210. 5 / 41
  • 7. Intro AlphaGo AlphaGo Zero AlphaZero Summary 6 / 41
  • 8. Intro AlphaGo AlphaGo Zero AlphaZero Summary 7 / 41
  • 9. Intro AlphaGo AlphaGo Zero AlphaZero Summary Papers in Focus Today ▸ AlphaGo4 ▸ AlphaGo Zero5 ▸ AlphaZero6 AlphaGo Nature, 6.5k citations 2016 AlphaGo Zero Nature, 2.5k citations 2017 Alpha Zero Science, 400 citations 2018 2em14 David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961. 2em15 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em16 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 8 / 41
  • 10. Intro AlphaGo AlphaGo Zero AlphaZero Summary The Reinforcement Learning Problem ▸ Notation; policy: π, state: s, reward: r, action: a ▸ Agent’s goal: maximize reward, Rt = ∞ ∑ k=0 γk rt+k+1 0 ≤ γ ≤ 1 ▸ RL’s goal, find optimal policy π∗ = maxπ E[R π] Agent Environment Actionat rtst st+1 rt+1 9 / 41
  • 11. Intro AlphaGo AlphaGo Zero AlphaZero Summary RL Examples: Elevator (Crites & Barto ’957) Elevator Agent b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Observations ElevatorPosition rt+1 Reward ∈ R select(up,down,wait,stop at floor 1,⋯,n) at+1 2em17 Robert H. Crites and Andrew G. Barto. “Improving Elevator Performance Using Re- inforcement Learning”. In: Proceedings of the 8th International Conference on Neural Information Processing Systems. NIPS’95. Denver, Colorado: MIT Press, 1995, 1017–1023. 10 / 41
  • 12. Intro AlphaGo AlphaGo Zero AlphaZero Summary RL Examples: Atari (Mnih ’15)8 DQN Agent Q(s,a1) Q(s,a18)⋯ Observations rt+1 Screen frames ∈ R4×84×84 Reward ∈ R at+1 2em18 Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/ nature14236. 11 / 41
  • 13. Intro AlphaGo AlphaGo Zero AlphaZero Summary How to Act Optimally? (Bellman 57’9) optimal(st) = max π E[ ∞ ∑ k=1 γk−1 rt+k st] 2em19 Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn: 9780486428093. 12 / 41
  • 14. Intro AlphaGo AlphaGo Zero AlphaZero Summary How to Act Optimally? (Bellman 57’10) optimal(st) = max π E[ ∞ ∑ k=1 γk−1 rt+k st] = max π E[rt+1 ∞ ∑ k=2 γk−1 rt+k st] = max at E[rt+1 + max π E[ ∞ ∑ k=2 γk−1 rt+k st+1] st] = max at E[rt+1 + γ max π E[ ∞ ∑ k=2 γk−2 rt+k st+1] st] = max at E[rt+1 + γ max π E[ ∞ ∑ k=2 γk−2 rt+k st+1] st] = max at E[rt+1 + γoptimal(st+1) st] 2em110 Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn:12 / 41
  • 15. Intro AlphaGo AlphaGo Zero AlphaZero Summary Reinforcement Learning: An Overview Deep Reinforcement Learning C O N V C O N V C O N V C O N V C O N V ⎛ ⎜ ⎜ ⎝ x1 xn ⎞ ⎟ ⎟ ⎠ Features b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Model θ ˆy Prediction L(y, ˆy) Loss Gradient ∇θL(y, ˆy) Algorithms: DQN, DDPG, Double-DQN 13 / 41
  • 16. Intro AlphaGo AlphaGo Zero AlphaZero Summary Reinforcement Learning: An Overview Deep Reinforcement Learning C O N V C O N V C O N V C O N V C O N V ⎛ ⎜ ⎜ ⎝ x1 xn ⎞ ⎟ ⎟ ⎠ Features b0 x0,1 x0,2 x0,3 b1 x1,1 x1,2 x1,3 ˆy Model θ ˆy Prediction L(y, ˆy) Loss Gradient ∇θL(y, ˆy) Algorithms: DQN, DDPG, Double-DQN Direct Policy Search w[0] w [1 ] w[2 ] Policy Spaceπ1 π2 π3 π4 Algorithms: REINFORCE, Evolutionary Search, DPG, Actor-Critic Value Based Methods s a r s′ max Value iteration Monte-Carlo s a s′ r TD(0) s,a s′ a′ r max Q-Learning s,a s′ a′ r Sarsa Dynamic Programming V (s1) V (s2) V (s4) V (s5) V (s3) V (s4) V (s6) s0 s1 s2 a0,r1 a1,r2 Heuristic Search Expansion One or more nodes in the search tree are created. Selection The selection function is applied recursively until a leaf node is reached. Simulation terminal state rollout game using model of environment M and policy π Backpropagation ∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree Algorithms: MCTS, Minimax Search Mathematical Foundations s0 s1 s2 a0,r1 a1,r2 Markov Decision Process YZ 0.80.6 0.2 0.4 Markov Chain S0 S1 S2 1 R = −2 0.5 0.5 R = +1 R = +0 R = +0 0.5 0.5 Markov Reward Process 13 / 41
  • 17. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo ’201611 2em111 David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961. 14 / 41
  • 18. Intro AlphaGo AlphaGo Zero AlphaZero Summary 15 / 41
  • 19. Intro AlphaGo AlphaGo Zero AlphaZero Summary The game of Go ▸ The world’s oldest game: 3000 years old, over 40M players world wide ▸ To win: capture the most territory on the board ▸ Surrounded stones/areas are captured and removed ▸ Why is it so hard for computers? 10170 unique states!!, ≈ 250 branching factor ▸ High branching factor, large board (19 × 19), hard to evaluate etc.. Capture 16 / 41
  • 20. Intro AlphaGo AlphaGo Zero AlphaZero Summary The game of Go ▸ The world’s oldest game: 3000 years old, over 40M players world wide ▸ To win: capture the most territory on the board ▸ Surrounded stones are captured and removed ▸ Why is it so hard for computers? 10170 unique states!! ≈ 250 branching factor ▸ High branching factor, large board (19 × 19), hard to evaluate etc.. (1) (2) 16 / 41
  • 21. Intro AlphaGo AlphaGo Zero AlphaZero Summary Game Trees ▸ How do you program a computer to play a board game? ▸ Simplest approach: ▸ (1) Program a game tree; (2) Assume opponent think like you; (3) Look-ahead and evaluate each move ▸ Requires Knowledge of game rules and evaluation function Ply 1 Ply 2 Ply 3 Depth = 3 Branching factor = 3 17 / 41
  • 22. Intro AlphaGo AlphaGo Zero AlphaZero Summary Search + Go = 18 / 41
  • 23. Intro AlphaGo AlphaGo Zero AlphaZero Summary Some Numbers ▸ Atoms in the universe ▸ ≈ 1080 ▸ States ▸ Go: 10170 , Chess: 1047 ▸ Game tree complexity ▸ Go: 10360 , Chess: 10123 ▸ Average branching factor ▸ Go: 250, Chess: 35 ▸ Board size (positions) ▸ Go: 361, Chess: 64 2 4 6 8 1 · 109 2 · 109 3 · 109 250depth states depth 19 / 41
  • 24. Intro AlphaGo AlphaGo Zero AlphaZero Summary Monte-Carlo Tree Search Expansion One or more nodes in the search tree are created. Selection The selection function is applied recursively until a leaf node is reached. Simulation terminal state rollout game using model of environment M and policy π Backpropagation ∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree 20 / 41
  • 25. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo’s Approach ▸ Brute-Force Search does not work ▸ At least not until hardware has improved a lot. ▸ Human Go professionals rely on small search guided by intuition/experience ▸ AlphaGo’s Approach: Complement MCTS with “artifical intuition” ▸ Artificial intuition provided by two neural networks: value network and policy network 21 / 41
  • 26. Intro AlphaGo AlphaGo Zero AlphaZero Summary Computer Go as an RL Problem AlphaGo Agent Observations rt+1 Board features ∈ RK×19×19 Reward ∈ [−1,1] at+1 22 / 41
  • 27. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (1/2) Human Expert Moves D1 23 / 41
  • 28. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (1/2) C O N V C O N V C O N V Supervised Rollout Policy Network pπ (a s) Classification minpπ L(predicted move,expert move) Human Expert Moves D1 23 / 41
  • 29. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (1/2) C O N V C O N V C O N V Supervised Rollout Policy Network pπ (a s) Classification minpπ L(predicted move,expert move) C O N V C O N V C O N V C O N V C O N V Supervised Policy Network pσ (a s) Classification minpσ L(predicted move,expert move) Human Expert Moves D1 23 / 41
  • 30. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (2/2) C O N V C O N V C O N V C O N V C O N V Reinforcement Learning Policy Network pρ (a s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑ ∞ t=0 rt] ρ ← ρ + α∇ρJ(pρ) Self Play 24 / 41
  • 31. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (2/2) C O N V C O N V C O N V C O N V C O N V Reinforcement Learning Policy Network pρ (a s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑ ∞ t=0 rt] ρ ← ρ + α∇ρJ(pρ) Self Play Self Play Dataset D2 24 / 41
  • 32. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Training Pipeline (2/2) C O N V C O N V C O N V C O N V C O N V Reinforcement Learning Policy Network pρ (a s) Initialize with pσ weights PolicyGradient J(pρ) = Epρ [∑ ∞ t=0 rt] ρ ← ρ + α∇ρJ(pρ) Self Play C O N V C O N V C O N V C O N V C O N V Supervised Value Network vθ(s′ ) Classification minvθ L(predicted outcome,actual outcome) Self Play Dataset D2 24 / 41
  • 33. Intro AlphaGo AlphaGo Zero AlphaZero Summary Guided Search Search Using the Policy Network C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) 25 / 41
  • 34. Intro AlphaGo AlphaGo Zero AlphaZero Summary Depth-Limited Search Using the Value Network C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ )C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) 26 / 41
  • 35. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Prediction Pipeline Input Position Game State s 27 / 41
  • 36. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Prediction Pipeline ML predictions Input Position C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) Game State s 27 / 41
  • 37. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Prediction Pipeline Artifical Intuition By NNs ML predictions Input Position C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) Game State s HighValueAction ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′ ) 27 / 41
  • 38. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Prediction PipelineGuided- Look-ahead Search (Self-play) Artifical Intuition By NNs ML predictions Input Position C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) Game State s HighValueAction ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′ ) Expansion One or more nodes in the search tree are created. Selection The selection function is applied recursively until a leaf node is reached. Simulation terminal state rollout game using model of environment M and policy π Backpropagation ∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree 27 / 41
  • 39. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Prediction PipelineSelected Move Guided- Look-ahead Search (Self-play) Artifical Intuition By NNs ML predictions Input Position C O N V C O N V C O N V C O N V C O N V PolicyNetwork pσ/ρ (a s) C O N V C O N V C O N V C O N V C O N V ValueNetwork vθ(s′ ) Game State s SelectedAction HighValueAction ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′ ) Expansion One or more nodes in the search tree are created. Selection The selection function is applied recursively until a leaf node is reached. Simulation terminal state rollout game using model of environment M and policy π Backpropagation ∆ ∆ ∆ ∆ The result of the rollout is backpropagated in the tree 27 / 41
  • 40. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Vs Lee Sedol ▸ In March 2016, Alpha Go won against Lee Sedol 4-1 ▸ Lee Sedol was 18-time World Champion prior to the game ▸ Two famous moves: Move 37 by AlphaGo and Move 78 by Sedol 28 / 41
  • 41. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo: Key Takeaways 1. Different AI techniques can be complementary ▸ Supervised learning ▸ Reinforcement learning ▸ Search ▸ Rules/Domain Knowledge ▸ What ever it takes to win!! 2. Self-play 3. Vast computation still required for training and inference ▸ AlphaGo used 1200 CPUs and 176 GPUs 29 / 41
  • 42. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero ’201712 2em112 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 30 / 41
  • 43. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero ▸ AlphaGo Zero is a successor to AlphaGo ▸ AlphaGo Zero is simpler and stronger than AlphaGo ▸ AlphaGo Zero beats AlphaGo 100 − 0 in matches ▸ AlphaGo Zero starts from Zero domain knowledge ▸ Uses a single neural network (compared to 4 NNs in AlphaGo) ▸ Learns by Self-Play only (No supervised learning like in AlphaGo) 31 / 41
  • 44. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Neural Network Architecture v(s) ≈ P[win s] P[a s] P[an s]P[a1 s] ⋯ x × y kernels Input State(s) features Multi Headed ResNet fθ(s) = (p,v) Policy Head Value Head 32 / 41
  • 45. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 32 / 41
  • 46. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 π1 32 / 41
  • 47. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 s2 a1 ∼ π1 π1 32 / 41
  • 48. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 s2 a1 ∼ π1 π1 π2 32 / 41
  • 49. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 s2 a1 ∼ π1 ⋯at ∼ πt π1 π2 z 32 / 41
  • 50. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 s2 a1 ∼ π1 ⋯at ∼ πt π1 π2 z Self Play Training Data fθ(st) = (pt,vt) θ′ = θ − α∇θL((pt,vt),(πt,z)) 32 / 41
  • 51. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Algorithm s1 s2 a1 ∼ π1 ⋯at ∼ πt π1 π2 z Self Play Training Data fθ(st) = (pt,vt) θ′ = θ − α∇θL((pt,vt),(πt,z)) L(fθ(st),(πt,z)) = (z − vt)2 MSE −πT t log pt + c θ 2 Cross-entropy loss 32 / 41
  • 52. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Pipeline v(s) ≈ P[win s] P[a s] P[an s]P[a1 s] ⋯ fθ = (p,v) Self Play 1 33 / 41
  • 53. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Pipeline v(s) ≈ P[win s] P[a s] P[an s]P[a1 s] ⋯ fθ = (p,v) Self Play 1 Learning 2 θt+1 = θt − α∇θL(fθ,D) Self Play Data D(π,z) 33 / 41
  • 54. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero Self-Play Training Pipeline v(s) ≈ P[win s] P[a s] P[an s]P[a1 s] ⋯ fθ = (p,v) Self Play 1 Repeat 3 Learning 2 θt+1 = θt − α∇θL(fθ,D) Self Play Data D(π,z) 33 / 41
  • 55. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaGo Zero: Key Takeaways 1. A simpler system can be more powerful than a complex one (AlphaGo Zero vs AlphaGo) 2. Neural networks can be combined like LEGO blocks 3. ResNet is better than traditional ConvNets 4. Self-play 34 / 41
  • 56. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaZero ’201813 2em113 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 35 / 41
  • 57. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaZero ▸ AlphaGo Zero is able to reach superhuman level at Go without any domain knowledge... ▸ As AlphaGo Zero is not dependent on Go, can the same algorithm play other games? ▸ AlphaZero extends AlphaGo to play not only Go but also Chess and Shogi ▸ The same algorithm achieves superhuman performance on all three games 36 / 41
  • 58. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaZero v(s) ≈ P[win s] P[a s] P[an s]P[a1 s] ⋯ rNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P PrNQ p p b K k P P P 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩 角角 角 香 香 桂 桂 銀 銀 金 金 玉 歩 歩 歩 歩 歩 歩 歩 歩 歩 歩角角 角 香 香 桂 桂 銀 銀 金 金 玉 Go Input FeaturesShogi Input Features Chess Input FeaturesOROR Multi Headed ResNet fθ(s) = (p,v) Policy Head Value Head 37 / 41
  • 59. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaZero is much Simpler than AlphaGo, Yet More Powerful Generality Performance AlphaGo AlphaGo Zero AlphaZero 38 / 41
  • 60. Intro AlphaGo AlphaGo Zero AlphaZero Summary AlphaZero: Key Takeaways 1. Being able to play three games at a superhuman level, AlphaZero is one step closer to general AI 2. Massive compute power still required for training AlphaZero ▸ 5000 TPUs 3. Self-play 39 / 41
  • 61. Intro AlphaGo AlphaGo Zero AlphaZero Summary Next Challenge.. 40 / 41
  • 62. Intro AlphaGo AlphaGo Zero AlphaZero Summary Presentation Summary ▸ Sometimes a simpler system can be more powerful than a complex one ▸ Universal research principle: strive for generality, simplicity, Occam’s Razor ▸ Self-play: no human bias, learn from first principles ▸ Deep RL is still in its infancy, a lot to more to be expected in the next few years ▸ Open challenges: Sample efficiency, data efficiency ▸ Yes, AlphaGo can learn to play Go after hundreds of game years, but a human can reach a decent level of play in only a couple of hours ▸ How can we make reinforcement learning more efficient? Model-based learning is a research area with increasing attention 41 / 41
  • 63. Intro AlphaGo AlphaGo Zero AlphaZero Summary References19 ▸ DQN14 ▸ AlphaGo15 ▸ AlphaGo Zero16 ▸ AlphaZero17 ▸ AlphaStar18 2em114 Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/ nature14236. 2em115 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em116 David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature 550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270. 2em117 David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http : //science.sciencemag.org/content/362/6419/1140/tab-pdf. 2em118 Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z. 2em119 Thanks to Rolf Stadler for Reviewing and discussing drafts of this presentation 41 / 41