A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
1. Intro AlphaGo AlphaGo Zero AlphaZero Summary
A Deep Journey of Playing Games with RL
NSE Seminar
Kim Hammar
kimham@kth.se
January 31, 2020
1 / 41
2. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Why Games
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Agent
Environment
Actionat
rtst
st+1
rt+1
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
VA LV E
R
AI & Machine Learning Games
Why Combine the two?
2 / 41
3. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Why Games
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Agent
Environment
Actionat
rtst
st+1
rt+1
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
VA LV E
R
AI & Machine Learning Games
Why Combine the two?
▸ AI & Games have a long history (Turing ’50& Minsky 60’)
▸ Simple to evaluate, reproducible, controllable, quick
feedback loop
▸ Common benchmark for the research community
2 / 41
4. Intro AlphaGo AlphaGo Zero AlphaZero Summary
1997: DeepBlue1 vs Kasparov
2em11
Murray Campbell, A. Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artif.
Intell. 134.1–2 (Jan. 2002), 57–83. issn: 0004-3702. doi: 10.1016/S0004- 3702(01)00129- 1. url:
https://doi.org/10.1016/S0004-3702(01)00129-1.
3 / 41
6. Intro AlphaGo AlphaGo Zero AlphaZero Summary
1959: Arthur Samuel’s Checkers Player3
2em13
A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers”. In:
IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi: 10.1147/rd.33.0210. url:
https://doi.org/10.1147/rd.33.0210, A. L. Samuel. “Some Studies in Machine Learning Using
the Game of Checkers”. In: IBM J. Res. Dev. 3.3 (July 1959), 210–229. issn: 0018-8646. doi:
10.1147/rd.33.0210. url: https://doi.org/10.1147/rd.33.0210.
5 / 41
9. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Papers in Focus Today
▸ AlphaGo4
▸ AlphaGo Zero5
▸ AlphaZero6
AlphaGo
Nature, 6.5k citations
2016
AlphaGo Zero
Nature, 2.5k citations
2017
Alpha Zero
Science, 400 citations
2018
2em14
David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
2em15
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em16
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
8 / 41
11. Intro AlphaGo AlphaGo Zero AlphaZero Summary
RL Examples: Elevator (Crites & Barto ’957)
Elevator Agent
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Observations ElevatorPosition rt+1
Reward ∈ R
select(up,down,wait,stop at floor 1,⋯,n) at+1
2em17
Robert H. Crites and Andrew G. Barto. “Improving Elevator Performance Using Re-
inforcement Learning”. In: Proceedings of the 8th International Conference on Neural Information
Processing Systems. NIPS’95. Denver, Colorado: MIT Press, 1995, 1017–1023.
10 / 41
12. Intro AlphaGo AlphaGo Zero AlphaZero Summary
RL Examples: Atari (Mnih ’15)8
DQN Agent
Q(s,a1) Q(s,a18)⋯
Observations rt+1
Screen frames ∈ R4×84×84 Reward ∈ R
at+1
2em18
Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”.
In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/
nature14236.
11 / 41
13. Intro AlphaGo AlphaGo Zero AlphaZero Summary
How to Act Optimally? (Bellman 57’9)
optimal(st) = max
π
E[
∞
∑
k=1
γk−1
rt+k st]
2em19
Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn:
9780486428093.
12 / 41
14. Intro AlphaGo AlphaGo Zero AlphaZero Summary
How to Act Optimally? (Bellman 57’10)
optimal(st) = max
π
E[
∞
∑
k=1
γk−1
rt+k st]
= max
π
E[rt+1
∞
∑
k=2
γk−1
rt+k st]
= max
at
E[rt+1 + max
π
E[
∞
∑
k=2
γk−1
rt+k st+1] st]
= max
at
E[rt+1 + γ max
π
E[
∞
∑
k=2
γk−2
rt+k st+1] st]
= max
at
E[rt+1 + γ max
π
E[
∞
∑
k=2
γk−2
rt+k st+1] st]
= max
at
E[rt+1 + γoptimal(st+1) st]
2em110
Richard Bellman. Dynamic Programming. Dover Publications, 1957. isbn:12 / 41
15. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Reinforcement Learning: An Overview
Deep Reinforcement Learning
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
⎛
⎜
⎜
⎝
x1
xn
⎞
⎟
⎟
⎠
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
∇θL(y, ˆy)
Algorithms: DQN, DDPG, Double-DQN
13 / 41
16. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Reinforcement Learning: An Overview
Deep Reinforcement Learning
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
⎛
⎜
⎜
⎝
x1
xn
⎞
⎟
⎟
⎠
Features
b0
x0,1
x0,2
x0,3
b1
x1,1
x1,2
x1,3
ˆy
Model θ
ˆy
Prediction
L(y, ˆy)
Loss
Gradient
∇θL(y, ˆy)
Algorithms: DQN, DDPG, Double-DQN
Direct Policy Search
w[0]
w
[1
]
w[2
]
Policy Spaceπ1 π2
π3
π4
Algorithms: REINFORCE, Evolutionary Search, DPG, Actor-Critic
Value Based Methods
s
a
r
s′
max
Value iteration Monte-Carlo
s
a
s′
r
TD(0)
s,a
s′
a′
r
max
Q-Learning
s,a
s′
a′
r
Sarsa
Dynamic Programming
V (s1)
V (s2)
V (s4) V (s5)
V (s3)
V (s4) V (s6)
s0
s1
s2
a0,r1
a1,r2
Heuristic Search
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
Algorithms: MCTS, Minimax Search
Mathematical Foundations
s0
s1
s2
a0,r1
a1,r2
Markov Decision Process
YZ
0.80.6
0.2
0.4
Markov Chain
S0
S1 S2
1
R = −2
0.5 0.5
R = +1
R = +0
R = +0
0.5
0.5
Markov Reward Process 13 / 41
17. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo ’201611
2em111
David Silver et al. “Mastering the Game of Go with Deep Neural Networks and Tree
Search”. In: Nature 529.7587 (Jan. 2016), pp. 484–489. doi: 10.1038/nature16961.
14 / 41
19. Intro AlphaGo AlphaGo Zero AlphaZero Summary
The game of Go
▸ The world’s oldest game: 3000 years old, over 40M players
world wide
▸ To win: capture the most territory on the board
▸ Surrounded stones/areas are captured and removed
▸ Why is it so hard for computers? 10170
unique states!!,
≈ 250 branching factor
▸ High branching factor, large board (19 × 19), hard to
evaluate etc..
Capture
16 / 41
20. Intro AlphaGo AlphaGo Zero AlphaZero Summary
The game of Go
▸ The world’s oldest game: 3000 years old, over 40M players
world wide
▸ To win: capture the most territory on the board
▸ Surrounded stones are captured and removed
▸ Why is it so hard for computers? 10170
unique states!!
≈ 250 branching factor
▸ High branching factor, large board (19 × 19), hard to
evaluate etc..
(1) (2)
16 / 41
21. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Game Trees
▸ How do you program a computer to play a board game?
▸ Simplest approach:
▸ (1) Program a game tree; (2) Assume opponent think like
you; (3) Look-ahead and evaluate each move
▸ Requires Knowledge of game rules and evaluation function
Ply 1
Ply 2
Ply 3
Depth = 3
Branching factor = 3
17 / 41
23. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Some Numbers
▸ Atoms in the universe
▸ ≈ 1080
▸ States
▸ Go: 10170
, Chess: 1047
▸ Game tree complexity
▸ Go: 10360
, Chess: 10123
▸ Average branching factor
▸ Go: 250, Chess: 35
▸ Board size (positions)
▸ Go: 361, Chess: 64
2 4 6 8
1 · 109
2 · 109
3 · 109
250depth
states
depth
19 / 41
24. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Monte-Carlo Tree Search
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
20 / 41
25. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo’s Approach
▸ Brute-Force Search does not work
▸ At least not until hardware has improved a lot.
▸ Human Go professionals rely on small search guided by
intuition/experience
▸ AlphaGo’s Approach: Complement MCTS with
“artifical intuition”
▸ Artificial intuition provided by two neural networks: value
network and policy network
21 / 41
26. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Computer Go as an RL Problem
AlphaGo Agent
Observations rt+1
Board features ∈ RK×19×19 Reward ∈ [−1,1]
at+1
22 / 41
27. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
Human Expert Moves D1
23 / 41
28. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Rollout Policy Network
pπ (a s)
Classification
minpπ L(predicted move,expert move)
Human Expert Moves D1
23 / 41
29. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (1/2)
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Rollout Policy Network
pπ (a s)
Classification
minpπ L(predicted move,expert move)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Policy Network
pσ (a s)
Classification
minpσ L(predicted move,expert move)
Human Expert Moves D1
23 / 41
30. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
24 / 41
31. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
Self Play Dataset D2
24 / 41
32. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Training Pipeline (2/2)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Reinforcement Learning Policy Network
pρ (a s)
Initialize with pσ weights
PolicyGradient
J(pρ) = Epρ [∑
∞
t=0 rt]
ρ ← ρ + α∇ρJ(pρ)
Self Play
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
Supervised Value Network
vθ(s′
)
Classification
minvθ
L(predicted outcome,actual outcome)
Self Play Dataset D2
24 / 41
33. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Guided Search Search Using the Policy
Network
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
25 / 41
34. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Depth-Limited Search Using the Value
Network
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
26 / 41
35. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
Input
Position
Game State s
27 / 41
36. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
27 / 41
37. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction Pipeline
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
27 / 41
38. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction PipelineGuided-
Look-ahead
Search
(Self-play)
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
27 / 41
39. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Prediction PipelineSelected
Move
Guided-
Look-ahead
Search
(Self-play)
Artifical
Intuition
By NNs
ML
predictions
Input
Position
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
PolicyNetwork
pσ/ρ (a s)
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
C
O
N
V
ValueNetwork
vθ(s′
)
Game State s
SelectedAction
HighValueAction
ActionDistribution pσ/ρ (a s) ValueEstimates vθ(s′
)
Expansion
One or more nodes in
the search tree are created.
Selection
The selection function is
applied recursively
until a leaf node is reached.
Simulation
terminal state
rollout game
using model
of environment M
and policy π
Backpropagation
∆
∆
∆
∆
The result of the rollout
is backpropagated in the tree
27 / 41
40. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Vs Lee Sedol
▸ In March 2016, Alpha Go won against Lee Sedol 4-1
▸ Lee Sedol was 18-time World Champion prior to the game
▸ Two famous moves: Move 37 by AlphaGo and Move 78 by
Sedol
28 / 41
41. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo: Key Takeaways
1. Different AI techniques can be complementary
▸ Supervised learning
▸ Reinforcement learning
▸ Search
▸ Rules/Domain Knowledge
▸ What ever it takes to win!!
2. Self-play
3. Vast computation still required for training and inference
▸ AlphaGo used 1200 CPUs and 176 GPUs
29 / 41
42. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero ’201712
2em112
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
30 / 41
43. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero
▸ AlphaGo Zero is a successor to AlphaGo
▸ AlphaGo Zero is simpler and stronger than AlphaGo
▸ AlphaGo Zero beats AlphaGo 100 − 0 in matches
▸ AlphaGo Zero starts from Zero domain knowledge
▸ Uses a single neural network (compared to 4 NNs in
AlphaGo)
▸ Learns by Self-Play only (No supervised learning like in
AlphaGo)
31 / 41
44. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Neural Network Architecture
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
x × y kernels
Input State(s) features
Multi Headed ResNet fθ(s) = (p,v)
Policy Head Value Head
32 / 41
45. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1
32 / 41
46. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1
π1
32 / 41
47. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
π1
32 / 41
48. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
π1 π2
32 / 41
49. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
32 / 41
50. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
Self Play Training Data
fθ(st) = (pt,vt) θ′
= θ − α∇θL((pt,vt),(πt,z))
32 / 41
51. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Algorithm
s1 s2
a1 ∼ π1
⋯at ∼ πt
π1 π2 z
Self Play Training Data
fθ(st) = (pt,vt) θ′
= θ − α∇θL((pt,vt),(πt,z))
L(fθ(st),(πt,z)) = (z − vt)2
MSE
−πT
t log pt + c θ 2
Cross-entropy loss
32 / 41
52. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
33 / 41
53. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
Learning
2
θt+1 = θt − α∇θL(fθ,D)
Self Play Data D(π,z)
33 / 41
54. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero Self-Play Training Pipeline
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
fθ = (p,v)
Self Play
1
Repeat
3
Learning
2
θt+1 = θt − α∇θL(fθ,D)
Self Play Data D(π,z)
33 / 41
55. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaGo Zero: Key Takeaways
1. A simpler system can be more powerful than a complex
one (AlphaGo Zero vs AlphaGo)
2. Neural networks can be combined like LEGO blocks
3. ResNet is better than traditional ConvNets
4. Self-play
34 / 41
56. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero ’201813
2em113
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
35 / 41
57. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero
▸ AlphaGo Zero is able to reach superhuman level at Go
without any domain knowledge...
▸ As AlphaGo Zero is not dependent on Go, can the same
algorithm play other games?
▸ AlphaZero extends AlphaGo to play not only Go but also
Chess and Shogi
▸ The same algorithm achieves superhuman performance on
all three games
36 / 41
58. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero
v(s) ≈ P[win s]
P[a s]
P[an s]P[a1 s] ⋯
rNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
PrNQ
p
p
b
K
k
P
P
P
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩
角角
角
香
香
桂
桂
銀
銀
金
金
玉
歩
歩
歩
歩
歩
歩
歩
歩
歩
歩角角
角
香
香
桂
桂
銀
銀
金
金
玉
Go Input FeaturesShogi Input Features Chess Input FeaturesOROR
Multi Headed ResNet fθ(s) = (p,v)
Policy Head Value Head
37 / 41
59. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero is much Simpler than AlphaGo,
Yet More Powerful
Generality
Performance
AlphaGo
AlphaGo Zero
AlphaZero
38 / 41
60. Intro AlphaGo AlphaGo Zero AlphaZero Summary
AlphaZero: Key Takeaways
1. Being able to play three games at a superhuman level,
AlphaZero is one step closer to general AI
2. Massive compute power still required for training
AlphaZero
▸ 5000 TPUs
3. Self-play
39 / 41
62. Intro AlphaGo AlphaGo Zero AlphaZero Summary
Presentation Summary
▸ Sometimes a simpler system can be more powerful than a
complex one
▸ Universal research principle: strive for generality,
simplicity, Occam’s Razor
▸ Self-play: no human bias, learn from first principles
▸ Deep RL is still in its infancy, a lot to more to be
expected in the next few years
▸ Open challenges: Sample efficiency, data efficiency
▸ Yes, AlphaGo can learn to play Go after hundreds of game
years, but a human can reach a decent level of play in only
a couple of hours
▸ How can we make reinforcement learning more efficient?
Model-based learning is a research area with increasing
attention
41 / 41
63. Intro AlphaGo AlphaGo Zero AlphaZero Summary
References19
▸ DQN14
▸ AlphaGo15
▸ AlphaGo Zero16
▸ AlphaZero17
▸ AlphaStar18
2em114
Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”.
In: Nature 518.7540 (Feb. 2015), pp. 529–533. issn: 00280836. url: http://dx.doi.org/10.1038/
nature14236.
2em115
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em116
David Silver et al. “Mastering the game of Go without human knowledge”. In: Nature
550 (Oct. 2017), pp. 354–. url: http://dx.doi.org/10.1038/nature24270.
2em117
David Silver et al. “A general reinforcement learning algorithm that masters chess,
shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. url: http :
//science.sciencemag.org/content/362/6419/1140/tab-pdf.
2em118
Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement
learning”. In: Nature 575 (Nov. 2019). doi: 10.1038/s41586-019-1724-z.
2em119
Thanks to Rolf Stadler for Reviewing and discussing drafts of this presentation
41 / 41