AlphaGo and AlphaGo Zero

AlphaGo/AlphaGo
Zero
Keita Watanabe

Motivation
• Tree based decision-making framework is common
across robotics, AV, and etc…
• Monte Carlo Tree Search (MCTS) is one of the most
successful method among other tree search
algorithms.
• Recent MCTS based decesion-making framework
for AV (Cai 2019) signiﬁcantly inﬂuenced by
AlphaGo

Overview of this
presentation
• Introduction to Go
• Alpha Go
• SL Policy Network
• RL Policy Network
• Value Network
• MCTS (Monte Carlo Tree Search)
• Alpha Go Zero
• Improvements from Alpha Go

Rule of Go I
Retrieved from Wikipedia 
https://en.wikipedia.org/wiki/Go_(game)
Go is an adversarial game with the objective of
surrounding a larger total area of the board with one's
stones than the opponent. As the game progresses, the
players position stones on the board to map out formations
and potential territories. Contests between opposing
formations are often extremely complex and may result in the
expansion, reduction, or wholesale capture and loss of
formation stones.
The four liberties (adjacent empty points) of a single black
stone (A), as White reduces those liberties by one (B, C, and D).
When Black has only one liberty left (D), that stone is "in atari .
White may capture that stone (remove from board) with a play
on its last liberty (at D-1).
A basic principle of Go is that a group of stones must have at
least one "liberty" to remain on the board. A "liberty" is an open
"point" (intersection) bordering the group. An enclosed liberty
(or liberties) is called an eye (眼), and a group of stones with
two or more eyes is said to be unconditionally "alive". Such
groups cannot be captured, even if surrounded.

Rule of Go II
Points where
Black can capture White
Points where
White cannot place stone
Fig. 1.1 of (Otsuki 2017)

Rule of Go IV:  
Victory judgment
If you want to know more, just ask Ivo or Erik
Fig. 1.2 of (Otsuki 2017)
* Score: # of stones + # of
eyes
* Komi: Black (Moves ﬁrst)
takes a handicap. Typically
7.5 points
* Black territory 45, White
territory 36 
45 > 36 + 7.5 => Black wins

Why Go is so diﬃcult?
Approx Size of the Search Space
Othello 10**60
Chess 10**120
Shogi 10**220
Go 10**360
Table 1.1 of (Otuki)
Size of the search space is enormous!

Abstract
The game of Go has long been viewed as the most challenging of classic
games for artificial intelligence owing to its enormous search space and the
difficulty of evaluating board positions and moves. Here we introduce a new
approach to computer Go that uses 'value networks' to evaluate board
positions and 'policy networks' to select moves. These deep neural networks are
trained by a novel combination of supervised learning from human expert games,
and reinforcement learning from games of self-play. Without any lookahead
search, the neural networks play Go at the level of state-of-the-art Monte
Carlo tree search programs that simulate thousands of random games of
self-play. We also introduce a new search algorithm that combines Monte
Carlo simulation with value and policy networks. Using this search algorithm, our
program AlphaGo achieved a 99.8% winning rate against other Go programs,
and defeated the human European Go champion by 5 games to 0. This is the
first time that a computer program has defeated a human professional player in the
full-sized game of Go, a feat previously thought to be at least a decade away.
1
2
3
4

Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS

Rollout Policy
• Logistic Regression with well known features (see the table below) used in
this ﬁeld.
• Trained with 30 million positions from the KGS Go Server (https://
www.gokgs.com/).
• This model used for Rollout (details will be explained later).
In total: 109747 Features. Extended Table 4 of (Silver 2016)

Logistic Regression
.
.
.
x1
x2
x109747
Σ
u =
109747
∑
k=1
wkxk
˜
p =
1
1 + e−u
• Logistic Regression with well known features used
in this ﬁeld
• Trained with 30 million positions from the KGS Go
Server (https://www.gokgs.com/)

Tree Policy
• It is a logistic regression model with additional features.
• Improved performance with extra computational time.
• Used for Expansion step of Monte Carlo Tree search
•
In total: 141989 Features. Extended Table 4 of (Silver 2016)

Policy Network: Overview
Fig1. of (Silver 2016)
• Convolutional Neural Network
• The network ﬁrst is trained
by supervised learning
algorithm and later reﬁned by
reinforcement learning
• Trained with KGS dataset.
29.4 million positions from
160000 games played by
KGS 6 to 9 dan

SL policy network
Output is percentage Fig. 2.18 (Otsuki 2017)

SL Policy Network
• Convolutional Neural Network
• Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS
6 to 9 dan
• 48 Channels (Features) is prepared (Next slide explains details).
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output: Prob. of the next move

Input features
(Silver 2016)
Note: Most of the hand-made features here are not new, but commonly used
in this ﬁeld.

RL Policy Network
• They further trained the policy network by policy gradient reinforcement
learning.
• Training is done by self-play
• The win rate of the RL policy network over the original SL policy network
was 80%

Value Network
• Alpha Go uses the RL policy network
to generate training data for the
Value Network, which predict win
rate.
• Training data (Position, Win/Lose)
30 million
• It took 1 week with 50 GPU
• Training also took 1 week with 50
GPU
• The network provides Evaluation
function for Go (that considered to be
hard previously).
Fig1. of (Silver 2016)

Rollout Policy Policy Network Value Network
Model
Logistic
Regression
CNN (13 Layers) CNN (15 Layers)
Time for
evaluation of a
state
2μs 5ms 5ms
Time for playout
(200 moves)
0.4ms 1.0 s -
# of playouts
per sec
About 2500 About 1 -
Accuracy 24% 57% -

MCTS Example: Nim
• You can take more than one
stone from ether left or right
• You will win when you take
the last stone
• This example from http://
blog.brainpad.co.jp/entry/
2018/04/05/163000

Game Tree
Green: Player who moves ﬁrst wins
Yellow: Player who moves second wins
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000

Monte Carlo Simulation
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
You can ﬁnd out Q value of each state by simulation.
-> MCTS is a heuristic that enable us eﬃciently investigate promising states

Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a heuristic
search algorithm for decision processes.
The focus of Monte Carlo tree search is on
the analysis of the most promising moves,
expanding the search tree based on random
sampling of the search space. The application
of Monte Carlo tree search in games is based
on many playouts. In each playout, the game
is played out to the very end by selecting
moves at random. The ﬁnal game result of
each playout is then used to weight the
nodes in the game tree so that better nodes
are more likely to be chosen in future
playouts.
(Browne 2016)

MCTS Example
N: 0, Q: 0
Initial State
N: # of visits to the state
Q: Expected reward
Selection
Select node that maximizes
Q(s, a) + Cp
2 log ns
ns,a
N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
First term: Estimated reward
Second term: Bias term 
It balances Exploration vs. Exploitation
(Auer, P, 2002)  
(In this case, it s random)
①
②

N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Rollout
Randomly play game and  
ﬁnd out win/lose
N: 1, Q: 0
N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Backup
Renew Q of the state
③
④

N: 1, Q: 0
N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
Expansion
Expand tree when a node is visited  
certain pre-deﬁned times  
(in this case 2)
⑤
⑥ ⑦

• Bias is evaluated by the original Bias + Output of the
SL policy network 
• Evaluation of win rate => playout + Output of the
value Network
• Massive parallel computation using both GPUs (176)
and CPUs (1202)
Q(s, a) = (1 − λ)
Wv(s, a)
Nv(s, a)
+ λ
Wr(s, a)
Nr(s, a)
u(s, a) = cpuctP(s, a)
∑b
Nr(s, b)
1 + Nr(s, a)
MCTS in Alpha Go
Value Network MCTS
P(s, a)

Performance
Figure 4 of (Silver 2016)

Abstract
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of
Go. The tree search in AlphaGo evaluated positions and selected moves using
deep neural networks. These neural networks were trained by supervised
learning from human expert moves, and by reinforcement learning from self-
play. Here we introduce an algorithm based solely on reinforcement
learning, without human data, guidance or domain knowledge beyond
game rules. AlphaGo becomes its own teacher: a neural network is trained to
predict AlphaGo s own move selections and also the winner of AlphaGo s
games. This neural network improves the strength of the tree search, resulting
in higher quality move selection and stronger self-play in the next iteration.
Starting tabula rasa, our new program AlphaGo Zero achieved superhuman
performance, winning 100‒0 against the previously published, champion-
defeating AlphaGo.
Tabula rasa is a Latin phrase often translated as "clean slate" .
1
2

Point1: Dual Network
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output1: Prediction
of the next move
19
19
Output Layer
Output2: Win Rate
• 40 Layers+ Convolutional Neural Network
• Each layer 3x3 convolution layer + Batch normalization + Relu
• Layer 2 39 are ResNet
• Trained by self-play (details are described later)
• 17 Channels (Features) is prepared (the next slide shows details).
• Learning method of this network discussed later. (For now, let s assume we have trained it nicely).
p
v

AlphaGo Zero is less depends
on hand crafted features
48 Features of Alpha Go (Silver 2016)
Feature # of planes
Position of black
stones
1
Position of black
stones
1
Position of black
stones k (1 7) steps
before
7
Position of white
stones k (1 7) steps
before
7
Turn 1
17 Features of Alpha Go Zero (Silver 2017)

Point 2: Improvement of
MCTS
• MCTS algorithm uses the following value for state
selection.
• No playout, it just relies on value.
Q(s, a) + u(s, a)
Q(s, a) =
W(s, a)
N(s, a)
u(s, a) = cpuct p(s, a)
∑b
N(s, b)
1 + N(s, a)
Win Rate
Bias
Prediction
of the move a

MCTS 1: Selection
25%
48%
35%
Select the no de which has max Q(s, a) + u(s, a)

MCTS 2: Selection
25%
48%
35%
Expand the node
30% 42%

MCTS 2: Selection
25%
48%
35%
Evaluate p and v using the dual network.
* p will be used for the calculation of Q+U
* the win rate on the state is updated by v
30%
42% -> 70%
p
v = 70 %

MCTS 3: Backup
50% 40%
Update win rate of each state and propagate until the root node.
60% 70%
p
v = 70 %
60% -> 65%
50% -> 55%

Point3 Improvements
on RL
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• The dual network (parameter ) accumulate data
by self play (step 1: repeated 25 thousand times).
• Based on the result, update the parameter of the
network (step 2), and get new parameter .
• Let two network instantiations compete, update the
network parameter if the new parameter set wins.
• Repeat step 1 and step 2
θ′
θ

Step 1: Data Accumulation
• Do a self play. Store the outcome z.
• Store all (s, π, z) tuples in the game.
• The policy π is calculated as
• Repeat the above processes 250000 times
πa =
N(s, a)1/γ
∑b
N(s, b)1/γ

Step 2: Parameter update
• Calculate the loss function using (s, π, z) evaluated
in the previous step.
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• Update parameter using gradient descent method.
θ′ ← θ − α ⋅ Δθ

Empirical evaluation of
AlphaGo Zero
Fig 3 of (Silver 2017)

Performance of AlphaGo
Zero
Fig 6 of (Silver 2017)

https://research.fb.com/facebook-open-sources-elf-opengo/

References
1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the
game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/
nature16961 
Alpha Go
2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka
gakushu kara mita sono shikumi. Shoeisha.
3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game
of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270 
Alpha Go Zero
4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte
Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1).
https://doi.org/10.1109/TCIAIG.2012.2186810
5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on
Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106
6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search.
Retrieved from https://arxiv.org/pdf/1905.12197.pdf

AlphaGo and AlphaGo Zero

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a AlphaGo and AlphaGo Zero

Similar a AlphaGo and AlphaGo Zero (20)

Último

Último (20)

AlphaGo and AlphaGo Zero