SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
AlphaGo/AlphaGo
Zero
Keita Watanabe
Motivation
• Tree based decision-making framework is common
across robotics, AV, and etc…
• Monte Carlo Tree Search (MCTS) is one of the most
successful method among other tree search
algorithms.
• Recent MCTS based decesion-making framework
for AV (Cai 2019) significantly influenced by
AlphaGo
Overview of this
presentation
• Introduction to Go
• Alpha Go
• SL Policy Network
• RL Policy Network
• Value Network
• MCTS (Monte Carlo Tree Search)
• Alpha Go Zero
• Improvements from Alpha Go
Rule of Go I
Retrieved from Wikipedia

https://en.wikipedia.org/wiki/Go_(game)
Go is an adversarial game with the objective of
surrounding a larger total area of the board with one's
stones than the opponent. As the game progresses, the
players position stones on the board to map out formations
and potential territories. Contests between opposing
formations are often extremely complex and may result in the
expansion, reduction, or wholesale capture and loss of
formation stones.
The four liberties (adjacent empty points) of a single black
stone (A), as White reduces those liberties by one (B, C, and D).
When Black has only one liberty left (D), that stone is "in atari .
White may capture that stone (remove from board) with a play
on its last liberty (at D-1).
A basic principle of Go is that a group of stones must have at
least one "liberty" to remain on the board. A "liberty" is an open
"point" (intersection) bordering the group. An enclosed liberty
(or liberties) is called an eye (眼), and a group of stones with
two or more eyes is said to be unconditionally "alive". Such
groups cannot be captured, even if surrounded.
Rule of Go II
Points where
Black can capture White
Points where
White cannot place stone
Fig. 1.1 of (Otsuki 2017)
Rule of Go IV: 

Victory judgment
If you want to know more, just ask Ivo or Erik
Fig. 1.2 of (Otsuki 2017)
* Score: # of stones + # of
eyes
* Komi: Black (Moves first)
takes a handicap. Typically
7.5 points
* Black territory 45, White
territory 36

45 > 36 + 7.5 => Black wins
Why Go is so difficult?
Approx Size of the Search Space
Othello 10**60
Chess 10**120
Shogi 10**220
Go 10**360
Table 1.1 of (Otuki)
Size of the search space is enormous!
Alpha Go
Abstract
The game of Go has long been viewed as the most challenging of classic
games for artificial intelligence owing to its enormous search space and the
difficulty of evaluating board positions and moves. Here we introduce a new
approach to computer Go that uses 'value networks' to evaluate board
positions and 'policy networks' to select moves. These deep neural networks are
trained by a novel combination of supervised learning from human expert games,
and reinforcement learning from games of self-play. Without any lookahead
search, the neural networks play Go at the level of state-of-the-art Monte
Carlo tree search programs that simulate thousands of random games of
self-play. We also introduce a new search algorithm that combines Monte
Carlo simulation with value and policy networks. Using this search algorithm, our
program AlphaGo achieved a 99.8% winning rate against other Go programs,
and defeated the human European Go champion by 5 games to 0. This is the
first time that a computer program has defeated a human professional player in the
full-sized game of Go, a feat previously thought to be at least a decade away.
1
2
3
4
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Rollout Policy
• Logistic Regression with well known features (see the table below) used in
this field.
• Trained with 30 million positions from the KGS Go Server (https://
www.gokgs.com/).
• This model used for Rollout (details will be explained later).
In total: 109747 Features. Extended Table 4 of (Silver 2016)
Logistic Regression
.
.
.
x1
x2
x109747
Σ
u =
109747
∑
k=1
wkxk
˜
p =
1
1 + e−u
• Logistic Regression with well known features used
in this field
• Trained with 30 million positions from the KGS Go
Server (https://www.gokgs.com/)
Tree Policy
• It is a logistic regression model with additional features.
• Improved performance with extra computational time.
• Used for Expansion step of Monte Carlo Tree search
•
In total: 141989 Features. Extended Table 4 of (Silver 2016)
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Policy Network: Overview
Fig1. of (Silver 2016)
• Convolutional Neural Network
• The network first is trained
by supervised learning
algorithm and later refined by
reinforcement learning
• Trained with KGS dataset.
29.4 million positions from
160000 games played by
KGS 6 to 9 dan
SL policy network
Output is percentage Fig. 2.18 (Otsuki 2017)
SL Policy Network
• Convolutional Neural Network
• Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS
6 to 9 dan
• 48 Channels (Features) is prepared (Next slide explains details).
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output: Prob. of the next move
Input features
(Silver 2016)
Note: Most of the hand-made features here are not new, but commonly used
in this field.
RL Policy Network
• They further trained the policy network by policy gradient reinforcement
learning.
• Training is done by self-play
• The win rate of the RL policy network over the original SL policy network
was 80%
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
Value Network
• Alpha Go uses the RL policy network
to generate training data for the
Value Network, which predict win
rate.
• Training data (Position, Win/Lose)
30 million
• It took 1 week with 50 GPU
• Training also took 1 week with 50
GPU
• The network provides Evaluation
function for Go (that considered to be
hard previously).
Fig1. of (Silver 2016)
Rollout Policy Policy Network Value Network
Model
Logistic
Regression
CNN (13 Layers) CNN (15 Layers)
Time for
evaluation of a
state
2μs 5ms 5ms
Time for playout
(200 moves)
0.4ms 1.0 s -
# of playouts
per sec
About 2500 About 1 -
Accuracy 24% 57% -
Overview of Alpha Go
Policy
Network
Value
Network
Rollout
Policy
Prediction of Move
Prediction of Move
Prediction of Win Rate
* Used for playout
* Logistic Regression
* Fast
* Used for Node Selection & Expansion
* CNN
* Fast
* CNN
Record of strong
players
Self play (RL)
MCTS
MCTS Example: Nim
• You can take more than one
stone from ether left or right
• You will win when you take
the last stone
• This example from http://
blog.brainpad.co.jp/entry/
2018/04/05/163000
Game Tree
Green: Player who moves first wins
Yellow: Player who moves second wins
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
Monte Carlo Simulation
Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
You can find out Q value of each state by simulation.
-> MCTS is a heuristic that enable us efficiently investigate promising states
Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a heuristic
search algorithm for decision processes.
The focus of Monte Carlo tree search is on
the analysis of the most promising moves,
expanding the search tree based on random
sampling of the search space. The application
of Monte Carlo tree search in games is based
on many playouts. In each playout, the game
is played out to the very end by selecting
moves at random. The final game result of
each playout is then used to weight the
nodes in the game tree so that better nodes
are more likely to be chosen in future
playouts.
(Browne 2016)
MCTS Example
N: 0, Q: 0
Initial State
N: # of visits to the state
Q: Expected reward
Selection
Select node that maximizes
Q(s, a) + Cp
2 log ns
ns,a
N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
First term: Estimated reward
Second term: Bias term

It balances Exploration vs. Exploitation
(Auer, P, 2002) 

(In this case, it s random)
①
②
N: 1, Q: 0
N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Rollout
Randomly play game and 

find out win/lose
N: 1, Q: 0
N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
Win
Backup
Renew Q of the state
③
④
N: 1, Q: 0
N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0
N: 5, Q: 0.25
N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1
Expansion
Expand tree when a node is visited 

certain pre-defined times 

(in this case 2)
⑤
⑥ ⑦
• Bias is evaluated by the original Bias + Output of the
SL policy network

• Evaluation of win rate => playout + Output of the
value Network
• Massive parallel computation using both GPUs (176)
and CPUs (1202)
Q(s, a) = (1 − λ)
Wv(s, a)
Nv(s, a)
+ λ
Wr(s, a)
Nr(s, a)
u(s, a) = cpuctP(s, a)
∑b
Nr(s, b)
1 + Nr(s, a)
MCTS in Alpha Go
Value Network MCTS
P(s, a)
Performance
Figure 4 of (Silver 2016)
Alpha Go Zero
Abstract
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa, superhuman proficiency in challenging domains. Recently,
AlphaGo became the first program to defeat a world champion in the game of
Go. The tree search in AlphaGo evaluated positions and selected moves using
deep neural networks. These neural networks were trained by supervised
learning from human expert moves, and by reinforcement learning from self-
play. Here we introduce an algorithm based solely on reinforcement
learning, without human data, guidance or domain knowledge beyond
game rules. AlphaGo becomes its own teacher: a neural network is trained to
predict AlphaGo s own move selections and also the winner of AlphaGo s
games. This neural network improves the strength of the tree search, resulting
in higher quality move selection and stronger self-play in the next iteration.
Starting tabula rasa, our new program AlphaGo Zero achieved superhuman
performance, winning 100‒0 against the previously published, champion-
defeating AlphaGo.
Tabula rasa is a Latin phrase often translated as "clean slate" .
1
2
Point1: Dual Network
https://senseis.xmp.net/?Go
19 x 19
48 Channel
19
19
5
5
5
5
3
3
3
3
19
19
....
....
3
3
3
3
19 19
19
19
Output1: Prediction
of the next move
19
19
Output Layer
Output2: Win Rate
• 40 Layers+ Convolutional Neural Network
• Each layer 3x3 convolution layer + Batch normalization + Relu
• Layer 2 39 are ResNet
• Trained by self-play (details are described later)
• 17 Channels (Features) is prepared (the next slide shows details).
• Learning method of this network discussed later. (For now, let s assume we have trained it nicely).
p
v
AlphaGo Zero is less depends
on hand crafted features
48 Features of Alpha Go (Silver 2016)
Feature # of planes
Position of black
stones
1
Position of black
stones
1
Position of black
stones k (1 7) steps
before
7
Position of white
stones k (1 7) steps
before
7
Turn 1
17 Features of Alpha Go Zero (Silver 2017)
Point 2: Improvement of
MCTS
• MCTS algorithm uses the following value for state
selection.
• No playout, it just relies on value.
Q(s, a) + u(s, a)
Q(s, a) =
W(s, a)
N(s, a)
u(s, a) = cpuct p(s, a)
∑b
N(s, b)
1 + N(s, a)
Win Rate
Bias
Prediction
of the move a
MCTS 1: Selection
25%
48%
35%
Select the no de which has max Q(s, a) + u(s, a)
MCTS 2: Selection
25%
48%
35%
Expand the node
30% 42%
MCTS 2: Selection
25%
48%
35%
Evaluate p and v using the dual network.
* p will be used for the calculation of Q+U
* the win rate on the state is updated by v
30%
42% -> 70%
p
v = 70 %
MCTS 3: Backup
50% 40%
Update win rate of each state and propagate until the root node.
60% 70%
p
v = 70 %
60% -> 65%
50% -> 55%
Point3 Improvements
on RL
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• The dual network (parameter ) accumulate data
by self play (step 1: repeated 25 thousand times).
• Based on the result, update the parameter of the
network (step 2), and get new parameter .
• Let two network instantiations compete, update the
network parameter if the new parameter set wins.
• Repeat step 1 and step 2
θ′
θ
Step 1: Data Accumulation
• Do a self play. Store the outcome z.
• Store all (s, π, z) tuples in the game.
• The policy π is calculated as
• Repeat the above processes 250000 times
πa =
N(s, a)1/γ
∑b
N(s, b)1/γ
Step 2: Parameter update
• Calculate the loss function using (s, π, z) evaluated
in the previous step.
(p, v) = fθ(s) and l = (z − v)2
− πT
log p + c∥θ∥2
• Update parameter using gradient descent method.
θ′ ← θ − α ⋅ Δθ
Empirical evaluation of
AlphaGo Zero
Fig 3 of (Silver 2017)
Performance of AlphaGo
Zero
Fig 6 of (Silver 2017)
https://research.fb.com/facebook-open-sources-elf-opengo/
References
1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the
game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/
nature16961

Alpha Go
2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka
gakushu kara mita sono shikumi. Shoeisha.
3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game
of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270

Alpha Go Zero
4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte
Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1).
https://doi.org/10.1109/TCIAIG.2012.2186810
5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on
Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106
6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search.
Retrieved from https://arxiv.org/pdf/1905.12197.pdf

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
AlphaZero: A General Reinforcement Learning Algorithm that Masters Chess, Sho...
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree SearchAlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search
 
분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현분산 강화학습 논문(DeepMind IMPALA) 구현
분산 강화학습 논문(DeepMind IMPALA) 구현
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Mastering the game of go with deep neural networks and tree search
Mastering the game of go with deep neural networks and tree searchMastering the game of go with deep neural networks and tree search
Mastering the game of go with deep neural networks and tree search
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Activation functions
Activation functionsActivation functions
Activation functions
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
How DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of GoHow DeepMind Mastered The Game Of Go
How DeepMind Mastered The Game Of Go
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 

Similar a AlphaGo and AlphaGo Zero

Mastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searchingMastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searching
Brian Kim
 

Similar a AlphaGo and AlphaGo Zero (20)

IaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGoIaGo: an Othello AI inspired by AlphaGo
IaGo: an Othello AI inspired by AlphaGo
 
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...A Presentation on the Paper: Mastering the game of Go with deep neural networ...
A Presentation on the Paper: Mastering the game of Go with deep neural networ...
 
Games.4
Games.4Games.4
Games.4
 
Alpha go 16110226_김영우
Alpha go 16110226_김영우Alpha go 16110226_김영우
Alpha go 16110226_김영우
 
G-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge ProcessingG-Store: High-Performance Graph Store for Trillion-Edge Processing
G-Store: High-Performance Graph Store for Trillion-Edge Processing
 
J-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game PlayingJ-Fall 2017 - AI Self-learning Game Playing
J-Fall 2017 - AI Self-learning Game Playing
 
Games
GamesGames
Games
 
Ibm's deep blue chess grandmaster chips
Ibm's deep blue chess grandmaster chipsIbm's deep blue chess grandmaster chips
Ibm's deep blue chess grandmaster chips
 
Monte Carlo Tree Search for the Super Mario Bros
Monte Carlo Tree Search for the Super Mario BrosMonte Carlo Tree Search for the Super Mario Bros
Monte Carlo Tree Search for the Super Mario Bros
 
AI3391 Artificial Intelligence Session 18 Monto carlo search tree.pptx
AI3391 Artificial Intelligence Session 18 Monto carlo search tree.pptxAI3391 Artificial Intelligence Session 18 Monto carlo search tree.pptx
AI3391 Artificial Intelligence Session 18 Monto carlo search tree.pptx
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
GamePlaying.ppt
GamePlaying.pptGamePlaying.ppt
GamePlaying.ppt
 
ConvNets_C_Focke2
ConvNets_C_Focke2ConvNets_C_Focke2
ConvNets_C_Focke2
 
Gdmc v11 presentation
Gdmc v11 presentationGdmc v11 presentation
Gdmc v11 presentation
 
Chakrabarti alpha go analysis
Chakrabarti alpha go analysisChakrabarti alpha go analysis
Chakrabarti alpha go analysis
 
From Alpha Go to Alpha Zero - Vaas Madrid 2018
From Alpha Go to Alpha Zero -  Vaas Madrid 2018From Alpha Go to Alpha Zero -  Vaas Madrid 2018
From Alpha Go to Alpha Zero - Vaas Madrid 2018
 
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
AlphaGo: An AI Go player based on deep neural networks and monte carlo tree s...
 
Study on Evaluation Function Design of Mahjong using Supervised Learning
Study on Evaluation Function Design of Mahjong using Supervised LearningStudy on Evaluation Function Design of Mahjong using Supervised Learning
Study on Evaluation Function Design of Mahjong using Supervised Learning
 
Mastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searchingMastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searching
 
Implementation and analysis of search algorithms in single player connect fou...
Implementation and analysis of search algorithms in single player connect fou...Implementation and analysis of search algorithms in single player connect fou...
Implementation and analysis of search algorithms in single player connect fou...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

AlphaGo and AlphaGo Zero

  • 2. Motivation • Tree based decision-making framework is common across robotics, AV, and etc… • Monte Carlo Tree Search (MCTS) is one of the most successful method among other tree search algorithms. • Recent MCTS based decesion-making framework for AV (Cai 2019) significantly influenced by AlphaGo
  • 3. Overview of this presentation • Introduction to Go • Alpha Go • SL Policy Network • RL Policy Network • Value Network • MCTS (Monte Carlo Tree Search) • Alpha Go Zero • Improvements from Alpha Go
  • 4. Rule of Go I Retrieved from Wikipedia
 https://en.wikipedia.org/wiki/Go_(game) Go is an adversarial game with the objective of surrounding a larger total area of the board with one's stones than the opponent. As the game progresses, the players position stones on the board to map out formations and potential territories. Contests between opposing formations are often extremely complex and may result in the expansion, reduction, or wholesale capture and loss of formation stones. The four liberties (adjacent empty points) of a single black stone (A), as White reduces those liberties by one (B, C, and D). When Black has only one liberty left (D), that stone is "in atari . White may capture that stone (remove from board) with a play on its last liberty (at D-1). A basic principle of Go is that a group of stones must have at least one "liberty" to remain on the board. A "liberty" is an open "point" (intersection) bordering the group. An enclosed liberty (or liberties) is called an eye (眼), and a group of stones with two or more eyes is said to be unconditionally "alive". Such groups cannot be captured, even if surrounded.
  • 5. Rule of Go II Points where Black can capture White Points where White cannot place stone Fig. 1.1 of (Otsuki 2017)
  • 6. Rule of Go IV: 
 Victory judgment If you want to know more, just ask Ivo or Erik Fig. 1.2 of (Otsuki 2017) * Score: # of stones + # of eyes * Komi: Black (Moves first) takes a handicap. Typically 7.5 points * Black territory 45, White territory 36
 45 > 36 + 7.5 => Black wins
  • 7. Why Go is so difficult? Approx Size of the Search Space Othello 10**60 Chess 10**120 Shogi 10**220 Go 10**360 Table 1.1 of (Otuki) Size of the search space is enormous!
  • 9. Abstract The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away. 1 2 3 4
  • 10. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 11. Rollout Policy • Logistic Regression with well known features (see the table below) used in this field. • Trained with 30 million positions from the KGS Go Server (https:// www.gokgs.com/). • This model used for Rollout (details will be explained later). In total: 109747 Features. Extended Table 4 of (Silver 2016)
  • 12. Logistic Regression . . . x1 x2 x109747 Σ u = 109747 ∑ k=1 wkxk ˜ p = 1 1 + e−u • Logistic Regression with well known features used in this field • Trained with 30 million positions from the KGS Go Server (https://www.gokgs.com/)
  • 13. Tree Policy • It is a logistic regression model with additional features. • Improved performance with extra computational time. • Used for Expansion step of Monte Carlo Tree search • In total: 141989 Features. Extended Table 4 of (Silver 2016)
  • 14. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 15. Policy Network: Overview Fig1. of (Silver 2016) • Convolutional Neural Network • The network first is trained by supervised learning algorithm and later refined by reinforcement learning • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan
  • 16. SL policy network Output is percentage Fig. 2.18 (Otsuki 2017)
  • 17. SL Policy Network • Convolutional Neural Network • Trained with KGS dataset. 29.4 million positions from 160000 games played by KGS 6 to 9 dan • 48 Channels (Features) is prepared (Next slide explains details). https://senseis.xmp.net/?Go 19 x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output: Prob. of the next move
  • 18. Input features (Silver 2016) Note: Most of the hand-made features here are not new, but commonly used in this field.
  • 19. RL Policy Network • They further trained the policy network by policy gradient reinforcement learning. • Training is done by self-play • The win rate of the RL policy network over the original SL policy network was 80%
  • 20. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 21. Value Network • Alpha Go uses the RL policy network to generate training data for the Value Network, which predict win rate. • Training data (Position, Win/Lose) 30 million • It took 1 week with 50 GPU • Training also took 1 week with 50 GPU • The network provides Evaluation function for Go (that considered to be hard previously). Fig1. of (Silver 2016)
  • 22. Rollout Policy Policy Network Value Network Model Logistic Regression CNN (13 Layers) CNN (15 Layers) Time for evaluation of a state 2μs 5ms 5ms Time for playout (200 moves) 0.4ms 1.0 s - # of playouts per sec About 2500 About 1 - Accuracy 24% 57% -
  • 23. Overview of Alpha Go Policy Network Value Network Rollout Policy Prediction of Move Prediction of Move Prediction of Win Rate * Used for playout * Logistic Regression * Fast * Used for Node Selection & Expansion * CNN * Fast * CNN Record of strong players Self play (RL) MCTS
  • 24. MCTS Example: Nim • You can take more than one stone from ether left or right • You will win when you take the last stone • This example from http:// blog.brainpad.co.jp/entry/ 2018/04/05/163000
  • 25. Game Tree Green: Player who moves first wins Yellow: Player who moves second wins Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000
  • 26. Monte Carlo Simulation Retrieved from http://blog.brainpad.co.jp/entry/2018/04/05/163000 You can find out Q value of each state by simulation. -> MCTS is a heuristic that enable us efficiently investigate promising states
  • 27. Monte Carlo Tree Search Monte Carlo tree search (MCTS) is a heuristic search algorithm for decision processes. The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts. (Browne 2016)
  • 28. MCTS Example N: 0, Q: 0 Initial State N: # of visits to the state Q: Expected reward Selection Select node that maximizes Q(s, a) + Cp 2 log ns ns,a N: 1, Q: 0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 First term: Estimated reward Second term: Bias term
 It balances Exploration vs. Exploitation (Auer, P, 2002) 
 (In this case, it s random) ① ②
  • 29. N: 1, Q: 0 N: 1, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Rollout Randomly play game and 
 find out win/lose N: 1, Q: 0 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 Win Backup Renew Q of the state ③ ④
  • 30. N: 1, Q: 0 N: 1, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 N: 0, Q: 0 N: 0, Q: 0 N: 0, Q: 0 N: 5, Q: 0.25 N: 2, Q: 1 N: 1, Q: 1 N: 1, Q: -1 N: 1, Q: 1 Expansion Expand tree when a node is visited 
 certain pre-defined times 
 (in this case 2) ⑤ ⑥ ⑦
  • 31. • Bias is evaluated by the original Bias + Output of the SL policy network
 • Evaluation of win rate => playout + Output of the value Network • Massive parallel computation using both GPUs (176) and CPUs (1202) Q(s, a) = (1 − λ) Wv(s, a) Nv(s, a) + λ Wr(s, a) Nr(s, a) u(s, a) = cpuctP(s, a) ∑b Nr(s, b) 1 + Nr(s, a) MCTS in Alpha Go Value Network MCTS P(s, a)
  • 32. Performance Figure 4 of (Silver 2016)
  • 34. Abstract A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self- play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo s own move selections and also the winner of AlphaGo s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100‒0 against the previously published, champion- defeating AlphaGo. Tabula rasa is a Latin phrase often translated as "clean slate" . 1 2
  • 35. Point1: Dual Network https://senseis.xmp.net/?Go 19 x 19 48 Channel 19 19 5 5 5 5 3 3 3 3 19 19 .... .... 3 3 3 3 19 19 19 19 Output1: Prediction of the next move 19 19 Output Layer Output2: Win Rate • 40 Layers+ Convolutional Neural Network • Each layer 3x3 convolution layer + Batch normalization + Relu • Layer 2 39 are ResNet • Trained by self-play (details are described later) • 17 Channels (Features) is prepared (the next slide shows details). • Learning method of this network discussed later. (For now, let s assume we have trained it nicely). p v
  • 36. AlphaGo Zero is less depends on hand crafted features 48 Features of Alpha Go (Silver 2016) Feature # of planes Position of black stones 1 Position of black stones 1 Position of black stones k (1 7) steps before 7 Position of white stones k (1 7) steps before 7 Turn 1 17 Features of Alpha Go Zero (Silver 2017)
  • 37. Point 2: Improvement of MCTS • MCTS algorithm uses the following value for state selection. • No playout, it just relies on value. Q(s, a) + u(s, a) Q(s, a) = W(s, a) N(s, a) u(s, a) = cpuct p(s, a) ∑b N(s, b) 1 + N(s, a) Win Rate Bias Prediction of the move a
  • 38. MCTS 1: Selection 25% 48% 35% Select the no de which has max Q(s, a) + u(s, a)
  • 40. MCTS 2: Selection 25% 48% 35% Evaluate p and v using the dual network. * p will be used for the calculation of Q+U * the win rate on the state is updated by v 30% 42% -> 70% p v = 70 %
  • 41. MCTS 3: Backup 50% 40% Update win rate of each state and propagate until the root node. 60% 70% p v = 70 % 60% -> 65% 50% -> 55%
  • 42. Point3 Improvements on RL (p, v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • The dual network (parameter ) accumulate data by self play (step 1: repeated 25 thousand times). • Based on the result, update the parameter of the network (step 2), and get new parameter . • Let two network instantiations compete, update the network parameter if the new parameter set wins. • Repeat step 1 and step 2 θ′ θ
  • 43. Step 1: Data Accumulation • Do a self play. Store the outcome z. • Store all (s, π, z) tuples in the game. • The policy π is calculated as • Repeat the above processes 250000 times πa = N(s, a)1/γ ∑b N(s, b)1/γ
  • 44. Step 2: Parameter update • Calculate the loss function using (s, π, z) evaluated in the previous step. (p, v) = fθ(s) and l = (z − v)2 − πT log p + c∥θ∥2 • Update parameter using gradient descent method. θ′ ← θ − α ⋅ Δθ
  • 45. Empirical evaluation of AlphaGo Zero Fig 3 of (Silver 2017)
  • 46. Performance of AlphaGo Zero Fig 6 of (Silver 2017)
  • 48. References 1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., … Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484‒489. https://doi.org/10.1038/ nature16961
 Alpha Go 2. Otsuki, T., & Miyake. (2017). Saikyo igo eai arufago kaitai shinsho : Shinso gakushu montekaruro kitansaku kyoka gakushu kara mita sono shikumi. Shoeisha. 3. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676), 354‒359. https://doi.org/10.1038/nature24270
 Alpha Go Zero 4. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Member, S., Cowling, P. I., … Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, 4(1). https://doi.org/10.1109/TCIAIG.2012.2186810 5. Auer, P. (2002). Finite-time analysis of the multi-armed bandit problem with known trend. IEEE Congress on Evolutionary Computation, CEC 2016, 47(1), 235‒256. https://doi.org/10.1109/CEC.2016.7744106 6. Cai, P., Luo, Y., Saxena, A., Hsu, D., & Lee, W. S. (2019). LeTS-Drive: Driving in a Crowd by Learning from Tree Search. Retrieved from https://arxiv.org/pdf/1905.12197.pdf