Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# AlphaGo zero

107 visualizaciones

17年底在滴滴内部的AlphaGo Zero的分享

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Sé el primero en comentar

• Sé el primero en recomendar esto

### AlphaGo zero

1. 1. AlphaGo Zero guodong
2. 2. Go • 本质是什么问题？有限空间搜索和评分 • 如果⾃自⼰己要实现下围棋的程序？ • 向前看⼏几步（迷宫） -> 复杂度问题 • 套路包（规则）+ 学习⾼高⼿手的Move • 评估棋局
3. 3. 3篇相关⽂文章 • Beat human in some games without human knowledge in 2015 • Human-level control through deep reinforcement learning • Beat human in Go game partially rely on human knowledge in 2016 • Mastering the Game of Go with Deep Neural Networks and Tree Search • Dominate human in Go game without human knowledge in 2017 • Mastering the Game of Go without Human Knowledge
4. 4. AlphaGo
5. 5. MCTS • Most of go programs area based on MCTS • 缩减搜索空间via policy network and value network • Exploration tradeoff: prefers actions with high prior probability and low visit count, and high action value
6. 6. MCTS steps • steps • selection: 基于当前的state，定位到tree的⼀一条path; Based on Q value, prior probability(value network), and Visit count • Expansion：当决定有必要(visit count⼤大于阈值)继续从深度上拓展该 path，使⽤用default policy做sampling • Evaluation：combine value network prediction and one random rollout results; update Value of state using sampling • value network approximates the outcome of games by strong policy; while the rollouts evaluate the outcome of games played by weaker policy • Backup: improve the tree (update visit count, and Q value)
7. 7. AlphaGo • Power MCTS by Policy Network and Value Network • Policy Network（SL + Policy Gradient on DNN） • Value Network（Value Function Approximation + DNN + MC target）
8. 8. AlphaGo：Policy Network • purpose: narrow down search to high-probability moves • ﬁrstly trained by SL to predict human expert moves, then reﬁned by policy gradient RL
9. 9. Value Network • Purpose: evaluate positions in the tree • Value-based Reinforcement learning (function approximation using DNN) • MSE of the predicted values and the observed rewards • “label” from MC (episode from self-play; whole episode shares single reward) • easy overﬁtting due to successive positions are similar
10. 10. AlphaGo Zero
11. 11. AlphaGo Zero • 1, Trained solely by self-play RL, without any supervision of human data • 2, End-2-end: raw image as the input features • 3, single neural network; and use residual network • 4, MCTS relies upon network only, without performing any MC rollouts • NB! knowledge all learned via network • incorporates lookahead search inside training loop
12. 12. AlphaGo Zero Algorithm • AlphaGo之前的做法：先确定好Network，再build MCTS • 随机初始化Network的参数和MCTS • 迭代如下步骤 • 基于Network的预估迭代MCTS • 基于MCTS提升后的Policy决定的Move，作为Network 下⼀一轮迭代的输⼊入（self-play） • Minimize Loss：
13. 13. Comparison • AlphaGo Zero段位更⾼高；训练时间⼤大幅下降；学到 了与human expert不同的策略
14. 14. Performance
15. 15. 经验总结 • 不依赖⼈人类经验 vs ⽆无监督 • Data质量决定了策略上界：使⽤用⼈人类经验训练模型beat⼈人类？ • 确定性问题 vs 不确定性问题 • 是否有明确的rewards • 是否有明确的game rule • 营销技术品牌 • Sampling is valuable: MC in value network inference, MC in tree search • RL is powerful in solving dynamic problem, combining with MC • Human knowledge is probably local-optimal • Engineering(tricks) is critical