AlphaGo zero

17年底在滴滴内部的AlphaGo Zero的分享

Publicado en: Tecnología
  1. 1. AlphaGo Zero guodong
  2. 2. Go • 本质是什么问题?有限空间搜索和评分 • 如果⾃自⼰己要实现下围棋的程序? • 向前看⼏几步(迷宫) -> 复杂度问题 • 套路包(规则)+ 学习⾼高⼿手的Move • 评估棋局
  3. 3. 3篇相关⽂文章 • Beat human in some games without human knowledge in 2015 • Human-level control through deep reinforcement learning • Beat human in Go game partially rely on human knowledge in 2016 • Mastering the Game of Go with Deep Neural Networks and Tree Search • Dominate human in Go game without human knowledge in 2017 • Mastering the Game of Go without Human Knowledge
  4. 4. AlphaGo
  5. 5. MCTS • Most of go programs area based on MCTS • 缩减搜索空间via policy network and value network • Exploration tradeoff: prefers actions with high prior probability and low visit count, and high action value
  6. 6. MCTS steps • steps • selection: 基于当前的state,定位到tree的⼀一条path; Based on Q value, prior probability(value network), and Visit count • Expansion:当决定有必要(visit count⼤大于阈值)继续从深度上拓展该 path,使⽤用default policy做sampling • Evaluation:combine value network prediction and one random rollout results; update Value of state using sampling • value network approximates the outcome of games by strong policy; while the rollouts evaluate the outcome of games played by weaker policy • Backup: improve the tree (update visit count, and Q value)
  7. 7. AlphaGo • Power MCTS by Policy Network and Value Network • Policy Network(SL + Policy Gradient on DNN) • Value Network(Value Function Approximation + DNN + MC target)
  8. 8. AlphaGo:Policy Network • purpose: narrow down search to high-probability moves • firstly trained by SL to predict human expert moves, then refined by policy gradient RL
  9. 9. Value Network • Purpose: evaluate positions in the tree • Value-based Reinforcement learning (function approximation using DNN) • MSE of the predicted values and the observed rewards • “label” from MC (episode from self-play; whole episode shares single reward) • easy overfitting due to successive positions are similar
  10. 10. AlphaGo Zero
  11. 11. AlphaGo Zero • 1, Trained solely by self-play RL, without any supervision of human data • 2, End-2-end: raw image as the input features • 3, single neural network; and use residual network • 4, MCTS relies upon network only, without performing any MC rollouts • NB! knowledge all learned via network • incorporates lookahead search inside training loop
  12. 12. AlphaGo Zero Algorithm • AlphaGo之前的做法:先确定好Network,再build MCTS • 随机初始化Network的参数和MCTS • 迭代如下步骤 • 基于Network的预估迭代MCTS • 基于MCTS提升后的Policy决定的Move,作为Network 下⼀一轮迭代的输⼊入(self-play) • Minimize Loss:
  13. 13. Comparison • AlphaGo Zero段位更⾼高;训练时间⼤大幅下降;学到 了与human expert不同的策略
  14. 14. Performance
  15. 15. 经验总结 • 不依赖⼈人类经验 vs ⽆无监督 • Data质量决定了策略上界:使⽤用⼈人类经验训练模型beat⼈人类? • 确定性问题 vs 不确定性问题 • 是否有明确的rewards • 是否有明确的game rule • 营销技术品牌 • Sampling is valuable: MC in value network inference, MC in tree search • RL is powerful in solving dynamic problem, combining with MC • Human knowledge is probably local-optimal • Engineering(tricks) is critical