Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Study on Evaluation Function Design of Mahjong using Supervised Learning

244 visualizaciones

Publicado el

情報理工学専攻 複合情報工学講座 調和系工学研究室 Yeqin Zheng

Publicado en: Ingeniería
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Study on Evaluation Function Design of Mahjong using Supervised Learning

  1. 1. Study on Evaluation Function Design of Mahjong using Supervised Learning Hokaido University Graduate School of Information Science and Technology Harmonious Systems Engineering Laboratory Yeqin Zheng 1
  2. 2. Background • Perfect information games – 1997 -- Deep blue vs. world champion on chess – 2007 -- Quackle vs. world champion on scrabble – 2016 -- AlphaGo vs. world champion on Go • Monte Carlo tree search theory • Deep learning method for pre-train network – AlphaGo Zero vs. AlphaGo on Go • Deep learning method • Reinforcement learning • Imperfect information games – Uncertainty – Randomness – Complex rules – Difficult for simulation *
  3. 3. Previous research’s model. • Naoki Mizukami and Yoshimasa Tsuruoka. Building a Computer Mahjong Player Based on Monte Carlo Simulation and Opponent Models, Proceedings of the 2015 IEEE Conference on Computational Intelligence and Games (CIG 2015), pp.275- 283, Aug. 2015. • Monte Carlo tree search to simulate opponents' movement • Prediction of game states. 3
  4. 4. Purpose • This study is about using supervised learning theory and deep learning method on imperfect information game -- Mahjong. • Improvement: – New feature engineering • Improve the training results of networks – Discard method • Improve aggressive during games 4
  5. 5. Introduction of Mahjong Rule • Mahjong tiles consist of 4 types, 34 different tile and each tile has 4 pieces, totally 136 pieces. 5 Hand: consist of tiles River: discarded tiles Dora tile Mountain: invisible tiles for stealing Meld: Open hands
  6. 6. Goal of Mahjong • Goal of mahjong is to make a winning hand into a special format. • There are two different types for earning points: 6 Tsumo: get last tile from mountain and earn points from all other players Ron: use other player’s last discarded tile and earn points from that player
  7. 7. Difficulty & Approach • Difficulty – Imperfect information game has much more states than perfect information game. • It's almost impossible to meet a same game state from any game you played ever. – Randomness and uncertainty will fill the entire game process. • Approach – Dividing the movements during games into several types. – Using multi-networks and methods to make different movements in different states. 7
  8. 8. Introduction of Tenhou is the one of the most popular online mahjong service in Japan. • 4,870,311 users totally. • About 5000 players on line on the same time. • Our training data are all from “houou” table 8 Model Game states Decision
  9. 9. Introduction of's API 9 Data from Mean Example T/U/V/W (+ ID) T/U/V/W: Player's position ID: Tile ID from 0 to 135 T123 #Dealer steals a North V #Player in position west steal a tile D/E/F/G + ID D/E/F/G: Player's position ID: Tile ID from 0 to 135 E123 #Player in position south discards a North Reach who= “Player's position” Who makes a call of riichi Reach who="2" #Player in position west calls a riichi N who=”Player's position” m=”meld" Who makes a call of meld N who="3" m=``34567" #Player in position north calls a meld Agari Who makes a call of winning and his hands, point changes, waiting tile, yaku and who lose point Ryuukyoku End a round without anyone wins and the point changes Data to Mean Example T + ID Discard a tile and the tile's ID T123 #You discard a North Reach who=”0" Make a call of riichi N who=”0" m=”meld" Make a call of meld N who="0" m=``34567" Agari Make a call of winning
  10. 10. Process of Decision Making 10 Player: Steal a tile System: Win check Player: Decide a tile to be discarded Player: Call winYes System: riichi check Player: Discard Player: Call riichi & discardYes Last player's turn Next player's turn System: Win check Opponent: Call winYes
  11. 11. Introduction of Related Terminology • Waiting/Tenpai: One or more players have made winning hands and waiting for the last tiles to earn score. • N shanten: After n effective tiles drawn into hands by player will make hands into winning hands and enter waiting state. 11
  12. 12. Aggressive Move • Two types of game states – No one is in waiting (Attack route): Discard a tile to make hands closer to winning hands and earn score, which may lead to a decrease in number of shanten. – Someone may in waiting (Defense route): • Aggressive move: Player choose a tile that may decrease the number of shanten and unsafe for current game state, also may lead to a decrease in player’s score because other players may have entered waiting states and waiting for this tile. • Safe move: Discard a tile that has less danger of losing score and give up to win, which may make hands away from winning hands and lead to an increase in number of shanten. 12 In this case, player D has make a riichi (Someone is in waiting): - Aggressive move: discard a tile to turn into waiting state which may lead to losing point. - Safe move: discard a tile that player D has discarded will lead to an increase in number of shanten. Without aggressive move: - Fold always - Difficult to make a winning hands
  13. 13. Model Details -- Networks 13 Choose a tile to discard Waiting-tile network (WTR) Waiting-or-not network (WR) opponents' waiting probability probability of 34 tiles that maybe waited Discard network (DR) Lose-point network (LP) probability of 34 tiles that maybe discarded probability of point may lose for the tiles in hand Defense/fold route WR > threshold 6*6*107(108) feature map Discard network (DR) WR ≤ threshold probability of 34 tiles that maybe discarded Attack route
  14. 14. Model Details -- Networks • No one is in waiting – Maximum of the output from discard network • Someone may in waiting – Minimum of lose point expert (LPE) – LPEi = WR * (DRi *) WTRi * LPi , where i is tile ID which in hands. In order to increse aggressive move, the output from discard network will be multiplied to LPE. • The threshold to turn mode – Collecting the data of games states when there is player in waiting. – Using the waiting-or-not network to calculate the probability for these games states. – Calculating the average of outputs which is 0.245. 14
  15. 15. Model Details -- Feature Engineering • Matrix with strong connection between each adjacent nodes in matrix performs better for convolutional neural network (CNN). • Modeling each non-repeating tile into a vector space. • Turning the vector space into 6 * 6 matrix base. 15
  16. 16. Features in Feature Map Feature map hands, 4 layers river, 4 layers turn's movement, 24 * 4 layers dora tiles, 1 layer invisible tiles, 1 layer close hand, 1 layer (discard tile, 1 layer) 16 107 layers feature map will not include the discard tile feature.
  17. 17. Networks Details 17 Network Content Output Data amount WR waiting-or-not network predict the probability that other players are waiting a probability about whether other players is in waiting or not (From 0 to 1) 300,000 WTR waiting-tiles network predict the probabilities of tiles that others may wait for a list of 34 probabilities about how dangerous 34 tiles 4,000*34 DR discard network predict which tile in hand will be discarded if player is a mahjong high level player a list of probability which are 34 tiles' probabilities of being discarded 100,000*34 Training data: Waiting-or-not network: Input: 107 layer feature map Output: 1: someone is in waiting 0: no one is in waiting Waiting-tiles network: Input: 107 layer feature map Output: 1: tiles being waited 0: other tiles In waiting Wait for 1s and 4s
  18. 18. Networks Details 18 Network Content Output Data amount LP lose-point network predict how many point will lost if discard one tile a list that consists of 6 probabilities about how many han in other hand if he wins this round 16,500*6 Training data: Lose-point network: • Input: 108 layer feature map • Output: the lost for this discarded tile
  19. 19. Networks Details 19 Number of convolutional kernels Size of convolutional kernels Edge processing padding Activation function 512 4*4 same relu 512 3*3 same none 512 2*2 same relu Dropout 256 2*2 same none 256 3*3 same relu Dropout 128 3*3 same none 128 2*2 same relu Dropout Full connected 6*6*107(108) feature map as input layer Hidden layer (Totally 7 layers) ... ... Output layer and full connected layer
  20. 20. Final Accuracy of Each Network 20 Network Accuracy Waiting-or-not network 82.7% Waiting-tiles network 40.2% Lose-point network 88.7% Discard network 88.4% The Waiting-tiles network has the accuracy only 40.2% is that the result only calculate whether the maximum of output is being waited.
  21. 21. Experiment and Result • Comparison of three models in our experiment 21 Model Game state Attack route Defense route Best choice algorithm (BCA) Make a call of riichi or open hands with over three melds Choose the tile which can make hands closer to winning hands Choose the tile which the in- waiting player has discarded Combine BCA's attack mode with deep model for defense Make a prediction that someone may in waiting Choose the tile which can make hands closer to winning hands Choose the tile which will lead to the least loss Deep model Make a prediction that someone may in waiting Imitate expert players discard base on current game state Choose the tile which will lead to the least loss
  22. 22. Experiment and Result • We perform 60 games for each model on “Ippan” table, which every player can participate in. 22 Ippan table (avg. lv. 1.5) Top 2nd 3rd 4th Win rate Feed rate Aggressive move BCA 27% 30% 25% 18% 24% 11% 14% BCA + defense model 17% 28% 45% 10% 18% 8% 0% Deep model 22% 27% 33% 18% 20% 9% 8% Players' average (Tenhou) 20% 23% 27% 30% 20% 19% - Geen: Worst performance Red: Best performance
  23. 23. Experiment and Result • We perform 100 games for each model on “Joukyuu” table. 23 Joukyuu table (avg. lv. 11.75) Top 2nd 3rd 4th Win rate Feed rate Aggressive move BCA 19% 23% 30% 28% 16% 18% 12% BCA with deep model 22% 28% 33% 17% 17% 8% 1% Deep model 24% 29% 27% 20% 21% 11% 7% Players' average (Tenhou) 25% 25% 25% 25% 23% 15% 17% Geen: Worst performance Red: Best performance
  24. 24. Competition Between Each Model 24 1st/2nd/3rd/4th 1 BCA 1 BCA + defense model 1 Deep model 3 BCA - 2/6/9/3 3/7/6/4 3 BCA + defense model 6/4/5/5 - 5/6/5/4 3 Deep model 4/5/5/6 4/7/7/2 - The result table shows that: • BCA • Good in attack • Easy to be defended • BCA + defense mode • Great in defense • Less aggressive move • Deep model • Good in defense • Balance in defensive and offensive We performed 20 game for each model with a 1 vs 3 games.
  25. 25. Comparison Between Discard Method • Two discard methods show different performance during expriment. • Make a comparison for these two methods. • It’s easier to be speculate the non-deep learning AI’s state and what tiles it’s waiting for. • Deep model performs more like a human player than non-deep learning AI in attack which we can get from the top rate and win rate. 25 Discard method Waiting Waiting rate Waiting prediction Waiting tiles prediction BCA 438 53.94% 91.32% 57.53% Discard model 411 49.58% 83.43% 39.90%
  26. 26. Conclusion • The deep model in this study shows a good performance during Mahjong games. – High 2nd rate. – Aggressive move. • New feature engineering performs good. • Performance when model predicts that someone is in waiting are better than human player’s average. • It’s possible to make a better multi-network model based on this experiment. Thank you for listening. 26
  27. 27. Research performance ・Information Processing Society of Japan 1) Yeqin Zheng, Soichiro Yokoyama, Tomohisa Yamashita, Hidenori Kawamura: Study on Evaluation Function Design of Mahjong using Supervised Learning, Special Internet Groups(Sig), Vol 194, Hokkaido(2019) 27