SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
EECS 463 Course Project   1




            ADAPTIVE LEARNING IN
            GAMES
3/11/2010   Suvarup Saha
Outline
2


     Motivation
     Games
     Learning in Games
     Adaptive Learning
       Example
     Gradient Techniques
     Conclusion


                           EECS 463 Course Project   3/11/2010
Motivation
3


     Adaptive Filtering Techniques generalize to a lot of
     applications outside
       Gradient Based iterative search
       Stochastic Gradient
       Least Squares
     Application of Game Theory in less than rational multi-
     agent scenarios demand self-learning mechanisms
     Adaptive techniques can be applied in such instances to
     help the agents learn the game and play intelligently

                             EECS 463 Course Project   3/11/2010
Games
4


     A game is an interaction between two or more self-interested
     agents
     Each agent chooses a strategy si from a set of strategies, Si
     A (joint) strategy profile, s, is the set of chosen strategies, also
     called an outcome of the game in a single play
     Each agent has a utility function, ui(s), specifying their
     preference for each outcome in terms of a payoff
     An agent’s best response is the strategy with the highest
     payoff, given its opponents choice of strategy
     A Nash equilibrium is a strategy profile such that every
     agent’s strategy is a best response to others’ choice of strategy

                                 EECS 463 Course Project   3/11/2010
A Normal Form Game
5

                                  B

                           b1           b2

                A     a1   4,4          5,2
                      a2   0,1          4,3


     This is a 2 player game with SA={a1,a2}, SB={b1,b2}
     The ui(s) are explicitly given in a matrix form, for
     example uA(a1, b2) = 5, uB(a1, b2) = 2
     The best response of A to B playing b2 is a1
     In this game, (a1, b1) is the unique Nash Equilibrium
                                EECS 463 Course Project   3/11/2010
Learning in Games
6


     Classical Approach: Compute an optimal/equilibrium
     strategy
     Some criticisms to this approach are
       Other agents’ utilities might be unknown to an agent for
       computing an equilibrium strategy
       Other agents might not be playing an equilibrium strategy
       Computing an equilibrium strategy might be hard
      Another Approach: Learn how to ‘optimally’ play a game
     by
       playing it many times
       updating strategy based on experience
                              EECS 463 Course Project   3/11/2010
Learning Dynamics
7




                                                  Rationality/Sophistication of agents



        Evolutionary        Adaptive             Bayesian
        Dynamics            Learning             Learning




                       Focus of Our Discussion




                                   EECS 463 Course Project   3/11/2010
Evolutionary Dynamics
8

     Inspired by Evolutionary Biology with no appeal to
     rationality of the agents
     Entire population of agents all programmed to use some
     strategy
        Players are randomly matched to play with each other
     Strategies with high payoff spread within the population by
       Learning
       copying or inheriting strategies – Replicator Dynamics
       Infection
     Stability analysis – Evolutionary Stable Strategies (ESS)
       Players playing an ESS must have strictly higher payoffs than a
       small group of invaders playing a different strategy

                                EECS 463 Course Project   3/11/2010
Bayesian Learning
9


     Assumes ‘informed agents’ playing repeated games
     with a finite action space
     Payoffs depend on some characteristics of agents
     represented by types – each agent’s type is private
     information
     The agents’ initial beliefs are given by a common prior
     distribution over agent types
     This belief is updated according to Bayes’ Rule to a
     posterior distribution with each stage of the game.
     In every finite Bayesian game, there is at least one
     Bayesian Nash equilibrium, possibly in mixed strategies

                            EECS 463 Course Project   3/11/2010
Adaptive Learning
10

      Agents are not fully rational, but can learn through
      experience and adapt their strategies
      Agents do not know the reward structure of the game
      Agents are only able to take actions and observe their own
      rewards (or oppnents’ rewards as well)
      Popular Examples
        Best Response Update
        Fictitious Play
        Regret Matching
        Infinitesimal Gradient Ascent (IGA)
        Dynamic Gradient Play
        Adaptive Play Q-learning

                                  EECS 463 Course Project   3/11/2010
Fictitious Play
11


      The learning process is used to develop a ‘historical
      distribution’ of the other agents’ play
      In fictitious play, agent i has an exogenous initial weight
      function       kit: S-i R+
      Weight is updated by adding 1 to the weight of each
      opponent strategy, each time it is played
      The probability that player i assigns to player -i
      playing s-i at date t is given by
                  qit(s-i) = kit(s-i) / Σ kit(s-i)
      The ‘best response’ of the agent i in this fictitious play is
      given by
                 sit+1 = arg max Σ qit(s-i)ui(si, s-it)

                                     EECS 463 Course Project   3/11/2010
An Example
12

      Consider the same 2x2 game example as before
                                                                                     B
      Suppose we assign                                                        b1        b2
          kA0 (b1)= kA0 (b2)= kB0 (a1)= kB0 (a2)= 1                   A   a1   4,4       5,2
      Then, qA0 (b1)= qA0 (b2)= qB0 (a1)= qB0 (a2)= 0.5
                                                                          a2   0,1       4,3
      For A, if A chooses a1
            qA0(b1)uA(a1, b1) + qA0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5
      while if A chooses a2
            qA0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2
      For B, if B chooses b1
            qB0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5
      while if B chooses b2
            qB0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5
      Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2

                                         EECS 463 Course Project   3/11/2010
Game proceeds.
13



      stage                        0


      A’s selection                a1


      B’s selection                b2


      A’s payoff                   5


      B’ payoff                    2


      kAt(b1), qAt(b1)   1, 0.5         1, 0.33


      kAt(b2), qAt(b2)   1, 0.5         2, 0.67


      kBt(a1), qBt(a1)   1, 0 .5        2, 0.67


      kBt(a2), qBt(a2)   1, 0 .5        1, 0.33


                                                  EECS 463 Course Project   3/11/2010
Game proceeds..
14



      stage                        0              1


      A’s selection                a1             a1


      B’s selection                b2             b1


      A’s payoff                   5              4


      B’ payoff                    2              4


      kAt(b1), qAt(b1)   1, 0.5         1, 0.33         2, 0.5


      kAt(b2), qAt(b2)   1, 0.5         2, 0.67         2, 0.5


      kBt(a1), qBt(a1)   1, 0 .5        2, 0.67         3, 0.75


      kBt(a2), qBt(a2)   1, 0 .5        1, 0.33         1, 0.25


                                                  EECS 463 Course Project   3/11/2010
Game proceeds…
15



      stage                        0              1               2


      A’s selection                a1             a1              a1


      B’s selection                b2             b1              b1


      A’s payoff                   5              4               4


      B’ payoff                    2              4               4


      kAt(b1), qAt(b1)   1, 0.5         1, 0.33         2, 0.5              3, 0.6


      kAt(b2), qAt(b2)   1, 0.5         2, 0.67         2, 0.5              2, 0.4


      kBt(a1), qBt(a1)   1, 0 .5        2, 0.67         3, 0.75             4, 0.2


      kBt(a2), qBt(a2)   1, 0 .5        1, 0.33         1, 0.25             1, 0.8


                                                  EECS 463 Course Project     3/11/2010
Game proceeds….
16



      stage                        0              1               2                  3


      A’s selection                a1             a1              a1                 a1


      B’s selection                b2             b1              b1                 b1


      A’s payoff                   5              4               4                  4


      B’ payoff                    2              4               4                  4


      kAt(b1), qAt(b1)   1, 0.5         1, 0.33         2, 0.5              3, 0.6        4, 0.67


      kAt(b2), qAt(b2)   1, 0.5         2, 0.67         2, 0.5              2, 0.4        2, 0.33


      kBt(a1), qBt(a1)   1, 0 .5        2, 0.67         3, 0.75             4, 0.2        5, 0 .84


      kBt(a2), qBt(a2)   1, 0 .5        1, 0.33         1, 0.25             1, 0.8        1, 0.16


                                                  EECS 463 Course Project     3/11/2010
Gradient Based Learning
17


      Fictitious Play assumes unbounded computation is
      allowed in every step – arg max calculation
      An alternative is to proceed in gradient ascent on some
      objective function – expected payoff
      Two players – row and column – have payoffs
               r  r        c c 
            R= 11

                   r 
                       and
                       12
                           C=    
                                               11    12

                r
                21    22    c c             21    22

      Row player chooses action 1 with probability α while
      column player chooses action 2 with probability β
      Expected payoffs are
            Vr (α, β ) = r11αβ + r12α (1 − β ) + r21(1 − α)β + r22 (1 − α )(1 − β )
            Vc (α , β ) = c11αβ + c12α (1 − β ) + c21 (1 − α )β + c22 (1 − α )(1 − β )
                                               EECS 463 Course Project         3/11/2010
Gradient Ascent
18


      Each player repeatedly adjusts her half of the current strategy
      pair in the direction of the current gradient with some step size η
                                                     ∂Vr (α k , β k )
                                     α k +1 = α k + η
                                                          ∂α
                                                     ∂V (α , β )
                                     β k +1   = βk +η c k k
                                                          ∂β
      In case the equations take the strategies outside the probability
      simplex, it is projected back to the boundary
      Gradient ascent algorithm assumes a full information game –
      both the players know the game matrices and can see the mixed
      strategy of their opponent in the previous step
         u = (r11 + r22 ) − (r21 + r12 )                      u' = (c11 +c22) −(c21 +c12)
         ∂Vr (α , β )                                         ∂Vc (α , β )
                      = βu − (r22 − r12 )                                  = αu ' − (c 22 − c 21 )
            ∂α                                                   ∂β
                                                       EECS 463 Course Project       3/11/2010
Infinitesimal Gradient Ascent
19

      Interesting to see what happens to the strategy pair and to the
      expected payoffs over time
      Strategy pair sequence produced by following a gradient ascent
      algorithm may never converge
      Average payoff of both the players always converges to that of some
      Nash pair
      Consider a small step size assumption – limη →0 so that the update
      equations become        ∂α 
                                 ∂t    0          u  α   − ( r22 − r12 ) 
                                 ∂β   = '                  +
                                       u          0   β   − ( c 22 − c 21 ) 
                                                       
                                                                                
                                 ∂t   
      Point where the gradient is zero – Nash Equilibrium
                                       c − c         r22 − r12 
                        (α * , β * ) =  22 ' 21 ,
                                        u                u    
      This point might even lie outside the probability simplex.

                                           EECS 463 Course Project            3/11/2010
IGA dynamics
20


      Denote the off-diagonal matrix containing u and u’ by U
      Depending on the nature of U (noninvertible, real or imaginary
      e-values) the convergence dynamics will vary




                               EECS 463 Course Project   3/11/2010
WoLF - W(in)-o(r)-L(earn)-Fast
21


      Introduces variable learning rate instead of a fixed η
                                                     ∂Vr (α k , β k )
                             α k +1 = α k + ηl r
                                                          ∂α
                                                 k



                                                     ∂ V c (α k , β k )
                             β k +1 = β k + η l kc
                                                            ∂β
      Let αe be the equilibrium strategy selected by the row player
      and βe be the equilibrium strategy selected by the column player
                         l                 Vr (αk , βk ) > Vr (α e , βk ) →Winning
                     l =  min
                         r
                         k
                         l max                     →
                                            otherwise Losin g

                                l          Vc (αk , βk ) > Vc (αk , β e ) →Winning
                     l   c
                         k    =  min
                                 l max              →
                                             otherwise Losing

      If in a two-person, two-action, iterated general-sum game, both
      players follow the WoLF-IGA algorithm (with lmax>lmin) then their
      strategies will converge to a Nash equilibrium
                                                EECS 463 Course Project          3/11/2010
WoLF-IGA convergence
22




                  EECS 463 Course Project   3/11/2010
To Conclude
23


      Learning in games is popular in anticipation of a future in
      which less than rational agents play a game repeatedly to
      arrive at a stable and efficient equilibrium.
      The algorithmic structure and adaptive techniques involved in
      such learning are largely motivated by Machine Learning and
      Adaptive Filtering
      A Gradient- based approach relieves this computational
      burden but might suffer from convergence issues
      A stochastic gradient method (not discussed in the presentation)
      makes use of minimal information available and still performs
      near-optimally

                                EECS 463 Course Project   3/11/2010

Más contenido relacionado

Último

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Último (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 

Destacado

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destacado (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Adaptive Learning In Games

  • 1. EECS 463 Course Project 1 ADAPTIVE LEARNING IN GAMES 3/11/2010 Suvarup Saha
  • 2. Outline 2 Motivation Games Learning in Games Adaptive Learning Example Gradient Techniques Conclusion EECS 463 Course Project 3/11/2010
  • 3. Motivation 3 Adaptive Filtering Techniques generalize to a lot of applications outside Gradient Based iterative search Stochastic Gradient Least Squares Application of Game Theory in less than rational multi- agent scenarios demand self-learning mechanisms Adaptive techniques can be applied in such instances to help the agents learn the game and play intelligently EECS 463 Course Project 3/11/2010
  • 4. Games 4 A game is an interaction between two or more self-interested agents Each agent chooses a strategy si from a set of strategies, Si A (joint) strategy profile, s, is the set of chosen strategies, also called an outcome of the game in a single play Each agent has a utility function, ui(s), specifying their preference for each outcome in terms of a payoff An agent’s best response is the strategy with the highest payoff, given its opponents choice of strategy A Nash equilibrium is a strategy profile such that every agent’s strategy is a best response to others’ choice of strategy EECS 463 Course Project 3/11/2010
  • 5. A Normal Form Game 5 B b1 b2 A a1 4,4 5,2 a2 0,1 4,3 This is a 2 player game with SA={a1,a2}, SB={b1,b2} The ui(s) are explicitly given in a matrix form, for example uA(a1, b2) = 5, uB(a1, b2) = 2 The best response of A to B playing b2 is a1 In this game, (a1, b1) is the unique Nash Equilibrium EECS 463 Course Project 3/11/2010
  • 6. Learning in Games 6 Classical Approach: Compute an optimal/equilibrium strategy Some criticisms to this approach are Other agents’ utilities might be unknown to an agent for computing an equilibrium strategy Other agents might not be playing an equilibrium strategy Computing an equilibrium strategy might be hard Another Approach: Learn how to ‘optimally’ play a game by playing it many times updating strategy based on experience EECS 463 Course Project 3/11/2010
  • 7. Learning Dynamics 7 Rationality/Sophistication of agents Evolutionary Adaptive Bayesian Dynamics Learning Learning Focus of Our Discussion EECS 463 Course Project 3/11/2010
  • 8. Evolutionary Dynamics 8 Inspired by Evolutionary Biology with no appeal to rationality of the agents Entire population of agents all programmed to use some strategy Players are randomly matched to play with each other Strategies with high payoff spread within the population by Learning copying or inheriting strategies – Replicator Dynamics Infection Stability analysis – Evolutionary Stable Strategies (ESS) Players playing an ESS must have strictly higher payoffs than a small group of invaders playing a different strategy EECS 463 Course Project 3/11/2010
  • 9. Bayesian Learning 9 Assumes ‘informed agents’ playing repeated games with a finite action space Payoffs depend on some characteristics of agents represented by types – each agent’s type is private information The agents’ initial beliefs are given by a common prior distribution over agent types This belief is updated according to Bayes’ Rule to a posterior distribution with each stage of the game. In every finite Bayesian game, there is at least one Bayesian Nash equilibrium, possibly in mixed strategies EECS 463 Course Project 3/11/2010
  • 10. Adaptive Learning 10 Agents are not fully rational, but can learn through experience and adapt their strategies Agents do not know the reward structure of the game Agents are only able to take actions and observe their own rewards (or oppnents’ rewards as well) Popular Examples Best Response Update Fictitious Play Regret Matching Infinitesimal Gradient Ascent (IGA) Dynamic Gradient Play Adaptive Play Q-learning EECS 463 Course Project 3/11/2010
  • 11. Fictitious Play 11 The learning process is used to develop a ‘historical distribution’ of the other agents’ play In fictitious play, agent i has an exogenous initial weight function kit: S-i R+ Weight is updated by adding 1 to the weight of each opponent strategy, each time it is played The probability that player i assigns to player -i playing s-i at date t is given by qit(s-i) = kit(s-i) / Σ kit(s-i) The ‘best response’ of the agent i in this fictitious play is given by sit+1 = arg max Σ qit(s-i)ui(si, s-it) EECS 463 Course Project 3/11/2010
  • 12. An Example 12 Consider the same 2x2 game example as before B Suppose we assign b1 b2 kA0 (b1)= kA0 (b2)= kB0 (a1)= kB0 (a2)= 1 A a1 4,4 5,2 Then, qA0 (b1)= qA0 (b2)= qB0 (a1)= qB0 (a2)= 0.5 a2 0,1 4,3 For A, if A chooses a1 qA0(b1)uA(a1, b1) + qA0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5 while if A chooses a2 qA0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2 For B, if B chooses b1 qB0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5 while if B chooses b2 qB0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5 Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2 EECS 463 Course Project 3/11/2010
  • 13. Game proceeds. 13 stage 0 A’s selection a1 B’s selection b2 A’s payoff 5 B’ payoff 2 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 EECS 463 Course Project 3/11/2010
  • 14. Game proceeds.. 14 stage 0 1 A’s selection a1 a1 B’s selection b2 b1 A’s payoff 5 4 B’ payoff 2 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 EECS 463 Course Project 3/11/2010
  • 15. Game proceeds… 15 stage 0 1 2 A’s selection a1 a1 a1 B’s selection b2 b1 b1 A’s payoff 5 4 4 B’ payoff 2 4 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 EECS 463 Course Project 3/11/2010
  • 16. Game proceeds…. 16 stage 0 1 2 3 A’s selection a1 a1 a1 a1 B’s selection b2 b1 b1 b1 A’s payoff 5 4 4 4 B’ payoff 2 4 4 4 kAt(b1), qAt(b1) 1, 0.5 1, 0.33 2, 0.5 3, 0.6 4, 0.67 kAt(b2), qAt(b2) 1, 0.5 2, 0.67 2, 0.5 2, 0.4 2, 0.33 kBt(a1), qBt(a1) 1, 0 .5 2, 0.67 3, 0.75 4, 0.2 5, 0 .84 kBt(a2), qBt(a2) 1, 0 .5 1, 0.33 1, 0.25 1, 0.8 1, 0.16 EECS 463 Course Project 3/11/2010
  • 17. Gradient Based Learning 17 Fictitious Play assumes unbounded computation is allowed in every step – arg max calculation An alternative is to proceed in gradient ascent on some objective function – expected payoff Two players – row and column – have payoffs r r  c c  R= 11 r  and 12 C=  11 12  r 21  22 c c  21 22 Row player chooses action 1 with probability α while column player chooses action 2 with probability β Expected payoffs are Vr (α, β ) = r11αβ + r12α (1 − β ) + r21(1 − α)β + r22 (1 − α )(1 − β ) Vc (α , β ) = c11αβ + c12α (1 − β ) + c21 (1 − α )β + c22 (1 − α )(1 − β ) EECS 463 Course Project 3/11/2010
  • 18. Gradient Ascent 18 Each player repeatedly adjusts her half of the current strategy pair in the direction of the current gradient with some step size η ∂Vr (α k , β k ) α k +1 = α k + η ∂α ∂V (α , β ) β k +1 = βk +η c k k ∂β In case the equations take the strategies outside the probability simplex, it is projected back to the boundary Gradient ascent algorithm assumes a full information game – both the players know the game matrices and can see the mixed strategy of their opponent in the previous step u = (r11 + r22 ) − (r21 + r12 ) u' = (c11 +c22) −(c21 +c12) ∂Vr (α , β ) ∂Vc (α , β ) = βu − (r22 − r12 ) = αu ' − (c 22 − c 21 ) ∂α ∂β EECS 463 Course Project 3/11/2010
  • 19. Infinitesimal Gradient Ascent 19 Interesting to see what happens to the strategy pair and to the expected payoffs over time Strategy pair sequence produced by following a gradient ascent algorithm may never converge Average payoff of both the players always converges to that of some Nash pair Consider a small step size assumption – limη →0 so that the update equations become  ∂α   ∂t  0 u  α   − ( r22 − r12 )   ∂β = ' +   u 0   β   − ( c 22 − c 21 )        ∂t  Point where the gradient is zero – Nash Equilibrium c − c r22 − r12  (α * , β * ) =  22 ' 21 ,  u u   This point might even lie outside the probability simplex. EECS 463 Course Project 3/11/2010
  • 20. IGA dynamics 20 Denote the off-diagonal matrix containing u and u’ by U Depending on the nature of U (noninvertible, real or imaginary e-values) the convergence dynamics will vary EECS 463 Course Project 3/11/2010
  • 21. WoLF - W(in)-o(r)-L(earn)-Fast 21 Introduces variable learning rate instead of a fixed η ∂Vr (α k , β k ) α k +1 = α k + ηl r ∂α k ∂ V c (α k , β k ) β k +1 = β k + η l kc ∂β Let αe be the equilibrium strategy selected by the row player and βe be the equilibrium strategy selected by the column player l Vr (αk , βk ) > Vr (α e , βk ) →Winning l =  min r k l max → otherwise Losin g l Vc (αk , βk ) > Vc (αk , β e ) →Winning l c k =  min  l max → otherwise Losing If in a two-person, two-action, iterated general-sum game, both players follow the WoLF-IGA algorithm (with lmax>lmin) then their strategies will converge to a Nash equilibrium EECS 463 Course Project 3/11/2010
  • 22. WoLF-IGA convergence 22 EECS 463 Course Project 3/11/2010
  • 23. To Conclude 23 Learning in games is popular in anticipation of a future in which less than rational agents play a game repeatedly to arrive at a stable and efficient equilibrium. The algorithmic structure and adaptive techniques involved in such learning are largely motivated by Machine Learning and Adaptive Filtering A Gradient- based approach relieves this computational burden but might suffer from convergence issues A stochastic gradient method (not discussed in the presentation) makes use of minimal information available and still performs near-optimally EECS 463 Course Project 3/11/2010