Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Adaptive Learning In Games
1. EECS 463 Course Project 1
ADAPTIVE LEARNING IN
GAMES
3/11/2010 Suvarup Saha
2. Outline
2
Motivation
Games
Learning in Games
Adaptive Learning
Example
Gradient Techniques
Conclusion
EECS 463 Course Project 3/11/2010
3. Motivation
3
Adaptive Filtering Techniques generalize to a lot of
applications outside
Gradient Based iterative search
Stochastic Gradient
Least Squares
Application of Game Theory in less than rational multi-
agent scenarios demand self-learning mechanisms
Adaptive techniques can be applied in such instances to
help the agents learn the game and play intelligently
EECS 463 Course Project 3/11/2010
4. Games
4
A game is an interaction between two or more self-interested
agents
Each agent chooses a strategy si from a set of strategies, Si
A (joint) strategy profile, s, is the set of chosen strategies, also
called an outcome of the game in a single play
Each agent has a utility function, ui(s), specifying their
preference for each outcome in terms of a payoff
An agent’s best response is the strategy with the highest
payoff, given its opponents choice of strategy
A Nash equilibrium is a strategy profile such that every
agent’s strategy is a best response to others’ choice of strategy
EECS 463 Course Project 3/11/2010
5. A Normal Form Game
5
B
b1 b2
A a1 4,4 5,2
a2 0,1 4,3
This is a 2 player game with SA={a1,a2}, SB={b1,b2}
The ui(s) are explicitly given in a matrix form, for
example uA(a1, b2) = 5, uB(a1, b2) = 2
The best response of A to B playing b2 is a1
In this game, (a1, b1) is the unique Nash Equilibrium
EECS 463 Course Project 3/11/2010
6. Learning in Games
6
Classical Approach: Compute an optimal/equilibrium
strategy
Some criticisms to this approach are
Other agents’ utilities might be unknown to an agent for
computing an equilibrium strategy
Other agents might not be playing an equilibrium strategy
Computing an equilibrium strategy might be hard
Another Approach: Learn how to ‘optimally’ play a game
by
playing it many times
updating strategy based on experience
EECS 463 Course Project 3/11/2010
8. Evolutionary Dynamics
8
Inspired by Evolutionary Biology with no appeal to
rationality of the agents
Entire population of agents all programmed to use some
strategy
Players are randomly matched to play with each other
Strategies with high payoff spread within the population by
Learning
copying or inheriting strategies – Replicator Dynamics
Infection
Stability analysis – Evolutionary Stable Strategies (ESS)
Players playing an ESS must have strictly higher payoffs than a
small group of invaders playing a different strategy
EECS 463 Course Project 3/11/2010
9. Bayesian Learning
9
Assumes ‘informed agents’ playing repeated games
with a finite action space
Payoffs depend on some characteristics of agents
represented by types – each agent’s type is private
information
The agents’ initial beliefs are given by a common prior
distribution over agent types
This belief is updated according to Bayes’ Rule to a
posterior distribution with each stage of the game.
In every finite Bayesian game, there is at least one
Bayesian Nash equilibrium, possibly in mixed strategies
EECS 463 Course Project 3/11/2010
10. Adaptive Learning
10
Agents are not fully rational, but can learn through
experience and adapt their strategies
Agents do not know the reward structure of the game
Agents are only able to take actions and observe their own
rewards (or oppnents’ rewards as well)
Popular Examples
Best Response Update
Fictitious Play
Regret Matching
Infinitesimal Gradient Ascent (IGA)
Dynamic Gradient Play
Adaptive Play Q-learning
EECS 463 Course Project 3/11/2010
11. Fictitious Play
11
The learning process is used to develop a ‘historical
distribution’ of the other agents’ play
In fictitious play, agent i has an exogenous initial weight
function kit: S-i R+
Weight is updated by adding 1 to the weight of each
opponent strategy, each time it is played
The probability that player i assigns to player -i
playing s-i at date t is given by
qit(s-i) = kit(s-i) / Σ kit(s-i)
The ‘best response’ of the agent i in this fictitious play is
given by
sit+1 = arg max Σ qit(s-i)ui(si, s-it)
EECS 463 Course Project 3/11/2010
12. An Example
12
Consider the same 2x2 game example as before
B
Suppose we assign b1 b2
kA0 (b1)= kA0 (b2)= kB0 (a1)= kB0 (a2)= 1 A a1 4,4 5,2
Then, qA0 (b1)= qA0 (b2)= qB0 (a1)= qB0 (a2)= 0.5
a2 0,1 4,3
For A, if A chooses a1
qA0(b1)uA(a1, b1) + qA0(b2)uA(a1, b2) = .5*4+.5*5 = 4.5
while if A chooses a2
qA0(b1)uA(a2, b1) + qA0(b2)uA(a2, b2) = .5*0+.5*4 = 2
For B, if B chooses b1
qB0(a1)uB(a1, b1) + qB0(a2)uB(a2, b1) = .5*4+.5*1 = 2.5
while if B chooses b2
qB0(a1)uB(a1, b2) + qB0(a2)uB(a2, b2) = .5*2+.5*3 = 2.5
Clearly, A plays a1 , B can choose either b1 or b2; assume B plays b2
EECS 463 Course Project 3/11/2010
17. Gradient Based Learning
17
Fictitious Play assumes unbounded computation is
allowed in every step – arg max calculation
An alternative is to proceed in gradient ascent on some
objective function – expected payoff
Two players – row and column – have payoffs
r r c c
R= 11
r
and
12
C=
11 12
r
21 22 c c 21 22
Row player chooses action 1 with probability α while
column player chooses action 2 with probability β
Expected payoffs are
Vr (α, β ) = r11αβ + r12α (1 − β ) + r21(1 − α)β + r22 (1 − α )(1 − β )
Vc (α , β ) = c11αβ + c12α (1 − β ) + c21 (1 − α )β + c22 (1 − α )(1 − β )
EECS 463 Course Project 3/11/2010
18. Gradient Ascent
18
Each player repeatedly adjusts her half of the current strategy
pair in the direction of the current gradient with some step size η
∂Vr (α k , β k )
α k +1 = α k + η
∂α
∂V (α , β )
β k +1 = βk +η c k k
∂β
In case the equations take the strategies outside the probability
simplex, it is projected back to the boundary
Gradient ascent algorithm assumes a full information game –
both the players know the game matrices and can see the mixed
strategy of their opponent in the previous step
u = (r11 + r22 ) − (r21 + r12 ) u' = (c11 +c22) −(c21 +c12)
∂Vr (α , β ) ∂Vc (α , β )
= βu − (r22 − r12 ) = αu ' − (c 22 − c 21 )
∂α ∂β
EECS 463 Course Project 3/11/2010
19. Infinitesimal Gradient Ascent
19
Interesting to see what happens to the strategy pair and to the
expected payoffs over time
Strategy pair sequence produced by following a gradient ascent
algorithm may never converge
Average payoff of both the players always converges to that of some
Nash pair
Consider a small step size assumption – limη →0 so that the update
equations become ∂α
∂t 0 u α − ( r22 − r12 )
∂β = ' +
u 0 β − ( c 22 − c 21 )
∂t
Point where the gradient is zero – Nash Equilibrium
c − c r22 − r12
(α * , β * ) = 22 ' 21 ,
u u
This point might even lie outside the probability simplex.
EECS 463 Course Project 3/11/2010
20. IGA dynamics
20
Denote the off-diagonal matrix containing u and u’ by U
Depending on the nature of U (noninvertible, real or imaginary
e-values) the convergence dynamics will vary
EECS 463 Course Project 3/11/2010
21. WoLF - W(in)-o(r)-L(earn)-Fast
21
Introduces variable learning rate instead of a fixed η
∂Vr (α k , β k )
α k +1 = α k + ηl r
∂α
k
∂ V c (α k , β k )
β k +1 = β k + η l kc
∂β
Let αe be the equilibrium strategy selected by the row player
and βe be the equilibrium strategy selected by the column player
l Vr (αk , βk ) > Vr (α e , βk ) →Winning
l = min
r
k
l max →
otherwise Losin g
l Vc (αk , βk ) > Vc (αk , β e ) →Winning
l c
k = min
l max →
otherwise Losing
If in a two-person, two-action, iterated general-sum game, both
players follow the WoLF-IGA algorithm (with lmax>lmin) then their
strategies will converge to a Nash equilibrium
EECS 463 Course Project 3/11/2010
23. To Conclude
23
Learning in games is popular in anticipation of a future in
which less than rational agents play a game repeatedly to
arrive at a stable and efficient equilibrium.
The algorithmic structure and adaptive techniques involved in
such learning are largely motivated by Machine Learning and
Adaptive Filtering
A Gradient- based approach relieves this computational
burden but might suffer from convergence issues
A stochastic gradient method (not discussed in the presentation)
makes use of minimal information available and still performs
near-optimally
EECS 463 Course Project 3/11/2010