1. 3. Reinforcement learning
3.1 The Q-learning algorithm
If we move from the strategy regime of full rationality to regime of exploring rationality where
players are willing to take some risks to explore their opponents and the game structure, we will
have more than one way to model the players. Here we are attracted by the class of reinforcement
learning which seems to describe human behavior better (see Roth and Erev 1995). It is inspired
by learning theory in psychology which says the likelihood to choose one action is strengthened if
this action leads to favorable outcome. This characteristic is observed to be quite robust in
learning behavior of human and animals. At the same time, reinforcement learning has been
widely used in machine learning community to solve individual decision problem (see Sutton and
Barto 1998). But it has rarely been treated as in a strategic environment (like in repeated games)
where the learning behaviors of others have an important impact on that of each other and the
outcome. So it is also interesting to see how reinforcement learning performs in such strategic
environment from machine learning’s perspective. In the following part, we will describe the
specific reinforcement learning algorithm we used for our experiment-Q-learning. The more
detailed information can be found at Watkins 1989, Watkins and Dayan 1992 and Sutton and Barto
1998.
Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is
defined to be the expected discounted sum of future payoffs obtained by taking action a from state
s and following an optimal policy thereafter. Once these values have been learned, the optimal
action from any state is the one with the highest Q-value. The standard procedure for Q-learning is
as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every
possible state-action pair, and that the table entries are initialized to arbitrary values. Then the
procedure for estimating correct Q(s,a) is to repeat the following loop until termination criterion is
met:
1. In current state s choose an action a, this will cause a receipt of an immediate reward r, and
arrival at a next state s'.
2. Update Q(s, a) according to the following equation:
∆Q( s, a ) = α [r + γ max Q( s' , b) − Q( s, a )]
b (1)
where αis the learning rate parameter.
In the context of repeated games, the player is exploring the environment (its opponent and the
game structure) by taking some risk to choose the action that might not be the current optimal one
in step 1. In step 2 the action that leads to higher reward will strengthen the Q value for that state-
action pair. The above procedure is guaranteed to converge to the correct Q values for stationary
MDPs.
In practice the exploration strategy in step 1 is usually chosen so that it will ensure sufficient
exploration while still favoring actions with higher value estimates in given state. A variety of
Draft.doc 4/26/2010 1/14
2. methods may be used. A simple one is to behave greedily most of the time, but with small
probability ε, choose an action at randomly from those that do not have the highest Q value.
This action selection method is called ε-greed in Sutton and Barto 1995. There is another one
called Softmax action selection method where the action with higher value is more likely to be
chosen in given state. The most common form for the probability of choosing action a is
e Qt ( a ) / τ
∑ bn=1 e Qt (b) / τ (2)
where τis a positive parameter and decreases over time. In the limit as τ→0, Softmax action
selection becomes greedy action selection. In our experiment we tried bothε-greed and Softmax
action selection.
3.2 implementation of Q-learning for 2 by 2 games
Q-learning does not need a model of its environment and can be used on-line. Therefore, it is very
suited for repeated games against an unknown opponent (especially someone with same adaptive
behavior). Here we will focus on some repeated games which are 2 by 2 games. Considering the
iterated prisoner’s dilemma where “tit for tat” has been often discussed, it is natural to represent
the state as the out come of the previous game played. We say in this case the player has memory
length of one. The number of states for 2 by 2 game is 4 and for each state there are two actions
(the pure strategies) from which the player can choose for current game. We also conducted the
experiments for the case that players have memory length of two (the number of states will be 16).
The immediate reward player gets is the payoff in the payoff matrix.
For the softmax action selection method, we set the decreasing rate of the parameter τas the
following
τ = T *ϑ n (3)
T is a constant, n is number of games played so far. θ, called annealing factor, is a positive
constant that is less than one. In the implementation, when n is getting large enough,τ is close
to zero and the player stops exploring. We start usingε-greed after that point to keep player
exploring.
4. Experiments
4.1 The motivation
The repeated 2 by 2 games are the simplest settings for strategic interactions and should be a good
starting point to investigate how different the outcome would be under exploring rationality from
full rationality. Take the iterated prisoner’s dilemma as an example, if the player takes risk
cooperating in some round, hoping to induce cooperation later from its opponent who may think in
the same way, it’s possible that they both find out later that they can get more by mutual
cooperation. Even in early stage it may lose some, later sustained mutual cooperation is a good
reason to explore at the early stage.
Draft.doc 4/26/2010 2/14
3. Motivated by this intuition, we deliberately selected 8 games and parameterized their payoff
matrix. The players are modeled as using Q-learning in each repeated game. For 5 of them, the
Pareto optimal solution does not coincide with Nash equilibrium. The rest are games with two
Nash equilibria, which we included to address the multi-equilibrium selection issue.
4.2 The games and the parameterization
The following is the list of the games and how we parameterized their payoff matrix. The
parameter is δ. In the payoff matrix, the first number is the payoff of the row player and the
second one is payoff of column player. We mark the NE with # and Pareto optimal solution with *.
C and D are the action or pure strategies that players can take. The row player always comes first.
When we say outcome of one play is CD, that means the row player chose pure strategy C and
column play choose pure strategy D. So there are four outcomes of one play: CC CD, DC, and
DD.
The first two games are taken from prisoner’s dilemma. The value of δ is from 0 to 3. When its
value is 2 in table 4.1, it is corresponding to the most common payoff matrix for prisoner’s
dilemma.
C D
C (3,3 )* (0,3+δ)
Defection (3+δ,0) (3-δ, 3-δ)#
Table 4.1: Prisoner’s dilemma Pattern1
C D
C (3,3)* (0, 3+δ)
D (3+δ,0) (δ, δ)#
Table 4.2: Prisoner’s dilemma Pattern2
While the above two are symmetric games, the following three are asymmetric games adopted
from Rapoport and Guyer( Steve, can you find the reference? ). They do not have same number
for payoff as the original ones by Rapoport and Guyer. The value of δ is taken from 0 to 3.
C D
C (0.2 ,0.3)# 0.3+δ,0.1
D 0.1, 0.2 (0.2+δ, 0.3+δ)*
Table 4.3: game #47
C D
C (0.2 ,0.2)# (0.3+δ,0.1)
D (0.1, 0.3) (0.2+δ, 0.3+δ)*
Table 4.5: game #48
C D
Draft.doc 4/26/2010 3/14
4. C (0.2 ,0.3)# (0.3+δ,0.2)
D (0.1, 0.1) (0.2+δ, 0.3+δ)*
Table 4.5: game #57
For the game with two Nash equilibria, one challenging question is which one is more likely to be
selected as the outcome. We choose three from this class of game. The game of Stag Hunt has
Pareto optimal solution as one of Nash equilibrium. The game of Chicken and the game of battle
of sexes are coordination games. The value of δ is taken from 0 to 3 for Stag Hunt and Bottle of
sexes. For Chicken, δ is from 0 to 2. Note that there is no Pareto optimal solution for the last two
coordination games.
C D
C (5,5)* (0,3)
D (3,0) (δ,δ)
Table 4.6: Stag Hunt
C D
C (δ,3-δ)# (0,0)
D (0,0) (3-δ,δ)#
Table 4.7: Battle of sexes
C D
C (2,2) (δ, 2+δ)#
D (2+δ, δ)# (0, 0)
Table 4.8: Chicken
4.3 The setting for the experiments
The parameters for Q-learning are set as following: Learning rate is set to 0.2 and discount factor
as 0.95. We ran the experiment with both softmax action selection and ε-greed action selection.
For softmax action selection, T is set to 5 and annealing factor as 0.9999. When τ is less than
0.01, we began usingε-greed. We set εto 0.01. We have chosen these parameter values after
those used by other studies (Sandholm and Crites 1995)
Each repeated game has 200,000 iterations so the players are given enough time to explore and
learn. For each setting of the payoff parameter δ, we ran the repeated game for 100 trials. We
recorded the frequencies of four outcomes (CC, CD, DC and DD) every 100 iteration. The
numbers usually become stable within 50,000 iterations, so we took frequencies of the outcomes
in the last 100 iterations over the 100 trials to report if not noted otherwise.
Tables in the appendix share similar layout. In the middle column is the payoff parameter δ. On its
left is result for ε-greedy action selection. The result for softmax action selection is on the right.
Again, the numbers are frequencies of four outcomes (CC, CD, DC and DD) in the last 100
iterations over 100 runs.
Draft.doc 4/26/2010 4/14
5. 4.4 The results
It’s disturbing when sometimes classical game theory tell you that only the inferior Nash
equilibrium will be the outcome , not the Pareto optimal solution (not necessarily a Nash
equilibrium). As in our first five games, the Subgame perfect Nash equilibrium will never be the
Pareto outcome. So the question is: will the outcome be different if we use models (such as
reinforcement learning)that better describe human behavior? How reasonable is it? More
specifically, can the player learn to play the Pareto optimal solution that is not Nash equilibrium?
Our experiments show that the answer is positive.
Take a look at Table 1 for Prisoner’s dilemma patter1, if δis close to zero, the two players choose
to defect most of the time. The reason is that there is not much difference between the outcome of
mutual defection and mutual cooperation. The Pareto outcome does not provide enough incentive
for players to take risk inducing cooperation. To avoid being exploited and get zero payoff, they’d
better choose defection all the time. But as δis getting larger, there are more mutual
cooperations observed which suggests both players are trying to settle on the Pareto optimal
outcome. The last row in Table 1 shows an interesting scenario: The players want to induce the
other player’s cooperation so it can take advantage of that by defection because the temptation for
defection is really large when the other player is cooperating. That’s why we see many CDs and
DCs, but less mutual cooperation (CC). It becomes more illustrative comparing with the result for
prisoner’s dilemma pattern 2 in Table 2. In prisoner’s dilemma pattern 2, the players lose almost
nothing by trying to cooperate when δ is close to zero. The exploration helps players to reach the
much superior Pareto outcome (CC) and as we can see from Table 2, mutual cooperation happens
94% of time. Considering the scenario whenδis close to 3, first, there is not much incentive
to shift from Nash equilibrium (DD) to Pareto outcome (CC) since there is not much difference in
payoffs; second, the danger of being exploited by the other player and getting zero payoff is much
higher, finally the players learn to defect most of the time (98%)
Now let’s turn to different class of game. Game#47,game #48 and game#57 are asymmetric games
and they have a common feature: The row player has a dominant strategy C because this strategy
always give higher payoff than the strategy D no matter what the other player’s action is. Thus a
fully rational player will never choose D. What will happen if players are enabled to explore and
learn? Table 3-5 tells us that it depends on the payoff. Ifδ is close to zero, the out come will be
Nash equilibrium (CC) almost 97% of the time since it doesn’t pay to induce other player to
achieve the Pareto optimal solution and more likely it will be ripped off by other player if doing
so. But as long as the incentive from Pareto is large enough, there will be considerable amount
time (above 94%) that Pareto outcomes (DD) being observed.
The Stag hunt problem is interesting because its Pareto optimal solution is also one of its Nash
equilibrium. But which one is more likely to be sustained remains challenging problem for
classical game theory. A mixed strategy (i.e., with some fixed probability to choose one of the pure
strategies) seems natural in this repeated game for classical game theory. Table 6 shows that the
outcomes of this repeated games with players with reinforcement learning model is much different
Draft.doc 4/26/2010 5/14
6. from the prediction of mixed strategy. Say, for example, when δ is equal to 1, the mixed strategy
for both players will be choosing action C with probability 1/3 and D with probability 2/3. We
should expect to see CC less than 33% of the time while Table 6 shows CC happens 88% of the
time. We also can see that when the Pareto outcome is far more superior to the other Nash
equilibrium, it is chosen almost 94% of the time.
The rest two games are coordination games and we are not only concerned about which Nash
equilibrium is to be selected, but also a further question: Is Nash equilibrium concept sufficient to
describe what happens in these games. The later concern arises as we observe different behaviors
in human experiment. Rapport et al. (1976) reported a majority of subjects quickly settled into an
alternating strategy, with the outcome changing back and forth between the two Nash equilibria
when playing the game of Chicken.
From Table 7 we can see these two Nash equilibrium in battle of sexes are equal likely to be the
outcome in most cases since the game is symmetric and these two outcomes are superior than
other two which give both player zero payoff. As in game of Chicken, Table 8 shows that if the
incentive for coordinating on the Nash equilibrium is too little (i.e., δ is close to zero), the
players learn to be conservative both at the same time (CC) since they can not afford the loss in
situation of DD (getting zero). Asδ increases, the game ends up more and more with Nash
equilibrium (CD and DC).
In order to see if players can learn the alternating strategy as observed in human subject
experiments, we conducted another 100 trials for these two games withδbeing set to 1 and with
softmax action selection. For most of the trials the outcome converges to one of the Nash
equilibrium. But we did observe patterns showing alternating strategies for both games. These
patterns are quite stable and can recover quickly from small random disturbance. For battle of
sexes, there is only one pattern that the players play the two Nash equilibria alternately. The total
number of this pattern is 11 (out of 100 trials). For game of Chicken, There are other kinds of
patterns that summarized with their frequencies in Table 4.9
The outcomes Frequency in 100 trials
Alternating between CD and DC 10
Cycle through CD-DC-CC or CD-CC-DC 13
Converge to one of the three: CC, CD or DC 76
No obvious patterns 1
Table 4.9: Frequencies of different kinds of outcome in the game of Chicken
The frequencies of patterns can not be said as considerable, but first we use numbers in the payoff
matrix that are different from Rapport et al. (1976) which may influence the incentive to form
such strategy; and second, our players do not explicitly know about the payoff matrix and can only
learn the payoff of structure of its opponent implicitly through behavior of its opponent (that’s not
a easy task), and finally we think there might be some features of human behavior that are not
captured in our current Q-learning model but important for human subject to learn such alternating
strategy. But our main point is clear here: the Nash equilibrium concept seems not sufficient for
Draft.doc 4/26/2010 6/14
7. describing the out comes of repeated coordination games as Chicken and battle of sexes.
*Note: To save space, there are additional results we do not discuss here, but we add them to
appendix for completeness.
1. We conducted all the experiments by setting the memory length of the players to 2. The
results are shown in table 9-16.
2. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
repeated the experiment with ε –greedy action selection and memory length as 1 on game
#47, result is summarized in table 17
3. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
repeated the experiment with ε –greedy action selection on game of Chicken and battle of
sexes. The Frequencies for patterns are reported in Table 18-21.
5. Discussion
Appendix
ε-greedy action selection Softmax action selection
CC CD DC DD δ CC CD DC DD
3 87 82 9828 0.05 0 106 101 9793
0 92 105 9803 0.5 0 90 94 9816
52 110 111 9727 1 1 111 111 9777
Draft.doc 4/26/2010 7/14
13. 0.15 3193 318 340 6149
0.2 1934 530 325 7211
0.5 314 365 177 9144
1 254 550 149 9047
1.5 253 529 184 9034
2 206 379 151 9264
3 246 471 177 9106
Table 17: Game #47 with ε –greedy action selection and memory length as 1
The outcomes Frequency in 100 trials
Alternating between CC and DD 8
Converge to CC or DD 92
Table 18: Battle of sexes with ε –greedy action selection and memory length as 1
The outcomes Frequency in 100 trials
Alternating between CD and DC 9
Converge to one of the three: CC, CD or DC 70
Other patterns 11
No obvious patterns 10
Table 19: Chicken with ε –greedy action selection and memory length as 1
The outcomes Frequency in 100 trials
Alternating between CC and DD 29
Converge to CC or DD 71
Table 20: Battle of sexes with ε –greedy action selection and memory length as 2
The outcomes Frequency in 100 trials
Alternating between CD and DC 26
Converge to one of the three: CC, CD or DC 46
Other patterns 18
No obvious patterns 10
Table 21: Chicken with ε –greedy action selection and memory length as 2
Reference
1. Rapport, Anatol; Guyer, Melvin J. and Gordon, David G.(1976) “The 2X2 game”, Ann Arbor,
MI: University of Michigan Press
2. Roth, Alvin E. and Erev, Ido. (1995). “Learning in Extensive-Form games: Experimental Data
and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior 8,
164-212.
3. Sanholm, Thomas W. and Crites, Robert H.(1995) “ Multi-agent Reinforcement Learning in
Iterated Prisoner’s Dilemma”, Biosystems, Vol. 37, 147-166
4. Sutton, R. and Barto, A. (1998) “Reinforcement learning: an Introduction”. MIT Press.
Draft.doc 4/26/2010 13/14
14. 5. Watkins and Dayan, 1992 Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine
Learning, 8:279-292.
6. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King's College, Oxford.
Draft.doc 4/26/2010 14/14