SlideShare una empresa de Scribd logo
1 de 14
3. Reinforcement learning
3.1 The Q-learning algorithm
If we move from the strategy regime of full rationality to regime of exploring rationality where
players are willing to take some risks to explore their opponents and the game structure, we will
have more than one way to model the players. Here we are attracted by the class of reinforcement
learning which seems to describe human behavior better (see Roth and Erev 1995). It is inspired
by learning theory in psychology which says the likelihood to choose one action is strengthened if
this action leads to favorable outcome. This characteristic is observed to be quite robust in
learning behavior of human and animals. At the same time, reinforcement learning has been
widely used in machine learning community to solve individual decision problem (see Sutton and
Barto 1998). But it has rarely been treated as in a strategic environment (like in repeated games)
where the learning behaviors of others have an important impact on that of each other and the
outcome. So it is also interesting to see how reinforcement learning performs in such strategic
environment from machine learning’s perspective. In the following part, we will describe the
specific reinforcement learning algorithm we used for our experiment-Q-learning. The more
detailed information can be found at Watkins 1989, Watkins and Dayan 1992 and Sutton and Barto
1998.

Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is
defined to be the expected discounted sum of future payoffs obtained by taking action a from state
s and following an optimal policy thereafter. Once these values have been learned, the optimal
action from any state is the one with the highest Q-value. The standard procedure for Q-learning is
as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every
possible state-action pair, and that the table entries are initialized to arbitrary values. Then the
procedure for estimating correct Q(s,a) is to repeat the following loop until termination criterion is
met:

1. In current state s choose an action a, this will cause a receipt of an immediate reward r, and
arrival at a next state s'.

2. Update Q(s, a) according to the following equation:

                        ∆Q( s, a ) = α [r + γ max Q( s' , b) − Q( s, a )]
                                                b                           (1)

where αis the learning rate parameter.

In the context of repeated games, the player is exploring the environment (its opponent and the
game structure) by taking some risk to choose the action that might not be the current optimal one
in step 1. In step 2 the action that leads to higher reward will strengthen the Q value for that state-
action pair. The above procedure is guaranteed to converge to the correct Q values for stationary
MDPs.

In practice the exploration strategy in step 1 is usually chosen so that it will ensure sufficient
exploration while still favoring actions with higher value estimates in given state. A variety of


Draft.doc                                     4/26/2010                                            1/14
methods may be used. A simple one is to behave greedily most of the time, but with small
probability ε, choose an action at randomly from those that do not have the highest Q value.
This action selection method is called ε-greed in Sutton and Barto 1995. There is another one
called Softmax action selection method where the action with higher value is more likely to be
chosen in given state. The most common form for the probability of choosing action a is
                                        e Qt ( a ) / τ
                                       ∑ bn=1 e Qt (b) / τ   (2)

where τis a positive parameter and decreases over time. In the limit as τ→0, Softmax action
selection becomes greedy action selection. In our experiment we tried bothε-greed and Softmax
action selection.

3.2 implementation of Q-learning for 2 by 2 games

Q-learning does not need a model of its environment and can be used on-line. Therefore, it is very
suited for repeated games against an unknown opponent (especially someone with same adaptive
behavior). Here we will focus on some repeated games which are 2 by 2 games. Considering the
iterated prisoner’s dilemma where “tit for tat” has been often discussed, it is natural to represent
the state as the out come of the previous game played. We say in this case the player has memory
length of one. The number of states for 2 by 2 game is 4 and for each state there are two actions
(the pure strategies) from which the player can choose for current game. We also conducted the
experiments for the case that players have memory length of two (the number of states will be 16).
The immediate reward player gets is the payoff in the payoff matrix.

For the softmax action selection method, we set the decreasing rate of the parameter τas the
 following

                                   τ = T *ϑ n                  (3)

T is a constant, n is number of games played so far. θ, called annealing factor, is a positive
constant that is less than one. In the implementation, when n is getting large enough,τ is close
to zero and the player stops exploring. We start usingε-greed after that point to keep player
exploring.



4. Experiments
4.1 The motivation
The repeated 2 by 2 games are the simplest settings for strategic interactions and should be a good
starting point to investigate how different the outcome would be under exploring rationality from
full rationality. Take the iterated prisoner’s dilemma as an example, if the player takes risk
cooperating in some round, hoping to induce cooperation later from its opponent who may think in
the same way, it’s possible that they both find out later that they can get more by mutual
cooperation. Even in early stage it may lose some, later sustained mutual cooperation is a good
reason to explore at the early stage.




Draft.doc                                      4/26/2010                                        2/14
Motivated by this intuition, we deliberately selected 8 games and parameterized their payoff
matrix. The players are modeled as using Q-learning in each repeated game. For 5 of them, the
Pareto optimal solution does not coincide with Nash equilibrium. The rest are games with two
Nash equilibria, which we included to address the multi-equilibrium selection issue.

4.2 The games and the parameterization

The following is the list of the games and how we parameterized their payoff matrix. The
parameter is δ. In the payoff matrix, the first number is the payoff of the row player and the
second one is payoff of column player. We mark the NE with # and Pareto optimal solution with *.
C and D are the action or pure strategies that players can take. The row player always comes first.
When we say outcome of one play is CD, that means the row player chose pure strategy C and
column play choose pure strategy D. So there are four outcomes of one play: CC CD, DC, and
DD.

The first two games are taken from prisoner’s dilemma. The value of δ is from 0 to 3. When its
value is 2 in table 4.1, it is corresponding to the most common payoff matrix for prisoner’s
dilemma.

                                         C                   D
                       C                 (3,3 )*             (0,3+δ)
                       Defection         (3+δ,0)             (3-δ, 3-δ)#

                             Table 4.1: Prisoner’s dilemma Pattern1

                                         C                   D
                       C                 (3,3)*              (0, 3+δ)
                       D                 (3+δ,0)             (δ, δ)#

                             Table 4.2: Prisoner’s dilemma Pattern2

While the above two are symmetric games, the following three are asymmetric games adopted
from Rapoport and Guyer( Steve, can you find the reference? ). They do not have same number
for payoff as the original ones by Rapoport and Guyer. The value of δ is taken from 0 to 3.

                                   C                     D
                      C            (0.2 ,0.3)#           0.3+δ,0.1
                      D            0.1, 0.2              (0.2+δ, 0.3+δ)*

                                       Table 4.3: game #47

                                   C                     D
                       C           (0.2 ,0.2)#           (0.3+δ,0.1)
                       D           (0.1, 0.3)            (0.2+δ, 0.3+δ)*

                                       Table 4.5: game #48

                                   C                     D



Draft.doc                                    4/26/2010                                         3/14
C              (0.2 ,0.3)#           (0.3+δ,0.2)
                       D              (0.1, 0.1)            (0.2+δ, 0.3+δ)*

                                         Table 4.5: game #57

For the game with two Nash equilibria, one challenging question is which one is more likely to be
selected as the outcome. We choose three from this class of game. The game of Stag Hunt has
Pareto optimal solution as one of Nash equilibrium. The game of Chicken and the game of battle
of sexes are coordination games. The value of δ is taken from 0 to 3 for Stag Hunt and Bottle of
sexes. For Chicken, δ is from 0 to 2. Note that there is no Pareto optimal solution for the last two
coordination games.

                                            C                   D
                       C                    (5,5)*              (0,3)
                       D                    (3,0)               (δ,δ)

                                         Table 4.6: Stag Hunt

                                            C                   D
                       C                    (δ,3-δ)#            (0,0)
                       D                    (0,0)               (3-δ,δ)#

                                       Table 4.7: Battle of sexes

                                            C                   D
                       C                    (2,2)               (δ, 2+δ)#
                       D                    (2+δ, δ)#           (0, 0)

                                          Table 4.8: Chicken

4.3 The setting for the experiments

The parameters for Q-learning are set as following: Learning rate is set to 0.2 and discount factor
as 0.95. We ran the experiment with both softmax action selection and ε-greed action selection.
For softmax action selection, T is set to 5 and annealing factor as 0.9999. When τ is less than
0.01, we began usingε-greed. We set εto 0.01. We have chosen these parameter values after
those used by other studies (Sandholm and Crites 1995)

Each repeated game has 200,000 iterations so the players are given enough time to explore and
learn. For each setting of the payoff parameter δ, we ran the repeated game for 100 trials. We
recorded the frequencies of four outcomes (CC, CD, DC and DD) every 100 iteration. The
numbers usually become stable within 50,000 iterations, so we took frequencies of the outcomes
in the last 100 iterations over the 100 trials to report if not noted otherwise.

Tables in the appendix share similar layout. In the middle column is the payoff parameter δ. On its
left is result for ε-greedy action selection. The result for softmax action selection is on the right.
Again, the numbers are frequencies of four outcomes (CC, CD, DC and DD) in the last 100
iterations over 100 runs.


Draft.doc                                       4/26/2010                                         4/14
4.4 The results

It’s disturbing when sometimes classical game theory tell you that only the inferior Nash
equilibrium will be the outcome , not the Pareto optimal solution (not necessarily a Nash
equilibrium). As in our first five games, the Subgame perfect Nash equilibrium will never be the
Pareto outcome. So the question is: will the outcome be different if we use models (such as
reinforcement learning)that better describe human behavior? How reasonable is it? More
specifically, can the player learn to play the Pareto optimal solution that is not Nash equilibrium?
Our experiments show that the answer is positive.

Take a look at Table 1 for Prisoner’s dilemma patter1, if δis close to zero, the two players choose
to defect most of the time. The reason is that there is not much difference between the outcome of
mutual defection and mutual cooperation. The Pareto outcome does not provide enough incentive
for players to take risk inducing cooperation. To avoid being exploited and get zero payoff, they’d
better choose defection all the time. But as δis getting larger, there are more mutual
cooperations observed which suggests both players are trying to settle on the Pareto optimal
outcome. The last row in Table 1 shows an interesting scenario: The players want to induce the
other player’s cooperation so it can take advantage of that by defection because the temptation for
defection is really large when the other player is cooperating. That’s why we see many CDs and
DCs, but less mutual cooperation (CC). It becomes more illustrative comparing with the result for
prisoner’s dilemma pattern 2 in Table 2. In prisoner’s dilemma pattern 2, the players lose almost
nothing by trying to cooperate when δ is close to zero. The exploration helps players to reach the
much superior Pareto outcome (CC) and as we can see from Table 2, mutual cooperation happens
94% of time. Considering the scenario whenδis close to 3, first, there is not much incentive
to shift from Nash equilibrium (DD) to Pareto outcome (CC) since there is not much difference in
payoffs; second, the danger of being exploited by the other player and getting zero payoff is much
higher, finally the players learn to defect most of the time (98%)

Now let’s turn to different class of game. Game#47,game #48 and game#57 are asymmetric games
and they have a common feature: The row player has a dominant strategy C because this strategy
always give higher payoff than the strategy D no matter what the other player’s action is. Thus a
fully rational player will never choose D. What will happen if players are enabled to explore and
learn? Table 3-5 tells us that it depends on the payoff. Ifδ is close to zero, the out come will be
Nash equilibrium (CC) almost 97% of the time since it doesn’t pay to induce other player to
achieve the Pareto optimal solution and more likely it will be ripped off by other player if doing
so. But as long as the incentive from Pareto is large enough, there will be considerable amount
time (above 94%) that Pareto outcomes (DD) being observed.

The Stag hunt problem is interesting because its Pareto optimal solution is also one of its Nash
equilibrium. But which one is more likely to be sustained remains challenging problem for
classical game theory. A mixed strategy (i.e., with some fixed probability to choose one of the pure
strategies) seems natural in this repeated game for classical game theory. Table 6 shows that the
outcomes of this repeated games with players with reinforcement learning model is much different


Draft.doc                                    4/26/2010                                          5/14
from the prediction of mixed strategy. Say, for example, when δ is equal to 1, the mixed strategy
for both players will be choosing action C with probability 1/3 and D with probability 2/3. We
should expect to see CC less than 33% of the time while Table 6 shows CC happens 88% of the
time. We also can see that when the Pareto outcome is far more superior to the other Nash
equilibrium, it is chosen almost 94% of the time.

The rest two games are coordination games and we are not only concerned about which Nash
equilibrium is to be selected, but also a further question: Is Nash equilibrium concept sufficient to
describe what happens in these games. The later concern arises as we observe different behaviors
in human experiment. Rapport et al. (1976) reported a majority of subjects quickly settled into an
alternating strategy, with the outcome changing back and forth between the two Nash equilibria
when playing the game of Chicken.

From Table 7 we can see these two Nash equilibrium in battle of sexes are equal likely to be the
outcome in most cases since the game is symmetric and these two outcomes are superior than
other two which give both player zero payoff. As in game of Chicken, Table 8 shows that if the
incentive for coordinating on the Nash equilibrium is too little (i.e., δ is close to zero), the
players learn to be conservative both at the same time (CC) since they can not afford the loss in
situation of DD (getting zero). Asδ increases, the game ends up more and more with Nash
equilibrium (CD and DC).

In order to see if players can learn the alternating strategy as observed in human subject
experiments, we conducted another 100 trials for these two games withδbeing set to 1 and with
softmax action selection. For most of the trials the outcome converges to one of the Nash
equilibrium. But we did observe patterns showing alternating strategies for both games. These
patterns are quite stable and can recover quickly from small random disturbance. For battle of
sexes, there is only one pattern that the players play the two Nash equilibria alternately. The total
number of this pattern is 11 (out of 100 trials). For game of Chicken, There are other kinds of
patterns that summarized with their frequencies in Table 4.9

The outcomes                                              Frequency in 100 trials
Alternating between CD and DC                             10
Cycle through CD-DC-CC or CD-CC-DC                        13
Converge to one of the three: CC, CD or DC                76
No obvious patterns                                       1

            Table 4.9: Frequencies of different kinds of outcome in the game of Chicken

The frequencies of patterns can not be said as considerable, but first we use numbers in the payoff
matrix that are different from Rapport et al. (1976) which may influence the incentive to form
such strategy; and second, our players do not explicitly know about the payoff matrix and can only
learn the payoff of structure of its opponent implicitly through behavior of its opponent (that’s not
a easy task), and finally we think there might be some features of human behavior that are not
captured in our current Q-learning model but important for human subject to learn such alternating
strategy. But our main point is clear here: the Nash equilibrium concept seems not sufficient for


Draft.doc                                     4/26/2010                                          6/14
describing the out comes of repeated coordination games as Chicken and battle of sexes.
*Note: To save space, there are additional results we do not discuss here, but we add them to
appendix for completeness.
    1. We conducted all the experiments by setting the memory length of the players to 2. The
         results are shown in table 9-16.
    2. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
         repeated the experiment with ε –greedy action selection and memory length as 1 on game
         #47, result is summarized in table 17
    3. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
         repeated the experiment with ε –greedy action selection on game of Chicken and battle of
         sexes. The Frequencies for patterns are reported in Table 18-21.




5. Discussion

Appendix

     ε-greedy action selection                                    Softmax action selection
    CC      CD       DC      DD                     δ          CC        CD       DC      DD
     3      87       82     9828                   0.05         0       106      101     9793
     0      92      105     9803                   0.5          0           90         94           9816
    52     110      111     9727                        1       1          111         111          9777




Draft.doc                                   4/26/2010                                        7/14
51      110      93      9746        1.25        2475     338     358           6829
   1136      160     198      8506         1.5        3119     526     483           5872
   1776      245     381      7598        1.75        4252     653     666           4429
   3526      547     413      5514             2      789      883     869           7549
    848      766    779       7607         2.5        496     2276     2368          4860
    544     2313    2306      4837        2.95        539     2821     2112          4528


                      Table 1: Prisoner’s dilemma Pattern1

     ε-greedy action selection                           Softmax action selection
    CC      CD       DC       DD           δ           CC       CD      DC       DD
   9422    218      183      177          0.05        9334     302     285           79
   9036    399      388      150           0.5        9346     294     220           140
   5691    738      678     2693               1      7537     954     1267          242
   3506    179      275     6040          1.25        8203     542     994           261
   1181    184      116     8519           1.5        7818     767     775           640
     2      98      103     9797          1.75        4685     270     422           4623
    97     114       91     9698               2      1820     217     220           7743
     0     100       92     9808           2.5          0      77      117           9806
     2      96       94     9808          2.95          0      90      114           9796


                      Table 2: Prisoner’s dilemma Pattern2

      ε-greedy action selection                          Softmax action selection
     CC      CD       DC      DD           δ           CC       CD      DC       DD
   9790     101      101       8               0      9808     94       98            0
   4147     137      156     5560          0.1        9812     94       93            1
   3019     123      165     6693         0.15        9799     95      104            2
   2188     141      132     7539          0.2        8934     85      109           872
    185     355      130     9330          0.5        730      284     208           8778
    131     309      135     9425              1      120      532     138           9210
    138     288       99     9475          1.5         77      471     103           9349
     99     321      131     9449              2       88      441     126           9345
    126     172       88     9614              3       64      366      92           9478


                              Table 3: Game #47

     ε-greedy action selection                           Softmax action selection
    CC      CD       DC      DD            δ           CC       CD      DC       DD
   9789    102      107       2                0      9787     106     105            2
   3173    515      173     6139           0.1        9811     86      101            2
   2832    457      207     6504          0.15        8127     256     137           1480
   1227    348      141     8284           0.2        2986     755     230           6029




Draft.doc                          4/26/2010                                  8/14
109     627     143     9121         0.5      143     631     146           9080
     90     492     139     9279             1    79     1320     126           8475
     88     318     134     9460         1.5      117    1076     128           8679
    241     236     119     9404             2    62      473     126           9339
     76     284     139     9501             3    64      277     128           9531


                             Table 4: Game #48
      ε-greedy action selection                     Softmax action selection
     CC      CD       DC      DD         δ        CC       CD      DC       DD
   9767     119      107       7             0    9764    131     105            0
   1684     587      175     7554        0.1      9794    106      98            2
    531     518      191     8760       0.15      9550    105     105           240
    238     543      159     9060        0.2      1048    497     257           8198
    126     307      121     9446        0.5      224     852     152           8772
    118     520      114     9248            1    113     753     119           9015
    104     526      125     9245        1.5      74      538     117           9271
     66     225      102     9607            2    57      569     123           9251
    123     296      116     9465            3    61      302     125           9512


                             Table 5: Game #57

     ε-greedy action selection                      Softmax action selection
    CC      CD       DC       DD         δ        CC       CD      DC       DD
   9390    126      122      362          0       9715    108     109           68
   9546     91      108      255         0.5      9681    120     121           78
   9211    112      125      552        0.75      9669    111     101           119
   8864    119      110      907          1       9666    98      102           134
   8634    115      132     1119        1.25      9598    139     134           129
   7914    122      130     1834         1.5      9465    99      109           327
   7822    122      104     1952          2       9452    126     126           296
   5936     87      101     3876         2.5      8592    116      89           1203
   5266    121      106     4507             3    3524    111     115           6250


                             Table 6: Stag Hunt

     ε-greedy action selection                      Softmax action selection
    CC      CD       DC      DD          δ        CC       CD      DC       DD
   2641     63     4571     2725             0    2872    73      4477          2578
   3842    135     1626     4397         0.1      4615    101     1732          3552
   5140    102       90     4668         0.5      4772    102     162           4964
   4828    107       94     4971             1    4862    88       89           4961
   4122    101      109     5668         1.5      4642    85      102           5171
   4983    100       97     4820             2    4623    97       87           5193




Draft.doc                        4/26/2010                               9/14
3814        111          96        5979          2.5         5139     102      99            4660
   4015       1388         107        4490          2.9         4303     1794    118            3785
   2653       4921          70        2356              3       2593     4776     58            2573


                                     Table 7: Battle of sexes

      ε-greedy action selection                                   Softmax action selection
     CC       CD      DC      DD                    δ           CC       CD      DC       DD
   9276      227     347     150                     0          9509     165     222            104
   9587      143     135     135                   0.25         9119     428     320            133
   9346      209     223     222                    0.5         9375     220     225            180
   6485     1491    1858     166                   0.75         8759     424     632            185
   1663     3532    4706      99                     1          1339     4903    3662           96
    385     4161    5342     112                   1.25         158      5416    4323           103
    113     4488    5274     125                    1.5         115      4700    5099           86
    111     4301    5504      84                   1.75         100      4704    5083           113
    100     4853    4953      94                        2       94       4772    5044           90


                                         Table 8: Chicken
* Memory length setting is 1 for table 1-8

     ε-greedy action selection                                    Softmax action selection
    CC      CD      DC       DD                     δ           CC       CD      DC       DD
       253        122         129        9496      0.05             2       85     106           9807
       860        137         133        8870       0.5             0       98     103           9799
       300        124         112        9464        1              4       95     100           9801
       227         88         121        9564      1.25            15      160     154           9671
      1615        316         325        7744       1.5           309      304     365           9022
      2900       1112        1085        4903      1.75           590      682     746           7982
      2748       1681        1652        3919        2            281     1522    1476           6721
      1919       2927        2988        2166       2.5           389     4235    4111           1265
       905       4384        4199         512      2.95           578     4170    4018           1234

                              Table 9: Prisoner’s dilemma Pattern1

     ε-greedy action selection                                    Softmax action selection
    CC      CD      DC       DD                     δ           CC       CD      DC       DD
      9453        267         258          22      0.05           0.05    9441     277            206
      9230        312         314         144       0.5           0.25    9379     260            296
      7591        674         633        1102        1             0.5    8940     251            245
      4360        483         542        4615      1.25           0.75    8100     444            376
      1297        456         443        7804       1.5              1    4311     867            813
         2        112         100        9786      1.75           1.25     816     416            406
         3         81          99        9817        2             1.5       3     129            117
         4         87         101        9808       2.5              2       0     100             95



Draft.doc                                   4/26/2010                                   10/14
0    108     91       9801      2.95           2.5         1    93      122
                       Table 10: Prisoner’s dilemma Pattern2

     ε-greedy action selection                            Softmax action selection
    CC      CD      DC       DD              δ          CC       CD      DC       DD
    9772     110      115        3               0       9784        93    123          0
    8812     275      163      750          0.1          9810       102     88          0
    5200     561      227    4012           0.15         9812        97     88          3
    3704     817      210    5269           0.2          9702        111    89         98
     290     558      185    8967           0.5          1458       656    224      7662
     105     680      159    9056                1             99   646    216      9039
       99    404      154    9343           1.5                55   270    182      9493
       71    463      124    9342                2             16   224    147      9613
       77    271      135    9517                3             34   245    184      9537


                               Table 11: Game #47

     ε-greedy action selection                            Softmax action selection
    CC      CD      DC       DD              δ          CC       CD      DC       DD
    9766     126      108       0                0       9803       100     95          2
    7772     179      153    1896           0.1          9821        90     87          2
    3775     802      193    5230           0.15         9598       104    104      194
    2848    1011      203    5938           0.2          7877       308    204      1611
     491     539      202    8768           0.5          1083       947    199      7771
     113     758      170    8959                1             93   791    158      8958
     114     677      187    9022           1.5                58   567    150      9225
       66    477      131    9326                2             33   286    161      9520
       66    411      168    9355                3             43   236    152      9569


                               Table 12: Game #48
     ε-greedy action selection                            Softmax action selection
    CC      CD      DC       DD              δ          CC       CD      DC       DD
      8757       715   111         417        0          9765        146    85         4
      3084      2534   124        4258       0.1         9634        211    88        67
      1553      2235   241        5971      0.15         4634       2717   176      2476
       491      1408   192        7909       0.2         2377       2280   207      5136
       131       623   170        9076       0.5          152       1331   196      8321
       106      1005   201        8688        1           130        791   127      8952
        51       582   131        9236       1.5           43        396   135      9426
        60       490   151        9299        2            47        471   161      9321
        83       570   192        9155        3            62        392   189      9357

                               Table 13: Game #57




Draft.doc                            4/26/2010                              11/14
ε-greedy action selection                                      Softmax action selection
    CC      CD      DC       DD                       δ           CC       CD      DC       DD
      9557        122         109           212        0          9742         138   109         11
      9052        161         134           653      0.5          9742         123   128          7
      8703        151         135          1011      0.75         9613         177   185         25
      8493        190         148          1169        1          9670         120   174         36
      8197        193         147          1463      1.25         9458         140   110        292
      7745        133         122          2000      1.5          9396         166   127        311
      5900        148         129          3823        2          8119         158   122       1601
      4111        114         142          5633      2.5          4015         156   163       5666
      2772        102         126          7000        3          1462         107   115       8316

                                        Table 14: Stag Hunt

     ε-greedy action selection                                      Softmax action selection
    CC      CD      DC       DD                       δ           CC       CD      DC       DD
      4124        125        1728          4023       0           3929       93      2132      3846
      4835        141         746          4278      0.1          4556       97       580      4767
      4092         94         117          5697      0.5          6178       92        97      3633
      5788         89          95          4028       1           4592       79        99      5230
      5308         94         108          4490      1.5          5012      107       110      4771
      4620        138         105          5137       2           4619      121        94      5166
      5119        146          93          4642      2.5          4357      121        93      5429
      4797        679         144          4308      2.9          4250     1286       143      4321
      3797       2112         125          3966       3           3727     2157       106      4010

                                      Table 15: Battle of sexes

     ε-greedy action selection                                      Softmax action selection
    CC      CD      DC       DD                       δ           CC       CD      DC       DD
      9342        297         280            81        0              0    9538       206       215
      9205        316         405            74      0.25          0.25    9465       256       236
      8529        727         684            60       0.5           0.5    9225       372       332
      7300       1178        1413           109      0.75          0.75    8012       871       967
      1401       3819        4666           114        1              1    1919      4323      3659
       223       4728        4937           112      1.25          1.25     255      4957      4661
       162       4073        5667            98       1.5           1.5     171      4825      4898
       188       5105        4574           133      1.75          1.75     130      4986      4787
       149       4693        5064            94        2              2     107      3816      5971

                                         Table 16: Chicken
* Memory length setting is 2 for table 9-16

                        δ                 CC          CD          DC      DD
                                  0       9594         106         298       2
                               0.1        7295         162         341    2202



Draft.doc                                     4/26/2010                                12/14
0.15      3193            318       340       6149
                                   0.2      1934            530       325       7211
                                   0.5       314            365       177       9144
                                     1       254            550       149       9047
                                   1.5       253            529       184       9034
                                     2       206            379       151       9264
                                     3       246            471       177       9106

              Table 17: Game #47 with ε –greedy action selection and memory length as 1

The outcomes                                         Frequency in 100 trials
Alternating between CC and DD                        8
Converge to CC or DD                                 92

            Table 18: Battle of sexes with ε –greedy action selection and memory length as 1

The outcomes                                                 Frequency in 100 trials
Alternating between CD and DC                                9
Converge to one of the three: CC, CD or DC                   70
Other patterns                                               11
No obvious patterns                                          10

               Table 19: Chicken with ε –greedy action selection and memory length as 1

The outcomes                                         Frequency in 100 trials
Alternating between CC and DD                        29
Converge to CC or DD                                 71

            Table 20: Battle of sexes with ε –greedy action selection and memory length as 2

The outcomes                                                 Frequency in 100 trials
Alternating between CD and DC                                26
Converge to one of the three: CC, CD or DC                   46
Other patterns                                               18
No obvious patterns                                          10

               Table 21: Chicken with ε –greedy action selection and memory length as 2



Reference
1. Rapport, Anatol; Guyer, Melvin J. and Gordon, David G.(1976) “The 2X2 game”, Ann Arbor,
    MI: University of Michigan Press
2. Roth, Alvin E. and Erev, Ido. (1995). “Learning in Extensive-Form games: Experimental Data
    and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior 8,
    164-212.
3. Sanholm, Thomas W. and Crites, Robert H.(1995) “ Multi-agent Reinforcement Learning in
    Iterated Prisoner’s Dilemma”, Biosystems, Vol. 37, 147-166
4. Sutton, R. and Barto, A. (1998) “Reinforcement learning: an Introduction”. MIT Press.


Draft.doc                                       4/26/2010                                      13/14
5. Watkins and Dayan, 1992 Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine
   Learning, 8:279-292.
6. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King's College, Oxford.




Draft.doc                               4/26/2010                                   14/14

Más contenido relacionado

La actualidad más candente

2 6 complex fractions
2 6 complex fractions2 6 complex fractions
2 6 complex fractionsmath123b
 
Int Math 2 Section 2-5 1011
Int Math 2 Section 2-5 1011Int Math 2 Section 2-5 1011
Int Math 2 Section 2-5 1011Jimbo Lamb
 
F4 04 Mathematical Reasoning
F4 04 Mathematical  ReasoningF4 04 Mathematical  Reasoning
F4 04 Mathematical Reasoningguestcc333c
 
Systems word problems notes
Systems word problems notesSystems word problems notes
Systems word problems notesRon Eick
 
cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1zukun
 
2 7 summary of usage of lcm
2 7 summary of usage of lcm2 7 summary of usage of lcm
2 7 summary of usage of lcmmath123b
 
Combinatorial Problems2
Combinatorial Problems2Combinatorial Problems2
Combinatorial Problems23ashmawy
 
0.5 Rational Expressions
0.5 Rational Expressions0.5 Rational Expressions
0.5 Rational Expressionssmiller5
 
Lesson 2.2 division
Lesson 2.2 divisionLesson 2.2 division
Lesson 2.2 divisionA V Prakasam
 
AA Section 7-1
AA Section 7-1AA Section 7-1
AA Section 7-1Jimbo Lamb
 
Simplification of switching functions
Simplification of switching functionsSimplification of switching functions
Simplification of switching functionssaanavi
 
Logisticregression
LogisticregressionLogisticregression
Logisticregressionsujimaa
 
2 2 addition and subtraction ii
2 2 addition and subtraction ii2 2 addition and subtraction ii
2 2 addition and subtraction iimath123b
 
UPSEE - Mathematics -2004 Unsolved Paper
UPSEE - Mathematics -2004 Unsolved PaperUPSEE - Mathematics -2004 Unsolved Paper
UPSEE - Mathematics -2004 Unsolved PaperVasista Vinuthan
 

La actualidad más candente (18)

2 6 complex fractions
2 6 complex fractions2 6 complex fractions
2 6 complex fractions
 
Int Math 2 Section 2-5 1011
Int Math 2 Section 2-5 1011Int Math 2 Section 2-5 1011
Int Math 2 Section 2-5 1011
 
F4 04 Mathematical Reasoning
F4 04 Mathematical  ReasoningF4 04 Mathematical  Reasoning
F4 04 Mathematical Reasoning
 
Systems word problems notes
Systems word problems notesSystems word problems notes
Systems word problems notes
 
cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1cvpr2011: game theory in CVPR part 1
cvpr2011: game theory in CVPR part 1
 
2 7 summary of usage of lcm
2 7 summary of usage of lcm2 7 summary of usage of lcm
2 7 summary of usage of lcm
 
Powerpoint
PowerpointPowerpoint
Powerpoint
 
Combinatorial Problems2
Combinatorial Problems2Combinatorial Problems2
Combinatorial Problems2
 
0.5 Rational Expressions
0.5 Rational Expressions0.5 Rational Expressions
0.5 Rational Expressions
 
Lesson 2.2 division
Lesson 2.2 divisionLesson 2.2 division
Lesson 2.2 division
 
AA Section 7-1
AA Section 7-1AA Section 7-1
AA Section 7-1
 
AMU - Mathematics - 1998
AMU - Mathematics  - 1998AMU - Mathematics  - 1998
AMU - Mathematics - 1998
 
Simplification of switching functions
Simplification of switching functionsSimplification of switching functions
Simplification of switching functions
 
Logisticregression
LogisticregressionLogisticregression
Logisticregression
 
2 2 addition and subtraction ii
2 2 addition and subtraction ii2 2 addition and subtraction ii
2 2 addition and subtraction ii
 
Introduction to root finding
Introduction to root findingIntroduction to root finding
Introduction to root finding
 
Gj3611551159
Gj3611551159Gj3611551159
Gj3611551159
 
UPSEE - Mathematics -2004 Unsolved Paper
UPSEE - Mathematics -2004 Unsolved PaperUPSEE - Mathematics -2004 Unsolved Paper
UPSEE - Mathematics -2004 Unsolved Paper
 

Similar a mingdraft2.doc

Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersSolving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersIJSRD
 
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersSolving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersIJSRD
 
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docx
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docxATT00001ATT00002ATT00003ATT00004ATT00005CARD.docx
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docxikirkton
 
Optimization of Fuzzy Matrix Games of Order 4 X 3
Optimization of Fuzzy Matrix Games of Order 4 X 3Optimization of Fuzzy Matrix Games of Order 4 X 3
Optimization of Fuzzy Matrix Games of Order 4 X 3IJERA Editor
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsemGopi Saiteja
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchDheerendra k
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationMarjan Sterjev
 
Problem Set 4 Due in class on Tuesday July 28. Solutions.docx
Problem Set 4   Due in class on Tuesday July 28.  Solutions.docxProblem Set 4   Due in class on Tuesday July 28.  Solutions.docx
Problem Set 4 Due in class on Tuesday July 28. Solutions.docxwkyra78
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big DataChristian Robert
 
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyNbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyMD Kutubuddin Sardar
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 

Similar a mingdraft2.doc (20)

Computer Network Assignment Help.pptx
Computer Network Assignment Help.pptxComputer Network Assignment Help.pptx
Computer Network Assignment Help.pptx
 
Computer Network Assignment Help
Computer Network Assignment HelpComputer Network Assignment Help
Computer Network Assignment Help
 
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersSolving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
 
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy NumbersSolving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
Solving Fuzzy Matrix Games Defuzzificated by Trapezoidal Parabolic Fuzzy Numbers
 
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docx
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docxATT00001ATT00002ATT00003ATT00004ATT00005CARD.docx
ATT00001ATT00002ATT00003ATT00004ATT00005CARD.docx
 
Optimization of Fuzzy Matrix Games of Order 4 X 3
Optimization of Fuzzy Matrix Games of Order 4 X 3Optimization of Fuzzy Matrix Games of Order 4 X 3
Optimization of Fuzzy Matrix Games of Order 4 X 3
 
Ee693 sept2014midsem
Ee693 sept2014midsemEe693 sept2014midsem
Ee693 sept2014midsem
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Equivariance
EquivarianceEquivariance
Equivariance
 
Principal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and VisualizationPrincipal Components Analysis, Calculation and Visualization
Principal Components Analysis, Calculation and Visualization
 
presentation
presentationpresentation
presentation
 
Problem Set 4 Due in class on Tuesday July 28. Solutions.docx
Problem Set 4   Due in class on Tuesday July 28.  Solutions.docxProblem Set 4   Due in class on Tuesday July 28.  Solutions.docx
Problem Set 4 Due in class on Tuesday July 28. Solutions.docx
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyNbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
 
Engg math's practice question by satya prakash 1
Engg math's practice question by satya prakash 1Engg math's practice question by satya prakash 1
Engg math's practice question by satya prakash 1
 
Student t t est
Student t t estStudent t t est
Student t t est
 
Quantum games
Quantum gamesQuantum games
Quantum games
 
Paper computer
Paper computerPaper computer
Paper computer
 
Paper computer
Paper computerPaper computer
Paper computer
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

mingdraft2.doc

  • 1. 3. Reinforcement learning 3.1 The Q-learning algorithm If we move from the strategy regime of full rationality to regime of exploring rationality where players are willing to take some risks to explore their opponents and the game structure, we will have more than one way to model the players. Here we are attracted by the class of reinforcement learning which seems to describe human behavior better (see Roth and Erev 1995). It is inspired by learning theory in psychology which says the likelihood to choose one action is strengthened if this action leads to favorable outcome. This characteristic is observed to be quite robust in learning behavior of human and animals. At the same time, reinforcement learning has been widely used in machine learning community to solve individual decision problem (see Sutton and Barto 1998). But it has rarely been treated as in a strategic environment (like in repeated games) where the learning behaviors of others have an important impact on that of each other and the outcome. So it is also interesting to see how reinforcement learning performs in such strategic environment from machine learning’s perspective. In the following part, we will describe the specific reinforcement learning algorithm we used for our experiment-Q-learning. The more detailed information can be found at Watkins 1989, Watkins and Dayan 1992 and Sutton and Barto 1998. Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a from state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The standard procedure for Q-learning is as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every possible state-action pair, and that the table entries are initialized to arbitrary values. Then the procedure for estimating correct Q(s,a) is to repeat the following loop until termination criterion is met: 1. In current state s choose an action a, this will cause a receipt of an immediate reward r, and arrival at a next state s'. 2. Update Q(s, a) according to the following equation: ∆Q( s, a ) = α [r + γ max Q( s' , b) − Q( s, a )] b (1) where αis the learning rate parameter. In the context of repeated games, the player is exploring the environment (its opponent and the game structure) by taking some risk to choose the action that might not be the current optimal one in step 1. In step 2 the action that leads to higher reward will strengthen the Q value for that state- action pair. The above procedure is guaranteed to converge to the correct Q values for stationary MDPs. In practice the exploration strategy in step 1 is usually chosen so that it will ensure sufficient exploration while still favoring actions with higher value estimates in given state. A variety of Draft.doc 4/26/2010 1/14
  • 2. methods may be used. A simple one is to behave greedily most of the time, but with small probability ε, choose an action at randomly from those that do not have the highest Q value. This action selection method is called ε-greed in Sutton and Barto 1995. There is another one called Softmax action selection method where the action with higher value is more likely to be chosen in given state. The most common form for the probability of choosing action a is e Qt ( a ) / τ ∑ bn=1 e Qt (b) / τ (2) where τis a positive parameter and decreases over time. In the limit as τ→0, Softmax action selection becomes greedy action selection. In our experiment we tried bothε-greed and Softmax action selection. 3.2 implementation of Q-learning for 2 by 2 games Q-learning does not need a model of its environment and can be used on-line. Therefore, it is very suited for repeated games against an unknown opponent (especially someone with same adaptive behavior). Here we will focus on some repeated games which are 2 by 2 games. Considering the iterated prisoner’s dilemma where “tit for tat” has been often discussed, it is natural to represent the state as the out come of the previous game played. We say in this case the player has memory length of one. The number of states for 2 by 2 game is 4 and for each state there are two actions (the pure strategies) from which the player can choose for current game. We also conducted the experiments for the case that players have memory length of two (the number of states will be 16). The immediate reward player gets is the payoff in the payoff matrix. For the softmax action selection method, we set the decreasing rate of the parameter τas the following τ = T *ϑ n (3) T is a constant, n is number of games played so far. θ, called annealing factor, is a positive constant that is less than one. In the implementation, when n is getting large enough,τ is close to zero and the player stops exploring. We start usingε-greed after that point to keep player exploring. 4. Experiments 4.1 The motivation The repeated 2 by 2 games are the simplest settings for strategic interactions and should be a good starting point to investigate how different the outcome would be under exploring rationality from full rationality. Take the iterated prisoner’s dilemma as an example, if the player takes risk cooperating in some round, hoping to induce cooperation later from its opponent who may think in the same way, it’s possible that they both find out later that they can get more by mutual cooperation. Even in early stage it may lose some, later sustained mutual cooperation is a good reason to explore at the early stage. Draft.doc 4/26/2010 2/14
  • 3. Motivated by this intuition, we deliberately selected 8 games and parameterized their payoff matrix. The players are modeled as using Q-learning in each repeated game. For 5 of them, the Pareto optimal solution does not coincide with Nash equilibrium. The rest are games with two Nash equilibria, which we included to address the multi-equilibrium selection issue. 4.2 The games and the parameterization The following is the list of the games and how we parameterized their payoff matrix. The parameter is δ. In the payoff matrix, the first number is the payoff of the row player and the second one is payoff of column player. We mark the NE with # and Pareto optimal solution with *. C and D are the action or pure strategies that players can take. The row player always comes first. When we say outcome of one play is CD, that means the row player chose pure strategy C and column play choose pure strategy D. So there are four outcomes of one play: CC CD, DC, and DD. The first two games are taken from prisoner’s dilemma. The value of δ is from 0 to 3. When its value is 2 in table 4.1, it is corresponding to the most common payoff matrix for prisoner’s dilemma. C D C (3,3 )* (0,3+δ) Defection (3+δ,0) (3-δ, 3-δ)# Table 4.1: Prisoner’s dilemma Pattern1 C D C (3,3)* (0, 3+δ) D (3+δ,0) (δ, δ)# Table 4.2: Prisoner’s dilemma Pattern2 While the above two are symmetric games, the following three are asymmetric games adopted from Rapoport and Guyer( Steve, can you find the reference? ). They do not have same number for payoff as the original ones by Rapoport and Guyer. The value of δ is taken from 0 to 3. C D C (0.2 ,0.3)# 0.3+δ,0.1 D 0.1, 0.2 (0.2+δ, 0.3+δ)* Table 4.3: game #47 C D C (0.2 ,0.2)# (0.3+δ,0.1) D (0.1, 0.3) (0.2+δ, 0.3+δ)* Table 4.5: game #48 C D Draft.doc 4/26/2010 3/14
  • 4. C (0.2 ,0.3)# (0.3+δ,0.2) D (0.1, 0.1) (0.2+δ, 0.3+δ)* Table 4.5: game #57 For the game with two Nash equilibria, one challenging question is which one is more likely to be selected as the outcome. We choose three from this class of game. The game of Stag Hunt has Pareto optimal solution as one of Nash equilibrium. The game of Chicken and the game of battle of sexes are coordination games. The value of δ is taken from 0 to 3 for Stag Hunt and Bottle of sexes. For Chicken, δ is from 0 to 2. Note that there is no Pareto optimal solution for the last two coordination games. C D C (5,5)* (0,3) D (3,0) (δ,δ) Table 4.6: Stag Hunt C D C (δ,3-δ)# (0,0) D (0,0) (3-δ,δ)# Table 4.7: Battle of sexes C D C (2,2) (δ, 2+δ)# D (2+δ, δ)# (0, 0) Table 4.8: Chicken 4.3 The setting for the experiments The parameters for Q-learning are set as following: Learning rate is set to 0.2 and discount factor as 0.95. We ran the experiment with both softmax action selection and ε-greed action selection. For softmax action selection, T is set to 5 and annealing factor as 0.9999. When τ is less than 0.01, we began usingε-greed. We set εto 0.01. We have chosen these parameter values after those used by other studies (Sandholm and Crites 1995) Each repeated game has 200,000 iterations so the players are given enough time to explore and learn. For each setting of the payoff parameter δ, we ran the repeated game for 100 trials. We recorded the frequencies of four outcomes (CC, CD, DC and DD) every 100 iteration. The numbers usually become stable within 50,000 iterations, so we took frequencies of the outcomes in the last 100 iterations over the 100 trials to report if not noted otherwise. Tables in the appendix share similar layout. In the middle column is the payoff parameter δ. On its left is result for ε-greedy action selection. The result for softmax action selection is on the right. Again, the numbers are frequencies of four outcomes (CC, CD, DC and DD) in the last 100 iterations over 100 runs. Draft.doc 4/26/2010 4/14
  • 5. 4.4 The results It’s disturbing when sometimes classical game theory tell you that only the inferior Nash equilibrium will be the outcome , not the Pareto optimal solution (not necessarily a Nash equilibrium). As in our first five games, the Subgame perfect Nash equilibrium will never be the Pareto outcome. So the question is: will the outcome be different if we use models (such as reinforcement learning)that better describe human behavior? How reasonable is it? More specifically, can the player learn to play the Pareto optimal solution that is not Nash equilibrium? Our experiments show that the answer is positive. Take a look at Table 1 for Prisoner’s dilemma patter1, if δis close to zero, the two players choose to defect most of the time. The reason is that there is not much difference between the outcome of mutual defection and mutual cooperation. The Pareto outcome does not provide enough incentive for players to take risk inducing cooperation. To avoid being exploited and get zero payoff, they’d better choose defection all the time. But as δis getting larger, there are more mutual cooperations observed which suggests both players are trying to settle on the Pareto optimal outcome. The last row in Table 1 shows an interesting scenario: The players want to induce the other player’s cooperation so it can take advantage of that by defection because the temptation for defection is really large when the other player is cooperating. That’s why we see many CDs and DCs, but less mutual cooperation (CC). It becomes more illustrative comparing with the result for prisoner’s dilemma pattern 2 in Table 2. In prisoner’s dilemma pattern 2, the players lose almost nothing by trying to cooperate when δ is close to zero. The exploration helps players to reach the much superior Pareto outcome (CC) and as we can see from Table 2, mutual cooperation happens 94% of time. Considering the scenario whenδis close to 3, first, there is not much incentive to shift from Nash equilibrium (DD) to Pareto outcome (CC) since there is not much difference in payoffs; second, the danger of being exploited by the other player and getting zero payoff is much higher, finally the players learn to defect most of the time (98%) Now let’s turn to different class of game. Game#47,game #48 and game#57 are asymmetric games and they have a common feature: The row player has a dominant strategy C because this strategy always give higher payoff than the strategy D no matter what the other player’s action is. Thus a fully rational player will never choose D. What will happen if players are enabled to explore and learn? Table 3-5 tells us that it depends on the payoff. Ifδ is close to zero, the out come will be Nash equilibrium (CC) almost 97% of the time since it doesn’t pay to induce other player to achieve the Pareto optimal solution and more likely it will be ripped off by other player if doing so. But as long as the incentive from Pareto is large enough, there will be considerable amount time (above 94%) that Pareto outcomes (DD) being observed. The Stag hunt problem is interesting because its Pareto optimal solution is also one of its Nash equilibrium. But which one is more likely to be sustained remains challenging problem for classical game theory. A mixed strategy (i.e., with some fixed probability to choose one of the pure strategies) seems natural in this repeated game for classical game theory. Table 6 shows that the outcomes of this repeated games with players with reinforcement learning model is much different Draft.doc 4/26/2010 5/14
  • 6. from the prediction of mixed strategy. Say, for example, when δ is equal to 1, the mixed strategy for both players will be choosing action C with probability 1/3 and D with probability 2/3. We should expect to see CC less than 33% of the time while Table 6 shows CC happens 88% of the time. We also can see that when the Pareto outcome is far more superior to the other Nash equilibrium, it is chosen almost 94% of the time. The rest two games are coordination games and we are not only concerned about which Nash equilibrium is to be selected, but also a further question: Is Nash equilibrium concept sufficient to describe what happens in these games. The later concern arises as we observe different behaviors in human experiment. Rapport et al. (1976) reported a majority of subjects quickly settled into an alternating strategy, with the outcome changing back and forth between the two Nash equilibria when playing the game of Chicken. From Table 7 we can see these two Nash equilibrium in battle of sexes are equal likely to be the outcome in most cases since the game is symmetric and these two outcomes are superior than other two which give both player zero payoff. As in game of Chicken, Table 8 shows that if the incentive for coordinating on the Nash equilibrium is too little (i.e., δ is close to zero), the players learn to be conservative both at the same time (CC) since they can not afford the loss in situation of DD (getting zero). Asδ increases, the game ends up more and more with Nash equilibrium (CD and DC). In order to see if players can learn the alternating strategy as observed in human subject experiments, we conducted another 100 trials for these two games withδbeing set to 1 and with softmax action selection. For most of the trials the outcome converges to one of the Nash equilibrium. But we did observe patterns showing alternating strategies for both games. These patterns are quite stable and can recover quickly from small random disturbance. For battle of sexes, there is only one pattern that the players play the two Nash equilibria alternately. The total number of this pattern is 11 (out of 100 trials). For game of Chicken, There are other kinds of patterns that summarized with their frequencies in Table 4.9 The outcomes Frequency in 100 trials Alternating between CD and DC 10 Cycle through CD-DC-CC or CD-CC-DC 13 Converge to one of the three: CC, CD or DC 76 No obvious patterns 1 Table 4.9: Frequencies of different kinds of outcome in the game of Chicken The frequencies of patterns can not be said as considerable, but first we use numbers in the payoff matrix that are different from Rapport et al. (1976) which may influence the incentive to form such strategy; and second, our players do not explicitly know about the payoff matrix and can only learn the payoff of structure of its opponent implicitly through behavior of its opponent (that’s not a easy task), and finally we think there might be some features of human behavior that are not captured in our current Q-learning model but important for human subject to learn such alternating strategy. But our main point is clear here: the Nash equilibrium concept seems not sufficient for Draft.doc 4/26/2010 6/14
  • 7. describing the out comes of repeated coordination games as Chicken and battle of sexes. *Note: To save space, there are additional results we do not discuss here, but we add them to appendix for completeness. 1. We conducted all the experiments by setting the memory length of the players to 2. The results are shown in table 9-16. 2. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and repeated the experiment with ε –greedy action selection and memory length as 1 on game #47, result is summarized in table 17 3. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and repeated the experiment with ε –greedy action selection on game of Chicken and battle of sexes. The Frequencies for patterns are reported in Table 18-21. 5. Discussion Appendix ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 3 87 82 9828 0.05 0 106 101 9793 0 92 105 9803 0.5 0 90 94 9816 52 110 111 9727 1 1 111 111 9777 Draft.doc 4/26/2010 7/14
  • 8. 51 110 93 9746 1.25 2475 338 358 6829 1136 160 198 8506 1.5 3119 526 483 5872 1776 245 381 7598 1.75 4252 653 666 4429 3526 547 413 5514 2 789 883 869 7549 848 766 779 7607 2.5 496 2276 2368 4860 544 2313 2306 4837 2.95 539 2821 2112 4528 Table 1: Prisoner’s dilemma Pattern1 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9422 218 183 177 0.05 9334 302 285 79 9036 399 388 150 0.5 9346 294 220 140 5691 738 678 2693 1 7537 954 1267 242 3506 179 275 6040 1.25 8203 542 994 261 1181 184 116 8519 1.5 7818 767 775 640 2 98 103 9797 1.75 4685 270 422 4623 97 114 91 9698 2 1820 217 220 7743 0 100 92 9808 2.5 0 77 117 9806 2 96 94 9808 2.95 0 90 114 9796 Table 2: Prisoner’s dilemma Pattern2 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9790 101 101 8 0 9808 94 98 0 4147 137 156 5560 0.1 9812 94 93 1 3019 123 165 6693 0.15 9799 95 104 2 2188 141 132 7539 0.2 8934 85 109 872 185 355 130 9330 0.5 730 284 208 8778 131 309 135 9425 1 120 532 138 9210 138 288 99 9475 1.5 77 471 103 9349 99 321 131 9449 2 88 441 126 9345 126 172 88 9614 3 64 366 92 9478 Table 3: Game #47 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9789 102 107 2 0 9787 106 105 2 3173 515 173 6139 0.1 9811 86 101 2 2832 457 207 6504 0.15 8127 256 137 1480 1227 348 141 8284 0.2 2986 755 230 6029 Draft.doc 4/26/2010 8/14
  • 9. 109 627 143 9121 0.5 143 631 146 9080 90 492 139 9279 1 79 1320 126 8475 88 318 134 9460 1.5 117 1076 128 8679 241 236 119 9404 2 62 473 126 9339 76 284 139 9501 3 64 277 128 9531 Table 4: Game #48 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9767 119 107 7 0 9764 131 105 0 1684 587 175 7554 0.1 9794 106 98 2 531 518 191 8760 0.15 9550 105 105 240 238 543 159 9060 0.2 1048 497 257 8198 126 307 121 9446 0.5 224 852 152 8772 118 520 114 9248 1 113 753 119 9015 104 526 125 9245 1.5 74 538 117 9271 66 225 102 9607 2 57 569 123 9251 123 296 116 9465 3 61 302 125 9512 Table 5: Game #57 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9390 126 122 362 0 9715 108 109 68 9546 91 108 255 0.5 9681 120 121 78 9211 112 125 552 0.75 9669 111 101 119 8864 119 110 907 1 9666 98 102 134 8634 115 132 1119 1.25 9598 139 134 129 7914 122 130 1834 1.5 9465 99 109 327 7822 122 104 1952 2 9452 126 126 296 5936 87 101 3876 2.5 8592 116 89 1203 5266 121 106 4507 3 3524 111 115 6250 Table 6: Stag Hunt ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 2641 63 4571 2725 0 2872 73 4477 2578 3842 135 1626 4397 0.1 4615 101 1732 3552 5140 102 90 4668 0.5 4772 102 162 4964 4828 107 94 4971 1 4862 88 89 4961 4122 101 109 5668 1.5 4642 85 102 5171 4983 100 97 4820 2 4623 97 87 5193 Draft.doc 4/26/2010 9/14
  • 10. 3814 111 96 5979 2.5 5139 102 99 4660 4015 1388 107 4490 2.9 4303 1794 118 3785 2653 4921 70 2356 3 2593 4776 58 2573 Table 7: Battle of sexes ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9276 227 347 150 0 9509 165 222 104 9587 143 135 135 0.25 9119 428 320 133 9346 209 223 222 0.5 9375 220 225 180 6485 1491 1858 166 0.75 8759 424 632 185 1663 3532 4706 99 1 1339 4903 3662 96 385 4161 5342 112 1.25 158 5416 4323 103 113 4488 5274 125 1.5 115 4700 5099 86 111 4301 5504 84 1.75 100 4704 5083 113 100 4853 4953 94 2 94 4772 5044 90 Table 8: Chicken * Memory length setting is 1 for table 1-8 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 253 122 129 9496 0.05 2 85 106 9807 860 137 133 8870 0.5 0 98 103 9799 300 124 112 9464 1 4 95 100 9801 227 88 121 9564 1.25 15 160 154 9671 1615 316 325 7744 1.5 309 304 365 9022 2900 1112 1085 4903 1.75 590 682 746 7982 2748 1681 1652 3919 2 281 1522 1476 6721 1919 2927 2988 2166 2.5 389 4235 4111 1265 905 4384 4199 512 2.95 578 4170 4018 1234 Table 9: Prisoner’s dilemma Pattern1 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9453 267 258 22 0.05 0.05 9441 277 206 9230 312 314 144 0.5 0.25 9379 260 296 7591 674 633 1102 1 0.5 8940 251 245 4360 483 542 4615 1.25 0.75 8100 444 376 1297 456 443 7804 1.5 1 4311 867 813 2 112 100 9786 1.75 1.25 816 416 406 3 81 99 9817 2 1.5 3 129 117 4 87 101 9808 2.5 2 0 100 95 Draft.doc 4/26/2010 10/14
  • 11. 0 108 91 9801 2.95 2.5 1 93 122 Table 10: Prisoner’s dilemma Pattern2 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9772 110 115 3 0 9784 93 123 0 8812 275 163 750 0.1 9810 102 88 0 5200 561 227 4012 0.15 9812 97 88 3 3704 817 210 5269 0.2 9702 111 89 98 290 558 185 8967 0.5 1458 656 224 7662 105 680 159 9056 1 99 646 216 9039 99 404 154 9343 1.5 55 270 182 9493 71 463 124 9342 2 16 224 147 9613 77 271 135 9517 3 34 245 184 9537 Table 11: Game #47 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9766 126 108 0 0 9803 100 95 2 7772 179 153 1896 0.1 9821 90 87 2 3775 802 193 5230 0.15 9598 104 104 194 2848 1011 203 5938 0.2 7877 308 204 1611 491 539 202 8768 0.5 1083 947 199 7771 113 758 170 8959 1 93 791 158 8958 114 677 187 9022 1.5 58 567 150 9225 66 477 131 9326 2 33 286 161 9520 66 411 168 9355 3 43 236 152 9569 Table 12: Game #48 ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 8757 715 111 417 0 9765 146 85 4 3084 2534 124 4258 0.1 9634 211 88 67 1553 2235 241 5971 0.15 4634 2717 176 2476 491 1408 192 7909 0.2 2377 2280 207 5136 131 623 170 9076 0.5 152 1331 196 8321 106 1005 201 8688 1 130 791 127 8952 51 582 131 9236 1.5 43 396 135 9426 60 490 151 9299 2 47 471 161 9321 83 570 192 9155 3 62 392 189 9357 Table 13: Game #57 Draft.doc 4/26/2010 11/14
  • 12. ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9557 122 109 212 0 9742 138 109 11 9052 161 134 653 0.5 9742 123 128 7 8703 151 135 1011 0.75 9613 177 185 25 8493 190 148 1169 1 9670 120 174 36 8197 193 147 1463 1.25 9458 140 110 292 7745 133 122 2000 1.5 9396 166 127 311 5900 148 129 3823 2 8119 158 122 1601 4111 114 142 5633 2.5 4015 156 163 5666 2772 102 126 7000 3 1462 107 115 8316 Table 14: Stag Hunt ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 4124 125 1728 4023 0 3929 93 2132 3846 4835 141 746 4278 0.1 4556 97 580 4767 4092 94 117 5697 0.5 6178 92 97 3633 5788 89 95 4028 1 4592 79 99 5230 5308 94 108 4490 1.5 5012 107 110 4771 4620 138 105 5137 2 4619 121 94 5166 5119 146 93 4642 2.5 4357 121 93 5429 4797 679 144 4308 2.9 4250 1286 143 4321 3797 2112 125 3966 3 3727 2157 106 4010 Table 15: Battle of sexes ε-greedy action selection Softmax action selection CC CD DC DD δ CC CD DC DD 9342 297 280 81 0 0 9538 206 215 9205 316 405 74 0.25 0.25 9465 256 236 8529 727 684 60 0.5 0.5 9225 372 332 7300 1178 1413 109 0.75 0.75 8012 871 967 1401 3819 4666 114 1 1 1919 4323 3659 223 4728 4937 112 1.25 1.25 255 4957 4661 162 4073 5667 98 1.5 1.5 171 4825 4898 188 5105 4574 133 1.75 1.75 130 4986 4787 149 4693 5064 94 2 2 107 3816 5971 Table 16: Chicken * Memory length setting is 2 for table 9-16 δ CC CD DC DD 0 9594 106 298 2 0.1 7295 162 341 2202 Draft.doc 4/26/2010 12/14
  • 13. 0.15 3193 318 340 6149 0.2 1934 530 325 7211 0.5 314 365 177 9144 1 254 550 149 9047 1.5 253 529 184 9034 2 206 379 151 9264 3 246 471 177 9106 Table 17: Game #47 with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CC and DD 8 Converge to CC or DD 92 Table 18: Battle of sexes with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CD and DC 9 Converge to one of the three: CC, CD or DC 70 Other patterns 11 No obvious patterns 10 Table 19: Chicken with ε –greedy action selection and memory length as 1 The outcomes Frequency in 100 trials Alternating between CC and DD 29 Converge to CC or DD 71 Table 20: Battle of sexes with ε –greedy action selection and memory length as 2 The outcomes Frequency in 100 trials Alternating between CD and DC 26 Converge to one of the three: CC, CD or DC 46 Other patterns 18 No obvious patterns 10 Table 21: Chicken with ε –greedy action selection and memory length as 2 Reference 1. Rapport, Anatol; Guyer, Melvin J. and Gordon, David G.(1976) “The 2X2 game”, Ann Arbor, MI: University of Michigan Press 2. Roth, Alvin E. and Erev, Ido. (1995). “Learning in Extensive-Form games: Experimental Data and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior 8, 164-212. 3. Sanholm, Thomas W. and Crites, Robert H.(1995) “ Multi-agent Reinforcement Learning in Iterated Prisoner’s Dilemma”, Biosystems, Vol. 37, 147-166 4. Sutton, R. and Barto, A. (1998) “Reinforcement learning: an Introduction”. MIT Press. Draft.doc 4/26/2010 13/14
  • 14. 5. Watkins and Dayan, 1992 Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279-292. 6. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King's College, Oxford. Draft.doc 4/26/2010 14/14