mingdraft2.doc

3. Reinforcement learning
3.1 The Q-learning algorithm
If we move from the strategy regime of full rationality to regime of exploring rationality where
players are willing to take some risks to explore their opponents and the game structure, we will
have more than one way to model the players. Here we are attracted by the class of reinforcement
learning which seems to describe human behavior better (see Roth and Erev 1995). It is inspired
by learning theory in psychology which says the likelihood to choose one action is strengthened if
this action leads to favorable outcome. This characteristic is observed to be quite robust in
learning behavior of human and animals. At the same time, reinforcement learning has been
widely used in machine learning community to solve individual decision problem (see Sutton and
Barto 1998). But it has rarely been treated as in a strategic environment (like in repeated games)
where the learning behaviors of others have an important impact on that of each other and the
outcome. So it is also interesting to see how reinforcement learning performs in such strategic
environment from machine learning’s perspective. In the following part, we will describe the
specific reinforcement learning algorithm we used for our experiment-Q-learning. The more
detailed information can be found at Watkins 1989, Watkins and Dayan 1992 and Sutton and Barto
1998.

Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is
defined to be the expected discounted sum of future payoffs obtained by taking action a from state
s and following an optimal policy thereafter. Once these values have been learned, the optimal
action from any state is the one with the highest Q-value. The standard procedure for Q-learning is
as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every
possible state-action pair, and that the table entries are initialized to arbitrary values. Then the
procedure for estimating correct Q(s,a) is to repeat the following loop until termination criterion is
met:

1. In current state s choose an action a, this will cause a receipt of an immediate reward r, and
arrival at a next state s'.

2. Update Q(s, a) according to the following equation:

∆Q( s, a ) = α [r + γ max Q( s' , b) − Q( s, a )]
b (1)

where αis the learning rate parameter.

In the context of repeated games, the player is exploring the environment (its opponent and the
game structure) by taking some risk to choose the action that might not be the current optimal one
in step 1. In step 2 the action that leads to higher reward will strengthen the Q value for that state-
action pair. The above procedure is guaranteed to converge to the correct Q values for stationary
MDPs.

In practice the exploration strategy in step 1 is usually chosen so that it will ensure sufficient
exploration while still favoring actions with higher value estimates in given state. A variety of

Draft.doc 4/26/2010 1/14

methods may be used. A simple one is to behave greedily most of the time, but with small
probability ε, choose an action at randomly from those that do not have the highest Q value.
This action selection method is called ε-greed in Sutton and Barto 1995. There is another one
called Softmax action selection method where the action with higher value is more likely to be
chosen in given state. The most common form for the probability of choosing action a is
e Qt ( a ) / τ
∑ bn=1 e Qt (b) / τ (2)

where τis a positive parameter and decreases over time. In the limit as τ→0, Softmax action
selection becomes greedy action selection. In our experiment we tried bothε-greed and Softmax
action selection.

3.2 implementation of Q-learning for 2 by 2 games

Q-learning does not need a model of its environment and can be used on-line. Therefore, it is very
suited for repeated games against an unknown opponent (especially someone with same adaptive
behavior). Here we will focus on some repeated games which are 2 by 2 games. Considering the
iterated prisoner’s dilemma where “tit for tat” has been often discussed, it is natural to represent
the state as the out come of the previous game played. We say in this case the player has memory
length of one. The number of states for 2 by 2 game is 4 and for each state there are two actions
(the pure strategies) from which the player can choose for current game. We also conducted the
experiments for the case that players have memory length of two (the number of states will be 16).
The immediate reward player gets is the payoff in the payoff matrix.

For the softmax action selection method, we set the decreasing rate of the parameter τas the
following

τ = T *ϑ n (3)

T is a constant, n is number of games played so far. θ, called annealing factor, is a positive
constant that is less than one. In the implementation, when n is getting large enough,τ is close
to zero and the player stops exploring. We start usingε-greed after that point to keep player
exploring.

4. Experiments
4.1 The motivation
The repeated 2 by 2 games are the simplest settings for strategic interactions and should be a good
starting point to investigate how different the outcome would be under exploring rationality from
full rationality. Take the iterated prisoner’s dilemma as an example, if the player takes risk
cooperating in some round, hoping to induce cooperation later from its opponent who may think in
the same way, it’s possible that they both find out later that they can get more by mutual
cooperation. Even in early stage it may lose some, later sustained mutual cooperation is a good
reason to explore at the early stage.

Draft.doc 4/26/2010 2/14

Motivated by this intuition, we deliberately selected 8 games and parameterized their payoff
matrix. The players are modeled as using Q-learning in each repeated game. For 5 of them, the
Pareto optimal solution does not coincide with Nash equilibrium. The rest are games with two
Nash equilibria, which we included to address the multi-equilibrium selection issue.

4.2 The games and the parameterization

The following is the list of the games and how we parameterized their payoff matrix. The
parameter is δ. In the payoff matrix, the first number is the payoff of the row player and the
second one is payoff of column player. We mark the NE with # and Pareto optimal solution with *.
C and D are the action or pure strategies that players can take. The row player always comes first.
When we say outcome of one play is CD, that means the row player chose pure strategy C and
column play choose pure strategy D. So there are four outcomes of one play: CC CD, DC, and
DD.

The first two games are taken from prisoner’s dilemma. The value of δ is from 0 to 3. When its
value is 2 in table 4.1, it is corresponding to the most common payoff matrix for prisoner’s
dilemma.

C D
C (3,3 )* (0,3+δ)
Defection (3+δ,0) (3-δ, 3-δ)#

Table 4.1: Prisoner’s dilemma Pattern1

C D
C (3,3)* (0, 3+δ)
D (3+δ,0) (δ, δ)#

Table 4.2: Prisoner’s dilemma Pattern2

While the above two are symmetric games, the following three are asymmetric games adopted
from Rapoport and Guyer( Steve, can you find the reference? ). They do not have same number
for payoff as the original ones by Rapoport and Guyer. The value of δ is taken from 0 to 3.

C D
C (0.2 ,0.3)# 0.3+δ,0.1
D 0.1, 0.2 (0.2+δ, 0.3+δ)*

Table 4.3: game #47

C D
C (0.2 ,0.2)# (0.3+δ,0.1)
D (0.1, 0.3) (0.2+δ, 0.3+δ)*

Table 4.5: game #48

C D

Draft.doc 4/26/2010 3/14

C (0.2 ,0.3)# (0.3+δ,0.2)
D (0.1, 0.1) (0.2+δ, 0.3+δ)*

Table 4.5: game #57

For the game with two Nash equilibria, one challenging question is which one is more likely to be
selected as the outcome. We choose three from this class of game. The game of Stag Hunt has
Pareto optimal solution as one of Nash equilibrium. The game of Chicken and the game of battle
of sexes are coordination games. The value of δ is taken from 0 to 3 for Stag Hunt and Bottle of
sexes. For Chicken, δ is from 0 to 2. Note that there is no Pareto optimal solution for the last two
coordination games.

C D
C (5,5)* (0,3)
D (3,0) (δ,δ)

Table 4.6: Stag Hunt

C D
C (δ,3-δ)# (0,0)
D (0,0) (3-δ,δ)#

Table 4.7: Battle of sexes

C D
C (2,2) (δ, 2+δ)#
D (2+δ, δ)# (0, 0)

Table 4.8: Chicken

4.3 The setting for the experiments

The parameters for Q-learning are set as following: Learning rate is set to 0.2 and discount factor
as 0.95. We ran the experiment with both softmax action selection and ε-greed action selection.
For softmax action selection, T is set to 5 and annealing factor as 0.9999. When τ is less than
0.01, we began usingε-greed. We set εto 0.01. We have chosen these parameter values after
those used by other studies (Sandholm and Crites 1995)

Each repeated game has 200,000 iterations so the players are given enough time to explore and
learn. For each setting of the payoff parameter δ, we ran the repeated game for 100 trials. We
recorded the frequencies of four outcomes (CC, CD, DC and DD) every 100 iteration. The
numbers usually become stable within 50,000 iterations, so we took frequencies of the outcomes
in the last 100 iterations over the 100 trials to report if not noted otherwise.

Tables in the appendix share similar layout. In the middle column is the payoff parameter δ. On its
left is result for ε-greedy action selection. The result for softmax action selection is on the right.
Again, the numbers are frequencies of four outcomes (CC, CD, DC and DD) in the last 100
iterations over 100 runs.

Draft.doc 4/26/2010 4/14

4.4 The results

It’s disturbing when sometimes classical game theory tell you that only the inferior Nash
equilibrium will be the outcome , not the Pareto optimal solution (not necessarily a Nash
equilibrium). As in our first five games, the Subgame perfect Nash equilibrium will never be the
Pareto outcome. So the question is: will the outcome be different if we use models (such as
reinforcement learning)that better describe human behavior? How reasonable is it? More
specifically, can the player learn to play the Pareto optimal solution that is not Nash equilibrium?
Our experiments show that the answer is positive.

Take a look at Table 1 for Prisoner’s dilemma patter1, if δis close to zero, the two players choose
to defect most of the time. The reason is that there is not much difference between the outcome of
mutual defection and mutual cooperation. The Pareto outcome does not provide enough incentive
for players to take risk inducing cooperation. To avoid being exploited and get zero payoff, they’d
better choose defection all the time. But as δis getting larger, there are more mutual
cooperations observed which suggests both players are trying to settle on the Pareto optimal
outcome. The last row in Table 1 shows an interesting scenario: The players want to induce the
other player’s cooperation so it can take advantage of that by defection because the temptation for
defection is really large when the other player is cooperating. That’s why we see many CDs and
DCs, but less mutual cooperation (CC). It becomes more illustrative comparing with the result for
prisoner’s dilemma pattern 2 in Table 2. In prisoner’s dilemma pattern 2, the players lose almost
nothing by trying to cooperate when δ is close to zero. The exploration helps players to reach the
much superior Pareto outcome (CC) and as we can see from Table 2, mutual cooperation happens
94% of time. Considering the scenario whenδis close to 3, first, there is not much incentive
to shift from Nash equilibrium (DD) to Pareto outcome (CC) since there is not much difference in
payoffs; second, the danger of being exploited by the other player and getting zero payoff is much
higher, finally the players learn to defect most of the time (98%)

Now let’s turn to different class of game. Game#47,game #48 and game#57 are asymmetric games
and they have a common feature: The row player has a dominant strategy C because this strategy
always give higher payoff than the strategy D no matter what the other player’s action is. Thus a
fully rational player will never choose D. What will happen if players are enabled to explore and
learn? Table 3-5 tells us that it depends on the payoff. Ifδ is close to zero, the out come will be
Nash equilibrium (CC) almost 97% of the time since it doesn’t pay to induce other player to
achieve the Pareto optimal solution and more likely it will be ripped off by other player if doing
so. But as long as the incentive from Pareto is large enough, there will be considerable amount
time (above 94%) that Pareto outcomes (DD) being observed.

The Stag hunt problem is interesting because its Pareto optimal solution is also one of its Nash
equilibrium. But which one is more likely to be sustained remains challenging problem for
classical game theory. A mixed strategy (i.e., with some fixed probability to choose one of the pure
strategies) seems natural in this repeated game for classical game theory. Table 6 shows that the
outcomes of this repeated games with players with reinforcement learning model is much different

Draft.doc 4/26/2010 5/14

from the prediction of mixed strategy. Say, for example, when δ is equal to 1, the mixed strategy
for both players will be choosing action C with probability 1/3 and D with probability 2/3. We
should expect to see CC less than 33% of the time while Table 6 shows CC happens 88% of the
time. We also can see that when the Pareto outcome is far more superior to the other Nash
equilibrium, it is chosen almost 94% of the time.

The rest two games are coordination games and we are not only concerned about which Nash
equilibrium is to be selected, but also a further question: Is Nash equilibrium concept sufficient to
describe what happens in these games. The later concern arises as we observe different behaviors
in human experiment. Rapport et al. (1976) reported a majority of subjects quickly settled into an
alternating strategy, with the outcome changing back and forth between the two Nash equilibria
when playing the game of Chicken.

From Table 7 we can see these two Nash equilibrium in battle of sexes are equal likely to be the
outcome in most cases since the game is symmetric and these two outcomes are superior than
other two which give both player zero payoff. As in game of Chicken, Table 8 shows that if the
incentive for coordinating on the Nash equilibrium is too little (i.e., δ is close to zero), the
players learn to be conservative both at the same time (CC) since they can not afford the loss in
situation of DD (getting zero). Asδ increases, the game ends up more and more with Nash
equilibrium (CD and DC).

In order to see if players can learn the alternating strategy as observed in human subject
experiments, we conducted another 100 trials for these two games withδbeing set to 1 and with
softmax action selection. For most of the trials the outcome converges to one of the Nash
equilibrium. But we did observe patterns showing alternating strategies for both games. These
patterns are quite stable and can recover quickly from small random disturbance. For battle of
sexes, there is only one pattern that the players play the two Nash equilibria alternately. The total
number of this pattern is 11 (out of 100 trials). For game of Chicken, There are other kinds of
patterns that summarized with their frequencies in Table 4.9

The outcomes Frequency in 100 trials
Alternating between CD and DC 10
Cycle through CD-DC-CC or CD-CC-DC 13
Converge to one of the three: CC, CD or DC 76
No obvious patterns 1

Table 4.9: Frequencies of different kinds of outcome in the game of Chicken

The frequencies of patterns can not be said as considerable, but first we use numbers in the payoff
matrix that are different from Rapport et al. (1976) which may influence the incentive to form
such strategy; and second, our players do not explicitly know about the payoff matrix and can only
learn the payoff of structure of its opponent implicitly through behavior of its opponent (that’s not
a easy task), and finally we think there might be some features of human behavior that are not
captured in our current Q-learning model but important for human subject to learn such alternating
strategy. But our main point is clear here: the Nash equilibrium concept seems not sufficient for

Draft.doc 4/26/2010 6/14

describing the out comes of repeated coordination games as Chicken and battle of sexes.
*Note: To save space, there are additional results we do not discuss here, but we add them to
appendix for completeness.
1. We conducted all the experiments by setting the memory length of the players to 2. The
results are shown in table 9-16.
2. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
repeated the experiment with ε –greedy action selection and memory length as 1 on game
#47, result is summarized in table 17
3. we set ε in ε –greedy for row player to 0.03 (the column player’s ε remains at 0.01) and
repeated the experiment with ε –greedy action selection on game of Chicken and battle of
sexes. The Frequencies for patterns are reported in Table 18-21.

5. Discussion

Appendix

ε-greedy action selection Softmax action selection
CC CD DC DD δ CC CD DC DD
3 87 82 9828 0.05 0 106 101 9793
0 92 105 9803 0.5 0 90 94 9816
52 110 111 9727 1 1 111 111 9777

Draft.doc 4/26/2010 7/14

51 110 93 9746 1.25 2475 338 358 6829
1136 160 198 8506 1.5 3119 526 483 5872
1776 245 381 7598 1.75 4252 653 666 4429
3526 547 413 5514 2 789 883 869 7549
848 766 779 7607 2.5 496 2276 2368 4860
544 2313 2306 4837 2.95 539 2821 2112 4528

Table 1: Prisoner’s dilemma Pattern1

9422 218 183 177 0.05 9334 302 285 79
9036 399 388 150 0.5 9346 294 220 140
5691 738 678 2693 1 7537 954 1267 242
3506 179 275 6040 1.25 8203 542 994 261
1181 184 116 8519 1.5 7818 767 775 640
2 98 103 9797 1.75 4685 270 422 4623
97 114 91 9698 2 1820 217 220 7743
0 100 92 9808 2.5 0 77 117 9806
2 96 94 9808 2.95 0 90 114 9796


9790 101 101 8 0 9808 94 98 0
4147 137 156 5560 0.1 9812 94 93 1
3019 123 165 6693 0.15 9799 95 104 2
2188 141 132 7539 0.2 8934 85 109 872
185 355 130 9330 0.5 730 284 208 8778
131 309 135 9425 1 120 532 138 9210
138 288 99 9475 1.5 77 471 103 9349
99 321 131 9449 2 88 441 126 9345
126 172 88 9614 3 64 366 92 9478

Table 3: Game #47

9789 102 107 2 0 9787 106 105 2
3173 515 173 6139 0.1 9811 86 101 2
2832 457 207 6504 0.15 8127 256 137 1480
1227 348 141 8284 0.2 2986 755 230 6029

Draft.doc 4/26/2010 8/14

109 627 143 9121 0.5 143 631 146 9080
90 492 139 9279 1 79 1320 126 8475
88 318 134 9460 1.5 117 1076 128 8679
241 236 119 9404 2 62 473 126 9339
76 284 139 9501 3 64 277 128 9531

Table 4: Game #48
9767 119 107 7 0 9764 131 105 0
1684 587 175 7554 0.1 9794 106 98 2
531 518 191 8760 0.15 9550 105 105 240
238 543 159 9060 0.2 1048 497 257 8198
126 307 121 9446 0.5 224 852 152 8772
118 520 114 9248 1 113 753 119 9015
104 526 125 9245 1.5 74 538 117 9271
66 225 102 9607 2 57 569 123 9251
123 296 116 9465 3 61 302 125 9512

Table 5: Game #57

9390 126 122 362 0 9715 108 109 68
9546 91 108 255 0.5 9681 120 121 78
9211 112 125 552 0.75 9669 111 101 119
8864 119 110 907 1 9666 98 102 134
8634 115 132 1119 1.25 9598 139 134 129
7914 122 130 1834 1.5 9465 99 109 327
7822 122 104 1952 2 9452 126 126 296
5936 87 101 3876 2.5 8592 116 89 1203
5266 121 106 4507 3 3524 111 115 6250

Table 6: Stag Hunt

2641 63 4571 2725 0 2872 73 4477 2578
3842 135 1626 4397 0.1 4615 101 1732 3552
5140 102 90 4668 0.5 4772 102 162 4964
4828 107 94 4971 1 4862 88 89 4961
4122 101 109 5668 1.5 4642 85 102 5171
4983 100 97 4820 2 4623 97 87 5193

Draft.doc 4/26/2010 9/14

3814 111 96 5979 2.5 5139 102 99 4660
4015 1388 107 4490 2.9 4303 1794 118 3785
2653 4921 70 2356 3 2593 4776 58 2573

Table 7: Battle of sexes

9276 227 347 150 0 9509 165 222 104
9587 143 135 135 0.25 9119 428 320 133
9346 209 223 222 0.5 9375 220 225 180
6485 1491 1858 166 0.75 8759 424 632 185
1663 3532 4706 99 1 1339 4903 3662 96
385 4161 5342 112 1.25 158 5416 4323 103
113 4488 5274 125 1.5 115 4700 5099 86
111 4301 5504 84 1.75 100 4704 5083 113
100 4853 4953 94 2 94 4772 5044 90

Table 8: Chicken
* Memory length setting is 1 for table 1-8

253 122 129 9496 0.05 2 85 106 9807
860 137 133 8870 0.5 0 98 103 9799
300 124 112 9464 1 4 95 100 9801
227 88 121 9564 1.25 15 160 154 9671
1615 316 325 7744 1.5 309 304 365 9022
2900 1112 1085 4903 1.75 590 682 746 7982
2748 1681 1652 3919 2 281 1522 1476 6721
1919 2927 2988 2166 2.5 389 4235 4111 1265
905 4384 4199 512 2.95 578 4170 4018 1234


9453 267 258 22 0.05 0.05 9441 277 206
9230 312 314 144 0.5 0.25 9379 260 296
7591 674 633 1102 1 0.5 8940 251 245
4360 483 542 4615 1.25 0.75 8100 444 376
1297 456 443 7804 1.5 1 4311 867 813
2 112 100 9786 1.75 1.25 816 416 406
3 81 99 9817 2 1.5 3 129 117
4 87 101 9808 2.5 2 0 100 95

Draft.doc 4/26/2010 10/14

0 108 91 9801 2.95 2.5 1 93 122

9772 110 115 3 0 9784 93 123 0
8812 275 163 750 0.1 9810 102 88 0
5200 561 227 4012 0.15 9812 97 88 3
3704 817 210 5269 0.2 9702 111 89 98
290 558 185 8967 0.5 1458 656 224 7662
105 680 159 9056 1 99 646 216 9039
99 404 154 9343 1.5 55 270 182 9493
71 463 124 9342 2 16 224 147 9613
77 271 135 9517 3 34 245 184 9537

Table 11: Game #47

9766 126 108 0 0 9803 100 95 2
7772 179 153 1896 0.1 9821 90 87 2
3775 802 193 5230 0.15 9598 104 104 194
2848 1011 203 5938 0.2 7877 308 204 1611
491 539 202 8768 0.5 1083 947 199 7771
113 758 170 8959 1 93 791 158 8958
114 677 187 9022 1.5 58 567 150 9225
66 477 131 9326 2 33 286 161 9520
66 411 168 9355 3 43 236 152 9569

Table 12: Game #48
8757 715 111 417 0 9765 146 85 4
3084 2534 124 4258 0.1 9634 211 88 67
1553 2235 241 5971 0.15 4634 2717 176 2476
491 1408 192 7909 0.2 2377 2280 207 5136
131 623 170 9076 0.5 152 1331 196 8321
106 1005 201 8688 1 130 791 127 8952
51 582 131 9236 1.5 43 396 135 9426
60 490 151 9299 2 47 471 161 9321
83 570 192 9155 3 62 392 189 9357

Table 13: Game #57

Draft.doc 4/26/2010 11/14

9557 122 109 212 0 9742 138 109 11
9052 161 134 653 0.5 9742 123 128 7
8703 151 135 1011 0.75 9613 177 185 25
8493 190 148 1169 1 9670 120 174 36
8197 193 147 1463 1.25 9458 140 110 292
7745 133 122 2000 1.5 9396 166 127 311
5900 148 129 3823 2 8119 158 122 1601
4111 114 142 5633 2.5 4015 156 163 5666
2772 102 126 7000 3 1462 107 115 8316

Table 14: Stag Hunt

4124 125 1728 4023 0 3929 93 2132 3846
4835 141 746 4278 0.1 4556 97 580 4767
4092 94 117 5697 0.5 6178 92 97 3633
5788 89 95 4028 1 4592 79 99 5230
5308 94 108 4490 1.5 5012 107 110 4771
4620 138 105 5137 2 4619 121 94 5166
5119 146 93 4642 2.5 4357 121 93 5429
4797 679 144 4308 2.9 4250 1286 143 4321
3797 2112 125 3966 3 3727 2157 106 4010

Table 15: Battle of sexes

9342 297 280 81 0 0 9538 206 215
9205 316 405 74 0.25 0.25 9465 256 236
8529 727 684 60 0.5 0.5 9225 372 332
7300 1178 1413 109 0.75 0.75 8012 871 967
1401 3819 4666 114 1 1 1919 4323 3659
223 4728 4937 112 1.25 1.25 255 4957 4661
162 4073 5667 98 1.5 1.5 171 4825 4898
188 5105 4574 133 1.75 1.75 130 4986 4787
149 4693 5064 94 2 2 107 3816 5971

Table 16: Chicken
* Memory length setting is 2 for table 9-16

δ CC CD DC DD
0 9594 106 298 2
0.1 7295 162 341 2202

Draft.doc 4/26/2010 12/14

0.15 3193 318 340 6149
0.2 1934 530 325 7211
0.5 314 365 177 9144
1 254 550 149 9047
1.5 253 529 184 9034
2 206 379 151 9264
3 246 471 177 9106

Table 17: Game #47 with ε –greedy action selection and memory length as 1

Alternating between CC and DD 8
Converge to CC or DD 92

Table 18: Battle of sexes with ε –greedy action selection and memory length as 1

Other patterns 11

Table 19: Chicken with ε –greedy action selection and memory length as 1

Alternating between CC and DD 29
Converge to CC or DD 71

Table 20: Battle of sexes with ε –greedy action selection and memory length as 2

Other patterns 18

Table 21: Chicken with ε –greedy action selection and memory length as 2

Reference
1. Rapport, Anatol; Guyer, Melvin J. and Gordon, David G.(1976) “The 2X2 game”, Ann Arbor,
MI: University of Michigan Press
2. Roth, Alvin E. and Erev, Ido. (1995). “Learning in Extensive-Form games: Experimental Data
and Simple Dynamic Models in the Intermediate Term.” Games and Economic Behavior 8,
164-212.
3. Sanholm, Thomas W. and Crites, Robert H.(1995) “ Multi-agent Reinforcement Learning in
Iterated Prisoner’s Dilemma”, Biosystems, Vol. 37, 147-166
4. Sutton, R. and Barto, A. (1998) “Reinforcement learning: an Introduction”. MIT Press.

Draft.doc 4/26/2010 13/14

5. Watkins and Dayan, 1992 Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine
Learning, 8:279-292.
6. Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King's College, Oxford.

Draft.doc 4/26/2010 14/14

mingdraft2.doc

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (18)

Similar a mingdraft2.doc

Similar a mingdraft2.doc (20)

Más de butest

Más de butest (20)

mingdraft2.doc