2. MARL in SSD
• Multi Agent Reinforcement Learning
• Sequential Social Dilemmas
=> Understanding Agent Cooperation
=> In sequential situation ( mixed incentive sturcutre of matrix game social dilemma )
learn policies.
4. Social Dilemma
• A social dilemma is a situation in which an
individual profits from selfishness unless everyone
chooses the selfish alternative, in which case the
whole group loses => Represent with Matrix game
5. Matrix Game – prisoner’s dilemma
Nash Equilibrium
This is Best Choice..
in global perspective
Betrayal Cooperate Matrix Game Social Dilemma
== MGSD
Rational agent
choice this
( Think reward is - )
6. MGSD ignores…
1. In real world’s social dilemmas are temporally extended
2. Cooperation and defection are labels that apply to polices implementing
strategic decision
3. Cooperativeness may be a graded quantity
4. Decision to cooperate or defect occur only quasi-simultaneously since some
information about what player 2 is starting to do can inform player 1’s decision
and vice versa
5. Decision must be made despite only having partial information about the
state of the world and the activities of the other players
8. SSD – Markov Games
two-player partially observable Markov game : M => O : S x {1,2}
# O = { o_i | s, o_i }
Transition Function T : S x A_1 x A_2 -> delta(S) ( discrete probability distributions )
Reward Function r_i : S x A1 x A2
Policy π : O_i -> delta(A_i)
== Find MGSD with Reinforcement Learning
Value-state function
9. SSD – Definition of SSD
Sequential Social Dilemma
Empirical payoff matrix
Markov game에서 observation이 변함에 따라 policy가 변화
11. Simulation Method
Game : 2D grid-world
Observation : 3( RGB )
x 15(forehead) x 10(side)
Action :
8 ( arrow keys + rotate left + rotate right
+ use beam + stand )
Episode : 1000 step
NN : two Hidden layer – 32 unit
+ relu activation 8 output
Policy : e-greedy ( decrease e 1.0 to 0.1 )
12. Result – Gathering
Reward가 없지만… laser로 other agent를 잠깐 없앰
먹을게 (초록) 많으면 공존하면서 reward를 얻고,
적으면 서로 공격하기 시작함
13. Result – Gathering
Touch Green : reward +1 ( green removed temporally )
Beam to other player : (tagging)
hit twice, remove opponent from game N_tagged frames
Apple respawns after N_apple frames
=>
Defecting Policy == aggressive ( use beam )
Coopertive Policy == not seek to tag the other player
https://www.youtube.com/watch?v=F97lqqpcqsM
14. Result – Gathering
*After training for 4- million steps for each option
Conflict cost
Abundance
Highly Agressive
Low Agressive
15. RL to SSD
1. Train Policies at Different Game
2. Extract trained Policies from 1.
3. Calculate MGSD
4. Repeat 2-3 Until Converge
16. Gathering : DRL to SSD
Prisoner Dilemma
or
Non-SSD : ( NE is Global Optimal )