Multi-armed bandits for fun and profit

MULTI-ARMED BANDITS
JANANI SRIRAM  
MAD STREET DEN
FOR FUN AND PROFIT

MULTI-ARMED BANDITS
OVERVIEW
▸ Overview - Background, Introduction and Formulation
▸ Optimality - Gittin’s Index
▸ Optimization Strategies
▸ Epsilon greedy
▸ Upper Conﬁdence Bound
▸ Boltzmann Exploration
▸ Bayesian Bandits

OVERVIEW
EXPLORATION VS. EXPLOITATION
▸ Tradeoff between the necessity to try out all arms
and to minimize the total regret suffered due to
sub-optimal arms.
▸ The agent can gain knowledge about the
environment only by pulling an arm.
▸ But by pulling a bad arm it suffers some regret.
▸ If an algorithms explores forever or exploits forever
it will have linear total regret
▸ Usage Scenarios
▸ Clinical Trials
▸ A/B Testing online ads*
▸ Restaurant selection
▸ Feynman’s restaurant problem
V ⇤
(s) = R(s) + maxa
P
s0 P(s0
|s, a)V ⇤
(s0
)
https://support.google.com/analytics/answer/2844870?hl=en

OVERVIEW
MARKOV DECISION PROCESSES
▸ Sequential decision making process with stationary
markov property s.t.
▸ States
▸ Transition model
▸ Reward
▸ Actions
▸ Discount factor
S = {s1, s2...sn}
(S, A, Pr, R, )
* Introduction to Reinforcement Learning Sutton and Barto [1998]
A = {a1, a2...an}
Pa
ss0 = P(St+1 = s0 | (St = s, At = a))
Ra
s = E(Rt+1 | (St = s, At = a))
2 [0, 1]

OVERVIEW
WHAT ARE BANDITS?
▸ Originally described by Robbins [1952]
▸ A gambler is faced with K slot
machines each with an unknown
distribution of rewards. The goal is to
maximize cumulative rewards over a
ﬁnite number of trials (horizon T).
▸ A Bernoulli bandit is a special case of
MAB that has a Bernoulli distributed
reward.
▸ Stochastic MABs - Each arm k is associated with an unknown probability. Rewards
are drawn i.i.d from
▸ Adversarial Bandits - Rewards are generated by an adversary.
vk 2 [0, 1]

OVERVIEW
BACKGROUND
▸ Notation
▸ Goal: To maximize total reward
▸ Or minimize total expected regret (optimal - obtained
reward)
▸ Lai and Robbins [1985] showed that optimal regret is
t = {1, 2, ...T}Trials: Choice: ti 2 {1, 2, ...K}
Reward: for chosen arm i at trial trit 2 R
TP
t=1
rit
i⇤
= arg maxi=1,...,K µi µ⇤
= maxi=1,...,K µi i = µ⇤
µi
Regret: is no. of times arm j is selectedTj(T)
O(log T)
Tµ⇤
TP
t=1
E[µit
] =
KP
j=1
jE[Tj(T)]

STRATEGIES
EPSILON GREEDY
▸ Select initial empirical means for each arm i,
▸ A time t, with probability play the arm with highest
empirical mean and with probability , play a random arm
ˆµi(0)
1 ✏t
✏t
BOLTZMANN EXPLORATION
pk = e
ˆµi(t)
⌧
kP
j=1
e
ˆµi(t)
⌧
, i = 1, ...n
▸ At trial t, arm k is selected with probability given by Gibb’s
distribution
is a temperature parameter controlling the
randomness of the choice
⌧

STRATEGIES
UPPER CONFIDENCE BOUND
▸ ‘Optimism in the face of uncertainty’.
▸ Chernoff-Hoeffding bound on deviation from mean
▸ Algorithm:
▸ Setup: Select empirical mean payoffs for each arm i,
▸ For each round pick arm with probability,.
▸ Optimal lower bound on regret
*Using Conﬁdence Bounds for Exploitation-Exploration Trade-offs Auer, Cesa-Bianchi & Fisher [2002]
ˆµi
P(Y + a + µ)  e 2na2
j(t) = arg maxi( ˆµi +
q
2 ln t
ni
)
O(log n)
(Knowledge) (Uncertainty)

STRATEGIES
BAYESIAN BANDITS
▸ Assume a prior distribution on parameters
▸ The likelihood of reward is given by
▸ Sample from the posterior distribution and update priors
▸ For bandits with Bernoulli rewards start with standard conjugate
prior - Beta distribution. The posterior is also a Beta distribution.
P(r | a, ✓)
P(✓)
red : ↵ = 2, = 2
green : ↵ = 12, = 12
blue : ↵ = 102, = 102
f(x; ↵, ) = (↵+ )
(↵) ( ) x↵ 1
(1 x) 1
pdf of a Beta distribution with parameters ↵ > 0, > 0

STRATEGIES
GITTIN’S INDEX (INFORMATION STATE SEARCH)
▸ Goal: to maximize the total expected discounted reward
▸ Reduces to solving the stopping problem
▸ Bayesian adaptive MDP: Assume prior on reward distribution and
geometric discounting. Each state transition is a Bayes model update.
For Bernoulli bandits this means Beta prior.
▸ Optimal policy: Select arm that maximizes Gittin’s dynamic allocation
Index which is a a normalized sum of time discounted reward.
▸ For arm i,
⇡(r|↵, ) = r↵ 1
(1 r) 1
B(↵, )
where B is the Beta function
vi = max
⌧>0
E(
1P
t=0
t
rit(xit))
E[
1P
t=0
t]
Reward discount parameter
⌧ Stopping time

STRATEGIES
THOMSON SAMPLING (PROBABILITY MATCHING)
▸ Start with a prior belief on parameters of the distribution
▸ Play arm according to probability that it is optimal
▸ After every trial, observe a reward and do a Bayesian update
▸ Shown to have logarithmic expected regret [Agrawal 2012]
at = arg maxa E(r | a, ✓t
)

STRATEGIES
THOMSON SAMPLING
▸ Simulation from http://bit.ly/2fqR57P

CONCLUSION
REFERENCES
▸ D. Berry and B. Fristedt. Bandit problems. Chapman and Hall, 1985
▸ J Gittins. Multi-armed bandit allocation indices. Wiley, 1989
▸ Lai and Robbins. Asymptotically Efﬁcient Adaptive Allocation Rules
▸ Shipra Agrawal and Navin Goyal. Analysis of Thompson Sampling for
the Multi-armed Bandit Problem.
▸ Volodymyr Kuleshov, Doina Precup. Algorithms for the multi-armed
bandit problem.
▸ Finite-time analysis of the multi-armed bandit problem. Auer, P., Cesa-
Bianchi, N., and Fischer, P.

Multi-armed bandits for fun and profit

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Multi-armed bandits for fun and profit