https://thoughtworks-bangalore.github.io/data-science/
Abstract: In this talk we will cover a specific kind of Markov decision process called the Multi-armed bandit which is used to model many common use cases in reinforcement learning such as A/B testing, clinical trials, ad-serving, resource allocation, recommendations etc. Bandit algorithms come in stochastic and adversarial flavors with different variations, all seeking to maximize some form of expected payout while keeping the total regret low.
We will discuss some common strategies to solve different kinds of bandit problems, their optimality with respect to the exploration-exploitation tradeoff as well as the empirical performance on real datasets.
3. OVERVIEW
EXPLORATION VS. EXPLOITATION
▸ Tradeoff between the necessity to try out all arms
and to minimize the total regret suffered due to
sub-optimal arms.
▸ The agent can gain knowledge about the
environment only by pulling an arm.
▸ But by pulling a bad arm it suffers some regret.
▸ If an algorithms explores forever or exploits forever
it will have linear total regret
▸ Usage Scenarios
▸ Clinical Trials
▸ A/B Testing online ads*
▸ Restaurant selection
▸ Feynman’s restaurant problem
V ⇤
(s) = R(s) + maxa
P
s0 P(s0
|s, a)V ⇤
(s0
)
https://support.google.com/analytics/answer/2844870?hl=en
4. OVERVIEW
MARKOV DECISION PROCESSES
▸ Sequential decision making process with stationary
markov property s.t.
▸ States
▸ Transition model
▸ Reward
▸ Actions
▸ Discount factor
S = {s1, s2...sn}
(S, A, Pr, R, )
* Introduction to Reinforcement Learning Sutton and Barto [1998]
A = {a1, a2...an}
Pa
ss0 = P(St+1 = s0 | (St = s, At = a))
Ra
s = E(Rt+1 | (St = s, At = a))
2 [0, 1]
5. OVERVIEW
WHAT ARE BANDITS?
▸ Originally described by Robbins [1952]
▸ A gambler is faced with K slot
machines each with an unknown
distribution of rewards. The goal is to
maximize cumulative rewards over a
finite number of trials (horizon T).
▸ A Bernoulli bandit is a special case of
MAB that has a Bernoulli distributed
reward.
▸ Stochastic MABs - Each arm k is associated with an unknown probability. Rewards
are drawn i.i.d from
▸ Adversarial Bandits - Rewards are generated by an adversary.
vk 2 [0, 1]
6. OVERVIEW
BACKGROUND
▸ Notation
▸ Goal: To maximize total reward
▸ Or minimize total expected regret (optimal - obtained
reward)
▸ Lai and Robbins [1985] showed that optimal regret is
t = {1, 2, ...T}Trials: Choice: ti 2 {1, 2, ...K}
Reward: for chosen arm i at trial trit 2 R
TP
t=1
rit
i⇤
= arg maxi=1,...,K µi µ⇤
= maxi=1,...,K µi i = µ⇤
µi
Regret: is no. of times arm j is selectedTj(T)
O(log T)
Tµ⇤
TP
t=1
E[µit
] =
KP
j=1
jE[Tj(T)]
7. STRATEGIES
EPSILON GREEDY
▸ Select initial empirical means for each arm i,
▸ A time t, with probability play the arm with highest
empirical mean and with probability , play a random arm
ˆµi(0)
1 ✏t
✏t
BOLTZMANN EXPLORATION
pk = e
ˆµi(t)
⌧
kP
j=1
e
ˆµi(t)
⌧
, i = 1, ...n
▸ At trial t, arm k is selected with probability given by Gibb’s
distribution
is a temperature parameter controlling the
randomness of the choice
⌧
8. STRATEGIES
UPPER CONFIDENCE BOUND
▸ ‘Optimism in the face of uncertainty’.
▸ Chernoff-Hoeffding bound on deviation from mean
▸ Algorithm:
▸ Setup: Select empirical mean payoffs for each arm i,
▸ For each round pick arm with probability,.
▸ Optimal lower bound on regret
*Using Confidence Bounds for Exploitation-Exploration Trade-offs Auer, Cesa-Bianchi & Fisher [2002]
ˆµi
P(Y + a + µ) e 2na2
j(t) = arg maxi( ˆµi +
q
2 ln t
ni
)
O(log n)
(Knowledge) (Uncertainty)
9. STRATEGIES
BAYESIAN BANDITS
▸ Assume a prior distribution on parameters
▸ The likelihood of reward is given by
▸ Sample from the posterior distribution and update priors
▸ For bandits with Bernoulli rewards start with standard conjugate
prior - Beta distribution. The posterior is also a Beta distribution.
P(r | a, ✓)
P(✓)
red : ↵ = 2, = 2
green : ↵ = 12, = 12
blue : ↵ = 102, = 102
f(x; ↵, ) = (↵+ )
(↵) ( ) x↵ 1
(1 x) 1
pdf of a Beta distribution with parameters ↵ > 0, > 0
10. STRATEGIES
GITTIN’S INDEX (INFORMATION STATE SEARCH)
▸ Goal: to maximize the total expected discounted reward
▸ Reduces to solving the stopping problem
▸ Bayesian adaptive MDP: Assume prior on reward distribution and
geometric discounting. Each state transition is a Bayes model update.
For Bernoulli bandits this means Beta prior.
▸ Optimal policy: Select arm that maximizes Gittin’s dynamic allocation
Index which is a a normalized sum of time discounted reward.
▸ For arm i,
⇡(r|↵, ) = r↵ 1
(1 r) 1
B(↵, )
where B is the Beta function
vi = max
⌧>0
E(
1P
t=0
t
rit(xit))
E[
1P
t=0
t]
Reward discount parameter
⌧ Stopping time
11. STRATEGIES
THOMSON SAMPLING (PROBABILITY MATCHING)
▸ Start with a prior belief on parameters of the distribution
▸ Play arm according to probability that it is optimal
▸ After every trial, observe a reward and do a Bayesian update
▸ Shown to have logarithmic expected regret [Agrawal 2012]
at = arg maxa E(r | a, ✓t
)
13. CONCLUSION
REFERENCES
▸ D. Berry and B. Fristedt. Bandit problems. Chapman and Hall, 1985
▸ J Gittins. Multi-armed bandit allocation indices. Wiley, 1989
▸ Lai and Robbins. Asymptotically Efficient Adaptive Allocation Rules
▸ Shipra Agrawal and Navin Goyal. Analysis of Thompson Sampling for
the Multi-armed Bandit Problem.
▸ Volodymyr Kuleshov, Doina Precup. Algorithms for the multi-armed
bandit problem.
▸ Finite-time analysis of the multi-armed bandit problem. Auer, P., Cesa-
Bianchi, N., and Fischer, P.