Contextual Bandit Survey

Lab Seminar: Contextual Bandit Survey
Sangwoo Mo
KAIST
swmo@kaist.ac.kr
August 4, 2016
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32

Overview
1 Problem Setting
2 Na¨ıve Approach: Reduce to MAB
3 Stochastic Contextual Bandit
UCB & Thompson Sampling
Arbitrary Set of Policies
4 Adversarial Contextual Bandit
5 Supervised Learning to Contextual Bandit

Problem Setting

Multi-Armed Bandit
At each time t, the agent selects an arm at (at ∈ {1, ..., K})
Then, the agent recieves a reward rt(= rat ,t) from the enviroment
If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if
ri,t is selected by the enviroment, we call it adversarial bandit
The goal of MAB is to ﬁnd the policy π ∈ Π s.t.
π(a1, r1, ...at−1, rt−1) = at
which minimizes the regret1
RT := max
i=1,...,K
E
T
t=1
ri,t −
T
t=1
rat ,t
1
Properly speaking, cumulative pseudo-regret.

Contextual Bandit
In contextual bandit, the agent recieves an additional information
(=context) ct
1 ∈ C at the begining of time t
In stochastic contextual bandit, the reward ri,t can be represented as
a function of the context ci,t and noise i,t
ri,t = f (ci,t) + i,t
or simply ri,t = fi (ct) + i,t if ct is independent to i
In adversarial contextual bandit, the reward ri,t is selected by the
enviroment, as in the non-contextual MAB
1
Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations
are identical since we can construct a single vector ct by concatenating ci,t s.

Optimal Regret Bound
Stochastic Bandit: Ω(log T)1
Adversarial Bandit: Ω(
√
KT)2
Contextual Bandit: Ω(d
√
T)3
1
Lai & Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in Applied Mathematics, 1985.
2
Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.
Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.
3
Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(
√
T)
even for the stochastic contextual bandit, since context may come in adversarially.

Na¨ıve Approach: Reduce to MAB

Na¨ıve Approach: Reduce to MAB
Approach 1: assume the context set is ﬁnite (|C| = N)
Run MAB algorithm (ex. EXP3) for each context independently
The regret bound is O(
√
TNK log K)1 (w/ EXP3)
Approach 2: assume the policy space is ﬁnite (|H| = M)
Run MAB algorithm (ex. EXP3) on policies, instead of arms
The regret bound is O(
√
TM log M) (w/ EXP3)
1 N
c=1 O(nc
√
K log K) ≤ O(
√
TN
√
K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)

Stochastic Contextual Bandit
UCB & Thompson Sampling

Review: Index Policy and Greedy Algorithm
Since Gittins Index1, index policy became one of the most popular
strategy for MAB problems
Idea: for each time t, deﬁne a score si,t (=index) for each arm i.
Select an arm which has the highest score
Question: how to deﬁne proper si,t?
Na¨ıve approach: use empirical mean2! (greedy algorithm)
However, na¨ıve greedy algorithm may occur O(T) regret
1
Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.
2
Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean
correctly and rapidly (explore-exploit dilema)

Review: UCB1
Assume ri,t ∼ Pi with support [0, 1] and mean µi
Idea: select more seldom-selected arms and less often-selected arms.
In other words, give a confidence bonus1!
UCB12: define score as
si,t = ˆµi,t +
2 log t
ni,t
where ˆµi,t is empirical mean, and ni,t is number of arm i selected
UCB1 policy garantees the optimal regret O(log T)
Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)
1
We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.
2
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.
3
Garivier & Cappé. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.
4
Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.

LinUCB
Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT
i,tθ∗ (ci,t, θ ∈ Rd )
Like UCB1, want to deﬁne score as
si,t = cT
i,t
ˆθt + UCBi,t
Question: how to choose proper UCBi,t?

LinUCB
Idea: let ˆθt be an estimator of θ∗ by ridge regression
ˆθt = (CT
t Ct + λId )−1
CT
t Rt
where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1}
Then, the inequality below holds with probability 1 − δ
T
cT
i,t
ˆθt − cT
i,tθ∗
≤ ( + 1) cT
i,tA−1
t ci,t
where At = CT
t Ct + Id and = 1
2 log 2TK
δ

LinUCB
LinUCB1: define score as
si,t = cT
i,t
ˆθt + α cT
i,tA−1
t ci,t
Regret bound (with probability 1 − δ) is
O(d T log
1 + T
δ
)
LinUCB policy garantees the optimal regret Õ(d
√
T)
Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)
1
Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
2
Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.
3
Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.

Review: Thompson Sampling
Another popular strategy for MAB is Thompson Sampling1
It can be applied to both contextual and non-contextual bandit
Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ)
Idea: sample estimator ˆθt from the posterior distribution
step 1. draw θt from posterior P(θ | D = {ct, at, rt})
step 2. select arm ai = arg maxi E[ri,t | ci,t, θt]
The idea is simple, but it works well both in theory2 and in practice3
1
Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
Biometrica, 1933.
2
Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.
3
Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.

LinTS
Assume ri,t ∼ N(cT
i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1
t ) where
Bt =
t−1
τ=1
ci,τ cT
i,τ + Id , ˆθt = B−1
t
t−1
τ=1
ci,τ ri,τ
ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R
24
d log
t
δ
Then, the posterior of θ∗ is N(θt+1, v2B−1
t+1)
LinTS1: run Thompson Sampling in this assumption
Regret bound (with probability 1 − δ) is
O(
d2 √
T1+ log(Td) log
1
δ
)
1
Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoﬀs. ICML, 2013.

UCB & TS: Nonlinear Case
Assume E[ri,t] = f (ci,t) is general nonlinear function
If we assume f is a member of exponential family, we can use
GLM-UCB1
If we assume f is sampled from a Guassian Process, we can use
GP-UCB2/CGP-UCB3
If we assume f is an element of Reproducing Kernel Hilbert Space,
we can use KernelUCB4
Also, we can use Thompson Sampling if we know the form of
probability distribution
1
Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.
2
Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.
3
Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.
4
Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.

Stochastic Contextual Bandit
Arbitrary Set of Policies

Epoch-Greedy
Assume policy space H if finite1
Idea: explore T steps and exploit T − T steps (epsilon-first)
issue 1. how to get an unbiased estimator of the best policy?
issue 2. how to balance explore and exploit if we don’t know T?
trick 1: use D = {ct, at, rt} observed in explore step
ˆπ = max
π∈H
(ct ,at ,rt )∈D
raI(π(ct) = at)
1/K
trick 2: run epsilon-first in mini-batches (partition of T)
1
Infinite w/ finite VC-dimension can be derived in similar way

Epoch-Greedy
Epoch-Greedy1: combine trick 1 & trick 2
Regret bound is ˜O(T2/3) (not optimal!)
1
Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.

RandomizedUCB
Idea: estimate the distribution Pt over the policy space H
RandomizedUCB1:
Regret bound is ˜O(
√
T), but time complexity is O(T6)
1
Dudik et al. Eﬃcient Optimal Learning for Contextual Bandits. UAI, 2011.

ILOVECONBANDITS
Idea: similar to RandomizedUCB, improve time complexity
ILOVECONBANDITS1 (Importance-weighted LOw-Variance
Epoch-Timed Oracleized CONtextual BANDITS):
Regret bound is ˜O(
√
T), and time complexity is O(T1.5)
1
Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.

Adversarial Contextual Bandit

Review: EXP3
Assume ri,t ∈ [0, 1] is selected by the enviroment
In adversarial setting, the agent must select arm randomly
Idea: weight more probability to higher-reward ovserved arms
EXP31 (EXPonential-weight algorithm for EXPloration and
EXPloitation):
Regret bound is O(
√
TK log K)
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.

EXP4
Idea: run EXP3 on policies, instead of arms
EXP41 (EXPonential-weight algorithm for EXPloration and
EXPloitation using EXPert advice):
Regret bound is O(
√
TK log N), but variance is high
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.

EXP4.P
Idea: run EXP4 with better weight, to make algorithm stable
EXP4.P1 (EXP4 with Probability):
Regret bound is O(
√
TK log N), with high probability
1
Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.

Supervised Learning to Contextual Bandit

Supervised Learning to Contextual Bandit
Idea: note that contextual bandit can be thought as a supervised
learing problem with partially-observed restriction
Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased
(true) reward estimator ˆrat ,t =
rat ,t
pat
instead of observed reward rat ,t.
Then,
E[ˆri,t] = pi ·
ri,t
pi
+ (1 − pi ) · 0 = ri,t
Using this trick, any supervised learning algorithm can be converted
to a contextual bandit algorithm
Banditron and NeuralBandit are examples using neural network

Banditron and NeuralBandit
Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and
epsilon-greedy algorithm w/ unbiased reward estimator
However, Banditron uses 0-1 loss (classification) while NeuralBandit
uses L2 loss (regression)
Regret bound of original Banditron is O(T2/3), and a 2nd-order
variant3 reduced it to Õ(
√
T)
No theoretical garnatee is proved for NeuralBandit yet
1
Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.
2
Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.
3
Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.

Summary & Reference

Summary

Reference
[Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv,
2015.
[Burtini’ 2015] A Survey of Online Experiment Design with the
Stochastic Multi-Armed Bandit. arXiv, 2015.
[Bubeck’ 2012] Regret Analysis of Stochastic and Nonstochastic
Multi-armed Bandit Problems. arXiv, 2012.

Contextual Bandit Survey

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Contextual Bandit Survey

Similar a Contextual Bandit Survey (20)

Más de Sangwoo Mo

Más de Sangwoo Mo (20)

Último

Último (20)

Contextual Bandit Survey