BanditProblems_final

FUN WITH BANDIT PROBLEMS ?
By Shweta Gupte
Psy606: Human Problem Solving
Spring2013

Introduction
Types of Bandit Problems
Application
The Paper
2/8/2016
Psy606:HumanProblemSolvingPurdue
University
2

INTRODUCTION
2/8/2016
3
University
Basic Form:
People choose repeatedly
between a small number
of alternatives each of which has an
unknown rate of providing
reward.
History:
Robbins(1952) constructed convergent
population selection
strategies in sequential decision making.

2/8/2016
University
4

2/8/2016
University
5
Bad Good
Bad
2 more days to go …

TYPES OF BANDIT PROBLEM
2/8/2016
6
University
Stationary
Fixed
Horizon
Infinite
Horizon
Dynamic
(Restless)
Fixed
Horizon
Infinite
Horizon

TYPES OF BANDIT PROBLEMS
2/8/2016
7
University
One-armed
refers to a choice between an option with
a known payout versus a different option
with an unknown payout.

TYPES OF BANDIT PROBLEMS
2/8/2016
8
University
Multi-armed
refer to the situation where there
are multiple unknown alternatives.

APPLICATIONS
 Problem of managing research projects
 Stock Market
 Sport coaches track changes in team performance
 Drivers choosing number of routes
2/8/2016
9
University

EXPLORATION AND EXPLOITATION
 Exploration-
Selection task is to get information about the hidden
arms.
 Exploitation-
a focus on a single arm, in order to obtain rewards
from an option that is believed to be sufficiently good
as compared to the other competing options.
Expected behavior :
Exploration Exploitation.
2/8/2016
10
University

SHORT SUMMARY
 Stationary Bandit Problem-
The reward rate for each alternative is kept constant over
all of the trials.
The number of trials in each game may be known,
creating a finite horizon problem, or unknown, creating an
infinite horizon problem.
 Optimal solutions can be found for all cases in
finite horizon environments by using a dynamic
programming approach, where optimal decisions
are computed for all potential cases starting from
the final trial and solving for each trial toward the
first (Kaelbling et al., 1996).
2/8/2016
11
University

SHORT SUMMARY
 As the length of a game increases or the number of
alternatives increases, the computation necessary
to create a complete decision tree increases
exponentially.
 Restless Bandit Problems-
 The rewards rates for alternatives may change over
time, rather than remaining stationary through each
trail of the game.
 Change detection, forcing a switch between
exploration and exploitation
2/8/2016
12
University

OPTIMAL SOLUTIONS VERSUS HEURISTICS
 Tend to be fairly ponderous in terms of
computational cost and often can only be applied in
limited situations.
 Heuristics are geared towards obtaining
performance that, while not optimal, is still good but
with comparatively much less work.
 Of course, there are also models that fall between
the two extremes in complexity - the particle filter
model used in paper can't really be counted in
either of the two groups
2/8/2016
13
University

GITTINS INDEX?
A Gittins index gives each alternative an utility that
takes into account an alternative’s current estimated
value and the information that can be gained from
choosing the alternative; the optimal
decision is the arm which has the largest index value.
Gittins indices are only applicable to a limited number
of bandit problems, and can be difficult to compute
even in those cases (Berry & Fristedt, 1985).
2/8/2016
14
University

GITTINS INDEX?
 The Gittins index is a measure of the reward that
can be achieved by a process evolving from its
present state onwards with the probability that it will
be terminated in the future.
 It is a real scalar value associated to the state of a
stochastic process with a reward function and with
a probability of termination.
2/8/2016
15
University

TSP AND BANDIT PROBLEMS ?SAME?
 Bandit problems are highly sequential, where
information you gain on each trial can be used to
inform your decisions on subsequent trials.
 TSPs are spatial tasks where generally, all
information is available at the outset of the task.
The connections you make between nodes on each
step are really only sequential in the sense that
they aren't made simultaneously.
2/8/2016
16
University

AUTHORS’ MOTIVATION?
When optimal solutions are available, bandit problems
provide an opportunity to examine whether or how people
make the best possible decisions.
For this reason, many previous empirical studies have
been motivated by economic theories, with a focus on
deviations from rationality in human decision-making (e.g.,
Banks, Olson, & Porter, 1997;Meyer & Shi, 1995).
 More recently, human performance on the bandit
problem has been studied within cognitive neuroscience
(e.g., Cohen, McClure, &Yu, 2007; Daw, O’Doherty,
Dayan, Seymour, & Dolan, 2006) and probabilistic models
of human cognition (e.g., Steyvers,
Lee, & Wagenmakers, 2009).
2/8/2016
17
University

PARTICLE FILTERS
 http://www.youtube.com/watch?v=O-lAJVra1PU
2/8/2016
18
University
Particle Filter MCMC
Depending on the design need less
computation time
More computation time with
increasing information
A sophisticated model estimation
technique based on simulation.
Particle filters are usually used to
estimate Bayesian models in which
the latent variables are connected
in a Markov chain
A class of algorithms for sampling
from probability distributions based
on constructing a Markov chain that
has the desired distribution as its
equilibrium distribution.
Estimate only the distribution of only
one of the latent variables at a time,
rather than attempting to estimate
them all at once, and produce a set
of weighted samples, rather than a
(usually much larger) set of
unweighted samples.

THE PAPER
Modeling Human Performance ?
in Restless Bandits ?
with Particle Filters?
(Fall 2009)
2/8/2016
19
University

EXPERIMENT 1
 Restless bandit problem is an extension of
sequential stationary infinite-horizon problems.
 The behavior of human participants in restless
bandit environment is observed and compared to
two different particle filter methods of solutions.
 Optimal ,other sub optimal
 27 participants ,UCI, course credit
2/8/2016
20
University

EXPERIMENT 1
2/8/2016
21
University

INTERFACE
2/8/2016
22
University

RESULTS
2/8/2016
23
University

OVER ALL CONCLUSIONS?
Many potential applications:
 Clinical trials
 Advertising: what ad to put on a web-page?
 Labor markets: which job a worker should choose?
 Optimization of noisy function
 Numerical resource allocation
2/8/2016
24
University

OVER ALL CONCLUSION
How to solve:
 Monte Carlo, Markov chain, Particle filter
 Use Gittens index
Paper
 focuses on human performance and not optimal
solution, does not use Gittens index
2/8/2016
25
University

ACKNOWLEDGEMENTS
2/8/2016
26
University
Sheng Kung M. Yi, Mark Steyvers and Michael Lee

REFERENCES?
 Robbins, H. (1952). "Some aspects of the sequential design of
experiments". Bulletin of the American Mathematical Society 58 (5):
527–535
 Berry, Donald A. and Fristedt, Bert (1985. viii+275). Bandit problems:
Sequential allocation of experiments. Monographs on Statistics and
Applied Probability. London: Chapman & Hall. ISBN 0-412-24810-7.
 Gittins, J.C. (1989). Multi-armed bandit allocation indices. Wiley-
Interscience Series in Systems and Optimization.. Chichester: John
Wiley & Sons, Ltd.. ISBN 0-471-92059-2.
 Doucet, A.; De Freitas, N.; Gordon, N.J. (2001). Sequential Monte
Carlo Methods in Practice. Springer.
2/8/2016
27
University

QUESTIONS?
That’s All Folks!
How do we make Money?
If we understand this model well ,Vegas is waiting!
2/8/2016
28
University

EXAMPLE
2/8/2016
29
University
1
3
4
0

BanditProblems_final

Recommended

Recommended

More Related Content

Similar to BanditProblems_final

Similar to BanditProblems_final (20)

More from Shweta Gupte

More from Shweta Gupte (6)

BanditProblems_final