Multi armed bandit

Multi-armed bandit
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan

Outline
● K-armed bandit problem
● Action-value function
● Exploration & Exploitation
● Example: 10-armed bandit problem
● Incremental method for value estimation
2

Why introduce multi-armed bandit problem?
Multi-armed bandit problem is a reduced decision problem for sequential decision
process.
We often use such simplified decision making problem to discuss some issues in
reinforcement learning, eg: exploration-exploitation dilemma.
3

One-armed bandit
● A slot machine (吃角子老虎機)
● The reward given by the slot machine is
generated in some kind of probability
distribution.
4
image source:
https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg

K-armed Bandit Problem
Imagine that you are in the casino on
Friday. In the casino, there are many slot
machines.
Tonight, your objective is to play with
those slot machines and earn more
money.
How do you choose the slot machine?
5

Applications of k-armed bandits problem
● K-armed bandit problem has been used to model many decision problems,
which the problem itself is non-associative. In the problem, each bandit
provides a random reward from a probability distribution specific to that
bandit.
● Non-associative here means: the decision made by this time won’t need to
consider its situation (state, observation)
6

Examples of k-armed bandits problem
● Recommendation system
● What do we eat tonight
● Choose the experimental treatments for a series of seriously ill patients
7
source:
http://hangyebk.baike.com/article-421947.html
source:
http://www.cdns.com.tw/news.php?n_id=31&nc
_id=51809

Action-value function
In our k-armed bandit problem, each of the k actions has an expected or mean
reward given that action is selected; we call this the value of that action.
We denoted the action selected on time step t as , and the corresponding
reward as . The value of an arbitrary action is denoted .
8
The action-value is an expected reward of
specific action; the * here means “true”
action-value.

Action-value function
If we knew the value of each action, then it would be trivial to solve the k-armed
bandit problem by always selecting the action with highest value.
In practice, we don’t know the true action-value but we can use some
method to estimate it.
We denote the estimated value of action a at time step t as . We would
like to be close to .
9

Exploration & Exploitation
Exploitation
If you maintain estimates of the action values, then at any time step here is at least
one action whose estimated value is greatest. We call these greedy actions. When
you select one of these actions, we say that you are exploiting your current
knowledge of the values of the actions.
10

11
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
The expected action values are our
knowledge

12
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
In exploitation, we only choose the
bandit with highest action value!

Exploration
Instead, if you select one of the non-greedy actions, then we say you are exploring,
because this enables you to improve the estimate action-value of non-greedy
actions.
13

Without exploration, the agent’s decision may be suboptimal because of inaccurate
action-value estimation.
14
reward
probability
image source:
http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg

● 今天晚上吃什麼
○ Exploitation: 以往去吃水工坊感覺都不錯，今天晚上繼續去吃好了！
○ Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶，來去吃吃看
● 回家的路上
○ Exploitation: 走原路滿好的
○ Exploration: 以往走原路都要等很久，試試其他路多不定更快
● 買滑板鞋
○ Exploitation: 以往好穿的滑板鞋磨壞了，再去買同樣的一雙
○ Exploration: 聽說有個人去叫做魅力之都買了一雙鞋，還為他寫了一首歌，我也去那找
看看吧
15

Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
16

● Sample-average method
● Incremental implementation
17
We’ll introduce this one first with
some example.

Sample-average method
True action-value is expected reward of specific action:
18

Sample-average method
True action-value is expected reward of specific action:
One natural way to estimate action-value is by averaging the rewards actually
received. We call this the sample-average method.
19

Greedy policy
The simplest action selection rule is to select one of the actions with the highest
estimated action-value, that is, one of the greedy actions. This action selection
policy is called greedy policy
20

Greedy policy
● Always exploits current knowledge
● Without sampling apparently inferior actions, it will often converge to
suboptimal
21

Ɛ-greedy policy
Sometimes, we need more exploration when maintaining the action value. A simple
alternative is to behave greedily most of the time, but with small probability Ɛ to
select the actions randomly with equal probability. We call this method as Ɛ-greedy
policy.
22

Ɛ-greedy policy
● Have better exploration
● With every action will be sampled an infinite number of time, will
converge to
● Need more time for training (more time to converge)
23

We take 10-armed bandit as an example.
Each arm has its reward distribution
● The actual reward, Rt, was select from
normal distribution with mean ,
and variance is 1.
● The action values were also
selected from normal distribution with
mean 0, and variance 1
Example: The 10-armed testbed
source: Microsoft research
24

source: Sutton’s textbook 25

The 10-armed testbed
The data are average over 2000 runs (each run 1000 steps)
source: Sutton’s textbook
26

The 10-armed bandit
● Ɛ-greedy can reach higher
performance than pure greedy
● The smaller the Ɛ is, the more
steps it need to converge with
● In long-term, the smaller Ɛ one will
get better performance
27

How to choose Ɛ ?
In practice, the choice of Ɛ is depend on your task, your computational resources
and the deadline for your task.
● If your reward signal was generated by non-stationary distribution, you had
better to use larger Ɛ first.
● If you have more computational resources, you can run your research faster so
that it will converge sooner.
28

Ɛ decay
In practice, there are another method to choose Ɛ. In the start of the task, we can
use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for
each step before reach its minimal setting(eg: 0.005). This method is called Ɛ
decay.
● The common method is linear decay, but there are also many other decay
scheduling methods.
29

● sample-average method
● incremental implementation
30
Now, we return to introduce this
one

Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
31

In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
32

Let Qn denote the action value for
specific action i which has been
selected n-1 times.
33

Let Qn denote the action value for
specific action i which has been
selected n-1 times.
34

Action value of specific action:
It’s general form is:
35

Action value of specific action:
It’s general form is:
This is an error in estimate
36

StepSize in stochastic approximation theory
● , n means # of iteration
● In practice, the step size which satisfies the upper condition will learn very
slow. So, we may not adopt this condition.
37

A simple bandit algorithm
38source: from Sutton’s book

The content not covered here
In addition to Ɛ-greedy, there are still many method in exploration:
● Upper confidence bound
● Thompson sampling
Besides, there exists associative multi-armed bandit problem:
● Contextual bandits problem
We will step into the core concept of reinforcement learning - Markov Decision
Process (MDP).
39

Multi armed bandit

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Multi armed bandit

Similar a Multi armed bandit (20)

Más de Jie-Han Chen

Más de Jie-Han Chen (10)

Último

Último (20)

Multi armed bandit