2. Outline
● K-armed bandit problem
● Action-value function
● Exploration & Exploitation
● Example: 10-armed bandit problem
● Incremental method for value estimation
2
3. Why introduce multi-armed bandit problem?
Multi-armed bandit problem is a reduced decision problem for sequential decision
process.
We often use such simplified decision making problem to discuss some issues in
reinforcement learning, eg: exploration-exploitation dilemma.
3
4. One-armed bandit
● A slot machine (吃角子老虎機)
● The reward given by the slot machine is
generated in some kind of probability
distribution.
4
image source:
https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg
5. K-armed Bandit Problem
Imagine that you are in the casino on
Friday. In the casino, there are many slot
machines.
Tonight, your objective is to play with
those slot machines and earn more
money.
How do you choose the slot machine?
5
6. Applications of k-armed bandits problem
● K-armed bandit problem has been used to model many decision problems,
which the problem itself is non-associative. In the problem, each bandit
provides a random reward from a probability distribution specific to that
bandit.
● Non-associative here means: the decision made by this time won’t need to
consider its situation (state, observation)
6
7. Examples of k-armed bandits problem
● Recommendation system
● What do we eat tonight
● Choose the experimental treatments for a series of seriously ill patients
7
source:
http://hangyebk.baike.com/article-421947.html
source:
http://www.cdns.com.tw/news.php?n_id=31&nc
_id=51809
8. Action-value function
In our k-armed bandit problem, each of the k actions has an expected or mean
reward given that action is selected; we call this the value of that action.
We denoted the action selected on time step t as , and the corresponding
reward as . The value of an arbitrary action is denoted .
8
The action-value is an expected reward of
specific action; the * here means “true”
action-value.
9. Action-value function
If we knew the value of each action, then it would be trivial to solve the k-armed
bandit problem by always selecting the action with highest value.
In practice, we don’t know the true action-value but we can use some
method to estimate it.
We denote the estimated value of action a at time step t as . We would
like to be close to .
9
10. Exploration & Exploitation
Exploitation
If you maintain estimates of the action values, then at any time step here is at least
one action whose estimated value is greatest. We call these greedy actions. When
you select one of these actions, we say that you are exploiting your current
knowledge of the values of the actions.
10
12. Exploration & Exploitation
12
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
In exploitation, we only choose the
bandit with highest action value!
13. Exploration & Exploitation
Exploration
Instead, if you select one of the non-greedy actions, then we say you are exploring,
because this enables you to improve the estimate action-value of non-greedy
actions.
13
14. Exploration & Exploitation
Without exploration, the agent’s decision may be suboptimal because of inaccurate
action-value estimation.
14
reward
probability
image source:
http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg
16. Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
16
17. Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
17
We’ll introduce this one first with
some example.
19. Sample-average method
True action-value is expected reward of specific action:
One natural way to estimate action-value is by averaging the rewards actually
received. We call this the sample-average method.
19
20. Greedy policy
The simplest action selection rule is to select one of the actions with the highest
estimated action-value, that is, one of the greedy actions. This action selection
policy is called greedy policy
20
21. Greedy policy
● Always exploits current knowledge
● Without sampling apparently inferior actions, it will often converge to
suboptimal
21
22. Ɛ-greedy policy
Sometimes, we need more exploration when maintaining the action value. A simple
alternative is to behave greedily most of the time, but with small probability Ɛ to
select the actions randomly with equal probability. We call this method as Ɛ-greedy
policy.
22
23. Ɛ-greedy policy
● Have better exploration
● With every action will be sampled an infinite number of time, will
converge to
● Need more time for training (more time to converge)
23
24. We take 10-armed bandit as an example.
Each arm has its reward distribution
● The actual reward, Rt, was select from
normal distribution with mean ,
and variance is 1.
● The action values were also
selected from normal distribution with
mean 0, and variance 1
Example: The 10-armed testbed
source: Microsoft research
24
26. The 10-armed testbed
The data are average over 2000 runs (each run 1000 steps)
source: Sutton’s textbook
26
27. The 10-armed bandit
● Ɛ-greedy can reach higher
performance than pure greedy
● The smaller the Ɛ is, the more
steps it need to converge with
● In long-term, the smaller Ɛ one will
get better performance
27
28. How to choose Ɛ ?
In practice, the choice of Ɛ is depend on your task, your computational resources
and the deadline for your task.
● If your reward signal was generated by non-stationary distribution, you had
better to use larger Ɛ first.
● If you have more computational resources, you can run your research faster so
that it will converge sooner.
28
29. Ɛ decay
In practice, there are another method to choose Ɛ. In the start of the task, we can
use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for
each step before reach its minimal setting(eg: 0.005). This method is called Ɛ
decay.
● The common method is linear decay, but there are also many other decay
scheduling methods.
29
30. Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● sample-average method
● incremental implementation
30
Now, we return to introduce this
one
31. Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
31
32. Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
32
33. Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
33
34. Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
34
36. Estimate action-value: Incremental implementation
Action value of specific action:
It’s general form is:
This is an error in estimate
36
37. StepSize in stochastic approximation theory
● , n means # of iteration
● In practice, the step size which satisfies the upper condition will learn very
slow. So, we may not adopt this condition.
37
39. The content not covered here
In addition to Ɛ-greedy, there are still many method in exploration:
● Upper confidence bound
● Thompson sampling
Besides, there exists associative multi-armed bandit problem:
● Contextual bandits problem
We will step into the core concept of reinforcement learning - Markov Decision
Process (MDP).
39