SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Multi-armed bandit
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan
Outline
● K-armed bandit problem
● Action-value function
● Exploration & Exploitation
● Example: 10-armed bandit problem
● Incremental method for value estimation
2
Why introduce multi-armed bandit problem?
Multi-armed bandit problem is a reduced decision problem for sequential decision
process.
We often use such simplified decision making problem to discuss some issues in
reinforcement learning, eg: exploration-exploitation dilemma.
3
One-armed bandit
● A slot machine (吃角子老虎機)
● The reward given by the slot machine is
generated in some kind of probability
distribution.
4
image source:
https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg
K-armed Bandit Problem
Imagine that you are in the casino on
Friday. In the casino, there are many slot
machines.
Tonight, your objective is to play with
those slot machines and earn more
money.
How do you choose the slot machine?
5
Applications of k-armed bandits problem
● K-armed bandit problem has been used to model many decision problems,
which the problem itself is non-associative. In the problem, each bandit
provides a random reward from a probability distribution specific to that
bandit.
● Non-associative here means: the decision made by this time won’t need to
consider its situation (state, observation)
6
Examples of k-armed bandits problem
● Recommendation system
● What do we eat tonight
● Choose the experimental treatments for a series of seriously ill patients
7
source:
http://hangyebk.baike.com/article-421947.html
source:
http://www.cdns.com.tw/news.php?n_id=31&nc
_id=51809
Action-value function
In our k-armed bandit problem, each of the k actions has an expected or mean
reward given that action is selected; we call this the value of that action.
We denoted the action selected on time step t as , and the corresponding
reward as . The value of an arbitrary action is denoted .
8
The action-value is an expected reward of
specific action; the * here means “true”
action-value.
Action-value function
If we knew the value of each action, then it would be trivial to solve the k-armed
bandit problem by always selecting the action with highest value.
In practice, we don’t know the true action-value but we can use some
method to estimate it.
We denote the estimated value of action a at time step t as . We would
like to be close to .
9
Exploration & Exploitation
Exploitation
If you maintain estimates of the action values, then at any time step here is at least
one action whose estimated value is greatest. We call these greedy actions. When
you select one of these actions, we say that you are exploiting your current
knowledge of the values of the actions.
10
Exploration & Exploitation
11
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
The expected action values are our
knowledge
Exploration & Exploitation
12
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
In exploitation, we only choose the
bandit with highest action value!
Exploration & Exploitation
Exploration
Instead, if you select one of the non-greedy actions, then we say you are exploring,
because this enables you to improve the estimate action-value of non-greedy
actions.
13
Exploration & Exploitation
Without exploration, the agent’s decision may be suboptimal because of inaccurate
action-value estimation.
14
reward
probability
image source:
http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg
Exploration & Exploitation
● 今天晚上吃什麼
○ Exploitation: 以往去吃水工坊感覺都不錯,今天晚上繼續去吃好了!
○ Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶,來去吃吃看
● 回家的路上
○ Exploitation: 走原路滿好的
○ Exploration: 以往走原路都要等很久,試試其他路多不定更快
● 買滑板鞋
○ Exploitation: 以往好穿的滑板鞋磨壞了,再去買同樣的一雙
○ Exploration: 聽說有個人去叫做魅力之都買了一雙鞋,還為他寫了一首歌,我也去那找
看看吧
15
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
16
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● Sample-average method
● Incremental implementation
17
We’ll introduce this one first with
some example.
Sample-average method
True action-value is expected reward of specific action:
18
Sample-average method
True action-value is expected reward of specific action:
One natural way to estimate action-value is by averaging the rewards actually
received. We call this the sample-average method.
19
Greedy policy
The simplest action selection rule is to select one of the actions with the highest
estimated action-value, that is, one of the greedy actions. This action selection
policy is called greedy policy
20
Greedy policy
● Always exploits current knowledge
● Without sampling apparently inferior actions, it will often converge to
suboptimal
21
Ɛ-greedy policy
Sometimes, we need more exploration when maintaining the action value. A simple
alternative is to behave greedily most of the time, but with small probability Ɛ to
select the actions randomly with equal probability. We call this method as Ɛ-greedy
policy.
22
Ɛ-greedy policy
● Have better exploration
● With every action will be sampled an infinite number of time, will
converge to
● Need more time for training (more time to converge)
23
We take 10-armed bandit as an example.
Each arm has its reward distribution
● The actual reward, Rt, was select from
normal distribution with mean ,
and variance is 1.
● The action values were also
selected from normal distribution with
mean 0, and variance 1
Example: The 10-armed testbed
source: Microsoft research
24
source: Sutton’s textbook 25
The 10-armed testbed
The data are average over 2000 runs (each run 1000 steps)
source: Sutton’s textbook
26
The 10-armed bandit
● Ɛ-greedy can reach higher
performance than pure greedy
● The smaller the Ɛ is, the more
steps it need to converge with
● In long-term, the smaller Ɛ one will
get better performance
27
How to choose Ɛ ?
In practice, the choice of Ɛ is depend on your task, your computational resources
and the deadline for your task.
● If your reward signal was generated by non-stationary distribution, you had
better to use larger Ɛ first.
● If you have more computational resources, you can run your research faster so
that it will converge sooner.
28
Ɛ decay
In practice, there are another method to choose Ɛ. In the start of the task, we can
use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for
each step before reach its minimal setting(eg: 0.005). This method is called Ɛ
decay.
● The common method is linear decay, but there are also many other decay
scheduling methods.
29
Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
● sample-average method
● incremental implementation
30
Now, we return to introduce this
one
Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
31
Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.
32
Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
33
Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.
34
Estimate action-value: Incremental implementation
Action value of specific action:
It’s general form is:
35
Estimate action-value: Incremental implementation
Action value of specific action:
It’s general form is:
This is an error in estimate
36
StepSize in stochastic approximation theory
● , n means # of iteration
● In practice, the step size which satisfies the upper condition will learn very
slow. So, we may not adopt this condition.
37
A simple bandit algorithm
38source: from Sutton’s book
The content not covered here
In addition to Ɛ-greedy, there are still many method in exploration:
● Upper confidence bound
● Thompson sampling
Besides, there exists associative multi-armed bandit problem:
● Contextual bandits problem
We will step into the core concept of reinforcement learning - Markov Decision
Process (MDP).
39
Question?
40

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
multi-armed bandit
multi-armed banditmulti-armed bandit
multi-armed bandit
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 

Similar a Multi armed bandit

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 

Similar a Multi armed bandit (20)

DRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdfDRL #2-3 - Multi-Armed Bandits .pptx.pdf
DRL #2-3 - Multi-Armed Bandits .pptx.pdf
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Lecture 4 - Opponent Modelling
Lecture 4 - Opponent ModellingLecture 4 - Opponent Modelling
Lecture 4 - Opponent Modelling
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Machine learning ( Part 3 )
Machine learning ( Part 3 )Machine learning ( Part 3 )
Machine learning ( Part 3 )
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Deep RL.pdf
Deep RL.pdfDeep RL.pdf
Deep RL.pdf
 
Finalver
FinalverFinalver
Finalver
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 

Más de Jie-Han Chen

Más de Jie-Han Chen (10)

Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Deep reinforcement learning
Deep reinforcement learningDeep reinforcement learning
Deep reinforcement learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)BiCNet presentation (multi-agent reinforcement learning)
BiCNet presentation (multi-agent reinforcement learning)
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
The artofreadablecode
The artofreadablecodeThe artofreadablecode
The artofreadablecode
 

Último

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
Silpa
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Último (20)

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 

Multi armed bandit

  • 1. Multi-armed bandit Jie-Han Chen NetDB, National Cheng Kung University 3/27, 2018 @ National Cheng Kung University, Taiwan
  • 2. Outline ● K-armed bandit problem ● Action-value function ● Exploration & Exploitation ● Example: 10-armed bandit problem ● Incremental method for value estimation 2
  • 3. Why introduce multi-armed bandit problem? Multi-armed bandit problem is a reduced decision problem for sequential decision process. We often use such simplified decision making problem to discuss some issues in reinforcement learning, eg: exploration-exploitation dilemma. 3
  • 4. One-armed bandit ● A slot machine (吃角子老虎機) ● The reward given by the slot machine is generated in some kind of probability distribution. 4 image source: https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg
  • 5. K-armed Bandit Problem Imagine that you are in the casino on Friday. In the casino, there are many slot machines. Tonight, your objective is to play with those slot machines and earn more money. How do you choose the slot machine? 5
  • 6. Applications of k-armed bandits problem ● K-armed bandit problem has been used to model many decision problems, which the problem itself is non-associative. In the problem, each bandit provides a random reward from a probability distribution specific to that bandit. ● Non-associative here means: the decision made by this time won’t need to consider its situation (state, observation) 6
  • 7. Examples of k-armed bandits problem ● Recommendation system ● What do we eat tonight ● Choose the experimental treatments for a series of seriously ill patients 7 source: http://hangyebk.baike.com/article-421947.html source: http://www.cdns.com.tw/news.php?n_id=31&nc _id=51809
  • 8. Action-value function In our k-armed bandit problem, each of the k actions has an expected or mean reward given that action is selected; we call this the value of that action. We denoted the action selected on time step t as , and the corresponding reward as . The value of an arbitrary action is denoted . 8 The action-value is an expected reward of specific action; the * here means “true” action-value.
  • 9. Action-value function If we knew the value of each action, then it would be trivial to solve the k-armed bandit problem by always selecting the action with highest value. In practice, we don’t know the true action-value but we can use some method to estimate it. We denote the estimated value of action a at time step t as . We would like to be close to . 9
  • 10. Exploration & Exploitation Exploitation If you maintain estimates of the action values, then at any time step here is at least one action whose estimated value is greatest. We call these greedy actions. When you select one of these actions, we say that you are exploiting your current knowledge of the values of the actions. 10
  • 11. Exploration & Exploitation 11 q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1 The expected action values are our knowledge
  • 12. Exploration & Exploitation 12 q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1 In exploitation, we only choose the bandit with highest action value!
  • 13. Exploration & Exploitation Exploration Instead, if you select one of the non-greedy actions, then we say you are exploring, because this enables you to improve the estimate action-value of non-greedy actions. 13
  • 14. Exploration & Exploitation Without exploration, the agent’s decision may be suboptimal because of inaccurate action-value estimation. 14 reward probability image source: http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg
  • 15. Exploration & Exploitation ● 今天晚上吃什麼 ○ Exploitation: 以往去吃水工坊感覺都不錯,今天晚上繼續去吃好了! ○ Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶,來去吃吃看 ● 回家的路上 ○ Exploitation: 走原路滿好的 ○ Exploration: 以往走原路都要等很久,試試其他路多不定更快 ● 買滑板鞋 ○ Exploitation: 以往好穿的滑板鞋磨壞了,再去買同樣的一雙 ○ Exploration: 聽說有個人去叫做魅力之都買了一雙鞋,還為他寫了一首歌,我也去那找 看看吧 15
  • 16. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● Sample-average method ● Incremental implementation 16
  • 17. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● Sample-average method ● Incremental implementation 17 We’ll introduce this one first with some example.
  • 18. Sample-average method True action-value is expected reward of specific action: 18
  • 19. Sample-average method True action-value is expected reward of specific action: One natural way to estimate action-value is by averaging the rewards actually received. We call this the sample-average method. 19
  • 20. Greedy policy The simplest action selection rule is to select one of the actions with the highest estimated action-value, that is, one of the greedy actions. This action selection policy is called greedy policy 20
  • 21. Greedy policy ● Always exploits current knowledge ● Without sampling apparently inferior actions, it will often converge to suboptimal 21
  • 22. Ɛ-greedy policy Sometimes, we need more exploration when maintaining the action value. A simple alternative is to behave greedily most of the time, but with small probability Ɛ to select the actions randomly with equal probability. We call this method as Ɛ-greedy policy. 22
  • 23. Ɛ-greedy policy ● Have better exploration ● With every action will be sampled an infinite number of time, will converge to ● Need more time for training (more time to converge) 23
  • 24. We take 10-armed bandit as an example. Each arm has its reward distribution ● The actual reward, Rt, was select from normal distribution with mean , and variance is 1. ● The action values were also selected from normal distribution with mean 0, and variance 1 Example: The 10-armed testbed source: Microsoft research 24
  • 26. The 10-armed testbed The data are average over 2000 runs (each run 1000 steps) source: Sutton’s textbook 26
  • 27. The 10-armed bandit ● Ɛ-greedy can reach higher performance than pure greedy ● The smaller the Ɛ is, the more steps it need to converge with ● In long-term, the smaller Ɛ one will get better performance 27
  • 28. How to choose Ɛ ? In practice, the choice of Ɛ is depend on your task, your computational resources and the deadline for your task. ● If your reward signal was generated by non-stationary distribution, you had better to use larger Ɛ first. ● If you have more computational resources, you can run your research faster so that it will converge sooner. 28
  • 29. Ɛ decay In practice, there are another method to choose Ɛ. In the start of the task, we can use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for each step before reach its minimal setting(eg: 0.005). This method is called Ɛ decay. ● The common method is linear decay, but there are also many other decay scheduling methods. 29
  • 30. Estimate action-value function We will introduce 2 kinds of method to estimate action-value function ● sample-average method ● incremental implementation 30 Now, we return to introduce this one
  • 31. Estimate action-value: Incremental implementation In previous, we have introduced sampled-average method to estimate the action value. However, in practice, we don’t want to store the reward each step for specific action. The incrementation implementation is desired. 31
  • 32. Estimate action-value: Incremental implementation In previous, we have introduced sampled-average method to estimate the action value. However, in practice, we don’t want to store the reward each step for specific action. The incrementation implementation is desired. 32
  • 33. Estimate action-value: Incremental implementation Let Qn denote the action value for specific action i which has been selected n-1 times. 33
  • 34. Estimate action-value: Incremental implementation Let Qn denote the action value for specific action i which has been selected n-1 times. 34
  • 35. Estimate action-value: Incremental implementation Action value of specific action: It’s general form is: 35
  • 36. Estimate action-value: Incremental implementation Action value of specific action: It’s general form is: This is an error in estimate 36
  • 37. StepSize in stochastic approximation theory ● , n means # of iteration ● In practice, the step size which satisfies the upper condition will learn very slow. So, we may not adopt this condition. 37
  • 38. A simple bandit algorithm 38source: from Sutton’s book
  • 39. The content not covered here In addition to Ɛ-greedy, there are still many method in exploration: ● Upper confidence bound ● Thompson sampling Besides, there exists associative multi-armed bandit problem: ● Contextual bandits problem We will step into the core concept of reinforcement learning - Markov Decision Process (MDP). 39