If you have atleast two choices of doing same thing and want to decide which one is better in a more data driven approach then Bandit algorithms helps you to do that. It's an alternative to traditional hypothesis testing approach in measuring A/B Testing systems. Also, does better than hypothesis testing approaches. It makes real time decisions and continuously learn about system. It explains how to handle trade off between exploration and exploitation in measuring
2. Overview
2
- K Slot Machine
- Multi Armed Bandit Problems
- A/B Testing
- MAB Algorithms
- Summary
3. K Slot Machines
3
- Choose a machine and receive a reward
- T turns (chances)
- What will be your goal ?
- Maximize the cumulative rewards
- How you choose the machines (arms) ?
4. 4
Multi Armed Bandit Problem (MAB)
- Goal : Two Fold
- Try different arms (Exploration)
- Play the seemingly most
rewarding arm (Exploitation)
- Explore – Exploit Trade Off
- Multi Armed Bandit Algorithms
- Reward distribution ( Unknown)
- Mean Reward : <µ1, . . . , µK>
- Standard Deviation Reward: <σ1, . . . , σk>
- Regret :
- Maximize Cumulative Rewards = Minimize Regret
(Minimize)
5. A/B Testing
5
- Advertisement selection for a request from a pool
of advertisements
- Rewards : CTR/AR or CPM
- Recommendation of news articles to users
- Product pricing and promotional offers
- MAB is used to measure the performance of A/B
Testing experiments
7. Epsilon-greedy Algorithm
- Choose epsilon ( Ɛ) : exploration factor
- Play the best arm with probability (1 – Ɛ): Exploitation
- Play the random arm with probability Ɛ: Exploration
Note :
- Typical value of Ɛ = 0.10 (10%)
13. Summary
13
- Each algorithm has an upper bound on regret
- It’s a function of average rewards distribution
- Each algorithm has a tuning parameter
- Parameter tuning is a function of reward function
- Choose right MAB algorithm based on
simulations/historical data
- All these algorithms have life time auto learning
mechanism