1. The document discusses personalized news article recommendation using a contextual bandit approach to balance exploration and exploitation when suggesting articles to users.
2. It provides examples of contextual bandits in web services and clinical decision making.
3. The key challenge is how to quickly identify relevant news stories on a personal level for both new and existing users given changing article relevance over time.
4. Two linear contextual bandit algorithms, LinUCB with disjoint and hybrid models, are proposed to learn the best policy for selecting news articles to maximize click-through rates based on user and article features.
3. Example of Learning through Exploration
Repeatedly:
1. A user comes to Yahoo! (with history of previous visits, IP addresses, data related to his Yahoo!
account)
2. Yahoo! chooses information to present (from URLs, Ads, news stories)
3. The user reacts to the presented information (clicks on something, clicks, comes back and clicks
again, etc.)
Yahoo! wants to interactively choose content and use the observed feedback to improve future content
choices.
4. Another Example: Clinical Decision Making
Repeatedly:
1. A patient comes to a doctor with symptoms, medical history, test results
2. The doctor chooses and suggests a treatment
3. The patient responds to it
The doctor wants a policy for choosing targeted treatments for individual patients.
5. Current Scenario
Which article to feature?
Challenges:
● A lot of new users and articles.
● Incorporation of content.
● Changing relevance of articles.
Goal:
"Quickly" identify relevant news stories on
personal level.
6. The Contextual Bandit Setting
For t = 1, . . . , T:
1. The world produces some context xt
∈ X
2. The learner chooses an action at
∈ {1, . . . ,K}
3. The world reacts with reward rt
(at
) ∈ [0, 1]
Goal: Learn a good policy for choosing actions given context.
What does learning mean?
7. The Contextual Bandit Setting (Contd.)
What does learning mean?
Efficiently competing with a large reference class of possible policies Π = { π : X → {1, ..., K} }
8. Some Remarks
This is not a supervised learning problem.
● We don’t know the reward of actions not taken,
○ loss function is unknown even at training time.
● Exploration is needed to succeed.
● Simpler than reinforcement learning,
○ We know which action is responsible for each reward.
9. Some Remarks (Contd.)
This is not a bandit problem.
● In the bandit setting, there is no x, and the goal is to compete with the set of constant actions.
○ Too weak in practice.
● Generalization across x is required to succeed.
10. Mapping to our current problem
For each time t = 1, 2, 3, … , T, the news page is loaded:
1. Arms or actions are the articles, which can be shown to the user. The environment could be user
and article information.
2. If the article a is clicked, rt, a
= 1, otherwise 0.
3. Improve new article selection.
Goal: Maximize expected Click-through-rate, i.e.,
12. LinUCB (Disjoint Linear Model)
Assumption: The expected reward for action a is a linear function in the features of the context, i.e.:
1. In each trial t, for each a ∈ At
estimate θa
via regularized linear regression using feature matrix Da
.
E[rt, a
| xt, a
] = xT
t, a
θa
*
2. Choose at
such that,
13. LinUCB (Hybrid Model)
Assumption: The expected reward for action a is the sum of two linear terms, one that is independent of
the action and one that is specific to each action, i.e.:
E[rt, a
| xt, a
] = zT
t, a
β*
+ xT
t, a
θa
*
Algorithm works similar to the previous LinUCB algorithm.
14. Evaluation
● Testing on Live Data?
○ TOO EXPENSIVE.
● Then, testing offline?
○ DIFFERENT LOGGING POLICY
● Then, simulator-based approach?
○ BIASED.
15. Results
● Training Set: 4.7 million events
● Test Set: 36 million events
● Articles and users clustered into 5 clusters:
○ Two 6-dimensional (one constant) feature
vectors