User-driven Approaches to Recsys

It's all about the User...

User-driven Approaches to the
Recommendation Problem

Xavier Amatriain
Telefonica Research

About me
Up until 2005

But first...

About Telefonica and Telefonica R&D

Telefonica is a fast-growing Telecom

1989 2000 2008
Clients About 12 About 68 About 260
million million million
subscribers customers customers
Services Basic Wireline and mobile Integrated ICT
telephone and voice, data and solutions for all
data services Internet services customers
Geographies
Operations in Operations in
Spain 25 countries
16 countries

Staff
About 71,000 About 149,000 About 257,000
professionals professionals professionals

Finances Rev: 4,273 M€ Rev: 28,485 M€ Rev: 57,946 M€
EPS(1): 0.45 € EPS(1): 0.67 € EPS: 1.63 €
(1) EPS: Earnings per share

Currently among the largest in the world

Telco sector worldwide ranking by market cap (US$ bn)

Source: Bloomberg, 06/12/09

Telefonica R&D (TID) is the Research and Development Unit
of the Telefónica Group

MISSION
“To contribute to the
improvement of the
n Founded in 1988 Telefónica Group’s
n Largest private R&D center in Spain competitivness through
n More than 1100 professionals technological innovation”
n Five centers in Spain and two in Latin America

Telefónica was in 2008 the first Spanish company by R&D Investment and the
third in the EU

Applied R&D
research 61 M€ R&D
Products / Services / Processes 594 M€
development 4.384 M€
Technological Innovation
(1)

Internet Scientific Areas
Content Distribution and P2P Wireless and Mobile Systems Social Networks

Next generation Managed Wireless bundling Information Propagation
P2P-TV
Device2Device Content Social Search Engines
Future Internet: Content Distribution
Networking Infrastructure for Social
Large Scale mobile data based cloud computing
Delay Tolerant Bulk analysis
Distribution

Network Transparency

Multimedia Scientific Areas
Multimedia Mobile and Ubicomp HCC
Core
Multimedia Data Context Multimodal User
Analysis, Search Awareness Interfaces
& Retrieval
Urban Computing Expression, Gesture,
Video, Audio,
Emotion Recognition
Image, Music, Mobile Multimedia
Text, Sensor Data & Search Personalization &
Recommendation
Understanding, Wearable Systems
Summarization, Physiological
Visualization Monitoring Super Telepresence

Data Mining & User Modeling
Areas

SOCIAL NETWORK ANALYSYS & BUSINESS INT.
-
Analytical CRM

-
Trend-spotting, service propagation & churn

-
Social Graph Analysis (construction, dynamics)

USER MODELING
-
Application to new services (technology for development)

-
Cognitive, socio-cultural, and contextual modeling

-
Behavioral user modeling (service-use patterns)

DATA MINING
Integration of statistical & knowledge-based techniques
-

- Stream mining

Large scale & distributed machine learning
-

Index

Now seriously,
this is where the index should go!

Introduction: What are
Recommender Systems?

The Age of Search has come
to an end

... long live the Age of Recommendation!
Chris Anderson in “The Long Tail”
“We are leaving the age of information and entering the age
of recommendation”
CNN Money, “The race to create a 'smart' Google”:
“The Web, they say, is leaving the era of search and entering
one of discovery. What's the difference? Search is what you
do when you're looking for something. Discovery is when
something wonderful that you didn't know existed, or didn't
know how to ask for, finds you.”

Information overload

“People read around 10 MB worth of
material a day, hear 400 MB a day, and
see one MB of information every second”
The Economist, November 2006

The value of
recommendations
Netflix: 2/3 of the movies rented are
recommended
Google News: recommendations generate
38% more clickthrough
Amazon: 35% sales from recommendations
Choicestream: 28% of the people would buy
more music if they found what they liked.

u

The “Recommender problem”

Estimate a utility function that is able to
automatically predict how much a user will like
an item that is unknown for her. Based on:
Past behavior
Relations to other users
Item similarity
Context
...

The “Recommender problem”

Let C be a large set of all users and let S be a large set of
all possible items that can be recommended (e.g books,
movies, or restaurants).
Let u be a utility function that measures the usefulness of
item s to user c, i.e., u : C X S→R, where R is a totally
ordered set. Then, for each user c є C, we want to choose
such item s’ є S that maximizes u.
Utility of an item is usually represented by rating but can
also can be an arbitrary function, including a profit function.

Approaches to Recommendation

Collaborative Filtering
Recommend items based only on the users past behavior

User-based
Find similar users to me and recommend what they liked

Item-based
Find similar items to those that I have previously liked

Content-based
Recommend based on features inherent to the items

Social recommendations (trust-based)

The Netflix Prize
 500K users x 17K movie
titles = 100M ratings = $1M
(if you “only” improve
existing system by 10%!
From 0.95 to 0.85 RMSE)
 49K contestants on 40K teams from
184 countries.
 41K valid submissions from 5K
teams; 64 submissions per day
 Wining approach uses hundreds of
predictors from several teams
 Is this general?

 Why did it take so long?

What works
It depends on the domain and particular problem
However, in the general case it has been demonstrated that
(currently) the best isolated approach is CF.
Item-based in general more efficient and better but mixing CF
approaches can improve result
Other approaches can be hybridized to improve results in specific
cases (cold-start problem...)
What matters:
Data preprocessing: outlier removal, denoising, removal of global
effects (e.g. individual user's average)
“Smart” dimensionality reduction using MF such as SVD
Combining classifiers

I like it... I like it not

Evaluating User Ratings Noise in
Recommender Systems

Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver
Telefonica Research

The Recommender Problem

 Two ways to address it
1. Improve the Algorithm

The Recommender Problem

 Two ways to address it
2. Improve the Input Data

Time for Data
Cleaning!

Natural Noise Limits our User Model

DID YOU HEAR WHAT
I LIKE??!!

...and Our Prediction Accuracy

The Magic Barrier
 Magic Barrier = Limit on prediction accuracy
due to noise in original data
 Natural Noise = involuntary noise introduced by
users when giving feedback
 Due to (a) mistakes, and (b) lack of resolution in
personal rating scale (e.g. In a 1 to 5 scale a 2 may mean the
same than a 3 for some users and some items).
 Magic Barrier >= Natural Noise Threshold
 We cannot predict with less error than the
resolution in the original data

Our related research questions

 Q1. Are users inconsistent when providing
explicit feedback to Recommender Systems via
the common Rating procedure?
 Q2. How large is the prediction error due to
these inconsistencies?
 Q3. What factors affect user inconsistencies?

Experimental Setup (I)
 Test-retest procedure: you need at least 3 trials
to separate
 Reliability: how much you can trust the instrument
you are using (i.e. ratings)
 r = r12 r23 /r13
 Stability: drift in user opinion
 s12 =r13 /r23 ; s23 =r13 /r12 ; s13 =r13 ²/r12 r23
 Users rated movies in 3 trials
 Trial 1 <-> 24 h <-> Trial 2 <-> 15 days <-> Trial 3

Experimental Setup (II)
 100 Movies selected from Netflix dataset doing
a stratified random sampling on popularity
 Ratings on a 1 to 5 star scale
 Special “not seen” symbol.
 Trial 1 and 3 = random order; trial 2 = ordered
by popularity
 118 participants

Comparison to Netflix Data

 Distribution of number of ratings per movie very
similar to Netflix but average rating is lower
(users are not voluntarily choosing what to rate)

Test-retest Reliability and Stability

 Overall reliability = 0.924 (good reliabilities are
expected to be > 0.9)
 Removing mild ratings yields higher reliabilities,
while removing extreme ratings yields lower
 Stabilities: s12 = 0.973, s23 = 0.977, and s13 =
0.951
 Stabilities might also be accounting for “learning
effect” (note s12<s23)

Users are Inconsistent

● What is the probability of making an inconsistency
given an original rating


Mild ratings are
noisier

● What is the percentage of inconsistencies given an
original rating


Negative
ratings are
noisier

● What is the percentage of inconsistencies given an
original rating

Prediction Accuracy
#Ti #Tj # RMSE

   
T1, T2 2185 1961 1838 2308 0.573 0.707

T1, T3 2185 1909 1774 2320 0.637 0.765

T2, T3 1969 1909 1730 2140 0.557 0.694

● Pairwise RMSE between trials considering
intersection and union of both sets

Prediction Accuracy
Max error in
trials that are
#Ti #Tj # RMSE
most distant in
time

   
T1, T2 2185 1961 1838 2308 0.573 0.707

T1, T3 2185 1909 1774 2320 0.637 0.765

T2, T3 1969 1909 1730 2140 0.557 0.694


Prediction Accuracy
Significant less
error when 2nd #Ti #Tj # RMSE
trial is involved

   
T1, T2 2185 1961 1838 2308 0.573 0.707

T1, T3 2185 1909 1774 2320 0.637 0.765

T2, T3 1969 1909 1730 2140 0.557 0.694


Algorithm Robustness to NN

Alg./Trial T1 T2 T3 Tworst /Tbest
User 1.2011 1.1469 1.1945 4.7%
Average
Item 1.0555 1.0361 1.0776 4%
Average
Userbased 0.9990 0.9640 1.0171 5.5%
kNN
Itembased 1.0429 1.0031 1.0417 4%
kNN
SVD 1.0244 0.9861 1.0285 4.3%

● RMSE for different Recommendation algorithms
when predicting each of the trials

Algorithm Robustness to NN
Trial 2 is
consistently the
Alg./Trial T1 T2 T3 Tworst /Tbest
least noisy
User 1.2011 1.1469 1.1945 4.7%
Average
Item 1.0555 1.0361 1.0776 4%
Average
Userbased 0.9990 0.9640 1.0171 5.5%
kNN
Itembased 1.0429 1.0031 1.0417 4%
kNN
SVD 1.0244 0.9861 1.0285 4.3%

when predicting each of the trials

Algorithm Robustness to NN (2)

TrainingTesting T1-T2 T1-T3 T2-T3
Dataset

User Average 1.1585 1.2095 1.2036

Movie Average 1.0305 1.0648 1.0637

Userbased kNN 0.9693 1.0143 1.0184

Itembased kNN 1.0009 1.0406 1.0590

SVD 0.9741 1.0491 1.0118

when predicting ratings in one trial (testing) from
ratings on another (training)

Algorithm Robustness to NN (2)

TrainingTesting T1-T2 T1-T3 T2-T3
Dataset

User Average 1.1585 1.2095 1.2036

Movie Average 1.0305 1.0648 1.0637

Userbased kNN 0.9693 1.0143 1.0184

Itembased kNN 1.0009 1.0406 1.0590

SVD
Noise is minimized 0.9741 1.0491 1.0118
when we predict
Trial 2

when predicting ratings in one trial (testing) from
ratings on another (training)

Let's recap
 Users are inconsistent
 Inconsistencies can depend on many things
including how the items are presented
 Inconsistencies produce natural noise
 Natural noise reduces our prediction accuracy
independently of the algorithm

Item order effect
 R1 is the trial with most inconsistencies
 R3 has less, but not when excluding “not seen”
(learning effect improves “not seen” discrimination)
 R2 minimizes inconsistencies because of order
(reducing “contrast effect”).

User Rating Speed Effect
 Evaluation time decreases as survey progresses in R1
and R3 (users losing attention but also learning)
 In R2 evaluation time starts decreasing until users find
segment of “popular” movies
 Rating speed is not correlated with inconsistencies

So...

What can we do?

Different proposals
 In order to deal with noise in user feedback we
have so far proposed 3 different approaches:
1. Denoise user feedback by using a re-rating
approach (Recsys09)
2. Instead of regular users, take feedback from
experts, which we expect to be less noisy
(SIGIR09)
3. Combine ensembles of datasets to identify which
works better for each user (IJCAI09)

Rate it Again

Rate it Again
Increasing Recommendation Accuracy
by User re-Rating
Xavier Amatriain (with J.M. Pujol, N. Tintarev, N. Oliver)
Telefonica Research

Rate it again
 By asking users to rate items again we can
remove noise in the dataset
 Improvements of up to 14% in accuracy!
 Because we don't want all users to re-rate all
items we design ways to do partial denoising
 Data-dependent: only denoise extreme ratings
 User-dependent: detect “noisy” users

Algorithm
 Given a rating dataset where (some) items
have been re-rated,
 Two fairness conditions:
1. Algorithm should remove as few ratings as
possible (i.e. only when there is some certainty that
the rating is only adding noise)
2.Algorithm should not make up new ratings but
decide on which of the existing ones are valid.

Algorithm
 One source re-rating case:

 Given the following milding function:

Results

 One-source re-rating (Denoised⊚Denoising)
T1⊚T2 ΔT1 T1⊚T3 ΔT1 T2⊚T3 ΔT2
Userbased kNN 0.8861 11.3% 0.8960 10.3% 0.8984 6.8%

SVD 0.9121 11.0% 0.9274 9.5% 0.9159 7.1%

 Two-source re-rating (Denoising T1with the other 2)
Datasets T1⊚(T2, T3) ΔT1
Userbased kNN 0.8647 13.4%
SVD 0.8800 14.1%

Denoise outliers

● Improvement in RMSE when doing onesource as a function of
the percentage of denoised ratings and users: selecting only noisy
users and extreme ratings

The Wisdom of the Few

A Collaborative Filtering Approach Based on
Expert Opinions from the Web

Xavier Amatriain (@xamat), Josep M. Pujol, Nuria Oliver
Telefonica Research (Barcelona)
Neal Lathia
UCL (London)

Crowds are not always wise

 Collaborative filtering is the preferred approach
for Recommender Systems
 Recommendations are drawn from your past
behavior and that of similar users in the system
 Standard CF approach:
 Find your Neighbors from the set of other users
 Recommend things that your Neighbors liked and you
have not “seen”
 Problem: predictions are based on a large
dataset that is sparse and noisy

Overview of the Approach
 expert = individual that we can trust to have produced
thoughtful, consistent and reliable evaluations (ratings) of
items in a given domain
 Expert-based Collaborative Filtering
 Find neighbors from a reduced set of experts instead of
regular users.
1. Identify domain experts with reliable ratings
2. For each user, compute “expert neighbors”
3. Compute recommendations similar to standard kNN CF

Advantages of the Approach

 Noise  Cold Start problem
 Experts introduce less  Experts rate items as
natural noise soon as they are
 Malicious Ratings available
 Dataset can be monitored
 Scalability
to avoid shilling  Dataset is several order of
 Data Sparsity magnitudes smaller
 Reduced set of domain
 Privacy
experts can be motivated  Recommendations can be
to rate items computed locally

Mining the Web for Expert Ratings

 Collections of expert
ratings can be obtained
almost directly on the web:
we crawled the Rotten
Tomatoes movie critics
mash-up
 Only those (169) with
more than 250 ratings in
the Neflix dataset were
used

Dataset Analysis. Summary

 Experts...
 are much less sparse
 rate movies all over the rating scale instead of
being biased towards rating only “good” movies
(different incentives).
 but, they seem to consistently agree on the good
movies.
 have a lower overall standard deviation per movie:
they tend to agree more than regular users.
 tend to deviate less from their personal average
rating.

Evaluation Procedure
 Use the 169 experts to predict ratings from
10.000 users sampled from the Netflix dataset
 Prediction MAE using a 80-20 holdout
procedure (5-fold cross-validation)
 Top-N precision by classifying items as being
“recommendable” given a threshold
 Results show Expert CF to behave similar to
standard CF
 But... we have a user study backing up the
approach

User Study
 57 participants, only 14.5 ratings/participant
 50% of the users consider Expert-based CF to be
good or very good
 Expert-based CF: only algorithm with an average
rating over 3 (on a 0-4 scale)

Current Work
 Music recommendations
(using metacritics.com),
mobile geo-located
recommendations...

Adaptive Data Sources

Collaborative Filtering With Adaptive
Information Sources
(ITWP @ IJCAI)
With Neal Lathia
UCL (London)

Adaptive data sources

like-
minded?

similarity friends?

trust

user modeling experts?
reputation

Adaptive Data sources
 Given
 a simple, un-tuned, kNN predictor and multiple
information sources
 A problem
 users are subjective, accuracy varies with source
 A promise
 optimal classification of users to best source
produces incredibly accurate predictions

Conclusions
 For many applications such as Recommender
Systems (but also Search, Advertising, and
even Networks) understanding data and users
is vital
 Algorithms can only be as good as the data
they use as input
 Importance of User/Data Mining is going to be a
growing trend in many areas in the coming
years

Thanks!

Questions?

Xavier Amatriain
xar@tid.es
xavier.amatriain.net
technocalifornia.blogspot.com
twitter.com/xamat

User-driven Approaches to Recsys

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to User-driven Approaches to Recsys

Similar to User-driven Approaches to Recsys (20)

More from Xavier Amatriain

More from Xavier Amatriain (20)

Recently uploaded

Recently uploaded (20)

User-driven Approaches to Recsys