1. Rus M. Mesas, Alejandro Bellogín
Universidad Autónoma de Madrid
Spain
RecSys, August 2017
Evaluating Decision-Aware
Recommender Systems
2. 2
Alejandro Bellogín – RecSys, August 2017
Main idea
▪ How to balance coverage and precision
Method Precision Coverage Best?
R1 0.093 100%
R2 0.094 97.8%
3. 3
Alejandro Bellogín – RecSys, August 2017
Main idea
▪ How to balance coverage and precision
Method Precision Coverage Best?
R1 0.093 100%
R2 0.094 97.8%
Method Precision Coverage Best?
R1 0.037 100%
R2 0.133 100%
R3 0.245 99.7%
5. 5
Alejandro Bellogín – RecSys, August 2017
Main idea
▪ How to balance coverage and precision
▪ To force different coverage levels, we allow
recommenders to decide if a recommendation is
worthy of being presented to the user or not
Estimations
6. 6
Alejandro Bellogín – RecSys, August 2017
Balancing coverage and precision
▪ [Herlocker et al 2004]: “there is no general
coverage metric that, at the same time, gives more
weight to relevant items when accounting for
coverage, and combines coverage and accuracy
measures”
▪ [Gunawardana & Shani 2015] leave the problem of
balancing coverage and precision as an open issue
in the area
8. 8
Alejandro Bellogín – RecSys, August 2017
Our proposal: Correctness metric
▪ Adapted from Question Answering:
• Several questions to be answered by a system
• Each question has several options
• Only one option is correct
• If an answer is not given, it should not be considered as
an incorrect answer
• Hence, if two systems have the same number of correct
answers but one has failed less questions (it has decided
not to respond), it should be better than the other one
A. Peñas & Á. Rodrigo. 2011. A simple measure to assess non response. ACL.
9. 9
Alejandro Bellogín – RecSys, August 2017
Correctness metric for recommendation
▪ Each recommendation algorithm is a system
▪ Each candidate item to be ranked is a question
▪ If an item is recommended, it could be relevant or
not
▪ The same set of items is presented to each system
Recommended list Precision@5 Correctness
10. 10
Alejandro Bellogín – RecSys, August 2017
Correctness metrics for recommendation
▪ Four instantiations:
• Based on users
• Based on items
11. 11
Alejandro Bellogín – RecSys, August 2017
What about the decision-aware
recommenders?
Estimations
12. 12
Alejandro Bellogín – RecSys, August 2017
Decision-aware recommender systems
▪ Exploiting the confidence a system has on its own
recommendations
▪ Not completely new
• Significance weighting
• Support and confidence in case-based recommenders
▪ Focus on Collaborative Filtering algorithms
• Support of prediction score of nearest-neighbour
methods
• Uncertainty in prediction score of a probabilistic matrix
factorisation algorithm
13. 13
Alejandro Bellogín – RecSys, August 2017
Estimating confidence in
decision-aware recommendation
▪ For user-based KNN
▪ For probabilistic MF
At least n (out of k)
neighbours have
participated in
rating estimation?
14. 14
Alejandro Bellogín – RecSys, August 2017
Experimental setup
▪ Datasets
• MovieLens 100K, MovieLens 1M, Jester
• Random 5-fold training/test split
▪ Evaluation
• Generate a ranking with every item in the test set
• Metrics at cutoff 10: precision (P), user space coverage
(USC), item space coverage (ISC), correctness (UC, RUC,
IC, RIC), novelty (EPC), diversity (AggrDiv)
▪ Frameworks
• RankSys: evaluation metrics, KNN recommenders
• RiVal: data splitting
16. 16
Alejandro Bellogín – RecSys, August 2017
Impact on novelty and diversity
▪ Prediction uncertainty
• More strict constraints (smaller uncertainty) decrease
novelty and diversity
17. 17
Alejandro Bellogín – RecSys, August 2017
Conclusions
▪ We have proposed a family of metrics based on
the assumption that it is better to avoid a
recommendation rather than providing a bad
recommendation
▪ We have shown that a balance between precision,
coverage, diversity, and novelty is critical
▪ We have proposed two strategies to decide if an
item should be presented to the user
18. 18
Alejandro Bellogín – RecSys, August 2017
Future work
▪ Extend the correctness metrics to combine other
evaluation dimensions
▪ Objective way to discriminate between systems:
which one is really the best one?
▪ Consider the psychological aspect of the
recommendation: the user is expecting to receive
N recommendations (better bad than none?)
19. 19
Alejandro Bellogín – RecSys, August 2017
Thank you
Evaluating Decision-Aware
Recommender Systems
Rus M. Mesas, Alejandro Bellogín
Universidad Autónoma de Madrid
Spain
RecSys, August 2017
21. 21
Alejandro Bellogín – RecSys, August 2017
Impact on novelty and diversity
▪ Prediction support
• Larger n decreases the
diversity and novelty of
the lists
• More popular items are
being recommended
22. 22
Alejandro Bellogín – RecSys, August 2017
Motivation
▪ Typical evaluation: it is better to fail than avoiding
a recommendation
• Assumption: no returning an item is an advocate of that
item being considered as not relevant
▪ In this work: a recommender system may decide
not to recommend a specific item
• We need a metric where “no recommendation” does
not mean relevant or not relevant. If possible, it should
mean “better than not relevant”
23. 23
Alejandro Bellogín – RecSys, August 2017
Definition of uncertainty for PMF
▪ PMF: probabilistic matrix factorisation using a
Bayesian approximation proposed in [Lim & Teh
2007]
▪ The standard deviation is derived using mean-field
variational inference: