SlideShare a Scribd company logo
1 of 10
Download to read offline
RecSys Boston,	Sept	17,	2016 1
Contrasting Offline and Online
Results when Evaluating
Recommendation Algorithms
Marco	Rossetti
Trainline Ltd.,	London
(previously	University	of	Milan-Bicocca)
Fabio	Stella
Department	of	Informatics,	Systems	and	Communication
University	of	Milano-Bicocca
Markus	Zanker
Faculty	of	Computer	Science
Free	University	of	Bozen-Bolzano
RecSys Boston,	Sept	17,	2016 2
Research Goal
• Given the dominance of offline evaluation reflecting on its validity
becomes important
• Said and Bellogin (RecSys 2014) identified serious problems with the
internal validity (not reproducible results with different open source
frameworks).
• Different results from offline and online evaluations have also been
identified putting question marks on the external validity (e.g.
Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et
al. 2014, Maksai et al., 2015).
• Proposition:
• Compare performance of an offline experimentation with an online
evaluation.
• Use of a within-users experimental design, where we can test for
differences in paired samples.
RecSys Boston,	Sept	17,	2016 3
Research Questions
1. Does the relative ranking of algorithms based on offline accuracy
measurements predict the relative ranking according to an accuracy
measurement in a user-centric evaluation?
2. Does the relative ranking of algorithms based on offline measurements of
the predictive accuracy for long- tail items produce comparable results to
a user-centric evaluation?
3. Do offline accuracy measurements allow to predict the utility of
recommendations in a user-centric evaluation?
RecSys Boston,	Sept	17,	2016 4
Study Design
• Collected likes on ML movies
from 241 users
• On average 137 ratings per user
1
• Same users, evaluated 4 algorithms, 5
recommendations each
• On average 17.4 + 2 recommendations
• 122 users returned, 100 after cleaning
2
RecSys Boston,	Sept	17,	2016 5
Offline and Online Evaluations
ML1M
All-but-1	validation Users	Answers
Popularity
MF80:	Matrix	Factorization	with	80	factors
MF400:	Matrix	Factorization	with	400	factors
I2I:	Item	To	Item	K-Nearest	Neighbors
train
Offline	evaluation Online	evaluation
Metrics
à precision on all items ß
à precision on long tail ß
useful recommendations ß
RecSys Boston,	Sept	17,	2016 6
Precision All Items
MF400 MF80
POP I2I
p = 0.05 p = 0.05 p = 0.05
MF80 MF400
POP I2I
p = 0.05 p = 0.05 p = 0.1
Algorithm Offline Online
I2I 0.438 0.546
MF80 0.504 0.598
MF400 0.454 0.604
POP 0.340 0.516
Offline	precision	all	items
Online	precision	all	items
RecSys Boston,	Sept	17,	2016 7
Precision on Long Tail Items
MF80
MF400
POP
I2I
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
p = 0.05
Offline	=	Online	precision	long	tail	items
Algorithm Offline Online
I2I 0.280 0.356
MF80 0.018 0.054
MF400 0.360 0.628
POP 0.000 0.000
RecSys Boston,	Sept	17,	2016 8
Useful Recommendations
MF400I2I
POP
p = 0.05 p = 0.05
MF80
p = 0.05 p = 0.05
p = 0.05
Useful	recommendations
Algorithm Online
I2I 0.126
MF80 0.082
MF400 0.116
POP 0.026
RecSys Boston,	Sept	17,	2016 9
Conclusions
• Comparison of different algorithms online and offline based on
a within-users experimental design.
• The algorithm performing best according to a traditional offline
accuracy measurement was significantly worse, when it comes
to useful (i.e. relevant and novel) recommendations measured
online.
• Academia and industry should keep investigating this topic in
order to find the best possible way to validate offline
evaluations.
RecSys Boston,	Sept	17,	2016
Thank you!
10
Marco	Rossetti
Trainline Ltd.,	London
@ross85

More Related Content

What's hot

Rp mr course quiz 05
Rp mr course quiz 05Rp mr course quiz 05
Rp mr course quiz 05
MROC Japan
 
Using Data to Drive Instruction
Using Data to Drive InstructionUsing Data to Drive Instruction
Using Data to Drive Instruction
Roger Sevilla
 
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll SummerschoolIdentifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
lcoorevits
 

What's hot (8)

Rp mr course quiz 05
Rp mr course quiz 05Rp mr course quiz 05
Rp mr course quiz 05
 
Handling missing Social Network data
Handling missing Social Network dataHandling missing Social Network data
Handling missing Social Network data
 
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
2010 ICGSE - Challenges and Solutions in Distributed Software Development Pro...
 
BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)BugDay2012 Test Design with CTE XL(SharingDay)
BugDay2012 Test Design with CTE XL(SharingDay)
 
Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36Automated Testing for Web Applications - Wurbe #36
Automated Testing for Web Applications - Wurbe #36
 
Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...Investigating the effects of popularity data on predictive relevance judgment...
Investigating the effects of popularity data on predictive relevance judgment...
 
Using Data to Drive Instruction
Using Data to Drive InstructionUsing Data to Drive Instruction
Using Data to Drive Instruction
 
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll SummerschoolIdentifying Lead Users in a Living Lab Environment Enoll Summerschool
Identifying Lead Users in a Living Lab Environment Enoll Summerschool
 

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

Similar to Contrasting Offline and Online Results when Evaluating Recommendation Algorithms (20)

[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence[DOLAP2019] Augmented Business Intelligence
[DOLAP2019] Augmented Business Intelligence
 
Software engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research resultsSoftware engineering practices and software quality empirical research results
Software engineering practices and software quality empirical research results
 
bonino
boninobonino
bonino
 
Open citations: Next steps
Open citations: Next stepsOpen citations: Next steps
Open citations: Next steps
 
From Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research HighlightsFrom Bugs to Decision Support - Selected Research Highlights
From Bugs to Decision Support - Selected Research Highlights
 
Incentives for infrastructure modernization
Incentives for infrastructure modernizationIncentives for infrastructure modernization
Incentives for infrastructure modernization
 
2.pdf
2.pdf2.pdf
2.pdf
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
Tutorial - Recommender systems meet linked open data - ICWE 2016 - Lugano - 0...
 
Exploratory Analysis of User Data
Exploratory Analysis of User DataExploratory Analysis of User Data
Exploratory Analysis of User Data
 
DataMind Pitch August 2013
DataMind Pitch August 2013DataMind Pitch August 2013
DataMind Pitch August 2013
 
Benchmarking Linked Data Introductory Remarks
Benchmarking Linked Data Introductory RemarksBenchmarking Linked Data Introductory Remarks
Benchmarking Linked Data Introductory Remarks
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
A Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile ApplicationsA Context-Aware Retrieval System for Mobile Applications
A Context-Aware Retrieval System for Mobile Applications
 
Semantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and SummarizationSemantic Data Retrieval: Search, Ranking, and Summarization
Semantic Data Retrieval: Search, Ranking, and Summarization
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...User Personality and the New User Problem in a Context-Aware Point of Interes...
User Personality and the New User Problem in a Context-Aware Point of Interes...
 
productionising-recommenders
productionising-recommendersproductionising-recommenders
productionising-recommenders
 
Frontiers: Five Year Plan
Frontiers: Five Year PlanFrontiers: Five Year Plan
Frontiers: Five Year Plan
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 

Recently uploaded

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 

Recently uploaded (20)

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

Contrasting Offline and Online Results when Evaluating Recommendation Algorithms

  • 1. RecSys Boston, Sept 17, 2016 1 Contrasting Offline and Online Results when Evaluating Recommendation Algorithms Marco Rossetti Trainline Ltd., London (previously University of Milan-Bicocca) Fabio Stella Department of Informatics, Systems and Communication University of Milano-Bicocca Markus Zanker Faculty of Computer Science Free University of Bozen-Bolzano
  • 2. RecSys Boston, Sept 17, 2016 2 Research Goal • Given the dominance of offline evaluation reflecting on its validity becomes important • Said and Bellogin (RecSys 2014) identified serious problems with the internal validity (not reproducible results with different open source frameworks). • Different results from offline and online evaluations have also been identified putting question marks on the external validity (e.g. Cremonesi et al. 2012, Beel et al. 2013, Garcin et al. 2014, Ekstrand et al. 2014, Maksai et al., 2015). • Proposition: • Compare performance of an offline experimentation with an online evaluation. • Use of a within-users experimental design, where we can test for differences in paired samples.
  • 3. RecSys Boston, Sept 17, 2016 3 Research Questions 1. Does the relative ranking of algorithms based on offline accuracy measurements predict the relative ranking according to an accuracy measurement in a user-centric evaluation? 2. Does the relative ranking of algorithms based on offline measurements of the predictive accuracy for long- tail items produce comparable results to a user-centric evaluation? 3. Do offline accuracy measurements allow to predict the utility of recommendations in a user-centric evaluation?
  • 4. RecSys Boston, Sept 17, 2016 4 Study Design • Collected likes on ML movies from 241 users • On average 137 ratings per user 1 • Same users, evaluated 4 algorithms, 5 recommendations each • On average 17.4 + 2 recommendations • 122 users returned, 100 after cleaning 2
  • 5. RecSys Boston, Sept 17, 2016 5 Offline and Online Evaluations ML1M All-but-1 validation Users Answers Popularity MF80: Matrix Factorization with 80 factors MF400: Matrix Factorization with 400 factors I2I: Item To Item K-Nearest Neighbors train Offline evaluation Online evaluation Metrics à precision on all items ß à precision on long tail ß useful recommendations ß
  • 6. RecSys Boston, Sept 17, 2016 6 Precision All Items MF400 MF80 POP I2I p = 0.05 p = 0.05 p = 0.05 MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.1 Algorithm Offline Online I2I 0.438 0.546 MF80 0.504 0.598 MF400 0.454 0.604 POP 0.340 0.516 Offline precision all items Online precision all items
  • 7. RecSys Boston, Sept 17, 2016 7 Precision on Long Tail Items MF80 MF400 POP I2I p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 p = 0.05 Offline = Online precision long tail items Algorithm Offline Online I2I 0.280 0.356 MF80 0.018 0.054 MF400 0.360 0.628 POP 0.000 0.000
  • 8. RecSys Boston, Sept 17, 2016 8 Useful Recommendations MF400I2I POP p = 0.05 p = 0.05 MF80 p = 0.05 p = 0.05 p = 0.05 Useful recommendations Algorithm Online I2I 0.126 MF80 0.082 MF400 0.116 POP 0.026
  • 9. RecSys Boston, Sept 17, 2016 9 Conclusions • Comparison of different algorithms online and offline based on a within-users experimental design. • The algorithm performing best according to a traditional offline accuracy measurement was significantly worse, when it comes to useful (i.e. relevant and novel) recommendations measured online. • Academia and industry should keep investigating this topic in order to find the best possible way to validate offline evaluations.