SlideShare una empresa de Scribd logo
1 de 23
NEXT.BUY
Aline Pichon – Insight Data Science
Physical stores
• Post-sale communication
Product
• Recommend next purchase
• Purchase history
• Demographics
Consulting project
?
Data
Remote MySQL
260 M products
9 M users
6000 stores
Features
Who buys where?
0
10000
20000
30000
40000
MI SC TN VA NY GA NC CA TX FL
1
100
10000
1000000
A B C D E F G H I J
Retailers
A, C: 88%
States
FL+ TX: 34%
users
products (A)
Model: collaborative filtering
Users: 5000 (FL)
Products: 2000 (retailer C)
Quantity purchased
Sparse data
Product category
Model: collaborative filtering
Francis, FL
Model: collaborative filtering
Francis, FL
Model: collaborative filtering
Francis, FL
Model: collaborative filtering
Francis, FL
Ted, TX
Validation
Precision
Recall
ROC-AUC
NDCG @4-32
Sparsity
.8
.6
.9
> .9
14%
.3
.7
.8
> .9
< .8 %
2-step recommender system
Local shopping behaviors
Geographic adjustment
Retailer specific
https://alinepichon.github.io
Summary
Thank you
Aline Pichon
Ph.D. Cognitive Neuroscience
M.S. Medical Biology
Aline Pichon: NextBuy
Additional slides
- Query all DB
- Optimize code complexity
for 9 M users
- Computational power
Scalability
Validation
FPR
TPR
0 .2 .4 .6 .8 10 .2 .4 .6 .8 1
1
.8
.6
.4
.2
0
1
.8
.6
.4
.2
0
FPR
TPR
Precision: .8
Recall .6
ROC-AUC: .9
NDCG @4-32 > .9
Sparsity : 14%
Precision: .3
Recall: .7
ROC-AUC: .8
NDCG @4-32 > .9
Sparsity <0.8%
Improvements
- Compare Pearson coefficients / euclidean distances
- Different clustering of products than by category
- Direct User x Item model weighted by the probability of
user to go AGAIN for a given category (repurchase score)
- Repurchase score: combination of how popular category
or product is (how many items were purchased out of real
users not just the coo matrix size) AND if these were
purchased over several days….
- If low repurchase score: Take out purchased items out of
recommendation (categ and object) to avoid
recommending the same thing twice if not liked!, example
furniture shopping
- not enough history in my subsample to assess that
Deliverable
• Python notebook - User and Item based
recommendation
• User x Category recommendation matrix
• User x User similarity matrix, within state
• User x Product / Category matrices (18)
- User x User similarity matrix, for different states to
compare predictions for similar users, predict different
local shopping behaviors & adjust offer locally
Additional validation
• Train-test split by leave k-out
• Train model, compare predictions to test
• MAE (Quantity vs. binarized data)
0
0.2
0.4
0.6
0.8
1
0 0
Qty_based
Ones
definitions
• MAE: Mean absolute error
• MSE: Mean square error
• Recall: ability of classifier to find all positive samples.
• Precision: ability of classifier not to label as positive a sample that is negative.
• F1: combination of recall and precision
• Roc AUC: For non-binary data, a threshold must be chosen such that all
ratings above the threshold are good and called "1", while the rest are bad
with label "0". To summarize classification performance generally, we need
metrics that can provide summaries over this threshold. One tool for
generating such a metric is the Receiver Operator Characteristic (ROC)
curve. It is a plot of the TPR vs FPR
• Cosine similarity: metric can be thought of geometrically if one treats a given
user's (item's) row (column) of the ratings matrix as a vector. For user-based
collaborative filtering, two users' similarity is measured as the cosine of the
angle between the two users' vectors.
• Leave k out: a split percentage is chosen (e.g., 80% train, 20% test) and the
test percentage is selected randomly from the user-item pairs with non-zero
entries.
Who buys what?
1
10
100
1000
10000
100000
1
100
10000
1000000
0 1 2 3 4 >5
products
Information sources
- Merchant C
- Product category (18)
- Quantity purchased
Data organization
User Categ. Product Qty Retailer Location
1 Fruit Apple 1 A FL
2 Fruit Orange 3 A TX
2 Dairy Milk 4 B TX
3 Dairy Egg 10 B NY
4 Spices Pepper 1 C NJ
4 Meat Beef 2 C NJ
Model: collaborative filtering
Francis, FL
Felipe, FL
bought recommend
recommend

Más contenido relacionado

Destacado

Folha Dominical - 03.04.11 Nº 369
Folha Dominical - 03.04.11 Nº 369Folha Dominical - 03.04.11 Nº 369
Folha Dominical - 03.04.11 Nº 369Comunidades Vivas
 
Programación Deporte en Vivo 12/12/12
Programación Deporte en Vivo 12/12/12Programación Deporte en Vivo 12/12/12
Programación Deporte en Vivo 12/12/12google1998
 
Colegio nacional nicolás esguerra logo
Colegio nacional nicolás esguerra logoColegio nacional nicolás esguerra logo
Colegio nacional nicolás esguerra logocesarmorales08
 
Folha Dominical - 19.06.11 Nº 380
Folha Dominical - 19.06.11 Nº 380Folha Dominical - 19.06.11 Nº 380
Folha Dominical - 19.06.11 Nº 380Comunidades Vivas
 
Folha Dominical - 24.10.10 Nº344
Folha Dominical - 24.10.10 Nº344Folha Dominical - 24.10.10 Nº344
Folha Dominical - 24.10.10 Nº344Comunidades Vivas
 
Folha Dominical - 17.01.10 Nº305
Folha Dominical - 17.01.10 Nº305Folha Dominical - 17.01.10 Nº305
Folha Dominical - 17.01.10 Nº305Comunidades Vivas
 
Blooms digital taxonomy handout
Blooms digital taxonomy handoutBlooms digital taxonomy handout
Blooms digital taxonomy handoutuclancareers
 
Macros automáticamente
Macros automáticamenteMacros automáticamente
Macros automáticamenteRAFAEL Cast
 
Measuring the agile process improvement
Measuring the agile process improvementMeasuring the agile process improvement
Measuring the agile process improvementKonstantin Savenkov
 
Business ethics and human value1
Business ethics and human value1Business ethics and human value1
Business ethics and human value1Dr Soundara Rajan
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesFabio Petroni, PhD
 

Destacado (20)

HUSP SANTIYE V4
HUSP SANTIYE V4HUSP SANTIYE V4
HUSP SANTIYE V4
 
Folha Dominical - 03.04.11 Nº 369
Folha Dominical - 03.04.11 Nº 369Folha Dominical - 03.04.11 Nº 369
Folha Dominical - 03.04.11 Nº 369
 
RM15_e
RM15_eRM15_e
RM15_e
 
Programación Deporte en Vivo 12/12/12
Programación Deporte en Vivo 12/12/12Programación Deporte en Vivo 12/12/12
Programación Deporte en Vivo 12/12/12
 
Colegio nacional nicolás esguerra logo
Colegio nacional nicolás esguerra logoColegio nacional nicolás esguerra logo
Colegio nacional nicolás esguerra logo
 
Folha Dominical - 19.06.11 Nº 380
Folha Dominical - 19.06.11 Nº 380Folha Dominical - 19.06.11 Nº 380
Folha Dominical - 19.06.11 Nº 380
 
Folha Dominical - 24.10.10 Nº344
Folha Dominical - 24.10.10 Nº344Folha Dominical - 24.10.10 Nº344
Folha Dominical - 24.10.10 Nº344
 
Emilioblogu12
Emilioblogu12Emilioblogu12
Emilioblogu12
 
Kata pengantar
Kata pengantarKata pengantar
Kata pengantar
 
Swimforkids
SwimforkidsSwimforkids
Swimforkids
 
LEAD GALAXY
LEAD GALAXYLEAD GALAXY
LEAD GALAXY
 
Folha Dominical - 17.01.10 Nº305
Folha Dominical - 17.01.10 Nº305Folha Dominical - 17.01.10 Nº305
Folha Dominical - 17.01.10 Nº305
 
Blooms digital taxonomy handout
Blooms digital taxonomy handoutBlooms digital taxonomy handout
Blooms digital taxonomy handout
 
Lean production для SAAS
Lean production для SAASLean production для SAAS
Lean production для SAAS
 
Macros automáticamente
Macros automáticamenteMacros automáticamente
Macros automáticamente
 
Measuring the agile process improvement
Measuring the agile process improvementMeasuring the agile process improvement
Measuring the agile process improvement
 
Cibil
CibilCibil
Cibil
 
Business ethics and human value1
Business ethics and human value1Business ethics and human value1
Business ethics and human value1
 
Building a Data Driven Business
Building a Data Driven BusinessBuilding a Data Driven Business
Building a Data Driven Business
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 

Similar a Aline Pichon: NextBuy

Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisVolha Banadyseva
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...Emanuel Lacić
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Mathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementMathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementZehra Kendir
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation SystemsRobin Reni
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Seattle DAML meetup
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerceAlexander Konduforov
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPranov Mishra
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxAniket Patil
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxpatilaniket2418
 

Similar a Aline Pichon: NextBuy (20)

Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Mathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer ManagementMathematical Model For Customer Life Time Based Offer Management
Mathematical Model For Customer Life Time Based Offer Management
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
Quality key users
Quality key usersQuality key users
Quality key users
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
 
Conjoint Analysis
Conjoint AnalysisConjoint Analysis
Conjoint Analysis
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 

Último

How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 

Último (16)

How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 

Aline Pichon: NextBuy

  • 1. NEXT.BUY Aline Pichon – Insight Data Science
  • 2. Physical stores • Post-sale communication Product • Recommend next purchase • Purchase history • Demographics Consulting project ?
  • 3. Data Remote MySQL 260 M products 9 M users 6000 stores Features
  • 4. Who buys where? 0 10000 20000 30000 40000 MI SC TN VA NY GA NC CA TX FL 1 100 10000 1000000 A B C D E F G H I J Retailers A, C: 88% States FL+ TX: 34% users products (A)
  • 5. Model: collaborative filtering Users: 5000 (FL) Products: 2000 (retailer C) Quantity purchased Sparse data Product category
  • 11. 2-step recommender system Local shopping behaviors Geographic adjustment Retailer specific https://alinepichon.github.io Summary
  • 12. Thank you Aline Pichon Ph.D. Cognitive Neuroscience M.S. Medical Biology
  • 15. - Query all DB - Optimize code complexity for 9 M users - Computational power Scalability
  • 16. Validation FPR TPR 0 .2 .4 .6 .8 10 .2 .4 .6 .8 1 1 .8 .6 .4 .2 0 1 .8 .6 .4 .2 0 FPR TPR Precision: .8 Recall .6 ROC-AUC: .9 NDCG @4-32 > .9 Sparsity : 14% Precision: .3 Recall: .7 ROC-AUC: .8 NDCG @4-32 > .9 Sparsity <0.8%
  • 17. Improvements - Compare Pearson coefficients / euclidean distances - Different clustering of products than by category - Direct User x Item model weighted by the probability of user to go AGAIN for a given category (repurchase score) - Repurchase score: combination of how popular category or product is (how many items were purchased out of real users not just the coo matrix size) AND if these were purchased over several days…. - If low repurchase score: Take out purchased items out of recommendation (categ and object) to avoid recommending the same thing twice if not liked!, example furniture shopping - not enough history in my subsample to assess that
  • 18. Deliverable • Python notebook - User and Item based recommendation • User x Category recommendation matrix • User x User similarity matrix, within state • User x Product / Category matrices (18) - User x User similarity matrix, for different states to compare predictions for similar users, predict different local shopping behaviors & adjust offer locally
  • 19. Additional validation • Train-test split by leave k-out • Train model, compare predictions to test • MAE (Quantity vs. binarized data) 0 0.2 0.4 0.6 0.8 1 0 0 Qty_based Ones
  • 20. definitions • MAE: Mean absolute error • MSE: Mean square error • Recall: ability of classifier to find all positive samples. • Precision: ability of classifier not to label as positive a sample that is negative. • F1: combination of recall and precision • Roc AUC: For non-binary data, a threshold must be chosen such that all ratings above the threshold are good and called "1", while the rest are bad with label "0". To summarize classification performance generally, we need metrics that can provide summaries over this threshold. One tool for generating such a metric is the Receiver Operator Characteristic (ROC) curve. It is a plot of the TPR vs FPR • Cosine similarity: metric can be thought of geometrically if one treats a given user's (item's) row (column) of the ratings matrix as a vector. For user-based collaborative filtering, two users' similarity is measured as the cosine of the angle between the two users' vectors. • Leave k out: a split percentage is chosen (e.g., 80% train, 20% test) and the test percentage is selected randomly from the user-item pairs with non-zero entries.
  • 21. Who buys what? 1 10 100 1000 10000 100000 1 100 10000 1000000 0 1 2 3 4 >5 products Information sources - Merchant C - Product category (18) - Quantity purchased
  • 22. Data organization User Categ. Product Qty Retailer Location 1 Fruit Apple 1 A FL 2 Fruit Orange 3 A TX 2 Dairy Milk 4 B TX 3 Dairy Egg 10 B NY 4 Spices Pepper 1 C NJ 4 Meat Beef 2 C NJ
  • 23. Model: collaborative filtering Francis, FL Felipe, FL bought recommend recommend

Notas del editor

  1. Hi, My name is XXX Today I will present the consulting project I worked on during my stay at insight
  2. I took a consulting project for a company that enables retailers to communicate with their customers AFTER the sale. This way, the retailers can maker personalized offers to their customers. My aim was to give some on the kind of offers they could make The company was not interested in an app. They rather wanted to be provided with a piece of code for a model that could predict the next buy a given user would be the most likely to make, based on what he or she had already bought and demographic specificities. LEAVE OUT #Compared to online stores, physical stores are at a disadvantage in terms of customer access. #Once a purchase is made, there is little chance to contact the customer back if no active step has been taken. “Lose the paper / go beyond the sale” / took it off
  3. the data was stored in a MySQL database on a remote server, Contains information about more than 260 M products purchased by 9 M users, in 6000 stores from 40 retailers across the US and Canada The products are characterized by several features, including their name (eg banana), their category (eg fruit), and the qty purchased)
  4. I started by evaluating who buys where? The NDA for this project prohibits me to name retailers & kind of articles they sell, so I will refer to them as retailer A B C As you can see in the upper graph, the two retailers A and C totalize most of the users. In the lower graph, you can see that about a third of the purchases for retailer A were done in FL + TX. I thus focused on these subsets to estimate my model
  5. I wanted a model optimal for a large number of users but that could still be estimated for the limited amount of users that I managed to pull since the data was not easily accessible because of server distance I chose collaborative filtering which assesses users or item’s similarity to predict user’s preference. For example, if two users tend to watch and like similar kinds of movies, the likelihood that user 1 will like a movie that user 2 likes is high. I used the quantity of items purchased as a proxy for users’ preference Finally, the data was too sparse to come up with direct product recommendation so I used the category information provided in DB as an intermediate step. Again, the NDA prevents me to reveal true information about the products so I replaced real categories with grocery items
  6. So the way my model works is the following: Francis, a user from FL
  7. will 1st be recommended with categories according to his purchase history and his similarity with other users from FL These categories will be ranked from most likely (fruits) to least likely (drinks) (as function of similarity scores). IN this case, Francis was recommended with fruits
  8. Then, in a similar way, the user will be recommended with products within one category. In this case, Francis was recommended with bananas
  9. but instead of bananas he was recommended with an apple. The algorithm could be run on all possible users and by comparing predictions between states, it could enable to obtain insight on local shopping behaviors, and thus infer suggestions to retailers on how to adjust offer according to location In this model, items / category that were already bought will not appear on top of the recommendation, but this can be modified according to type of merchant. Example grocery retailer with high repurchase score vs furniture retailer with low repurchase score.
  10. These are the validation metrics for boths steps of the recommender system On the left category choice, both classification metrics and ranking metrics (NDCG) suggest a good prediction + accurate ranking of relevant items On the right, product choice, metrics are good exception for precision (ability of the classifier not to label as positive a sample that is negative ) possibly due to the small sample user vs large amount products , resulting in important sparsity. Would be interesting to run model on ++ users. Sparsty : 14 of the user-item ratings have a valu Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative Recall: The recall is intuitively the ability of the classifier to find all the positive samples. Classifier metric Precision: fraction of good/positive/one labels you got correct, out of all the samples you label… positive fraction of true positives / total testinv events (Tp/TP+falepos) Recall: fraction good/positive/one labels you got correct, out of all the true positives. F1 = combination of the too. Perfect = 1 ROC-AUC: The choice of the threshold is left to the user, and can be varied depending on desired tradeoffs (for a more in-depth discussion of this threshold, see this blog). Therefore to summarize classification performance generally, we need metrics that can provide summaries over this threshold. One tool for generating such a metric is the Receiver Operator Characteristic (ROC) curve. It is a plot of the True Positive Rate TPR TPR versus the False Positive Rate FPR FPR Ranking metrics The final rank-based metric we will discuss is the Normalized Discounted Cumulative Gain (NDCG). NDCG is a very popular metric, which emphasizes - strongly - that items with high relevance should be placed early in the ranked list. In addition, order is important. NDCG is based on the Discounted Cumulative Gain (DCG). Simply put, for each user we take the rank ordered list (from our predictions) and look up how relevant the items are (from the true rating). Relevances are simply the value of the entry in the test set, so (0, 1) or (1, 2, 3, 4, 5) in the MovieLens case. We then exponentially weight the relevances, but discount them based on their place in the list. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings. NDCG @4 : 0.757 NDCG @8 : 0.803 NDCG @16 : 0.846 NDCG @32 : 0.878 When we look at these numbers, we see that (on average) the first 4 items we showed to each user was only 0.778 of the DCG from a perfect set of recommendations. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
  11. We have a 2 step Recommender system based on purchase history, user similarity & user location. Scalability what would the company need to deploy this algorithm to their database: query all DB and sufficient computational power for 9 M users Your can find out more about the project, the strategy the python code used online It’s not an app!! i
  12. About me:I have a PhD in cognitive neurosciences I did both my undergraduate and graduate studies in Switzerland and as consequence, I’m very fond of skiing. Thank you very much If time Smells => emotions => effect on cognitive perfromance (ie attention) => map these effects in the brain Another picture?? during which I was mapping of the effects of fragrances on human brain in collaboration with a perfume company
  13. We have a 2 step Recommender system based on purchase history, user similarity & user location. Scalability what would the company need to deploy this algorithm to their database: query all DB and sufficient computational power for 9 M users Your can find out more about the project, the strategy the python code used online It’s not an app!! i
  14. These are the validation metrics for boths steps of the recommender system On the left category choice, both classification metrics and ranking metrics (NDCG) suggest a good prediction + accurate ranking of relevant items On the right, product choice, metrics are good exception for precision (ability of the classifier not to label as positive a sample that is negative ) possibly due to the small sample user vs large amount products , resulting in important sparsity. Would be interesting to run model on ++ users. Sparsty : 14 of the user-item ratings have a valu Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative Recall: The recall is intuitively the ability of the classifier to find all the positive samples. Classifier metric Precision: fraction of good/positive/one labels you got correct, out of all the samples you label… positive fraction of true positives / total testinv events (Tp/TP+falepos) Recall: fraction good/positive/one labels you got correct, out of all the true positives. F1 = combination of the too. Perfect = 1 ROC-AUC: The choice of the threshold is left to the user, and can be varied depending on desired tradeoffs (for a more in-depth discussion of this threshold, see this blog). Therefore to summarize classification performance generally, we need metrics that can provide summaries over this threshold. One tool for generating such a metric is the Receiver Operator Characteristic (ROC) curve. It is a plot of the True Positive Rate TPR TPR versus the False Positive Rate FPR FPR Ranking metrics The final rank-based metric we will discuss is the Normalized Discounted Cumulative Gain (NDCG). NDCG is a very popular metric, which emphasizes - strongly - that items with high relevance should be placed early in the ranked list. In addition, order is important. NDCG is based on the Discounted Cumulative Gain (DCG). Simply put, for each user we take the rank ordered list (from our predictions) and look up how relevant the items are (from the true rating). Relevances are simply the value of the entry in the test set, so (0, 1) or (1, 2, 3, 4, 5) in the MovieLens case. We then exponentially weight the relevances, but discount them based on their place in the list. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings. NDCG @4 : 0.757 NDCG @8 : 0.803 NDCG @16 : 0.846 NDCG @32 : 0.878 When we look at these numbers, we see that (on average) the first 4 items we showed to each user was only 0.778 of the DCG from a perfect set of recommendations. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
  15. Repruchase 14 down vote favorite 1 It looks like the cosine similarity of two features is just their dot product scaled by the product of their magnitudes. When does cosine similarity make a better distance metric than the dot product? I.e. do the dot product and cosine similarity have different strengths or weaknesses in different situations? classification share|improve this questionasked Jul 15 '14 at 21:30 ahoffer ‪173‬16    Note that neither of these are proper distance metrics, even if you transform them to be a value that is small when points are "similar". It may or may not matter for your use case. – Sean Owen♦ ‪Jul 18 '14 at 11:34‬add a comment 6 Answers active oldest votes up vote 15 down vote accepted Think geometrically. Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude. If you normalize your data to have the same magnitude, the two are indistinguishable. Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better as a similarity measure. Note that neither of them is a "distance metric". https://www.quora.com/Why-should-I-use-Cosine-Similarity-for-a-movie-recommendation-engine Cosine similarity is a measure of similarity between two non zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. So maybe try euclidean distances…? Memory-based[edit] This approach uses user rating data to compute the similarity between users or items. This is used for making recommendations. This was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an aggregation of some similar users' rating of the item: r u , i = aggr u ′ ′ ∈ ∈ U ⁡ ⁡ r u ′ ′ , i {\displaystyle r_{u,i}=\operatorname {aggr} _{u^{\prime }\in U}r_{u^{\prime },i}} where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the aggregation function includes: r u , i = 1 N ∑ ∑ u ′ ′ ∈ ∈ U r u ′ ′ , i {\displaystyle r_{u,i}={\frac {1}{N}}\sum \limits _{u^{\prime }\in U}r_{u^{\prime },i}} r u , i = k ∑ ∑ u ′ ′ ∈ ∈ U simil ⁡ ⁡ ( u , u ′ ′ ) r u ′ ′ , i {\displaystyle r_{u,i}=k\sum \limits _{u^{\prime }\in U}\operatorname {simil} (u,u^{\prime })r_{u^{\prime },i}} r u , i = r u ¯ ¯ + k ∑ ∑ u ′ ′ ∈ ∈ U simil ⁡ ⁡ ( u , u ′ ′ ) ( r u ′ ′ , i − − r u ′ ′ ¯ ¯ ) {\displaystyle r_{u,i}={\bar {r_{u}}}+k\sum \limits _{u^{\prime }\in U}\operatorname {simil} (u,u^{\prime })(r_{u^{\prime },i}-{\bar {r_{u^{\prime }}}})} where k is a normalizing factor defined as k = 1 / ∑ ∑ u ′ ′ ∈ ∈ U | simil ⁡ ⁡ ( u , u ′ ′ ) | {\displaystyle k=1/\sum _{u^{\prime }\in U}|\operatorname {simil} (u,u^{\prime })|} . and r u ¯ ¯ {\displaystyle {\bar {r_{u}}}} is the average rating of user u for all the items rated by u. The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user by taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as Pearson correlation and vector cosine based similarity are used for this. The Pearson correlation similarity of two users x, y is defined as simil ⁡ ⁡ ( x , y ) = ∑ ∑ i ∈ ∈ I x y ( r x , i − − r x ¯ ¯ ) ( r y , i − − r y ¯ ¯ ) ∑ ∑ i ∈ ∈ I x y ( r x , i − − r x ¯ ¯ ) 2 ∑ ∑ i ∈ ∈ I x y ( r y , i − − r y ¯ ¯ ) 2 {\displaystyle \operatorname {simil} (x,y)={\frac {\sum \limits _{i\in I_{xy}}(r_{x,i}-{\bar {r_{x}}})(r_{y,i}-{\bar {r_{y}}})}{\sqrt {\sum \limits _{i\in I_{xy}}(r_{x,i}-{\bar {r_{x}}})^{2}\sum \limits _{i\in I_{xy}}(r_{y,i}-{\bar {r_{y}}})^{2}}}}} where Ixy is the set of items rated by both user x and user y. The cosine-based approach defines the cosine-similarity between two users x and y as:[4] simil ⁡ ⁡ ( x , y ) = cos ⁡ ⁡ ( x → → , y → → ) = x → → ⋅ ⋅ y → → | | x → → | | × × | | y → → | | = ∑ ∑ i ∈ ∈ I x y r x , i r y , i ∑ ∑ i ∈ ∈ I x r x , i 2 ∑ ∑ i ∈ ∈ I y r y , i 2 {\displaystyle \operatorname {simil} (x,y)=\cos({\vec {x}},{\vec {y}})={\frac {{\vec {x}}\cdot {\vec {y}}}{||{\vec {x}}||\times ||{\vec {y}}||}}={\frac {\sum \limits _{i\in I_{xy}}r_{x,i}r_{y,i}}{{\sqrt {\sum \limits _{i\in I_{x}}r_{x,i}^{2}}}{\sqrt {\sum \limits _{i\in I_{y}}r_{y,i}^{2}}}}}} The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive hashing, which implements the nearest neighbor mechanism in linear time. The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items. There are also several disadvantages with this approach. Its performance decreases when data gets sparse, which occurs frequently with web-related items. This hinders the scalability of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes more complicated since that representation usually relies on a specific vector space. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.
  16. Focus on the deliverable: rather than mentioning all the things you tried that didn’t work, focus on what you accomplished, the final product.
  17. Leave-k-out¶ A common strategy for splitting recommendation data into training and test sets is leave-k-out. Here, a split percentage is chosen (e.g., 80% train, 20% test) and the test percentage is selected randomly from the user-item pairs with non-zero entries. Choosing an 80%/20% split, we can see the test data highlighted in our example below: note - in this implementation of leave-k-out, the train, test data # have the same shape, but test data is zeroed out in the training set. # In addition, we have imputed values where there are no entries in the # matrix. Here we sum the squared difference between predicted value p i pi and actual value a i ai over all N N test examples. The RMSE is simply the square root of this value. RMSE perhaps more interpretible because it is of the same scale as the data, but nevertheless constains the same information as the MSE. While the MSE is easy to compute, it can suffer from very large error contributions from outliers - squaring the error puts emphasis on large deviations. A more robust error metric is the Mean Absolute Error (MAE) as more robust sum differences absoéute value Mean Squared Error (MSE) or similarly Root Mean Squared Error (RMSE). Lets define the MSE: Mean Squared Error (MSE) or similarly Root Mean Squared Error (RMSE). Lets define the MSE:
  18. Information sources Product count as a function of the quantity purchased (fruits)
  19. Then, in a similar way, the user will be recommended with products within one category. In this case, Francis was recommended with bananas