17. Improvements
- Compare Pearson coefficients / euclidean distances
- Different clustering of products than by category
- Direct User x Item model weighted by the probability of
user to go AGAIN for a given category (repurchase score)
- Repurchase score: combination of how popular category
or product is (how many items were purchased out of real
users not just the coo matrix size) AND if these were
purchased over several days….
- If low repurchase score: Take out purchased items out of
recommendation (categ and object) to avoid
recommending the same thing twice if not liked!, example
furniture shopping
- not enough history in my subsample to assess that
18. Deliverable
• Python notebook - User and Item based
recommendation
• User x Category recommendation matrix
• User x User similarity matrix, within state
• User x Product / Category matrices (18)
- User x User similarity matrix, for different states to
compare predictions for similar users, predict different
local shopping behaviors & adjust offer locally
19. Additional validation
• Train-test split by leave k-out
• Train model, compare predictions to test
• MAE (Quantity vs. binarized data)
0
0.2
0.4
0.6
0.8
1
0 0
Qty_based
Ones
20. definitions
• MAE: Mean absolute error
• MSE: Mean square error
• Recall: ability of classifier to find all positive samples.
• Precision: ability of classifier not to label as positive a sample that is negative.
• F1: combination of recall and precision
• Roc AUC: For non-binary data, a threshold must be chosen such that all
ratings above the threshold are good and called "1", while the rest are bad
with label "0". To summarize classification performance generally, we need
metrics that can provide summaries over this threshold. One tool for
generating such a metric is the Receiver Operator Characteristic (ROC)
curve. It is a plot of the TPR vs FPR
• Cosine similarity: metric can be thought of geometrically if one treats a given
user's (item's) row (column) of the ratings matrix as a vector. For user-based
collaborative filtering, two users' similarity is measured as the cosine of the
angle between the two users' vectors.
• Leave k out: a split percentage is chosen (e.g., 80% train, 20% test) and the
test percentage is selected randomly from the user-item pairs with non-zero
entries.
22. Data organization
User Categ. Product Qty Retailer Location
1 Fruit Apple 1 A FL
2 Fruit Orange 3 A TX
2 Dairy Milk 4 B TX
3 Dairy Egg 10 B NY
4 Spices Pepper 1 C NJ
4 Meat Beef 2 C NJ
Hi,
My name is XXX
Today I will present the consulting project I worked on during my stay at insight
I took a consulting project for a company that enables retailers to communicate with their customers AFTER the sale. This way, the retailers can maker personalized offers to their customers.
My aim was to give some on the kind of offers they could make
The company was not interested in an app. They rather wanted to be provided with a piece of code for a model that could predict the next buy a given user would be the most likely to make, based on what he or she had already bought and demographic specificities.
LEAVE OUT
#Compared to online stores, physical stores are at a disadvantage in terms of customer access.
#Once a purchase is made, there is little chance to contact the customer back if no active step has been taken.
“Lose the paper / go beyond the sale” / took it off
the data was stored in a MySQL database on a remote server,
Contains information about more than 260 M products purchased by 9 M users, in 6000 stores from 40 retailers across the US and Canada
The products are characterized by several features, including their name (eg banana), their category (eg fruit), and the qty purchased)
I started by evaluating who buys where?
The NDA for this project prohibits me to name retailers & kind of articles they sell, so I will refer to them as retailer A B C
As you can see in the upper graph, the two retailers A and C totalize most of the users. In the lower graph, you can see that about a third of the purchases for retailer A were done in FL + TX.
I thus focused on these subsets to estimate my model
I wanted a model optimal for a large number of users but that could still be estimated for the limited amount of users that I managed to pull since the data was not easily accessible because of server distance
I chose collaborative filtering which assesses users or item’s similarity to predict user’s preference. For example, if two users tend to watch and like similar kinds of movies, the likelihood that user 1 will like a movie that user 2 likes is high.
I used the quantity of items purchased as a proxy for users’ preference
Finally, the data was too sparse to come up with direct product recommendation so I used the category information provided in DB as an intermediate step. Again, the NDA prevents me to reveal true information about the products so I replaced real categories with grocery items
So the way my model works is the following: Francis, a user from FL
will 1st be recommended with categories according to his purchase history and his similarity with other users from FL
These categories will be ranked from most likely (fruits) to least likely (drinks) (as function of similarity scores). IN this case, Francis was recommended with fruits
Then, in a similar way, the user will be recommended with products within one category. In this case, Francis was recommended with bananas
but instead of bananas he was recommended with an apple.
The algorithm could be run on all possible users and by comparing predictions between states, it could enable to obtain insight on local shopping behaviors, and thus infer suggestions to retailers on how to adjust offer according to location
In this model, items / category that were already bought will not appear on top of the recommendation, but this can be modified according to type of merchant. Example grocery retailer with high repurchase score vs furniture retailer with low repurchase score.
These are the validation metrics for boths steps of the recommender system
On the left category choice, both classification metrics and ranking metrics (NDCG) suggest a good prediction + accurate ranking of relevant items
On the right, product choice, metrics are good exception for precision (ability of the classifier not to label as positive a sample that is negative ) possibly due to the small sample user vs large amount products , resulting in important sparsity. Would be interesting to run model on ++ users.
Sparsty : 14 of the user-item ratings have a valu
Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative
Recall: The recall is intuitively the ability of the classifier to find all the positive samples.
Classifier metric
Precision: fraction of good/positive/one labels you got correct, out of all the samples you label… positive fraction of true positives / total testinv events (Tp/TP+falepos)
Recall: fraction good/positive/one labels you got correct, out of all the true positives.
F1 = combination of the too. Perfect = 1
ROC-AUC: The choice of the threshold is left to the user, and can be varied depending on desired tradeoffs (for a more in-depth discussion of this threshold, see this blog). Therefore to summarize classification performance generally, we need metrics that can provide summaries over this threshold. One tool for generating such a metric is the Receiver Operator Characteristic (ROC) curve. It is a plot of the True Positive Rate
TPR
TPR
versus the False Positive Rate
FPR
FPR
Ranking metrics
The final rank-based metric we will discuss is the Normalized Discounted Cumulative Gain (NDCG). NDCG is a very popular metric, which emphasizes - strongly - that items with high relevance should be placed early in the ranked list. In addition, order is important.
NDCG is based on the Discounted Cumulative Gain (DCG). Simply put, for each user we take the rank ordered list (from our predictions) and look up how relevant the items are (from the true rating). Relevances are simply the value of the entry in the test set, so (0, 1) or (1, 2, 3, 4, 5) in the MovieLens case. We then exponentially weight the relevances, but discount them based on their place in the list. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
NDCG @4 : 0.757
NDCG @8 : 0.803
NDCG @16 : 0.846
NDCG @32 : 0.878
When we look at these numbers, we see that (on average) the first 4 items we showed to each user was only 0.778 of the DCG from a perfect set of recommendations. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
We have a 2 step Recommender system based on purchase history, user similarity & user location.
Scalability what would the company need to deploy this algorithm to their database: query all DB and sufficient computational power for 9 M users
Your can find out more about the project, the strategy the python code used online
It’s not an app!!
i
About me:I have a PhD in cognitive neurosciences
I did both my undergraduate and graduate studies in Switzerland and as consequence, I’m very fond of skiing.
Thank you very much
If time Smells => emotions => effect on cognitive perfromance (ie attention) => map these effects in the brain
Another picture??
during which I was mapping of the effects of fragrances on human brain in collaboration with a perfume company
We have a 2 step Recommender system based on purchase history, user similarity & user location.
Scalability what would the company need to deploy this algorithm to their database: query all DB and sufficient computational power for 9 M users
Your can find out more about the project, the strategy the python code used online
It’s not an app!!
i
These are the validation metrics for boths steps of the recommender system
On the left category choice, both classification metrics and ranking metrics (NDCG) suggest a good prediction + accurate ranking of relevant items
On the right, product choice, metrics are good exception for precision (ability of the classifier not to label as positive a sample that is negative ) possibly due to the small sample user vs large amount products , resulting in important sparsity. Would be interesting to run model on ++ users.
Sparsty : 14 of the user-item ratings have a valu
Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative
Recall: The recall is intuitively the ability of the classifier to find all the positive samples.
Classifier metric
Precision: fraction of good/positive/one labels you got correct, out of all the samples you label… positive fraction of true positives / total testinv events (Tp/TP+falepos)
Recall: fraction good/positive/one labels you got correct, out of all the true positives.
F1 = combination of the too. Perfect = 1
ROC-AUC: The choice of the threshold is left to the user, and can be varied depending on desired tradeoffs (for a more in-depth discussion of this threshold, see this blog). Therefore to summarize classification performance generally, we need metrics that can provide summaries over this threshold. One tool for generating such a metric is the Receiver Operator Characteristic (ROC) curve. It is a plot of the True Positive Rate
TPR
TPR
versus the False Positive Rate
FPR
FPR
Ranking metrics
The final rank-based metric we will discuss is the Normalized Discounted Cumulative Gain (NDCG). NDCG is a very popular metric, which emphasizes - strongly - that items with high relevance should be placed early in the ranked list. In addition, order is important.
NDCG is based on the Discounted Cumulative Gain (DCG). Simply put, for each user we take the rank ordered list (from our predictions) and look up how relevant the items are (from the true rating). Relevances are simply the value of the entry in the test set, so (0, 1) or (1, 2, 3, 4, 5) in the MovieLens case. We then exponentially weight the relevances, but discount them based on their place in the list. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
NDCG @4 : 0.757
NDCG @8 : 0.803
NDCG @16 : 0.846
NDCG @32 : 0.878
When we look at these numbers, we see that (on average) the first 4 items we showed to each user was only 0.778 of the DCG from a perfect set of recommendations. As k increases, NDCG increases because the cumulative effect of the rest of the list washes out some errors we made early. The cool thing about NDCG is that the actual score matters! We didn't binarize our predicted ratings.
Repruchase
14
down vote
favorite
1 It looks like the cosine similarity of two features is just their dot product scaled by the product of their magnitudes. When does cosine similarity make a better distance metric than the dot product? I.e. do the dot product and cosine similarity have different strengths or weaknesses in different situations?
classification
share|improve this questionasked Jul 15 '14 at 21:30
ahoffer
17316
Note that neither of these are proper distance metrics, even if you transform them to be a value that is small when points are "similar". It may or may not matter for your use case. – Sean Owen♦ Jul 18 '14 at 11:34add a comment
6 Answers
active
oldest
votes
up vote
15
down vote
accepted Think geometrically. Cosine similarity only cares about angle difference, while dot product cares about angle and magnitude. If you normalize your data to have the same magnitude, the two are indistinguishable. Sometimes it is desirable to ignore the magnitude, hence cosine similarity is nice, but if magnitude plays a role, dot product would be better as a similarity measure. Note that neither of them is a "distance metric".
https://www.quora.com/Why-should-I-use-Cosine-Similarity-for-a-movie-recommendation-engine
Cosine similarity is a measure of similarity between two non zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
So maybe try euclidean distances…?
Memory-based[edit]
This approach uses user rating data to compute the similarity between users or items. This is used for making recommendations. This was an early approach used in many commercial systems. It's effective and easy to implement. Typical examples of this approach are neighbourhood-based CF and item-based/user-based top-N recommendations. For example, in user based approaches, the value of ratings user 'u' gives to item 'i' is calculated as an aggregation of some similar users' rating of the item:
r u , i = aggr u ′ ′ ∈ ∈ U r u ′ ′ , i {\displaystyle r_{u,i}=\operatorname {aggr} _{u^{\prime }\in U}r_{u^{\prime },i}}
where 'U' denotes the set of top 'N' users that are most similar to user 'u' who rated item 'i'. Some examples of the aggregation function includes:
r u , i =
1
N ∑ ∑ u ′ ′ ∈ ∈ U r u ′ ′ , i {\displaystyle r_{u,i}={\frac {1}{N}}\sum \limits _{u^{\prime }\in U}r_{u^{\prime },i}}
r u , i = k ∑ ∑ u ′ ′ ∈ ∈ U simil ( u , u ′ ′ ) r u ′ ′ , i {\displaystyle r_{u,i}=k\sum \limits _{u^{\prime }\in U}\operatorname {simil} (u,u^{\prime })r_{u^{\prime },i}}
r u , i = r u ¯ ¯ + k ∑ ∑ u ′ ′ ∈ ∈ U simil ( u , u ′ ′ ) ( r u ′ ′ , i − − r u ′ ′ ¯ ¯ ) {\displaystyle r_{u,i}={\bar {r_{u}}}+k\sum \limits _{u^{\prime }\in U}\operatorname {simil} (u,u^{\prime })(r_{u^{\prime },i}-{\bar {r_{u^{\prime }}}})}
where k is a normalizing factor defined as
k =
1
/ ∑ ∑ u ′ ′ ∈ ∈ U | simil ( u , u ′ ′ ) | {\displaystyle k=1/\sum _{u^{\prime }\in U}|\operatorname {simil} (u,u^{\prime })|}
. and
r u ¯ ¯ {\displaystyle {\bar {r_{u}}}}
is the average rating of user u for all the items rated by u.
The neighborhood-based algorithm calculates the similarity between two users or items, produces a prediction for the user by taking the weighted average of all the ratings. Similarity computation between items or users is an important part of this approach. Multiple measures, such as Pearson correlation and vector cosine based similarity are used for this.
The Pearson correlation similarity of two users x, y is defined as
simil ( x , y ) = ∑ ∑ i ∈ ∈ I x y ( r x , i − − r x ¯ ¯ ) ( r y , i − − r y ¯ ¯ ) ∑ ∑ i ∈ ∈ I x y ( r x , i − − r x ¯ ¯ )
2
∑ ∑ i ∈ ∈ I x y ( r y , i − − r y ¯ ¯ )
2
{\displaystyle \operatorname {simil} (x,y)={\frac {\sum \limits _{i\in I_{xy}}(r_{x,i}-{\bar {r_{x}}})(r_{y,i}-{\bar {r_{y}}})}{\sqrt {\sum \limits _{i\in I_{xy}}(r_{x,i}-{\bar {r_{x}}})^{2}\sum \limits _{i\in I_{xy}}(r_{y,i}-{\bar {r_{y}}})^{2}}}}}
where Ixy is the set of items rated by both user x and user y.
The cosine-based approach defines the cosine-similarity between two users x and y as:[4]
simil ( x , y ) = cos ( x → → , y → → ) = x → → ⋅ ⋅ y → → | | x → → | | × × | | y → → | | = ∑ ∑ i ∈ ∈ I x y r x , i r y , i ∑ ∑ i ∈ ∈ I x r x , i
2
∑ ∑ i ∈ ∈ I y r y , i
2
{\displaystyle \operatorname {simil} (x,y)=\cos({\vec {x}},{\vec {y}})={\frac {{\vec {x}}\cdot {\vec {y}}}{||{\vec {x}}||\times ||{\vec {y}}||}}={\frac {\sum \limits _{i\in I_{xy}}r_{x,i}r_{y,i}}{{\sqrt {\sum \limits _{i\in I_{x}}r_{x,i}^{2}}}{\sqrt {\sum \limits _{i\in I_{y}}r_{y,i}^{2}}}}}}
The user based top-N recommendation algorithm uses a similarity-based vector model to identify the k most similar users to an active user. After the k most similar users are found, their corresponding user-item matrices are aggregated to identify the set of items to be recommended. A popular method to find the similar users is the Locality-sensitive hashing, which implements the nearest neighbor mechanism in linear time.
The advantages with this approach include: the explainability of the results, which is an important aspect of recommendation systems; easy creation and use; easy facilitation of new data; content-independence of the items being recommended; good scaling with co-rated items.
There are also several disadvantages with this approach. Its performance decreases when data gets sparse, which occurs frequently with web-related items. This hinders the scalability of this approach and creates problems with large datasets. Although it can efficiently handle new users because it relies on a data structure, adding new items becomes more complicated since that representation usually relies on a specific vector space. Adding new items requires inclusion of the new item and the re-insertion of all the elements in the structure.
Focus on the deliverable: rather than mentioning all the things you tried that didn’t work, focus on what you accomplished, the final product.
Leave-k-out¶
A common strategy for splitting recommendation data into training and test sets is leave-k-out. Here, a split percentage is chosen (e.g., 80% train, 20% test) and the test percentage is selected randomly from the user-item pairs with non-zero entries.
Choosing an 80%/20% split, we can see the test data highlighted in our example below:
note - in this implementation of leave-k-out, the train, test data
# have the same shape, but test data is zeroed out in the training set.
# In addition, we have imputed values where there are no entries in the
# matrix.
Here we sum the squared difference between predicted value
p
i
pi
and actual value
a
i
ai
over all
N
N
test examples. The RMSE is simply the square root of this value. RMSE perhaps more interpretible because it is of the same scale as the data, but nevertheless constains the same information as the MSE.
While the MSE is easy to compute, it can suffer from very large error contributions from outliers - squaring the error puts emphasis on large deviations. A more robust error metric is the Mean Absolute Error (MAE) as more robust sum differences absoéute value
Mean Squared Error (MSE) or similarly Root Mean Squared Error (RMSE). Lets define the MSE:
Mean Squared Error (MSE) or similarly Root Mean Squared Error (RMSE). Lets define the MSE:
Information sources
Product count as a function of the quantity purchased
(fruits)
Then, in a similar way, the user will be recommended with products within one category. In this case, Francis was recommended with bananas