2. Georgian Micsa
● Software engineer with 6+ years of experience, mainly Java but also
JavaScript and .NET
● Interested on OOD, architecture and agile software development
methodologies
● Currently working as Senior Java Developer @ Cegeka
● georgian.micsa@gmail.com
● http://ro.linkedin.com/in/georgianmicsa
3. What is it?
● Recommender/recommendation system/engine/platform
● A subclass of information filtering system
● Predict the 'rating' or 'preference' that a user would give to a new item
(music, books, movies, people or groups etc)
● Can use a model built from the characteristics of an item (content-based
approaches)
● Can use the user's social environment (collaborative filtering approaches)
4. Examples
● Amazon.com
○ Recommend additional books
○ Frequently bought together books
○ Implemented using a sparse matrix of book cooccurrences
● Pandora Radio
○ Plays music with similar characteristics
○ Content based filtering based on properties of song/artist
○ Based also on user's feedback
○ Users emphasize or deemphasize certain characteristics
5. Examples 2
● Last.fm
○ Collaborative filtering
○ Recommends songs by observing the tracks played by user and
comparing to behaviour of other users
○ Suggests songs played by users with similar interests
● Netflix
○ Predictions of movies
○ Hybrid approach
○ Collaborative filtering based on user`s previous ratings and watching
behaviours (compared to other users)
○ Content based filtering based on characteristics of movies
6. Collaborative filtering
● Collect and analyze a large amount of information on users’ behaviors,
activities or preferences
● Predict what users will like based on their similarity to other users
● It does not rely on the content of the items
● Measures user similarity or item similarity
● Many algorithms:
○ the k-nearest neighborhood (k-NN)
○ the Pearson Correlation
○ etc.
7. Collaborative filtering 2
● Build a model from user's profile collecting explicit and implicit data
● Explicit data:
○ Asking a user to rate an item on a sliding scale.
○ Rank a collection of items from favorite to least favorite.
○ Presenting two items to a user and asking him/her to choose the
better one of them.
○ Asking a user to create a list of items that he/she likes.
● Implicit data:
○ Observing the items that a user views in an online store.
○ Analyzing item/user viewing times
○ Keeping a record of the items that a user purchases online.
○ Obtaining a list of items that a user has listened to or watched
○ Analyzing the user's social network and discovering similar likes and
dislikes
8. Collaborative filtering 3
● Collaborative filtering approaches often suffer from three problems:
○ Cold Start: needs a large amount of existing data on a user in order
to make accurate recommendations
○ Scalability: a large amount of computation power is often necessary
to calculate recommendations.
○ Sparsity: The number of items sold on major e-commerce sites is
extremely large. The most active users will only have rated a small
subset of the overall database. Thus, even the most popular items
have very few ratings.
9. Content-based filtering
● Based on information about and characteristics of the items
● Try to recommend items that are similar to those that a user liked in the
past (or is examining in the present)
● Use an item profile (a set of discrete attributes and features)
● Content-based profile of users based on a weighted vector of item
features
● The weights denote the importance of each feature to the user
● To compute the weights:
○ average values of the rated item vector
○ Bayesian Classifiers, cluster analysis, decision trees, and artificial
neural networks
10. Content-based filtering 2
● Can collect feedback from user to assign higher or lower weights on the
importance of certain attributes
● Cross-content recommendation: music, videos, products, discussions etc.
from different services can be recommended based on news browsing.
● Popular for movie recommendations: Internet Movie Database, See This
Next etc.
11. Hybrid Recommender Systems
● Combines collaborative filtering and content-based filtering
● Implemented in several ways:
○ by making content-based and collaborative-based predictions
separately and then combining them
○ by adding content-based capabilities to a collaborative-based
approach (and vice versa)
○ by unifying the approaches into one model
● Studies have shown that hybrid methods can provide more accurate
recommendations than pure approaches
● Overcome cold start and the sparsity problems
● Netflix and See This Next
12. What is Apache Mahout?
● A scalable Machine Learning library
● Apache License
● Scalable to reasonably large datasets (core algorithms implemented in
Map/Reduce, runnable on Hadoop)
● Distributed and non-distributed algorithms
● Community
● Usecases
• Clustering (group items that are topically related)
• Classification (learn to assign categories to documents)
• Frequent Itemset Mining (find items that appear together)
• Recommendation Mining (find items a user might like)
13. Non-distributed recommenders
● Non-distributed, non Hadoop, collaborative recommender algorithms
● Java or external server which exposes recommendation logic to your
application via web services and HTTP
● Key interfaces:
○ DataModel: CSV files or database
○ UserSimilarity: computes similarity between users
○ ItemSimilarity: computes similarity between items
○ UserNeighborhood: used for similarity of users
○ Recommender: produces recommendations
● Different implementations based on your needs
● Input in this format: UserId,ItemId,[Preference or Rating]
● Preference is not needed in case of associations (pages viewed by users)
14. User-based recommender example
DataModel model = new FileDataModel(new File("data.txt"));
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
// Optional:
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer());
UserNeighborhood neighborhood =
new NearestNUserNeighborhood(3, userSimilarity, model);
Recommender recommender =
new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations =
cachingRecommender.recommend(1234, 10);
15. Item-based recommender example
DataModel model = new FileDataModel(new File("data.txt"));
// Construct the list of pre-computed correlations
Collection<GenericItemSimilarity.ItemItemSimilarity> correlations = ...;
ItemSimilarity itemSimilarity = new GenericItemSimilarity(correlations);
Recommender recommender =
new GenericItemBasedRecommender(model, itemSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations =
cachingRecommender.recommend(1234, 10);
16. Recommender evaluation
For preference data models:
DataModel myModel = ...;
RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) {
// build and return the Recommender to evaluate here
}
};
RecommenderEvaluator evaluator =
new AverageAbsoluteDifferenceRecommenderEvaluator();
double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
For boolean data models, precision and recall can be computed.
17. Distributed Item Based
● Mahout offers 2 Hadoop Map/Reduce jobs aimed to support Itembased
Collaborative Filtering
● org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
○ computes all similar items
○ input is a CSV file with theformat userID,itemID,value
○ output is a file of itemIDs with their associated similarity value
○ different configuration options: eg. similarity measure to use (co
occurrence, Euclidian distance, Pearson correlation, etc.)
● org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
○ Completely distributed itembased recommender
○ input is a CSV file with the format userID,itemID,value
○ output is a file of userIDs with associated recommended itemIDs and
their scores
○ also configuration options
18. Mahout tips
● Start with non-distributed recommenders
● 100M user-item associations can be handled by a modern server with 4GB
of heap available as a real-time recommender
● Over this scale distributed algorithms make sense
● Data can be sampled, noisy and old data can be pruned
● Ratings: GenericItemBasedRecommender and
PearsonCorrelationSimilarity
● Preferences: GenericBooleanPrefItemBasedRecommender and
LogLikelihoodSimilarity
● Content-based item-item similarity => your own ItemSimilarity
19. Mahout tips 2
● CSV files
○ FileDataModel
○ push new files periodically
● Database
○ XXXJDBCDataModel
○ ReloadFromJDBCDataModel
● Offline or live recommendations?
○ Distributed algorithms => Offline periodical computations
○ Data is pushed periodically as CSV files or in DB
○ SlopeOneRecommender deals with updates quickly
○ Real time update of the DataModel and refresh recommander after
some events (user rates an item etc.)