Serving tens of billions of personalized recommendations a day under a latency of 30 milliseconds is a challenge. In this talk I’ll share our algorithmic architecture, including its Spark-based offline layer, and its Elasticsearch-based serving layer, that enable running complex models under difficult scale constrains and shorten the cycle between research and production.
Recommender Systems @ Scale, Big Data Europe Conference 2019
1. | CONTENT-BASED PERSONALIZATION
From Spark to Elasticsearch and
Back
Learning Large Scale
Models for
Content
Recommendation
Sonya Liberman
Personalization Team Lead
Outbrain Recommendations Group
11. 11
Oubtrain’s NLP Engine
Crawling articles where
our recommendations
are displayed
Crawling the content we
recommend
Over 3 million new
articles a week
12. 12
What is a Document About?
Categories
Entertainment/Television
Topics
sitcom
Entities
Jim Parsons, Mayim Bialik,
CBS
17. 1. Content Based Models
Recommends content based on semantic similarity with
user interests
17
Predictive Models
Music
Tech
Travel
18. 1. Content Based Models
2. Collaborative Models
If my taste matches someone else’s taste, what they liked
will probably interest me as well
- Matrix Factorization
- Factorization Machines
- Feature Embedding with Deep Neural Networks
18
Predictive Models
19. The Machine Learning Challenge
Features Vector Supervision
User Profile Click / No Click
Current Context
Predicting the Probability of a Click
24. What Data Scientists Spend the Most Time Doing ?
https://visit.figure-eight.com/data-science-report.html, 2016
Data Science Report (CrowdFlower)
Cleaning and
Organizing Data
Collecting
Datasets
Refining
Algorithms
26. What Makes Preparing Datasets a Hard Work ?
Multiple data stores
Hive, MySQL, Cassandra
Multiple data resources
Recommendations, Clicks, User Profiles
Logic
Deduping, filtering, siloing
Scale
Huge tables, generating a large dataset
take a long time
28. 28
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
29. 29
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
31. 31
Distributed Machine Learning Framework
Data
Collection
Feature
Engineering
Offline
Evaluation
Metrics
Model
Deployment
1 2
4 5
Model
Training
3
32. Ongoing Hourly Jobs which
1. Aggregate data from multiple sources
2. Apply data pre-processing logic
3. Make datasets available for model training and
evaluation
Automatic Data Collection
Data
Collection
1
33. 1. Collects model predictions for our labeled test datasets
2. Applies a large set of evaluation metrics
3. Reports metrics to a centralized datastore
Offline Model Evaluation
Offline
Evaluation
Metrics
4
34. 34
Goal # 1 - Agile Research
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5
35. Simple Model Interface
Data scientists and algorithm
engineers only need to
implement their model’s logic
Use a variety of algorithms -
Spark.ML, open source
implementations and “home
made”
Everything else - out-of-the-box
37. 37
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
38. 38
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
39. 39
ML Framework for both Research and Production
Daily retraining of
production models to
refresh the models as new
data becomes available
40. 40
ML Framework for both Research and Production
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5
41. 41
Models are evaluated right before deployment
Deployment is conditional based on evaluation
results
Goal # 2 - Robust Productization
Offline
Evaluation
Metrics
Model
Deployment
4 5
45. Distributed Search Engines - What Can They Do ?
1. Efficiently apply filters over a large inventory
2. Apply (custom) scoring functions
3. Scale for inventory and number of requests
50. Query: user interests,
user geo, context
info
User, Context is translated into a query of Interests
and meta-data
Content is translated into a searchable index
Searchable
Index
Reducing Recommendations to Search
52. Generate a query
from User Interests
52
Music
Tech
Travel
Tech
Music
Sports
Celebrities
Get relevant
recommendations
Indexing the semantics
of an article
Reducing Recommendations to Search
53. 53
Beyond Content Based
Elasticsearch Plugins -
Writing custom scoring functions with native Java
Deploying on Elasticsearch Cluster
Instructing Elasticsearch to use it for its ranking
58. 58
Search-Based Serving Architecture of Embeddings-Based Recommendations|
S Liberman, S Bar, R Vannerom, D Rosenstein, R Lempel
RecSys 2019, Workshop on Online Recommender Systems and User Modeling
Recommending with Embeddings
Challenges
1. Incorporating embedding into the scoring function
2. Synchronization and Versioning
3. Handling indirect, feature-based embeddings
59. 59
Key Takeaways
1. Invest in your ML pipeline to perform agile
research
2. Automate periodic model deployment with
control mechanisms
3. Reduce your task to a problem someone
already solved