Netflix is the world’s leading Internet television network with over 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series. Netflix uses machine learning to deliver a personalized experience to each one of our 48 million users.
In this talk you will hear about the machine learning algorithms that power almost every part of the Netflix experience, including some of our recent work on distributed Neural Networks on AWS GPUs. You will also get an insight into the innovation approach that includes offline experimentation and online AB testing. Finally, you will learn about the system architectures that enable all of this at a Netflix scale.
13. Proxy question:
▪ Accuracy in predicted rating
▪ Improve by 10% = $1million!
What we were interested in:
▪ High quality recommendations
predicted
actual
19. ▪ > 40M subscribers
▪ Ratings: ~5M/day
▪ Searches: >3M/day
▪ Plays: > 50M/day
▪ Streamed hours:
o 5B hours in Q3 2013
Geo Info
Time
Impressions
Device Info
Metadata
Social
Ratings
Demographics
Member Behavior
Plays
20. Aish House of Cards
Latent User Vector
Latent Item Vector
32. ▪ App Logs
▪ User Actions
▪ Ratings
▪ Plays
▪ Queue Adds
▪ Algo Actions
▪ Impressions (Presentation Bias)
▪ Context
▪ Device Info
▪ User Demographics
▪ Social
▪ Time
▪ …
Many different types of data…
- Who in the audience has an ML background ?
Who is has big data background?
Who’s an engineer?
Going to cover:
Bit of everything. A few models, our approach to architecture of ML systems, and how it all comes together
Feel free to ask questions as we go along.
- We use Machine Learning in many places at Netflix, but perhaps the place we’re best known for ML is in our recommender systems, and our personalization
- So wanted to start with quick overview of what is personalization in Netflix
If you’ve logged into Netflix before this should look familiar. This is what it looks like when you login to our website
What you might not realize however is that almost every element on this page is driven by a ML algorithm
- There’s the obvious recommendations. We a row of explicit recommendations, where we pull together everything we know about you, and present our “top picks” for you
You’ll also see “Genre” rows, that provide shows around a particular theme.
Movies are tagged in our system based on a number of different aspects
The tags are editorially added by our team of content experts
Which genre’s we pick however, is personalized. So “Movies based on books” is shown for me based on my predicted likelihood of wanting to watch this genre
There’s also a level of personalization within the row itself. So a genre like “Movies based on books” spans a lot of different tastes.
For example, movies about Wall Street and documentation on the GFC, and Young Adult Fiction all types of “Movies based on Books”, but they serve different tastes.
But based on what we know about you, we can construct a set of “Movies based on books” tailed to your particular view of what that means.
We also do “Similar” rows. So as the title says, because I last watched Bob Burgers, here’s some choices that are similar to that.
Even our marketing images are personalized. Much of the hero images and marketing you see within Netflix is personalized to your taste.
I see OITNB, but here because it fits with my tastes
Finally we put it all together.
Unsurprisingly, most of what people play is from the top left hand corner, and if they are forced to scroll further down, or right, then that means we failed to predict what they want to watch
So we also rank the entire page. I’ve already shown how we rank the different rows left-to-right. We also rank each row top-to-bottom, so that you the most relevant (for you) rows are pushed to the top of the page.
The net result of this personalization, is that 75% of what our users watch, is selected from the homepage. And the rows I’ve just shown you.
Which means that we’ve been able to provide a very personalized experience for our users, where what they see on the homepage, when they login to Netflix, matches pretty well with what they want to watch.
- Okay, I’m going to take a minute now to provide some back story.
Who’s heard of the Netflix prize?
It ran from 2006->2009.
- It was won in 2009 by Team KorBell (AT&T).
The challenge was:
We give you 100M anonymized ratings from users data, to build a “rating prediction” model with
We then get you to predict 2.8M ratings for user’s who we already know what they rated, but we held back.
If you can improve on our predicted ratings by 10%, then we give you 1 million dollars
We measure this as the root mean square difference between, your predicted rating, and what the real rating is that we held back.
- Team KorBell (AT&T) won it in 2009.
- They improved the predictions by 8.43%
http://mathurl.com/osuomvj
Two significant algorithms came out of the Netflix Prize.
SVD - Prize RMSE: 0.8914
RBM - Prize RMSE: 0.8990
They were known in academia already, but hadn’t made their way out into industry recommender systems.
I talk through how SVD works at a high level in later slides
These two algorithms are still used in parts of the Netflix Recommender System to this day.
- There are limitations though.
Ratings != Plays. People’s ratings are somewhat “aspirational”. People may rate CitzenKane 5 stars, but what they watch is Sharknado.
For our use case, we’re interested in predicting what people actually want to watch, not predicting what they think are critically worthy movies.
Also Netflix has changed a lot since the start of the Netflix Prize.
In 2006 we were mailing out DVDs. Now we’re more about steaming to devices.
This also changed people viewing habits.
The investment in selecting a great DVD, that the entire family can watch, was higher. Everyone had to agree on it, and getting it wrong might ruin your night.
With streaming content want content that is more personalized, and more context sensitive to what they want to watch NOW.
Also Netflix has grown. A lot.
What algorithms worked in 2006, don’t necessary work with the volume we now have
- Okay so dive a little into the models and data we use to do our personalization
On the data side we have have a lot to work with.
There’s a lot of signal that we get beyond straight plays/ratings.
If you think about it, the context in which someone chooses what to watch tells you a lot too.
So I want to give you a quick overview of how SVD (aka Matrix Factorization) works. This is one of the classic algorithms used in the NF prize, and was a big break through at the time. This should give you a flavor of how these systems work.
Basic model is.
http://mathurl.com/pgux65w
- To make that more visual
http://mathurl.com/pgux65w
http://mathurl.com/l4w5yd6
http://mathurl.com/l4w5yd6
So that’s one of the foundational algorithms used in recommender systems. But things have moved on a lot since then too.
These days we’re mostly focused on ranking rather than rating prediction. This allows us to balance things like diversity, freshness, global popularity against our prediction on how much this fits your tastes
We are (or have) AB tested many of these. And what algorithm to use really depends on your application, and what you’re trying to achieve. All have pros and cons.
You’ll likely end up with a few different algorithms for different parts of the problem
The important thing to test them in your production system
Over time we been able to improve on the results we got from the Netflix prize.
It’s been a combination of adding more data, and adding in more sophisticated models
As you can see here, we’ve moved things on a lot. These are improvements to Netflix’s core business metrics. So even a 1% improvement equates to real benefits to the business
One quick note: Always make sure you select a realistic baseline to test against. Just straight global popularity is usually pretty tough to beat. So you can fool yourself if you’re not testing against that, or your equivalent of that.
- So you now you have an idea of what a recommender system algorithm looks like, lets see how you can productionize that
So here’s the core workflow you’ll need to support. Whatever decisions you make about your architecture, you’ll need to make the above process seamless.
Machine Learning Approach
Define problem (what you think needs solving, or hypothesis of what can be improved
Gather data on which to train model
Experiment offline to see if you can improve over baseline
Produce Model/Algorithm and deploy
Track key metrics in production to see if hypothesis is proven
- Here’s a blueprint for different layers you’ll need.
- We’ll step through each area next.
Okay lets start with the front-end (aka online). I won’t cover much here, except for to point out that you’ll need an extremely good data pipeline.
You’ll spend 90% of your time building this.
Often needs to be built by an engineering team in collaboration with your researches.
There’s many different types of data you’ll want to capture
Incl. What your algorithms are doing. You’ll need to correct for presentation bias
And context and behavior that users interact with you in
- Need backend service that can accept and aggregate all these disparate data sources
Want to look at technologies like Suro, Kafka, etc
Stream to longer term (cheap) storage (S3, HDFS)
Need common framework that makes it easier to instrument your code for events.
Adopt early and get into every app as “standard”
Okay lets talk about where you (typically) define and train your models
Most of your models will be produced offline & embedded in production
You’ll need a platform that allows easy, across diverse tools: R, iPython, in-house
Common Format (can be code) that allows you to embed models once learned
Common confusion: Models change less than you think
Values you’ll be plugging in, can still be real-time
http://mathurl.com/kuxa5hw
Lets walk through an example of a model we train. Neural Networks
These days use GPUs (Cuda) to do training of network.
Thousand of cores
Massively parallel
Computing power is what’s changed. ANN are really an old idea
But still need to explore hyper-parameter space.
Parameters
Learning rate theta
…
Parameters
How many layers, and how deep
AWS offers GPU compute instances.
Approach. Conduct search over many different architectures / parameters
- Distribute different architecture to each instance
- Train model
- Evaluate
Can get smarter with how you explore this space. So rather than doing grid search, you search in areas most likely to have improvement
60cents an hour. Comparative fortune compared to other instances, but only takes a few hours to train model that is used in production for weeks (or months)
Perfect for experimental work
Your offline models won’t reflect sudden changes in behavior, that it hasn’t seen before.
Here’s OITNB, and House of Cards (as being searched for in Google). These can represent massive shifts in global user behavior, which can throw the model off
Also some models degrade faster than others. You see this especially with tree models.
Another problem: The models themselves still run in production (even though they’re trained offline). This limits how sophisticated you can make your models. They still need to return results within your SLAs.
One Solution. Near-line computing.
Re-train models based on events from the system
Pre-compute results where you can
Now you don’t always have to pre-compute the final results. The beauty of the near-line approach is that it lets you half-bake the model. So that the parts that are more static are pre-generated, and the parts that are more sensitive to changes get worked on the fly.
Remember our SVD model. U is users, M is movies, and R are ratings
- Turns out that solving U if you know M and R, is simple Least Squares solution. With modern linear algebra libraries we can compute that in milli-seconds.
Recomputes are event driven. No need to re-compute if nothing has changed
So in this example, we re-compute the latent vectors representing my tastes, whenever there’s more information available about me to re-train that vector with.