At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora.
13. Training Pipelines
Spark
Extract from Hive
Stratified Sampling
Data Preparation
Spark
Online Snapshots
Feature Encoders
Feature Generation
Validation
Offline Metrics,
Alerts
Model Metrics
Proprietary Algos
Spark/TensorFlow
Parameter Search
Model Training
Test dataset
Hyper parameters
Model Selection
S3
Online Caches
Precompute
Live Compute
Spark/Online Caches
Model PublishScoring/Inference
37. Lesson: One Abstraction Doesn’t Fit All
Evidenced by the many names:
● Workflow
● ProcessFlow
● Pipeline
● DAG
● DataFlow
Overspecialization will inevitably
weaken other use cases
Copyright
38. Meson provides “workflows as a
service” on top of which many
domain-specialized abstractions can
be built:
● A/B test orchestration
● ML orchestration
● ETL pipelines
● Notebook Automation
● And more…
Meson
ETL DSL ML DSL Automation DSL
Lesson: One Abstraction Doesn’t Fit All
39. Lesson: Prepare for the Future, for It Is Unknown
The influx of new ML tech is massive
We had invested heavily in Spark, and that has been useful, but technology
is still moving
The ability for users to extend the system for new tech has enabled us to
keep up
40. The Custom Step interface enables platform-specific integrations
like Spark and Titus (Netflix internal Docker service)
The DSL can be extended to further specialize for particular
technologies (TensorFlow on Docker, model training in Spark)
Lesson: Prepare for the Future, for It Is Unknown
Spark Submit options
Links to Spark UI &
History Server
Titus Docker Milestones
41. Execute Command REST Job
Spark Submit Run Docker
TensorFlow Train Run Notebook
Run Pig Job
...
...
Lesson: Prepare for the Future, for It Is Unknown
43. Lesson: Embrace How the Sausage Is Made
Meson as a Mesos framework
Mesos offers resources and runs
the steps
Fenzo (Netflix OSS) makes
scheduling decisions
Mesos Framework
Scheduler
Fenzo
Mesos Agent
Mesos Master
Meson executor
Mesos Agent
Meson executor
44. Run the actual steps
Publish runtime debug information (logs,
metrics, configurations) and task status
updates
Meson executor survives to Meson
scheduler failures
Mesos Agent
Mesos Master
Meson
executor
Mesos Agent
Meson
executorDocker
container
Service
Spark driver
Mesos Agent
Spark
Executors
Lesson: Embrace How the Sausage Is Made
45. Lesson: When to Get a New Pair of Jeans*
*Hint: before the first sign of tear!
Cassandra cluster provided as a service and maintained by a dedicated team
Everything stored as Protobuf blobs
Custom secondary indexes to support various query patterns
46. Lesson: When to Get a New Pair of Jeans
Need to support complex query patterns, aggregations, and joins
Creating and maintaining secondary indexes is cumbersome
Debugging the stored data not trivial—requires application code to deserialize
47. Lesson: When to Get a New Pair of Jeans
Amazon Relational Database Service
On-demand production ready relational database in the cloud
Takes care of the administrative work for you
- backups, replication, software updates, failover
Easy to scale the database and possibility to add read replicas as needed
Supports most common database engines
Amazon Aurora
A high performance and reliably managed database
Fully compatible with MySQL
Can serve a high number of concurrent requests
48. Lesson: When to Get a New Pair of Jeans
A single Amazon RDS cluster per
region
Multiple Meson instances running
different versions
An Amazon RDS cluster will have
a logical database per Meson
instance
49. How do we apply schema changes or data migration?
Automated database migrations through SQL scripts or code with Flyway
Leadership acquisition in Zookeeper for red/black deployments
Migrations applied lazily when updating specific Meson instance
Lesson: When to Get a New Pair of Jeans
50. Next steps/takeaways
Obtain desired querying flexibility without additional operations burden
The database will influence the design of your application
Migrating application code to leverage relational DB capabilities is tedious
Consider an ORM to reduce the code and improve queries composability
Lazy migrations can be preferable but with tradeoff of maintaining old code
Lesson: When to Get a New Pair of Jeans
51. Lesson: Know Thy User
User interactions with Meson
Defining the workflow (Scala DSL)
Operating/monitoring a running
workflow (Web UI)
52. We were improving those, but our users suffered from a different problem:
How to deploy workflows and ship binaries to the cluster
We talked to our users and iterated
A lot
Current solution
A gradle plugin integrated with the build system for automation
Automated workflow releases…
Lesson: Know Thy User
53. Lesson: Know Thy User
Jenkins
Git
1 PR merged
2
Deploy & run
canary workflows
3
Deploy production
workflows
54. Interact with Meson from the running job to leverage advanced features
Loops, foreach, parameters that can be passed around
Artifacts to expose debugging information
Progress Milestones, Links, Counters, Images, etc.
Lesson: Know Thy User
55. What’s Ahead
Scaling to tens of thousands of daily ETL jobs for broader Netflix Data needs
Tighter integration with application code using MesonContext
Support for more sophisticated pipelines
56. Monday
10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian)
12:15pm SID206: Best Practices for Managing Security on AWS (MGM)
Tuesday
10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian)
11:30am CMP204: How Netflix Tunes EC2 Instances for Performance (Venetian)
Wednesday
11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian)
12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian)
1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian)
1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM)
1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian)
4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM)
Thursday
12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo)
12:15pm DAT308: A story of Netflix and AB Testing in the User Interface using DynamoDB (Venetian)
12:55pm CMP309: How Netflix Encodes at Scale (Venetian)
5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria)
Friday
8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria)
10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
Netflix Talks at ReInvent 2017