SlideShare una empresa de Scribd logo
1 de 57
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Orchestrating Machine Learning Training
for Netflix Recommendations
Davis Shepherd, Eugen Cepoi, Faisal Siddiqi
M C L 3 1 7
N o v e m b e r 2 9 , 2 0 1 7
?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Goal
Create personalized recommendations to
help members find content to enjoy
maximizing their satisfaction
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Context
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendation context
Overview of Meson
ML training pipelines using Meson
Lessons learned while building Meson
Agenda
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendation Context
Member
Streaming
Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Models
Recommendation Context
Member
Streaming
Data
Training
pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Models
Recommendation Context
Member
Streaming
Data
Training
pipeline
Caches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Models
Recommendation Context
Member
Streaming
Data
Training
pipeline
Precompute
system
Caches
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Models
Models
Models
Recommendation Context
Member
Streaming
Data
Training
pipelines
Precompute
system
Caches
Training
pipelines
Training
pipelines
AB test
Allocation
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Models
Models
Models
Recommendation Context
Member
Streaming
Data
Training
pipelines
Precompute
system
Caches
Training
pipelines
Training
pipelines
AB test
Allocation
Training Pipelines
Spark
Extract from Hive
Stratified Sampling
Data Preparation
Spark
Online Snapshots
Feature Encoders
Feature Generation
Validation
Offline Metrics,
Alerts
Model Metrics
Proprietary Algos
Spark/TensorFlow
Parameter Search
Model Training
Test dataset
Hyper parameters
Model Selection
S3
Online Caches
Precompute
Live Compute
Spark/Online Caches
Model PublishScoring/Inference
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Heterogeneous systems
Failure handling
Reproducibility
Multi-tenancy
External triggers
Pipeline Challenges
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Generic, extensible workflow scheduling engine
Attached to an Apache Mesos cluster
Supports Spark compute natively and can schedule Dockers at scale
DSLs, REST API, and UI for definition and visualization
DAGs, Loops, Conditionals, DataArtifacts, SubWorkflows
Flexible MVEL Java expressions for extensibility
Meson — Workflow Orchestration
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
2+ years in production
10+ managed and self-service deployed clusters
1000+ daily production and A/B Test ML pipelines
2000+ Amazon EC2 instances in largest Spark/Mesos compute pool
20,000+ Daily step runs
Meson—Stats
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Meson in Action
Web interface
Workflow in motion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML Pipelines
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of an ML Pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of an ML Pipeline
A Page construction model
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of an ML Pipeline
A Page construction model
MVEL for datetime
{"trainingDataStartDateint":
"new java.text.SimpleDateFormat("yyyyMMdd")
.format(new java.text.SimpleDateFormat
("yyyyMMdd").parse(FeedDate)
.getTime() - 1209600000).toString()" }
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of an ML Pipeline
A Page construction model
Heterogenous computes tied
together by Meson
Data Prep
@ Spark
Training
@ Docker
Training
@ Docker
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of an ML Pipeline
A Page construction model
Per A/B Cell Metric computations
Per Cell
Metrics
@ Spark
Per Cell
Metrics
@ Spark
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
A Boxart Personalization pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Data prep,
labels, etc.
Feature
Generation
Model Publish
Custom Step
Feature
Importance
computation
Model
Scoring and
Selection
Training multiple
Models
Metric
Distribution
Notification
Clean up
features,
models...
A Boxart Personalization pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
A Video Ranking pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
A Video Ranking pipeline
Base DSL class defines the standard pipeline
Every new A/B test extends the base class
Branching for multiple ABTest cells/model
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Another Ranking pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Another Ranking pipeline
Uses Scatter/Gather pattern for data-parallel training
Data chunks sent to trainer framework in parallel
Model validation and publish relegated to Docker script
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Continue Watching pipeline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Continue Watching pipeline
Parallel pipelines for old and new ways and share commons steps
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Member Value Modeling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
More ML Pipelines
Member Value ModelingMember Value Modeling
Uses Meson foreach for Parameter Sweeps across Dockers runs
Uses Custom step to run PIG query
Triggers downstream ETL on completion
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lessons Learned
Lesson: Data is at the core of every job
Data artifacts defined by a name and partitions
Cross workflow dependencies
External triggers
Lesson: Immutability (with versions) provides sanity
Workflows have immutable versions
Enabling
• Better collaboration
• Rollbacks
• Reproducibility
Lesson: One Abstraction Doesn’t Fit All
Evidenced by the many names:
● Workflow
● ProcessFlow
● Pipeline
● DAG
● DataFlow
Overspecialization will inevitably
weaken other use cases
Copyright
Meson provides “workflows as a
service” on top of which many
domain-specialized abstractions can
be built:
● A/B test orchestration
● ML orchestration
● ETL pipelines
● Notebook Automation
● And more…
Meson
ETL DSL ML DSL Automation DSL
Lesson: One Abstraction Doesn’t Fit All
Lesson: Prepare for the Future, for It Is Unknown
The influx of new ML tech is massive
We had invested heavily in Spark, and that has been useful, but technology
is still moving
The ability for users to extend the system for new tech has enabled us to
keep up
The Custom Step interface enables platform-specific integrations
like Spark and Titus (Netflix internal Docker service)
The DSL can be extended to further specialize for particular
technologies (TensorFlow on Docker, model training in Spark)
Lesson: Prepare for the Future, for It Is Unknown
Spark Submit options
Links to Spark UI &
History Server
Titus Docker Milestones
Execute Command REST Job
Spark Submit Run Docker
TensorFlow Train Run Notebook
Run Pig Job
...
...
Lesson: Prepare for the Future, for It Is Unknown
Zookeeper
Lesson: Embrace How the Sausage Is Made
Scheduler
Orchestration layer
REST API
Persistencelayer
Lesson: Embrace How the Sausage Is Made
Meson as a Mesos framework
Mesos offers resources and runs
the steps
Fenzo (Netflix OSS) makes
scheduling decisions
Mesos Framework
Scheduler
Fenzo
Mesos Agent
Mesos Master
Meson executor
Mesos Agent
Meson executor
Run the actual steps
Publish runtime debug information (logs,
metrics, configurations) and task status
updates
Meson executor survives to Meson
scheduler failures
Mesos Agent
Mesos Master
Meson
executor
Mesos Agent
Meson
executorDocker
container
Service
Spark driver
Mesos Agent
Spark
Executors
Lesson: Embrace How the Sausage Is Made
Lesson: When to Get a New Pair of Jeans*
*Hint: before the first sign of tear!
Cassandra cluster provided as a service and maintained by a dedicated team
Everything stored as Protobuf blobs
Custom secondary indexes to support various query patterns
Lesson: When to Get a New Pair of Jeans
Need to support complex query patterns, aggregations, and joins
Creating and maintaining secondary indexes is cumbersome
Debugging the stored data not trivial—requires application code to deserialize
Lesson: When to Get a New Pair of Jeans
Amazon Relational Database Service
On-demand production ready relational database in the cloud
Takes care of the administrative work for you
- backups, replication, software updates, failover
Easy to scale the database and possibility to add read replicas as needed
Supports most common database engines
Amazon Aurora
A high performance and reliably managed database
Fully compatible with MySQL
Can serve a high number of concurrent requests
Lesson: When to Get a New Pair of Jeans
A single Amazon RDS cluster per
region
Multiple Meson instances running
different versions
An Amazon RDS cluster will have
a logical database per Meson
instance
How do we apply schema changes or data migration?
Automated database migrations through SQL scripts or code with Flyway
Leadership acquisition in Zookeeper for red/black deployments
Migrations applied lazily when updating specific Meson instance
Lesson: When to Get a New Pair of Jeans
Next steps/takeaways
Obtain desired querying flexibility without additional operations burden
The database will influence the design of your application
Migrating application code to leverage relational DB capabilities is tedious
Consider an ORM to reduce the code and improve queries composability
Lazy migrations can be preferable but with tradeoff of maintaining old code
Lesson: When to Get a New Pair of Jeans
Lesson: Know Thy User
User interactions with Meson
Defining the workflow (Scala DSL)
Operating/monitoring a running
workflow (Web UI)
We were improving those, but our users suffered from a different problem:
How to deploy workflows and ship binaries to the cluster
We talked to our users and iterated
A lot
Current solution
A gradle plugin integrated with the build system for automation
Automated workflow releases…
Lesson: Know Thy User
Lesson: Know Thy User
Jenkins
Git
1 PR merged
2
Deploy & run
canary workflows
3
Deploy production
workflows
Interact with Meson from the running job to leverage advanced features
Loops, foreach, parameters that can be passed around
Artifacts to expose debugging information
Progress Milestones, Links, Counters, Images, etc.
Lesson: Know Thy User
What’s Ahead
Scaling to tens of thousands of daily ETL jobs for broader Netflix Data needs
Tighter integration with application code using MesonContext
Support for more sophisticated pipelines
Monday
10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian)
12:15pm SID206: Best Practices for Managing Security on AWS (MGM)
Tuesday
10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian)
11:30am CMP204: How Netflix Tunes EC2 Instances for Performance (Venetian)
Wednesday
11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian)
12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian)
1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian)
1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM)
1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian)
4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM)
Thursday
12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo)
12:15pm DAT308: A story of Netflix and AB Testing in the User Interface using DynamoDB (Venetian)
12:55pm CMP309: How Netflix Encodes at Scale (Venetian)
5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria)
Friday
8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria)
10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria)
Netflix Talks at ReInvent 2017
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions?
Thank you!

Más contenido relacionado

La actualidad más candente

Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...Amazon Web Services
 
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...Amazon Web Services
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)Julien SIMON
 
Machine Learning Models with Apache MXNet and AWS Fargate
Machine Learning Models with Apache MXNet and AWS FargateMachine Learning Models with Apache MXNet and AWS Fargate
Machine Learning Models with Apache MXNet and AWS FargateAmazon Web Services
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Amazon Web Services
 
How Netflix Encodes at Scale - CMP309 - re:Invent 2017
How Netflix Encodes at Scale - CMP309 - re:Invent 2017How Netflix Encodes at Scale - CMP309 - re:Invent 2017
How Netflix Encodes at Scale - CMP309 - re:Invent 2017Amazon Web Services
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapChris Fregly
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Randall Hunt
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML OverviewBESPIN GLOBAL
 
Deep Dive on Amazon EC2 Accelerated Computing
Deep Dive on Amazon EC2 Accelerated ComputingDeep Dive on Amazon EC2 Accelerated Computing
Deep Dive on Amazon EC2 Accelerated ComputingAmazon Web Services
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...Amazon Web Services
 
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...Amazon Web Services
 
High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101Amazon Web Services
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWSAmazon Web Services
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...Amazon Web Services
 
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...Amazon Web Services
 
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 

La actualidad más candente (20)

Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
Deep Learning for Data Scientists: Using Apache MXNet and R on AWS - June 201...
 
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...
FPGA Accelerated Computing Using Amazon EC2 F1 Instances - CMP308 - re:Invent...
 
Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)Build, train, and deploy Machine Learning models at scale (May 2018)
Build, train, and deploy Machine Learning models at scale (May 2018)
 
Machine Learning Models with Apache MXNet and AWS Fargate
Machine Learning Models with Apache MXNet and AWS FargateMachine Learning Models with Apache MXNet and AWS Fargate
Machine Learning Models with Apache MXNet and AWS Fargate
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
How Netflix Encodes at Scale - CMP309 - re:Invent 2017
How Netflix Encodes at Scale - CMP309 - re:Invent 2017How Netflix Encodes at Scale - CMP309 - re:Invent 2017
How Netflix Encodes at Scale - CMP309 - re:Invent 2017
 
AWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:CapAWS Re:Invent 2019 Re:Cap
AWS Re:Invent 2019 Re:Cap
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
Amazon AI/ML Overview
Amazon AI/ML OverviewAmazon AI/ML Overview
Amazon AI/ML Overview
 
Deep Dive on Amazon EC2 Accelerated Computing
Deep Dive on Amazon EC2 Accelerated ComputingDeep Dive on Amazon EC2 Accelerated Computing
Deep Dive on Amazon EC2 Accelerated Computing
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...
Run Your CI/CD and Test Workloads for 90% Less with Amazon EC2 Spot - CMP317 ...
 
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...
How to Get the HPC Best-in-class Performance via Intel Xeon Skylake Processor...
 
High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101
 
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...Waking the Data Scientist at 2am:  Detect Model Degradation on Production Mod...
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...
 
High Performance Computing on AWS
High Performance Computing on AWSHigh Performance Computing on AWS
High Performance Computing on AWS
 
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...High Performance Computing on AWS: Accelerating Innovation with virtually unl...
High Performance Computing on AWS: Accelerating Innovation with virtually unl...
 
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
Optimizing EC2 for Fun and Profit #bigsavings #newfeatures - CMP202 - re:Inve...
 
Amazon EC2 Foundations
Amazon EC2 FoundationsAmazon EC2 Foundations
Amazon EC2 Foundations
 
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
 

Similar a Orchestrating Machine Learning Training for Netflix Recommendations - MCL317 - re:Invent 2017

Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAmazon Web Services
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseAmazon Web Services
 
Integrating Deep Learning Into Your Enterprise
Integrating Deep Learning Into Your EnterpriseIntegrating Deep Learning Into Your Enterprise
Integrating Deep Learning Into Your EnterpriseAmazon Web Services
 
Artificial Intelligence (Machine Learning) on AWS: How to Start
Artificial Intelligence (Machine Learning) on AWS: How to StartArtificial Intelligence (Machine Learning) on AWS: How to Start
Artificial Intelligence (Machine Learning) on AWS: How to StartVladimir Simek
 
Integrating Deep Learning In the Enterprise
Integrating Deep Learning In the EnterpriseIntegrating Deep Learning In the Enterprise
Integrating Deep Learning In the EnterpriseAmazon Web Services
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueAmazon Web Services
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Amazon Web Services
 
Machine Learning State of the Union - MCL210 - re:Invent 2017
Machine Learning State of the Union - MCL210 - re:Invent 2017Machine Learning State of the Union - MCL210 - re:Invent 2017
Machine Learning State of the Union - MCL210 - re:Invent 2017Amazon Web Services
 
Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Amazon Web Services
 
Model Serving for Deep Learning with MXNet Model Server
Model Serving for Deep Learning with MXNet Model ServerModel Serving for Deep Learning with MXNet Model Server
Model Serving for Deep Learning with MXNet Model ServerAmazon Web Services
 
在 AWS 上運行任務關鍵工作負載
在 AWS 上運行任務關鍵工作負載在 AWS 上運行任務關鍵工作負載
在 AWS 上運行任務關鍵工作負載Amazon Web Services
 
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017Amazon Web Services
 
Time series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 LausanneTime series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 LausanneSunil Mallya
 
CON203_Driving Innovation with Containers
CON203_Driving Innovation with ContainersCON203_Driving Innovation with Containers
CON203_Driving Innovation with ContainersAmazon Web Services
 
Driving Innovation with Containers - CON203 - re:Invent 2017
Driving Innovation with Containers - CON203 - re:Invent 2017Driving Innovation with Containers - CON203 - re:Invent 2017
Driving Innovation with Containers - CON203 - re:Invent 2017Amazon Web Services
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)Amazon Web Services
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Amazon Web Services
 
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und ExpertenMaschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und ExpertenAWS Germany
 

Similar a Orchestrating Machine Learning Training for Netflix Recommendations - MCL317 - re:Invent 2017 (20)

Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
Integrating Deep Learning Into Your Enterprise
Integrating Deep Learning Into Your EnterpriseIntegrating Deep Learning Into Your Enterprise
Integrating Deep Learning Into Your Enterprise
 
Artificial Intelligence (Machine Learning) on AWS: How to Start
Artificial Intelligence (Machine Learning) on AWS: How to StartArtificial Intelligence (Machine Learning) on AWS: How to Start
Artificial Intelligence (Machine Learning) on AWS: How to Start
 
Integrating Deep Learning In the Enterprise
Integrating Deep Learning In the EnterpriseIntegrating Deep Learning In the Enterprise
Integrating Deep Learning In the Enterprise
 
ARC205_Born in the Cloud
ARC205_Born in the CloudARC205_Born in the Cloud
ARC205_Born in the Cloud
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
Stack Mastery: Create and Optimize Advanced AWS CloudFormation Templates - DE...
 
Machine Learning State of the Union - MCL210 - re:Invent 2017
Machine Learning State of the Union - MCL210 - re:Invent 2017Machine Learning State of the Union - MCL210 - re:Invent 2017
Machine Learning State of the Union - MCL210 - re:Invent 2017
 
Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)Migrating Your Databases to AWS – Tools and Services (Level 100)
Migrating Your Databases to AWS – Tools and Services (Level 100)
 
Model Serving for Deep Learning with MXNet Model Server
Model Serving for Deep Learning with MXNet Model ServerModel Serving for Deep Learning with MXNet Model Server
Model Serving for Deep Learning with MXNet Model Server
 
在 AWS 上運行任務關鍵工作負載
在 AWS 上運行任務關鍵工作負載在 AWS 上運行任務關鍵工作負載
在 AWS 上運行任務關鍵工作負載
 
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017
Build a Java Spring Application on Amazon ECS - CON332 - re:Invent 2017
 
Time series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 LausanneTime series modeling workd AMLD 2018 Lausanne
Time series modeling workd AMLD 2018 Lausanne
 
CON203_Driving Innovation with Containers
CON203_Driving Innovation with ContainersCON203_Driving Innovation with Containers
CON203_Driving Innovation with Containers
 
Driving Innovation with Containers - CON203 - re:Invent 2017
Driving Innovation with Containers - CON203 - re:Invent 2017Driving Innovation with Containers - CON203 - re:Invent 2017
Driving Innovation with Containers - CON203 - re:Invent 2017
 
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
透過最新的 AWS 服務在 2019 年為您的業務轉型 (Level 200)
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und ExpertenMaschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
Maschinelles Lernen auf AWS für Entwickler, Data Scientists und Experten
 

Más de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Más de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Orchestrating Machine Learning Training for Netflix Recommendations - MCL317 - re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Orchestrating Machine Learning Training for Netflix Recommendations Davis Shepherd, Eugen Cepoi, Faisal Siddiqi M C L 3 1 7 N o v e m b e r 2 9 , 2 0 1 7
  • 2. ?
  • 3.
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Goal Create personalized recommendations to help members find content to enjoy maximizing their satisfaction
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Context
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendation context Overview of Meson ML training pipelines using Meson Lessons learned while building Meson Agenda
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendation Context Member Streaming Data
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Models Recommendation Context Member Streaming Data Training pipeline
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Models Recommendation Context Member Streaming Data Training pipeline Caches
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Models Recommendation Context Member Streaming Data Training pipeline Precompute system Caches
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Models Models Models Recommendation Context Member Streaming Data Training pipelines Precompute system Caches Training pipelines Training pipelines AB test Allocation
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Models Models Models Recommendation Context Member Streaming Data Training pipelines Precompute system Caches Training pipelines Training pipelines AB test Allocation
  • 13. Training Pipelines Spark Extract from Hive Stratified Sampling Data Preparation Spark Online Snapshots Feature Encoders Feature Generation Validation Offline Metrics, Alerts Model Metrics Proprietary Algos Spark/TensorFlow Parameter Search Model Training Test dataset Hyper parameters Model Selection S3 Online Caches Precompute Live Compute Spark/Online Caches Model PublishScoring/Inference
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Heterogeneous systems Failure handling Reproducibility Multi-tenancy External triggers Pipeline Challenges
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Generic, extensible workflow scheduling engine Attached to an Apache Mesos cluster Supports Spark compute natively and can schedule Dockers at scale DSLs, REST API, and UI for definition and visualization DAGs, Loops, Conditionals, DataArtifacts, SubWorkflows Flexible MVEL Java expressions for extensibility Meson — Workflow Orchestration
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 2+ years in production 10+ managed and self-service deployed clusters 1000+ daily production and A/B Test ML pipelines 2000+ Amazon EC2 instances in largest Spark/Mesos compute pool 20,000+ Daily step runs Meson—Stats
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Meson in Action Web interface Workflow in motion
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML Pipelines
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anatomy of an ML Pipeline
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anatomy of an ML Pipeline A Page construction model
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anatomy of an ML Pipeline A Page construction model MVEL for datetime {"trainingDataStartDateint": "new java.text.SimpleDateFormat("yyyyMMdd") .format(new java.text.SimpleDateFormat ("yyyyMMdd").parse(FeedDate) .getTime() - 1209600000).toString()" }
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anatomy of an ML Pipeline A Page construction model Heterogenous computes tied together by Meson Data Prep @ Spark Training @ Docker Training @ Docker
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anatomy of an ML Pipeline A Page construction model Per A/B Cell Metric computations Per Cell Metrics @ Spark Per Cell Metrics @ Spark
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines A Boxart Personalization pipeline
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Data prep, labels, etc. Feature Generation Model Publish Custom Step Feature Importance computation Model Scoring and Selection Training multiple Models Metric Distribution Notification Clean up features, models... A Boxart Personalization pipeline
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines A Video Ranking pipeline
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines A Video Ranking pipeline Base DSL class defines the standard pipeline Every new A/B test extends the base class Branching for multiple ABTest cells/model
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Another Ranking pipeline
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Another Ranking pipeline Uses Scatter/Gather pattern for data-parallel training Data chunks sent to trainer framework in parallel Model validation and publish relegated to Docker script
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Continue Watching pipeline
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Continue Watching pipeline Parallel pipelines for old and new ways and share commons steps
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Member Value Modeling
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More ML Pipelines Member Value ModelingMember Value Modeling Uses Meson foreach for Parameter Sweeps across Dockers runs Uses Custom step to run PIG query Triggers downstream ETL on completion
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lessons Learned
  • 35. Lesson: Data is at the core of every job Data artifacts defined by a name and partitions Cross workflow dependencies External triggers
  • 36. Lesson: Immutability (with versions) provides sanity Workflows have immutable versions Enabling • Better collaboration • Rollbacks • Reproducibility
  • 37. Lesson: One Abstraction Doesn’t Fit All Evidenced by the many names: ● Workflow ● ProcessFlow ● Pipeline ● DAG ● DataFlow Overspecialization will inevitably weaken other use cases Copyright
  • 38. Meson provides “workflows as a service” on top of which many domain-specialized abstractions can be built: ● A/B test orchestration ● ML orchestration ● ETL pipelines ● Notebook Automation ● And more… Meson ETL DSL ML DSL Automation DSL Lesson: One Abstraction Doesn’t Fit All
  • 39. Lesson: Prepare for the Future, for It Is Unknown The influx of new ML tech is massive We had invested heavily in Spark, and that has been useful, but technology is still moving The ability for users to extend the system for new tech has enabled us to keep up
  • 40. The Custom Step interface enables platform-specific integrations like Spark and Titus (Netflix internal Docker service) The DSL can be extended to further specialize for particular technologies (TensorFlow on Docker, model training in Spark) Lesson: Prepare for the Future, for It Is Unknown Spark Submit options Links to Spark UI & History Server Titus Docker Milestones
  • 41. Execute Command REST Job Spark Submit Run Docker TensorFlow Train Run Notebook Run Pig Job ... ... Lesson: Prepare for the Future, for It Is Unknown
  • 42. Zookeeper Lesson: Embrace How the Sausage Is Made Scheduler Orchestration layer REST API Persistencelayer
  • 43. Lesson: Embrace How the Sausage Is Made Meson as a Mesos framework Mesos offers resources and runs the steps Fenzo (Netflix OSS) makes scheduling decisions Mesos Framework Scheduler Fenzo Mesos Agent Mesos Master Meson executor Mesos Agent Meson executor
  • 44. Run the actual steps Publish runtime debug information (logs, metrics, configurations) and task status updates Meson executor survives to Meson scheduler failures Mesos Agent Mesos Master Meson executor Mesos Agent Meson executorDocker container Service Spark driver Mesos Agent Spark Executors Lesson: Embrace How the Sausage Is Made
  • 45. Lesson: When to Get a New Pair of Jeans* *Hint: before the first sign of tear! Cassandra cluster provided as a service and maintained by a dedicated team Everything stored as Protobuf blobs Custom secondary indexes to support various query patterns
  • 46. Lesson: When to Get a New Pair of Jeans Need to support complex query patterns, aggregations, and joins Creating and maintaining secondary indexes is cumbersome Debugging the stored data not trivial—requires application code to deserialize
  • 47. Lesson: When to Get a New Pair of Jeans Amazon Relational Database Service On-demand production ready relational database in the cloud Takes care of the administrative work for you - backups, replication, software updates, failover Easy to scale the database and possibility to add read replicas as needed Supports most common database engines Amazon Aurora A high performance and reliably managed database Fully compatible with MySQL Can serve a high number of concurrent requests
  • 48. Lesson: When to Get a New Pair of Jeans A single Amazon RDS cluster per region Multiple Meson instances running different versions An Amazon RDS cluster will have a logical database per Meson instance
  • 49. How do we apply schema changes or data migration? Automated database migrations through SQL scripts or code with Flyway Leadership acquisition in Zookeeper for red/black deployments Migrations applied lazily when updating specific Meson instance Lesson: When to Get a New Pair of Jeans
  • 50. Next steps/takeaways Obtain desired querying flexibility without additional operations burden The database will influence the design of your application Migrating application code to leverage relational DB capabilities is tedious Consider an ORM to reduce the code and improve queries composability Lazy migrations can be preferable but with tradeoff of maintaining old code Lesson: When to Get a New Pair of Jeans
  • 51. Lesson: Know Thy User User interactions with Meson Defining the workflow (Scala DSL) Operating/monitoring a running workflow (Web UI)
  • 52. We were improving those, but our users suffered from a different problem: How to deploy workflows and ship binaries to the cluster We talked to our users and iterated A lot Current solution A gradle plugin integrated with the build system for automation Automated workflow releases… Lesson: Know Thy User
  • 53. Lesson: Know Thy User Jenkins Git 1 PR merged 2 Deploy & run canary workflows 3 Deploy production workflows
  • 54. Interact with Meson from the running job to leverage advanced features Loops, foreach, parameters that can be passed around Artifacts to expose debugging information Progress Milestones, Links, Counters, Images, etc. Lesson: Know Thy User
  • 55. What’s Ahead Scaling to tens of thousands of daily ETL jobs for broader Netflix Data needs Tighter integration with application code using MesonContext Support for more sophisticated pipelines
  • 56. Monday 10:45am ARC208:Walking the tightrope: Balancing Innovation, Reliability, Security, and Efficiency (Venetian) 12:15pm SID206: Best Practices for Managing Security on AWS (MGM) Tuesday 10:45am ARC209: A Day in the Life of a Netflix Engineer (Venetian) 11:30am CMP204: How Netflix Tunes EC2 Instances for Performance (Venetian) Wednesday 11:30am MCL317: Orchestrating ML Training for Netflix Recommendations (Venetian) 12:15pm NET303: A day in the life of a Cloud Network Engineer at Netflix (Venetian) 1:00pm ARC312: Why Regional Reservations are a Game Changer for Netflix (Venetian) 1:00pm SID304: SecOps 2021 Today: Using AWS Services to Deliver SecOps (MGM) 1:45pm DEV334: Performing Chaos at Netflix Scale (Venetian) 4:45pm SID316: Using Access Advisor to Strike the Balance Between Security and Usability (MGM) Thursday 12:15pm CMP311: Auto Scaling Made Easy: How Target Tracking Scaling Policies Hit the Bullseye (Palazzo) 12:15pm DAT308: A story of Netflix and AB Testing in the User Interface using DynamoDB (Venetian) 12:55pm CMP309: How Netflix Encodes at Scale (Venetian) 5:00pm ABD401: How Netflix Monitors Applications Real Time with Kinesis (Aria) Friday 8:30am ABD319: Tooling Up For Efficiency: DIY Solutions @ Netflix (Aria) 10:00am ABD401: Netflix Keystone SPaaS - Real-time Stream Processing as a Service (Aria) Netflix Talks at ReInvent 2017
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Questions? Thank you!