SlideShare a Scribd company logo
1 of 64
Download to read offline
Deep learning beyond the learning
@joerg_schad @dcos
Jörg Schad
Technical Community
Lead / Developer
Deep Learning
● Core Mesos
developer at
Mesosphere
● Twitter:
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
1. Explore data using
Jupyter notebook
2. Train the model
using TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
Cloud Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7.Streaming of requests
...
Open Source Pipeline
2. Explore data using
Jupyter notebook
3. Train the model
using TensorFlow
4. Monitor training progress
using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow
Serving
1. Data Preparation using
Spark
7. Kafka stream of
requests
Kubeflow
Deep Learning Pipeline
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
11
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
Data Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 13
Challenges
●
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Challenges
● Data is typically not ready to be
consumed by ML job*
● Data Cleaning
● Missing/incorrect labels
● Data Preparation
● Same Format
● Same Distribution
Solutions
Data Preparation
* Demo datasets are a fortunate exception :)
Users
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 17
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
Frameworks
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
19
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
20
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 21
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2017 Mesosphere, Inc. All Rights Reserved. 22
© 2018 Mesosphere, Inc. All Rights Reserved. 23
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 24
Alternatives
tf.enable_eager_execution()
https://www.tensorflow.org/get_started/eager
© 2018 Mesosphere, Inc. All Rights Reserved. 25
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved.
APIs
26
© 2018 Mesosphere, Inc. All Rights Reserved. 27
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Pick the right tool
● PMML if needed
Deep Learning Frameworks
Cluster
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow (Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
29
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
30
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
31
© 2018 Mesosphere, Inc. All Rights Reserved.
Resource Isolation and Allocation
32
© 2018 Mesosphere, Inc. All Rights Reserved.
TPU
33
© 2018 Mesosphere, Inc. All Rights Reserved.
TPUs
34
© 2017 Mesosphere, Inc. All Rights Reserved. 35
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2018 Mesosphere, Inc. All Rights Reserved.
PHYSICAL
INFRASTRUCTURE
MICROSERVICES, CONTAINERS, & DEV TOOLS
VIRTUAL MACHINES PUBLIC CLOUDS
DATA SERVICES, MACHINE LEARNING, & AI
Security &
Compliance
Application-Aware
Automation Multitenancy
Hybrid Cloud
Management
100+
MORE
DatacenterEdge
Datacenter and Cloud as a Single Computing Resource
Powered by Apache Mesos
20+
MORE
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow*
37
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
* Any Distributed System
Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
TensorFlow
© 2018 Mesosphere, Inc. All Rights Reserved.
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
39
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
40
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
41
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
42
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
43
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
44
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
45
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
Model Management
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
47
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
48
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 49
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
50
https://www.tensorflow.org/hub/
Serving
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 52
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
TensorFlow Lite
53
https://www.tensorflow.org/mobile/tflite/
Challenges
● Small/Fast model without losing too
much performance
● 500 KB models….
© 2018 Mesosphere, Inc. All Rights Reserved.
Rendezvous Architecture
54
https://mapr.com/ebooks/machine-learning-logistics/
Monitoring
Data &
Streaming
Users
Frameworks &
Cluster
Models
Distributed Data
Storage and
Streaming
Model Serving
Data Preparation and
Analysis
Deep Learning Tools
and Distributed
Hosting
Building Machine
Learning Model
Sending Model to
Clients
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 56
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
57
tfdbg
https://www.tensorflow.org/programmers_guide/debugger
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
58
Tfdbg
- GUI currently alpha
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
© 2018 Mesosphere, Inc. All Rights Reserved.
Profiling
59
Performance optimization for different
devices
- Keep device occupied
Profiling!
+
Experience!
https://www.tensorflow.org/performance/performance_guide
© 2018 Mesosphere, Inc. All Rights Reserved.
Platforms
60
● AWS Sagemaker
+ Spark, MXNet, TF
+ Serving/AB
- Cloud Only
● Google Datalab/ML-Engine
+ TF, Keras, Scikit, XGBoost
+ Serving/AB
- Cloud Only
- No control of docker images
● KubeFlow
+ TF Everywhere
- TF only
● DC/OS
+ Flexibility (all of the above)
+ GPU support
- More Manual setup
© 2018 Mesosphere, Inc. All Rights Reserved. 61
Demo
1. Explore data using
Jupyter notebook
2. Train the
model using
TensorFlow
3. Monitor training progress
using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow
Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
62
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
63
Ben Wood Robin Oh
Evan Lezar Art Rand
Gabriel Hartmann Chris Lambert
Bo Hu
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco
s
○ Slack: chat.dcos.io #tensorflow
Questions and Links
64

More Related Content

What's hot

Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
 
Webinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containersWebinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containersMesosphere Inc.
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームMasayuki Matsushita
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriDatabricks
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 

What's hot (20)

Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Webinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containersWebinar: Déployez facilement Kubernetes & vos containers
Webinar: Déployez facilement Kubernetes & vos containers
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォームPivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
Pivotal Greenplum 次世代マルチクラウド・データ分析プラットフォーム
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 

Similar to Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018

TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOSQAware GmbH
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lGanesan Narayanasamy
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...Edge AI and Vision Alliance
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Mesosphere Inc.
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsAlexey Rybakov
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Cloudera, Inc.
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghData Con LA
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg SchadData Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg SchadData Con LA
 
Downtime is not an option - day 2 operations - Jörg Schad
Downtime is not an option - day 2 operations -  Jörg SchadDowntime is not an option - day 2 operations -  Jörg Schad
Downtime is not an option - day 2 operations - Jörg SchadCodemotion
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Codemotion
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Clarisse Hedglin
 
Curated "Cloud Design Patterns" for Call Center Platforms
Curated "Cloud Design Patterns" for Call Center PlatformsCurated "Cloud Design Patterns" for Call Center Platforms
Curated "Cloud Design Patterns" for Call Center PlatformsAlejandro Rios Peña
 
Operating Flink on Mesos at Scale
Operating Flink on Mesos at ScaleOperating Flink on Mesos at Scale
Operating Flink on Mesos at ScaleBiswajit Das
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learningGanesan Narayanasamy
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaNeo4j
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 

Similar to Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018 (20)

TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
 
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS Running Distributed TensorFlow with GPUs on Mesos with DC/OS
Running Distributed TensorFlow with GPUs on Mesos with DC/OS
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika SinghDeep Learning Frameworks Using Spark on YARN by Vartika Singh
Deep Learning Frameworks Using Spark on YARN by Vartika Singh
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg SchadData Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
Data Con LA 2018 - Towards Data Science Engineering Principles by Joerg Schad
 
Downtime is not an option - day 2 operations - Jörg Schad
Downtime is not an option - day 2 operations -  Jörg SchadDowntime is not an option - day 2 operations -  Jörg Schad
Downtime is not an option - day 2 operations - Jörg Schad
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Curated "Cloud Design Patterns" for Call Center Platforms
Curated "Cloud Design Patterns" for Call Center PlatformsCurated "Cloud Design Patterns" for Call Center Platforms
Curated "Cloud Design Patterns" for Call Center Platforms
 
Operating Flink on Mesos at Scale
Operating Flink on Mesos at ScaleOperating Flink on Mesos at Scale
Operating Flink on Mesos at Scale
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 

More from Codemotion

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Codemotion
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyCodemotion
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaCodemotion
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserCodemotion
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Codemotion
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Codemotion
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Codemotion
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 - Codemotion
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Codemotion
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Codemotion
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Codemotion
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Codemotion
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Codemotion
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Codemotion
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...Codemotion
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Codemotion
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Codemotion
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Codemotion
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Codemotion
 

More from Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
 
Pompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending storyPompili - From hero to_zero: The FatalNoise neverending story
Pompili - From hero to_zero: The FatalNoise neverending story
 
Pastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storiaPastore - Commodore 65 - La storia
Pastore - Commodore 65 - La storia
 
Pennisi - Essere Richard Altwasser
Pennisi - Essere Richard AltwasserPennisi - Essere Richard Altwasser
Pennisi - Essere Richard Altwasser
 
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
 
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
 
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
 
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 - Francesco Baldassarri  - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
 
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
 
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
 
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
 
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
 
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
 
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
 
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
 
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
 
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
 
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
 
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
 
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018

  • 1. Deep learning beyond the learning @joerg_schad @dcos
  • 2. Jörg Schad Technical Community Lead / Developer Deep Learning ● Core Mesos developer at Mesosphere ● Twitter: @joerg_schad
  • 3. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7. 1. Explore data using Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 8. Cloud Pipeline 2. Explore data using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7.Streaming of requests ...
  • 9. Open Source Pipeline 2. Explore data using Jupyter notebook 3. Train the model using TensorFlow 4. Monitor training progress using TensorBoard 5. Debug Model using tfdbg 6. Serve Model using TensorFlow Serving 1. Data Preparation using Spark 7. Kafka stream of requests Kubeflow
  • 10. Deep Learning Pipeline Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 11. © 2017 Mesosphere, Inc. All Rights Reserved. Training Challenges 11 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 12. Data Management Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 13. © 2018 Mesosphere, Inc. All Rights Reserved. 13 Challenges ● ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 14. © 2018 Mesosphere, Inc. All Rights Reserved. 14 Challenges ● Data is typically not ready to be consumed by ML job* ● Data Cleaning ● Missing/incorrect labels ● Data Preparation ● Same Format ● Same Distribution Solutions Data Preparation * Demo datasets are a fortunate exception :)
  • 15. Users Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 16. © 2018 Mesosphere, Inc. All Rights Reserved. 16 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 17. © 2018 Mesosphere, Inc. All Rights Reserved. 17 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 18. Frameworks Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 19. 19
  • 20. © 2018 Mesosphere, Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 20 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 21. © 2018 Mesosphere, Inc. All Rights Reserved. 21 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 22. © 2017 Mesosphere, Inc. All Rights Reserved. 22
  • 23. © 2018 Mesosphere, Inc. All Rights Reserved. 23 Alternatives
  • 24. © 2018 Mesosphere, Inc. All Rights Reserved. 24 Alternatives tf.enable_eager_execution() https://www.tensorflow.org/get_started/eager
  • 25. © 2018 Mesosphere, Inc. All Rights Reserved. 25 Data Analytics Ecosystem
  • 26. © 2018 Mesosphere, Inc. All Rights Reserved. APIs 26
  • 27. © 2018 Mesosphere, Inc. All Rights Reserved. 27 Challenges ● Different Frameworks ● No one rules them all Solutions ● Pick the right tool ● PMML if needed Deep Learning Frameworks
  • 28. Cluster Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 29 Input Data Set
  • 30. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 30 Trained Model Input Data Set
  • 31. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 31
  • 32. © 2018 Mesosphere, Inc. All Rights Reserved. Resource Isolation and Allocation 32
  • 33. © 2018 Mesosphere, Inc. All Rights Reserved. TPU 33
  • 34. © 2018 Mesosphere, Inc. All Rights Reserved. TPUs 34
  • 35. © 2017 Mesosphere, Inc. All Rights Reserved. 35 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 36. © 2018 Mesosphere, Inc. All Rights Reserved. PHYSICAL INFRASTRUCTURE MICROSERVICES, CONTAINERS, & DEV TOOLS VIRTUAL MACHINES PUBLIC CLOUDS DATA SERVICES, MACHINE LEARNING, & AI Security & Compliance Application-Aware Automation Multitenancy Hybrid Cloud Management 100+ MORE DatacenterEdge Datacenter and Cloud as a Single Computing Resource Powered by Apache Mesos 20+ MORE
  • 37. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow* 37 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs * Any Distributed System
  • 38. Deploy Scale Configure Recover 3 AM ... Typical Datacenter siloed, over-provisioned servers, low utilization HDFS Kafka Kubernetes Flink TensorFlow
  • 39. © 2018 Mesosphere, Inc. All Rights Reserved. Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master 39 MESOS ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Flink Scheduler Spark Executor Spark Task Mesos AgentMesos Agent Service Docker Executor Docker Task CDB Executor Spark Task Spark Scheduler Kafka Scheduler
  • 40. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 40 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 41. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 41
  • 42. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 42 Trained Model Input Data Set
  • 43. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 43 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 44. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 44 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 45. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 45 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 46. Model Management Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 47. © 2018 Mesosphere, Inc. All Rights Reserved. Recall 47 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 48. © 2017 Mesosphere, Inc. All Rights Reserved. Many Models 48 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 49. © 2018 Mesosphere, Inc. All Rights Reserved. 49 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 50. © 2017 Mesosphere, Inc. All Rights Reserved. TensorFlow Hub 50 https://www.tensorflow.org/hub/
  • 51. Serving Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 52. © 2018 Mesosphere, Inc. All Rights Reserved. 52 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary Solutions ● TensorFlow Serving Model Serving
  • 53. © 2018 Mesosphere, Inc. All Rights Reserved. TensorFlow Lite 53 https://www.tensorflow.org/mobile/tflite/ Challenges ● Small/Fast model without losing too much performance ● 500 KB models….
  • 54. © 2018 Mesosphere, Inc. All Rights Reserved. Rendezvous Architecture 54 https://mapr.com/ebooks/machine-learning-logistics/
  • 55. Monitoring Data & Streaming Users Frameworks & Cluster Models Distributed Data Storage and Streaming Model Serving Data Preparation and Analysis Deep Learning Tools and Distributed Hosting Building Machine Learning Model Sending Model to Clients Monitoring & Operations
  • 56. © 2018 Mesosphere, Inc. All Rights Reserved. 56 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 57. © 2018 Mesosphere, Inc. All Rights Reserved. Debugging 57 tfdbg https://www.tensorflow.org/programmers_guide/debugger
  • 58. © 2018 Mesosphere, Inc. All Rights Reserved. Debugging 58 Tfdbg - GUI currently alpha https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
  • 59. © 2018 Mesosphere, Inc. All Rights Reserved. Profiling 59 Performance optimization for different devices - Keep device occupied Profiling! + Experience! https://www.tensorflow.org/performance/performance_guide
  • 60. © 2018 Mesosphere, Inc. All Rights Reserved. Platforms 60 ● AWS Sagemaker + Spark, MXNet, TF + Serving/AB - Cloud Only ● Google Datalab/ML-Engine + TF, Keras, Scikit, XGBoost + Serving/AB - Cloud Only - No control of docker images ● KubeFlow + TF Everywhere - TF only ● DC/OS + Flexibility (all of the above) + GPU support - More Manual setup
  • 61. © 2018 Mesosphere, Inc. All Rights Reserved. 61 Demo 1. Explore data using Jupyter notebook 2. Train the model using TensorFlow 3. Monitor training progress using TensorBoard 4. Debug Model using tfdbg 5. Serve Model using TensorFlow Serving
  • 62. © 2018 Mesosphere, Inc. All Rights Reserved. Related Work 62 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 63. © 2018 Mesosphere, Inc. All Rights Reserved. Special Thanks to All Collaborators 63 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Chris Lambert Bo Hu Sam Pringle Kevin Klues
  • 64. © 2018 Mesosphere, Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco s ○ Slack: chat.dcos.io #tensorflow Questions and Links 64