SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
TensorFlow 深度学习平台
AI
Machine Learning
Deep Learning
Hadoop? Spark?
MapReduce
Hadoop & Spark
Data
Executor Executor Executor Executor Executor
Result
map
reduce
MapReduce
Hadoop & Spark
“In parallel computing, an embarrassingly parallel workload or problem
(also called perfectly parallel or pleasingly parallel) is one where little or
no effort is needed to separate the problem into a number of parallel
tasks.”
—— Wikipedia
https://en.wikipedia.org/wiki/Embarrassingly_parallel
Spark MLlib
What about
TensorFlow?
TensorFlow Graph
Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘
Graph
(DAG)
How does TensorFlow work
Python
Golang
Java
…
Graph
(DAG)
C++ Core
compile run
Preprocess
Data
Executor Executor Executor Executor Executor
Result
map
reduce
Hive, Spark, Storm, …
Distributed Storage
NFS, HDFS, S3, …
Training
Training Data Validation Data Test Data
Model Serving request
train
(cpu/gpu)
✔
✘
TensorFlow
(Distributed Training)
• TaaS (TensorFlow as a Service)
• 开始于 2016 年年 8 ⽉月底
• 受到 Google Cloud 的 CloudML 产品启发
• 让算法⼯工程师可以专注于算法,其它的事情交给 elearn
分布式存储
CPU 弹性需求 service 的 IP、Port 管理理
⼤大量量 container 的⽣生命周期 API
GPU
Distributed
Storage
Datastore
• Abstraction of data
• Compatible with many types of distributed storage
• Decoupling computing and storage
• Bring your own storage
Datastore
Client 缺点 应⽤用场景
NFS nfs-utils 性能问题、权限管理理
存放训练数据
模型训练
S3
compatible
s3fs-fuse
⽆无法创建空⽬目录
mkdir
存放训练数据
MinFS ⽆无法创建 symlink 存放训练数据
GlusterFS - - -
Datastore Example
S3
NFS
GPU
in container
host GPU
container GPU
GPU with Docker
• GPU 内存使⽤用有讲究

https://www.tensorflow.org/tutorials/using_gpu
• GPU Docker images

https://hub.docker.com/r/nvidia/cuda/
• nvidia-docker 不不是必须

CUDA_SO, NVIDIA_SO, DEVICES, …
GPU with Kubernetes
--feature-gates="Accelerators=true"
Distributed
Training
TensorFlow 的上⼀一代产品
DistBelief
• 解决了了数据量量 > GPU 最⼤大内存的问题
• paper 中提到最⼤大的 model,达到了了

1.7 billion parameters, utilizing 81 machines, delivering a 12x speedup.
• 它的缺点:
• 擅⻓长图像识别,但对其它机器器学习 model 适⽤用性差,⽽而且不不⽀支持 mobile
• 维护⼤大规模系统成为负担,缺乏抽象。
https://research.google.com/pubs/pub40565.html
TensorFlow Ecosystem: Integrating TensorFlow with Your Infrastructure — Derek Murray, Jonathan Hseu
TensorFlow Ecosystem: Integrating TensorFlow with Your Infrastructure — Derek Murray, Jonathan Hseu
👋⼿手动管理理的资源
• IP or DNS
• port
• task ID
• CPU/GPU ID
😫抓狂的节奏
登录 10 台服务器器
mount 10 遍共享存储
⼩小⼼心翼翼安排 GPU 资源
仔细设计 10 条启动命令
敲 10 遍这不不⼀一样的启动命令
⼿手动启动⼀一个 TensorBoard
⼿手动创建⽬目录
mkdir + cp ⼿手动管理理模型版本
……
天呐!这才训练一次
以后可怎么每周训练、天天训练啊?!
TensorFlow (GPU/CPU) cluster benchmark
Total: 6 GPUs
master worker
2 GPUs x 1 1 GPU x 4
Total: 9 GPUs
master worker
2 GPUs x 1 1 GPU x 7
Total: 3 GPUs
master worker
2 GPUs x 1 1 GPU x 1
11
global steps/s
28
global steps/s
45
global steps/s
Total: 9 CPUs
master worker
2 CPUs x 1 1 CPU x 7
5
global steps/s
Powered by
Model
Management
Serving+
Cluster
(Monthly Training)
model - 2017-07-23
model - 2017-07-23
model - 2017-07-16
model - 2017-07-09
model - 2017-07-02
Recommendation Model
Copy
GRPC
Online Serving
Use TensorFlow
Golang/Java Binding to
Load the Model
Dow
nload
System Datastore
User Defined
Datastore
Demo
Thinking of
为 TensorFlow 量量身打造
如果让 elearn ⽀支持其它 deep learning 框架,太容易易了了,

只要写 driver 就可以了了,但是我不不会去这样做的。
尽量量发挥 Kubernetes 的特⾊色功能和编排能⼒力力

Deployment, Job, StatefulSet, ConfigMap, Nginx Ingress, …
如果让 elearn ⽀支持其它 Cloud,太容易易了了,只要写 interface 够通⽤用,
然后写 driver 就可以了了,但会失去 Kubernetes 本身的意义,
所以 elearn cloud interfaces 的设计是⾯面向功能的 。
让 Kubernetes 变成 OS

所有⼩小的操作都以 Job 形式下发
⽐比如训练完成后的 post run、保存 modelVersion,这些任务都是 Kubernetes Job,
并不不在 elearn server 上运⾏行行。
I’m 江骏 / ohmystack
@饿了了么
https://github.com/ohmystack
http://weibo.com/jiangjun1990
http://ohmystack.com/
推荐项⽬目(点击链接前往)
dt (docker-tool)
最爽功能:dt ssh
gotool
管理理 Golang 开发环境 GOPATH 的利利器器
欢迎关注饿了了么技术社区

Más contenido relacionado

La actualidad más candente

A One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStackA One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStack
Puppet
 
Open stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyhOpen stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyh
OpenCity Community
 
CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]
Travis McAdams
 

La actualidad más candente (20)

How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUsHow to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
Cloud computing and bioinformatics
Cloud computing and bioinformaticsCloud computing and bioinformatics
Cloud computing and bioinformatics
 
Bioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory UseBioinformatics Analysis Environment for Your Laboratory Use
Bioinformatics Analysis Environment for Your Laboratory Use
 
themidgame-tube-slides
themidgame-tube-slidesthemidgame-tube-slides
themidgame-tube-slides
 
A One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStackA One-Stop Solution for Puppet and OpenStack
A One-Stop Solution for Puppet and OpenStack
 
Open stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyhOpen stack china_201109_sjtu_jinyh
Open stack china_201109_sjtu_jinyh
 
Local testing of Containerized Distributed Systems
Local testing of Containerized Distributed SystemsLocal testing of Containerized Distributed Systems
Local testing of Containerized Distributed Systems
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and Kubernetes
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
Hadoop on OpenStack
Hadoop on OpenStackHadoop on OpenStack
Hadoop on OpenStack
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
 
HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)HTrace: Tracing in HBase and HDFS (HBase Meetup)
HTrace: Tracing in HBase and HDFS (HBase Meetup)
 
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebulaOpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
OpenNebula TechDay Boston 2015 - HA HPC with OpenNebula
 
やっとでた! OpenStack Manila
やっとでた! OpenStack Manilaやっとでた! OpenStack Manila
やっとでた! OpenStack Manila
 
CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]CMPS 494 Presentation [Cloud Computing]
CMPS 494 Presentation [Cloud Computing]
 
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8sk8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
k8sjp#9 KubeCon - Service Mesh, ML/DL on k8s
 
Thoughts on Cybersecurity
Thoughts on CybersecurityThoughts on Cybersecurity
Thoughts on Cybersecurity
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Apps software development with Vert.X
Apps software development with Vert.XApps software development with Vert.X
Apps software development with Vert.X
 

Similar a 饿了么 TensorFlow 深度学习平台:elearn

High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
Chris Fregly
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 

Similar a 饿了么 TensorFlow 深度学习平台:elearn (20)

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
Optimizing, Profiling, and Deploying TensorFlow AI Models in Production with ...
 
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
Building Google Cloud ML Engine From Scratch on AWS with PipelineAI - ODSC Lo...
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on G...
 
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
High Performance Distributed TensorFlow with GPUs - NYC Workshop - July 9 2017
 
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
High Performance Distributed TensorFlow with GPUs - Nvidia GPU Tech Conferenc...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
 
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
High Performance TensorFlow in Production -- Sydney ML / AI Train Workshop @ ...
 
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...Nvidia GPU Tech Conference -  Optimizing, Profiling, and Deploying TensorFlow...
Nvidia GPU Tech Conference - Optimizing, Profiling, and Deploying TensorFlow...
 
Apache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning PlatformApache Submarine: Unified Machine Learning Platform
Apache Submarine: Unified Machine Learning Platform
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsOptimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmark
 
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020GPU Accelerated Data Science with RAPIDS - ODSC West 2020
GPU Accelerated Data Science with RAPIDS - ODSC West 2020
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
The Flow of TensorFlow
The Flow of TensorFlowThe Flow of TensorFlow
The Flow of TensorFlow
 
End-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache SparkEnd-to-End Deep Learning with Horovod on Apache Spark
End-to-End Deep Learning with Horovod on Apache Spark
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

饿了么 TensorFlow 深度学习平台:elearn