Distributed TensorFlow on Hops (Papis London, April 2018)
1. Techniques for Distributed TensorFlow on Hops
Jim Dowling
CEO, Logical Clocks AB
Assoc Prof, KTH Stockholm
Senior Researcher, RISE SICS
jim_dowling
Europe 2018
9. • Model Architecture Search*
- Explore on smaller datasets, then scale to
larger datasets => enables more searches.
• SOTA on CIFAR10 (2.13% top 1)
SOTA on ImageNet (3.8% top 5)
- 450 GPU / 7 days
- 900 TPU / 5 days
Parallel Experiments to Find Better Models
*https://arxiv.org/abs/1802.01548
9/45
26. Why not Kubeflow?
•Operational Reasons
-No Integrated Enterprise Security Framework
• Encryption-in-Transit, Encryption-at-Rest
-Stateful services not designed for Kubernetes
• Distributed Storage, Kafka, Databases
•Usability Reasons
-Not a Fully Managed Platform
• Write YML files and restart just to install a new Python library
-Slow startup times for applications/notebooks
26/45
33. Hops API
•Python (also Java/Scala)
-Manage tensorboard, Load/save models in HDFS
-Horovod, TensorFlowOnSpark
-Parallel experiments
• Gridsearch
• Model Architecture Search with Genetic Algorithms
-Secure Streaming Analytics with Kafka/Spark/Flink
• SSL/TLS certs, Avro Schema, Endpoints for Kafka/Zookeeper/etc
33/45
Feature Extraction
Experimentation
Training
Test + Serve
Data Acquisition
Clean/Transform Data
44. Summary
•The future of Deep Learning is Distributed
https://www.oreilly.com/ideas/distributed-tensorflow
•Hops is a new Data Platform with first-class support for
Python / Deep Learning / ML / Data Governance / GPUs
*https://twitter.com/karpathy/status/972701240017633281
“It is starting to look like deep learning workflows of the future
feature autotuned architectures running with autotuned
compute schedules across arbitrary backends.”
Andrej Karpathy - Head of AI @ Tesla