Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

How to build containerized architectures for deep learning - Data Festival 2019 Munich

37 visualizaciones

Publicado el

When it comes to AI data scientists/engineers tend to focus on tools. Though the data platform that enables these tools is equally important, it’s often overlooked. In fact, 90% of the effort required for success in ML is not the algorithm – it’s the data logistics. In this workshop we will talk about common architecture blueprints to integrate AI in your data centers and how the right data platform choice can make all the difference in launching your AI use case into production! Presented at Data Festival Munich, 2019.

Publicado en: Software
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

How to build containerized architectures for deep learning - Data Festival 2019 Munich

  1. 1. DATAfestival 2019, Munich How to build a Containerized Architecture for Deep Learning
  2. 2. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential When it comes to AI data scientists/engineers tend to focus on tools. Though the data platform that enables these tools is equally important, it’s often overlooked. In fact, 90% of the effort required for success in ML is not the algorithm – it’s the data logistics. In this workshop we will talk about common architecture blueprints to integrate AI in your data centers and how the right data platform choice can make all the difference in launching your AI use case into production!
  3. 3. Democratization of Artificial Intelligence Improved Data Collection Increased Computing Power Advancement in ML Frameworks
  4. 4. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Announcement You may see Artificial Intelligence, Machine Learning and Deep Learning used interchangeably within this presentation please feel free to mentally substitute the phrase of your choice if it is more applicable to you J
  5. 5. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Artificial Intelligence, Machine Learning, and Deep Learning F(x) Deep Learning Artificial Intelligence Machine Learning Artificial Intelligence Technique where computer can mimic human behavior Machine Learning Subset of AI techniques which use algorithms to enable machines to learn from data Deep Learning Subset of ML techniques which uses multi-layer neural network to learn
  6. 6. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential AI Projects and Inquiries Across All Industries Video Captioning Content Based Search NLP, VR and AR Media and Entertainment Cancer Cell Detection Drug Discovery Medical Research Healthcare Fraud Detection Cryptocurrencies Algorithmic Trading Finance Face Recognition Crowd Analytics Cyber Security Security and Defense Theft Detection Auto Checkout Targeted Marketing Retail Reduce Product Defects Increase Production Speed Shorten Downtime Manufacturing
  7. 7. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Expectation vs. Reality
  8. 8. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Effort for AI & Machine Learning Has Some Surprises https://medium.com/thelaunchpad/the-ml-surprise-f54706361a6c “The Surprising Truth About What it Takes to Build a Machine Learning Product” by Josh Cogan, Tech Lead and Manager in the Cloud AI group at Google, Jan 2019
  9. 9. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 90% of the effort in successful machine learning isn’t in the training or model development… It’s the logistics!
  10. 10. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential What You Want To Be Doing Get Data Write intelligent machine learning code for your app Train Model Run Model Repeat
  11. 11. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential What You End Up Doing Source: Sculley, D., Holt, G., Golovin, D. et al. Hidden Technical Debt in Machine Learning Systems “Only a small fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.” 90+% of Machine Learning Success Depends On Data Logistics! https://mapr.com/ebook/machine-learning-logistics
  12. 12. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Why? Just getting the training data is hard: ● Which data? How to make it accessible? Multiple sources! ● New kinds of observations force restarts ● Requires a ton of domain knowledge The myth of a single model: ● You cannot train just one ● You will have dozens of models, likely hundreds or more ● Handoff to new versions is tricky ● You have to get runtime to be sure about which is better
  13. 13. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Seamless Access to All Data Technical Capabilities of the Platform Leadership from the Top Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) Adopters 20% Key Traits of a Successful AI Strategy
  14. 14. © 2017 Cisco and/or its affiliates. All rights reserved. Stream first architecture is a powerful approach with surprisingly widespread advantages • Innovative technologies emerging to for streaming data Microservices approach provides flexibility • Streaming supports microservices (if done right) Containers remove surprises • Predictable environment for running models Improving Machine Learning Logistics
  15. 15. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Demo
  16. 16. 16 © 2018 MapR Technologies, Inc. // MapR Confidential
  17. 17. 17 © 2018 MapR Technologies, Inc. // MapR Confidential Demonstrate an end-to-end containerized & integrated ML workflow – showcasing Online Model Predictions and Online Model Training! “Slack After Dark“ DEMO BACKEND SERVICE “AI Monkey” Slack app to help us perform the “DataOps” tasks – Test, train and deploy the model. FRONTEND SERVICE “Slack After Dark” Slack app – AI-powered mobile dating app representing the end-user application.
  18. 18. 18 © 2018 MapR Technologies, Inc. // MapR Confidential Implementing Rendezvous Architecture for Online Prediction Model 2 Model 1 Decoy Canary Mirrored Traffic Live Traffic RENDEZVOUS / ENSEMBLE Select prediction with highest confidence (via customizable Objective Function) REQUEST RESPONSE Archive Compare Canary to live models Replay for future use Streams Distributed Filesystem For more details on the Rendezvous Architecture see: https://mapr.com/ebook/machine-learning-logistics/ /predict /predict- rendezvous
  19. 19. 19 © 2018 MapR Technologies, Inc. // MapR Confidential Implementing Online Model Training off Streaming Data Model Training Build New Model Deploy Model 3 (Canary) Training Stream Model 2 Model 1 FROM BATCH TO REAL- TIME! Streaming Data Distributed Filesystem /fix – labels, trains and deploys the new model all together /label /train /deploy /fix
  20. 20. 20 © 2018 MapR Technologies, Inc. // MapR Confidential User and Scoring Database Slack After Dark App /login /match Industry‘s unique AI- powered scoring and matching algorithms
  21. 21. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Containers & Kubernetes
  22. 22. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Data Science Phases Exploration Training Deployment Production In this phase, the executable code that is used to train models is developed and some prototyping is done. • Typically uses data science notebooks Output is code The executable training code is run on very large datasets. • Phase where compute powers matters Output is a model Models are deployed into a framework that allows for the scoring of data. • Can be done in batch or real time Output is a microservices framework Models are monitored and updated in production. • Requires CI/CD pipeline capability Output is “insights”
  23. 23. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential App #1 App #1 Bins/Libs Bins/Libs Guest OS Guest OS vCenter/HyperV Host OS Infrastructure vRealize/... Virtual Machines App #1 App #1 Bins/Libs Bins/Libs Kubernetes (k8s) Host OS Infrastructure Kubeflow Containers Getting to Know Kubernetes Containers and kubernetes (k8s) address major ML/DL challenges Whats the DIFF? VM Container Containerization is good for ML • For Exploration: containerization enables isolated personalized development environments • For Training: containerization provides compute agility and the ability to iterate with varying parameters • For Deployment: containerization provides the ability to create a robust microservices architecture
  24. 24. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Kubernetes is an API and agents The Kubernetes API provides containers with a scheduling, configuration, network, and storage The Kubernetes runtime manages the containers
  25. 25. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential App 1 App 2 App 3 Kubernetes
  26. 26. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential App 1 App 2 App 3 Kubernetes rpc stream LogFile
  27. 27. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential App 1 App 2 App 3 Kubernetes rpc stream LogFile But what about the data??
  28. 28. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Data platform App 1 App 2 App 3 Kubernetes rpc
  29. 29. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential The Data Platform needs to be like Kubernetes. For Data.
  30. 30. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential The concept of "Dataware"
  31. 31. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential 1970 INTEGRATED SYSTEMS FLEXIBILITY AGILITY LOCK-IN SPECIALIZATION 2018+ ABSTRACTION BARE METAL Specialized HW with open industry software standards (TCP/IP, X86, NFS) CONTAINERS Resources entirely managed in softwareDATACENTER VIRTUALIZATION Software replaces specialized HW VIRTUAL MACHINES Software used to abstract HW from OS Freedom to run multiple OS on the same HW DATA • Software has increasingly abstracted underlying resources from applications to improve flexibility, agility, and costs. • Data is growing exponentially and getting highly fragmented and distributed with the Enterprise IT stack. • Data abstraction is about an enterprise data layer that turns data into a more powerful resource. The Next Era Of Abstraction: DATA
  32. 32. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Dataware – Managing Data As A Resource Applications Middleware Hardware Applications Middleware Dataware Hardware Dataware - The New Layer That Manages Data as a Resource Current Enterprise IT stack Key Attributes for Dataware: • Universal Access to Data • Data Workload Independence • Global Data Multi-Tenancy • Data Processing Isolation • Data Security • Data Performance and Temperature Management • Data Portability • Global Data Deployments
  33. 33. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential MapR Is the Most Advanced AI and Analytics Dataware MapR Data Platform accelerates data-driven innovation: • Full spectrum of workloads from analytics to ML and AI • Edge first, cloud, container, and data native • Open and adaptive • Single security model • Mission-critical reliability at scale MapR’s Data Platform allows data to be managed as a resource regardless of deployment or location.
  34. 34. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Data Center Integration
  35. 35. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Typical Data Pipeline for AI (Logical View) Streaming Data Sources IOT Data Sources Web Data Sources Message Bus / Kafka Data Retention HDFS Data Processing and storage Historical Data Structured /Data Warehouse data Extract data Process data Create dataset ETL Training
  36. 36. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Data must always be moved to the compute • No ability to optimize SLA per use case, and no true edge support Distributed compute, HPC and GPU workloads cannot be co-located in a heterogeneous environment Data-at-rest and data-in-motion live in two different locations • More complex software and hardware architectures Does not support a data operations strategy • At-rest and in-motion cannot be versioned simultaneously • input data, models and outputs • Complex synchronization and security models Does not work across both on-premises and cloud providers Limitations With This Approach
  37. 37. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Storage Appliance Traditional Storage Vendor Solution Edge Copy Ingest Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment Copy Storage Appliance ServersServers w/ GPU Lineage is lost between environments Data and GPUs cannot be co-located Copy Copy
  38. 38. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential PLEASE, PLEASE, PLEASE… ...tell me you are not copying all your data between these systems
  39. 39. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Hadoop Based Solutions Edge Copy Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment HDFS Cluster ServersServers w/ GPU Minimum of seven non-homogeneous environments to administer and secure Full data copies without versioning, lineage control or multi-master support Copy Kafkain-motion Copy Copy Copy in-motion Kafka in-motion Copy Copy Copy Ingest Kafka Where does the master copy of the data live?
  40. 40. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential One Data Fabric Global Namespace Core CloudEdge Data Prep Training + Testing Deployment One homogeneous environment to manage and secure Supports real-time processing with data protection, lineage, and versioning Runs directly on GPU-based servers to create a unified GPU-based cluster
  41. 41. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Data Centric Approach: Expanding to AI/ML/DL Cisco Validated Designs – cisco.com/go/bigdata – cisco.com/go/ai-compute NGC TensorFlow on Openshift NGC on OpenShift for data scientists for interactive and batch workloads. Portable, Scalable ML Stack Enabling Rapid Development and Deployment Kubeflow on premise and Google Cloud• Scale CPU and GPU on Kubernetes with Enterprise support • Mix and Match Different Infra • Up to 2 PCIe GPUs • Up to 6 PCIe GPUs • 8 NVLink GPUs • Run NGC • TensorFlow, Pytorch, Caffe,… • Kubeflow • Integrating TensorFlow and Kubernetes • Kubeflow Pipelines: • Reusable software components to build complete data pipeline • Kubeflow Pipelines on UCS and Google Cloud • Hybrid cloud architecture for data pipeline and machine learning
  42. 42. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Cisco UCS Infrastructure Choices Test & Dev and Model Training C240 2 x P4 6 x P4 HyperFlex 240 Deep Learning/ Training C480 Inferencing C/HX 220 C/HX 240 Option of GPU Only Nodes 2 x P100/ V100 2 x P100/ V100 Per Node 6 x PCIe P100/ V100 8x SXM2 V100 with NVLink C480 ML Better Together, Customer Choice, Cisco Validated Design with Eco-system UCSM and Intersight Managed Validated AI/ML SW For Turnkey (Working with Partners)
  43. 43. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Cisco UCS C480 ML M5 Rack Server for Deep Learning A No-Compromise Purpose Built Server for Deep Learning Raid Controller Network Choice of 10/25 or 40/100G Four PCIe Slots GPUs 8 X V100 32GB NVLink Interconnect Redundant Fans Storage Up to 24 SAS/SATA SSD/HDD Up to 6 NVMe Drives CPUs 2 * Intel® Xeon® Processor Scalable Family (Up to 28 cores per socket) 24 DDR4 DIMMS—up to 3 TB Memory
  44. 44. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Simplified administration and security models • One and done - no need for a different model in each location • GDPR “compliant”! Scales linearly with customer needs • No reason to create a bunch of separate clusters Sustainability - All data, files, database and event streaming • Both at-rest and in-motion An enabling and flexible architecture • Only way to bring distributed data and GPUs together • Easy to meet customers needs • Supports both Kubernetes and Containers Low cost of entry and linear cost of scaling Advantages for AI
  45. 45. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Complex data pipelines, large data volumes serving GPUs • Mixed workloads - distributed data prep plus real-time Simultaneous data and model versioning • Data at-rest and in-motion Model output lands in a stream • Creates pluggable model flow Works across on-premise and cloud infrastructures, simultaneously Simplifying Model Development and Deployment
  46. 46. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Summary
  47. 47. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Seamless Access to All Data Technical Capabilities of the Platform Leadership from the Top Source: McKinsey Global Institute – Artificial Intelligence / The Next Digital Frontier? (2017) Adopters 20% Key Traits of a Successful AI Strategy
  48. 48. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential • Use Containers/Kubernetes to leverage NVIDIA GPU computing power when building deep learning models. • Use a converged data platform ("dataware") to serve as data infrastructure, providing Distributed File System, NoSQL Database and Event Streams. • Leverage the ability to publish and subscribe to streams on the platform to build next generation applications with deep learning models. • Use Cisco Validated Designs as a reference for your architecture choices.
  49. 49. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Containerized Architecture for Deep Learning DC1 DC2 OrchestrationOrchestration
  50. 50. © 2017 Cisco and/or its affiliates. All rights reserved. More information
  51. 51. © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Confidential More Information cisco.com/go/bigdata cisco.com/go/ai-compute www.cisco-ai.com
  52. 52. 52 © 2018 MapR Technologies, Inc. // MapR Confidential O’Reilly (e)books! Download the e-book here: https://mapr.com/ebook/ machine-learning- logistics/ by Ted Dunning and Ellen Friedman Download the e-book here: https://mapr.com/ebook/ ai-and-analytics-in- production/
  53. 53. © 2017 Cisco and/or its affiliates. All rights reserved. • Over 35 FREE on-demand training courses for AI and analytic development, data engineering and administration • Certification tracks for developers, administrators, and data scientists • Expanded support portal and knowledge base • Containerized clusters, for free download, solution templates and code examples for hands-on experience https://mapr.com/training/ Need Help Solving Your Data Logistics Problems?

×