Imagine a system where one collects real-time data, develops a machine learning model… Runs analysis and training on powerful GPUs… Clicks on a magic button and then deploys code and ML models to production… All without any heavy lifting from data and DevOps engineers. Today, data scientists work on laptops with just a subset of data and time is wasted while waiting for data and compute.
It’s about efficient use of time! Join Iguazio and NVIDIA so that you can get home early today! Learn how to speed up data science from development to production:
- Access to large scale, real-time and operational data without waiting for ETL
- Run high performance analytics and ML on NVIDIA GPUs (Rapids)
- Work on a shared, pre-integrated Kubernetes cluster with - - Jupyter notebook and leading data science tools
- One-click (really!) deployment to production
Speakers: Yaron Haviv, CTO at Iguazio, Or Zilberman, Data Scientist at Iguazio and Jacci Cenci, Sr. Technical Marketing Engineer at NVIDIA
2. Data science challenges
Iguazio data science PaaS over Kubernetes
NVIDIA solutions to accelerate data science with Kubernetes
o GPU integration, TensorRT, RAPIDS
Hands on tutorial
o End-to-end application: real-time predictive infrastructure monitoring
(ingest, explore, hyper param training, deploy to production)
o Serverless and scale-out data science
o NVIDIA RAPIDS
Summary
Q&A
Agenda
3. Today: ML Lifecycle is Complex and Siloed
Data Prep & Analytics
Data Engineers
Model Building
Data Scientists
Model Deployment
ETL Data Lakes/
Warehouses
CSVs Model
Need more
fresh data
Tune model
Active Data
(CSV/in-mem)
GPU
Data Engineers and App Developers
ML Model
Serving
App Deployment
Interactive App
Stream Processing
Triggers and
InteractionsDatabase
4. 4
ML Challenges in Real Life
Re-coding &
instrumenting
AI Model “Depth” & Accuracy
vs Performance & Costs
Observability &
Reproducibility
Infrastructure and
Software Complexity
Can we gather (and prep)
model features in production?
5. 5
Solution: Fast & Continuous Data Science Pipeline
Collect
Constantly Ingest, Clean &
Tag Data via “Collectors”
Develop
“Serverless” Functions
& Notebooks
Deploy to Production
Triggers and
Interactions
Intelligent
Serverless
Run-Time
In Cloud, On-prem or Edge
Build & Test
CI/CD for Code
& Models
ML Model
Training
CPU GPU
Monitor & Reiterate
Deploy in Any
Cloud or Edge
Deliver Accurate
Results in Real-time
Develop and
Iterate Faster
6. 6
Iguazio: Open & High-Performance Data-Science PaaS
Real-time Structured & Unstructured Data Fabric
External Data
Managed & hardened open-source
plus 3rd party services and apps
Secure real-time data sharing
enabling collaboration & parallelism
Self-service experience from A to Z
CPU GPU
Built on a cloud-native architecture
Compute
7. 7
Develop Faster, Run Faster, Use Less Resources
Managed Jupyter
Data science notebooks and online IDE
Serverless notebooks: self-service, scale to zero on idle
Simplify, secure and accelerate data access and processing
Accelerate applications and training using shared GPUs and ML services
One-click deployment to production (as jobs, real-time functions and dashboards)
Time Series Stream Table Object
GPU
Historical and real-time data
from a variety of sources
Integrated, 3rd party or cloud
ML services on-demand
8. 8
Deploy Faster to Production with Serverless
Nuclio: the leading open-source serverless for real-time intelligence
Minimize software development and maintenance overhead
Extreme performance (Up to 370K events/sec per process, 0.1 ms latency, fast data access)
Open, supports many event/data sources - HTTP, streaming, messaging, jobs
One-click deployment from many sources (code, containers, notebooks, git, templates)
Cloud, On-prem
or Edge
One-Click
Deployment
9. 9
Kubernetes
Kubernetes Helps Simplify the Use of Clusters and GPUs
Think of Kubernetes as an operating
system for a cluster.
Kubernetes manages nodes, administer
access, launch containers, jobs and more
Container
Worker
Worker
Worker
Worker
C. C.
Container
Master
Server
API Server
Replication Controller
Scheduler
Daemon
Daemon
Daemon
Daemon
Infrastructure as code:
e.g. PyTorch Training Job
pytorch-job.yml
---
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-example
spec:
backoffLimit: 5
template:
spec:
imagePullSecrets:
- name: nvcr.dgxkey
containers:
- name: pytorch-container
image: nvcr.io/nvidia/pytorch:18.06-py3
command: ["/bin/sh"]
args: ["-c", "python /examples/mnist/main.py"]
resources:
limits:
nvidia.com/gpu: 1
9
10. 10
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
Data
preparation
/ wrangling
cuDF
Optimized ML
model
training
cuML Visualization
Data
visualization
libraries
data insights
Re-Imagining Data Science Workflow
10
11. 11
Software Stack Python
Data Preparation
cuDF
Visualization
cuGRAPH
Model Training
cuML
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP
LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
Read/Write RAPIDS
dataframes Directly into
Iguzaio Database & FS
RAPIDS – GPU Accelerated Data Science
11
12. 12
2,290
1,956
1,999
1,948
169
157
0 1,000 2,000 3,000
20 CPU
Nodes
30 CPU
Nodes
50 CPU
Nodes
100 CPU
Nodes
DGX-2
5x DGX-1
0 5,000 10,000
20 CPU
Nodes
30 CPU
Nodes
50 CPU
Nodes
100 CPU
Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU
Nodes
30 CPU
Nodes
50 CPU
Nodes
100 CPU
Nodes
DGX-2
5x DGX-1
End-to-End
cuIO/cuDF —
Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation
includes joins, variable
transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs,
64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
Faster Speeds, Real World Benefits
12
13. 13
TensorRT – GPU Powered Inference Server
Available with Monthly Updates
Models supported
● TensorFlow GraphDef/SavedModel
● TensorFlow and TensorRT GraphDef
● TensorRT Plans
● Caffe2 NetDef (ONNX import)
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
13
16. 16
Eliminate complexity through pre-integrated managed services
Leverage parallelism and hardware acceleration to improve ROI
Consolidate data engineering, science and app dev platforms
Focus on the end goal:
Build and Deploy Intelligent Apps Faster:
Summary
Production Deployment of Intelligent Applications
19. 19
Many APIs and models on the same data
o SQL, NoSQL, time series, stream, files
o Custom APIs, streaming, sync and ETLs
Minimize CPU, mem, and ops overhead
Iguazio Smart Unified Real-time DB & File-System
100TB NVMe Flash
(direct attached)
High-Speed Fabric
Real-time Firewall
Smart Real-time DB
Many standard &
open APIs on a
unified DB Engine
Use NVMe Flash
as an extension
of memory
Granular
security
S3
ETL Streams
In-memory performance, at 1/30 of the
cost and 30x the density (on Flash)
Real-time time series & data analytics
Fine-grained security
Apps & Users Backup
20. Real-time Intelligent Infrastructure Management
Auto-Healing Network Operations
Replaced a complex Hadoop based data
pipeline that was never productized
Cross correlating real-time data from
multiple sources with historical data
AI-based predictions trigger pre-
programmed actions that fix evolving
problems in the network
Implemented within weeks of initial
deployment
Singtel uses Iguazio to predict network outages and avoid them in real-time
Singtel’s self-healing network is the perfect example of a client shifting from
reactive to proactive with Iguazio
20
21. 21
Real-time Intelligent Infrastructure Management
Maintaining Continuous Fast Response for 2nd Tier Cloud Services
Analyzing and predicting cloud service response time for optimal results
Real-time Data Ingestion
From multiple monitoring tools including Jennifer and Zabbix
Anomaly Detection
Accurate anomaly detection with order of magnitude lower
false positives as opposed to the previous Elasticsearch based
platform
Root Cause Analysis
Real-time root cause analysis from multiple factors. For
example, correlating servers’ CPU’s and applications response
time changes occurring simultaneously
Predictive Analytics
Predicting response times and sending real-time alerts
indicating which factors need to be adjusted to avoid
malfunctions
From deployment to completion in less than two weeks!
22. 22
Evolve Into an Agile Cloud-Native Architecture
YARN
HbaseHDFS
Map
Reduce
Pig,
Hive, ..
DBaaS
S3 (object)
From a Legacy & Resource
Intensive Architecture To Simpler & Modern Approach
Data
Orchestration
Middleware
Your Business Logic
Consume
Innovate
Serverless Data-Science BigData