Simplifying AI Infrastructure: Lessons in Scaling on DGX Systems, the world's most powerful AI Systems. This is a presentation I did at GTC Israel in 2018
4. END-TO-END PRODUCT FAMILY
FULLY INTEGRATED AI SYSTEMS
DESKTOP
TITAN
WORKSTATION
DGX Station
DATA CENTER
Tesla V100
DATA CENTER
Tesla V100
AUTOMOTIVE EMBEDDED
Tesla P4/T4
Drive AGX Pegasus Jetson AGX Xavier
VIRTUAL
WS
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
DGX-1 DGX-2
6. 6
AI PLATFORM CONSIDERATIONS
Factors impacting deep learning deployment decisions
I have limited budget,
need lowest up-front
cost possible
COST
“I want the most GPU
bang for the buck
PERFORMANCE
“
PRODUCTIVITY
Must get started now,
line of business wants to
deliver results yesterday
“
7. 7
PLATFORM IMPACT ON AI TCO
Study &
exploration
Platform Design
Productive
Experi-
mentation
HW & SW
Integra-
tion
Trouble-
shooting
Software
eng’g
Software
optimiz-
ation
Design and
Build for
Scale
Software
re-optimiz-
ation
InsightsTraining
at Scale
1. Designing and Building an AI Compute Platform – from Scratch
OPEX
CAPEX
Day
1
Month 3
Time and budget spent
on things other than
data science
“DIY”
TCO
8. 8
TAKING A FULL-STACK MINDSET TO
AI SYSTEM DEPLOYMENT
Support Model
Accelerate Time-to-Resolution for
AI/DL Issues
AI/DL Software Stack
Maximize GPU Performance
Operating System Image
Maintain stability while keeping pace
with the latest
Hardware Architecture
Look beyond the spec sheet
10. 10
PURPOSE-BUILT, NOT RE-PURPOSED
NVIDIA DGX STATION
AI WORKSTATION
NVIDIA DGX-1, DGX-2
AI DATA CENTER
• Universal SW for Deep Learning
• Predictable execution across
platforms
• Pervasive reach
NVIDIA AI SOFTWARE STACK
The Essential
Instrument for AI
Research
DGX-1
The Personal
AI Supercomputer
DGX Station
The World’s Most Powerful
AI System for the Most
Complex AI Challenges
DGX-2
11. 11
SOFTWARE PERFORMANCE PROOF-POINTS
FROM THE FIELD
Global Technology Firm
Specializing in Digital Media
World-Leading Medical
Research Center
Home-grown
“optimized”
TensorFlow stack
DGX
TensorFlow
stack
1680
images/sec
2600
images/sec
Home-grown
TensorFlow stack
DGX
TensorFlow
stack
1238
images/sec
2600
images/sec
2.1X
1.5X
FASTER
2.1X
FASTER
ResNet50 Training ResNet50 Training
11
12. 12
OpenAI
NYU
NVIDIA DGX SYSTEMS MOMENTUM BUILDING
Barriers Toppled, the Unsolvable Solved – a Sampling of DGX Systems Impact
April
2016
August
2018
UC Berkley CSIRO MIT CMU Fidelity RIKEN CloudSight
Mass. General DFK
IDSIA
Ford SAP NVIDIA
SATURNV launch
Hologic
Avitas Systems
(A GE Venture)
SK Telecom
PayPal
Chinese Academy
Of SciencesFAIR
Microsoft
University
of Michigan
Comcast
Nimbix
Noodle.ai
Oak Ridge
National
Laboratory
BHGE
Zenuity
Swiss Federal
Railway
14. 14
NVIDIA DGX STATION
Groundbreaking AI – at your desk
The Fastest Personal Supercomputer
for Researchers and Data Scientists
Revolutionary form factor -
designed for the desk, whisper-quiet
Start experimenting in hours,
not weeks, powered by DGX Stack
Productivity that goes from desk
to data center to cloud
Breakthrough performance and
precision – powered by Volta
14
15. 15
The Personal AI Supercomputer
for Researchers and Data Scientists
15
Key Features
1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)
2. 2nd-gen NVLink (4-way)
3. Water-cooled design
4. 3 x DisplayPort (4K resolution)
5. Intel Xeon E5-2698 20-core
6. 256GB DDR4 RAM
2
1
5
4
3
6
NVIDIA DGX STATION
Groundbreaking AI – at your desk
16. 16
DGX STATION: 72X FASTER THAN CPU
72X
9.9 hours
DGX
Station
20X1X
4-way GPU
Workstation
Dual Socket
CPU Server
36.4 hours
711 hours
Workload: ResNet50, 90 epochs to solution | CPU Server: Dual Xeon E5-2699 v4, 2.6GHz
17. 17
JUMP START YOUR AI JOURNEY
Training AI on DGX Station
4x Speedup
8 Months Payback
Healthcare
10x Speedup
Self-Driving Cars
3x Speedup
2 Months Payback
Retail
8x Speedup
Smart City
18. 18
Most mutations found when sequencing tumours are unknown
and often ignored. It requires a strong machine to uncover
the significance of each mutation and its interaction with
drugs, learning from large and multi-scaled image data of
cell expressing the mutations.
NovellusDx leverages DGX Station and containerized deep
learning framework to obtain very accurate readings of the
intracellular signalling pathway activity that is very stable
through time and other biological perturbations, with 10x
better accuracy, $70k annual saving, and 4x faster training.
In a clinical trial for results of progression free survival (PFS),
the clinical parameter—i.e. the number of months the
disease does not progress—increase 3x.
AI-Powered Tumor
Mutation Induced
Signaling Activity
Monitoring
19. 19
For a self-driving car to reach the same level of accuracy as
a human, it will need to have travelled 11 billion miles of
test-drives, taking many years to complete.
Cognata is shaving years off this training process by enabling
virtual cars to experience the world of driving in a strikingly
realistic simulated environment.
With the NVIDIA DGX Station, Cognata
• Accelerates DNN based generative models training by a
factor of 10x
• Simultaneously runs dozens of AV simulations to
accumulate millions of virtual miles and improve and
identify edge cases
AI-powered solution enabled Cognata to increase
performance, save money, improve productivity and make
the world a safer place.
NVIDIA DGX
ACCELERATES
AUTONOMOUS VEHICLE
READINESS
20. 20
In-store retail experience is not easy or streamline as online
retail experience.
Tracxpoint created a shopping cart that’s fully integrated
with hardware and AI powered and GPU-accelerated
software, called AIC.
Trained on DGX Station and inferenced using TensorRT and
Deepstream 2.0 on Jetson TX2, AIC can recognize 100,000
individual products in under a second, with a high accuracy,
3x performance gain, and 2 months ROI comparing to cloud
solutions.
Customers now simply place products in their cart (no need
to search for barcodes), communicate in real-time with
suppliers to get personalized offers while shopping, navigate
inside the supermarket, and then pay digitally on cart.
AI-Powered Shopping
Cart for Seamless
Online Retail
Experience in Store
Usability
Speed
Trust
21. 21
Providing the sensor with the ability to analyze the data it
picks up has been top of mind for goverments, police,
security agencies, banking, smart cities, retail, and
transportation industry. Collecting, analyzing, and storing
data can be difficult, costly, and error-prone.
AnyVision, the world’s leading designer and developer of
recognition platforms, offers a wide range of capabilities,
including face recognition, human body recognition and
object identification.
Powered by a cutting-edge, deep neural network on NVIDIA
DGX Systems, Tesla, and Jetson, AnyVision is the first to
provide 1:1 and 1:N face recognition and can detect 115
million individuals in 0.2 seconds per database, with a 8x
performance increase in training.
Revolutionize Security
in Smart City with AI-
Powered Facial
Recognition Platform
23. 23
NVIDIA DGX-1: THE ESSENTIAL TOOL OF AI
Highest Performance, Fully Integrated System
1 PFLOPS | 8x Tesla V100 32GB | 300 GB/s NVLink Hybrid Cube Mesh
2x Xeon | 8 TB RAID 0 | Quad 100Gbps, Dual 10GbE | 3U — 3500W
8 TB SSD 8 x Tesla V100 16GB32GB
24. 24
DGX-1: 140X FASTER THAN CPU
140X
5.1 hoursDGX-1
8-way
GPU
Server
46X1X
15.5 hours
711 hours
Dual
Socket
CPU
Workload: ResNet50, 90 epochs to solution | CPU Server: Dual Xeon E5-2699 v4, 2.6GHz
25. 26
WORLD-CLASS
RAILWAY LOGISTICS
10,671 trains
1.26 million riders
3,232 km of track
300 tunnels
6,000 bridges
30,000 switches
1 train
11 switches
30 possible ways
2 trains
900 ways
80 trains
1080 possibilities
> # of observed atoms in
the universe
One full day of experiments | 17 seconds
One day of whole train traffic in Switzerland | 0.3 seconds
86,000 steps in 0.3 seconds
27. 2828
NVIDIA DGX-2
LIMITLESS DEEP LEARNING FOR EXPLORATION
WITHOUT BOUNDARIES
The World’s Most Powerful Deep Learning System
for the Most Complex Deep Learning Challenges
• Performance to Train the Previously Impossible
• Revolutionary AI Network Fabric
• Fastest Path to AI Scale
• Powered by NVIDIA GPU Cloud
For More Information: nvidia.com/dgx-2
28. 29
DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
29
30 TB NVME SSDs
Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards
8 V100 32GB GPUs per board
6 NVSwitches per board
512GB Total HBM2 Memory
interconnected by
Plane Card
Twelve NVSwitches
2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE
1600 Gb/sec Total
Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/sec
Ethernet
29. 30
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-1, SEP’17 DGX-2, Q3‘18
software improvements across the stack including NCCL, cuDNN, etc.
Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
Time to Train (days)
1.5
15
0 5 10 15 20
DGX-2
DGX-1 with V100
10 Times Fasterdays
days
30. 31
300 Skylake Gold CPU Servers
THE PERFORMANCE OF 300 SKYLAKE SERVERS
One DGX-2
SAME
performance
1/8
THE COST
60X
LESS SPACE
18X
LESS POWER
15 racks
$2.7M in
servers
31. 32
2X HIGHER PERFORMANCE WITH NVSWITCH
2 DGX-1V servers have dual socket Xeon E5 2698v4 Processor. 8 x V100 GPUs. Servers connected via 4X 100Gb IB ports |
DGX-2 server has dual-socket Xeon Platinum 8168 Processor. 16 V100 GPUs
Weather Simulation
(ECMWF benchmark)
Language Processing
(Mixture of Experts)
DGX-2 with NVSwitch2x DGX-1 (Volta)
2.4X FASTER
2.7X FASTER
32. 33
CRISIS MANAGEMENT
SOLUTION
Natural disasters are increasingly causing major destruction
to life, property and economies. DFKI is using the NVIDIA
DGX-2 to evolve DeepEye —which uses satellite images
enriched with social media content to identify natural
disasters— into a crisis management solution. With
the increased GPU memory and fully connected
GPUs based on the NVSwitch architecture, DFKI
can build bigger models and process more
data to aid rescuers in their decision-
making for faster, more efficient
dispatching of
resources.
33. 34
“Fujifilm applies AI in a wide range of fields. In
healthcare, multiple NVIDIA GPUs will deliver
high-speed computation to develop AI supporting
image diagnostics.The introduction of this
supercomputer will massively increase our
processing power. We expect that AI learning that
once took days to complete can now be
completed within hours.”
AkiraYoda
chief digital officer of FUJIFILM Corporation
- Pharmaceuticals
- BioCDMO
- Regenerative medicine
- Analyzing and
recognizing medical
images
- Simulations display
materials and fine
chemicals
34. 35
DL FROM DEVELOPMENT TO PRODUCTION
Accelerate Deep Learning Value
Experiment
Refine
Model
Deploy
Train at
Scale
Insights
Procure
DGX
Station
Install,
Build, Test
Training
Productive
Experimentation
Fast Bring-up
To Data CenterTo Desk
From
Idea
installed iterate
Inference
To
Results
refine, re-train
scale
To Edge
35. 36
THE CHALLENGE OF SCALING AI
Addressing design, deployment and operations bottlenecks
DESIGN
GUESSWORK
DEPLOYMENT
COMPLEXITY
MULTIPLE POINTS
OF SUPPORT
36. NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
BUILDING AI FOR SDC IS HARD
Every neural net in our DRIVE
Software stack needs to
handle 1000s of conditions
and geolocations
HazardsAnimalsBicyclesPedestrians
Backlit
Snow
Vehicles
Day
Clear FogRainCloudy
Street
LampsNightTwilight
37. SDC SCALE TODAY AT NVIDIA
1PB collected/week
12-camera+Radar+Lidar
RIG mounted on 30 cars
4,000 GPUs in cluster
= 500 PFLOPs
1,500 labelers
20M objects labeled/mo
15PB active training
+ test dataset
20 unique models
50 labeling tasks
38. 40
“A-HA” MOMENTS IN AI DATA CENTERS
Best Practices from Building the World’s Largest Deep Learning Environments
Rack Design Networking Storage Facilities Software
• DL drives
close to
operational
limits
• Similarities
to HPC best
practices
• 100GbE
preferred
• High-
bandwidth,
ultra-low
latency
• Datasets
range from
10k’s to
millions
objects
• terabyte
levels of
storage and
up
• assume
higher watts
per-rack
• Higher
FLOPS/watt
= DC less
floorspace
required
• Scale
requires
“cluster-
aware”
software
Example:
• Autonomous vehicle = 1TB / hr
• 100’s PB of raw data
• Billions of total images
• Millions of images for AI training
• 10+ neural nets and 10+ parallel
experiments
• 1 DGX-1 runs 1 deep neural net in
1 day
40. 42
GPU READY VS. CPU ONLY DATA CENTER
1/40th the footprint | 1/20th the power
41. 43
1) POWER & COOLING
Cooling Techniques:
● Hot or cold aisle
containment
● Rear water door heat
exchangers
● Component-level
water cooling
Minimize power & floor space needs for better performance efficiency
Improved performance per watt | Improved performance per $$ | Higher Density
42. 44
SAMPLE GPU SERVER
CONFIGURATIONS
DNN, Analytics,
and HPC workloads
Things to consider:
- Characterize peak power loads to avoid
unnecessary downtime
- Liquid cool – 3500 times more heat than 10 air-
cooled systems
- Component cool – capture 60-80% of server
heat, reduce cost by 50%, 2-5x increased in
density
- Rule of Thumb: 100 cfm/kW of server load + a
5% overhead for air-leakage
MaxQ
43. 45
2) COMPUTE NETWORK
100 Gb Ethernet
Minimize Ethernet Adaptor Load on CPU
Support Cut-Through Communication
Support RDMA
Layer two network - Spine-leaf topology
for high bisection bandwidth
Fewer layer three network – minimize
routing bottleneck
Design localized traffic for scalable apps
EDR (100 Gb) or HDR (200 Gb)
InfiniBand
Option 1 - fat-tree networks to
maximize the total cluster bandwidth
Option 2 - multiple InfiniBand
connections per node for dense GPU
nodes
High-bandwidth, low-latency, and highly efficient
44. 46
EXAMPLE: MULTI-NODE SERVER COMPARISON
with different High-Speed Interconnects
EDR InfiniBand is 20X the performance of the 10Gb Ethernet based solution > 2x app performance
45. 47
3) STORAGE REQUIREMENT
Workloads dependent
- Support
multiple processes
accessing the same
files simultaneously
- Support many
threads and quick
access to small
pieces of data
- Dominated by
reads
- Requires high
streaming
bandwidth
- Fast random
access
- Fast memory
mapped (mmap)
performance
- Require any
combination of fast
bandwidth with
random and small
files
Parallel HPC Applications
Accelerated Analytics
Applications
Vision Based DL Apps Recurrent NN Apps
46. 48
STORAGE ARCHITECTURE CONSIDERATIONS
Use Cases
Adequate
Read Cache?
Network Type
Recommendation
Network File System Options
Data Analytics N/A 10Gbe Object-Storage, NFS, or other system with good multi-threaded read
and small file performance
HPC NA 10/40/100 Gbe
InfiniBand
NFS, or HPC targeted file system with support for large # of clients
and fast single-node performance, support multi-threaded writes
DL, 256x256
images
Yes 10 Gbe NFS or storage with good small file support
DL, 1080p images Yes 10/40 Gbe InfiniBand, High-end NFS, HPC file system or storage with fast streaming
performance
DL, 4k images Yes 40 Gbe, InfiniBand HPC Filesystem, high-end NFS or storage with fast streaming
performance capable of 3+ GB/s per node
DL, uncompressed
Images
Yes InfiniBand,
40/100 Gbe
HPC Filesystem, high-end NFS or storage with fast streaming
performance capable of 3+ GB/s per node
DL, Datasets that
are
not cached
No InfiniBand,
10/40/100 Gbe
Same as above, aggregate storage performance must scale to meet
the all applications simultaneously
47. Each rack:
9 DGX-1 = 72 TESLA V100 GPUs = 9 PFLOPs
12 CPU nodes for services & data management
1.2PB per rack of cache can front object storage
MAGLEV DATA CENTER ARCHITECTURE
Kubernetes
Cloud Provider
Object Storage
On Premise
Object Storage
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
MagLev Platform
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
1PB per
week
15PB
Today
49. 51
DATA SCIENCE WORKFLOW
All
Data
ETL
Manage Data
Structured
Data Store
Data
Preparation
Training
Model
Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for
Data Scientists
50. 52
RAPIDS OPEN SOURCE SOFTWARE
Breakthrough performance for data science and machine learning workflows
51. Cloud
Kubernetes over 4000 GPU Cluster (= 480 PFLOPs)
Data
Lake
Selected
Datasets
Data selection
Job #1
Data selection
Job #N
…
Labeled
Datasets
Metrics
& Logs
MAGLEV
End to End Platform to Enable Industry-Grade AI Dev
“Collect ⇨ Select ⇨ Label ⇨ Train ⇨ Test”
as programmatic workflows
Ingest
1PB per
week
15PB
Today
Labeling
UI
Data selection
Job #2
Trained
Models
Training
Job #1
Training
Job #N
Training
Job #2
Testing
Job #1
Testing
Job #N
…
Testing
Job #1
…
ML/Metrics
UI
Run
Multi-Step
Workflow
(workflow =
sequence of
map jobs)
1,500
Labelers
Large
AI Dev team
20M
objects
labeled
per month
20 models
actively
developed
52. Code
Repository
App #1
App #2
App #N
Git+CI based
Workflow
Launcher
Traced Asset
Repository
Models
Datasets
Metrics
Code Version
NVIDIA DRIVE Car
4000-GPU Cluster
MAGLEV: AUTOMATION & TRACEABILITY
ML
Developer
Production
Engineer
Empower Prod engineers to run or schedule
complete workflows & version everything
Optimize
app perf
Deploy prod
applications
Publish
Develop
Applications
Run/Debug
Applications
Manual
Workflow
Launcher
Analyze
Experiments/results
54. 57
HIGH-DENSITY COMPUTE REFERENCE ARCH.
Nine DGX-1 Servers
• Eight Tesla V100 GPUs
• NVIDIA. GPUDirect™ over RDMA support
• Run at MaxQ
• 100 GbE networking (up to 4 x 100 GbE)
Twelve Storage Nodes
• 192 GB RAM
• 3.8 TB SSD
• 100 TB HDD (1.2 PB Total HDD)
• 50 GbE networking
Network
• In-rack: 100 GbE to DGX-1 servers
• In-rack: 50 GbE to storage nodes
• Out-of-rack: 4 x 100 GbE (up to 8)
Rack
• 35 kW Power
• 42U x 1200 mm x 700 mm (minimum)
• Rear Door Cooler
4 POD design with cooling
DGX-1 POD
• NVIDIA DGX POD™
• Support scalability to hundreds of nodes
• Based on proven SATURNV architecture
55. 58
DGX REFERENCE ARCHITECTURE SOLUTIONS
Growing ecosystem of offers for enterprise IT - more to come!
Benefits:
• No more design
guesswork
• Faster, simpler
deployment
• Predictable
performance at
scale
• Simplified, single-
point of support
58. 61
Installed/
running
Problem!
Open source / forum
Open source / forum
Framework?
Libraries?
O/S?
GPU?
Drivers?
Server?
Network?
Storage?
Multiple paths to
problem resolution
Server, Storage & Network
Solution Providers
SUPPORTING AI:
ALTERNATIVE APPROACHES
59. 62
SUPPORTING AI WITH DGX REFERENCE
ARCHITECTURE SOLUTIONS
“Update to PyTorch
container XX.XX”
AI ExpertiseDGX
VARs
Running!Problem!
DGX RA
Solution
Storage
DGX RA
Solution
Storage
“My PyTorch CNN model
is running 30% slower
than yesterday!”
IT Admin
60. 63
THE VALUE OF AI INFRASTRUCTURE
REFERENCE ARCHITECTURES
Reference architectures from
NVIDIA and leading storage partners
SCALABLE
PERFORMANCE
Simplified, validated, converged
infrastructure offers
FASTER, SIMPLIFIED
DEPLOYMENT
TRUSTED EXPERTISE
AND SUPPORT
Available through select partners
as a turnkey solution
DGX RA
Solution
Storage
Effortless Productivity, Best Performance, Lowest TCO
61. 64
PLATFORM IMPACT ON AI/DL TCO
Study &
exploration
Platform Design
Productive
Experi-
mentation
HW & SW
Integra-
tion
Trouble-
shooting
Software
eng’g
Software
optimiz-
ation
Design and
Build for
Scale
Software
re-optimiz-
ation
InsightsTraining
at Scale
1. Designing and Building an AI Compute Platform – from Scratch
OPEX
CAPEX
Day
1
Month 3
Time and budget spent
on things other than
data science
“DIY”
TCO
Study &
exploration
Platform Design
Productive
Experi-
mentation
Install and
Deploy
DGX
Trouble-
shooting
Software
eng’g
Software
optimiz-
ation
Design and
Build for
Scale
Software
re-
optimiz-
ation
InsightsTraining
at Scale
2. Deploying an Integrated, Full-Stack AI Solution
Day
1
Month 3
“DIY” TCO
CAPEX
DGX
TCOdeployment cycle
shortened
Study &
exploration
Insights
2. Deploying an Integrated, Full-Stack AI Solution
Day
1
Week 1
Install and
Deploy
DGX
CAPEX
Productive
Experi-
mentation
Training
at Scale
“DIY” TCO
DGX
TCO
62. 65
NVIDIA DGX
SYSTEMS
Faster AI Innovation
and Insight
The World’s First Portfolio of
Purpose-Built AI Supercomputers
Powered by NVIDIA GPU Cloud
Get Started in AI – Faster
Effortless Productivity
Performance Without Compromise
For More Information: nvidia.com/dgx
65