Architecting the Right System for Your AI Application—without the Vendor Fluff

Architecting the Right
System for Your AI
Application—without the Vendor Fluff
Brett Newman
VP Marketing & Customer Engagement
Microway, Inc.
wespeakhpc@microway.com

Where We’re Headed
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Common training, mixed workloads, datasets too large,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for

Who is This For?
End Users Who:
1. Don’t know where to start
2. Need a “checklist”
3. Afraid of/ hate working with vendors
4. Hate being sold to
Not for:
1. AI Framework Writers
2. 10+ year ninja GPU coders

What Do You Know?
About Your Dataset:
○ Size – overall
○ Chunkable? (batch size)
○ Size – individual datum
128GB
16GB
32GB + 32GB + 32GB + 32GB
8GB
Image Credit: By Leonardo da Vinci - Cropped and relevelled from File:Mona Lisa, by Leonardo da Vinci, from C2RMF.jpg.
Originally C2RMF: Galerie de tableaux en très haute définition: image page, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=15442524
Visual Idea Inspiration Credit: Scott Soutter, IBM
1 multi
GPU
server
POWER9
w/NVLink or pre-
process
Various Tesla V100 systems
Overall: 128GB
Oversimplified Example

About Your Algorithm
○ Standard Framework vs. Custom Algorithm
○ Have You Run Any Profilers/Tools?
PCI-E Switching
OR
CPU:GPU NVLink
Denser,
NVLink Interconnected
(+10-20% on training)
Mixed
Workload
Ex: Molecular Dynamics +
AI Simulation Refinement
NVProf
Allinea Perf Tools
Intel Visual Profiler
What Do You Know?
Tool Examples

What Do You Know?
About Your Collaborators
○ Running on what HW?
○ Using Larger facilities?
Ex: Summit @ ORNL

Basic Guidance to
Architecting Your AI System

Algorithm: Solely AI Training, Common Frameworks
• Primary: NVLink connected systems, with GPU count to dataset scale/ budget
• Secondary: PCI-E systems (switched) with GPU count to dataset scale/ budget
4 GPUs with NVLink 8 GPUs with NVLink 16 GPUs with NVLink
Dataset Size (w/ batches <32GB)
NVLink: 10-20% training
perf. increase

Greatest Ease of Use with Perf., AI Training
DGX-Station
(4 GPUs)
DGX-1
(8 GPUs)
DGX-2
(16 GPUs)

Mixed Workloads or Small Datasets
• Balanced systems (2 sockets, full/half populated 2-4 GPUs)
• Greatest flexibility & expandability

Dataset: Too Large/Non “Chunkable”
• POWER9 Systems with Coherency + CPU: GPU NVLink (5X BW)
• Switched PCI-E Tree + Custom Algorithms with Unified Memory
POWER9 with NVLink8 GPUs with Switches

Don’t Know, Can’t Find Out
1. Test it! If at all possible
Upgrading from Fermi, Kepler > most
system architecture choices
2. No Matter Your Choice…
GPU acceleration > CPU systems (5X-50X)
Good, Better, Best

Vendors: Who to Look For?
People & Titles
○ Technical Sales
○ Solution Engineer
○ Anyone who proves they know something
○ Anyone with proven access to hardware

Vendors: Who to Look For?
In Tier 1 Vendors
○ Find: HPC or AI Groups, exclusively (hard)
○ Avoid: general sellers, laptop/networking guy
In Tier 2 Vendors
○ Find: Established AI/HPC Vendors
○ Avoid: parts resellers/limited integration shops
○ Find: NVIDIA NPN Elite Deep Learning Partners

Vendors: What to Look For/Signals
Signals:
○ Ask for testing/benchmarking
○ Ask to see HW architecture of solution
(back of napkin OK)
○ Spending time on phone, email, or in
person?
Don’t work with someone who doesn’t
understand what you’re talking about!

Vendors: Strategies For a Better Engagement
Overshare
○ Every piece of data: about data, algorithm/code, your goals
○ About what is working/isn’t working today
○ About what you own
Discuss Collaborators
○ What do they own?
○ Need to plan to run together?
State Realistic Plans for Flexibility/Expansion

What we Talked About
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Datasets too large, common training, mixed workloads,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for

So, Less Confused?
Gain confidence to Solve the AI HW Puzzle
The Best Vendors are Partners & Here to Help!
microway.com/gpu-test-drive/ microway.com/configure-
your-solution
calendly.com/microway/schedul
e-a-consulation
GPU Solutions Guide

Microway designs and builds fully-integrated clusters, servers, and
workstations. For 35 years, we have delivered high-performance
systems for data analytics, cognitive systems, research, and AI.
Leverage our expertise – We Speak HPC & AI
© Copyright 2019 Microway. All Rights Reserved.
Experts in High Performance Computing
http://www.microway.com
508-746-7341

Architecting the Right System for Your AI Application—without the Vendor Fluff

Recommended

Recommended

More Related Content

Similar to Architecting the Right System for Your AI Application—without the Vendor Fluff

Similar to Architecting the Right System for Your AI Application—without the Vendor Fluff (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Architecting the Right System for Your AI Application—without the Vendor Fluff

Editor's Notes