In this deck from the 2019 Stanford HPC Conference, Brett Newman from Microway presents: Architecting the Right System for Your AI Application—without the Vendor Fluff.
"Figuring out how to map your dataset or algorithm to the optimal hardware design is one of the hardest tasks in HPC. We’ll review what helps steer the selection of one system architecture from another for AI applications. Plus the right questions to ask of your collaborators—and a hardware vendor. Honest technical advice, no fluff.
Brett Newman is the VP of Marketing and Customer Engagement at Microway, Inc, a leading systems integrator for the intersection of AI & HPC.
Since 1982, customers have trusted Microway to design and deliver solutions that keep them at the bleeding edge of supercomputing. Brett is part of a broad Microway team with proven technical ability—that architects & builds users unique hardware configurations tuned for their applications.
Brett has served many roles in HPC—a cluster architect, as part of the IBM HPC group, and in product marketing focused solely on materials and resources with serious technical “street cred.”
Watch the video: https://youtu.be/H4HrAskDwno
Learn more: https://www.microway.com/hpc-tech-tips/designing-a-production-class-ai-cluster/
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Boost Fertility New Invention Ups Success Rates.pdf
Architecting the Right System for Your AI Application—without the Vendor Fluff
1. Architecting the Right
System for Your AI
Application—without the Vendor Fluff
Brett Newman
VP Marketing & Customer Engagement
Microway, Inc.
wespeakhpc@microway.com
2. Where We’re Headed
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Common training, mixed workloads, datasets too large,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for
3. Who is This For?
End Users Who:
1. Don’t know where to start
2. Need a “checklist”
3. Afraid of/ hate working with vendors
4. Hate being sold to
Not for:
1. AI Framework Writers
2. 10+ year ninja GPU coders
5. What Do You Know?
About Your Dataset:
○ Size – overall
○ Chunkable? (batch size)
○ Size – individual datum
128GB
16GB
32GB + 32GB + 32GB + 32GB
8GB
Image Credit: By Leonardo da Vinci - Cropped and relevelled from File:Mona Lisa, by Leonardo da Vinci, from C2RMF.jpg.
Originally C2RMF: Galerie de tableaux en très haute définition: image page, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=15442524
Visual Idea Inspiration Credit: Scott Soutter, IBM
1 multi
GPU
server
POWER9
w/NVLink or pre-
process
Various Tesla V100 systems
Overall: 128GB
Oversimplified Example
6. About Your Algorithm
○ Standard Framework vs. Custom Algorithm
○ Have You Run Any Profilers/Tools?
PCI-E Switching
OR
CPU:GPU NVLink
Denser,
NVLink Interconnected
(+10-20% on training)
Mixed
Workload
Ex: Molecular Dynamics +
AI Simulation Refinement
NVProf
Allinea Perf Tools
Intel Visual Profiler
What Do You Know?
Tool Examples
7. What Do You Know?
About Your Collaborators
○ Running on what HW?
○ Using Larger facilities?
Ex: Summit @ ORNL
9. Algorithm: Solely AI Training, Common Frameworks
• Primary: NVLink connected systems, with GPU count to dataset scale/ budget
• Secondary: PCI-E systems (switched) with GPU count to dataset scale/ budget
4 GPUs with NVLink 8 GPUs with NVLink 16 GPUs with NVLink
Dataset Size (w/ batches <32GB)
NVLink: 10-20% training
perf. increase
10. Greatest Ease of Use with Perf., AI Training
DGX-Station
(4 GPUs)
DGX-1
(8 GPUs)
DGX-2
(16 GPUs)
11. Mixed Workloads or Small Datasets
• Balanced systems (2 sockets, full/half populated 2-4 GPUs)
• Greatest flexibility & expandability
12. Dataset: Too Large/Non “Chunkable”
• POWER9 Systems with Coherency + CPU: GPU NVLink (5X BW)
• Switched PCI-E Tree + Custom Algorithms with Unified Memory
POWER9 with NVLink8 GPUs with Switches
13. Don’t Know, Can’t Find Out
1. Test it! If at all possible
Upgrading from Fermi, Kepler > most
system architecture choices
2. No Matter Your Choice…
GPU acceleration > CPU systems (5X-50X)
Good, Better, Best
15. Vendors: Who to Look For?
People & Titles
○ Technical Sales
○ Solution Engineer
○ Anyone who proves they know something
○ Anyone with proven access to hardware
16. Vendors: Who to Look For?
In Tier 1 Vendors
○ Find: HPC or AI Groups, exclusively (hard)
○ Avoid: general sellers, laptop/networking guy
In Tier 2 Vendors
○ Find: Established AI/HPC Vendors
○ Avoid: parts resellers/limited integration shops
○ Find: NVIDIA NPN Elite Deep Learning Partners
17. Vendors: What to Look For/Signals
Signals:
○ Ask for testing/benchmarking
○ Ask to see HW architecture of solution
(back of napkin OK)
○ Spending time on phone, email, or in
person?
Don’t work with someone who doesn’t
understand what you’re talking about!
18. Vendors: Strategies For a Better Engagement
Overshare
○ Every piece of data: about data, algorithm/code, your goals
○ About what is working/isn’t working today
○ About what you own
Discuss Collaborators
○ What do they own?
○ Need to plan to run together?
State Realistic Plans for Flexibility/Expansion
20. What we Talked About
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Datasets too large, common training, mixed workloads,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for
22. So, Less Confused?
Gain confidence to Solve the AI HW Puzzle
The Best Vendors are Partners & Here to Help!
microway.com/gpu-test-drive/ microway.com/configure-
your-solution
calendly.com/microway/schedul
e-a-consulation
GPU Solutions Guide
What’s the overall size of your whole dataset? Does it fit into a single GPU or is it definitely a number of GPUs? Is it multi system?
Chunkable – the professional term is whether you can set a reasonable batch size. Does you data fit into chunks the size of a GPU (or portion of one)
Individual datum—sometimes your data is so large it won’t fit at all. That’s a case for a specialized code or specialized HW to compensate. Writing your code to manage data with CUDA unified memory, or better yet purchasing a POWER9 with NVLink system.
Similarly, if you are using image data of fairly large size (or a batch size of many smaller, more likely), it’s likely a case for a 32GB Tesla GPU
PCI-E switching
Why CPU: GPU NVLink? If you can’t write efficiently
End users underweight this.
They are so focused on the concrete hardware value (how much, what’s my complicated price/performance calculation), that they miss the efficacy metric.
If you and a primary collaborator need to dramatically change your ETL steps or even your runtime instructions perform similar runs, then you getting far less time out of your expensive hardware. Matching each other is hugely important
Similarly, if you have opportunity for larger runs or dedicated time on a larger machine, matching this is critical.