SlideShare a Scribd company logo
1 of 23
Architecting the Right
System for Your AI
Application—without the Vendor Fluff
Brett Newman
VP Marketing & Customer Engagement
Microway, Inc.
wespeakhpc@microway.com
Where We’re Headed
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Common training, mixed workloads, datasets too large,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for
Who is This For?
End Users Who:
1. Don’t know where to start
2. Need a “checklist”
3. Afraid of/ hate working with vendors
4. Hate being sold to
Not for:
1. AI Framework Writers
2. 10+ year ninja GPU coders
Before You Start
What Do You Know?
About Your Dataset:
○ Size – overall
○ Chunkable? (batch size)
○ Size – individual datum
128GB
16GB
32GB + 32GB + 32GB + 32GB
8GB
Image Credit: By Leonardo da Vinci - Cropped and relevelled from File:Mona Lisa, by Leonardo da Vinci, from C2RMF.jpg.
Originally C2RMF: Galerie de tableaux en très haute définition: image page, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=15442524
Visual Idea Inspiration Credit: Scott Soutter, IBM
1 multi
GPU
server
POWER9
w/NVLink or pre-
process
Various Tesla V100 systems
Overall: 128GB
Oversimplified Example
About Your Algorithm
○ Standard Framework vs. Custom Algorithm
○ Have You Run Any Profilers/Tools?
PCI-E Switching
OR
CPU:GPU NVLink
Denser,
NVLink Interconnected
(+10-20% on training)
Mixed
Workload
Ex: Molecular Dynamics +
AI Simulation Refinement
NVProf
Allinea Perf Tools
Intel Visual Profiler
What Do You Know?
Tool Examples
What Do You Know?
About Your Collaborators
○ Running on what HW?
○ Using Larger facilities?
Ex: Summit @ ORNL
Basic Guidance to
Architecting Your AI System
Algorithm: Solely AI Training, Common Frameworks
• Primary: NVLink connected systems, with GPU count to dataset scale/ budget
• Secondary: PCI-E systems (switched) with GPU count to dataset scale/ budget
4 GPUs with NVLink 8 GPUs with NVLink 16 GPUs with NVLink
Dataset Size (w/ batches <32GB)
NVLink: 10-20% training
perf. increase
Greatest Ease of Use with Perf., AI Training
DGX-Station
(4 GPUs)
DGX-1
(8 GPUs)
DGX-2
(16 GPUs)
Mixed Workloads or Small Datasets
• Balanced systems (2 sockets, full/half populated 2-4 GPUs)
• Greatest flexibility & expandability
Dataset: Too Large/Non “Chunkable”
• POWER9 Systems with Coherency + CPU: GPU NVLink (5X BW)
• Switched PCI-E Tree + Custom Algorithms with Unified Memory
POWER9 with NVLink8 GPUs with Switches
Don’t Know, Can’t Find Out
1. Test it! If at all possible
Upgrading from Fermi, Kepler > most
system architecture choices
2. No Matter Your Choice…
GPU acceleration > CPU systems (5X-50X)
Good, Better, Best
Collaborating with Vendors
Vendors: Who to Look For?
People & Titles
○ Technical Sales
○ Solution Engineer
○ Anyone who proves they know something
○ Anyone with proven access to hardware
Vendors: Who to Look For?
In Tier 1 Vendors
○ Find: HPC or AI Groups, exclusively (hard)
○ Avoid: general sellers, laptop/networking guy
In Tier 2 Vendors
○ Find: Established AI/HPC Vendors
○ Avoid: parts resellers/limited integration shops
○ Find: NVIDIA NPN Elite Deep Learning Partners
Vendors: What to Look For/Signals
Signals:
○ Ask for testing/benchmarking
○ Ask to see HW architecture of solution
(back of napkin OK)
○ Spending time on phone, email, or in
person?
Don’t work with someone who doesn’t
understand what you’re talking about!
Vendors: Strategies For a Better Engagement
Overshare
○ Every piece of data: about data, algorithm/code, your goals
○ About what is working/isn’t working today
○ About what you own
Discuss Collaborators
○ What do they own?
○ Need to plan to run together?
State Realistic Plans for Flexibility/Expansion
Review
What we Talked About
1. Before You Start
• What do you know: Datasets, Algorithms, Collaborators
2. How to Select A System
• Datasets too large, common training, mixed workloads,
don’t know
3. Collaborating with Vendors
• Who, where, and what to look for
Real Experts, Real Deliveries
So, Less Confused?
Gain confidence to Solve the AI HW Puzzle
The Best Vendors are Partners & Here to Help!
microway.com/gpu-test-drive/ microway.com/configure-
your-solution
calendly.com/microway/schedul
e-a-consulation
GPU Solutions Guide
Microway designs and builds fully-integrated clusters, servers, and
workstations. For 35 years, we have delivered high-performance
systems for data analytics, cognitive systems, research, and AI.
Leverage our expertise – We Speak HPC & AI
© Copyright 2019 Microway. All Rights Reserved.
Experts in High Performance Computing
http://www.microway.com
508-746-7341

More Related Content

Similar to Architecting the Right System for Your AI Application—without the Vendor Fluff

Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesRob Winters
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science teamAshish Bansal
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventBenjamin Schulte
 
Taming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AITaming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AIdesmondchanatdet
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services
 
Write code and find a job
Write code and find a jobWrite code and find a job
Write code and find a jobYung-Yu Chen
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 
Using Product Box to Build the Complete Developer
Using Product Box to Build the Complete DeveloperUsing Product Box to Build the Complete Developer
Using Product Box to Build the Complete DeveloperLuke Hohmann
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …mortardata
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...Safe Software
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionRobin van Emden
 
PyDataStructs Tech Share at Quansight
PyDataStructs Tech Share at QuansightPyDataStructs Tech Share at Quansight
PyDataStructs Tech Share at QuansightGagandeep Singh
 

Similar to Architecting the Right System for Your AI Application—without the Vendor Fluff (20)

Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Idiots guide to setting up a data science team
Idiots guide to setting up a data science teamIdiots guide to setting up a data science team
Idiots guide to setting up a data science team
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
Taming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AITaming Your Deep Learning Workflow by Determined AI
Taming Your Deep Learning Workflow by Determined AI
 
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Write code and find a job
Write code and find a jobWrite code and find a job
Write code and find a job
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Using Product Box to Build the Complete Developer
Using Product Box to Build the Complete DeveloperUsing Product Box to Build the Complete Developer
Using Product Box to Build the Complete Developer
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...
The Latest Advances in Generative AI_ Exploring New Technology for Data Integ...
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit edition
 
Unit no_1.pptx
Unit no_1.pptxUnit no_1.pptx
Unit no_1.pptx
 
PyDataStructs Tech Share at Quansight
PyDataStructs Tech Share at QuansightPyDataStructs Tech Share at Quansight
PyDataStructs Tech Share at Quansight
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Architecting the Right System for Your AI Application—without the Vendor Fluff

  • 1. Architecting the Right System for Your AI Application—without the Vendor Fluff Brett Newman VP Marketing & Customer Engagement Microway, Inc. wespeakhpc@microway.com
  • 2. Where We’re Headed 1. Before You Start • What do you know: Datasets, Algorithms, Collaborators 2. How to Select A System • Common training, mixed workloads, datasets too large, don’t know 3. Collaborating with Vendors • Who, where, and what to look for
  • 3. Who is This For? End Users Who: 1. Don’t know where to start 2. Need a “checklist” 3. Afraid of/ hate working with vendors 4. Hate being sold to Not for: 1. AI Framework Writers 2. 10+ year ninja GPU coders
  • 5. What Do You Know? About Your Dataset: ○ Size – overall ○ Chunkable? (batch size) ○ Size – individual datum 128GB 16GB 32GB + 32GB + 32GB + 32GB 8GB Image Credit: By Leonardo da Vinci - Cropped and relevelled from File:Mona Lisa, by Leonardo da Vinci, from C2RMF.jpg. Originally C2RMF: Galerie de tableaux en très haute définition: image page, Public Domain, https://commons.wikimedia.org/w/index.php?curid=15442524 Visual Idea Inspiration Credit: Scott Soutter, IBM 1 multi GPU server POWER9 w/NVLink or pre- process Various Tesla V100 systems Overall: 128GB Oversimplified Example
  • 6. About Your Algorithm ○ Standard Framework vs. Custom Algorithm ○ Have You Run Any Profilers/Tools? PCI-E Switching OR CPU:GPU NVLink Denser, NVLink Interconnected (+10-20% on training) Mixed Workload Ex: Molecular Dynamics + AI Simulation Refinement NVProf Allinea Perf Tools Intel Visual Profiler What Do You Know? Tool Examples
  • 7. What Do You Know? About Your Collaborators ○ Running on what HW? ○ Using Larger facilities? Ex: Summit @ ORNL
  • 9. Algorithm: Solely AI Training, Common Frameworks • Primary: NVLink connected systems, with GPU count to dataset scale/ budget • Secondary: PCI-E systems (switched) with GPU count to dataset scale/ budget 4 GPUs with NVLink 8 GPUs with NVLink 16 GPUs with NVLink Dataset Size (w/ batches <32GB) NVLink: 10-20% training perf. increase
  • 10. Greatest Ease of Use with Perf., AI Training DGX-Station (4 GPUs) DGX-1 (8 GPUs) DGX-2 (16 GPUs)
  • 11. Mixed Workloads or Small Datasets • Balanced systems (2 sockets, full/half populated 2-4 GPUs) • Greatest flexibility & expandability
  • 12. Dataset: Too Large/Non “Chunkable” • POWER9 Systems with Coherency + CPU: GPU NVLink (5X BW) • Switched PCI-E Tree + Custom Algorithms with Unified Memory POWER9 with NVLink8 GPUs with Switches
  • 13. Don’t Know, Can’t Find Out 1. Test it! If at all possible Upgrading from Fermi, Kepler > most system architecture choices 2. No Matter Your Choice… GPU acceleration > CPU systems (5X-50X) Good, Better, Best
  • 15. Vendors: Who to Look For? People & Titles ○ Technical Sales ○ Solution Engineer ○ Anyone who proves they know something ○ Anyone with proven access to hardware
  • 16. Vendors: Who to Look For? In Tier 1 Vendors ○ Find: HPC or AI Groups, exclusively (hard) ○ Avoid: general sellers, laptop/networking guy In Tier 2 Vendors ○ Find: Established AI/HPC Vendors ○ Avoid: parts resellers/limited integration shops ○ Find: NVIDIA NPN Elite Deep Learning Partners
  • 17. Vendors: What to Look For/Signals Signals: ○ Ask for testing/benchmarking ○ Ask to see HW architecture of solution (back of napkin OK) ○ Spending time on phone, email, or in person? Don’t work with someone who doesn’t understand what you’re talking about!
  • 18. Vendors: Strategies For a Better Engagement Overshare ○ Every piece of data: about data, algorithm/code, your goals ○ About what is working/isn’t working today ○ About what you own Discuss Collaborators ○ What do they own? ○ Need to plan to run together? State Realistic Plans for Flexibility/Expansion
  • 20. What we Talked About 1. Before You Start • What do you know: Datasets, Algorithms, Collaborators 2. How to Select A System • Datasets too large, common training, mixed workloads, don’t know 3. Collaborating with Vendors • Who, where, and what to look for
  • 21. Real Experts, Real Deliveries
  • 22. So, Less Confused? Gain confidence to Solve the AI HW Puzzle The Best Vendors are Partners & Here to Help! microway.com/gpu-test-drive/ microway.com/configure- your-solution calendly.com/microway/schedul e-a-consulation GPU Solutions Guide
  • 23. Microway designs and builds fully-integrated clusters, servers, and workstations. For 35 years, we have delivered high-performance systems for data analytics, cognitive systems, research, and AI. Leverage our expertise – We Speak HPC & AI © Copyright 2019 Microway. All Rights Reserved. Experts in High Performance Computing http://www.microway.com 508-746-7341

Editor's Notes

  1. What’s the overall size of your whole dataset? Does it fit into a single GPU or is it definitely a number of GPUs? Is it multi system? Chunkable – the professional term is whether you can set a reasonable batch size. Does you data fit into chunks the size of a GPU (or portion of one) Individual datum—sometimes your data is so large it won’t fit at all. That’s a case for a specialized code or specialized HW to compensate. Writing your code to manage data with CUDA unified memory, or better yet purchasing a POWER9 with NVLink system. Similarly, if you are using image data of fairly large size (or a batch size of many smaller, more likely), it’s likely a case for a 32GB Tesla GPU
  2. PCI-E switching Why CPU: GPU NVLink? If you can’t write efficiently
  3. End users underweight this. They are so focused on the concrete hardware value (how much, what’s my complicated price/performance calculation), that they miss the efficacy metric. If you and a primary collaborator need to dramatically change your ETL steps or even your runtime instructions perform similar runs, then you getting far less time out of your expensive hardware. Matching each other is hugely important Similarly, if you have opportunity for larger runs or dedicated time on a larger machine, matching this is critical.