SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
ML at Facebook:
An Infrastructure View
Yangqing Jia
Director, Facebook AI Infra
(* This is not
Facebook AI Infra)
The Machine Learning Moore’s Law?
0
500
1000
1500
2000
2500
2001 2003 2005 2007 2009 2011 2013 2015 2017
NumberofCitations
Machine Learning Execution Flow
Data Features Training Evaluation Inference
Offline Online
Machine Learning Execution Flow
Data Features Training Eval Inference PredictionsModel
It’s an infrastructure challenge
Data
Offline Training Online Inference
Storage
Challenges!
Network
Challenges!
Compute
Challenges!
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Does Facebook Use Machine Learning?
News Feed Ads Search
Language Translation
Sigma Facer Lumos
Speech Recognition
Content
Understanding
Classification
& Ranking
Services
What ML Models Do We Leverage?
GBDTSVM MLP CNN RNN
Support
Vector
Machines
Gradient-
Boosted
Decision Trees
Multi-Layer
Perceptron
Convolutional
Neural Nets
Recurrent
Neural Nets
Facer Sigma News Feed
Ads
Search
Sigma
Facer
Lumos
Language
Translation
Content
Understanding
Speech Rec
How Often Do We Train Models?
minutes hours days months
How Long Does Training Take?
seconds minutes hours days
How Much Compute Does Inference Consume?
100X 10x 1x
Does Facebook Design Hardware?
• Yes! Since 2010! All designs released through open compute!
• Facebook Server Design Philosophy
• Identify a small number of major services with unique resource requirements
• Design servers for those major services
One major
server design
versus
A B
D
C
Customized, dedicated hardwareGlobal shared pool
Does Facebook Design Hardware?
Yosemite/Twin Lakes:
For the web tier and
other “stateless services”
Open Compute “Sleds”
are 2U x 3 Across in an
Open Compute Rack
Server Card
Chassis
4-way
Shared NIC
SSD Boot
Rack
Does Facebook Design Hardware?
Tioga Pass:
For compute or memory-
intensive workloads:
Bryce Canyon:
For storage-heavy workloads:
Does Facebook Design Hardware for AI/ML?
• HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt.
• Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100)
Big Sur
Integrated Compute
8 Nvidia M40 GPUs
Big Basin
JBOG Design (CPU headnode)
8 Nvidia P100 / V100 GPUs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Let’s Answer Some Pressing Questions
• Does Facebook leverage machine learning?
• Does Facebook design hardware?
• Does Facebook design hardware for machine learning?
• What platforms and frameworks exist; can the community use them?
• What assumptions break when supporting 2B people?
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
Facebook AI Frameworks
Infra Efficiency for Production
• Stability
• Scale & Speed
• Data Integration
• Relatively Fixed
Developer Efficiency for Research
• Flexible
• Fast Iteration
• Highly Debuggable
• Less Robust
OpenGL ESNNPACK
Metal™/
MPSCNN
Qualcomm
Snapdragon
NPE
CUDA/cuDNN
Deep Learning Frameworks
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
O(n2) pairs
Shared model and operator representation
Open Neural Network Exchange
Framework
backends
Vendor and numeric libraries
Apple CoreML Nvidia TensorRT
Intel/Nervana
ngraph
Qualcom
SNPE
…
From O(n2) to O(n) pairs
Putting it Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
Facebook AI Ecosystem
Frameworks: Core ML Software
Caffe2 / PyTorch / ONNX
Platforms: Workflow Management, Deployment
FB Learner
Infrastructure: Servers, Storage, Network Strategy
Open Compute Project
FB Learner Platform
FB Learner
Feature Store
FB Learner
Flow
FB Learner
Predictor
• AI Workflow
• Model Management and Deployment
FBLearner in ML
Data Features Training Eval Inference PredictionsModel
FBLearner
Flow
FBLearner
Predictor
FBLearner
Feature
Store
CPU+GPUCPU CPU
Putting it All Together
Data Features Training Evaluation Inference
Bryce Canyon Big Basin, Tioga Pass
Tioga Pass
Twin Lakes
Feature Store FBLearner Flow FBL Predictor
Data APIs Caffe2, PyTorch, ONNX C2 Predictor
2 Billion People
What changes when you scale to over
Scaling Challenges / Opportunities
Lots of Data Lots of Compute
Scaling Opportunity: Free Compute!
Santa Clara, California
Ashburn, Virginia
Prineville, Oregon
Forest City, North Carolina
Lulea, Sweden
Altoona, Iowa
Fort Worth, Texas
Clonee, Ireland
Los Lunas, New Mexico
Odense, Denmark
New Albany, Ohio
Papillion, Nebraska
Key Takeaways
Facebook AI
Lots of
Data
Wide variety
of models
Full stack
challenges Global scale
Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov
Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law
Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang
Facebook ML Infrastructure - 2018 slides

Más contenido relacionado

La actualidad más candente

MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
Databricks
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Ed Fernandez
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
Weaveworks
 

La actualidad más candente (20)

Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
 
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...Standardizing +113 million Merchant Names in Financial Services with Greenplu...
Standardizing +113 million Merchant Names in Financial Services with Greenplu...
 
Introducing MLOps.pdf
Introducing MLOps.pdfIntroducing MLOps.pdf
Introducing MLOps.pdf
 
Explainability and bias in AI
Explainability and bias in AIExplainability and bias in AI
Explainability and bias in AI
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.추천시스템 이제는 돈이 되어야 한다.
추천시스템 이제는 돈이 되어야 한다.
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
introduction to deep Learning with full detail
introduction to deep Learning with full detailintroduction to deep Learning with full detail
introduction to deep Learning with full detail
 
Power BI Overview
Power BI OverviewPower BI Overview
Power BI Overview
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
 
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
Navigating the ML Pipeline Jungle with MLflow: Notes from the Field with Thun...
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Introduction to Deep Learning
Introduction to Deep Learning Introduction to Deep Learning
Introduction to Deep Learning
 
Tensorflow presentation
Tensorflow presentationTensorflow presentation
Tensorflow presentation
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 

Similar a Facebook ML Infrastructure - 2018 slides

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
Alan Tsai
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
 

Similar a Facebook ML Infrastructure - 2018 slides (20)

2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
2018 .NET Conf - 利用Machine Learning .NET整合機器學習至應用程式
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligenceTour de France Azure PaaS 6/7 Ajouter de l'intelligence
Tour de France Azure PaaS 6/7 Ajouter de l'intelligence
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Introduction to ML.NET
Introduction to ML.NETIntroduction to ML.NET
Introduction to ML.NET
 
Technology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and BeyondTechnology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and Beyond
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Microsoft power platform
Microsoft power platformMicrosoft power platform
Microsoft power platform
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Introduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI PlatformIntroduction to PowerAI - The Enterprise AI Platform
Introduction to PowerAI - The Enterprise AI Platform
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Microsoft AI Platform Overview
Microsoft AI Platform OverviewMicrosoft AI Platform Overview
Microsoft AI Platform Overview
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...
 
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
 
Cookbook for Building An App
Cookbook for Building An AppCookbook for Building An App
Cookbook for Building An App
 

Más de Karthik Murugesan

BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 

Más de Karthik Murugesan (20)

Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Free servers to build Big Data Systems on: Bing's Approach
Free servers to build Big Data Systems on: Bing's  Approach Free servers to build Big Data Systems on: Bing's  Approach
Free servers to build Big Data Systems on: Bing's Approach
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
Microsoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER IntroductionMicrosoft AI Platform - AETHER Introduction
Microsoft AI Platform - AETHER Introduction
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019The Evolution of Spotify Home Architecture - Qcon 2019
The Evolution of Spotify Home Architecture - Qcon 2019
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019The journey toward a self-service data platform at Netflix - sf 2019
The journey toward a self-service data platform at Netflix - sf 2019
 
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
2019 Slides - Michelangelo Palette: A Feature Engineering Platform at Uber
 
Developing a ML model using TF Estimator
Developing a ML model using TF EstimatorDeveloping a ML model using TF Estimator
Developing a ML model using TF Estimator
 
Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018Production Model Deployment - StitchFix - 2018
Production Model Deployment - StitchFix - 2018
 
Netflix factstore for recommendations - 2018
Netflix factstore  for recommendations - 2018Netflix factstore  for recommendations - 2018
Netflix factstore for recommendations - 2018
 
Trends in Music Recommendations 2018
Trends in Music Recommendations 2018Trends in Music Recommendations 2018
Trends in Music Recommendations 2018
 
Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017Netflix Ads Personalization Solution - 2017
Netflix Ads Personalization Solution - 2017
 
State Of AI 2018
State Of AI 2018State Of AI 2018
State Of AI 2018
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
AirBNB - Zipline: Airbnb’s Machine Learning Data Management Platform
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Facebook ML Infrastructure - 2018 slides

  • 1. ML at Facebook: An Infrastructure View Yangqing Jia Director, Facebook AI Infra
  • 2. (* This is not Facebook AI Infra)
  • 3.
  • 4. The Machine Learning Moore’s Law? 0 500 1000 1500 2000 2500 2001 2003 2005 2007 2009 2011 2013 2015 2017 NumberofCitations
  • 5. Machine Learning Execution Flow Data Features Training Evaluation Inference Offline Online
  • 6. Machine Learning Execution Flow Data Features Training Eval Inference PredictionsModel
  • 7. It’s an infrastructure challenge Data Offline Training Online Inference Storage Challenges! Network Challenges! Compute Challenges!
  • 8. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 9. Does Facebook Use Machine Learning? News Feed Ads Search Language Translation Sigma Facer Lumos Speech Recognition Content Understanding Classification & Ranking Services
  • 10. What ML Models Do We Leverage? GBDTSVM MLP CNN RNN Support Vector Machines Gradient- Boosted Decision Trees Multi-Layer Perceptron Convolutional Neural Nets Recurrent Neural Nets Facer Sigma News Feed Ads Search Sigma Facer Lumos Language Translation Content Understanding Speech Rec
  • 11. How Often Do We Train Models? minutes hours days months
  • 12. How Long Does Training Take? seconds minutes hours days
  • 13. How Much Compute Does Inference Consume? 100X 10x 1x
  • 14. Does Facebook Design Hardware? • Yes! Since 2010! All designs released through open compute! • Facebook Server Design Philosophy • Identify a small number of major services with unique resource requirements • Design servers for those major services One major server design versus A B D C Customized, dedicated hardwareGlobal shared pool
  • 15. Does Facebook Design Hardware? Yosemite/Twin Lakes: For the web tier and other “stateless services” Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack Server Card Chassis 4-way Shared NIC SSD Boot Rack
  • 16. Does Facebook Design Hardware? Tioga Pass: For compute or memory- intensive workloads: Bryce Canyon: For storage-heavy workloads:
  • 17. Does Facebook Design Hardware for AI/ML? • HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt. • Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100) Big Sur Integrated Compute 8 Nvidia M40 GPUs Big Basin JBOG Design (CPU headnode) 8 Nvidia P100 / V100 GPUs
  • 18. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes
  • 19. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  • 20. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust
  • 21. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust OpenGL ESNNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  • 22.
  • 23.
  • 24.
  • 25. Deep Learning Frameworks Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … O(n2) pairs
  • 26. Shared model and operator representation Open Neural Network Exchange Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … From O(n2) to O(n) pairs
  • 27. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 28. Facebook AI Ecosystem Frameworks: Core ML Software Caffe2 / PyTorch / ONNX Platforms: Workflow Management, Deployment FB Learner Infrastructure: Servers, Storage, Network Strategy Open Compute Project
  • 29. FB Learner Platform FB Learner Feature Store FB Learner Flow FB Learner Predictor • AI Workflow • Model Management and Deployment
  • 30. FBLearner in ML Data Features Training Eval Inference PredictionsModel FBLearner Flow FBLearner Predictor FBLearner Feature Store CPU+GPUCPU CPU
  • 31. Putting it All Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Feature Store FBLearner Flow FBL Predictor Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  • 32. 2 Billion People What changes when you scale to over
  • 33. Scaling Challenges / Opportunities Lots of Data Lots of Compute
  • 35. Santa Clara, California Ashburn, Virginia Prineville, Oregon Forest City, North Carolina Lulea, Sweden Altoona, Iowa Fort Worth, Texas Clonee, Ireland Los Lunas, New Mexico Odense, Denmark New Albany, Ohio Papillion, Nebraska
  • 36. Key Takeaways Facebook AI Lots of Data Wide variety of models Full stack challenges Global scale
  • 37. Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang