SlideShare una empresa de Scribd logo
1 de 21
Read paper “In-Datacenter
Performance Analysis of a
Tensor Processing Unit”2009-8-22
Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors and graphics
accelerator
• David Patterson (fourth
author)
– Father of “RISC”
Ref: https://www.computer.org/web/awards/goode-norman-jouppi
Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model needs 5M ~ 100M weights
• Hardware
– TPU has 25 time as many MACs and 3.5 times as much on-chip
memory as the K80 GPU
Neural Networks (Cont.)
Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– Coprocessor on the PCIE, plug into existing servers
– More like FPU (floating-point unit) than GPU
TPU Block Diagram
Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense matrices
• Off-chip 8GiB DRAM (Weight Memory)
– Read-only (different from Global Memory of GPU)
– Supports many simultaneously active models
• Instruction Set
– Traditional CISC
– Read_Host_Memory/Read_Weights/MatrixMultiply/Convol
ve/Activate etc.
– 4-stage pipeline
Architecture (Cont.)
Architecture(Cont.)
Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DDR3 DRAM)
• Systolic System
– A network of processors which rhythmically compute and
pass data through the system
• Software Stack
– User Space Library and Kernel Driver (like Nvidia-GPU)
Performance
Performance (Cont.)
Performance (Cont.)
Alternative TPU Design
Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architectures that reply on high-bandwidth DRAM and thousands of
threads to achieve their goals”
Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages its advantage in MACs and on-chip
memory
– TPU succeeded because of the large matrix multiply
unit
Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on RAM
– Large mount of GPUs will lower the cost for single
chip
• GPU have more “parallel” performance
– Could train two small-model or a large mount of
samples at the same time
Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-prediction, SMT etc.
– GPU was optimized for “Parallel” rather than “Matrix”
• Read-only on-chip memory
• TensorRT makes GPU-inference much faster
GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could also have “read-only” on-chip memory
• Making a utterly new chip is a high-risk task
– AMD
– Calxeda
– Fusionio
Thank you

Más contenido relacionado

La actualidad más candente

XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Jafar Khan
 
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architectureKhanh Le
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Chiplets in Data Centers
Chiplets in Data CentersChiplets in Data Centers
Chiplets in Data CentersODSA Workgroup
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxScyllaDB
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with GpuRohit Khatana
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 

La actualidad más candente (20)

GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)Graphic Processing Unit (GPU)
Graphic Processing Unit (GPU)
 
An AI accelerator ASIC architecture
An AI accelerator ASIC architectureAn AI accelerator ASIC architecture
An AI accelerator ASIC architecture
 
GPU
GPUGPU
GPU
 
Gpu
GpuGpu
Gpu
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
It's Time to ROCm!
It's Time to ROCm!It's Time to ROCm!
It's Time to ROCm!
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Chiplets in Data Centers
Chiplets in Data CentersChiplets in Data Centers
Chiplets in Data Centers
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 

Similar a Google TPU

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsS N
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 

Similar a Google TPU (20)

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 

Más de Hao(Robin) Dong

flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造Hao(Robin) Dong
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02Hao(Robin) Dong
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report publicHao(Robin) Dong
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigallocHao(Robin) Dong
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Hao(Robin) Dong
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Hao(Robin) Dong
 

Más de Hao(Robin) Dong (9)

Transformer and BERT
Transformer and BERTTransformer and BERT
Transformer and BERT
 
flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report public
 
Overlayfs and VFS
Overlayfs and VFSOverlayfs and VFS
Overlayfs and VFS
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigalloc
 
why we need ext4
why we need ext4why we need ext4
why we need ext4
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析
 

Último

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 

Último (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 

Google TPU

  • 1. Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”2009-8-22
  • 2. Authors • Norman P. Jouppi (first author) – Distinguished Engineer at Google – Lead designer of several microprocessors and graphics accelerator • David Patterson (fourth author) – Father of “RISC” Ref: https://www.computer.org/web/awards/goode-norman-jouppi
  • 3. Neural Networks • Application – MLP, CNN, RNN represent 95% of NN inference workload in Google datacenters – Each model needs 5M ~ 100M weights • Hardware – TPU has 25 time as many MACs and 3.5 times as much on-chip memory as the K80 GPU
  • 5. Origin • Requirement – DNNs might double computation demands – Quickly produce a custom ASIC for inference • Definition – Coprocessor on the PCIE, plug into existing servers – More like FPU (floating-point unit) than GPU
  • 7. Architecture • Matrix Multiply Unit – Contains 256 x 256 MACs, can perform 8-bit multiply-and- adds – Designed for dense matrices • Off-chip 8GiB DRAM (Weight Memory) – Read-only (different from Global Memory of GPU) – Supports many simultaneously active models • Instruction Set – Traditional CISC – Read_Host_Memory/Read_Weights/MatrixMultiply/Convol ve/Activate etc. – 4-stage pipeline
  • 10. Implementation • Flows – Data flows from the left (Unified Buffer) – Weights are loaded from the top (Weight FIFO, 8GiB DDR3 DRAM) • Systolic System – A network of processors which rhythmically compute and pass data through the system • Software Stack – User Space Library and Kernel Driver (like Nvidia-GPU)
  • 15. Discussion • Fallacy: K80 GPU is a good match to inference “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”
  • 16. Conclusion • Advantage – K80 GPU: 2496 32-bit, 8Mib on-chip memory TPU: 65536 8-bit, 28Mib on-chip memory – TPU leverages its advantage in MACs and on-chip memory – TPU succeeded because of the large matrix multiply unit
  • 17. Q1: Why don’t use TPU for training • TPU’s on-chip 8GiB DRAM is read-only – CPU paid a lot for synchronous operations on RAM – Large mount of GPUs will lower the cost for single chip • GPU have more “parallel” performance – Could train two small-model or a large mount of samples at the same time
  • 18. Q2: Why TPU faster? • Application Specific Instruction Set – Intel CPU (CISC) need decoding, out-of-order, branch-prediction, SMT etc. – GPU was optimized for “Parallel” rather than “Matrix” • Read-only on-chip memory • TensorRT makes GPU-inference much faster
  • 19. GPU grows faster and faster https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
  • 20. Q3: TPU or FPGA? • They looks like the same – By programming, FPGA could have similar Matrix-Multiply-Unit – FPGA could also have “read-only” on-chip memory • Making a utterly new chip is a high-risk task – AMD – Calxeda – Fusionio