SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Read paper “In-Datacenter
Performance Analysis of a
Tensor Processing Unit”2009-8-22
Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors and graphics
accelerator
• David Patterson (fourth
author)
– Father of “RISC”
Ref: https://www.computer.org/web/awards/goode-norman-jouppi
Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model needs 5M ~ 100M weights
• Hardware
– TPU has 25 time as many MACs and 3.5 times as much on-chip
memory as the K80 GPU
Neural Networks (Cont.)
Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– Coprocessor on the PCIE, plug into existing servers
– More like FPU (floating-point unit) than GPU
TPU Block Diagram
Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense matrices
• Off-chip 8GiB DRAM (Weight Memory)
– Read-only (different from Global Memory of GPU)
– Supports many simultaneously active models
• Instruction Set
– Traditional CISC
– Read_Host_Memory/Read_Weights/MatrixMultiply/Convol
ve/Activate etc.
– 4-stage pipeline
Architecture (Cont.)
Architecture(Cont.)
Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DDR3 DRAM)
• Systolic System
– A network of processors which rhythmically compute and
pass data through the system
• Software Stack
– User Space Library and Kernel Driver (like Nvidia-GPU)
Performance
Performance (Cont.)
Performance (Cont.)
Alternative TPU Design
Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architectures that reply on high-bandwidth DRAM and thousands of
threads to achieve their goals”
Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages its advantage in MACs and on-chip
memory
– TPU succeeded because of the large matrix multiply
unit
Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on RAM
– Large mount of GPUs will lower the cost for single
chip
• GPU have more “parallel” performance
– Could train two small-model or a large mount of
samples at the same time
Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-prediction, SMT etc.
– GPU was optimized for “Parallel” rather than “Matrix”
• Read-only on-chip memory
• TensorRT makes GPU-inference much faster
GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could also have “read-only” on-chip memory
• Making a utterly new chip is a high-risk task
– AMD
– Calxeda
– Fusionio
Thank you

Más contenido relacionado

La actualidad más candente

Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPCChris Dwan
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDivyen Patel
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architectureAjithaSomasundaram
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationSubarno Pal
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An IntroductionDhan V Sagar
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureAllan Cantle
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technologyAmirali Sharifian
 
Connection Machine
Connection MachineConnection Machine
Connection Machinebutest
 
ARM architcture
ARM architcture ARM architcture
ARM architcture Hossam Adel
 
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling..."Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...Edge AI and Vision Alliance
 

La actualidad más candente (20)

Introduction to HPC
Introduction to HPCIntroduction to HPC
Introduction to HPC
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
GPU
GPUGPU
GPU
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
NUMA overview
NUMA overviewNUMA overview
NUMA overview
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its Application
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
GPU - An Introduction
GPU - An IntroductionGPU - An Introduction
GPU - An Introduction
 
If AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC ArchitectureIf AMD Adopted OMI in their EPYC Architecture
If AMD Adopted OMI in their EPYC Architecture
 
80286 microprocessors
80286 microprocessors80286 microprocessors
80286 microprocessors
 
Intel® hyper threading technology
Intel® hyper threading technologyIntel® hyper threading technology
Intel® hyper threading technology
 
Connection Machine
Connection MachineConnection Machine
Connection Machine
 
CPU vs GPU Comparison
CPU  vs GPU ComparisonCPU  vs GPU Comparison
CPU vs GPU Comparison
 
GPU - Basic Working
GPU - Basic WorkingGPU - Basic Working
GPU - Basic Working
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling..."Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...
"Pioneering Analog Compute for Edge AI to Overcome the End of Digital Scaling...
 

Similar a Google TPU

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsS N
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-isctembreternitz
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuAlan Sill
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computerPriya Manik
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 

Similar a Google TPU (20)

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Spark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloadsSpark and Deep Learning frameworks with distributed workloads
Spark and Deep Learning frameworks with distributed workloads
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
19-7960-01.pptx
19-7960-01.pptx19-7960-01.pptx
19-7960-01.pptx
 
Design installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttuDesign installation-commissioning-red raider-cluster-ttu
Design installation-commissioning-red raider-cluster-ttu
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Tesla personal super computer
Tesla personal super computerTesla personal super computer
Tesla personal super computer
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 

Más de Hao(Robin) Dong

flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造Hao(Robin) Dong
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02Hao(Robin) Dong
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report publicHao(Robin) Dong
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigallocHao(Robin) Dong
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Hao(Robin) Dong
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Hao(Robin) Dong
 

Más de Hao(Robin) Dong (9)

Transformer and BERT
Transformer and BERTTransformer and BERT
Transformer and BERT
 
flashcache原理及改造
flashcache原理及改造flashcache原理及改造
flashcache原理及改造
 
ext2-110628041727-phpapp02
ext2-110628041727-phpapp02ext2-110628041727-phpapp02
ext2-110628041727-phpapp02
 
Ext4 Bigalloc report public
Ext4 Bigalloc report publicExt4 Bigalloc report public
Ext4 Bigalloc report public
 
Overlayfs and VFS
Overlayfs and VFSOverlayfs and VFS
Overlayfs and VFS
 
Ext4 new feature - bigalloc
Ext4 new feature - bigallocExt4 new feature - bigalloc
Ext4 new feature - bigalloc
 
why we need ext4
why we need ext4why we need ext4
why we need ext4
 
Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制Kernel在多核机器上的负载均衡机制
Kernel在多核机器上的负载均衡机制
 
Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析Linux下Poll和Epoll内核源码剖析
Linux下Poll和Epoll内核源码剖析
 

Último

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 

Último (20)

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 

Google TPU

  • 1. Read paper “In-Datacenter Performance Analysis of a Tensor Processing Unit”2009-8-22
  • 2. Authors • Norman P. Jouppi (first author) – Distinguished Engineer at Google – Lead designer of several microprocessors and graphics accelerator • David Patterson (fourth author) – Father of “RISC” Ref: https://www.computer.org/web/awards/goode-norman-jouppi
  • 3. Neural Networks • Application – MLP, CNN, RNN represent 95% of NN inference workload in Google datacenters – Each model needs 5M ~ 100M weights • Hardware – TPU has 25 time as many MACs and 3.5 times as much on-chip memory as the K80 GPU
  • 5. Origin • Requirement – DNNs might double computation demands – Quickly produce a custom ASIC for inference • Definition – Coprocessor on the PCIE, plug into existing servers – More like FPU (floating-point unit) than GPU
  • 7. Architecture • Matrix Multiply Unit – Contains 256 x 256 MACs, can perform 8-bit multiply-and- adds – Designed for dense matrices • Off-chip 8GiB DRAM (Weight Memory) – Read-only (different from Global Memory of GPU) – Supports many simultaneously active models • Instruction Set – Traditional CISC – Read_Host_Memory/Read_Weights/MatrixMultiply/Convol ve/Activate etc. – 4-stage pipeline
  • 10. Implementation • Flows – Data flows from the left (Unified Buffer) – Weights are loaded from the top (Weight FIFO, 8GiB DDR3 DRAM) • Systolic System – A network of processors which rhythmically compute and pass data through the system • Software Stack – User Space Library and Kernel Driver (like Nvidia-GPU)
  • 15. Discussion • Fallacy: K80 GPU is a good match to inference “GPUs have traditionally been seen as high-throughput architectures that reply on high-bandwidth DRAM and thousands of threads to achieve their goals”
  • 16. Conclusion • Advantage – K80 GPU: 2496 32-bit, 8Mib on-chip memory TPU: 65536 8-bit, 28Mib on-chip memory – TPU leverages its advantage in MACs and on-chip memory – TPU succeeded because of the large matrix multiply unit
  • 17. Q1: Why don’t use TPU for training • TPU’s on-chip 8GiB DRAM is read-only – CPU paid a lot for synchronous operations on RAM – Large mount of GPUs will lower the cost for single chip • GPU have more “parallel” performance – Could train two small-model or a large mount of samples at the same time
  • 18. Q2: Why TPU faster? • Application Specific Instruction Set – Intel CPU (CISC) need decoding, out-of-order, branch-prediction, SMT etc. – GPU was optimized for “Parallel” rather than “Matrix” • Read-only on-chip memory • TensorRT makes GPU-inference much faster
  • 19. GPU grows faster and faster https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
  • 20. Q3: TPU or FPGA? • They looks like the same – By programming, FPGA could have similar Matrix-Multiply-Unit – FPGA could also have “read-only” on-chip memory • Making a utterly new chip is a high-risk task – AMD – Calxeda – Fusionio