2. Authors
• Norman P. Jouppi (first
author)
– Distinguished Engineer at Google
– Lead designer of several
microprocessors and graphics
accelerator
• David Patterson (fourth
author)
– Father of “RISC”
Ref: https://www.computer.org/web/awards/goode-norman-jouppi
3. Neural Networks
• Application
– MLP, CNN, RNN represent 95% of NN inference workload
in Google datacenters
– Each model needs 5M ~ 100M weights
• Hardware
– TPU has 25 time as many MACs and 3.5 times as much on-chip
memory as the K80 GPU
5. Origin
• Requirement
– DNNs might double computation demands
– Quickly produce a custom ASIC for inference
• Definition
– Coprocessor on the PCIE, plug into existing servers
– More like FPU (floating-point unit) than GPU
7. Architecture
• Matrix Multiply Unit
– Contains 256 x 256 MACs, can perform 8-bit multiply-and-
adds
– Designed for dense matrices
• Off-chip 8GiB DRAM (Weight Memory)
– Read-only (different from Global Memory of GPU)
– Supports many simultaneously active models
• Instruction Set
– Traditional CISC
– Read_Host_Memory/Read_Weights/MatrixMultiply/Convol
ve/Activate etc.
– 4-stage pipeline
10. Implementation
• Flows
– Data flows from the left (Unified Buffer)
– Weights are loaded from the top (Weight FIFO, 8GiB
DDR3 DRAM)
• Systolic System
– A network of processors which rhythmically compute and
pass data through the system
• Software Stack
– User Space Library and Kernel Driver (like Nvidia-GPU)
15. Discussion
• Fallacy: K80 GPU is a good match to inference
“GPUs have traditionally been seen as high-throughput
architectures that reply on high-bandwidth DRAM and thousands of
threads to achieve their goals”
16. Conclusion
• Advantage
– K80 GPU: 2496 32-bit, 8Mib on-chip memory
TPU: 65536 8-bit, 28Mib on-chip memory
– TPU leverages its advantage in MACs and on-chip
memory
– TPU succeeded because of the large matrix multiply
unit
17. Q1: Why don’t use TPU for training
• TPU’s on-chip 8GiB DRAM is read-only
– CPU paid a lot for synchronous operations on RAM
– Large mount of GPUs will lower the cost for single
chip
• GPU have more “parallel” performance
– Could train two small-model or a large mount of
samples at the same time
18. Q2: Why TPU faster?
• Application Specific Instruction Set
– Intel CPU (CISC) need decoding, out-of-order,
branch-prediction, SMT etc.
– GPU was optimized for “Parallel” rather than “Matrix”
• Read-only on-chip memory
• TensorRT makes GPU-inference much faster
19. GPU grows faster and faster
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
20. Q3: TPU or FPGA?
• They looks like the same
– By programming, FPGA could have similar
Matrix-Multiply-Unit
– FPGA could also have “read-only” on-chip memory
• Making a utterly new chip is a high-risk task
– AMD
– Calxeda
– Fusionio