SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
High Performance DSP for Vision, Imaging and
Neural Networks
Greg Efland, Sandip Parikh, Himanshu Sanghavi, Aamir Farooqui
2016 Hot Chips
Flint Center for the Performing Arts, Cupertino, California
August 21-23, 2016
2 © 2016 Cadence Design Systems, Inc. All rights reserved.
Vision Processors
Emerging Applications
Wearables and IoTSecurity
Automotive Gesture ControlTraffic Sign Recognition
Mobile
People Detection
HDR Video Stabilizer Face Detection
Drone 3D Vision
3 © 2016 Cadence Design Systems, Inc. All rights reserved.
Vision Processors
Opportunity
• Ever increasing demand for vision and image processing
– Cameras everywhere
– Dramatic complexity expansion of vision in cars, IoT, mobile, consumer
– Emergence of compute-hungry neural networks
• Vision Processors are designed for the complex algorithms in
computer vision and imaging
– Programmable architectures for rapidly evolving and diverse applications
– Memory architectures designed for vision and imaging application needs
– Power-efficient solutions for vision and imaging applications
4 © 2016 Cadence Design Systems, Inc. All rights reserved.
Cadence Tensilica Vision Processors
Evolution
• Architecture evolved over 4 generations
– IVP32 (2013)  IVP-EP (2014)  Vision P5 (2015)  Vision P6 (2016)
– Member of large family of Cadence DSPs
• Key characteristics
– Wide SIMD processing: 512-bit width, 64-/32-/16-way processing
– Local memory based: multiple banks, multiple 512-bit load/store
– FLIX™ instructions (VLIW): 5 slots, multiple formats, scalar / vector mix
• Delivered as soft IP
– Customer extensible with new operations and interfaces
– Customer configured according to requirements
5 © 2016 Cadence Design Systems, Inc. All rights reserved.
Real Time Vision with Cadence
XtensaTM Foundation, XtensaTM Processor Generator
Hundreds of customers, thousands of designs, 5 billion cores
Vision Libraries and Application Examples
OpenCV, OpenVx, CNN Framework
Unified Software Development Environment
One GUI for configuration, compilation, debug, modeling, RTOS
Custom XtensaTM
1B cores in
imaging/video/vision
Vision P6: Fourth generation vision DSP
Fastest DSP for vision – classic and CNN
Cadence Development Platforms
Virtual Platforms, FPGA development systems, at-speed silicon
Third-Party Imaging and Vision Applications
Automotive, mobile, surveillance, IOT, CNN
6 © 2016 Cadence Design Systems, Inc. All rights reserved.
Cadence Tensilica Vision Processors
Evolution
• Vision P6
– Significantly improved performance on a wide range of imaging and vision
applications compared to predecessors
– Performance improvements achieved through incremental increases in area
and power while maintaining efficiency and software compatibility
• Key areas of improvement
– Data type diversity
– Enhanced memory parallelism and data movement
– CNN enhancements
7 © 2016 Cadence Design Systems, Inc. All rights reserved.
Tensilica Vision P6 DSP
Enhanced Data Movement
 Aligning vector load/store
 Integrated 2-D DMA
Enhanced Memory Parallelism
 SuperGather™
CNN Enhancements
 256 MACs
Diverse Data Type Support
 8-/16-/32-bit fixed-point
 Single-/half-precision FP
8 © 2016 Cadence Design Systems, Inc. All rights reserved.
Cadence Tensilica Vision Processors
Evolution
• Vision P6 status
– Sampled to early engagement customers
– General release October 2016
This slide contains forward-looking statements regarding Cadence's business or
products. Actual results may differ materially from the information presented here.
9 © 2016 Cadence Design Systems, Inc. All rights reserved.
Data Type Diversity
10 © 2016 Cadence Design Systems, Inc. All rights reserved.
Data Type Diversity
• Data type requirements vary by application
– A wide range of supported data types enables a wide range of applications
– Configurable data type support enables efficient targeted solutions
• Vision P6 adds vector floating-point support
– Leverages existing resources – vector register files and load/store operations
– Incremental cost – floating-point data paths
– Optional – no cost for applications that don’t require floating-point
• Vision P6 computation data types
– Fixed-point: 8-/16-/32-bit
– Floating-point: single-/half-precision
11 © 2016 Cadence Design Systems, Inc. All rights reserved.
Data Type Diversity
Fixed-Point Data Types
• Fixed-point
– Primary data type for many applications
– 8-bit, 16-bit most common computation types
– 8-bit increasingly important – e.g. CNN network inference
– 32-bit for precision – e.g. perspective warp mapping functions
– Multiple 8x8/8x16/16x16 MACs per SIMD way
Data Type SIMD MACs per Cycle
8-bit fixed-point 64-way 256 (8x8) / 128 (8x16)
16-bit fixed-point 32-way 64 (16x16)
32-bit fixed-point 16-way 16 (16x32) / 8 (32x32)
12 © 2016 Cadence Design Systems, Inc. All rights reserved.
Data Type Diversity
Floating-Point Data Types
• Floating-point
– For dynamic range – e.g. RANSAC
– Ease porting of applications from other platforms
– Support both scalar and N-way vector processing
– Single-precision and half-precision independently configurable
Data Type SIMD MADDs per Cycle
32-bit single precision 16-way 16
16-bit half precision 32-way 32
13 © 2016 Cadence Design Systems, Inc. All rights reserved.
Enhanced Memory Parallelism and Data Movement
14 © 2016 Cadence Design Systems, Inc. All rights reserved.
Gather-Scatter
Opportunity
• A number of kernels access data in disparate memory locations
– Not simple fixed-stride accesses: data in arbitrary rows and/or columns
– Hard to efficiently vectorize using wide loads/stores since arbitrary movement
of data across vector registers, particularly across rows, can be expensive
– If data is sparse, wide loads/stores also waste memory bandwidth and energy
in/out
Canny Edge Detection
in
out
Image Warping
15 © 2016 Cadence Design Systems, Inc. All rights reserved.
Gather-Scatter
Opportunity
• Vision P6 supports multiple banks per local data RAM
– Up to 4 banks: multiple load/store operations and DMA transfers per cycle
– Banks are 512 bits wide, interleaved on low-order address bits
– Typically implemented using multiple narrow memories: up to 32 sub-banks
• Enable software to exploit independently addressable sub-banks
– Incremental hardware increase leverages existing memory structure
– Significant performance increase for a number of applications
Bank 0 Bank 3SB0 SB7 SB24 SB31…
16 © 2016 Cadence Design Systems, Inc. All rights reserved.
SuperGather
Hardware Operations
• Gathers: vector of addrs  vector register
– Non-blocking operations – stall on data use
– Up to 4 outstanding gather operations in flight
• Scatters: vector register  vector of addrs
– Posted operations
– Up to 2 outstanding scatter operations in flight
• Up to 32 8/16-bit elements, 16 32-bit elements read / written per cycle
– Reads and writes to different sub-banks performed in parallel
– Multiple reads or writes to same sub-bank address combined into single access
– Reads overlapped across different gathers; writes overlapped across different
scatters
0 1 2 3 4 5 76
SB0 SB1 SB2 SB3
vector register
17 © 2016 Cadence Design Systems, Inc. All rights reserved.
Gather-Scatter
Programming Model
• Gather and scatter operations are invoked via C intrinsics
– C compiler schedules all operations including gather and scatter
– Loop unrolling, software pipelining handles gather and scatter
– C type qualifier ‘restrict’ supported for gather and scatter base pointers
• Example: update N independent histograms in parallel (HoG)
– Placement of each histogram in a single sub-bank bounds conflicts
LOOP for vBins, vWeights in vInput:
vOffsets = vBins * pitch + vBase
vCounts = GATHER(base, vOffsets)
SCATTER(vCounts + vWeights, base, vOffsets)
18 © 2016 Cadence Design Systems, Inc. All rights reserved.
Gather-Scatter
Performance
• Overall benchmark speedup
relative to no gather-scatter
– 2X to 10X over various kernels
• Gather cycle count reduction
due to overlap
– 0.5X to 1X image warp cases
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1 2 3 4 5 6 7 8 9 10 11 12
Cycles / Gather - Image Warping
Series1 Series2 Series3
0
2
4
6
8
10
12
Speeduprelativetowidevectormemory
19 © 2016 Cadence Design Systems, Inc. All rights reserved.
Vector Load/Store, Shuffle Operations and DMA
• Efficient data reorganization to support computation
– Aligning load/store operations access vectors of arbitrary
length and/or alignment in memory
– Shuffle operations perform arbitrary shuffles/routing of
vector register data across SIMD lanes
• Integrated DMA
– 1-D and 2-D transfers between local data RAM and
external memory or within local data RAM
– Arbitrary alignment of source, destination
– Independent source and destination pitch
– Up to 64 outstanding external memory requests to
deal with large system memory latencies
number
of rows
row size
source pitch
destination pitch
20 © 2016 Cadence Design Systems, Inc. All rights reserved.
Enhanced Memory Parallelism and Data Movement
Summary
• Memory parallelism and flexible data movement key to high
performance across a wide range of application kernels
• Gather-scatter provides significant performance increase for modest
incremental hardware increase by exploiting existing memory structure
• Memory bandwidth balanced with compute requirements
Type Peak Bandwidth
Aligned / Unaligned Loads 2 x 64 bytes/cycle
Aligned / Unaligned Stores 1 x 64 bytes/cycle
Gathers 64 bytes/cycle
Scatters 64 bytes/cycle
DMA 64 bytes/cycle
21 © 2016 Cadence Design Systems, Inc. All rights reserved.
CNN Enhancements
22 © 2016 Cadence Design Systems, Inc. All rights reserved.
Neural Network Applications
• Neural networks need lots of
compute – especially multiply-add
• Two key compute engine metrics:
– Scaling to high total compute
– High multiply-add per watt
• Vision DSPs target emerging
deployment opportunities
– High total compute at good efficiency
– Flexibility and programmability for
related processing (e.g. ROI
extraction) and evolving network
structures
10
100
1000
10000
10 100 1,000 10,000 100,000
BillionsofmuliplieesperW
Billions of multiplies per second
Throughput and Efficiency for CNN
Server GPUs
FPGAs
Embedded
GPUs
Vision DSPs
CNN DSPs
23 © 2016 Cadence Design Systems, Inc. All rights reserved.
3D-Filter Interpretation of CNN Layer
𝑥1 𝑥2 𝑥 𝐷⋯ ⋯𝑥𝑗
 input feature maps 
𝑦1 𝑦2 𝑦 𝑁
⋯ ⋯
 output feature maps 
𝑦𝑖
𝑦𝑖 = 𝐹𝑖𝑗 ∘ 𝑥𝑗
𝐷
𝑗=1
𝐹𝑖𝑀
𝐹𝑖𝑗
𝐹𝑖1
𝐹𝑖2
𝑭𝒊𝑴𝑭𝒊𝑴𝑭𝒊𝟑𝑭𝒊𝟐𝑭𝒊𝟏
𝑊
𝐻
𝐷
One 3D Filter
per Layer “i” Output
Each filter has 𝐿 = 𝐻 ∙ 𝑊 ∙ 𝐷
coefficients
There are N such filters
Convolutional
Layer
24 © 2016 Cadence Design Systems, Inc. All rights reserved.
CNN Enhancements
Opportunity
• 3-D convolution: significant data reuse possible
– Load bandwidth supports > 1 MAC per element
• Fixed-point types appropriate for computation
– Moving to 8-bit types for inference
• Additional MACs enable significantly improved
CNN kernel performance
– Incremental hardware increase for additional MACs
– 2X – 4X peak MAC performance of Vision P5
99.00%
99.20%
99.40%
99.60%
99.80%
100.00%
FP 16x8 8x8
Traffic Sign Recognition
Accuracy vs. Data Type
Sliding
Kernel
Window
25 © 2016 Cadence Design Systems, Inc. All rights reserved.
CNN Enhancements
Operations
• Paired 16x16, 8x16 and 8x8 MAC operations
• Quad 8x8 MAC operations
• Variants support vectorizing in different dimensions
– Different networks and layers within a given network
often need different approaches for best performance
scalar x vector: acc[i] += c0 x v0[i] + c1 x v1[i]
vector x vector: acc[i] += v0[i] x v1[i] + v2[i] x v3[i]
scalar x vector: acc[i] += c0 x v0[i] + c1 x v1[i] +
c2 x v2[i] + c3 x v3[i]
+
{vs[23:0],vr[511:24]}
arr[23:16]
{vs[15:0],vr[511:16]}
{vs[7:0],vr[511:8]}
wvt[1535:0]
X arr[7:0]
vr[511:0]
X arr[15:8]
X
X arr[31:24]
wvt[1535:0]
MUL4TA2N8XR8
26 © 2016 Cadence Design Systems, Inc. All rights reserved.
CNN Performance
• 3-D convolution kernel
– Input: 14x14x64 (8-bit); conv: 5x5x64, stride 1; output: 10x10x64 (8-bit)
• Alexnet layer 1 convolution
– Input: 227x227x3 (8-bit); conv: 11x11x3, stride 4; output: 55x55x96 (8-bit)
Multiplier Utilization Relative Performance
Vision P5 95% 1.0
Vision P6 80% 3.3
Multiplier Utilization
Vision P6 57%
27 © 2016 Cadence Design Systems, Inc. All rights reserved.
Vision P6 Demo
28 © 2016 Cadence Design Systems, Inc. All rights reserved.
Traffic Sign Recognition: CNN
Localization
Cropped
Proposal
CNN
Prediction
System
Proposal
Probability
Image with
BOX
‘detected’
Input
Image
32x32 28x28x108
5x5 Filter 2x2
14x14x108
5x5 Filter
10x10x200
2-Stage FC
43
neurons
100 neurons
5x5x200
7x7x108
2x2
One Of
the 43 signs
Tensilica® Vision DSP Running Convolutional Neural
Network (CNN) Prediction Algorithm
Compare the input signs with 43 trained signs
29 © 2016 Cadence Design Systems, Inc. All rights reserved.
Tensilica®
Vision P6 DSP
People Detection and Face Detection
People detection: An example of computer vision
histogram of oriented gradient algorithm
• 64x128 detection window, 1.2 scale factor, L1 Histogram
normalization
• Market: Automotive, drone, and robot
Applications: Collision avoidance, object detection, passenger
detection
• Market: Surveillance
Applications: People counting, people detection/tracking
Face detection: An example of computer vision
OpenCV, Haar cascade algorithm
• Full Detection every frame
• 22 levels
• Market: Mobile, automotive, wearable, tablets
Applications: Selective auto exposure, white balance,
focus, face tracking, face authentication
© 2016 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks
found at www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other
trademarks are the property of their respective holders.

Más contenido relacionado

Destacado

Electiva iv, act1, 3 corte
Electiva iv, act1, 3 corteElectiva iv, act1, 3 corte
Electiva iv, act1, 3 corteRAFAEL QUIARAGUA
 
English i text set (1)
English i text set (1)English i text set (1)
English i text set (1)Jillian Lyles
 
презентація Данте
презентація Дантепрезентація Данте
презентація Дантеdtamara123
 
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟مشاور کودک
 
Сучасна література Гавальда Анна " 35 кіло надії"
Сучасна література Гавальда Анна " 35 кіло надії"Сучасна література Гавальда Анна " 35 кіло надії"
Сучасна література Гавальда Анна " 35 кіло надії"dtamara123
 
Gastric sleeve surgery diet
Gastric sleeve surgery dietGastric sleeve surgery diet
Gastric sleeve surgery dietarnob mahmud
 
11 клас Профільний рівень
11 клас Профільний рівень11 клас Профільний рівень
11 клас Профільний рівеньdtamara123
 
Презентация до уроку "Характеристика образу титана Прометея"
Презентация до уроку "Характеристика образу титана Прометея"Презентация до уроку "Характеристика образу титана Прометея"
Презентация до уроку "Характеристика образу титана Прометея"dtamara123
 
تک فرزندی و گرایش به رسانه
تک فرزندی و گرایش به رسانهتک فرزندی و گرایش به رسانه
تک فرزندی و گرایش به رسانهمشاور کودک
 
انتخاب رشته تحصیلی مناسب
انتخاب رشته تحصیلی مناسبانتخاب رشته تحصیلی مناسب
انتخاب رشته تحصیلی مناسبمشاور کودک
 
fashion jewelry trend ( http://pwoshop.com )
fashion jewelry trend  ( http://pwoshop.com )fashion jewelry trend  ( http://pwoshop.com )
fashion jewelry trend ( http://pwoshop.com )Pwoshop
 

Destacado (18)

Electiva iv, act1, 3 corte
Electiva iv, act1, 3 corteElectiva iv, act1, 3 corte
Electiva iv, act1, 3 corte
 
English i text set (1)
English i text set (1)English i text set (1)
English i text set (1)
 
презентація Данте
презентація Дантепрезентація Данте
презентація Данте
 
slid1
slid1slid1
slid1
 
Geoeconomia
GeoeconomiaGeoeconomia
Geoeconomia
 
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟
چطور به فرزندان مان در اولین تجربه مدرسه کمک کنیم؟
 
Introducción
IntroducciónIntroducción
Introducción
 
new
newnew
new
 
انگشت مکیدن کودک
انگشت مکیدن کودکانگشت مکیدن کودک
انگشت مکیدن کودک
 
Exam2
Exam2Exam2
Exam2
 
Seattle chiopractic ppt
Seattle chiopractic pptSeattle chiopractic ppt
Seattle chiopractic ppt
 
Сучасна література Гавальда Анна " 35 кіло надії"
Сучасна література Гавальда Анна " 35 кіло надії"Сучасна література Гавальда Анна " 35 кіло надії"
Сучасна література Гавальда Анна " 35 кіло надії"
 
Gastric sleeve surgery diet
Gastric sleeve surgery dietGastric sleeve surgery diet
Gastric sleeve surgery diet
 
11 клас Профільний рівень
11 клас Профільний рівень11 клас Профільний рівень
11 клас Профільний рівень
 
Презентация до уроку "Характеристика образу титана Прометея"
Презентация до уроку "Характеристика образу титана Прометея"Презентация до уроку "Характеристика образу титана Прометея"
Презентация до уроку "Характеристика образу титана Прометея"
 
تک فرزندی و گرایش به رسانه
تک فرزندی و گرایش به رسانهتک فرزندی و گرایش به رسانه
تک فرزندی و گرایش به رسانه
 
انتخاب رشته تحصیلی مناسب
انتخاب رشته تحصیلی مناسبانتخاب رشته تحصیلی مناسب
انتخاب رشته تحصیلی مناسب
 
fashion jewelry trend ( http://pwoshop.com )
fashion jewelry trend  ( http://pwoshop.com )fashion jewelry trend  ( http://pwoshop.com )
fashion jewelry trend ( http://pwoshop.com )
 

Similar a HC28.22.430-Vision-Neural-Net-GregEfland-Cadence-v02-57

Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v finalCisco Canada
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedDataCore Software
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous DrivingHeiko Joerg Schick
 
ONF & iSDX Webinar
ONF & iSDX WebinarONF & iSDX Webinar
ONF & iSDX WebinarKatie Hyman
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsApplying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsJenny Midwinter
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemAI Frontiers
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceMongoDB
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMRedis Labs
 
Informix - The Ideal Database for IoT
Informix - The Ideal Database for IoTInformix - The Ideal Database for IoT
Informix - The Ideal Database for IoTPradeep Natarajan
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 

Similar a HC28.22.430-Vision-Neural-Net-GregEfland-Cadence-v02-57 (20)

Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v final
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the Unexpected
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous Driving
 
ONF & iSDX Webinar
ONF & iSDX WebinarONF & iSDX Webinar
ONF & iSDX Webinar
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
SnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark MeetupSnappyData @ Seattle Spark Meetup
SnappyData @ Seattle Spark Meetup
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsApplying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
 
Webinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-ServiceWebinar: Enterprise Trends for Database-as-a-Service
Webinar: Enterprise Trends for Database-as-a-Service
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
 
updatedElectrical
updatedElectricalupdatedElectrical
updatedElectrical
 
Informix - The Ideal Database for IoT
Informix - The Ideal Database for IoTInformix - The Ideal Database for IoT
Informix - The Ideal Database for IoT
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 

HC28.22.430-Vision-Neural-Net-GregEfland-Cadence-v02-57

  • 1. High Performance DSP for Vision, Imaging and Neural Networks Greg Efland, Sandip Parikh, Himanshu Sanghavi, Aamir Farooqui 2016 Hot Chips Flint Center for the Performing Arts, Cupertino, California August 21-23, 2016
  • 2. 2 © 2016 Cadence Design Systems, Inc. All rights reserved. Vision Processors Emerging Applications Wearables and IoTSecurity Automotive Gesture ControlTraffic Sign Recognition Mobile People Detection HDR Video Stabilizer Face Detection Drone 3D Vision
  • 3. 3 © 2016 Cadence Design Systems, Inc. All rights reserved. Vision Processors Opportunity • Ever increasing demand for vision and image processing – Cameras everywhere – Dramatic complexity expansion of vision in cars, IoT, mobile, consumer – Emergence of compute-hungry neural networks • Vision Processors are designed for the complex algorithms in computer vision and imaging – Programmable architectures for rapidly evolving and diverse applications – Memory architectures designed for vision and imaging application needs – Power-efficient solutions for vision and imaging applications
  • 4. 4 © 2016 Cadence Design Systems, Inc. All rights reserved. Cadence Tensilica Vision Processors Evolution • Architecture evolved over 4 generations – IVP32 (2013)  IVP-EP (2014)  Vision P5 (2015)  Vision P6 (2016) – Member of large family of Cadence DSPs • Key characteristics – Wide SIMD processing: 512-bit width, 64-/32-/16-way processing – Local memory based: multiple banks, multiple 512-bit load/store – FLIX™ instructions (VLIW): 5 slots, multiple formats, scalar / vector mix • Delivered as soft IP – Customer extensible with new operations and interfaces – Customer configured according to requirements
  • 5. 5 © 2016 Cadence Design Systems, Inc. All rights reserved. Real Time Vision with Cadence XtensaTM Foundation, XtensaTM Processor Generator Hundreds of customers, thousands of designs, 5 billion cores Vision Libraries and Application Examples OpenCV, OpenVx, CNN Framework Unified Software Development Environment One GUI for configuration, compilation, debug, modeling, RTOS Custom XtensaTM 1B cores in imaging/video/vision Vision P6: Fourth generation vision DSP Fastest DSP for vision – classic and CNN Cadence Development Platforms Virtual Platforms, FPGA development systems, at-speed silicon Third-Party Imaging and Vision Applications Automotive, mobile, surveillance, IOT, CNN
  • 6. 6 © 2016 Cadence Design Systems, Inc. All rights reserved. Cadence Tensilica Vision Processors Evolution • Vision P6 – Significantly improved performance on a wide range of imaging and vision applications compared to predecessors – Performance improvements achieved through incremental increases in area and power while maintaining efficiency and software compatibility • Key areas of improvement – Data type diversity – Enhanced memory parallelism and data movement – CNN enhancements
  • 7. 7 © 2016 Cadence Design Systems, Inc. All rights reserved. Tensilica Vision P6 DSP Enhanced Data Movement  Aligning vector load/store  Integrated 2-D DMA Enhanced Memory Parallelism  SuperGather™ CNN Enhancements  256 MACs Diverse Data Type Support  8-/16-/32-bit fixed-point  Single-/half-precision FP
  • 8. 8 © 2016 Cadence Design Systems, Inc. All rights reserved. Cadence Tensilica Vision Processors Evolution • Vision P6 status – Sampled to early engagement customers – General release October 2016 This slide contains forward-looking statements regarding Cadence's business or products. Actual results may differ materially from the information presented here.
  • 9. 9 © 2016 Cadence Design Systems, Inc. All rights reserved. Data Type Diversity
  • 10. 10 © 2016 Cadence Design Systems, Inc. All rights reserved. Data Type Diversity • Data type requirements vary by application – A wide range of supported data types enables a wide range of applications – Configurable data type support enables efficient targeted solutions • Vision P6 adds vector floating-point support – Leverages existing resources – vector register files and load/store operations – Incremental cost – floating-point data paths – Optional – no cost for applications that don’t require floating-point • Vision P6 computation data types – Fixed-point: 8-/16-/32-bit – Floating-point: single-/half-precision
  • 11. 11 © 2016 Cadence Design Systems, Inc. All rights reserved. Data Type Diversity Fixed-Point Data Types • Fixed-point – Primary data type for many applications – 8-bit, 16-bit most common computation types – 8-bit increasingly important – e.g. CNN network inference – 32-bit for precision – e.g. perspective warp mapping functions – Multiple 8x8/8x16/16x16 MACs per SIMD way Data Type SIMD MACs per Cycle 8-bit fixed-point 64-way 256 (8x8) / 128 (8x16) 16-bit fixed-point 32-way 64 (16x16) 32-bit fixed-point 16-way 16 (16x32) / 8 (32x32)
  • 12. 12 © 2016 Cadence Design Systems, Inc. All rights reserved. Data Type Diversity Floating-Point Data Types • Floating-point – For dynamic range – e.g. RANSAC – Ease porting of applications from other platforms – Support both scalar and N-way vector processing – Single-precision and half-precision independently configurable Data Type SIMD MADDs per Cycle 32-bit single precision 16-way 16 16-bit half precision 32-way 32
  • 13. 13 © 2016 Cadence Design Systems, Inc. All rights reserved. Enhanced Memory Parallelism and Data Movement
  • 14. 14 © 2016 Cadence Design Systems, Inc. All rights reserved. Gather-Scatter Opportunity • A number of kernels access data in disparate memory locations – Not simple fixed-stride accesses: data in arbitrary rows and/or columns – Hard to efficiently vectorize using wide loads/stores since arbitrary movement of data across vector registers, particularly across rows, can be expensive – If data is sparse, wide loads/stores also waste memory bandwidth and energy in/out Canny Edge Detection in out Image Warping
  • 15. 15 © 2016 Cadence Design Systems, Inc. All rights reserved. Gather-Scatter Opportunity • Vision P6 supports multiple banks per local data RAM – Up to 4 banks: multiple load/store operations and DMA transfers per cycle – Banks are 512 bits wide, interleaved on low-order address bits – Typically implemented using multiple narrow memories: up to 32 sub-banks • Enable software to exploit independently addressable sub-banks – Incremental hardware increase leverages existing memory structure – Significant performance increase for a number of applications Bank 0 Bank 3SB0 SB7 SB24 SB31…
  • 16. 16 © 2016 Cadence Design Systems, Inc. All rights reserved. SuperGather Hardware Operations • Gathers: vector of addrs  vector register – Non-blocking operations – stall on data use – Up to 4 outstanding gather operations in flight • Scatters: vector register  vector of addrs – Posted operations – Up to 2 outstanding scatter operations in flight • Up to 32 8/16-bit elements, 16 32-bit elements read / written per cycle – Reads and writes to different sub-banks performed in parallel – Multiple reads or writes to same sub-bank address combined into single access – Reads overlapped across different gathers; writes overlapped across different scatters 0 1 2 3 4 5 76 SB0 SB1 SB2 SB3 vector register
  • 17. 17 © 2016 Cadence Design Systems, Inc. All rights reserved. Gather-Scatter Programming Model • Gather and scatter operations are invoked via C intrinsics – C compiler schedules all operations including gather and scatter – Loop unrolling, software pipelining handles gather and scatter – C type qualifier ‘restrict’ supported for gather and scatter base pointers • Example: update N independent histograms in parallel (HoG) – Placement of each histogram in a single sub-bank bounds conflicts LOOP for vBins, vWeights in vInput: vOffsets = vBins * pitch + vBase vCounts = GATHER(base, vOffsets) SCATTER(vCounts + vWeights, base, vOffsets)
  • 18. 18 © 2016 Cadence Design Systems, Inc. All rights reserved. Gather-Scatter Performance • Overall benchmark speedup relative to no gather-scatter – 2X to 10X over various kernels • Gather cycle count reduction due to overlap – 0.5X to 1X image warp cases 0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 0.000 0.200 0.400 0.600 0.800 1.000 1.200 1 2 3 4 5 6 7 8 9 10 11 12 Cycles / Gather - Image Warping Series1 Series2 Series3 0 2 4 6 8 10 12 Speeduprelativetowidevectormemory
  • 19. 19 © 2016 Cadence Design Systems, Inc. All rights reserved. Vector Load/Store, Shuffle Operations and DMA • Efficient data reorganization to support computation – Aligning load/store operations access vectors of arbitrary length and/or alignment in memory – Shuffle operations perform arbitrary shuffles/routing of vector register data across SIMD lanes • Integrated DMA – 1-D and 2-D transfers between local data RAM and external memory or within local data RAM – Arbitrary alignment of source, destination – Independent source and destination pitch – Up to 64 outstanding external memory requests to deal with large system memory latencies number of rows row size source pitch destination pitch
  • 20. 20 © 2016 Cadence Design Systems, Inc. All rights reserved. Enhanced Memory Parallelism and Data Movement Summary • Memory parallelism and flexible data movement key to high performance across a wide range of application kernels • Gather-scatter provides significant performance increase for modest incremental hardware increase by exploiting existing memory structure • Memory bandwidth balanced with compute requirements Type Peak Bandwidth Aligned / Unaligned Loads 2 x 64 bytes/cycle Aligned / Unaligned Stores 1 x 64 bytes/cycle Gathers 64 bytes/cycle Scatters 64 bytes/cycle DMA 64 bytes/cycle
  • 21. 21 © 2016 Cadence Design Systems, Inc. All rights reserved. CNN Enhancements
  • 22. 22 © 2016 Cadence Design Systems, Inc. All rights reserved. Neural Network Applications • Neural networks need lots of compute – especially multiply-add • Two key compute engine metrics: – Scaling to high total compute – High multiply-add per watt • Vision DSPs target emerging deployment opportunities – High total compute at good efficiency – Flexibility and programmability for related processing (e.g. ROI extraction) and evolving network structures 10 100 1000 10000 10 100 1,000 10,000 100,000 BillionsofmuliplieesperW Billions of multiplies per second Throughput and Efficiency for CNN Server GPUs FPGAs Embedded GPUs Vision DSPs CNN DSPs
  • 23. 23 © 2016 Cadence Design Systems, Inc. All rights reserved. 3D-Filter Interpretation of CNN Layer 𝑥1 𝑥2 𝑥 𝐷⋯ ⋯𝑥𝑗  input feature maps  𝑦1 𝑦2 𝑦 𝑁 ⋯ ⋯  output feature maps  𝑦𝑖 𝑦𝑖 = 𝐹𝑖𝑗 ∘ 𝑥𝑗 𝐷 𝑗=1 𝐹𝑖𝑀 𝐹𝑖𝑗 𝐹𝑖1 𝐹𝑖2 𝑭𝒊𝑴𝑭𝒊𝑴𝑭𝒊𝟑𝑭𝒊𝟐𝑭𝒊𝟏 𝑊 𝐻 𝐷 One 3D Filter per Layer “i” Output Each filter has 𝐿 = 𝐻 ∙ 𝑊 ∙ 𝐷 coefficients There are N such filters Convolutional Layer
  • 24. 24 © 2016 Cadence Design Systems, Inc. All rights reserved. CNN Enhancements Opportunity • 3-D convolution: significant data reuse possible – Load bandwidth supports > 1 MAC per element • Fixed-point types appropriate for computation – Moving to 8-bit types for inference • Additional MACs enable significantly improved CNN kernel performance – Incremental hardware increase for additional MACs – 2X – 4X peak MAC performance of Vision P5 99.00% 99.20% 99.40% 99.60% 99.80% 100.00% FP 16x8 8x8 Traffic Sign Recognition Accuracy vs. Data Type Sliding Kernel Window
  • 25. 25 © 2016 Cadence Design Systems, Inc. All rights reserved. CNN Enhancements Operations • Paired 16x16, 8x16 and 8x8 MAC operations • Quad 8x8 MAC operations • Variants support vectorizing in different dimensions – Different networks and layers within a given network often need different approaches for best performance scalar x vector: acc[i] += c0 x v0[i] + c1 x v1[i] vector x vector: acc[i] += v0[i] x v1[i] + v2[i] x v3[i] scalar x vector: acc[i] += c0 x v0[i] + c1 x v1[i] + c2 x v2[i] + c3 x v3[i] + {vs[23:0],vr[511:24]} arr[23:16] {vs[15:0],vr[511:16]} {vs[7:0],vr[511:8]} wvt[1535:0] X arr[7:0] vr[511:0] X arr[15:8] X X arr[31:24] wvt[1535:0] MUL4TA2N8XR8
  • 26. 26 © 2016 Cadence Design Systems, Inc. All rights reserved. CNN Performance • 3-D convolution kernel – Input: 14x14x64 (8-bit); conv: 5x5x64, stride 1; output: 10x10x64 (8-bit) • Alexnet layer 1 convolution – Input: 227x227x3 (8-bit); conv: 11x11x3, stride 4; output: 55x55x96 (8-bit) Multiplier Utilization Relative Performance Vision P5 95% 1.0 Vision P6 80% 3.3 Multiplier Utilization Vision P6 57%
  • 27. 27 © 2016 Cadence Design Systems, Inc. All rights reserved. Vision P6 Demo
  • 28. 28 © 2016 Cadence Design Systems, Inc. All rights reserved. Traffic Sign Recognition: CNN Localization Cropped Proposal CNN Prediction System Proposal Probability Image with BOX ‘detected’ Input Image 32x32 28x28x108 5x5 Filter 2x2 14x14x108 5x5 Filter 10x10x200 2-Stage FC 43 neurons 100 neurons 5x5x200 7x7x108 2x2 One Of the 43 signs Tensilica® Vision DSP Running Convolutional Neural Network (CNN) Prediction Algorithm Compare the input signs with 43 trained signs
  • 29. 29 © 2016 Cadence Design Systems, Inc. All rights reserved. Tensilica® Vision P6 DSP People Detection and Face Detection People detection: An example of computer vision histogram of oriented gradient algorithm • 64x128 detection window, 1.2 scale factor, L1 Histogram normalization • Market: Automotive, drone, and robot Applications: Collision avoidance, object detection, passenger detection • Market: Surveillance Applications: People counting, people detection/tracking Face detection: An example of computer vision OpenCV, Haar cascade algorithm • Full Detection every frame • 22 levels • Market: Mobile, automotive, wearable, tablets Applications: Selective auto exposure, white balance, focus, face tracking, face authentication
  • 30. © 2016 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks found at www.cadence.com/go/trademarks are trademarks or registered trademarks of Cadence Design Systems, Inc. All other trademarks are the property of their respective holders.