Understanding and Integrating Intel Deep Learning Boost

OVERVIEW DEEP DIVE STEPS TO
Strategy
Portfolio
compute
hardware
software
performance
usecases
develop
partner
© 2019 Intel Corporation

#datacentric
Multi-cloudINTELLIGENTEdgeDevice

Deliverthebest
AIPlatforms
Shape&winindustry
OpenSoftwareStacks
Seed&drive
TheEcosystem
ExtendCPU
Mostcompleteportfolio
Bestintegratedplatforms
Optimizecustomersoftware
Build“OneAPI”unifiedplatform
Evangelizetodevelopers
Seedemergingusecases
Attract&developtoptalent
PioneerleadingedgeAI

Faster More Everything
SILICONPHOTONICS
OMNI-PATHFABRIC
ETHERNET
INTELATOM®CPROCESSORS
INTEL®NERVANA™NNP
INTEL®MOVIDIUS™TECHNOLOGY
INTEL®XEON®DPROCESSORS
INTEL®XEON®SCALABLEPROCESSORS
INTEL®FPGASDCSERIESSSD
QLC3DNANDDRIVE
OPTANEDC<>>>
SOLIDSTATEDRIVE
OPTANEDC<>>>
PERSISTENTMEMORY

50+ YEARS OF INTEL INNOVATION – DEVICE TO INTELLIGENT EDGE & MULTI-CLOUD
MULTI PURPOSE, FORMS
FOUNDATION FOR AI
ACCELERATION OF GRAPHICS,
MEDIA, ANALYTICS AND AI+
REAL TIME AI+ FOR OPTIMIZED
SYSTEM PERFORMANCE
HIGHLY EFFICIENT MULTI-MODAL
INFERENCE & TRAINING
PerformanceefficiencyWorkloadbreadth
Multi models – high performance across broad set of existing and emerging DL models for applications like Natural Language Processing, Speech & Text, Recommendation Engines, and Video & Image search & filtering
A+ - Combined workflows where use cases span AI plus additional workloads such as- Immersive Media - Media + Graphics

Intelligentedge
Multi-cloud
NNP-I
NNP-L+ACCELERATION
DEPLOY AI ANYWHERE FROM DEVICE TO INTELLIGENT EDGE & MULTI-CLOUD
device

2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Network
Database
Analytics
Multi-cloud&Orchestration
AI
Security
Virtualization
HPC
COMPUTE DEMAND

PERFORMANCETOPROPELINSIGHTS
BUSINESSRESILIENCETHROUGHHardware-enhancedsecurity
AGILESERVICEDELIVERY
FASTER TIME
TO VALUE
GLOBAL
ECOSYSTEM
WORKLOAD
OPTIMIZED
CUSTOMER
OBSESSED
.
BUILT-IN
VALUE

JUL'17 DEC'18 APR'19 COLUMN1
5.7
1.0
INTEL OPTIMIZATION FOR CAFFE RESNET-50
INFERENCETHROUGHPUT(IMAGES/SEC)
intel®xeon®platinum
INTEL AVX-512
INTEL DLBOOST
OPTIMIZING AI INFERENCE
8100processor
BASE VS BASE VS BASE

vpmadd231ps
OUTPUT
FP32
INPUT
FP32
INPUT
FP32 Port0 Port5
…. ….
FMA FMA
Microarchitecture view
In a given clock cycle
vpmaddubsw
OUTPUT
INT16
CONSTANT
INT16
vpmaddwd
OUTPUT
INT32
Accumulate
INT32
vpaddd OUTPUT
INT32
INPUT
INT8
INPUT
INT8
1st gen Intel® Xeon®
Scalable processor
without Intel® DL Boost
1st gen Intel® Xeon®
Scalable processor
without Intel® DL Boost
2nd gen Intel® Xeon®
Scalable processor
with Intel® DL Boost
VNNI VNNI
FP32
INT8
INT8
VNNI
NEW
vpdpbusd
OUTPUT
INT32
INPUT
INT8
INPUT
INT8
Accumulate
INT32
OPTIMIZING AI INFERENCE

Vector Neural Network Instruction
https://software.intel.com/sites/default/files/managed/c5/15/architecture-
instruction-set-extensions-programming-reference.pdf
2nd gen Intel® Xeon®
Scalable processor
with Intel® DL Boost
INT8
VNNI
NEW
vpdpbusd
Accumulator
INT32
INPUT
INT8
INPUT
INT8
Accumulator
INT32

Precisionisameasureofthedetailusedfornumericalvaluesprimarilymeasuredinbits Signed Range
±1.18×10−38 to ±3.4×1038
±1.18×10−38 to ±3.4×1038
±6.10 × 10−5 To ± 65504
-32,768 to +32,767
-128 to 127
DlComputeinHighprecision-Highprocessingtime&highaccuracyinresults
DlComputeinLowerprecision-Fasterresultsbutpotentiallylessaccurate
Speedaccuracytradeoff

BF16bit
training
8bit
inference
Better cache usage
Avoids bandwidth bottlenecks
Maximizes compute resources
Lesser silicon area & power
FP32
TRAINING
FP32
inference
Need-FP32Weights INT8WeightsFORINFERENCE
Benefit-Reducemodelsizeandenergyconsumption
Challenge-Possibledegradationinpredictiveperformance
Solution-Quantizationwithminimallossofinformation
Deployment-OpenVinotoolkitordirectframework(tf)

8bit
inference
FP32weights
training
Reduceprecision& keepmodelaccuracy
quantizationawaretraining
Simulate the effect of quantization in the
forward and backward passes using FAKE
quantization
Posttrainingquantization
Train normally, capture FP32 weights; convert
to low precision before running inference and
calibrate to improve accuracy
FP32weights
training
8bit
inference
Collect statistics for the activation to find
quantization factor
Calibrate using subset data to improve
accuracy, Convert FP32 weights to INT8
8bit
inference
FP32weights
training
captured FP32 weights are
quantized to int8 at each
iteration after the weight
updates
Convert
FP32
weights
to INT8

Directframework[tf] Inference
with
standard TF
https://github.com/IntelAI/models/tree/ma
ster/benchmarks
Freeze
graph
Calibrate
Convert FP32
weights to INT8
Training
with
standard TF
Quantized
graph
OpenVinotool
https://docs.openvinotoolkit.org/
Trained
model - DL
Framework
Posttrainingquantization
Train normally, capture FP32 weights; convert
to low precision before running inference and
calibrate to improve accuracy
FP32weights
training
8bit
inference
Collect statistics for the activation to find
quantization factor
Calibrate using subset data to improve
accuracy, Convert FP32 weights to INT8

Image
Recognition
Object
Detection
Rec.
Systems
Relative Inference Throughput Performance [Higher Is Better]
See Config #46 & #47
Performance results are based as of date shown in configurations and may not reflect all publically available security updates. No product can be absolutely secure. See configuration disclosure for details. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. For more complete information visit: http://www.intel.com/performance Other names and brands may be claimed as the property of others.
2.1x @ 0.239 % accuracy loss
2.2x @ 0.009 mAP accuracy loss
0x 1x 2x 3x 4x
Wide and Deep
Wide and Deep
SSD-MobileNet
SSD-VGG16
RetinaNet
ResNet-50
ResNet-50
ResNet-101
ResNet-50
ResNet-50
ResNet-101
DL BOOST (VNNI) ON 2ND GEN XEON VS FP32 ON 1ST GEN XEON
Baseline :
FP32 performance on Intel®
Platinum 8180 Processor

#datacentric
white paper:
https://www.intel.ai/papers/siemens
-healthineers-ai-cardiac-imaging/

#datacentric
Client: Siemens Healthineers is the
global market leader in diagnostic
imaging. Artificial intelligence (AI) is
transforming care delivery and
expanding precision medicine.
Siemens Healthineers utilizes AI to
help automate and standardize
complex diagnostics to improve
patient outcomes.
Challenge: Segmentation of anatomical regions for medical
imaging requires hardware that delivers increased throughput
for Deep Learning inferencing. This is used to automate and
standardize clinical measurements, such as measurement of
heart chambers, ejection fraction, and strain. Due to increased
data generation and workflow demands, high performance
Deep Learning inference is required to help reduce the burden
on radiologists and cardiologists, enabling improved clinical
decisions in near real-time.
Solution: The next generation Intel® Xeon® Scalable processors
with Intel® Deep Learning Boost, in combination with the Intel®
Distribution of OpenVINO™ toolkit, automate anatomical
measurements in near real-time and improve workflow
efficiency, while maintaining accuracy of the model without the
need for discrete GPU investment.
*Other names and brands may be claimed as the property of others.
Configuration: Based on Siemens Healthineers and Intel Analysis on 2nd Gen Intel® Xeon® Platinum 8280 Processor (28 Cores) with 192GB, DDR4-2933, using Intel® OpenVino™ 2019 R1. HT ON, Turbo ON. CentOS Linux release 7.6.1810, kernel 4.19.5-
1.el7.elrepo.x86_64.Custom topology and dataset (image resolution 288x288). Comparing FP32 vs Int8 w/ Intel® DL Boost performance on the system.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit http://www.intel.com/performance.
Result
5.5xfaster(now201fPS)
This Siemens Healthineers’ feature is currently under development and is not currently for sale.
Using 2nd Generation Intel® Xeon
Scalable with Intel® DL Boost (INT8
precision) compared to FP32
precisionLeft Ventricle, Right Ventricle (short-axis), Left Ventricle, Right Atrium and Left Atrium (long-axis)

Image
Resnet-50, Inception V3,
MobileNet, SqueezeNet
R-FCN, Faster-RCNN ,
Yolo V2, SSD-VGG16,
SSD-MobileNet
Mask R-CNN
Imagerecognition
Objectdetection
Imagesegmentation
GNMT
Wide & Deep, NCF
Languagetranslation
Texttospeech
WaveNet
MiniGo
© 2019 Intel CorporationOther names and brands may be claimed as the property of others.

Image
Config 1: Nvidia data source: https://developer.nvidia.com/deep-learning-performance-training-inference (Updated March 5th 2019). See Config 24 for detailed Xeon Configuration
Config 2: See Config.slide 22 for Xeon and Nvidia configuration
Config 3: See Config slide 23 for Xeon and Nvidia configuration
Config 4: See Config slide 12 for Xeon and MLPerf configuration ; MLPerf v0.5 Training Closed - Reinforcement benchmark (MiniGo); system employed TensorFlow 1.10.1 with MKLDNN v0.14. Retrieved from www.mlperf.org. MLPerf v0.5 Results December 12th,
2018. MLPerf name and logo are trademarks. See www.mlperf.org for more information. Estimated score of 11.81 on the MLPerf v0.5 Training Closed Division – Reinforcement benchmark (MiniGo). Result not verified by MLPerf. MLPerf name and logo are trademarks. See www.mlperf.org for more
information. See Config 12 for detailed Xeon configuration.; 1.87x performance improvement in MLPerf MiniGo performance on 2S Intel® Xeon® Platinum 9282 processor compared to published MLPerf v0.5 MiniGo results on 2S Intel® Xeon® Platinum 8180 processor
Performance results are based on testing as of 1/31/2019 and may not reflect all publically available security updates. No product can be absolutely secure. See configuration disclosure for details. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured
by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more complete information visit: http://www.intel.com/performance Other names and brands may be claimed as the property of others.
Config 1 Config 2 Config 3 Config 4

Image
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance
tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Other names and brands may be claimed as the property of others.
https://www.intel.com/content/www/us/en/process
ors/xeon/scalable/software-solutions/taboola-
cloud.html
https://www.intel.ai/papers/iflytek-xeonscalable/ https://www.intel.ai/rl-coach-aws-sagemaker-
accelerated-deep-rl/
Billions of real time
content
recommendations
Advanced RL
models on C5
instances
Leading
advancements in
speech recognition
https://newsroom.intel.com/news/new-intel-based-
artificial-intelligence-imaging-solution-accelerate-
critical-patient-diagnoses/
CT scan radiology
medical image
classification
GE Healthcare

OPTIMIZED SOFTWARE FOR THE BEST AI PERFORMANCE
github.com/IntelAI/models/tree/m
aster/benchmarks
nGraph
software.intel.com/OpenVINO/toolkit
github.com/NervanaSystems/ngraph
MKL® DNN
DAAL
PYTHON
software.intel.com/en-
us/distribution-for-python
software.intel.com/en-us/intel-daal
github.com/intel/mkl-dnn
© 2019 Intel CorporationOther names and brands may be claimed as the property of others.
data types

INTEL® DL BOOST ECOSYSTEM SUPPORT
OPTIMIZED SW &
FRAMEWORKS
SOFTWARE
VENDORS
CLOUD SERVICE
PROVIDERS
ENTERPRISES
VIDEO ANALYSIS
TEXT DETECTION
MEDICAL IMAGE
CLASSIFICATION
FACIAL RECOGNITION
ML INFERENCING

PARTNERS
DELL* EMC* READY SOLUTIONS FOR AI
INSPUR FPGA SOLUTION FOR AI
LENOVO AI OFFERINGS
AND MORE
DEPLOY AI SOLUTIONS FASTER THROUGH INTEL’S STELLAR ECOSYSTEM
LEADING AI SYSTEMS AND SOFTWARE
SOLUTIONS >80 AVAILABLE FROM
200+ PARTNERS
builders.intel.com/ai
ACCELERATING AI INNOVATION
ACROSS CRITICAL
BUSINESS WORKLOADS
software.intel.com/ai

Poweringtheaitransformation
aiinfusedintohardware,software&ecosystem
SpanningdevicetointelligentEdge&multi-Cloud
Unmatchedportfoliotomove,Store,Processdata
CPU,GPU,FPGA&AcceleratorOPTIMIZEDforprocessingyourai
Aicomputeportfolioforthedata-centricera
2nd GenerationIntel®Xeon®ScalableProcessorswithIntel®DeepLearningBoost
TheonlyCPUwithintegratedAIacceleration

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration.
No product or component can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete
information about performance and benchmark results, visit http://www.intel.com/benchmarks .
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information
visit http://www.intel.com/benchmarks .
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost
savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost
savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
accurate.
Intel, the Intel logo, Intel Optane, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as property of others.
Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
© 2019 Intel Corporation.

Understanding and Integrating Intel Deep Learning Boost

Recomendados

Recomendados

Más contenido relacionado

Más de inside-BigData.com

Más de inside-BigData.com (20)

Último

Último (20)

Understanding and Integrating Intel Deep Learning Boost