Scaling up deep learning by scaling down

Scaling up deep
learning by scaling
down
—
Nick Pentreath
Principal Engineer
@MLnick

About
IBM Developer / © 2020 IBM Corporation
– @MLnick on Twitter, Github, LinkedIn
– Principal Engineer, IBM CODAIT (Center for
Open-Source Data & AI Technologies)
– Machine Learning & AI
– Apache Spark committer & PMC
– Author of Machine Learning with Spark
– Various conferences & meetups
2

Improving the Enterprise AI Lifecycle in Open Source
IBM Developer / © 2020 IBM Corporation 3
– CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise.
– We contribute to and advocate for
the open-source technologies that
are foundational to IBM’s AI
offerings.
– 30+ open-source developers!
Center for Open Source Data & AI Technologies
codait.org
CODAIT
Open Source @ IBM

Agenda
4
– Deep Learning overview & computational
considerations
– Evolving efficiency of model architectures
– Model compression
– Model distillation
– Conclusion
DEG / June 4, 2020 / © 2020 IBM Corporation

Machine Learning
Workflow
5
Data Analyze Process Train Deploy
Predict
&
Maintain
DEG / June 4, 2020 / © 2020 IBM Corporation
Compute-heavy

Deep Learning
– Original theory from 1940s; computer
models originated around 1960s; fell out of
favor in 1980s/90s
– Recent resurgence due to
• Bigger (and better) data; standard datasets
(e.g. ImageNet)
• Better hardware (GPUs, TPUs)
• Improvements to algorithms, architectures and
optimization
– Leading to new state-of-the-art results in
computer vision (images and video);
speech/text; language translation and
more
IBM Developer / © 2020 IBM Corporation 6Source: Wikipedia

Modern Neural Networks
– Deep (multi-layer) networks
– Computer vision
• Convolution neural networks (CNNs)
• Image classification, object detection,
segmentation
– Sequences and time-series
• Machine translation, text generation
• Recurrent neural networks - LSTM, GRU
– Natural language processing
• Word embeddings
• Transformers, attention mechanisms
– Deep learning frameworks
• Flexibility, computation graphs, auto-
differentiation, GPUs
IBM Developer / © 2020 IBM Corporation 7Source: Stanford CS231n

Evolution of Training
Computation Requirements
Source
Computational
resources
required for
training AI
models doubles
every 3 to 4
months

Example: Image
Classification
IBM Developer / © 2020 IBM Corporation 9Source
Input image Inference Prediction
beagle: 0.82
basset: 0.09
bluetick: 0.07
...

Example: Inception V3
Source
Effectively
matrix
multiplication
~24 million
parameters
78.8%
accuracy
(ImageNet)

Accuracy vs Computational
Complexity (ImageNet)
Source: Paper, blog

Computational efficiency
(ImageNet)
Source: Paper, blog

Deep Learning
Deployment
– Model training typically
uses substantial
hardware
– GPU / multi-GPU
– Cloud-based
deployment scenarios

Deep Learning
Deployment
– Edge devices have more limited resources
• Memory
• Compute (CPU, mobile GPU, edge TPU)
• Network bandwidth
– Also applies to low-latency applications

How do we improve
performance
efficiency?
– Architecture
improvements
– Model pruning
– Quantization
– Model distillation

Architecture
Improvements

Specialized architectures for
low-resource targets
Source
Standard
Convolution
Building Block
Inception V3 MobileNet V1
Depthwise
Convolution
Building Block
(~8x less
computation)
~4 million
parameters
70.9%
accuracy
~24 million
parameters
78.8%
accuracy
(ImageNet)

Trade off accuracy vs model size
Source
– Scale layer width &
resolution multiplier to
target available
computation budget
– Width multiplier =
“thinner” models
– Resolution multiplier
scales input image
representation

MobileNet V2
Source
– Same depthwise
convolution backbone
– Add linear bottlenecks
& shortcut connections
~3.4 million
parameters
72%
accuracy

Accuracy vs Computation - Updated
(ImageNet)
Source: Paper, blog

Computational efficiency - Updated
(ImageNet)
Source: Paper, blog

EfficientNet
Source: blog post, paper
– Neural Architecture
Search to find backbone
– Optimize for accuracy &
efficiency (FLOPS)
~5.3 million
parameters
77.3%
accuracy
~60 million
parameters
84.5%
accuracy

MobileNet V3
Source: GitHub, paper
– Hardware-aware Neural
Architecture Search
~5.4 million
parameters
75.2%
accuracy

One network to rule them all?
Source: GitHub, paper
– Once for All: Train One
Network and Specialize
it for Efficient
Deployment
– Manual design or NAS is
hugely costly in terms of
computation
– Train one network,
“cherry-pick” the sub-
net without additional
training

Model Pruning
– Reduce # of model
parameters
– Effectively like L1
regularization – remove
weights with small
impact on prediction
– Sparse weights ->
model compression &
lower latency

Model Pruning
Source
70
71
72
73
74
75
76
77
78
79
0% 20% 40% 60% 80% 100%
Top-1Accuracy(%)
Model Sparsity
ImageNet Classification
InceptionV3
MobileNet V1 224

Model Pruning
Source
26
26.5
27
27.5
28
28.5
29
29.5
30
0% 20% 40% 60% 80% 100%
BLEUScore
Model Sparsity
Language Translation
English-German
German-English

Quantization

Quantization
Source
– Most DL computation
users 32 (or even 64)
bits floating point
– Quantization reduces
numerical precision of
weights by binning
values
– Popular targets are 16-
bit FP and 8-bit integer
coding

Quantization
Source
– Post-training quantization
• Useful if you can’t (or don’t wish
to) retrain a model
• Give up accuracy
• Various options
– Float16
– Dynamic
– Int8
– Training-aware quantization
• Much more complex
• Can provide large efficiency
gains with minimal accuracy loss
78
71.9
77.2
63.7
77.5
70.9
InceptionV3 MobileNet V2 224Top-1Accuracy(%)
Original Post-training Training-aware

Quantization
Source
100% 100%
25% 26%
InceptionV3 MobileNet V2 224
%orginalmodelsize
Original Quantized
75%
110%
48%
61%
InceptionV3 MobileNet V2 224
%orginallatency
Post-training Training-aware

Quantization
– TensorFlow Model
Optimization
– PyTorch
– Distiller for PyTorch

Model Distillation
– Large models may be
over-parameterized
– Use a large, complex
model to teach a
smaller, simpler model
– Effectively distil the
core knowledge of the
large model

Model Distillation
Source: Distiller docs, paper

Model Distillation
– BERT model distillations have been very
successful
– DistilBERT
– TinyBERT
– Others (see this blog post)

Conclusion
– Model distillation is less popular but
potentially compelling in NLP tasks
– Area of rapid research evolution
– New efficient model architectures are
rapidly evolving
• If one fits your needs, use it!
– Compression techniques can yield
large efficiency gains
• Now good support in DL frameworks
/ supporting libraries
– Perhaps combining pruning &
quantization (though trickier)
36DEG / June 4, 2020 / © 2020 IBM Corporation

Thank you
codait.org
twitter.com/MLnick
github.com/MLnick
developer.ibm.com
Check out the Model Asset Exchange
https://ibm.biz/model-exchange
Sign up for IBM Cloud
https://ibm.biz/BdqdSi

References
Efficient Inference in Deep Learning – Where
is the Problem?
Analysis of deep neural networks
MobileNets
EfficientNet
Making Neural Nets Work With Low Precision
Speeding up BERT
Distilling the Knowledge in a Neural Network
Once for All: Train One Network and
Specialize it for Efficient Deployment
Distiller – PyTorch
TensorFlow Model Optimization
Deep Compression: Compressing Deep
Neural Networks with Pruning, Trained
Quantization and Huffman Coding

Scaling up deep learning by scaling down

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Scaling up deep learning by scaling down

Similar a Scaling up deep learning by scaling down (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Scaling up deep learning by scaling down