Why Deep Learning Works: Self Regularization in Deep Neural Networks

calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); ﬁrst $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
(TM)
3

c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
4
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…

c|c
(TM)
Set up: the Energy Landscape
(TM)
5
minimize Loss: but how avoid overtraining ?

c|c
(TM)
Problem: How can this possibly work ?
(TM)
6
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue

c|c
(TM)
Problem: Local Minima ?
(TM)
7
Duda, Hart and Stork, 2000
solution: add more capacity and regularize

c|c
(TM)
Motivations: what is Regularization ?
(TM)
8
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…

c|c
(TM)
(TM)
9
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overﬁt on randomly labeled data
Regularization can not prevent this

Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
10
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/

c|c
(TM)
Motivations: how we study Regularization
(TM)
11
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L

c|c
(TM)
(TM)
12
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
ﬂoor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information ﬂow
what happens to the layer weight matrices WL ?

c|c
(TM)
(TM)
13
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC

c|c
(TM)
(TM)
14
Matrix Complexity: Entropy and Stable Rank

c|c
(TM)
(TM)
15
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)

c|c
(TM)
(TM)
16
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes

c|c
(TM)
(TM)
17
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well deﬁned edges (depends on Q, aspect ratio)

c|c
(TM)
(TM)
18
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom ﬂuctuations
very crisp edges
Q

c|c
(TM)
(TM)
19
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

c|c
(TM)
(TM)
20
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
21
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5

c|c
(TM)
(TM)
22
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
23
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226

c|c
(TM)
(TM)
24
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signiﬁes over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~

c|c
(TM)
(TM)
25
RMT: 5+1 Phases of Training

c|c
(TM)
(TM)
26
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT

c|c
(TM)
(TM)
27
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk

c|c
(TM)
(TM)
28
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold

c|c
(TM)
(TM)
29
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization

c|c
(TM)
(TM)
30
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively

c|c
(TM)
(TM)
31
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small

c|c
(TM)
(TM)
32
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like

c|c
(TM)
(TM)
33

c|c
(TM)
(TM)
34
Bulk+Spikes Bulk+Spikes Bulk-decay

c|c
(TM)
(TM)
35
Bulk-decay Bulk-decay Heavy-tailed

c|c
(TM)
(TM)
36
Summary 
self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization

c|c
(TM)
(TM)
37
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?

c|c
(TM)
Energy Funnels: Minimizing Frustration 
(TM)
38
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding

c|c
(TM)
the Spin Glass of Minimal Frustration  
(TM)
39
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, ﬂipped
low lying Energy state in Spin Glass ~ spikes in RMT

c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
40
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?

(TM)
c|c
(TM)
c | c

Why Deep Learning Works: Self Regularization in Deep Neural Networks

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Why Deep Learning Works: Self Regularization in Deep Neural Networks

Similar a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

Más de Charles Martin

Más de Charles Martin (7)

Último

Último (20)

Why Deep Learning Works: Self Regularization in Deep Neural Networks