Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
UC Berkeley / ICSI 2018
why deep learning works:
self-regularization in deep neural networks
(TM)

calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); ﬁrst $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
(TM)
3

c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?

c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
NNs as spin glasses
LeCun et. al. 2015
Looks exactly
old protein
folding results
(late 90s)
Energy Landscape Theory
broad questions about Why Deep Learning Works ?
MDDS talk 2016 Blog post 2015
completely
different
picture
of DNNs

c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
weakly supervised deep learning ?
…

c|c
(TM)
Set up: the Energy Landscape
(TM)
7
minimize Loss: but how avoid overtraining ?

c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue

c|c
(TM)
Problem: Local Minima ?
(TM)
9
Duda, Hart and Stork, 2000
solution: add more capacity and regularize

c|c
(TM)
(TM)
10
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overﬁt on randomly labeled data
Regularization can not prevent this

c|c
(TM)
Motivations: what is Regularization ?
(TM)
11
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…

Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
12
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/

c|c
(TM)
Motivations: how we study Regularization
(TM)
13
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
and the eigenvalues of X= (1/N) W WL L
T

c|c
(TM)
(TM)
14
Random Matrix Theory:
Universality Classes for
DNN Weight Matrices

c|c
(TM)
(TM)
15
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well deﬁned edges (depends on Q, aspect ratio)

c|c
(TM)
(TM)
16
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom ﬂuctuations
very crisp edges
Q

c|c
(TM)
(TM)
17
Random Matrix Theory: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)

c|c
(TM)
(TM)
18
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs

c|c
(TM)
(TM)
19
A Heavy tailed random matrices has a heavy tailed empirical spectral density
Given where
Form where (but really )
The the ESD has the form
Random Matrix Theory: Heavy Tailed
where is linear in

c|c
(TM)
(TM)
20
Random Matrix Theory: Heavy Tailed
RMT says if W is heavy tailed, the ESD will also have heavy tails
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)

c|c
(TM)
(TM)
21
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT

c|c
(TM)
(TM)
22
Experiments on pre-trained DNNs:
Traditional and Heavy Tailed
Implicit Self-Regularization

c|c
(TM)
(TM)
23
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC

c|c
(TM)
(TM)
24
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5

c|c
(TM)
(TM)
25
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in

c|c
(TM)
(TM)
26
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226

c|c
(TM)
(TM)
27
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signiﬁes over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1

c|c
(TM)
(TM)
28
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT

c|c
(TM)
(TM)
29
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold

c|c
(TM)
(TM)
30
AlexNet,  
VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free

c|c
(TM)
(TM)
31
Universality of Power Law Exponents:
I0,000 Weight Matrices

c|c
(TM)
(TM)
32
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality

c|c
(TM)
(TM)
33
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4

c|c
(TM)
(TM)
34
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse

c|c
(TM)
(TM)
35
Power Law Universality: BERT
The pretrained BERT model is not optimal, displays rank collapse

c|c
(TM)
(TM)
36
Predicting Test / Generalization Accuracies:
Universality and Complexity Metrics

c|c
(TM)
(TM)
37
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs

c|c
(TM)
(TM)
38
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs

c|c
(TM)
(TM)
39
Predicting test accuracies: Product norms
We can predict trends in the test accuracy without peeking at the test data !

c|c
(TM)
(TM)
40
Universality: Capacity metrics
We can do even better using the weighted average alpha
But ﬁrst, we need a Relation between Frobenius norm to the Power Law
And to solve…what are the weights b ?

c|c
(TM)
(TM)
41
Heavy Tailed matrices: norm-powerlaw relations
form Correlation matrix, select Normalization
create a random Heavy Tailed (Pareto) matrix
Frobenius Norm-Power Law relation depends on the Normalization

c|c
(TM)
(TM)
42
Frobenius Norm-Power Law relation depends on the Normalization
compute ‘eigenvalues’ of , ﬁt to power Law
examine the norm-powerlaw relation:

c|c
(TM)
(TM)
43
import numpy as np
import powerlaw
…
N,M = …
mu = …
W = np.random.pareto(a=mu,size=(N,M))
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
alpha = Powerlaw.fit(evals).alpha
logNorm = np.log10(np.linalg.norm(W)
logMaxEig = np.log10(np.max(evals))
ratio = 2*logNorm/logMaxEig

c|c
(TM)
(TM)
44
Scale-free Normalization
the Frobenius Norm is dominated by a single scale-free eigenvalue
Relation is weakly scale dependent

c|c
(TM)
(TM)
45
large Pareto matrices have a simple, limiting norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Standard Normalization
Relation is only linear for
very Heavy Tailed matrices

c|c
(TM)
(TM)
46
Finite size Pareto matrices have a universal, linear norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Relation is nearly linear for
very Heavy and Fat Tailed
ﬁnite size matrices

c|c
(TM)
(TM)
47
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps

c|c
(TM)
(TM)
48
Predicting test accuracies: weighted alpha
Associate the log product norm with the weighed alpha metric

c|c
(TM)
(TM)
49
Predicting test accuracies: VGG Series
The weighted alpha capacity metric predicts trends in test accuracy

c|c
(TM)
(TM)
50
Predicting test accuracies: ResNet Series

c|c
(TM)
(TM)
51
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher

c|c
(TM)
(TM)
52
Implicit Self-Regularization: 5+1 Phases of Training

c|c
(TM)
(TM)
53
Self-Regularization: 5+1 Phases of Training

c|c
(TM)
(TM)
54
Self-Regularization: 5+1 Phases of Training

c|c
(TM)
(TM)
55
Self-Regularization: Batch size experiments
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
We can induce heavy tails in small models by decreasing the batch size
Mini-AlexNet: retrained to exploit Generalization Gap

c|c
(TM)
(TM)
56
Large batch sizes => decrease generalization accuracy
Self-Regularization: Batch size experiments

c|c
(TM)
(TM)
57
Batch Size Tuning: Generalization Gap
Random-Like Bleeding-outRandom-Like
Decreasing the batch size induces strong correlations in W

c|c
(TM)
(TM)
58
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Large batch sizes
Small batch sizes

c|c
(TM)
(TM)
59
Bulk+Spikes Bulk+Spikes Bulk-decay

c|c
(TM)
(TM)
60
Bulk-decay Bulk-decay Heavy-tailed

c|c
(TM)
(TM)
61
Summary 
•reviewed Random Matrix Theory (RMT)
•small NNs ~ Tinkhonov regularization
•modern DNNs are Heavy-Tailed
•Universality of Power Law exponent alpha
•can predict generalization accuracies using new capacity
metric, weighted average alpha
•can induce heavy tails by exploiting the Generalization
Gap phenomena, decreasing batch size

c|c
(TM)
(TM)
62
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
ﬂoor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information ﬂow
Is this right ? Based on a Gaussian Spin Glass model

c|c
(TM)
(TM)
63
Implications: RMT and Deep Learning
How can we characterize the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?

c|c
(TM)
Energy Funnels: Minimizing Frustration 
(TM)
64
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding

c|c
(TM)
the Spin Glass of Minimal Frustration  
(TM)
65
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, ﬂipped
low lying Energy state in Spin Glass ~ spikes in RMT

c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
66
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?

(TM)
c|c
(TM)
c | c

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

Similar a Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley (20)

Más de Charles Martin

Más de Charles Martin (7)

Último

Último (20)

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley