Talk given on Dec 13, 2018 at ICSI, UC Berkeley
http://www.icsi.berkeley.edu/icsi/events/2018/12/regularization-neural-networks
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a ``size scale'' separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. Moreover, we can use these heavy tailed results to form a VC-like average case complexity metric that resembles the product norm used in analyzing toy NNs, and we can use this to predict the test accuracy of pretrained DNNs without peeking at the test data.
Scaling API-first – The story of a global engineering organization
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
1. calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
2. calculation|consulting
UC Berkeley / ICSI 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
3. calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
4. calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
5. c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
calculation | consulting why deep learning works
NNs as spin glasses
LeCun et. al. 2015
Looks exactly
old protein
folding results
(late 90s)
Energy Landscape Theory
broad questions about Why Deep Learning Works ?
MDDS talk 2016 Blog post 2015
completely
different
picture
of DNNs
6. c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
weakly supervised deep learning ?
…
7. c|c
(TM)
Set up: the Energy Landscape
(TM)
7
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
8. c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
9. c|c
(TM)
Problem: Local Minima ?
(TM)
9
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
10. c|c
(TM)
(TM)
10
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
11. c|c
(TM)
Motivations: what is Regularization ?
(TM)
11
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
12. Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
12
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
13. c|c
(TM)
Motivations: how we study Regularization
(TM)
13
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
and the eigenvalues of X= (1/N) W WL L
T
15. c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
17. c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)
18. c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
19. c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
A Heavy tailed random matrices has a heavy tailed empirical spectral density
Given where
Form where (but really )
The the ESD has the form
Random Matrix Theory: Heavy Tailed
where is linear in
20. c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
RMT says if W is heavy tailed, the ESD will also have heavy tails
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)
21. c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
27. c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
28. c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
29. c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
30. c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
AlexNet,
VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
33. c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
34. c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse
37. c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs
38. c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs
39. c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
We can predict trends in the test accuracy without peeking at the test data !
40. c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Universality: Capacity metrics
We can do even better using the weighted average alpha
But first, we need a Relation between Frobenius norm to the Power Law
And to solve…what are the weights b ?
41. c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
form Correlation matrix, select Normalization
create a random Heavy Tailed (Pareto) matrix
Frobenius Norm-Power Law relation depends on the Normalization
42. c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
Frobenius Norm-Power Law relation depends on the Normalization
compute ‘eigenvalues’ of , fit to power Law
examine the norm-powerlaw relation:
43. c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
import numpy as np
import powerlaw
…
N,M = …
mu = …
W = np.random.pareto(a=mu,size=(N,M))
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
alpha = Powerlaw.fit(evals).alpha
logNorm = np.log10(np.linalg.norm(W)
logMaxEig = np.log10(np.max(evals))
ratio = 2*logNorm/logMaxEig
44. c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Scale-free Normalization
the Frobenius Norm is dominated by a single scale-free eigenvalue
Heavy Tailed matrices: norm-powerlaw relations
Relation is weakly scale dependent
45. c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
large Pareto matrices have a simple, limiting norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Standard Normalization
Relation is only linear for
very Heavy Tailed matrices
46. c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Finite size Pareto matrices have a universal, linear norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Relation is nearly linear for
very Heavy and Fat Tailed
finite size matrices
47. c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
48. c|c
(TM)
(TM)
48
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Associate the log product norm with the weighed alpha metric
49. c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Predicting test accuracies: VGG Series
The weighted alpha capacity metric predicts trends in test accuracy
51. c|c
(TM)
(TM)
51
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher
55. c|c
(TM)
(TM)
55
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
We can induce heavy tails in small models by decreasing the batch size
Mini-AlexNet: retrained to exploit Generalization Gap
57. c|c
(TM)
(TM)
57
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Random-Like Bleeding-outRandom-Like
Decreasing the batch size induces strong correlations in W
58. c|c
(TM)
(TM)
58
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Large batch sizes
Small batch sizes
59. c|c
(TM)
(TM)
59
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
60. c|c
(TM)
(TM)
60
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
61. c|c
(TM)
(TM)
61
calculation | consulting why deep learning works
Summary
•reviewed Random Matrix Theory (RMT)
•small NNs ~ Tinkhonov regularization
•modern DNNs are Heavy-Tailed
•Universality of Power Law exponent alpha
•can predict generalization accuracies using new capacity
metric, weighted average alpha
•can induce heavy tails by exploiting the Generalization
Gap phenomena, decreasing batch size
62. c|c
(TM)
(TM)
62
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
Is this right ? Based on a Gaussian Spin Glass model
63. c|c
(TM)
(TM)
63
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can we characterize the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
64. c|c
(TM)
Energy Funnels: Minimizing Frustration
(TM)
64
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
65. c|c
(TM)
the Spin Glass of Minimal Frustration
(TM)
65
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
66. c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
66
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?