This document summarizes research into implicit self-regularization in deep neural networks. It discusses how analyzing the eigenvalue spectrum of weight matrices can provide insights into the learning dynamics. Large, well-trained modern networks exhibit heavy-tailed eigenvalue distributions rather than Gaussian distributions. This heavy-tailed behavior acts as a form of self-regularization and may explain why large networks generalize well despite having many parameters. The document presents analysis of various networks showing this heavy-tailed behavior is universal across different architectures and datasets. It proposes that metrics based on the heavy-tailed behavior could predict a network's generalization performance without access to test data.
3. calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Griffin Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
4. calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
5. c|c
(TM)
Research: Implicit Self-Regularization
in Deep Learning
(TM)
5
calculation | consulting why deep learning works
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning
(archive, long version)
• Traditional and Heavy Tailed Self Regularization in Neural Network Models
(archive, short version, submitted ICML 2019)
• Rethinking generalization requires revisiting old ideas: statistical mechanics
approaches and complex learning behavior
(archive, long version, submitted JMLR 2019)
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
(archive, submitted ICML 2019)
6. c|c
(TM)
(TM)
6
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Motivations: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
7. c|c
(TM)
(TM)
7
calculation | consulting why deep learning works
Motivations: Remembering Regularization
Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase
Binary Classifier with N Random Labelings:
2 over-trained solutions— locally convex, very high barriers
all unable to generalize
N
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
8. c|c
(TM)
Motivations: still, what is Regularization ?
(TM)
8
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
9. Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
9
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues
Ridge Regression / Tikhonov-Phillips Regularization
10. c|c
(TM)
Set up: the Energy Landscape
(TM)
10
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
11. c|c
(TM)
Motivations: how we study Regularization
(TM)
11
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
i.e the eigenvalues of the correlation matrix
X= (1/N) W WL L
T
as in traditional regularization
13. c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
take Term-Document TFIDF Matrix
form SVD decomposition
use TruncatedSVD, keep top k=400 singular values
LSA is a soft rank approximation of A
14. c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
and compute the Empirical Spectral Density (ESD)
Equivalently, we can form the correlation matrix X
X = A A
T
i.e. histogram
15. c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
Bulk
(throw away)
Heavy Tail ?
(keep)
Power law
exponent
ESD
Histogram of eigenvalues
Not at all random Gaussian ,
could be spiked or heavy tailed ?
Spikes
(keep)
16. c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Heavy Tails in DNNs: ESD of Toy MLP
Trained a toy 3 layer MLP on CIFAR10 and monitored
the ESD of the layer weight matrices (Q=1)
Spikes
Bulk
17. c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
ESD of DNNs: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)
18. c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
19. c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
20. c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
22. c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
29. c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
30. c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
31. c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
32. c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
AlexNet,
VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
35. c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
36. c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse
39. c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs
40. c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs
41. c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
We can predict trends in the test accuracy without peeking at the test data !
42. c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Universality: Capacity metrics
We can do even better using the weighted average alpha
But first, we need a Relation between Frobenius norm to the Power Law
And to solve…what are the weights b ?
43. c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
create a random Heavy Tailed (Pareto) matrix
compute ‘eigenvalues’ of , fit to power Law
examine the norm-powerlaw relation:
44. c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
import numpy as np
import powerlaw
…
N,M = …
mu = …
W = np.random.pareto(a=mu,size=(N,M))
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
alpha = Powerlaw.fit(evals).alpha
logNorm = np.log10(np.linalg.norm(W)
logMaxEig = np.log10(np.max(evals))
ratio = 2*logNorm/logMaxEig
45. c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
Finite size Pareto matrices have a universal, linear norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Relation is nearly linear for
very Heavy and Fat Tailed
finite size matrices
46. c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
Can related the log product norm with our weighed alpha metric
The weights compensate for different size & scale weight matrices & feature maps
47. c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: VGG Series
The weighted alpha capacity metric predicts trends in test accuracy
49. c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher