SlideShare una empresa de Scribd logo
1 de 50
Descargar para leer sin conexión
calculation | consulting
implicit self-regularization in
deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
TWiML Feb 20, 2019
implicit self-regularization in
deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Griffin Advisors
Alt. Energy: Anthropocene Institute (Page Family)
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
Research: Implicit Self-Regularization
in Deep Learning
(TM)
5
calculation | consulting why deep learning works
• Implicit Self-Regularization in Deep Neural Networks: Evidence from
Random Matrix Theory and Implications for Learning
(archive, long version)
• Traditional and Heavy Tailed Self Regularization in Neural Network Models
(archive, short version, submitted ICML 2019)
• Rethinking generalization requires revisiting old ideas: statistical mechanics
approaches and complex learning behavior
(archive, long version, submitted JMLR 2019)
• Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large
Pre-Trained Deep Neural Networks
(archive, submitted ICML 2019)
c|c
(TM)
(TM)
6
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Motivations: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
c|c
(TM)
(TM)
7
calculation | consulting why deep learning works
Motivations: Remembering Regularization
Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase
Binary Classifier with N Random Labelings:
2 over-trained solutions— locally convex, very high barriers
all unable to generalize
N
Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
c|c
(TM)
Motivations: still, what is Regularization ?
(TM)
8
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
9
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues
Ridge Regression / Tikhonov-Phillips Regularization
c|c
(TM)
Set up: the Energy Landscape
(TM)
10
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Motivations: how we study Regularization
(TM)
11
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
i.e the eigenvalues of the correlation matrix
X= (1/N) W WL L
T
as in traditional regularization
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
Random Matrix Theory:
Universality Classes for
DNN Weight Matrices
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
take Term-Document TFIDF Matrix
form SVD decomposition
use TruncatedSVD, keep top k=400 singular values
LSA is a soft rank approximation of A
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
and compute the Empirical Spectral Density (ESD)
Equivalently, we can form the correlation matrix X
X = A A
T
i.e. histogram
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Example: Latent Semantic Analysis
Bulk
(throw away)
Heavy Tail ?
(keep)
Power law
exponent
ESD
Histogram of eigenvalues
Not at all random Gaussian ,
could be spiked or heavy tailed ?
Spikes
(keep)
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Heavy Tails in DNNs: ESD of Toy MLP
Trained a toy 3 layer MLP on CIFAR10 and monitored
the ESD of the layer weight matrices (Q=1)
Spikes
Bulk
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
ESD of DNNs: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
and/or
Small batch sizes
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
RMT says if W is a simple random Gaussian matrix,
then the ESD will have a very simple , known form
Shape depends on Q=N/M
(and variance ~ 1)
Eigenvalues tightly bounded
a few spikes may appear
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
But if W is heavy tailed, the ESD will also have heavy tails
(i.e. its all spikes, bulk vanishes)
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Nearly all pre-trained DNNs display heavy tails…as shall soon see
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
Experiments on pre-trained DNNs:
Traditional and Heavy Tailed
Implicit Self-Regularization
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
AlexNet, 

VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Universality of Power Law Exponents:
I0,000 Weight Matrices
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Power Law Universality: BERT
The pretrained BERT model is not optimal, displays rank collapse
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
Predicting Test / Generalization Accuracies:
Mechanistic Universality
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
We can predict trends in the test accuracy without peeking at the test data !
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Universality: Capacity metrics
We can do even better using the weighted average alpha
But first, we need a Relation between Frobenius norm to the Power Law
And to solve…what are the weights b ?
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
create a random Heavy Tailed (Pareto) matrix
compute ‘eigenvalues’ of , fit to power Law
examine the norm-powerlaw relation:
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
import numpy as np
import powerlaw
…
N,M = …
mu = …
W = np.random.pareto(a=mu,size=(N,M))
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
alpha = Powerlaw.fit(evals).alpha
logNorm = np.log10(np.linalg.norm(W)
logMaxEig = np.log10(np.max(evals))
ratio = 2*logNorm/logMaxEig
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
Finite size Pareto matrices have a universal, linear norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Relation is nearly linear for
very Heavy and Fat Tailed
finite size matrices
c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
Can related the log product norm with our weighed alpha metric
The weights compensate for different size & scale weight matrices & feature maps
c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: VGG Series
The weighted alpha capacity metric predicts trends in test accuracy
c|c
(TM)
(TM)
48
calculation | consulting why deep learning works
Predicting test accuracies: ResNet Series
c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Más contenido relacionado

La actualidad más candente

AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpCharles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Mokhtar SELLAMI
 
A BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machineA BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machineAboul Ella Hassanien
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVMHonglin Yu
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorizationrecsysfr
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine LearningAndres Mendez-Vazquez
 
Integrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP formatIntegrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP formatIEA-ETSAP
 
Minimization of Assignment Problems
Minimization of Assignment ProblemsMinimization of Assignment Problems
Minimization of Assignment Problemsijtsrd
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 
Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019ByungEunJeon
 

La actualidad más candente (19)

AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
Cc stat phys draft
Cc stat phys draftCc stat phys draft
Cc stat phys draft
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
Cari2020 Parallel Hybridization for SAT: An Efficient Combination of Search S...
 
A BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machineA BA-based algorithm for parameter optimization of support vector machine
A BA-based algorithm for parameter optimization of support vector machine
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Dictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix FactorizationDictionary Learning for Massive Matrix Factorization
Dictionary Learning for Massive Matrix Factorization
 
11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning11 Machine Learning Important Issues in Machine Learning
11 Machine Learning Important Issues in Machine Learning
 
Integrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP formatIntegrating the TDBU-ETSAP models in MCP format
Integrating the TDBU-ETSAP models in MCP format
 
Compositional Program Analysis using Max-SMT
Compositional Program Analysis using Max-SMTCompositional Program Analysis using Max-SMT
Compositional Program Analysis using Max-SMT
 
Minimization of Assignment Problems
Minimization of Assignment ProblemsMinimization of Assignment Problems
Minimization of Assignment Problems
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019Winner of EY NextWave Data Science Challenge 2019
Winner of EY NextWave Data Science Challenge 2019
 

Similar a Implicit self-regularization in deep neural networks

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
iccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part IIiccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part IIzukun
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 018threspecter
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Dataymelka
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Thang Nguyen
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Semet Gecco06
Semet Gecco06Semet Gecco06
Semet Gecco06ysemet
 

Similar a Implicit self-regularization in deep neural networks (20)

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
iccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part IIiccv2009 tutorial: boosting and random forest - part II
iccv2009 tutorial: boosting and random forest - part II
 
theory of computation lecture 01
theory of computation lecture 01theory of computation lecture 01
theory of computation lecture 01
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
CoopLoc Technical Presentation
CoopLoc Technical PresentationCoopLoc Technical Presentation
CoopLoc Technical Presentation
 
deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
Efficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input DataEfficient Implementation of Self-Organizing Map for Sparse Input Data
Efficient Implementation of Self-Organizing Map for Sparse Input Data
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)Recurrent Instance Segmentation (UPC Reading Group)
Recurrent Instance Segmentation (UPC Reading Group)
 
Lecture16 xing
Lecture16 xingLecture16 xing
Lecture16 xing
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Semet Gecco06
Semet Gecco06Semet Gecco06
Semet Gecco06
 

Más de Charles Martin

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Más de Charles Martin (8)

LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Último (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Implicit self-regularization in deep neural networks

  • 1. calculation | consulting implicit self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting TWiML Feb 20, 2019 implicit self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Griffin Advisors Alt. Energy: Anthropocene Institute (Page Family) www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) Research: Implicit Self-Regularization in Deep Learning (TM) 5 calculation | consulting why deep learning works • Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning (archive, long version) • Traditional and Heavy Tailed Self Regularization in Neural Network Models (archive, short version, submitted ICML 2019) • Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior (archive, long version, submitted JMLR 2019) • Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks (archive, submitted ICML 2019)
  • 6. c|c (TM) (TM) 6 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Motivations: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 7. c|c (TM) (TM) 7 calculation | consulting why deep learning works Motivations: Remembering Regularization Statistical Mechanics (1990s) - (this) Overtraining -> Spin Glass Phase Binary Classifier with N Random Labelings: 2 over-trained solutions— locally convex, very high barriers all unable to generalize N Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior
  • 8. c|c (TM) Motivations: still, what is Regularization ? (TM) 8 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 9. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 9 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues Ridge Regression / Tikhonov-Phillips Regularization
  • 10. c|c (TM) Set up: the Energy Landscape (TM) 10 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 11. c|c (TM) Motivations: how we study Regularization (TM) 11 we can characterize the the learning process by studying W the Energy Landscape is determined by the layer weights W L L i.e the eigenvalues of the correlation matrix X= (1/N) W WL L T as in traditional regularization
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works Random Matrix Theory: Universality Classes for DNN Weight Matrices
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Example: Latent Semantic Analysis take Term-Document TFIDF Matrix form SVD decomposition use TruncatedSVD, keep top k=400 singular values LSA is a soft rank approximation of A
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Example: Latent Semantic Analysis and compute the Empirical Spectral Density (ESD) Equivalently, we can form the correlation matrix X X = A A T i.e. histogram
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Example: Latent Semantic Analysis Bulk (throw away) Heavy Tail ? (keep) Power law exponent ESD Histogram of eigenvalues Not at all random Gaussian , could be spiked or heavy tailed ? Spikes (keep)
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Heavy Tails in DNNs: ESD of Toy MLP Trained a toy 3 layer MLP on CIFAR10 and monitored the ESD of the layer weight matrices (Q=1) Spikes Bulk
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works ESD of DNNs: detailed insight into W Empirical Spectral Density (ESD: eigenvalues of X) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] N,M = W.shape() … X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) plt.hist(X, bin=100, density=True)
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs and/or Small batch sizes
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q RMT says if W is a simple random Gaussian matrix, then the ESD will have a very simple , known form Shape depends on Q=N/M (and variance ~ 1) Eigenvalues tightly bounded a few spikes may appear
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed But if W is heavy tailed, the ESD will also have heavy tails (i.e. its all spikes, bulk vanishes) If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Nearly all pre-trained DNNs display heavy tails…as shall soon see
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works Experiments on pre-trained DNNs: Traditional and Heavy Tailed Implicit Self-Regularization
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works LeNet 5 resembles the MP Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works AlexNet, 
 VGG, ResNet, Inception, DenseNet, … Heavy Tailed RMT: Scale Free ESD All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Universality of Power Law Exponents: I0,000 Weight Matrices
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Rank Collapse: ImageNet and AllenNLP The pretrained ImageNet and AllenNLP show (almost) no rank collapse
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Power Law Universality: BERT The pretrained BERT model is not optimal, displays rank collapse
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works Predicting Test / Generalization Accuracies: Mechanistic Universality
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Universality: Capacity metrics Universality suggests the power law exponent would make a good, Universal. DNN capacity metric Imagine a weighted average where the weights b ~ |W| scale of the matrix An Unsupervised, VC-like data dependent complexity metric for predicting trends in average case generalization accuracy in DNNs
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works DNN Capacity metrics: Product norms The product norm is a data-dependent,VC-like capacity metric for DNNs
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Predicting test accuracies: Product norms We can predict trends in the test accuracy without peeking at the test data !
  • 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Universality: Capacity metrics We can do even better using the weighted average alpha But first, we need a Relation between Frobenius norm to the Power Law And to solve…what are the weights b ?
  • 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations create a random Heavy Tailed (Pareto) matrix compute ‘eigenvalues’ of , fit to power Law examine the norm-powerlaw relation:
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations import numpy as np import powerlaw … N,M = … mu = … W = np.random.pareto(a=mu,size=(N,M)) X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) alpha = Powerlaw.fit(evals).alpha logNorm = np.log10(np.linalg.norm(W) logMaxEig = np.log10(np.max(evals)) ratio = 2*logNorm/logMaxEig
  • 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works Finite size Pareto matrices have a universal, linear norm-power-law relation Heavy Tailed matrices: norm-power law relations Relation is nearly linear for very Heavy and Fat Tailed finite size matrices
  • 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Here we treat both Linear layers and Conv2D feature maps Can related the log product norm with our weighed alpha metric The weights compensate for different size & scale weight matrices & feature maps
  • 47. c|c (TM) (TM) 47 calculation | consulting why deep learning works Predicting test accuracies: VGG Series The weighted alpha capacity metric predicts trends in test accuracy
  • 48. c|c (TM) (TM) 48 calculation | consulting why deep learning works Predicting test accuracies: ResNet Series
  • 49. c|c (TM) (TM) 49 calculation | consulting why deep learning works Open source tool: weightwatcher pip install weightwatcher … import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() All results can be reproduced using the python weightwatcher tool python tool to analyze Fat Tails in Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher