SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
UC Berkeley / ICSI 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry, UIUC
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
calculation | consulting why deep learning works
c|c
(TM)
(TM)
4
Michael W. Mahoney
ICSI, RISELab, Dept. of Statistics UC Berkeley
Algorithmic and statistical aspects of modern large-scale data analysis.
large-scale machine learning | randomized linear algebra
geometric network analysis | scalable implicit regularization
PhD, Yale University, computational chemical physics
SAMSI National Advisory Committee
NRC Committee on the Analysis of Massive Data
Simons Institute Fall 2013 and 2018 program on the Foundations of Data
Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets
NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley
https://www.stat.berkeley.edu/~mmahoney/
mmahoney@stat.berkeley.edu
Who Are We?
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
5
calculation | consulting why deep learning works
NNs as spin glasses
LeCun et. al. 2015
Looks exactly
old protein
folding results
(late 90s)
Energy Landscape Theory
broad questions about Why Deep Learning Works ?
MDDS talk 2016 Blog post 2015
completely
different
picture
of DNNs
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
6
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
weakly supervised deep learning ?
…
c|c
(TM)
Set up: the Energy Landscape
(TM)
7
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Problem: How can this possibly work ?
(TM)
8
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
c|c
(TM)
Problem: Local Minima ?
(TM)
9
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
c|c
(TM)
(TM)
10
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
c|c
(TM)
Motivations: what is Regularization ?
(TM)
11
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
12
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
c|c
(TM)
Motivations: how we study Regularization
(TM)
13
we can characterize the the learning process by studying W
the Energy Landscape is determined by the layer weights W L
L
and the eigenvalues of X= (1/N) W WL L
T
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Random Matrix Theory:
Universality Classes for
DNN Weight Matrices
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into W
Empirical Spectral Density (ESD: eigenvalues of X)
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
N,M = W.shape()
…
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
DNN training induces breakdown of Gaussian random structure
and the onset of a new kind of heavy tailed self-regularization
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Small, older NNs
Large, modern DNNs
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
A Heavy tailed random matrices has a heavy tailed empirical spectral density
Given where
Form where (but really )
The the ESD has the form
Random Matrix Theory: Heavy Tailed
where is linear in
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Random Matrix Theory: Heavy Tailed
RMT says if W is heavy tailed, the ESD will also have heavy tails
If W is strongly correlated , then the ESD can be modeled as if W is drawn
from a heavy tailed distribution
Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Heavy Tailed RMT: Universality Classes
The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
Experiments on pre-trained DNNs:
Traditional and Heavy Tailed
Implicit Self-Regularization
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
LeNet 5 resembles the MP Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
Q > 1
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
AlexNet, 

VGG,
ResNet,
Inception,
DenseNet,
…
Heavy Tailed RMT: Scale Free ESD
All large, well trained, modern DNNs exhibit heavy tailed self-regularization
scale free
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Universality of Power Law Exponents:
I0,000 Weight Matrices
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Power Law Universality: ImageNet
All ImageNet models display remarkable Heavy Tailed Universality
500 matrices
~50 architectures
Linear layers &
Conv2D feature maps
80-90% < 4
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Rank Collapse: ImageNet and AllenNLP
The pretrained ImageNet and AllenNLP show (almost) no rank collapse
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Power Law Universality: BERT
The pretrained BERT model is not optimal, displays rank collapse
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Predicting Test / Generalization Accuracies:
Universality and Complexity Metrics
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Universality: Capacity metrics
Universality suggests the power law exponent
would make a good, Universal. DNN capacity metric
Imagine a weighted average
where the weights b ~ |W| scale of the matrix
An Unsupervised, VC-like data dependent complexity metric for
predicting trends in average case generalization accuracy in DNNs
c|c
(TM)
(TM)
38
calculation | consulting why deep learning works
DNN Capacity metrics: Product norms
The product norm is a data-dependent,VC-like capacity metric for DNNs
c|c
(TM)
(TM)
39
calculation | consulting why deep learning works
Predicting test accuracies: Product norms
We can predict trends in the test accuracy without peeking at the test data !
c|c
(TM)
(TM)
40
calculation | consulting why deep learning works
Universality: Capacity metrics
We can do even better using the weighted average alpha
But first, we need a Relation between Frobenius norm to the Power Law
And to solve…what are the weights b ?
c|c
(TM)
(TM)
41
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
form Correlation matrix, select Normalization
create a random Heavy Tailed (Pareto) matrix
Frobenius Norm-Power Law relation depends on the Normalization
c|c
(TM)
(TM)
42
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
Frobenius Norm-Power Law relation depends on the Normalization
compute ‘eigenvalues’ of , fit to power Law
examine the norm-powerlaw relation:
c|c
(TM)
(TM)
43
calculation | consulting why deep learning works
Heavy Tailed matrices: norm-powerlaw relations
import numpy as np
import powerlaw
…
N,M = …
mu = …
W = np.random.pareto(a=mu,size=(N,M))
X = np.dot(W.T, W)/N
evals = np.linalg.eigvals(X)
alpha = Powerlaw.fit(evals).alpha
logNorm = np.log10(np.linalg.norm(W)
logMaxEig = np.log10(np.max(evals))
ratio = 2*logNorm/logMaxEig
c|c
(TM)
(TM)
44
calculation | consulting why deep learning works
Scale-free Normalization
the Frobenius Norm is dominated by a single scale-free eigenvalue
Heavy Tailed matrices: norm-powerlaw relations
Relation is weakly scale dependent
c|c
(TM)
(TM)
45
calculation | consulting why deep learning works
large Pareto matrices have a simple, limiting norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Standard Normalization
Relation is only linear for
very Heavy Tailed matrices
c|c
(TM)
(TM)
46
calculation | consulting why deep learning works
Finite size Pareto matrices have a universal, linear norm-power-law relation
Heavy Tailed matrices: norm-power law relations
Relation is nearly linear for
very Heavy and Fat Tailed
finite size matrices
c|c
(TM)
(TM)
47
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Here we treat both Linear layers and Conv2D feature maps
c|c
(TM)
(TM)
48
calculation | consulting why deep learning works
Predicting test accuracies: weighted alpha
Associate the log product norm with the weighed alpha metric
c|c
(TM)
(TM)
49
calculation | consulting why deep learning works
Predicting test accuracies: VGG Series
The weighted alpha capacity metric predicts trends in test accuracy
c|c
(TM)
(TM)
50
calculation | consulting why deep learning works
Predicting test accuracies: ResNet Series
c|c
(TM)
(TM)
51
calculation | consulting why deep learning works
Open source tool: weightwatcher
pip install weightwatcher
…
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
results = watcher.analyze()
watcher.get_summary()
watcher.print_results()
All results can be reproduced using the python weightwatcher tool
python tool to analyze Fat Tails in Deep Neural Networks
https://github.com/CalculatedContent/WeightWatcher
c|c
(TM)
(TM)
52
calculation | consulting why deep learning works
Implicit Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
53
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
54
calculation | consulting why deep learning works
Self-Regularization: 5+1 Phases of Training
c|c
(TM)
(TM)
55
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
We can induce heavy tails in small models by decreasing the batch size
Mini-AlexNet: retrained to exploit Generalization Gap
c|c
(TM)
(TM)
56
calculation | consulting why deep learning works
Large batch sizes => decrease generalization accuracy
Self-Regularization: Batch size experiments
c|c
(TM)
(TM)
57
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Random-Like Bleeding-outRandom-Like
Decreasing the batch size induces strong correlations in W
c|c
(TM)
(TM)
58
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Gaussian
random
matrix
Bulk+
Spikes
Heavy
Tailed
Large batch sizes
Small batch sizes
c|c
(TM)
(TM)
59
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
c|c
(TM)
(TM)
60
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
c|c
(TM)
(TM)
61
calculation | consulting why deep learning works
Summary

•reviewed Random Matrix Theory (RMT)
•small NNs ~ Tinkhonov regularization
•modern DNNs are Heavy-Tailed
•Universality of Power Law exponent alpha
•can predict generalization accuracies using new capacity
metric, weighted average alpha
•can induce heavy tails by exploiting the Generalization
Gap phenomena, decreasing batch size
c|c
(TM)
(TM)
62
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
Is this right ? Based on a Gaussian Spin Glass model
c|c
(TM)
(TM)
63
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can we characterize the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
64
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
65
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
66
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Más contenido relacionado

La actualidad más candente

Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksCharles Martin
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Charles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization MethodsStefan Kühn
 
Industrial project and machine scheduling with Constraint Programming
Industrial project and machine scheduling with Constraint ProgrammingIndustrial project and machine scheduling with Constraint Programming
Industrial project and machine scheduling with Constraint ProgrammingPhilippe Laborie
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmGabriele D'Angelo
 
IRJET- LS Chaotic based Image Encryption System Via Permutation Models
IRJET- LS Chaotic based Image Encryption System Via Permutation ModelsIRJET- LS Chaotic based Image Encryption System Via Permutation Models
IRJET- LS Chaotic based Image Encryption System Via Permutation ModelsIRJET Journal
 
An empirical assessment of different kernel functions on the performance of s...
An empirical assessment of different kernel functions on the performance of s...An empirical assessment of different kernel functions on the performance of s...
An empirical assessment of different kernel functions on the performance of s...riyaniaes
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 

La actualidad más candente (20)

Stanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning WorksStanford ICME Lecture on Why Deep Learning Works
Stanford ICME Lecture on Why Deep Learning Works
 
CC mmds talk 2106
CC mmds talk 2106CC mmds talk 2106
CC mmds talk 2106
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Deep Learning and Optimization Methods
Deep Learning and Optimization MethodsDeep Learning and Optimization Methods
Deep Learning and Optimization Methods
 
Industrial project and machine scheduling with Constraint Programming
Industrial project and machine scheduling with Constraint ProgrammingIndustrial project and machine scheduling with Constraint Programming
Industrial project and machine scheduling with Constraint Programming
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management Algorithm
 
QMC: Transition Workshop - Discussion of "Representative Points for Small and...
QMC: Transition Workshop - Discussion of "Representative Points for Small and...QMC: Transition Workshop - Discussion of "Representative Points for Small and...
QMC: Transition Workshop - Discussion of "Representative Points for Small and...
 
IRJET- LS Chaotic based Image Encryption System Via Permutation Models
IRJET- LS Chaotic based Image Encryption System Via Permutation ModelsIRJET- LS Chaotic based Image Encryption System Via Permutation Models
IRJET- LS Chaotic based Image Encryption System Via Permutation Models
 
An empirical assessment of different kernel functions on the performance of s...
An empirical assessment of different kernel functions on the performance of s...An empirical assessment of different kernel functions on the performance of s...
An empirical assessment of different kernel functions on the performance of s...
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 

Similar a Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfPolytechnique Montréal
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fittingWush Wu
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageWilliam de Vazelhes
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systemsrecsysfr
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisyahmed51236
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...Edge AI and Vision Alliance
 

Similar a Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley (20)

ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Lecture16 xing
Lecture16 xingLecture16 xing
Lecture16 xing
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
Capsule Networks
Capsule NetworksCapsule Networks
Capsule Networks
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisy
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
“Introducing Machine Learning and How to Teach Machines to See,” a Presentati...
 
Fa19_P1.pptx
Fa19_P1.pptxFa19_P1.pptx
Fa19_P1.pptx
 

Más de Charles Martin

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Más de Charles Martin (7)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Último (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley

  • 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting UC Berkeley / ICSI 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry, UIUC Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. calculation | consulting why deep learning works c|c (TM) (TM) 4 Michael W. Mahoney ICSI, RISELab, Dept. of Statistics UC Berkeley Algorithmic and statistical aspects of modern large-scale data analysis. large-scale machine learning | randomized linear algebra geometric network analysis | scalable implicit regularization PhD, Yale University, computational chemical physics SAMSI National Advisory Committee NRC Committee on the Analysis of Massive Data Simons Institute Fall 2013 and 2018 program on the Foundations of Data Biennial MMDS Workshops on Algorithms for Modern Massive Data Sets NSF/TRIPODS-funded Foundations of Data Analysis Institute at UC Berkeley https://www.stat.berkeley.edu/~mmahoney/ mmahoney@stat.berkeley.edu Who Are We?
  • 5. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 5 calculation | consulting why deep learning works NNs as spin glasses LeCun et. al. 2015 Looks exactly old protein folding results (late 90s) Energy Landscape Theory broad questions about Why Deep Learning Works ? MDDS talk 2016 Blog post 2015 completely different picture of DNNs
  • 6. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 6 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? weakly supervised deep learning ? …
  • 7. c|c (TM) Set up: the Energy Landscape (TM) 7 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 8. c|c (TM) Problem: How can this possibly work ? (TM) 8 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  • 9. c|c (TM) Problem: Local Minima ? (TM) 9 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  • 10. c|c (TM) (TM) 10 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 11. c|c (TM) Motivations: what is Regularization ? (TM) 11 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 12. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 12 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  • 13. c|c (TM) Motivations: how we study Regularization (TM) 13 we can characterize the the learning process by studying W the Energy Landscape is determined by the layer weights W L L and the eigenvalues of X= (1/N) W WL L T
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Random Matrix Theory: Universality Classes for DNN Weight Matrices
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into W Empirical Spectral Density (ESD: eigenvalues of X) import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] N,M = W.shape() … X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) plt.hist(X, bin=100, density=True)
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL DNN training induces breakdown of Gaussian random structure and the onset of a new kind of heavy tailed self-regularization Gaussian random matrix Bulk+ Spikes Heavy Tailed Small, older NNs Large, modern DNNs
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works A Heavy tailed random matrices has a heavy tailed empirical spectral density Given where Form where (but really ) The the ESD has the form Random Matrix Theory: Heavy Tailed where is linear in
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Random Matrix Theory: Heavy Tailed RMT says if W is heavy tailed, the ESD will also have heavy tails If W is strongly correlated , then the ESD can be modeled as if W is drawn from a heavy tailed distribution Known results from early 90s, developed in Finance (Bouchaud, Potters, etc)
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Heavy Tailed RMT: Universality Classes The familiar Wigner/MP Gaussian class is not the only Universality class in RMT
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works Experiments on pre-trained DNNs: Traditional and Heavy Tailed Implicit Self-Regularization
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works LeNet 5 resembles the MP Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~ Q > 1
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works AlexNet, 
 VGG, ResNet, Inception, DenseNet, … Heavy Tailed RMT: Scale Free ESD All large, well trained, modern DNNs exhibit heavy tailed self-regularization scale free
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Universality of Power Law Exponents: I0,000 Weight Matrices
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Power Law Universality: ImageNet All ImageNet models display remarkable Heavy Tailed Universality 500 matrices ~50 architectures Linear layers & Conv2D feature maps 80-90% < 4
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Rank Collapse: ImageNet and AllenNLP The pretrained ImageNet and AllenNLP show (almost) no rank collapse
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Power Law Universality: BERT The pretrained BERT model is not optimal, displays rank collapse
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Predicting Test / Generalization Accuracies: Universality and Complexity Metrics
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Universality: Capacity metrics Universality suggests the power law exponent would make a good, Universal. DNN capacity metric Imagine a weighted average where the weights b ~ |W| scale of the matrix An Unsupervised, VC-like data dependent complexity metric for predicting trends in average case generalization accuracy in DNNs
  • 38. c|c (TM) (TM) 38 calculation | consulting why deep learning works DNN Capacity metrics: Product norms The product norm is a data-dependent,VC-like capacity metric for DNNs
  • 39. c|c (TM) (TM) 39 calculation | consulting why deep learning works Predicting test accuracies: Product norms We can predict trends in the test accuracy without peeking at the test data !
  • 40. c|c (TM) (TM) 40 calculation | consulting why deep learning works Universality: Capacity metrics We can do even better using the weighted average alpha But first, we need a Relation between Frobenius norm to the Power Law And to solve…what are the weights b ?
  • 41. c|c (TM) (TM) 41 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations form Correlation matrix, select Normalization create a random Heavy Tailed (Pareto) matrix Frobenius Norm-Power Law relation depends on the Normalization
  • 42. c|c (TM) (TM) 42 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations Frobenius Norm-Power Law relation depends on the Normalization compute ‘eigenvalues’ of , fit to power Law examine the norm-powerlaw relation:
  • 43. c|c (TM) (TM) 43 calculation | consulting why deep learning works Heavy Tailed matrices: norm-powerlaw relations import numpy as np import powerlaw … N,M = … mu = … W = np.random.pareto(a=mu,size=(N,M)) X = np.dot(W.T, W)/N evals = np.linalg.eigvals(X) alpha = Powerlaw.fit(evals).alpha logNorm = np.log10(np.linalg.norm(W) logMaxEig = np.log10(np.max(evals)) ratio = 2*logNorm/logMaxEig
  • 44. c|c (TM) (TM) 44 calculation | consulting why deep learning works Scale-free Normalization the Frobenius Norm is dominated by a single scale-free eigenvalue Heavy Tailed matrices: norm-powerlaw relations Relation is weakly scale dependent
  • 45. c|c (TM) (TM) 45 calculation | consulting why deep learning works large Pareto matrices have a simple, limiting norm-power-law relation Heavy Tailed matrices: norm-power law relations Standard Normalization Relation is only linear for very Heavy Tailed matrices
  • 46. c|c (TM) (TM) 46 calculation | consulting why deep learning works Finite size Pareto matrices have a universal, linear norm-power-law relation Heavy Tailed matrices: norm-power law relations Relation is nearly linear for very Heavy and Fat Tailed finite size matrices
  • 47. c|c (TM) (TM) 47 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Here we treat both Linear layers and Conv2D feature maps
  • 48. c|c (TM) (TM) 48 calculation | consulting why deep learning works Predicting test accuracies: weighted alpha Associate the log product norm with the weighed alpha metric
  • 49. c|c (TM) (TM) 49 calculation | consulting why deep learning works Predicting test accuracies: VGG Series The weighted alpha capacity metric predicts trends in test accuracy
  • 50. c|c (TM) (TM) 50 calculation | consulting why deep learning works Predicting test accuracies: ResNet Series
  • 51. c|c (TM) (TM) 51 calculation | consulting why deep learning works Open source tool: weightwatcher pip install weightwatcher … import weightwatcher as ww watcher = ww.WeightWatcher(model=model) results = watcher.analyze() watcher.get_summary() watcher.print_results() All results can be reproduced using the python weightwatcher tool python tool to analyze Fat Tails in Deep Neural Networks https://github.com/CalculatedContent/WeightWatcher
  • 52. c|c (TM) (TM) 52 calculation | consulting why deep learning works Implicit Self-Regularization: 5+1 Phases of Training
  • 53. c|c (TM) (TM) 53 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 54. c|c (TM) (TM) 54 calculation | consulting why deep learning works Self-Regularization: 5+1 Phases of Training
  • 55. c|c (TM) (TM) 55 calculation | consulting why deep learning works Self-Regularization: Batch size experiments Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC We can induce heavy tails in small models by decreasing the batch size Mini-AlexNet: retrained to exploit Generalization Gap
  • 56. c|c (TM) (TM) 56 calculation | consulting why deep learning works Large batch sizes => decrease generalization accuracy Self-Regularization: Batch size experiments
  • 57. c|c (TM) (TM) 57 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Random-Like Bleeding-outRandom-Like Decreasing the batch size induces strong correlations in W
  • 58. c|c (TM) (TM) 58 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Gaussian random matrix Bulk+ Spikes Heavy Tailed Large batch sizes Small batch sizes
  • 59. c|c (TM) (TM) 59 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  • 60. c|c (TM) (TM) 60 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  • 61. c|c (TM) (TM) 61 calculation | consulting why deep learning works Summary
 •reviewed Random Matrix Theory (RMT) •small NNs ~ Tinkhonov regularization •modern DNNs are Heavy-Tailed •Universality of Power Law exponent alpha •can predict generalization accuracies using new capacity metric, weighted average alpha •can induce heavy tails by exploiting the Generalization Gap phenomena, decreasing batch size
  • 62. c|c (TM) (TM) 62 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow Is this right ? Based on a Gaussian Spin Glass model
  • 63. c|c (TM) (TM) 63 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can we characterize the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  • 64. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 64 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  • 65. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 65 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  • 66. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 66 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?