SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
calculation | consulting
why deep learning works:
self-regularization in deep neural networks
(TM)
c|c
(TM)
charles@calculationconsulting.com
calculation|consulting
UC Berkeley / NERSC 2018
why deep learning works:
self-regularization in deep neural networks
(TM)
charles@calculationconsulting.com
calculation | consulting why deep learning works
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 15 years experience in applied Machine Learning and AI
ML algos for: Aardvark, acquired by Google (2010)
Demand Media (eHow); first $1B IPO since Google
Wall Street: BlackRock
Fortune 500: Roche, France Telecom
BigTech: eBay, Aardvark (Google), GoDaddy
Private Equity: Anthropocene Institute
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
c|c
(TM)
Motivations: towards a Theory of Deep Learning
(TM)
4
calculation | consulting why deep learning works
Theoretical: deeper insight into Why Deep LearningWorks ?
non-convex optimization ?
regularization ?
why is deep better ?
VC vs Stat Mech vs ?
…
Practical: useful insight to improve engineering DNNs
when is a network fully optimized ?
large batch sizes ?
better ensembles ?
…
c|c
(TM)
Set up: the Energy Landscape
(TM)
5
calculation | consulting why deep learning works
minimize Loss: but how avoid overtraining ?
c|c
(TM)
Problem: How can this possibly work ?
(TM)
6
calculation | consulting why deep learning works
highly non-convex ? apparently not
expected observed ?
has been suspected for a long time that local minima are not the issue
c|c
(TM)
Problem: Local Minima ?
(TM)
7
calculation | consulting why deep learning works
Duda, Hart and Stork, 2000
solution: add more capacity and regularize
c|c
(TM)
Motivations: what is Regularization ?
(TM)
8
calculation | consulting why deep learning works
every adjustable knob and switch is called regularization
https://arxiv.org/pdf/1710.10686.pdf
Dropout Batch Size Noisify Data
…
c|c
(TM)
(TM)
9
calculation | consulting why deep learning works
Understanding deep learning requires rethinking generalization
Problem: What is Regularization in DNNs ?
ICLR 2017 Best paper
Large models overfit on randomly labeled data
Regularization can not prevent this
Moore-Pensrose pseudoinverse (1955)
regularize (Phillips, 1962)
familiar optimization problem
c|c
(TM)
Motivations: what is Regularization ?
(TM)
10
calculation | consulting why deep learning works
Soften the rank of X, focus on large eigenvalues ( )
Ridge Regression / Tikhonov-Phillips Regularization
https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
c|c
(TM)
Motivations: how we study Regularization
(TM)
11
calculation | consulting why deep learning works
turn off regularization, turn it back on systematically, study W
and traditional regularization is applied to W
the Energy Landscape is determined by the layer weights WL
L
L
c|c
(TM)
(TM)
12
calculation | consulting why deep learning works
Information bottleneck
Entropy collapse
local minima
k=1 saddle points
floor / ground state
k = 2 saddle points
Information / Entropy
Energy Landscape: and Information flow
what happens to the layer weight matrices WL ?
c|c
(TM)
(TM)
13
calculation | consulting why deep learning works
Self-Regularization: Experiments
Retrained LeNet5 on MINST using Keras
Two (2) other small models: 3-Layer MLP and a Mini AlexNet
And examine pre-trained models (AlexNet, Inception, …)
Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
c|c
(TM)
(TM)
14
calculation | consulting why deep learning works
Matrix Complexity: Entropy and Stable Rank
c|c
(TM)
(TM)
15
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
import keras
import numpy as np
import matplotlib.pyplot as plt
…
W = model.layers[i].get_weights()[0]
…
X = np.dot(W, W.T)
evals, evecs = np.linalg.eig(W, W.T)
plt.hist(X, bin=100, density=True)
c|c
(TM)
(TM)
16
calculation | consulting why deep learning works
Random Matrix Theory: detailed insight into WL
Entropy decrease corresponds to breakdown of random structure
and the onset of a new kind of self-regularization
Empirical Spectral Density (ESD: eigenvalues of X=W W )LL
T
Random
Matrix
Random
+ Spikes
c|c
(TM)
(TM)
17
calculation | consulting why deep learning works
Random Matrix Theory: Marchenko-Pastur
converges to a deterministic function
Empirical Spectral Density (ESD)
with well defined edges (depends on Q, aspect ratio)
c|c
(TM)
(TM)
18
calculation | consulting why deep learning works
Random Matrix Theory: Marcenko Pastur
plus Tracy-Widom fluctuations
very crisp edges
Q
c|c
(TM)
(TM)
19
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
c|c
(TM)
(TM)
20
calculation | consulting why deep learning works
Experiments: just apply to pre-trained Models
LeNet5 (1998)
AlexNet (2012)
InceptionV3 (2014)
ResNet (2015)
…
DenseNet201 (2018)
https://medium.com/@siddharthdas_32104/
cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
Conv2D MaxPool Conv2D MaxPool FC FC
c|c
(TM)
(TM)
21
calculation | consulting why deep learning works
Marchenko-Pastur Bulk + Spikes
Conv2D MaxPool Conv2D MaxPool FC FC
softrank = 10%
RMT: LeNet5
c|c
(TM)
(TM)
22
calculation | consulting why deep learning works
RMT: AlexNet
Marchenko-Pastur Bulk-decay | Heavy Tailed
FC1
zoomed in
FC2
zoomed in
c|c
(TM)
(TM)
23
calculation | consulting why deep learning works
Random Matrix Theory: InceptionV3
Marchenko-Pastur bulk decay, onset of Heavy Tails
W226
c|c
(TM)
(TM)
24
calculation | consulting why deep learning works
Eigenvalue Analysis: Rank Collapse ?
Modern DNNs: soft rank collapses; do not lose hard rank
> 0
(hard) rank collapse (Q>1)
signifies over-regularization
= 0
all smallest eigenvalues > 0,
within numerical (recipes) threshold~
c|c
(TM)
(TM)
25
calculation | consulting why deep learning works
RMT: 5+1 Phases of Training
c|c
(TM)
(TM)
26
calculation | consulting why deep learning works
Bulk+Spikes: Small Models
Rank 1 perturbation Perturbative correction
Bulk
Spikes
Smaller, older models can be described pertubatively w/RMT
c|c
(TM)
(TM)
27
calculation | consulting why deep learning works
Spikes: carry more information
Information begins to concentrate in the spikes
S(v)
spikes have less entropy, are more localized than bulk
c|c
(TM)
(TM)
28
calculation | consulting why deep learning works
Bulk+Spikes: ~ Tikhonov regularization
Small models like LeNet5 exhibit traditional regularization
softer rank , eigenvalues > , spikes carry most information
simple scale threshold
c|c
(TM)
(TM)
29
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
W strongly correlated / highly non-random
Can be modeled as if drawn from a heavy tailed distribution
Then RMT/MP ESD will also have heavy tails
Known results from RMT / polymer theory (Bouchaud, Potters, etc)
AlexNet
ReseNet50
InceptionV3
DenseNet201
…
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
c|c
(TM)
(TM)
30
calculation | consulting why deep learning works
Heavy Tailed: Self-Regularization
Large, well trained, modern DNNs exhibit heavy tailed self-regularization
Salient ideas: what we ‘suspect’ today
No single scale threshold
No simple low rank approximation for WL
Contributions from correlations at all scales
Can not be treated pertubatively
c|c
(TM)
(TM)
31
calculation | consulting why deep learning works
Self-Regularization: Batch size experiments
We can cause small models to exhibit strong correlations / heavy tails
By exploiting the Generalization Gap Phenomena
Large batch sizes => decrease generalization accuracy
Tuning the batch size from very large to very small
c|c
(TM)
(TM)
32
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Random-Like Bleeding-outRandom-Like
c|c
(TM)
(TM)
33
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
c|c
(TM)
(TM)
34
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk+Spikes Bulk+Spikes Bulk-decay
c|c
(TM)
(TM)
35
calculation | consulting why deep learning works
Batch Size Tuning: Generalization Gap
Decreasing the batch size induces strong correlations in W
Bulk-decay Bulk-decay Heavy-tailed
c|c
(TM)
(TM)
36
calculation | consulting why deep learning works
Summary

self-regularization ~ entropy / information decrease
modern DNNs have heavy-tailed self-regularization
5+1 phases of learning
applied Random Matrix Theory (RMT)
small models ~ Tinkhonov regularization
c|c
(TM)
(TM)
37
calculation | consulting why deep learning works
Implications: RMT and Deep Learning
How can RMT be used to understand the Energy Landscape ?
tradeoff between
Energy and Entropy minimization
Where are the local minima ?
How is the Hessian behaved ?
Are simpler models misleading ?
Can we design better learning strategies ?
c|c
(TM)
Energy Funnels: Minimizing Frustration

(TM)
38
calculation | consulting why deep learning works
http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf
Energy Landscape Theory for polymer / protein folding
c|c
(TM)
the Spin Glass of Minimal Frustration 

(TM)
39
calculation | consulting why deep learning works
Conjectured 2015 on my blog (15 min fame on Hacker News)
https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/
Bulk+Spikes, flipped
low lying Energy state in Spin Glass ~ spikes in RMT
c|c
(TM)
RMT w/Heavy Tails: Energy Landscape ?
(TM)
40
calculation | consulting why deep learning works
Compare to LeCun’s Spin Glass model (2015)
Spin Glass with/Heavy Tails ?
Local minima do not concentrate
near the ground state
(Cizeau P and Bouchaud J-P 1993)
is Landscape is more funneled, no ‘problems’ with local minima ?
(TM)
c|c
(TM)
c | c
charles@calculationconsulting.com

Más contenido relacionado

La actualidad más candente

Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Charles Martin
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Charles Martin
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021Charles Martin
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher IntroductionCharles Martin
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...Mokhtar SELLAMI
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsFabio Petroni, PhD
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAPJakub Bartczuk
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringFabio Petroni, PhD
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoMokhtar SELLAMI
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...Artem Lutov
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageWilliam de Vazelhes
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Kostas Katrinis
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmGabriele D'Angelo
 

La actualidad más candente (20)

Search relevance
Search relevanceSearch relevance
Search relevance
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022 Weight watcher Bay Area ACM Feb 28, 2022
Weight watcher Bay Area ACM Feb 28, 2022
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
ENS Macrh 2022.pdf
ENS Macrh 2022.pdfENS Macrh 2022.pdf
ENS Macrh 2022.pdf
 
WeightWatcher Introduction
WeightWatcher IntroductionWeightWatcher Introduction
WeightWatcher Introduction
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...
 
HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
Dimensionality reduction with UMAP
Dimensionality reduction with UMAPDimensionality reduction with UMAP
Dimensionality reduction with UMAP
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Neural Style Transfer in practice
Neural Style Transfer in practiceNeural Style Transfer in practice
Neural Style Transfer in practice
 
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
DAOR - Bridging the Gap between Community and Node Representations: Graph Emb...
 
Metric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible packageMetric-learn, a Scikit-learn compatible package
Metric-learn, a Scikit-learn compatible package
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
Teaching Graph Algorithms in the Field - Bipartite Matching in optical datace...
 
A Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management AlgorithmA Parallel Data Distribution Management Algorithm
A Parallel Data Distribution Management Algorithm
 

Similar a Why Deep Learning Works: Self Regularization in Deep Neural Networks

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksCharles Martin
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM UpdateCharles Martin
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021Charles Martin
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsPooyan Jamshidi
 
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images TaskMediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Taskmultimediaeval
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fittingWush Wu
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcastingRudra Narayan Paul
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksRafael Nogueras
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Riccardo Tommasini
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)MLconf
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Thang Nguyen
 
Co-Learning: Consensus-based Learning for Multi-Agent Systems
 Co-Learning: Consensus-based Learning for Multi-Agent Systems Co-Learning: Consensus-based Learning for Multi-Agent Systems
Co-Learning: Consensus-based Learning for Multi-Agent SystemsMiguel Rebollo
 
Fast optimization intevacoct6_3final
Fast optimization intevacoct6_3finalFast optimization intevacoct6_3final
Fast optimization intevacoct6_3finaleArtius, Inc.
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
 

Similar a Why Deep Learning Works: Self Regularization in Deep Neural Networks (20)

Why Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural NetworksWhy Deep Learning Works: Self Regularization in Deep Neural Networks
Why Deep Learning Works: Self Regularization in Deep Neural Networks
 
WeightWatcher LLM Update
WeightWatcher LLM UpdateWeightWatcher LLM Update
WeightWatcher LLM Update
 
WeightWatcher Update: January 2021
WeightWatcher Update:  January 2021WeightWatcher Update:  January 2021
WeightWatcher Update: January 2021
 
ICCF24.pdf
ICCF24.pdfICCF24.pdf
ICCF24.pdf
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Integrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of RobotsIntegrated Model Discovery and Self-Adaptation of Robots
Integrated Model Discovery and Self-Adaptation of Robots
 
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images TaskMediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
MediaEval 2016 - UPMC at MediaEval2016 Retrieving Diverse Social Images Task
 
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
MUMS Opening Workshop - Machine-Learning Error Models for Quantifying the Epi...
 
Online advertising and large scale model fitting
Online advertising and large scale model fittingOnline advertising and large scale model fitting
Online advertising and large scale model fitting
 
Project seminar ppt_steelcasting
Project seminar ppt_steelcastingProject seminar ppt_steelcasting
Project seminar ppt_steelcasting
 
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free NetworksSelf-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
Self-Balancing Multimemetic Algorithms in Dynamic Scale-Free Networks
 
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
Heaven: Supporting Systematic Comparative Research of RDF Stream Processing E...
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)
 
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...Overlapping community detection in Large-Scale Networks using BigCLAM model b...
Overlapping community detection in Large-Scale Networks using BigCLAM model b...
 
Co-Learning: Consensus-based Learning for Multi-Agent Systems
 Co-Learning: Consensus-based Learning for Multi-Agent Systems Co-Learning: Consensus-based Learning for Multi-Agent Systems
Co-Learning: Consensus-based Learning for Multi-Agent Systems
 
autoTVM
autoTVMautoTVM
autoTVM
 
Fast optimization intevacoct6_3final
Fast optimization intevacoct6_3finalFast optimization intevacoct6_3final
Fast optimization intevacoct6_3final
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
 

Más de Charles Martin

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfCharles Martin
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfCharles Martin
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Charles Martin
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Charles Martin
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105Charles Martin
 

Más de Charles Martin (7)

Heavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdfHeavy Tails Workshop NeurIPS2023.pdf
Heavy Tails Workshop NeurIPS2023.pdf
 
LLM avalanche June 2023.pdf
LLM avalanche June 2023.pdfLLM avalanche June 2023.pdf
LLM avalanche June 2023.pdf
 
Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery Building AI Products: Delivery Vs Discovery
Building AI Products: Delivery Vs Discovery
 
Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107Palo alto university rotary club talk Sep 29, 2107
Palo alto university rotary club talk Sep 29, 2107
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Cc hass b school talk 2105
Cc hass b school talk  2105Cc hass b school talk  2105
Cc hass b school talk 2105
 
CC Talk at Berekely
CC Talk at BerekelyCC Talk at Berekely
CC Talk at Berekely
 

Último

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Último (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Why Deep Learning Works: Self Regularization in Deep Neural Networks

  • 1. calculation | consulting why deep learning works: self-regularization in deep neural networks (TM) c|c (TM) charles@calculationconsulting.com
  • 2. calculation|consulting UC Berkeley / NERSC 2018 why deep learning works: self-regularization in deep neural networks (TM) charles@calculationconsulting.com
  • 3. calculation | consulting why deep learning works Who Are We? c|c (TM) Dr. Charles H. Martin, PhD University of Chicago, Chemical Physics NSF Fellow in Theoretical Chemistry Over 15 years experience in applied Machine Learning and AI ML algos for: Aardvark, acquired by Google (2010) Demand Media (eHow); first $1B IPO since Google Wall Street: BlackRock Fortune 500: Roche, France Telecom BigTech: eBay, Aardvark (Google), GoDaddy Private Equity: Anthropocene Institute www.calculationconsulting.com charles@calculationconsulting.com (TM) 3
  • 4. c|c (TM) Motivations: towards a Theory of Deep Learning (TM) 4 calculation | consulting why deep learning works Theoretical: deeper insight into Why Deep LearningWorks ? non-convex optimization ? regularization ? why is deep better ? VC vs Stat Mech vs ? … Practical: useful insight to improve engineering DNNs when is a network fully optimized ? large batch sizes ? better ensembles ? …
  • 5. c|c (TM) Set up: the Energy Landscape (TM) 5 calculation | consulting why deep learning works minimize Loss: but how avoid overtraining ?
  • 6. c|c (TM) Problem: How can this possibly work ? (TM) 6 calculation | consulting why deep learning works highly non-convex ? apparently not expected observed ? has been suspected for a long time that local minima are not the issue
  • 7. c|c (TM) Problem: Local Minima ? (TM) 7 calculation | consulting why deep learning works Duda, Hart and Stork, 2000 solution: add more capacity and regularize
  • 8. c|c (TM) Motivations: what is Regularization ? (TM) 8 calculation | consulting why deep learning works every adjustable knob and switch is called regularization https://arxiv.org/pdf/1710.10686.pdf Dropout Batch Size Noisify Data …
  • 9. c|c (TM) (TM) 9 calculation | consulting why deep learning works Understanding deep learning requires rethinking generalization Problem: What is Regularization in DNNs ? ICLR 2017 Best paper Large models overfit on randomly labeled data Regularization can not prevent this
  • 10. Moore-Pensrose pseudoinverse (1955) regularize (Phillips, 1962) familiar optimization problem c|c (TM) Motivations: what is Regularization ? (TM) 10 calculation | consulting why deep learning works Soften the rank of X, focus on large eigenvalues ( ) Ridge Regression / Tikhonov-Phillips Regularization https://calculatedcontent.com/2012/09/28/kernels-greens-functions-and-resolvent-operators/
  • 11. c|c (TM) Motivations: how we study Regularization (TM) 11 calculation | consulting why deep learning works turn off regularization, turn it back on systematically, study W and traditional regularization is applied to W the Energy Landscape is determined by the layer weights WL L L
  • 12. c|c (TM) (TM) 12 calculation | consulting why deep learning works Information bottleneck Entropy collapse local minima k=1 saddle points floor / ground state k = 2 saddle points Information / Entropy Energy Landscape: and Information flow what happens to the layer weight matrices WL ?
  • 13. c|c (TM) (TM) 13 calculation | consulting why deep learning works Self-Regularization: Experiments Retrained LeNet5 on MINST using Keras Two (2) other small models: 3-Layer MLP and a Mini AlexNet And examine pre-trained models (AlexNet, Inception, …) Conv2D MaxPool Conv2D MaxPool FC1 FC2 FC
  • 14. c|c (TM) (TM) 14 calculation | consulting why deep learning works Matrix Complexity: Entropy and Stable Rank
  • 15. c|c (TM) (TM) 15 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T import keras import numpy as np import matplotlib.pyplot as plt … W = model.layers[i].get_weights()[0] … X = np.dot(W, W.T) evals, evecs = np.linalg.eig(W, W.T) plt.hist(X, bin=100, density=True)
  • 16. c|c (TM) (TM) 16 calculation | consulting why deep learning works Random Matrix Theory: detailed insight into WL Entropy decrease corresponds to breakdown of random structure and the onset of a new kind of self-regularization Empirical Spectral Density (ESD: eigenvalues of X=W W )LL T Random Matrix Random + Spikes
  • 17. c|c (TM) (TM) 17 calculation | consulting why deep learning works Random Matrix Theory: Marchenko-Pastur converges to a deterministic function Empirical Spectral Density (ESD) with well defined edges (depends on Q, aspect ratio)
  • 18. c|c (TM) (TM) 18 calculation | consulting why deep learning works Random Matrix Theory: Marcenko Pastur plus Tracy-Widom fluctuations very crisp edges Q
  • 19. c|c (TM) (TM) 19 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5
  • 20. c|c (TM) (TM) 20 calculation | consulting why deep learning works Experiments: just apply to pre-trained Models LeNet5 (1998) AlexNet (2012) InceptionV3 (2014) ResNet (2015) … DenseNet201 (2018) https://medium.com/@siddharthdas_32104/ cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5 Conv2D MaxPool Conv2D MaxPool FC FC
  • 21. c|c (TM) (TM) 21 calculation | consulting why deep learning works Marchenko-Pastur Bulk + Spikes Conv2D MaxPool Conv2D MaxPool FC FC softrank = 10% RMT: LeNet5
  • 22. c|c (TM) (TM) 22 calculation | consulting why deep learning works RMT: AlexNet Marchenko-Pastur Bulk-decay | Heavy Tailed FC1 zoomed in FC2 zoomed in
  • 23. c|c (TM) (TM) 23 calculation | consulting why deep learning works Random Matrix Theory: InceptionV3 Marchenko-Pastur bulk decay, onset of Heavy Tails W226
  • 24. c|c (TM) (TM) 24 calculation | consulting why deep learning works Eigenvalue Analysis: Rank Collapse ? Modern DNNs: soft rank collapses; do not lose hard rank > 0 (hard) rank collapse (Q>1) signifies over-regularization = 0 all smallest eigenvalues > 0, within numerical (recipes) threshold~
  • 25. c|c (TM) (TM) 25 calculation | consulting why deep learning works RMT: 5+1 Phases of Training
  • 26. c|c (TM) (TM) 26 calculation | consulting why deep learning works Bulk+Spikes: Small Models Rank 1 perturbation Perturbative correction Bulk Spikes Smaller, older models can be described pertubatively w/RMT
  • 27. c|c (TM) (TM) 27 calculation | consulting why deep learning works Spikes: carry more information Information begins to concentrate in the spikes S(v) spikes have less entropy, are more localized than bulk
  • 28. c|c (TM) (TM) 28 calculation | consulting why deep learning works Bulk+Spikes: ~ Tikhonov regularization Small models like LeNet5 exhibit traditional regularization softer rank , eigenvalues > , spikes carry most information simple scale threshold
  • 29. c|c (TM) (TM) 29 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization W strongly correlated / highly non-random Can be modeled as if drawn from a heavy tailed distribution Then RMT/MP ESD will also have heavy tails Known results from RMT / polymer theory (Bouchaud, Potters, etc) AlexNet ReseNet50 InceptionV3 DenseNet201 … Large, well trained, modern DNNs exhibit heavy tailed self-regularization
  • 30. c|c (TM) (TM) 30 calculation | consulting why deep learning works Heavy Tailed: Self-Regularization Large, well trained, modern DNNs exhibit heavy tailed self-regularization Salient ideas: what we ‘suspect’ today No single scale threshold No simple low rank approximation for WL Contributions from correlations at all scales Can not be treated pertubatively
  • 31. c|c (TM) (TM) 31 calculation | consulting why deep learning works Self-Regularization: Batch size experiments We can cause small models to exhibit strong correlations / heavy tails By exploiting the Generalization Gap Phenomena Large batch sizes => decrease generalization accuracy Tuning the batch size from very large to very small
  • 32. c|c (TM) (TM) 32 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Random-Like Bleeding-outRandom-Like
  • 33. c|c (TM) (TM) 33 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W
  • 34. c|c (TM) (TM) 34 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk+Spikes Bulk+Spikes Bulk-decay
  • 35. c|c (TM) (TM) 35 calculation | consulting why deep learning works Batch Size Tuning: Generalization Gap Decreasing the batch size induces strong correlations in W Bulk-decay Bulk-decay Heavy-tailed
  • 36. c|c (TM) (TM) 36 calculation | consulting why deep learning works Summary
 self-regularization ~ entropy / information decrease modern DNNs have heavy-tailed self-regularization 5+1 phases of learning applied Random Matrix Theory (RMT) small models ~ Tinkhonov regularization
  • 37. c|c (TM) (TM) 37 calculation | consulting why deep learning works Implications: RMT and Deep Learning How can RMT be used to understand the Energy Landscape ? tradeoff between Energy and Entropy minimization Where are the local minima ? How is the Hessian behaved ? Are simpler models misleading ? Can we design better learning strategies ?
  • 38. c|c (TM) Energy Funnels: Minimizing Frustration
 (TM) 38 calculation | consulting why deep learning works http://www.nature.com/nsmb/journal/v4/n11/pdf/nsb1197-871.pdf Energy Landscape Theory for polymer / protein folding
  • 39. c|c (TM) the Spin Glass of Minimal Frustration 
 (TM) 39 calculation | consulting why deep learning works Conjectured 2015 on my blog (15 min fame on Hacker News) https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/ Bulk+Spikes, flipped low lying Energy state in Spin Glass ~ spikes in RMT
  • 40. c|c (TM) RMT w/Heavy Tails: Energy Landscape ? (TM) 40 calculation | consulting why deep learning works Compare to LeCun’s Spin Glass model (2015) Spin Glass with/Heavy Tails ? Local minima do not concentrate near the ground state (Cizeau P and Bouchaud J-P 1993) is Landscape is more funneled, no ‘problems’ with local minima ?