Cc stat phys draft

calculation | consulting
This is an early draft of some notes
on the relationship between
statistical physics and deep learning
(TM)
c|c
(TM)
charles@calculationconsulting.com

calculation|consulting
This is an early draft of some notes
on the relationship between
statistical physics and deep learning
(TM)

calculation | consulting stat phys of deep learning
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 10 years experience in applied Machine Learning
Developed ML algos for Demand Media; the ﬁrst $1B IPO since Google
Tech: Aardvark (now Google), eHow, GoDaddy, …
Wall Street: BlackRock
Fortune 500: Big Pharma, Telecom, eBay
www.calculationconsulting.com
(TM)
3

Data Scientists are Different
c|c
(TM)
theoretical physics
machine learning specialist
(TM)
4
experimental physics
data scientist
engineer
software, browser tech, dev ops, …
not all techies are the same

Statistical Physics of Information Theory
c|c
(TM)
(TM)
5
not my ideas just a summary
the book : Merhav (2009)
http://webee.technion.ac.il/people/merhav/papers/p138f.pdf
”If I have seen further than others,
it is by standing on the shoulders of
giants” (Isaac Newton)
notes from the web &

Statistical Physics of Information Theory
c|c
(TM)
(TM)
6
not my ideas just a summary
the book : Merhav (2009)
http://webee.technion.ac.il/people/merhav/papers/p138f.pdf
”If I have seen further than others,
it is by standing on the shoulders of
giants” (Isaac Newton)
notes from the web &

c|c
(TM)
(TM)
7
Energies: unnormalized probabilities 
in stat phys and ML , energies
give unnormalized probabilities
xj = Ej = - ln xj
xj
in ML, is an (optional) scale /smoothing parameter
in stat phys, is the inverse Temperature

c|c
(TM)
(TM)
8
Energy normalization: Partition Function (Z) 
the normalization factor Z is
to get probabilities, we do a soft-max transform
but we also include the inverse Temperature

c|c
(TM)
(TM)
9
Old School Nets: from Z to sigmoid activations 
modern nets are layers of nodes and activation functions
What happened to E and Z ?
They are easy to recover in simple cases…

c|c
(TM)
(TM)
10
consider 1 layer of an RBM

c|c
(TM)
(TM)
11
lets compute the p(h|x) directly from the Energy function
we expect the conditional probabilities to factor
and to have sigmoid activations

c|c
(TM)
(TM)
12
http://www.youtube.com/watch?v=lekCh_i32iE&t=18m31s

c|c
(TM)
(TM)
13
http://www.youtube.com/watch?v=lekCh_i32iE&t=18m31s
we ﬁnd that the conditional probabilities do factor
and we can recover the local sigmoid activations
but we don’t include Temperature…although old models did

c|c
(TM)
(TM)
14
Scaled Energies: w/ Temperature 
we do see T in some simple reinforcement learning methods

c|c
(TM)
(TM)
15
Scaled Energies: Temperature smoothing 
and T arises as a smoothing parameter in Dark Knowledge

c|c
(TM)
(TM)
16
Scaled Energies: Max Norm Regularization 
http://www.deeplearningbook.org/slides/dls_2016.pdf
We frequently have to rescale the weights in the deep net
I simply observe that this, effectively, energy rescaling

c|c
(TM)
(TM)
17
Scaled Energies: Batch Norm Regularization 
most recent ideas out of Google Deep Mind
ReLU
mean = 0
variance = 1
Z ~ E energy
local layer energies must be rescaled explicitly on each batch step

c|c
(TM)
(TM)
18
Scaled Energies: Batch Norm Regularization 
most recent ideas out of Google Deep Mind
ReLU
mean = 0
variance = 1
Z ~ E energy
local layer energies must be rescaled explicitly on each batch step

c|c
(TM)
(TM)
19
Recap: energies and temperatures 
Neural Networks deﬁne energies at each layer
Sigmoid activations result from normalization and factorization
Local energies / weights must be rescaled carefully
Lots of hacks to get good convergence
Lets turn to some stat mech / stats to see howT arises

c|c
(TM)
(TM)
20
Boltzmann Distribution: classic argument (Hill) 
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
given the constraints (constant N, E)
given many discrete states, the distribution is
what is the most probable distribution ?

c|c
(TM)
(TM)
21
Boltzmann Distribution: the most likely distribution ? 
and the most likely
energy distribution
we expect the most likely
distribution of states
to both be highly peaked
i.e. concentrate to the means very fast

min log s.t.
c|c
(TM)
(TM)
22
Boltzmann Distribution: Lagrange multiplier problem 
so peaked we can minimize the log of the distribution
as
giving
are Lagrange multipliers, and aswhere

c|c
(TM)
(TM)
23
Boltzmann Distribution: Stirling’s Approximation 
see Art of Computer Programming by Knuth
we apply an asymptotically convergent expansion
to the terms in the multinomial distribution
when taking ; note that term vanishes

c|c
(TM)
(TM)
24
Boltzmann Distribution: Lagrange multiplier problem 
after applying Stirling’s approximation, and taking partials
mean number of events
this leads to the ﬁnal most likely distribution …
we get
giving

c|c
(TM)
(TM)
25
Boltzmann Distribution: and Partition Function
optimal probability
average energy
partition function
central result of Gibbs statistical mechanics

c|c
(TM)
(TM)
26
Partition Function: a generating function
we get all sorts of useful stuff out of it

c|c
(TM)
(TM)
27
Ground State Energy: the low Temp limit 

c|c
(TM)
(TM)
28
Statistical Physics: an ML viewpoint 
we can derive and describe these results
using language familiar to the ML community
• max entropy principle
• KL divergence
• Chernoff bounds
• sums of random numbers
• concentration to the mean
• extreme value statistics
some results may be familiar; others surprising

c|c
(TM)
(TM)
29
Canonical Ensemble: from states to energies 
microcanonical: maximum entropy
Boltzmann-Gibbs distribution minimizes the free energy
canonical: minimum free energy
at constantT

c|c
(TM)
(TM)
30
Canonical Ensemble: from states to energies 
sum over states
sum over energy levels
many states ( ) can have the same energy level E
we count them w/ density of states
free energy
entropy S = ln

c|c
(TM)
(TM)
31
Free Energy: back to probabilities 

c|c
(TM)
(TM)
32
Free Energy: KL Divergence 

c|c
(TM)
(TM)
33
Temperature: a Chernoff parameter 
given X1,X2 … i.i.d vars, and a function
how fast does event (sum) decay ?
where
apply Chernoff bound
w/ exponential Indicator
minimize over

c|c
(TM)
(TM)
34

c|c
(TM)
(TM)
35
principle of minimum free energy
is the equilibrium inverse temperature
see book for details & caveats
S is really a rate function,
as in large deviations theory

c|c
(TM)
(TM)
36
Free Energy: thermodynamic limit 
free energy density
these may differ: the order of the limits matter
annealed (w/ moments)

c|c
(TM)
(TM)
37
Free Energy: indicates Phase Transitions (PT) 
thermodynamic functions change abruptly with external changes
should be analytic
ﬁrst order PT
second order PT
discontinuous

c|c
(TM)
(TM)
38
Random Energies: sum of exponentials of
random numbers 
say we have i.i.d. events
w/probability
what is the probability that at least one event occurs ?

c|c
(TM)
(TM)
39
sums of exp(rand(x)): concentration result 
w/expectation
# successes = sum of i.i.d. binary random vars
A < B vanishes completely
A > B concentrates to mean very fast

c|c
(TM)
(TM)
40
either 1 event or 0 events are seen, depending on A/B
ln(1- x) x + …
sums of exp(rand(x)): proof of concentrations

c|c
(TM)
(TM)
41
Random Energy Model (REM): setup 

c|c
(TM)
(TM)
42
Random Energy Model (REM): … 

c|c
(TM)
(TM)
43
Replica Method: an old trick to eval Z 
expected value
in moments of Z
of ln Z
express w/ integer m
analytic continuation to real as m-> 0
bad branch cut? deal w/ later

c|c
(TM)
(TM)
44
Summary 

(TM)
c|c
(TM)
c | c

Cc stat phys draft

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cc stat phys draft

Similar to Cc stat phys draft (20)

More from Charles Martin

More from Charles Martin (8)

Recently uploaded

Recently uploaded (20)

Cc stat phys draft