1. calculation | consulting
This is an early draft of some notes
on the relationship between
statistical physics and deep learning
(TM)
c|c
(TM)
charles@calculationconsulting.com
2. calculation|consulting
This is an early draft of some notes
on the relationship between
statistical physics and deep learning
(TM)
charles@calculationconsulting.com
3. calculation | consulting stat phys of deep learning
Who Are We?
c|c
(TM)
Dr. Charles H. Martin, PhD
University of Chicago, Chemical Physics
NSF Fellow in Theoretical Chemistry
Over 10 years experience in applied Machine Learning
Developed ML algos for Demand Media; the first $1B IPO since Google
Tech: Aardvark (now Google), eHow, GoDaddy, …
Wall Street: BlackRock
Fortune 500: Big Pharma, Telecom, eBay
www.calculationconsulting.com
charles@calculationconsulting.com
(TM)
3
4. Data Scientists are Different
c|c
(TM)
theoretical physics
machine learning specialist
(TM)
4
experimental physics
data scientist
engineer
software, browser tech, dev ops, …
not all techies are the same
calculation | consulting stat phys of deep learning
5. Statistical Physics of Information Theory
c|c
(TM)
(TM)
5
not my ideas just a summary
calculation | consulting stat phys of deep learning
the book : Merhav (2009)
http://webee.technion.ac.il/people/merhav/papers/p138f.pdf
”If I have seen further than others,
it is by standing on the shoulders of
giants” (Isaac Newton)
notes from the web &
6. Statistical Physics of Information Theory
c|c
(TM)
(TM)
6
not my ideas just a summary
calculation | consulting stat phys of deep learning
the book : Merhav (2009)
http://webee.technion.ac.il/people/merhav/papers/p138f.pdf
”If I have seen further than others,
it is by standing on the shoulders of
giants” (Isaac Newton)
notes from the web &
7. c|c
(TM)
(TM)
7
Energies: unnormalized probabilities
calculation | consulting stat phys of deep learning
in stat phys and ML , energies
give unnormalized probabilities
xj = Ej = - ln xj
xj
in ML, is an (optional) scale /smoothing parameter
in stat phys, is the inverse Temperature
8. c|c
(TM)
(TM)
8
Energy normalization: Partition Function (Z)
calculation | consulting stat phys of deep learning
the normalization factor Z is
to get probabilities, we do a soft-max transform
but we also include the inverse Temperature
9. c|c
(TM)
(TM)
9
Old School Nets: from Z to sigmoid activations
calculation | consulting stat phys of deep learning
modern nets are layers of nodes and activation functions
What happened to E and Z ?
They are easy to recover in simple cases…
10. c|c
(TM)
(TM)
10
Old School Nets: from Z to sigmoid activations
calculation | consulting stat phys of deep learning
consider 1 layer of an RBM
11. c|c
(TM)
(TM)
11
Old School Nets: from Z to sigmoid activations
calculation | consulting stat phys of deep learning
lets compute the p(h|x) directly from the Energy function
we expect the conditional probabilities to factor
and to have sigmoid activations
12. c|c
(TM)
(TM)
12
Old School Nets: from Z to sigmoid activations
calculation | consulting stat phys of deep learning
http://www.youtube.com/watch?v=lekCh_i32iE&t=18m31s
13. c|c
(TM)
(TM)
13
Old School Nets: from Z to sigmoid activations
calculation | consulting stat phys of deep learning
http://www.youtube.com/watch?v=lekCh_i32iE&t=18m31s
we find that the conditional probabilities do factor
and we can recover the local sigmoid activations
but we don’t include Temperature…although old models did
14. c|c
(TM)
(TM)
14
Scaled Energies: w/ Temperature
calculation | consulting stat phys of deep learning
we do see T in some simple reinforcement learning methods
16. c|c
(TM)
(TM)
16
Scaled Energies: Max Norm Regularization
calculation | consulting stat phys of deep learning
http://www.deeplearningbook.org/slides/dls_2016.pdf
We frequently have to rescale the weights in the deep net
I simply observe that this, effectively, energy rescaling
17. c|c
(TM)
(TM)
17
Scaled Energies: Batch Norm Regularization
calculation | consulting stat phys of deep learning
most recent ideas out of Google Deep Mind
http://www.deeplearningbook.org/slides/dls_2016.pdf
ReLU
mean = 0
variance = 1
Z ~ E energy
local layer energies must be rescaled explicitly on each batch step
18. c|c
(TM)
(TM)
18
Scaled Energies: Batch Norm Regularization
calculation | consulting stat phys of deep learning
most recent ideas out of Google Deep Mind
http://www.deeplearningbook.org/slides/dls_2016.pdf
ReLU
mean = 0
variance = 1
Z ~ E energy
local layer energies must be rescaled explicitly on each batch step
19. c|c
(TM)
(TM)
19
Recap: energies and temperatures
calculation | consulting stat phys of deep learning
http://www.deeplearningbook.org/slides/dls_2016.pdf
Neural Networks define energies at each layer
Sigmoid activations result from normalization and factorization
Local energies / weights must be rescaled carefully
Lots of hacks to get good convergence
Lets turn to some stat mech / stats to see howT arises
20. c|c
(TM)
(TM)
20
Boltzmann Distribution: classic argument (Hill)
calculation | consulting stat phys of deep learning
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
given the constraints (constant N, E)
given many discrete states, the distribution is
what is the most probable distribution ?
21. c|c
(TM)
(TM)
21
Boltzmann Distribution: the most likely distribution ?
calculation | consulting stat phys of deep learning
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
and the most likely
energy distribution
we expect the most likely
distribution of states
to both be highly peaked
i.e. concentrate to the means very fast
22. min log s.t.
c|c
(TM)
(TM)
22
Boltzmann Distribution: Lagrange multiplier problem
calculation | consulting stat phys of deep learning
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
so peaked we can minimize the log of the distribution
as
giving
are Lagrange multipliers, and aswhere
23. c|c
(TM)
(TM)
23
Boltzmann Distribution: Stirling’s Approximation
calculation | consulting stat phys of deep learning
see Art of Computer Programming by Knuth
we apply an asymptotically convergent expansion
to the terms in the multinomial distribution
when taking ; note that term vanishes
24. c|c
(TM)
(TM)
24
calculation | consulting stat phys of deep learning
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
Boltzmann Distribution: Lagrange multiplier problem
after applying Stirling’s approximation, and taking partials
mean number of events
this leads to the final most likely distribution …
we get
giving
25. c|c
(TM)
(TM)
25
Boltzmann Distribution: and Partition Function
calculation | consulting stat phys of deep learning
https://charlesmartin14.wordpress.com/2013/11/14/metric-learning-some-quantum-statistical-mechanics/
optimal probability
average energy
partition function
central result of Gibbs statistical mechanics
26. c|c
(TM)
(TM)
26
Partition Function: a generating function
calculation | consulting stat phys of deep learning
we get all sorts of useful stuff out of it
28. c|c
(TM)
(TM)
28
Statistical Physics: an ML viewpoint
calculation | consulting stat phys of deep learning
we can derive and describe these results
using language familiar to the ML community
• max entropy principle
• KL divergence
• Chernoff bounds
• sums of random numbers
• concentration to the mean
• extreme value statistics
some results may be familiar; others surprising
29. c|c
(TM)
(TM)
29
Canonical Ensemble: from states to energies
calculation | consulting stat phys of deep learning
microcanonical: maximum entropy
Boltzmann-Gibbs distribution minimizes the free energy
canonical: minimum free energy
at constantT
30. c|c
(TM)
(TM)
30
Canonical Ensemble: from states to energies
calculation | consulting stat phys of deep learning
sum over states
sum over energy levels
many states ( ) can have the same energy level E
we count them w/ density of states
free energy
entropy S = ln
33. c|c
(TM)
(TM)
33
Temperature: a Chernoff parameter
calculation | consulting stat phys of deep learning
given X1,X2 … i.i.d vars, and a function
how fast does event (sum) decay ?
where
apply Chernoff bound
w/ exponential Indicator
minimize over
35. c|c
(TM)
(TM)
35
Temperature: a Chernoff parameter
calculation | consulting stat phys of deep learning
principle of minimum free energy
is the equilibrium inverse temperature
see book for details & caveats
S is really a rate function,
as in large deviations theory
36. c|c
(TM)
(TM)
36
Free Energy: thermodynamic limit
calculation | consulting stat phys of deep learning
free energy density
these may differ: the order of the limits matter
annealed (w/ moments)
37. c|c
(TM)
(TM)
37
Free Energy: indicates Phase Transitions (PT)
calculation | consulting stat phys of deep learning
thermodynamic functions change abruptly with external changes
should be analytic
first order PT
second order PT
discontinuous
38. c|c
(TM)
(TM)
38
Random Energies: sum of exponentials of
random numbers
calculation | consulting stat phys of deep learning
say we have i.i.d. events
w/probability
what is the probability that at least one event occurs ?
39. c|c
(TM)
(TM)
39
sums of exp(rand(x)): concentration result
calculation | consulting stat phys of deep learning
w/expectation
# successes = sum of i.i.d. binary random vars
A < B vanishes completely
A > B concentrates to mean very fast
40. c|c
(TM)
(TM)
40
calculation | consulting stat phys of deep learning
either 1 event or 0 events are seen, depending on A/B
ln(1- x) x + …
sums of exp(rand(x)): proof of concentrations
43. c|c
(TM)
(TM)
43
Replica Method: an old trick to eval Z
calculation | consulting stat phys of deep learning
expected value
in moments of Z
of ln Z
express w/ integer m
analytic continuation to real as m-> 0
bad branch cut? deal w/ later