Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Deep Learning Structure Discovery
1. Deep
Learning
Russ
Salakhutdinov
Department of Statistics and Computer Science!
University of Toronto
2. Mining
for
Structure
Massive
increase
in
both
computa:onal
power
and
the
amount
of
data
available
from
web,
video
cameras,
laboratory
measurements.
Images
&
Video
Text
&
Language
Speech
&
Audio
Gene
Expression
Rela:onal
Data/
Climate
Change
Product
Social
Network
Geological
Data
Recommenda:on
Mostly
Unlabeled
•
Develop
sta:s:cal
models
that
can
discover
underlying
structure,
cause,
or
sta:s:cal
correla:on
from
data
in
unsupervised
or
semi-‐supervised
way.
•
Mul:ple
applica:on
domains.
3. Talk
Roadmap
• Unsupervised
Feature
Learning
– Restricted
Boltzmann
Machines
DBM
– Deep
Belief
Networks
– Deep
Boltzmann
Machines
• Transfer
Learning
with
Deep
Models
RBM
• Mul:modal
Learning
4. Restricted
Boltzmann
Machines
hidden
variables
Bipar:te
Stochas:c
binary
visible
variables
Structure
are
connected
to
stochas:c
binary
hidden
variables
.
The
energy
of
the
joint
configura:on:
Image
visible
variables
model
parameters.
Probability
of
the
joint
configura:on
is
given
by
the
Boltzmann
distribu:on:
par::on
func:on
poten:al
func:ons
Markov
random
fields,
Boltzmann
machines,
log-‐linear
models.
5. Restricted
Boltzmann
Machines
hidden
variables
Bipar:te
Structure
Restricted:
No
interac:on
between
hidden
variables
Inferring
the
distribu:on
over
the
Image
visible
variables
hidden
variables
is
easy:
Factorizes:
Easy
to
compute
Similarly:
Markov
random
fields,
Boltzmann
machines,
log-‐linear
models.
6. Model
Learning
hidden
variables
Given
a
set
of
i.i.d.
training
examples
,
we
want
to
learn
model
parameters
.
Maximize
(penalized)
log-‐likelihood
objec:ve:
Image
visible
variables
Deriva:ve
of
the
log-‐likelihood:
Regulariza:on
Difficult
to
compute:
exponen:ally
many
configura:ons
7. Model
Learning
hidden
variables
Given
a
set
of
i.i.d.
training
examples
,
we
want
to
learn
model
parameters
.
Maximize
(penalized)
log-‐likelihood
objec:ve:
Image
visible
variables
Deriva:ve
of
the
log-‐likelihood:
Approximate
maximum
likelihood
learning:
Contras:ve
Divergence
(Hinton
2000)
Pseudo
Likelihood
(Besag
1977)
MCMC-‐MLE
es:mator
(Geyer
1991)
Composite
Likelihoods
(Lindsay,
1988;
Varin
2008)
Tempered
MCMC
Adap:ve
MCMC
(Salakhutdinov,
NIPS
2009)
(Salakhutdinov,
ICML
2010)
8. RBMs
for
Images
Gaussian-‐Bernoulli
RBM:
Define
energy
func:ons
for
various
data
modali:es:
Image
visible
variables
Gaussian
Bernoulli
9. RBMs
for
Images
Gaussian-‐Bernoulli
RBM:
Interpreta:on:
Mixture
of
exponen:al
number
of
Gaussians
Image
visible
variables
where
is
an
implicit
prior,
and
Gaussian
10. RBMs
for
Images
and
Text
Images:
Gaussian-‐Bernoulli
RBM
Learned
features
(out
of
10,000)
4
million
unlabelled
images
Text:
Mul:nomial-‐Bernoulli
RBM
Learned
features:
``topics’’
russian
clinton
computer
trade
stock
Reuters
dataset:
russia
house
system
country
wall
804,414
unlabeled
moscow
president
product
import
street
newswire
stories
yeltsin
bill
sobware
world
point
soviet
congress
develop
economy
dow
Bag-‐of-‐Words
11. Collabora:ve
Filtering
Bernoulli
hidden:
user
preferences
h
Learned
features:
``genre’’
W1
Fahrenheit
9/11
Independence
Day
v Bowling
for
Columbine
The
Day
Aber
Tomorrow
The
People
vs.
Larry
Flynt
Con
Air
Mul:nomial
visible:
user
ra:ngs
Canadian
Bacon
Men
in
Black
II
La
Dolce
Vita
Men
in
Black
Neglix
dataset:
480,189
users
Friday
the
13th
Scary
Movie
The
Texas
Chainsaw
Massacre
Naked
Gun
17,770
movies
Children
of
the
Corn
Hot
Shots!
Over
100
million
ra:ngs
Child's
Play
American
Pie
The
Return
of
Michael
Myers
Police
Academy
State-‐of-‐the-‐art
performance
on
the
Neglix
dataset.
Relates
to
Probabilis;c
Matrix
Factoriza;on
(Salakhutdinov & Mnih ICML 2007)!
12. Mul:ple
Applica:on
Domains
• Natural
Images
• Text/Documents
• Collabora:ve
Filtering
/
Matrix
Factoriza:on
• Video
(Langford
et
al.
ICML
2009
,
Lee
et
al.)
• Mo:on
Capture
(Taylor
et.al.
NIPS
2007)
• Speech
Percep:on
(Dahl
et.
al.
NIPS
2010,
Lee
et.al.
NIPS
2010)
Same
learning
algorithm
-‐-‐
mul:ple
input
domains.
Limita:ons
on
the
types
of
structure
that
can
be
represented
by
a
single
layer
of
low-‐level
features!
13. Talk
Roadmap
• Unsupervised
Feature
Learning
– Restricted
Boltzmann
Machines
DBM
– Deep
Belief
Networks
– Deep
Boltzmann
Machines
• Transfer
Learning
with
Deep
Models
RBM
• Mul:modal
Learning
14. Deep
Belief
Network
Low-‐level
features:
Edges
Built
from
unlabeled
inputs.
Input:
Pixels
Image
15. Deep
Belief
Network
Unsupervised
feature
learning.
Internal
representa:ons
capture
higher-‐order
sta:s:cal
structure
Higher-‐level
features:
Combina:on
of
edges
Low-‐level
features:
Edges
Built
from
unlabeled
inputs.
Input:
Pixels
Image
(Hinton et.al. Neural Computation 2006)!
16. Deep
Belief
Network
The
joint
probability
Deep
Belief
Network
distribu:on
factorizes:
h3
W3 RBM
h2
W2 Sigmoid
Belief
RBM
Sigmoid
h1 Belief
Network
Network
W1
v
17. DBNs
for
Classifica:on
2000
W3
500 Softmax Output
RBM
10 10
500 WT
4 WT+
4 4
W2
2000 2000
500 WT
3 WT+
3 3
RBM
500 500
WT
2 WT+
2 2
500 500 500
W1 WT
1 WT+
1 1
RBM
Pretraining Unrolling Fine tuning
•
Aber
layer-‐by-‐layer
unsupervised
pretraining,
discrimina:ve
fine-‐tuning
by
backpropaga:on
achieves
an
error
rate
of
1.2%
on
MNIST.
SVM’s
get
1.4%
and
randomly
ini:alized
backprop
gets
1.6%.
•
Clearly
unsupervised
learning
helps
generaliza:on.
It
ensures
that
most
of
the
informa:on
in
the
weights
comes
from
modeling
the
input
data.
19. Deep
Genera:ve
Model
Model
P(document)
Reuters
dataset:
804,414
newswire
stories:
unsupervised
European Community
Interbank Markets Monetary/Economic
Energy Markets
Disasters and
Accidents
Leading Legal/Judicial
Economic
Indicators
Bag
of
words
Government
Accounts/
Borrowings
Earnings
(Hinton & Salakhutdinov, Science 2006)!
20. Informa:on
Retrieval
Interbank Markets
European Community
Monetary/Economic
2-‐D
LSA
space
Energy Markets
Disasters and
Accidents
Leading Legal/Judicial
Economic
Indicators
Government
Accounts/
Borrowings
Earnings
•
The
Reuters
Corpus
Volume
II
contains
804,414
newswire
stories
(randomly
split
into
402,207
training
and
402,207
test).
•
“Bag-‐of-‐words”:
each
ar:cle
is
represented
as
a
vector
containing
the
counts
of
the
most
frequently
used
2000
words
in
the
training
set.
21. Seman:c
Hashing
European Community
Monetary/Economic
Address Space Disasters and
Accidents
Semantically
Similar
Documents
Semantic
Hashing Government
Function Energy Markets Borrowing
Document Accounts/Earnings
•
Learn
to
map
documents
into
seman;c
20-‐D
binary
codes.
•
Retrieve
similar
documents
stored
at
the
nearby
addresses
with
no
search
at
all.
22. Searching
Large
Image
Database
using
Binary
Codes
•
Map
images
into
binary
codes
for
fast
retrieval.
•
Small
Codes,
Torralba,
Fergus,
Weiss,
CVPR
2008
•
Spectral
Hashing,
Y.
Weiss,
A.
Torralba,
R.
Fergus,
NIPS
2008
•
Kulis
and
Darrell,
NIPS
2009,
Gong
and
Lazebnik,
CVPR
20111
•
Norouzi
and
Fleet,
ICML
2011,
23. Talk
Roadmap
• Unsupervised
Feature
Learning
– Restricted
Boltzmann
Machines
DBM
– Deep
Belief
Networks
– Deep
Boltzmann
Machines
• Transfer
Learning
with
Deep
Models
RBM
• Mul:modal
Learning
24. DBNs
vs.
DBMs
Deep Belief Network! Deep Boltzmann Machine!
h3 h3
W3 W3
h2 h2
W2 W2
h1 h1
W1 W1
v v
DBNs
are
hybrid
models:
•
Inference
in
DBNs
is
problema:c
due
to
explaining
away.
•
Only
greedy
pretrainig,
no
joint
op;miza;on
over
all
layers.
•
Approximate
inference
is
feed-‐forward:
no
boMom-‐up
and
top-‐down.
Introduce
a
new
class
of
models
called
Deep
Boltzmann
Machines.
25. Mathema:cal
Formula:on
Deep
Boltzmann
Machine
model
parameters
h3
• Dependencies
between
hidden
variables.
W3 • All
connec:ons
are
undirected.
h2 • Bopom-‐up
and
Top-‐down:
W2
h1
W1
v Top-‐down
Bopom-‐up
Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),
HMAX
(Poggio
et.al.),
Deep
Belief
Nets
(Hinton
et.al.)
26. Mathema:cal
Formula:on
Neural
Network
Deep
Boltzmann
Machine
Deep
Belief
Network
Output
h3 h3 h3
W3 W3 W3
h2 h2 h2
W2 W2 W2
h1 h1 h1
W1 W1 W1
v v v
Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),
HMAX
(Poggio),
Deep
Belief
Nets
(Hinton)
27. Mathema:cal
Formula:on
Neural
Network
Deep
Boltzmann
Machine
Deep
Belief
Network
Output
h3 h3 h3
W3 W3 W3
h2 h2 h2
inference
W2 W2 W2
h1 h1 h1
W1 W1 W1
v v v
Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),
HMAX
(Poggio),
Deep
Belief
Nets
(Hinton)
28. Mathema:cal
Formula:on
Deep
Boltzmann
Machine
model
parameters
h3
• Dependencies
between
hidden
variables.
W3
Maximum
likelihood
learning:
h2
W2
h1
Problem:
Both
expecta:ons
are
W1 intractable!
v
Learning
rule
for
undirected
graphical
models:
MRFs,
CRFs,
Factor
graphs.
29. Previous
Work
Many
approaches
for
learning
Boltzmann
machines
have
been
proposed
over
the
last
20
years:
•
Hinton
and
Sejnowski
(1983),
•
Peterson
and
Anderson
(1987)
•
Galland
(1991)
Real-‐world
applica:ons
–
thousands
•
Kappen
and
Rodriguez
(1998)
of
hidden
and
observed
variables
•
Lawrence,
Bishop,
and
Jordan
(1998)
•
Tanaka
(1998)
with
millions
of
parameters.
•
Welling
and
Hinton
(2002)
•
Zhu
and
Liu
(2002)
•
Welling
and
Teh
(2003)
•
Yasuda
and
Tanaka
(2009)
Many
of
the
previous
approaches
were
not
successful
for
learning
general
Boltzmann
machines
with
hidden
variables.
Algorithms
based
on
Contras:ve
Divergence,
Score
Matching,
Pseudo-‐
Likelihood,
Composite
Likelihood,
MCMC-‐MLE,
Piecewise
Learning,
cannot
handle
mul:ple
layers
of
hidden
variables.
30. New
Learning
Algorithm
Posterior
Inference
Simulate
from
the
Model
Condi:onal
Uncondi:onal
Approximate
Approximate
the
condi:onal
joint
distribu:on
(Salakhutdinov, 2008; NIPS 2009)!
31. New
Learning
Algorithm
Posterior
Inference
Simulate
from
the
Model
Condi:onal
Uncondi:onal
Approximate
Approximate
the
condi:onal
joint
distribu:on
Data-‐dependent
Data-‐independent
density
Match
32. New
Learning
Algorithm
Posterior
Inference
Simulate
from
the
Model
Condi:onal
Uncondi:onal
Approximate
Approximate
the
condi:onal
joint
distribu:on
Markov
Chain
Mean-‐Field
Monte
Carlo
Data-‐dependent
Data-‐independent
Match
Key
Idea
of
Our
Approach:
Data-‐dependent:
Varia;onal
Inference,
mean-‐field
theory
Data-‐independent:
Stochas;c
Approxima;on,
MCMC
based
33. Sampling
from
DBMs
Sampling
from
two-‐hidden
layer
DBM:
by
running
Markov
chain:
Randomly
ini:alize
…
Sample
34. Stochas:c
Approxima:on
Time
t=1
t=2
t=3
h2 h2 h2
Update
Update
h1 h1 h1
v v v
Update
and
sequen:ally,
where
• Generate
by
simula:ng
from
a
Markov
chain
that
leaves
invariant
(e.g.
Gibbs
or
M-‐H
sampler)
• Update
by
replacing
intractable
with
a
point
es:mate
In
prac:ce
we
simulate
several
Markov
chains
in
parallel.
Robbins
and
Monro,
Ann.
Math.
Stats,
1957
L.
Younes,
Probability
Theory
1989,
Tieleman,
ICML
2008.
35. Stochas:c
Approxima:on
Update
rule
decomposes:
True
gradient
Noise
term
Almost
sure
convergence
guarantees
as
learning
rate
Problem:
High-‐dimensional
data:
Markov
Chain
the
energy
landscape
is
highly
Monte
Carlo
mul:modal
Key
insight:
The
transi:on
operator
can
be
any
valid
transi:on
operator
–
Tempered
Transi:ons,
Parallel/Simulated
Tempering.
Connec:ons
to
the
theory
of
stochas:c
approxima:on
and
adap:ve
MCMC.
36. Varia:onal
Inference
Approximate
intractable
distribu:on
with
simpler,
tractable
distribu:on
:
Posterior
Inference
Mean-‐Field
Varia:onal
Lower
Bound
Minimize
KL
between
approxima:ng
and
true
distribu:ons
with
respect
to
varia:onal
parameters
.
(Salakhutdinov & Larochelle, AI & Statistics 2010)!
37. Varia:onal
Inference
Approximate
intractable
distribu:on
with
simpler,
tractable
distribu:on
:
Posterior
Inference
Varia:onal
Lower
Bound
Mean-‐Field:
Choose
a
fully
factorized
distribu:on:
Mean-‐Field
with
Varia;onal
Inference:
Maximize
the
lower
bound
w.r.t.
Varia:onal
parameters
.
Nonlinear
fixed-‐
point
equa:ons:
38. Varia:onal
Inference
Approximate
intractable
distribu:on
with
simpler,
tractable
distribu:on
:
Posterior
Inference
Varia:onal
Lower
Bound
Uncondi:onal
Simula:on
Markov
Chain
Mean-‐Field
Fast
Inference
1.
Varia;onal
Inference:
Maximize
the
lower
bound
w.r.t.
varia:onal
parameters
Monte
Carlo
2.
MCMC:
Apply
stochas:c
approxima:on
Learning
can
scale
to
to
update
model
parameters
millions
of
examples
Almost
sure
convergence
guarantees
to
an
asympto:cally
stable
point.
45. Deep
Boltzmann
Machine
Sanskrit
Model
P(image)
25,000
characters
from
50
alphabets
around
the
world.
•
3,000
hidden
variables
•
784
observed
variables
(28
by
28
images)
•
Over
2
million
parameters
Bernoulli
Markov
Random
Field
46. Deep
Boltzmann
Machine
Condi:onal
Simula:on
P(image|par:al
image)
Bernoulli
Markov
Random
Field
47. Handwri:ng
Recogni:on
MNIST
Dataset
Op:cal
Character
Recogni:on
60,000
examples
of
10
digits
42,152
examples
of
26
English
lepers
Learning
Algorithm
Error
Learning
Algorithm
Error
Logis:c
regression
12.0%
Logis:c
regression
22.14%
K-‐NN
3.09%
K-‐NN
18.92%
Neural
Net
(Plap
2005)
1.53%
Neural
Net
14.62%
SVM
(Decoste
et.al.
2002)
1.40%
SVM
(Larochelle
et.al.
2009)
9.70%
Deep
Autoencoder
1.40%
Deep
Autoencoder
10.05%
(Bengio
et.
al.
2007)
(Bengio
et.
al.
2007)
Deep
Belief
Net
1.20%
Deep
Belief
Net
9.68%
(Hinton
et.
al.
2006)
(Larochelle
et.
al.
2009)
DBM
0.95%
DBM
8.40%
Permuta:on-‐invariant
version.
48. Deep
Boltzmann
Machine
Gaussian-‐Bernoulli
Markov
Deep
Boltzmann
Machine
Random
Field
12,000
Latent
Variables
Model
P(image)
Planes
96
by
96
images
24,000
Training
Images
Stereo
pair
49. Genera:ve
Model
of
3-‐D
Objects
24,000
examples,
5
object
categories,
5
different
objects
within
each
category,
6
lightning
condi:ons,
9
eleva:ons,
18
azimuths.
51. Learning
Part-‐based
Hierarchy
Object
parts.
Combina:on
of
edges.
Trained
from
mul:ple
classes
(cars,
faces,
motorbikes,
airplanes).
Lee
et.al.,
ICML
2009
52. Robust
Boltzmann
Machines
•
Build
more
complex
models
that
can
deal
with
occlusions
or
structured
noise.
Gaussian
RBM,
modeling
Binary
RBM
modeling
clean
faces
occlusions
Inferred
Binary
pixel-‐wise
Gaussian
noise
Mask
Observed
Relates
to
Le
Roux,
Heess,
Shopon,
and
Winn,
Neural
Computa:on,
2011
Eslami,
Heess,
Winn,
CVPR
2012
Tang
et.
al.,
CVPR
2012
53. Robust
Boltzmann
Machines
Internal
States
of
RoBM
during
learning.
Inference
on
the
test
subjects
Comparing
to
Other
Denoising
Algorithms
54. Spoken
Query
Detec:on
• 630
speaker
TIMIT
corpus:
3,696
training
and
944
test
uperances.
• 10
query
keywords
were
randomly
selected
and
10
examples
of
each
keyword
were
extracted
from
the
training
set.
• Goal:
For
each
keyword,
rank
all
944
uperances
based
on
the
uperance’s
probability
of
containing
that
keyword.
• Performance
measure:
The
average
equal
error
rate
(EER).
13.5
Learning
Algorithm
AVG
EER
13
GMM
Unsupervised
16.4%
12.5
12
DBM
Unsupervised
14.7%
Avg. EER 11.5
DBM
(1%
labels)
13.3%
11
DBM
(30%
labels)
10.5%
10.5
DBM
(100%
labels)
9.7%
10
9.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Training Ratio
(Yaodong
Zhang
et.al.
ICASSP
2012)!
55. Learning
Hierarchical
Representa:ons
Deep
Boltzmann
Machines:
Learning
Hierarchical
Structure
in
Features:
edges,
combina:on
of
edges.
•
Performs
well
in
many
applica:on
domains
•
Combines
bopom
and
top-‐down
•
Fast
Inference:
frac:on
of
a
second
•
Learning
scales
to
millions
of
examples
Many
examples,
few
categories
Next:
Few
examples,
many
categories
–
Transfer
Learning
56. Talk
Roadmap
DBM
• Unsupervised
Feature
Learning
– Restricted
Boltzmann
Machines
– Deep
Belief
Networks
– Deep
Boltzmann
Machines
• Transfer
Learning
with
Deep
Models
• Mul:modal
Learning
“animal”
“vehicle”
horse
cow
car
van
truck
57. One-‐shot
Learning
“zarc” “segway”
How
can
we
learn
a
novel
concept
–
a
high
dimensional
sta:s:cal
object
–
from
few
examples.
58. Learning
from
Few
Examples
SUN
database
car
van
truck
bus
Classes
sorted
by
frequency
Rare
objects
are
similar
to
frequent
objects
60. Learning
to
Transfer
Background
Knowledge
Millions
of
unlabeled
images
Learn
to
Transfer
Knowledge
Some
labeled
images
Learn
novel
concept
from
one
example
Bicycle
Dolphin
Test:
What
is
this?
Elephant
Tractor
61. Learning
to
Transfer
Background
Knowledge
Millions
of
unlabeled
images
Learn
to
Transfer
Knowledge
Key
problem
in
computer
vision,
speech
percep:on,
natural
language
processing,
and
many
other
domains.
Some
labeled
images
Learn
novel
concept
from
one
example
Bicycle
Dolphin
Test:
What
is
this?
Elephant
Tractor
62. One-‐Shot
Learning
Level 3
{τ 0, α0}
Hierarchical
Bayesian
Level 2
{µk, τ k, αk}
Level 1
Animal Vehicle
...
Models
{µc, τ c}
Cow Horse Sheep Truck Car
Hierarchical
Prior.
Probability of observed Prior probability of
data given parameters weight vector W
Posterior probability of
parameters given the
training data D.
•
Fei-‐Fei,
Fergus,
and
Perona,
TPAMI
2006
•
E.
Bart,
I.
Porteous,
P.
Perona,
and
M.
Welling,
CVPR
2007
•
Miller,
Matsakis,
and
Viola,
CVPR
2000
•
Sivic,
Russell,
Zisserman,
Freeman,
and
Efros,
CVPR
2008
63. Hierarchical-‐Deep
Models
HD
Models:
Compose
hierarchical
Bayesian
Deep
Nets
models
with
deep
networks,
two
influen:al
Part-‐based
Hierarchy
approaches
from
unsupervised
learning
Deep
Networks:
•
learn
mul:ple
layers
of
nonlineari;es.
•
trained
in
unsupervised
fashion
-‐-‐
unsupervised
feature
learning
–
no
need
to
Marr
and
Nishihara
(1978)
rely
on
human-‐crabed
input
representa:ons.
•
labeled
data
is
used
to
slightly
adjust
the
Hierarchical
Bayes
model
for
a
specific
task.
Category-‐based
Hierarchy
Hierarchical
Bayes:
•
explicitly
represent
category
hierarchies
for
sharing
abstract
knowledge.
•
explicitly
iden:fy
only
a
small
number
of
parameters
that
are
relevant
to
the
new
concept
being
learned.
Collins
&
Quillian
(1969)
64. Mo:va:on
Learning
to
transfer
knowledge:
Super-‐class
Hierarchical
•
Super-‐category:
“A
segway
looks
Segway
like
a
funny
kind
of
vehicle”.
•
Higher-‐level
features,
or
parts,
shared
with
other
classes:
Parts
Ø
wheel,
handle,
post
•
Lower-‐level
features:
Ø
edges,
composi:on
of
edges
Deep
Edges
65. Hierarchical
Genera:ve
Model
Hierarchical
Latent
Dirichlet
Alloca:on
Model
“animal”
“vehicle”
K
horse
cow
car
van
truck
DBM
Model
Lower-‐level
generic
features:
•
edges,
combina:on
of
edges
Images
(Salakhutdinov, Tenenbaum, Torralba, 2011)!
66. Hierarchical
Genera:ve
Model
Hierarchical
Latent
Dirichlet
Alloca:on
Model
“animal”
“vehicle”
Hierarchical
Organiza;on
of
Categories:
•
express
priors
on
the
features
that
are
typical
of
different
kinds
of
concepts
•
modular
data-‐parameter
rela:ons
K
Higher-‐level
class-‐sensi;ve
features:
horse
cow
car
van
truck
•
capture
dis:nc:ve
perceptual
structure
of
a
specific
concept
DBM
Model
Lower-‐level
generic
features:
•
edges,
combina:on
of
edges
Images
(Salakhutdinov, Tenenbaum, Torralba, 2011)!
67. Intui:on
α Pr(topic
|
doc)
π
Words
⇔
ac:va:ons
of
DBM’s
top-‐level
units.
z
Topics
⇔
distribu:ons
over
top-‐level
units,
or
θ
higher-‐level
parts.
w
Pr(word
|
topic)
DBM
generic
features:
LDA
high-‐level
features:
Images
Words
Topics
Documents
Each
topic
is
made
up
of
words.
Each
document
is
made
up
of
topics.
69. Hierarchical
Deep
Model
Tree
hierarchy
of
classes
is
learned
“animal”
“vehicle”
(Nested
Chinese
Restaurant
Process)
prior:
a
nonparametric
prior
over
tree
structures.
K
Topics
horse
cow
car
van
truck
70. Hierarchical
Deep
Model
Tree
hierarchy
of
classes
is
learned
“animal”
“vehicle”
(Nested
Chinese
Restaurant
Process)
prior:
a
nonparametric
prior
over
tree
structures.
(Hierarchical
Dirichlet
Process)
prior:
K a
nonparametric
prior
allowing
categories
to
Topics
share
higher-‐level
features,
or
parts.
horse
cow
car
van
truck
71. Hierarchical
Deep
Model
Tree
hierarchy
of
classes
is
learned
“animal”
“vehicle”
Unlike
standard
sta:s:cal
models,
(Nested
Chinese
Restaurant
Process)
in
addi:on
ntonparametric
prior
arameters,
prior:
a
structures
o
inferring
p over
tree
we
also
infer
the
hierarchy
for
(Hierarchical
Dirichlet
Process)
prior:
sharing
nonparametric
prior
allowing
categories
to
K a
those
parameters.
Topics
share
higher-‐level
features,
or
parts.
Condi;onal
Deep
Boltzmann
horse
cow
car
van
truck
Machine.
Enforce
(approximate)
global
consistency
through
many
local
constraints.
72. CIFAR
Object
Recogni:on
Tree
hierarchy
of
classes
is
learned
50,000
images
of
100
classes
“animal”
“vehicle”
Higher-‐level
class
Inference:
Markov
chain
sensi:ve
features
Monte
Carlo
–
Later!
horse
cow
car
van
truck
4
million
unlabeled
images
Lower-‐level
generic
features
32
x
32
pixels
x
3
RGB
73. Learning
to
Learn
The
model
learns
how
to
share
the
knowledge
across
many
visual
categories.
Learned
super-‐
“global”
class
hierarchy
“aqua;c
animal”
“fruit”
“human”
dolphin
turtle
shark
ray
apple
orange
sunflower
girl
baby
man
Basic
level
class
woman
…
Learned
higher-‐level
class-‐sensi;ve
features
…
Learned
low-‐level
generic
features
74. Learning
to
Learn
The
model
learns
how
to
share
the
knowledge
across
many
visual
categories.
“global”
Learned
super-‐
crocodile
spider
class
hierarchy
“aqua;c
snake
lizard
“fruit”
“human”
castle
road
animal”
squirrel
bridge
kangaroo
skyscraper
bus
house
leopard
dolphin
fox
turtle
shark
ray
apple
orange
sunflower
truck
Basic
level
train
;ger
girl
baby
man
tank
lion
wolf
tractor
streetcar
class
oMer
skunk
woman
shrew
porcupine
pine
dolphin
oak
maple
tree
bear
ray
whale
shark
…
Learned
higher-‐level
willow
tree
camel
turtle
boMle
elephant
caMle
class-‐sensi;ve
can
lamp
bowl
cup
chimpanzee
beaver
features
mouse
raccoon
Learned
low-‐level
…
apple
peer
pepper
man
boy
man
generic
features
hamster
rabbit
possum
orange
sunflower
girl
woman
75. Sharing
Features
“fruit”
Reconst-‐
Learning
to
Real
ruc:ons
Shape
Color
Learn
Dolphin
Sunflower
Orange
Apple
apple
orange
sunflower
Sunflower
ROC
curve
5 3
1
ex’s
1
0.9
0.8
0.7
detection rate
0.6
0.5
0.4
Pixel-‐
0.3
0.2
space
0.1 distance
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false alarm rate
Learning
to
Learn:
Learning
a
hierarchy
for
sharing
parameters
–
rapid
learning
of
a
novel
concept.
76. Object
Recogni:on
Area under ROC curve for same/different
(1 new class vs. 99 distractor classes)
1
HDP-‐DBM
LDA
DBM
HDP-‐DBM
0.95
GIST
(class
condi:onal)
(no
super-‐classes)
0.9
0.85
0.8
0.75
0.7
0.65
1
3
5
10
50
#
examples
[Averaged
over
40
test
classes]
Our
model
outperforms
standard
computer
vision
features
(e.g.
GIST).
77. Handwripen
Character
Recogni:on
“alphabet
1”
“alphabet
2”
Learned
higher-‐level
features
Strokes
char
1
char
2
char
3
char
4
char
5
Learned
lower-‐
level
features
25,000
Edge
characters
s
78. Handwripen
Character
Recogni:on
Area under ROC curve for same/different
(1 new class vs. 1000 distractor classes)
HDP-‐DBM
1
HDP-‐DBM
LDA
DBM
(no
super-‐classes)
0.95
(class
condi:onal)
0.9
Pixels
0.85
0.8
0.75
0.7
0.65
1
3
5
10
[Averaged
over
40
test
classes]
#
examples
79. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
80. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
81. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
82. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
83. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
84. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
85. Simula:ng
New
Characters
Real
data
within
super
class
Global
Super
Super
class
1
class
2
Class
1
Class
2
New
class
Simulated
new
characters
86. Learning from very few examples
3
examples
of
a
new
class
Condi:onal
samples
in
the
same
class
Inferred
super-‐class
95. Hierarchical-‐Deep
So
far
we
have
considered
directed
+
undirected
models.
Deep
Lamber:an
Networks
“animal”
“vehicle”
Deep
Undirected
K
Topics
horse
cow
car
van
truck
Directed
Combines
the
elegant
proper:es
of
the
Lamber:an
model
with
the
Gaussian
RBMs
Low-‐level
features:
(and
Deep
Belief
Nets,
Deep
Boltzmann
replace
GIST,
SIFT
Machines).
Tang
et.
al.,
ICML
2012
96. Deep
Lamber:an
Networks
Model
Specifics
Image
Surface
Light
Deep
Lamber:an
Net
Normals
albedo
source
Inferred
Observed
Inference:
Gibbs
sampler.
Learning:
Stochas:c
Approxima:on
97. Deep
Lamber:an
Networks
Yale
B
Extended
Database
One
Test
Image
Two
Test
Images
Face
Religh:ng
98. Deep
Lamber:an
Networks
Recogni:on
as
func:on
of
the
number
of
training
images
for
10
test
subjects.
One-‐Shot
Recogni:on
99. Recursive
Neural
Networks
Recursive
structure
learning
Local
recursive
networks
are
making
predic:ons
whether
to
merge
the
two
inputs
as
well
as
predic:ng
the
label.
Use
Max-‐Margin
Es:ma:on.
Socher
et.
al.,
ICML
2011
101. Learning
from
Few
Examples
SUN
database
car
van
truck
bus
Classes
sorted
by
frequency
Rare
objects
are
similar
to
frequent
objects
102. Learning
from
Few
Examples
chair
armchair
Swivel
chair
Deck
chair
Classes
sorted
by
frequency
103. Genera:ve
Model
of
Classifier
Parameters
Many
state-‐of-‐the-‐art
object
detec:on
systems
use
sophis:cated
models,
based
on
mul:ple
parts
with
separate
appearance
and
shape
components.
Detect
objects
by
tes:ng
sub-‐windows
and
scoring
corresponding
test
patches
with
a
linear
func:on.
Define
hierarchical
prior
over
parameters
of
discrimina;ve
model
and
learn
the
hierarchy.
Image
Specific:
concatena:on
of
the
HOG
feature
pyramid
at
mul:ple
scales.
Felzenszwalb,
McAllester
&
Ramanan,
2008
104. Genera:ve
Model
of
Classifier
Parameters
Hierarchical
Bayes
By
learning
hierarchical
structure,
we
can
improve
the
current
state-‐of-‐the-‐art.
Sun
Dataset:
32,855
examples
of
200
categories
185
ex
27
ex
12
ex
Hierarchical
Model
Single
Class
105. Talk
Roadmap
• Unsupervised
Feature
Learning
– Restricted
Boltzmann
Machines
DBM
– Deep
Belief
Networks
– Deep
Boltzmann
Machines
• Transfer
Learning
with
Deep
Models
RBM
• Mul:modal
Learning
106. Mul:-‐Modal
Input
Learning
systems
that
combine
mul:ple
input
domains
Images
Text
&
Language
Video
Laser
scans
Speech
&
Audio
Time
series
data
Develop
learning
systems
that
come
One
of
Key
Challenges:
closer
to
displaying
human
like
intelligence
Inference
107. Mul:-‐Modal
Input
Learning
systems
that
combine
mul:ple
input
domains
Image
Text
More
robust
percep:on.
Ngiam
et.al.,
ICML
2011
used
deep
autoencoders
(video
+
speech)
•
Guillaumin,
Verbeek,
and
Schmid,
CVPR
2011
•
Huiskes,
Thomee,
and
Lew,
Mul:media
Informa:on
Retrieval,
2010
•
Xing,
Yan,
and
Hauptmann,
UAI
2005.
108. Training
Data
pentax,
k10d,
camera,
jahdakine,
kangarooisland
lightpain:ng,
southaustralia,
sa
reflec:on
australia
doublepaneglass
australiansealion
300mm
wowiekazowie
sandbanks,
lake,
lakeontario,
sunset,
top20buperflies
walking,
beach,
purple,
sky,
water,
clouds,
overtheexcellence
mickikrimmel,
mickipedia,
headshot
<no
text>
Samples
from
the
MIR
Flickr
Dataset
-‐
Crea:ve
Commons
License
109. Mul:-‐Modal
Input
• Improve
Classifica:on
pentax,
k10d,
kangarooisland
southaustralia,
sa
australia
SEA
/
NOT
SEA
australiansealion
300mm
• Fill
in
Missing
Modali:es
beach,
sea,
surf,
strand,
shore,
wave,
seascape,
sand,
ocean,
waves
• Retrieve
data
from
one
modality
when
queried
using
data
from
another
modality
beach,
sea,
surf,
strand,
shore,
wave,
seascape,
sand,
ocean,
waves