Deep Learning Structure Discovery

Deep
Learning

Russ
Salakhutdinov

Department of Statistics and Computer Science!
University of Toronto

Mining
for
Structure

Massive
increase
in
both
computa:onal
power
and
the
amount
of

data
available
from
web,
video
cameras,
laboratory
measurements.

Images
&
Video
Text
&
Language

Speech
&
Audio

Gene
Expression

Rela:onal
Data/

Climate
Change

Product

Social
Network
Geological
Data

Recommenda:on

Mostly
Unlabeled

• 
Develop
sta:s:cal
models
that
can
discover
underlying
structure,
cause,
or

sta:s:cal
correla:on
from
data
in
unsupervised
or
semi-‐supervised
way.

• 
Mul:ple
applica:on
domains.

Talk
Roadmap

•  Unsupervised
Feature
Learning

–  Restricted
Boltzmann
Machines

DBM
–  Deep
Belief
Networks

–  Deep
Boltzmann
Machines

•  Transfer
Learning
with
Deep

Models

RBM
•  Mul:modal
Learning

Restricted
Boltzmann
Machines

hidden
variables

Bipar:te

Stochas:c
binary
visible
variables

Structure

are
connected
to
stochas:c
binary
hidden

variables

.

The
energy
of
the
joint
configura:on:

Image

visible
variables

model
parameters.

Probability
of
the
joint
configura:on
is
given
by
the
Boltzmann
distribu:on:

par::on
func:on
poten:al
func:ons

Markov
random
fields,
Boltzmann
machines,
log-‐linear
models.

Restricted
Boltzmann
Machines

hidden
variables

Bipar:te

Structure
Restricted:

No
interac:on
between

hidden
variables

Inferring
the
distribu:on
over
the

Image

visible
variables

hidden
variables
is
easy:

Factorizes:
Easy
to
compute

Similarly:

Markov
random
ﬁelds,
Boltzmann
machines,
log-‐linear
models.

Model
Learning

hidden
variables

Given
a
set
of
i.i.d.
training
examples

,
we
want
to
learn

model
parameters

.

Maximize
(penalized)
log-‐likelihood
objec:ve:

Image

visible
variables

Deriva:ve
of
the
log-‐likelihood:
Regulariza:on

Diﬃcult
to
compute:
exponen:ally
many

conﬁgura:ons

Model
Learning

hidden
variables

Given
a
set
of
i.i.d.
training
examples

,
we
want
to
learn

model
parameters

.

Maximize
(penalized)
log-‐likelihood
objec:ve:

Image

visible
variables

Deriva:ve
of
the
log-‐likelihood:

Approximate
maximum
likelihood
learning:

Contras:ve
Divergence
(Hinton
2000)

Pseudo
Likelihood
(Besag
1977)

MCMC-‐MLE
es:mator
(Geyer
1991)

Composite
Likelihoods
(Lindsay,
1988;
Varin
2008)

Tempered
MCMC

Adap:ve
MCMC

(Salakhutdinov,
NIPS
2009)

(Salakhutdinov,
ICML
2010)

RBMs
for
Images

Gaussian-‐Bernoulli
RBM:

Deﬁne
energy
func:ons
for

various
data
modali:es:

Image

visible
variables

Gaussian

Bernoulli

RBMs
for
Images

RBM:

Interpreta:on:
Mixture
of
exponen:al

number
of
Gaussians

Image

visible
variables

where

is
an
implicit
prior,
and

Gaussian

RBMs
for
Images
and
Text

Images:
RBM

Learned
features
(out
of
10,000)

4
million
unlabelled
images

Text:
Mul:nomial-‐Bernoulli
RBM

Learned
features:
``topics’’

russian
clinton
computer
trade
stock

Reuters
dataset:
russia
house
system
country
wall

804,414
unlabeled
moscow
president
product
import
street

newswire
stories
yeltsin
bill
sobware
world
point

soviet
congress
develop
economy
dow

Bag-‐of-‐Words

Collabora:ve
Filtering

Bernoulli
hidden:
user
preferences

h
Learned
features:
``genre’’

W1
Fahrenheit
9/11
Independence
Day

v Bowling
for
Columbine
The
Day
Aber
Tomorrow

The
People
vs.
Larry
Flynt
Con
Air

Mul:nomial
visible:
user
ra:ngs
Canadian
Bacon
Men
in
Black
II

La
Dolce
Vita
Men
in
Black

Neglix
dataset:

480,189
users

Friday
the
13th
Scary
Movie

The
Texas
Chainsaw
Massacre
Naked
Gun

17,770
movies

Children
of
the
Corn
Hot
Shots!

Over
100
million
ra:ngs
Child's
Play
American
Pie

The
Return
of
Michael
Myers
Police
Academy

State-‐of-‐the-‐art
performance

on
the
Neglix
dataset.

Relates
to
Probabilis;c
Matrix
Factoriza;on

(Salakhutdinov & Mnih ICML 2007)!

Mul:ple
Applica:on
Domains

•  Natural
Images

•  Text/Documents

•  Collabora:ve
Filtering
/
Matrix
Factoriza:on

•  Video
(Langford
et
al.
ICML
2009
,
Lee
et
al.)

•  Mo:on
Capture
(Taylor
et.al.
NIPS
2007)

•  Speech
Percep:on
(Dahl
et.
al.
NIPS
2010,
Lee
et.al.
NIPS
2010)

Same
learning
algorithm
-‐-‐

mul:ple
input
domains.

Limita:ons
on
the
types
of
structure
that
can
be

represented
by
a
single
layer
of
low-‐level
features!

Deep
Belief
Network

Low-‐level
features:

Edges

Built
from
unlabeled
inputs.

Input:
Pixels

Image

Deep
Belief
Network

Unsupervised
feature
learning.

Internal
representa:ons
capture

higher-‐order
sta:s:cal
structure

Higher-‐level
features:

Combina:on
of
edges

Low-‐level
features:

Edges

Built
from
unlabeled
inputs.

Input:
Pixels

Image

(Hinton et.al. Neural Computation 2006)!

Deep
Belief
Network

The
joint
probability

Deep
Belief
Network

distribu:on
factorizes:

h3
W3 RBM

h2
W2 Sigmoid
Belief

RBM

Sigmoid

h1 Belief

Network

Network

W1
v

DBNs
for
Classiﬁca:on

2000
W3
500 Softmax Output
RBM
10 10
500 WT
4 WT+
4 4

W2
2000 2000
500 WT
3 WT+
3 3
RBM
500 500
WT
2 WT+
2 2
500 500 500
W1 WT
1 WT+
1 1

RBM
Pretraining Unrolling Fine tuning
• 
Aber
layer-‐by-‐layer
unsupervised
pretraining,
discrimina:ve
ﬁne-‐tuning

by
backpropaga:on
achieves
an
error
rate
of
1.2%
on
MNIST.
SVM’s
get

1.4%
and
randomly
ini:alized
backprop
gets
1.6%.

• 
Clearly
unsupervised
learning
helps
generaliza:on.
It
ensures
that
most
of

the
informa:on
in
the
weights
comes
from
modeling
the
input
data.

Deep
Autoencoders

B-=4>-,
6$
!*
)4C
($$
%&'
) )
!" !" ! ;
#$$$ #$$$
) )
($$ !# !# ! :

!6 "$$$ "$$$
) )
"$$$ !6 !6 ! 9
%&'
($$ ($$
) )
!* !* ! (
6$ ?4>-@5/A-, 6$
"$$$
!* !* ! *
!#
($$ ($$
#$$$
%&'
!6 !6 ! 6
"$$$ "$$$
!# !# ! #
#$$$ #$$$ #$$$
!" !" !" ! "

%&' <1=4>-,

+,-.,/01012 31,455012 701- .81012

Deep
Genera:ve
Model

Model
P(document)
Reuters
dataset:
804,414

newswire
stories:
unsupervised

European Community
Interbank Markets Monetary/Economic

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

Bag
of
words
Government
Accounts/
Borrowings
Earnings
(Hinton & Salakhutdinov, Science 2006)!

Informa:on
Retrieval

Interbank Markets
European Community
Monetary/Economic
2-‐D
LSA
space

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

Government
Accounts/
Borrowings
Earnings

• 
The
Reuters
Corpus
Volume
II
contains
804,414
newswire
stories

(randomly
split
into
402,207
training
and
402,207
test).

• 
“Bag-‐of-‐words”:
each
ar:cle
is
represented
as
a
vector
containing
the
counts
of

the
most
frequently
used
2000
words
in
the
training
set.

Seman:c
Hashing

European Community
Monetary/Economic
Address Space Disasters and
Accidents

Semantically
Similar
Documents

Semantic
Hashing Government
Function Energy Markets Borrowing

Document Accounts/Earnings

• 
Learn
to
map
documents
into
seman;c
20-‐D
binary
codes.

• 
Retrieve
similar
documents
stored
at
the
nearby
addresses
with
no

search
at
all.

Searching
Large
Image
Database

using
Binary
Codes

• 
Map
images
into
binary
codes
for
fast
retrieval.

• 
Small
Codes,
Torralba,
Fergus,
Weiss,
CVPR
2008

• 
Spectral
Hashing,
Y.
Weiss,
A.
Torralba,
R.
Fergus,
NIPS
2008

• 
Kulis
and
Darrell,
NIPS
2009,
Gong
and
Lazebnik,
CVPR
20111

• 
Norouzi
and
Fleet,
ICML
2011,

DBNs
vs.
DBMs

Deep Belief Network! Deep Boltzmann Machine!
h3 h3
W3 W3
h2 h2
W2 W2
h1 h1
W1 W1
v v
DBNs
are
hybrid
models:

• 
Inference
in
DBNs
is
problema:c
due
to
explaining
away.

• 
Only
greedy
pretrainig,
no
joint
op;miza;on
over
all
layers.

• 
Approximate
inference
is
feed-‐forward:
no
boMom-‐up
and
top-‐down.

Introduce
a
new
class
of
models
called
Deep
Boltzmann
Machines.

Mathema:cal
Formula:on

Deep
Boltzmann
Machine
model
parameters

h3
•  Dependencies
between
hidden
variables.

W3 •  All
connec:ons
are
undirected.

h2 •  Bopom-‐up
and
Top-‐down:

W2
h1
W1
v Top-‐down
Bopom-‐up

Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),

HMAX
(Poggio
et.al.),
Deep
Belief
Nets
(Hinton
et.al.)

Mathema:cal
Formula:on

Neural
Network

Deep
Boltzmann
Machine
Deep
Belief
Network

Output

h3 h3 h3
W3 W3 W3
h2 h2 h2
W2 W2 W2
h1 h1 h1
W1 W1 W1
v v v
Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),

HMAX
(Poggio),
Deep
Belief
Nets
(Hinton)

Mathema:cal
Formula:on

Neural
Network

Deep
Boltzmann
Machine
Deep
Belief
Network

Output

h3 h3 h3
W3 W3 W3
h2 h2 h2

inference

W2 W2 W2
h1 h1 h1
W1 W1 W1
v v v
Input
Unlike
many
exis:ng
feed-‐forward
models:
ConvNet
(LeCun),

HMAX
(Poggio),
Deep
Belief
Nets
(Hinton)

Mathema:cal
Formula:on

Deep
Boltzmann
Machine
model
parameters

h3
•  Dependencies
between
hidden
variables.

W3
Maximum
likelihood
learning:

h2
W2
h1
Problem:
Both
expecta:ons
are

W1 intractable!

v
Learning
rule
for
undirected
graphical
models:

MRFs,
CRFs,
Factor
graphs.

Previous
Work

Many
approaches
for
learning
Boltzmann
machines
have
been

proposed
over
the
last
20
years:

• 
Hinton
and
Sejnowski
(1983),

• 
Peterson
and
Anderson
(1987)

• 
Galland
(1991)

Real-‐world
applica:ons
–
thousands

• 
Kappen
and
Rodriguez
(1998)
of
hidden
and
observed
variables

• 
Lawrence,
Bishop,
and
Jordan
(1998)

• 
Tanaka
(1998)

with
millions
of
parameters.

• 
Welling
and
Hinton
(2002)

• 
Zhu
and
Liu
(2002)

• 
Welling
and
Teh
(2003)

• 
Yasuda
and
Tanaka
(2009)

Many
of
the
previous
approaches
were
not
successful
for
learning

general
Boltzmann
machines
with
hidden
variables.

Algorithms
based
on
Contras:ve
Divergence,
Score
Matching,
Pseudo-‐
Likelihood,
Composite
Likelihood,
MCMC-‐MLE,
Piecewise
Learning,
cannot

handle
mul:ple
layers
of
hidden
variables.

New
Learning
Algorithm

Posterior
Inference
Simulate
from
the
Model

Condi:onal
Uncondi:onal

Approximate
Approximate
the

condi:onal
joint
distribu:on

(Salakhutdinov, 2008; NIPS 2009)!

New
Learning
Algorithm

Posterior
Inference
Simulate
from
the
Model

Condi:onal
Uncondi:onal

Approximate
Approximate
the

condi:onal
joint
distribu:on

Data-‐dependent
Data-‐independent

density
Match

New
Learning
Algorithm

Posterior
Inference
Simulate
from
the
Model

Condi:onal
Uncondi:onal

Approximate
Approximate
the

condi:onal
joint
distribu:on

Markov
Chain

Mean-‐Field
Monte
Carlo

Data-‐dependent
Data-‐independent

Match

Key
Idea
of
Our
Approach:

Data-‐dependent:

Varia;onal
Inference,
mean-‐ﬁeld
theory

Data-‐independent:

Stochas;c
Approxima;on,
MCMC
based

Sampling
from
DBMs

Sampling
from
two-‐hidden
layer
DBM:
by
running
Markov
chain:

Randomly

ini:alize

…

Sample

Stochas:c
Approxima:on

Time

t=1
t=2
t=3

h2 h2 h2

Update
Update

h1 h1 h1

v v v

Update

and

sequen:ally,

where

•  Generate

by
simula:ng
from
a
Markov
chain

that
leaves

invariant
(e.g.
Gibbs
or
M-‐H
sampler)

•  Update

by
replacing
intractable

with
a
point

es:mate

In
prac:ce
we
simulate
several
Markov
chains
in
parallel.

Robbins
and
Monro,
Ann.
Math.
Stats,
1957

L.
Younes,

Probability
Theory
1989,
Tieleman,
ICML
2008.

Stochas:c
Approxima:on

Update
rule
decomposes:

True
gradient
Noise
term

Almost
sure
convergence
guarantees
as
learning
rate

Problem:
High-‐dimensional
data:
Markov
Chain

the
energy
landscape
is
highly

Monte
Carlo

mul:modal

Key
insight:
The
transi:on
operator
can
be

any
valid
transi:on
operator
–
Tempered

Transi:ons,
Parallel/Simulated
Tempering.

Connec:ons
to
the
theory
of
stochas:c
approxima:on
and
adap:ve
MCMC.

Varia:onal
Inference

Approximate
intractable
distribu:on

with
simpler,
tractable

distribu:on

:

Posterior
Inference

Mean-‐Field

Varia:onal
Lower
Bound

Minimize
KL
between
approxima:ng
and
true

distribu:ons
with
respect
to
varia:onal
parameters

.

(Salakhutdinov & Larochelle, AI & Statistics 2010)!

Varia:onal
Inference

Approximate
intractable
distribu:on

with
simpler,
tractable

distribu:on

:

Posterior
Inference

Varia:onal
Lower
Bound

Mean-‐Field:
Choose
a
fully
factorized
distribu:on:

Mean-‐Field

with

Varia;onal
Inference:
Maximize
the
lower
bound
w.r.t.

Varia:onal
parameters

.

Nonlinear
ﬁxed-‐

point
equa:ons:

Varia:onal
Inference

Approximate
intractable
distribu:on

with
simpler,
tractable

distribu:on

:

Posterior
Inference

Varia:onal
Lower
Bound
Uncondi:onal
Simula:on

Markov
Chain

Mean-‐Field

Fast
Inference

1.
Varia;onal
Inference:
Maximize
the
lower

bound
w.r.t.
varia:onal
parameters

Monte
Carlo

2.
MCMC:
Apply
stochas:c
approxima:on

Learning
can
scale
to

to
update
model
parameters

millions
of
examples

Almost
sure
convergence
guarantees
to
an
asympto:cally

stable
point.

Good
Genera:ve
Model?

Handwripen
Characters

Good
Genera:ve
Model?

Handwripen
Characters

Simulated
Real
Data

Good
Genera:ve
Model?

Handwripen
Characters

Real
Data
Simulated

Good
Genera:ve
Model?

MNIST
Handwripen
Digit
Dataset

Deep
Boltzmann
Machine

Sanskrit
Model
P(image)

25,000
characters
from
50

alphabets
around
the
world.

• 
3,000
hidden
variables

• 
784

observed
variables

(28
by
28
images)

• 
Over
2
million
parameters

Bernoulli
Markov
Random
Field

Deep
Boltzmann
Machine

Condi:onal

Simula:on

P(image|par:al
image)
Bernoulli
Markov
Random
Field

Handwri:ng
Recogni:on

MNIST
Dataset
Op:cal
Character
Recogni:on

60,000
examples
of
10
digits
42,152
examples
of
26
English
lepers

Learning
Algorithm
Error
Learning
Algorithm
Error

Logis:c
regression
12.0%
Logis:c
regression
22.14%

K-‐NN

3.09%
K-‐NN

18.92%

Neural
Net
(Plap
2005)
1.53%
Neural
Net
14.62%

SVM
(Decoste
et.al.
2002)
1.40%
SVM
(Larochelle
et.al.
2009)
9.70%

Deep
Autoencoder
1.40%
Deep
Autoencoder
10.05%

(Bengio
et.
al.
2007)

(Bengio
et.
al.
2007)

Deep
Belief
Net
1.20%
Deep
Belief
Net
9.68%

(Hinton
et.
al.
2006)

(Larochelle
et.
al.
2009)

DBM

0.95%
DBM
8.40%

Permuta:on-‐invariant
version.

Deep
Boltzmann
Machine

Markov

Deep
Boltzmann
Machine

Random
Field

12,000
Latent

Variables

Model
P(image)

Planes
96
by
96

images

24,000
Training
Images

Stereo
pair

Genera:ve
Model
of
3-‐D
Objects

24,000
examples,
5
object
categories,
5
diﬀerent
objects
within
each

category,
6
lightning
condi:ons,
9
eleva:ons,
18
azimuths.

3-‐D
Object
Recogni:on

Papern
Comple:on

Learning
Algorithm
Error

Logis:c
regression
22.5%

K-‐NN
(LeCun
2004)
18.92%

SVM
(Bengio
&
LeCun

2007)
11.6%

Deep
Belief
Net
(Nair
&
9.0%

Hinton

2009)

DBM
7.2%

Permuta:on-‐invariant
version.

Learning
Part-‐based
Hierarchy

Object
parts.

Combina:on
of
edges.

Trained
from
mul:ple
classes

(cars,
faces,
motorbikes,
airplanes).

Lee
et.al.,
ICML
2009

Robust
Boltzmann
Machines

• 
Build
more
complex
models
that
can
deal
with
occlusions
or
structured

noise.

Gaussian
RBM,
modeling
Binary
RBM
modeling

clean
faces
occlusions

Inferred

Binary
pixel-‐wise
Gaussian
noise

Mask

Observed

Relates
to
Le
Roux,
Heess,
Shopon,
and
Winn,

Neural
Computa:on,
2011

Eslami,
Heess,
Winn,
CVPR
2012

Tang
et.
al.,
CVPR
2012

Robust
Boltzmann
Machines

Internal
States
of

RoBM
during

learning.

Inference
on
the
test

subjects

Comparing
to
Other

Denoising
Algorithms

Spoken
Query
Detec:on

•  630
speaker
TIMIT
corpus:
3,696
training
and
944
test
uperances.

•  10
query
keywords
were
randomly
selected
and
10
examples
of

each
keyword
were
extracted
from
the
training
set.

•  Goal:
For
each
keyword,
rank
all
944
uperances
based
on
the

uperance’s
probability
of
containing
that
keyword.

•  Performance
measure:
The
average
equal
error
rate
(EER).

13.5

Learning
Algorithm
AVG
EER
13

GMM
Unsupervised

16.4%
12.5

12

DBM
Unsupervised

14.7%
Avg. EER 11.5

DBM
(1%
labels)
13.3%
11

DBM
(30%
labels)
10.5%
10.5

DBM
(100%
labels)
9.7%

10

9.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Training Ratio

(Yaodong
Zhang
et.al.

ICASSP
2012)!

Learning
Hierarchical
Representa:ons

Deep
Boltzmann
Machines:

Learning
Hierarchical
Structure

in
Features:
edges,
combina:on

of
edges.

• 
Performs
well
in
many
applica:on
domains

• 
Combines
bopom
and
top-‐down

• 
Fast
Inference:
frac:on
of
a
second

• 
Learning
scales
to
millions
of
examples

Many
examples,
few
categories

Next:
Few
examples,
many
categories
–
Transfer
Learning

Talk
Roadmap

DBM
•  Unsupervised
Feature
Learning

–  Restricted
Boltzmann
Machines

–  Deep
Belief
Networks

–  Deep
Boltzmann
Machines

•  Transfer
Learning
with
Deep

Models

•  Mul:modal
Learning

“animal”

“vehicle”

horse

cow

car

van

truck

One-‐shot
Learning

“zarc” “segway”

How
can
we
learn
a
novel
concept
–
a
high
dimensional

sta:s:cal
object
–
from
few
examples.

Learning
from
Few
Examples

SUN
database

car

van
truck

bus

Classes
sorted
by
frequency

Rare
objects
are
similar
to
frequent
objects

Tradi:onal
Supervised
Learning

Segway
Motorcycle

Test:

What
is
this?

Learning
to
Transfer

Background
Knowledge

Millions
of
unlabeled
images

Learn
to
Transfer

Knowledge

Some
labeled
images

Learn
novel
concept

from
one
example

Bicycle
Dolphin

Test:

What
is
this?

Elephant
Tractor

Learning
to
Transfer

Background
Knowledge

Millions
of
unlabeled
images

Learn
to
Transfer

Knowledge

Key
problem
in
computer
vision,

speech
percep:on,
natural
language

processing,
and
many
other
domains.

Some
labeled
images

Learn
novel
concept

from
one
example

Bicycle
Dolphin

Test:

What
is
this?

Elephant
Tractor

One-‐Shot
Learning

Level 3
{τ 0, α0}
Hierarchical
Bayesian

Level 2
{µk, τ k, αk}

Level 1
Animal Vehicle
...
Models

{µc, τ c}
Cow Horse Sheep Truck Car

Hierarchical
Prior.

Probability of observed Prior probability of
data given parameters weight vector W

Posterior probability of
parameters given the
training data D.

• 
Fei-‐Fei,
Fergus,
and
Perona,
TPAMI
2006

• 
E.
Bart,
I.
Porteous,
P.
Perona,
and
M.
Welling,
CVPR
2007

• 
Miller,
Matsakis,
and
Viola,
CVPR
2000

• 
Sivic,
Russell,

Zisserman,
Freeman,
and
Efros,
CVPR
2008

Hierarchical-‐Deep
Models

HD
Models:
Compose
hierarchical
Bayesian
Deep
Nets

models
with
deep
networks,
two
inﬂuen:al
Part-‐based
Hierarchy

approaches
from
unsupervised
learning

Deep
Networks:

• 
learn
mul:ple
layers
of
nonlineari;es.

• 
trained
in
unsupervised
fashion
-‐-‐

unsupervised
feature
learning
–
no
need
to

Marr
and
Nishihara
(1978)

rely
on
human-‐crabed
input
representa:ons.

• 
labeled
data
is
used
to
slightly
adjust
the
Hierarchical
Bayes

model
for
a
speciﬁc
task.
Category-‐based
Hierarchy

Hierarchical
Bayes:

• 
explicitly
represent
category
hierarchies
for

sharing
abstract
knowledge.

• 
explicitly
iden:fy
only
a
small
number
of

parameters
that
are
relevant
to
the
new

concept
being
learned.
Collins
&
Quillian
(1969)

Mo:va:on

Learning
to
transfer
knowledge:

Super-‐class

Hierarchical

• 
Super-‐category:
“A
segway
looks
Segway

like
a
funny
kind
of
vehicle”.

• 
Higher-‐level
features,
or
parts,

shared

with
other
classes:

Parts

Ø 
wheel,
handle,
post

• 
Lower-‐level
features:

Ø 
edges,
composi:on
of
edges

Deep
Edges

Hierarchical
Genera:ve
Model

Hierarchical
Latent
Dirichlet

Alloca:on
Model

“animal”

“vehicle”

K

horse

cow

car

van

truck

DBM
Model

Lower-‐level
generic
features:

• 
edges,
combina:on
of
edges

Images

(Salakhutdinov, Tenenbaum, Torralba, 2011)!

Hierarchical
Genera:ve
Model

Hierarchical
Latent
Dirichlet

Alloca:on
Model

“animal”

“vehicle”
Hierarchical
Organiza;on
of
Categories:

• 
express
priors
on
the
features
that
are

typical
of
diﬀerent
kinds
of
concepts

• 
modular
data-‐parameter
rela:ons

K

Higher-‐level
class-‐sensi;ve
features:

horse

cow

car

van

truck
• 
capture
dis:nc:ve
perceptual

structure
of
a
speciﬁc
concept

DBM
Model

Lower-‐level
generic
features:

• 
edges,
combina:on
of
edges

Images

(Salakhutdinov, Tenenbaum, Torralba, 2011)!

Intui:on
α Pr(topic
|
doc)

π

Words
⇔
ac:va:ons
of
DBM’s
top-‐level
units.

z
Topics
⇔
distribu:ons
over
top-‐level
units,
or
θ
higher-‐level
parts.

w
Pr(word
|
topic)

DBM
generic
features:

LDA
high-‐level
features:
Images

Words
Topics
Documents

Each
topic
is
made
up
of
words.
Each
document
is
made
up
of
topics.

Hierarchical
Deep
Model

“animal”
“vehicle”

K

Topics

horse

cow

car

van

truck

Hierarchical
Deep
Model

Tree
hierarchy
of

classes
is
learned

“animal”
“vehicle”

(Nested
Chinese
Restaurant
Process)

prior:
a
nonparametric
prior
over
tree

structures.

K

Topics

horse

cow

car

van

truck

Hierarchical
Deep
Model

Tree
hierarchy
of

classes
is
learned

“animal”
“vehicle”

(Nested
Chinese
Restaurant
Process)

prior:
a
nonparametric
prior
over
tree

structures.

(Hierarchical
Dirichlet
Process)
prior:

K a
nonparametric
prior
allowing
categories
to

Topics
share
higher-‐level
features,
or
parts.

horse

cow

car

van

truck

Hierarchical
Deep
Model

Tree
hierarchy
of

classes
is
learned

“animal”
“vehicle”

Unlike
standard
sta:s:cal
models,

(Nested
Chinese
Restaurant
Process)

in
addi:on
ntonparametric
prior
arameters,

prior:
a

structures

o
inferring
p over
tree

we
also
infer
the
hierarchy
for

(Hierarchical
Dirichlet
Process)
prior:

sharing
nonparametric
prior
allowing
categories
to

K a
those
parameters.

Topics
share
higher-‐level
features,
or
parts.

Condi;onal
Deep
Boltzmann

horse

cow

car

van

truck

Machine.

Enforce
(approximate)
global
consistency

through
many
local
constraints.

CIFAR
Object
Recogni:on

Tree
hierarchy
of

classes
is
learned
50,000
images
of
100
classes

“animal”

“vehicle”

Higher-‐level
class
Inference:
Markov
chain

sensi:ve
features

Monte
Carlo
–
Later!

horse

cow

car

van

truck
4
million
unlabeled
images

Lower-‐level

generic

features

32
x
32
pixels
x
3

RGB

Learning
to
Learn

The
model
learns
how
to
share
the
knowledge
across
many
visual

categories.
Learned
super-‐
“global”

class
hierarchy

“aqua;c

animal”
“fruit”
“human”

dolphin
turtle
shark
ray
apple
orange
sunﬂower
girl
baby
man
Basic
level

class

woman

…
Learned
higher-‐level

class-‐sensi;ve
features

…
Learned
low-‐level

generic
features

Learning
to
Learn

The
model
learns
how
to
share
the
knowledge
across
many
visual

categories.

“global”
Learned
super-‐
crocodile
spider

class
hierarchy

“aqua;c
snake

lizard
“fruit”
“human”
castle
road

animal”

squirrel
bridge

kangaroo
skyscraper

bus
house

leopard

dolphin
fox
turtle
shark
ray
apple
orange
sunﬂower
truck
Basic
level

train

;ger
girl
baby
man
tank

lion
wolf
tractor
streetcar

class

oMer
skunk
woman

shrew

porcupine

pine

dolphin
oak
maple
tree

bear

ray

whale

shark

…
Learned
higher-‐level

willow
tree

camel
turtle
boMle

elephant

caMle

class-‐sensi;ve
can
lamp

bowl
cup

chimpanzee
beaver
features

mouse
raccoon
Learned
low-‐level

…
apple

peer
pepper
man

boy
man

generic
features

hamster
rabbit
possum
orange

sunﬂower
girl
woman

Sharing
Features
“fruit”

Reconst-‐
Learning
to

Real
ruc:ons
Shape
Color
Learn

Dolphin
Sunflower
Orange
Apple

apple
orange
sunflower

Sunflower
ROC
curve

5  3

1
ex’s

1
0.9
0.8
0.7

detection rate
0.6
0.5
0.4
Pixel-‐
0.3
0.2
space

0.1 distance

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
false alarm rate

Learning
to
Learn:
Learning
a
hierarchy
for
sharing
parameters
–

rapid
learning
of
a
novel
concept.

Object
Recogni:on

Area under ROC curve for same/different
(1 new class vs. 99 distractor classes)

1
HDP-‐DBM

LDA
DBM

HDP-‐DBM

0.95
GIST
(class
condi:onal)
(no
super-‐classes)

0.9

0.85

0.8

0.75

0.7

0.65

1
3
5
10
50

#
examples
[Averaged
over
40
test
classes]

Our
model
outperforms
standard
computer
vision

features
(e.g.
GIST).

Handwripen
Character
Recogni:on

“alphabet
1”

“alphabet
2”

Learned

higher-‐level

features

Strokes

char
1

char
2

char
3

char
4

char
5

Learned
lower-‐
level
features

25,000

Edge
characters

s

Handwripen
Character
Recogni:on

Area under ROC curve for same/different
(1 new class vs. 1000 distractor classes)

HDP-‐DBM

1
HDP-‐DBM

LDA

DBM
(no
super-‐classes)

0.95
(class
condi:onal)

0.9
Pixels

0.85

0.8

0.75

0.7

0.65

1
3
5
10

[Averaged
over
40
test
classes]

#
examples

Simula:ng
New
Characters

Real
data
within
super
class

Global

Super
Super

class
1
class
2

Class
1
Class
2
New
class

Simulated
new
characters

Learning from very few examples
3
examples
of

a
new
class

Condi:onal
samples

in
the
same
class

Inferred
super-‐class

Learning from very few examples

Hierarchical-‐Deep

So
far
we
have
considered
directed
+
undirected
models.

Deep
Lamber:an
Networks

“animal”
“vehicle”

Deep

Undirected

K
Topics

horse

cow

car

van

truck

Directed

Combines
the
elegant
proper:es
of
the

Lamber:an
model
with
the
Gaussian
RBMs

Low-‐level
features:

(and
Deep
Belief
Nets,
Deep
Boltzmann

replace
GIST,
SIFT
Machines).

Tang
et.
al.,
ICML
2012

Deep
Lamber:an
Networks

Model
Speciﬁcs

Image
Surface

Light

Deep
Lamber:an
Net

Normals

albedo
source

Inferred

Observed

Inference:
Gibbs
sampler.

Learning:
Stochas:c
Approxima:on

Deep
Lamber:an
Networks

Yale
B
Extended
Database

One
Test
Image
Two
Test
Images
Face
Religh:ng

Deep
Lamber:an
Networks

Recogni:on
as
func:on
of
the
number
of
training
images
for
10
test

subjects.

One-‐Shot

Recogni:on

Recursive
Neural
Networks

Recursive
structure
learning

Local
recursive
networks
are

making
predic:ons
whether

to
merge
the
two
inputs
as

well
as
predic:ng
the
label.

Use
Max-‐Margin
Es:ma:on.

Socher
et.
al.,
ICML
2011

Recursive
Neural
Networks

Recursive
structure
learning

Socher
et.
al.,
ICML
2011

Learning
from
Few
Examples

chair

armchair

Swivel
chair

Deck
chair

Classes
sorted
by
frequency

Genera:ve
Model
of
Classifier

Parameters

Many
state-‐of-‐the-‐art
object
detec:on
systems
use
sophis:cated
models,

based
on
mul:ple
parts
with
separate
appearance
and
shape
components.

Detect
objects
by
tes:ng
sub-‐windows
and

scoring
corresponding
test
patches
with
a
linear

func:on.

Define
hierarchical
prior
over
parameters

of
discrimina;ve
model
and
learn
the

hierarchy.

Image
Specific:
concatena:on
of
the

HOG
feature
pyramid
at
mul:ple
scales.

Felzenszwalb,
McAllester
&
Ramanan,
2008

Genera:ve
Model
of
Classiﬁer

Parameters

Hierarchical

Bayes

By
learning
hierarchical
structure,

we
can
improve
the
current

state-‐of-‐the-‐art.

Sun
Dataset:
32,855
examples
of

200
categories

185
ex
27
ex
12
ex

Hierarchical
Model

Single
Class

Mul:-‐Modal
Input

Learning
systems
that
combine
mul:ple
input
domains

Images
Text
&
Language

Video

Laser
scans

Speech
&

Audio
Time
series

data

Develop
learning
systems
that
come

One
of
Key
Challenges:

closer
to
displaying
human
like
intelligence

Inference

Mul:-‐Modal
Input

Learning
systems
that
combine
mul:ple
input
domains

Image
Text

More
robust
percep:on.

Ngiam
et.al.,
ICML
2011
used
deep
autoencoders
(video
+
speech)

• 
Guillaumin,
Verbeek,
and
Schmid,
CVPR
2011

• 
Huiskes,
Thomee,
and
Lew,
Mul:media
Informa:on
Retrieval,
2010

• 
Xing,
Yan,
and
Hauptmann,
UAI
2005.

Training
Data

pentax,
k10d,
camera,
jahdakine,

kangarooisland
lightpain:ng,

southaustralia,
sa
reﬂec:on

australia
doublepaneglass

australiansealion
300mm
wowiekazowie

sandbanks,
lake,

lakeontario,
sunset,
top20buperﬂies

walking,
beach,
purple,

sky,
water,
clouds,

overtheexcellence

mickikrimmel,

mickipedia,
headshot

<no
text>

Samples
from
the
MIR
Flickr
Dataset
-‐
Crea:ve
Commons
License

Mul:-‐Modal
Input

•  Improve
Classiﬁca:on

pentax,
k10d,
kangarooisland

southaustralia,
sa
australia
SEA
/
NOT
SEA

australiansealion
300mm

•  Fill
in
Missing
Modali:es

beach,
sea,
surf,

strand,
shore,

wave,
seascape,

sand,
ocean,
waves

•  Retrieve
data
from
one
modality
when
queried
using
data
from

another
modality

beach,
sea,
surf,

strand,
shore,

wave,
seascape,

sand,
ocean,
waves

Mul:-‐Modal
Deep
Belief
Net

Gaussian
RBM

Replicated
Sobmax

Dense
Image
Text
Sparse
counts

Mul:-‐Modal
Deep
Belief
Net

• 
Flickr
Data
-‐
1
Million
images
along
with
text
tags,
25K
annotated

Recogni:on
Results

• 
Mul:modal
Inputs
(images
+
text),
38
classes.

Learning
Algorithm
Mean
Average
Precision

Image-‐text
SVM
0.475

Image-‐text
LDA
0.492

Mul:modal
DBN
0.566

• 
Unimodal
Inputs
(images
only).

Learning
Algorithm
Mean
Average
Precision

Image-‐SVM
0.375

Image-‐LDA
0.315

Image
DBN
0.413

Papern
Comple:on

Given
a
test
image,
we
generate
associated
text
–
achieve
far
beper

classiﬁca:on
results.

Thank
you

Code
for
learning
RBMs,
DBNs,
and
DBMs
is
available
at:

hpp://www.mit.edu/~rsalakhu/

Deep Learning Structure Discovery

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Deep Learning Structure Discovery

Similar a Deep Learning Structure Discovery (20)

Más de zukun

Más de zukun (20)

Último

Último (20)

Deep Learning Structure Discovery