NIPS2007: sensory coding and hierarchical representations

Sensory Coding &
Hierarchical
Representations
Michael S. Lewicki

Computer Science Department &
Center for the Neural Basis of Cognition
Carnegie Mellon University

“IT”

V4

V2
retina LGN V1

NIPS 2007 Tutorial 2 Michael S. Lewicki ! Carnegie Mellon

Fig. 1. Anatomical and functional hierarchies in the macaque visual system. The human Hedgé and Felleman, 2007
from visual system (not shown) is
believed to be roughly similar. A, A schematic summary of the laminar patterns of feed-forward (or ascending) and feed-
back (or descending) connections for visual area V1. The laminar patterns vary somewhat from one visual area to the next.
But in general, the connections are complementary, so that the ascending connections terminate in the granular layer (layer
4) and the descending connections avoid it. The connections are generally reciprocal, in that an area that sends feed-
forward connections to another area also receives feedback connections from it. The visual anatomical hierarchy is defined
based on, among other things, the laminar patterns of these interconnections among the various areas. See text for details.
B, A simplified version of the visual anatomical hierarchy in the macaque monkey. For the complete version, see Felleman
and Van Essen (1991). See text for additional details. AIT = anterior inferotemporal; LGN = lateral geniculate nucleus;
LIP = lateral intraparietal; MT = middle temporal; MST = medial superior temporal; PIT = posterior inferotemporal; V1 =
visual area 1; V2 = visual area 2; V4 = visual area 4; VIP = ventral intraparietal. C, A model of hierarchical processing of
visual information proposed by David Marr (1982). D, A schematic illustration of the presumed parallels between the
anatomical and functional hierarchies. It is widely presumed 3 only that visual processingMichael S. Lewicki ! Carnegie Mellon the
NIPS 2007 Tutorial not is hierarchical but also that

Palmer and Rock, 1994


V1 simple cells
!"#"$%&'"()&"*+(,)(-(./(0&1$*"(#"

2"345"*&06 789-:

Hubel and Wiesel, 1959 DeAngelis, et al, 1995

Out of the retina
• > 23 distinct neural pathways, no simple function
division
• Suprachiasmatic nucleus: circadian rhythm
• Accessory optic system: stabilize retinal image
• Superior colliculus: integrate visual and auditory
information with head movements, direct eyes
• Pretectum: plays adjusting pupil size, track large
moving objects
• Pregeniculate: cells responsive to ambient light
• Lateral geniculate (LGN): main “relay” to visual
cortex; contains 6 distinct layers, each with 2
sublayers. Organization very complex


V1 simple cell integration of LGN cells
!"#"$%&"'(()'*+,-+)".*)(/"*'"&0.1/("2(//&

!"#$%&'(")&* +,%"-

345
2(//&

60.1/(
2(//

Hubel and Wiesel, 1963
!78(/"#"$0(&(/9":;<=


Hyper-column model

Hubel and Wiesel, 1978


Anatomical circuitry in V1

54 CALLAWAY

Figure 1 Contributions of individual neurons to local excitatory connections between cortical
layers. From left to right, typical spiny from Callaway, 1998 4C, 2–4B, 5, and 6 are shown. Dendritic
neurons in layers
arbors are illustrated with thick lines and axonal arbors with finer lines. Each cell projects axons
specifically to only a subset of the layers. This simplified diagram focuses on connections between
layers 2–4B, 4C, 5, and 6 and does not incorporate details of circuitry specific for subdivisions of
NIPS 2007 Tutorial Michael S. Lewicki ! Carnegie
these layers. A model of the interactions between these neurons is shown in Figure 2. The neurons Mellon
11

Where is this headed?

Ramon y Cajal


A wing would be a most mystifying structure
if one did not know that birds ﬂew.

Horace Barlow, 1961

An algorithm is likely to be understood more
readily by understanding the nature of the
problem being solved than by examining the
mechanism in which it is embodied.

David Marr, 1982

A theoretical approach

• Look at the system from a functional perspective:
What problems does it need to solve?

• Abstract from the details:
Make predictions from theoretical principles.

• Models are bottom-up; theories are top-down.

What are the relevant computational principles?


Representing structure in natural images

image from Field (1994)



Need to describe all image structure in the scene.
What representation is best?



V1 simple cells
!"#"$%&'"()&"*+(,)(-(./(0&1$*"(#"

2"345"*&06 789-:

Hubel and Wiesel, 1959 DeAngelis, et al, 1995

Oriented Gabor models of individual simple cells

ﬁgure from Daugman, 1990; data from Jones and Palmer, 1987


2D Gabor wavelet captures spatial structure

• 2D Gabor functions

• Wavelet basis generated by
dilations, translations, and
rotations of a single basis function

• Can also control phase and
aspect ratio

• (drifting) Gabor functions are what
the eye “sees best”


Recoding with Gabor functions

Pixel entropy = 7.57 bits Recoding with 2D Gabor functions
Coefﬁcient entropy = 2.55 bits


How is the V1 population organized?


A general approach to coding: redundancy reduction

Correlation of adjacent pixels
1

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

Redundancy reduction is equivalent to efﬁcient coding.


Describing signals with a simple statistical model
Principle
Good codes capture the statistical distribution of sensory patterns.

How do we describe the distribution?

• Goal is to encode the data to desired precision
x = a1 s1 + a2 s2 + · · · + aL sL +
= As +

• Can solve for the coefﬁcients in the no noise case

ˆ=A
s −1
x


An algorithm for deriving efficient linear codes: ICA
Learning objective: Using independent component analysis
(ICA) to optimize A:
maximize coding efficiency
maximize P(x|A) over A. ∂
∆A ∝ AA log P (x|A)
T
∂A
= −A(zsT − I)
Probability of the pattern ensemble is:
where z = (log P(s))’.
P (x1 , x2 , ..., xN |A) = P (xk |A)
k

This learning rule:
To obtain P(x|A) marginalize over s:
• learns the features that capture the
most structure
P (x|A) = ds P (x|A, s)P (s)
• optimizes the efficiency of the code
P (s)
=
| det A|
What should we use for P(s)?


Modeling Non-Gaussian distributions

• Typical coeff. distributions of natural
signals are non-Gaussian.
!=2 !=4

!6 !4 !2 0 2 4 6 !6 !4 !2 0 2 4 6
s s
1 2


The generalized Gaussian distribution
1 q
P (x|q) ∝ exp(− |x| )
2

• Or equivalently, and exponential power distribution (Box and Tiao, 1973):
2/(1+β)
ω(β) x−µ
P (x|µ, σ, β) = exp −c(β)
σ σ

• " varies monotonically with the kurtosis, #2:
!=!0.75 " =!1.08 !=!0.25 " =!0.45 !=+0.00 (Normal) " =+0.00
2 2 2

!=+0.50 (ICA tanh) "2=+1.21 !=+1.00 (Laplacian) "2=+3.00 !=+2.00 "2=+9.26


Modeling Gaussian distributions with PCA

•
!=0 !=0
Principal component
analysis (PCA) describes
the principal axes of
variation in the data
distribution.
• This is equivalent to ﬁtting
the data with a multivariate
Gaussian. !6 !4 !2 0 2 4 6 !6 !4 !2 0 2 4 6
s1 s2


Modeling non-Gaussian distributions

•
!=2 !=4
What about non-Gaussian
marginals?

!6 !4 !2 0 2 4 6 !6 !4 !2 0 2 4 6
s1 s2

• How would this distribution
be modeled by PCA?


Modeling non-Gaussian distributions

•
!=2 !=4
What about non-Gaussian
marginals?

!6 !4 !2 0 2 4 6 !6 !4 !2 0 2 4 6
s1 s2

• How would this distribution
be modeled by PCA?

• How should the distribution
be described?
The non-orthogonal ICA
solution captures the non-
Gaussian structure


Efﬁcient coding of natural images: Olshausen and Field, 1996

naturalscene
nature scene visual input
visual input units image basis functions
receptive fields

...

...
...
before after
learning learning

Network weights are adapted to maximize coding efﬁciency:
minimizes redundancy and maximizes the independence of the outputs

Model predicts local and global receptive ﬁeld properties

Learned basis for natural images Overlaid basis function properties

from Lewicki and Olshausen, 1999


Algorithm selects best of many possible sensory codes
Haar
Learned Wavelet

Gabor Fourier PCA

Theoretical perspective: Not edge “detectors.” from Lewicki and Olshausen, 1999

An efﬁcient code for natural images.

Comparing coding efﬁciency on natural images
Comparing codes on natural images
8

7

6
estimated bits per pixel

5

4

3

2

1

0
Gabor wavelet Haar Fourier PCA learned


Responses in primary visual cortex to visual motion

from Wandell, 1995


Sparse coding of time-varying images (Olshausen, 2002)
"#$!%

!( !

'
"#$&)')!!!(%
&
!

'
*$&)')!% ...
&

!

I(x, y, t) = ai (t )φi (x, y, t − t ) + (x, y, t)
i t

= ai (t) ∗ φi (x, y, t) + (x, y, t)
i

Sparse decomposition of image sequences
convolution posterior maximum
%#$
!
"% "%
%#5
"#$
!% !% %#7
"
7% 7% %#!

%#$ %#"
5% 5%
34),,'3')/&

34),,'3')/&
$% % $% %

!%#$ !%#"
6% 6%
!%#!
8% !" 8%
!%#7
9% !"#$ 9%
!%#5
:% :%
!!
!%#$
!% 5% 6% !% 5% 6%
&'()*+,-.()*/0(1)-2 &'()*+,-.()*/0(1)-2

reconstruction

input sequence

from Olshausen, 2002

Learned spatio-temporal basis functions

time

basis
function



Animated spatial-temporal basis functions



$ explains data from principles
Theory
$ requires idealization and abstraction

$ code signals accurately and efﬁciently
Principle
$ adapted to natural sensory environment

Idealization $ cell response is linear

Methodology $ information theory, natural images

$ explains individual receptive ﬁelds
Prediction
$ explains population organization


How general is the efﬁcient coding principle?

Can it explain auditory coding?


Limitations of the linear model

• linear
• only optimal for block, not whole signal
• no phase-locking
• representation depends on block alignment

NIPS 2007 Tutorial Michael S. Lewicki ! Carnegie Mellon

A continuous filterbank does not form an efficient code

⊗ ←

→

→

→

Goal:
find a representation that is both time-relative and efficient


Efﬁcient signal representation using time-shiftable kernels (spikes)
Smith and Lewicki (2005) Neural Comp. 17:19-45

M nm
x(t) = sm,i φm (t − τm,i ) + (t)
m=1 i=1

• Each spike encodes the precise time and magnitude of an acoustic feature
• Figure 3: Smith, NC ms 2956
Two important theoretical abstractions for “spikes”
- not binary
- not probabilistic
• Can convert to a population of stochastic, binary spikes

Coding audio signals with spikes

5000
Kernel CF (Hz)

2000

1000
K

500
200
100
0 5 10 15 20 25 ms
Input

Reconstruction

Residual


Kernel CF (Hz)
2000

Comparing a spike code to a spectrogram
1000

500
a
200
100

b
c
5000
5000
FrequencyCF (Hz)

2000
Kernel (Hz)

4000

1000
3000
500
2000
200
a
1000
100

c
b 100 200 300 400 500 600 700 800 ms

5000
Frequency (Hz)

4000
Kernel CF (Hz)

2000
3000
1000
2000
500
1000
200
100
100 200 300 400 500 600 700 800 ms
c How do we compute the spikes?
NIPS 2007 Tutorial
5000 47 Michael S. Lewicki ! Carnegie Mellon

“can” with ﬁlter-threshold

5000
Kernel CF (Hz)

2000

1000

500

200
100
0 5 10 15 20 ms

Input

Reconstruction

Residual


Spike Coding with Matching Pursuit
1. convolve signal with kernels


2. ﬁnd largest peak over convolution set


3. ﬁt signal with kernel


4. subtract kernel from signal, record
spike, and adjust convolutions


5. repeat


5. repeat . . .


5. repeat . . .
6. halt when desired ﬁdelity is reached


“can” 5 dB SNR, 36 spikes, 145 sp/sec
5000
Kernel CF (Hz)

2000

1000

500

200
100
0 50 100 150 200 ms

Input

Reconstruction

Residual


5000
Kernel CF (Hz)

2000

1000

500

200
100
0 50 100 150 200 ms

Input

Reconstruction

Residual


Efﬁcient auditory coding with optimized kernel shapes
Smith and Lewicki (2006) Nature 439:978-982

What are the optimal kernel shapes?

M nm
x(t) = sm,i φm (t − τm,i ) + (t)
m=1 i=1

Adapt algorithm of Olshausen (2002)


Optimizing the probabilistic model
M nm
x(t) = sm φm(t − τim) + (t),
i
m=1 i=1

p(x|Φ) = p(x|Φ, s, τ )p(s)p(τ )dsdτ

≈ p(x|Φ, s, τ )p(ˆ)p(ˆ)
ˆ ˆ s τ

(t) ∼ N (0, σ )
Learning (Olshausen, 2002):

∂ ∂
log p(x|Φ) = log p(x|Φ, s, τ ) + log p(ˆ)p(ˆ)
ˆ ˆ s τ
∂φm ∂φm
M n
1 ∂ m
= [x − sm φm(t − τim)]2
ˆi
2σε ∂φm m=1 i=1
1
= [x − x]
ˆ ˆi
sm
σε i

Also adapt kernel lengths

Adapting the optimal kernel shapes


Kernel functions optimized for coding speech


Quantifying coding efficiency
5000

1. fit signal
Kernel CF (Hz)
2000
2. quantize time and amplitude values
1000 spike code
3. prune zero values
4. measure coding efficiency using the
500

entropy of quantized values
200

5. reconstruct signal using quantized
100
0 50 150
values
6. measure fidelity using Input
signal-to-noise
ratio (SNR) of residual error original

• identical procedure for other codes
(e.g. Fourier and wavelet)
Reconstruction
reconstruction

M nm
x(t) = sm,i φm (tResidual) + (t)
− τm,i
m=1 i=1 residual
error

NIPS 2007 Tutorial Michael S. Lewicki ! Carnegie Mellon
67

Coding efﬁciency curves

40

35

30

25
SNR (dB)

+14 dB
20

15 4x more efﬁcient

10 Spike Code: adapted
Spike Code: gammatone
5 Block Code: wavelet
Block Code: Fourier
0
0 10 20 30 40 50 60 70 80 90
Rate (Kbps)


Using efficient coding theoryComparing atheoretical a spectrogram
to make spike code to predictions

a

b
5000 Natural Sound Environment

Kernel CF (Hz)
2000
evolution theory
1000

500
A simple model of auditory coding
200

physiological data: 100
optimal kernels:
How do we compute the spikes?
sound
• auditory nerve non-linearities
waveform
filterbank filter shapes
static c stochastic
spiking
population
spike code
• properties
• population trends 5000 • coding efficiency
Frequency (Hz)

4000
auditory revcor filters: gammatones
5 3000
Speech Prediction

?
Auditory Nerve Filters

1 2 3 4 5 6 7 8 9 10
2000
time (ms) 2

1000
Bandwidth (kHz)

1 More theoretical questions:
• Why gammatones?
1 2 3 4 5
time (ms)
0.5
100 200 300
only compare 500 the data after optimizing
400
to 600 700 800 ms

• Why spikes?
Michael S. Lewicki ! Carnegie Mellon
• How is sound coded by
we do not fit theBad Zwischenahn ! Aug 21, 2004
data
0.2

1 2 3 4 5
the spike population?
time (ms) 0.1
0.1 0.2 0.5 1 2 5
Center Frequency (kHz)
How do we develop a theory? Prediction depends on sound ensemble.
deBoer and deJongh, 1978 Carney and Yin, 1988
Michael S. Lewicki ! Carnegie Mellon Bad Zwischenahn ! Aug 21, 2004


Learned kernels share features of auditory nerve ﬁlters

Auditory nerve ﬁlters Optimized kernels
from Carney, McDuffy, and Shekhter, 1999 scale bar = 1 msec

Learned kernels closely match individual auditory nerve filters

for each kernel find closet matching auditory nerve filter
in Laurel Carney’s database of ~100 filters.


Learned kernels overlaid on selected auditory nerve ﬁlters

For almost all learned kernels there is a closely matching auditory nerve ﬁlter.


Coding of a speech consonant

a Input

Reconstruction

Residual

b
5000
Kernel CF (Hz)

2000
4800
1000 4000
500 3400

200 38 39 40
100
c
5000
Frequency (Hz)

4000
3000
2000
1000

10 20 30 40 50 60ms


How is this achieving an efﬁcient, time-relative code?

Time-relative coding of glottal pulses

a

b
5000
Kernel CF (Hz)

2000

1000

500

200
100
0 20 40 60 80 100 ms


Theory

$ code signals accurately and efﬁciently
Principle

Idealization $ analog spikes

$ information theory, natural sounds
Methodology
$ optimization

Prediction $ explains individual receptive ﬁelds


A gap in the theory?
-
- + -
-

Learned

from Hubel, 1995


Redundancy reduction for noisy channels (Atick, 1992)

s input x recoding Ax output y
channel A channel
y = x +" y = Ax +!

Mutual information

P (x, s)
I(x, s) = P (x, s) log2
s,x
P (s)P (x)

I(x, s) = 0 iﬀ P (x, s) = P (x)P (s), i.e. x and s are independent.


Profiles of optimal filters

• high SNR
- - reduce redundancy
- + -
-
- center-surround

-

- + -

-

+ • low SNR
- average
- low-pass filter

• matches behavior of
retinal ganglion cells

An observation: Contrast sensitivity of ganglion cells

Luminance level decreases one log unit each time we go to lower curve.


Natural images have a 1/f amplitude spectrum

Field, 1987


Components of predicted filters
whitening filter

low-pass filter

optimal filter

from Atick, 1992


Predicted contrast sensitivity functions match neural data

from Atick, 1992


Robust coding of natural images

• Theory reﬁned:
- image is noisy and blurred
- neural population size changes
- neurons are noisy

from Hubel, 1995

(a) Undistorted image (b) Fovea retinal image (c) 40 degrees eccentricity
Intensity

-4 0 4 -4 0 4
Visual angle [arc min]
from Doi and Lewicki, 2006


Problem 1: Real neurons are “noisy”

Estimates of neural information capacity

system (area) stimulus bits / sec bits / spike

ﬂy visual (H1) motion 64 ~1

monkey visual (MT) motion 5.5 - 12 0.6 - 1.5

frog auditory (auditory nerve) noise & call 46 & 133 1.4 & 7.8

Salamander visual (ganglinon cells) rand. spots 3.2 1.6

cricket cercal (sensory afferent) mech. motion 294 3.2

cricket cercal (sensory afferent) wind noise 75 - 220 0.6 - 3.1

cricket cercal (10-2 and 10-3) wind noise 8 - 80 avg. = 1

Electric ﬁsh (P-afferent) amp. modulation 0 - 200 0 - 1.2

Limited capacity neural codes needBorst and Theunissen, 1999
After to be robust.


Traditional codes are not robust

encoding neurons

sensory input

Original


Traditional codes are not robust

encoding neurons
Add
noise equivalent
to 1 bit precision
sensory input

Original 1x efﬁcient coding reconstruction (34% error)

1 bit precision


al., to pairwise correla- al.,
st 2001; Schneidman et
e correlations that can
not just to pairwise correla-
Redundancy plays an important role in neural coding
ental Procedures). By can
to all correlations that
xperimental Procedures). By
cy can be measured
model ofcan be measured
undancy
the light re-
g any model of the light re-
ancy was defined as
edundancy was defined as
Response of salamander retinal ganglion cells to natural movies
al information that that the
mutual information the
ne conveyed aboutstim-
veyed about the the stim-
information conveyed
d the information conveyed
a As information rates
).,Rb;S). As information rates
n of ganglion cells, we
pulation of ganglion cells, we
as a fraction of the minimum
action of the minimum
al cells cells (Reichal., al.,
dividual (Reich et et

min{I(Ra;S), I(Rb;S)}, is the
ancy I(Rb;S)}, is the
Ra;S),between two cells, so
between be nocells, sothan
ncy can two greater
fractional redundancy

an be 0 when the two cells
cy is no greater than
0 when the two cells
mation about the stimulus; its
lls encode exactly the same
about the stimulus; its
ell’s information is a subset
ode exactly the same
ues of the redundancy mean
formation is a subset
c.
the redundancy mean

s of Ganglion Cells
rocessing under realistic vi-
anglion Cells
ted retinas with a set of nat-
sing underarealistic of envi-
represent variety vi- From Puchalla et al, 2005
cell pair distance (%m)
important a set of nat- of
inas with characteristic
sent a NIPS 2007 Tutorial by the
variety of envi-
field motion caused 87 Michael S. Lewicki ! Carnegie Mellon

tance, the code. great numbersee, coding elementsintroduces redundancy into the cod
ty of what if a As we will of robust coding are available while their coding pre
o be Robust codingthe assumed noise in 2005,2006)
case, separated from (Doi and Lewicki, the representation, unlike PCA or ICA the
the apparent dimensionality could be increased (instead of reduced) while tha
limited. Robust coding should make optimal use of such an arbitrary number of cod
e present the optimal linear encoder andcoding introduces redundancy into the code
lity of the code. As we will Encoder see, robust decoder when the coefﬁcients are noisy, an
Decoder
to be separated from the assumedthe different representation, unlike PCA or ICApaper r
to the number of units and to noise in the data and noise conditions. This that
x W u A ^
x
, we formulate the problem. Then, in section III, we analyze the solutions for the g
we present theand two-dimensional case. Next, in when theIV, we applyare noisy, and
ons for one- optimal Data encoder and decoder section coefﬁcients the proposed
linear Noiseless Reconstruction
Representation
y to theits considerable robustness different data and noise conditions. This codes. is
nstrate number of units and to the compared against conventional image paper Fi
udies. formulate the problem. Then, in section III, we analyze the solutions for the gen
II, we
ions for one- and two-dimensional case. Next, in section IV, we apply the proposed ro
Model limited precision using
onstrate its channel noise: robustnessPcompared FORMULATION
additive
considerable II. Channel Noise against conventional image codes. Fina
ROBLEM
n
udies. (Fig. 1), we assume that the data is N -dimensional with zero mean and cov
model Encoder Decoder
rices W ∈ RM ×N and A ∈ RN ×M . For each data point x, its representation r in
r
FORMULATION A ^
x perturbed byII.uP ROBLEM noise (i.e., channel noise) n ∼ N (0, σ 2
W x
rough matrix W, the additive n
Data Noiseless Noisy Reconstruction
model (Fig. 1), we assume that the data is Representation
Representation
N -dimensional with zero mean and covari
trices W ∈ RM ×N and A ∈ r N ×M . For each= u + n. x, its representation r in th
R = Wx + n data point
channel noise) n ∼ N (0, σ 2 I
hrough matrix W, perturbed by the additive noise (i.e., vectors. The reconstructionnofM
the encoding matrix and its row vectors as encoding
noisy representation using matrix A: + n = u + n.
r = Wx Caveat: linear, but can
evaluate for different
s the encoding matrix and its row= Ar = encoding vectors.
x vectors as AWx + An.
ˆ Thenoise levels.
reconstruction of ad
noisy representation using matrix A:
NIPS 2007 Tutorial 88 Michael The term AWx i
the decoding matrix and its column vectors as decoding vectors. S. Lewicki ! Carnegie Mellon

Generalizing the model: sensory noise and optical blur

(a) Undistorted image (b) Fovea retinal image (c) 40 degrees eccentricity

Intensity

-4 0 4 -4 0 4
Visual angle [arc min]

sensory noise channel noise
ν δ only implicit
optical blur encoder decoder
s H x W r A s
ˆ
image observation representation reconstruction

Can also add sparseness and resource constraints


REVIEWS
Robust coding is distinct from “reading” noisy populations
a b In EQN 2, si is the direction (th
100 100 triggers the strongest respo
width of the tuning curve, s
80 80
(so if s = 359° and si = 2°, the

Activity (spikes s–1)

60 60
ing factor. In this case, all t
share a common tuning cur
40 40 preferred directions, si (FIG. 1
involve bell-shaped tuning cu
20 20 The inclusion of the no
because neurons are known
0 0
–100 0 100 –100 0 100 neuron that has an average
Direction (deg) Preferred direction (deg) stimulus moving at 90° mig
c d occasion for a particular 90°
100 100
another occasion for exactly
factors contribute to this va
80 80 trolled or uncontrollable as

presented to the monkey, a
60 60 neuronal responses. In the
collectively considered as ran
40 40
this noise causes important
20 20 transmission and processing
which are solved by populati
0 0 we should be concerned no
–100 0 100 –100 0 100
computes with population co
Preferred direction (deg) Preferred direction (deg) From Pouget et al, 2001
reliably in the presence of suc
Figure 1 | The standard population coding model. a | Bell-shaped tuning curves to direction for
16 neurons. b | A population pattern of activity across 64 neurons with bell-shaped tuning curves in Decoding population cod
response to an object moving at –40°. The activity of each cell was generated using EQN 1, and In this section, we shall addr
Here, we want to learn an optimal image code using noisy neurons.
plotted at the location of the preferred direction of the cell. The overall activity looks like a ‘noisy’ hill what information about t
centred around the stimulus direction. c | Population vector decoding fits a cosine function to the
object is available from the r
observed activity, and uses the peak of the cosine function, s as an estimate of the encoded
ˆ,
direction. d | Maximum likelihood fits a template derived from the tuning curves of the cells. More neurons? Let us take a hypoth
NIPS 2007 Tutorial precisely, the template is obtained from the noiseless (or average) population activity in response to S. Lewicki we record Mellon
90 Michael that ! Carnegie the activity of
and that these neurons have

∆W ∝ − (Cost) = 2 A (IN − AW)Σx − λ diag
(Cost) = (Error) + λ (Var.∂W
Const.) M (83)
How do we learn robust codes?
M sensory noise 2 channel noise
u2 A = Σx WT (σn IM + WΣx WT
2
(Var. Const.) = ln ν 2 i
δ (84)
i=1
σu
optical blur encoder decoder
∂
⇔ (Cost) = O
s H x W r A ∂A s
ˆ
image observation representation reconstruction
4 ln{diag(WΣx WT )/c} u2
2 A (IObjective: x − λ
T
N − AW)Σ diag T)
WΣxi = i
SNR (85)
M diag(WΣx W 2
σn
! Find W and A that minimize reconstruction error.

A •= Channel capacity of+ WΣneuron:−1
Σx WT (σn IM the ith x WT )
2
u 2 = σu
i
2 (86)
1
Ci = ln(SNRi + 1)
∂
⇔ (Cost) = O
2 (87)
∂A
• To limit capacity, ﬁx the coefﬁcient signal to noise ratio:√
λ 1 + λ2
√ √
√λ1
u2 2(1 + M SNR) λ2
SNRi = 2i 2 (88)
σn
Now robust coding is formulated as a 2
constrained optimization problem. N
1 1
u 2 = σu
2 E= M λi
i
N · SNR + 1 N i=1 (89)


encoding neurons
Add
noise equivalent
to 1 bit precision
sensory input

Original 1x efﬁcient coding reconstruction (34% error)

1 bit precision



encoding neurons
Weights adapted for optimal
robustness
sensory input

Original 1x robust coding reconstruction (3.8% error)

1 bit precision


Reconstruction improves by adding neurons

encoding neurons
Weights adapted for optimal
robustness
sensory input

Original 8x robust coding reconstruction error: 0.6%

1 bit precision


Adding precision to 1x robust coding

original images 1 bit: 12.5% error 2 bit: 3.1% error


1 M ·2 (1, ·√ ,1 T ∈ RM . It takes a convenient fo
where 1M SNR + 1· · 1)
=
Can 2(1 + M SNR) 2 average error bound
derive minimum theoretical
2
λ2
2-D 2σx
Isotropic
E= M diag(V
2
· SNR + 1
when we deﬁne V (byλ1 + λ2 ) √ 22
N a linear transform of W,
√
2-D 1= 11
E = M E M · SNR + 1
Anisotropic λi if SNR ≥ SNRc
N · SNR + 1 N i=1 2
2 V≡
λ1
E= = EDET √
where Σx· SNR + + λ2 the 
is if SNR ≤ SNRc
eigenvalue decomposition of
M 1
√ λ1
λi -≡ Dii are theof the data covariance. To summarize, the c
N ith eigenvalue eigenvalues of Σx
its λi 1 
thatinputlinear transform. V should have row vectors of un

E = MN- i=1 dimensionality .
 . 
· SNR + 1usN (neurons)
N M Next, coding units √ a necessary condition for the min
- # of
let consider
λN
all A PPLICATION Regarding A, ∂E/∂A = O yields (se
IV. free parameters.TO IMAGE CODING
ata. we characterized the optimal solutions for 1-D and 2-D data. For1the high
ection
A = high-dim
γ 2 ES(I
ata, or theto be investigated. Here, we data. lower bound solutions for u
s remains eigenvalues of isotoropictheoretical numerical
Algorithm achieves present σ
bustness to channel noise. To derive an optimal solution we can employ a grad
ons) (Note that Results Bound
the necessary condition with respect to W (or e
function (its details are given in [3], [4]).
W (or V) 19.9%
is constrained to a certain subspace by the cha
performance of our proposed code when applied 20.3%test image. The data consists
0.5x to a
sampled T
) = 1x 512×512 pixel image. We set the number andcodin
Using the channel T 12.5% constraint (eq. 9) of the n
e randomlydiag( uufrom the diag(WΣx 12.4% W capacity σ 2 1
) = u M
be8xsimpliﬁed as (see Appendix B) Balcan, Doi, and Lewicki, 2007;
2.0% 2.0% Balcan and Lewicki, 2007

NIPS 2007 Tutorial 96
E = tr{D · (
Michael S. Lewicki ! Carnegie Mellon

Sparseness localizes the vectors and increases coding efﬁciency
normalized histogram
encoding vectors (W) decoding vectors (A)
of coeff. kurtosis
Kurtosis of RC
100 Kurtosis of RC
100
avg. = 4.9
80
80
robust coding

% % frequency
60

frequency
60
%
40
40

20
20

0
0 4 6 8 10
4 6
kurtosis 8 10
average error is unchanged kurtosis
kurtosis

12.5% & 12.5% Kurtosis of RSC
100 Kurtosis of RSC
100

avg. = 7.3
robust sparse coding

80
80

% % frequency
60

frequency
60

%40
40

20
20

0
0 4 6 8 10
4 6
kurtosis 8 10
kurtosis
kurtosis


Non-zero resource constraints localize weights
(a) 20dB 10dB 0dB -10dB
fovea (d)
20dB
(b)

Magnitude
-10dB

(c)
Spatial freq.

average error is also unchanged
40 degrees
20dB -10dB


Theory

$ code signals accurately, efﬁciently, robustly
Principle

$ cell response is linear, noise is additive
Idealization
$ simple, additive resource constraints

$ information theory, natural images
Methodology
$ constrained optimization

$ explains individual receptive ﬁelds
Prediction $ explains non-linear response


What happens next?

-
- + -
-

Learned

from Hubel, 1995


What happens next?


of filters with strongly matic gain control known as ‘divisive normalization’ that has been
is larger; for a pair that used to account for many nonlinear steady-state behaviors of neu-
Joint statisticsrons in primary visual cortex10,20,21. Normalization models have
rtion is zero. Because L2 of ﬁlter outputs show magnitude dependence
ber of other filters with- been motivated by several basic properties. First, gain control
ralization of this condi-
portional to a weighted
neighborhood and an histo{ L | L ! 0.1 } histo{ L2 | L1 ! 0.9 }
2 1
optimal weights and an 1 1
ihood of the conditional
es or sounds (Methods,
0.6 0.6
er for pairs of filters that
0.2 0.2
ge as seen through two lin-
ical filter (L2), conditioned -1 0 1 -1 0 1
diagonal spatially shifted fil-
r all image positions, and a 1
he frequency of occurence
nsional histograms are ver-
widths of these histograms
e not statistically indepen- L2 0 histo{ L2| L1 }
ull two-dimensional condi-
tional to the bin counts,
escaled to fill the range of
hly decorrelated (expected -1
of L1) but not statistically -1 0 1
stribution of L2 increases L
tive) of L1. 1
nature neuroscience • volume 4 no 8 • Schwartz and Simoncelli (2001)
from august 2001


Using sensory gain control to reduce redundancy
Schwartz and Simoncelli (2001)

1
4
Fig. 4. Generic no
response is divided
A divisive boring filters and a
normalization 0 Maximum Likeliho
model The conditional his
that the variance o
0
-1
-1 0 1 0 4 gram is a represen
Other squared a particular mecha
filter responses
2
!

F1
that is primarily e
In Fig. 7b, the hig
Stimulus quency bands wi
F2
2 Adapt weights to
DISCUSSION
!
Other squared We have describ
minimize higher-
filter responses order dependencies
processing, in w
from natural images
divided by a gain
the squared linea
masking tone. As in the visual data, the rate–level curves of the stant. The form

Model accounts for several non-linear response properties
muli. Schwartz and Simoncelli (2001)
audi-
a Cell Model
muli.
grat- a
Mask Signal
(Cavanaugh et al., 2000)
udi- Cell Model
of a Mask Signal
grat-
MeanMean firing rate
(Cavanaugh et al., 2000)
80
e fits
of a Mask contrast:
firing rate

Mean 80 No mask
e fits 40 0.13
gonal Mask contrast:
Mean 0.5 mask
No
of an 40 0.13
onal 0
of a 0.03 0.3 1 0.03 0.3 1 0.5
of an
Mean 0 Signal contrast Signal contrast
of a
non-
0.03 0.3 1 0.03 0.3 1
Mask Signal
Signal contrast
Mean
maxi-
Signal contrast
non- b
Mean Mean rate rate

Mask Signal
80
maxi-
b Mask contrast:
firing firing

80 No mask
40 0.13
Mask contrast:
0.5
No mask
udi- 40
0 0.13
0.03 0.3 1 0.03 0.3 1 0.5
cha- Signal contrast Signal contrast
udi-
have
0
0.03 0.3 1 0.03 0.3 1
ha-
a V1 Signal contrast Signal contrast
have NIPS 2007 Tutorial
odel 104 Michael S. Lewicki ! Carnegie Mellon

What about image structure?

• Bottom-up approaches focus on the non-linearity.

• Our aim here is to focus on the computational problem:

How do we learn the intrinsic dimensions of natural image structure?

• Idea: characterize how the local image distribution changes.


Perceptual organization in natural scenes

image of Kyoto, Japan from E. Doi

Palmer and Rock’s theory of perceptual organization

Palmer and Rock, 1994


Gestalt grouping principles

from Palmer, 1999

Malik and Perona’s (1991) model of texture segregation

input image

output of dark
bar ﬁlters

output of light
bar ﬁlters

Visual Features

Malik and Perona algorithm can ﬁnd texture boundaries

Painting by Gustav Klimt texture boundaries from Malik and
Perona algorithm


NIPS2007: sensory coding and hierarchical representations

NIPS2007: sensory coding and hierarchical representations

Recommended

Recommended

More Related Content

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

NIPS2007: sensory coding and hierarchical representations