ICVSS2011 Selected Presentations

ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg

ICVSS 2011: Selected Presentations

Angel Cruz and Andrea Rueda

BioIngenium Research Group, Universidad Nacional de Colombia

August 25, 2011

Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations


Outline

1 ICVSS 2011

2 A Trillion Photos - Steven Seitz

3 Eﬃcient Novel Class Recognition and Search - Lorenzo
Torresani

4 The Life of Structured Learned Dictionaries - Guillermo Sapiro

5 Image Rearrangement & Video Synopsis - Shmuel Peleg



ICVSS 2011
International Computer Vision Summer School

15 speakers, from USA, France, UK, Italy, Prague and Israel



ICVSS 2011
International Computer Vision Summer School


A Trillion Photos

Steve Seitz
University of Washington
Google

Sicily Computer Vision Summer School
July 11, 2011

Facebook

>3 billion uploaded each month

~ trillion photos taken each year

What do you do with a trillion photos?

Digital Shoebox
(hard drives, iphoto, facebook...)

Comparing images

Detect features using SIFT [Lowe, IJCV 2004]

Comparing images

Extraordinarily robust image matching
– Across viewpoint (~60 degree out-of-plane rotations)
– Varying illumination
– Real-time implementations

Scale Invariant Feature Transform

0 2π
angle histogram

Adapted from slide by David Lowe

NASA Mars Rover images
with SIFT feature matches
Figure by Noah Snavely

Coliseum
(outside)

St. Peters (inside)
Coliseum
St. Peters (outside)
(inside)

Il Vittoriano
Trevi Fountain

Forum

Structure from motion

Matched photos 3D structure

Structure from motion
aka “bundle adjustment” (texts: Zisserman; Faugeras)
p4
p1 p3 minimize
p2
f (R, T, P)
p5 p7
p6

Camera 1 Camera 3
R1,t1 Camera 2
R3,t3
R2,t2

Reconstructing Rome
In a day...

From ~1M images
Using ~1000 cores

Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitz
http://grail.cs.washington.edu/rome

From Sparse to Dense

Sparse output from the SfM system

From Sparse to Dense

Furukawa, Curless, Seitz, Szeliski, CVPR 2010

Most of our photos don’t look like this

Your Life in 30 Seconds

path optimization

Picasa Integration
• As “Face Movies” feature in v3.8
– Rahul Garg, Ira Kemelmacher

Conclusion

trillions of photos
+ computer vision breakthroughs

= new ways to see the world

Efﬁcient Novel-Class
Recognition and Search
Lorenzo Torresani

Problem statement:
novel object-class search
• Given: image database user-provided images
(e.g., 1 million photos) of an object class

+

• Want:
database • no text/tags available
images • query images may
of this class represent a novel class

Application: Web-powered visual search
in unlabeled personal photos
Goal: Find “soccer camp”
pictures on my computer
1 1 Search the Web for images
of “soccer camp”
2 Find images of this visual class
on my computer
2

Application: product search

• Search of aesthetic products

RBM predictedpredicted labels (47%)
RBM labels (47%)

Relation to other tasks sky sky

building building
tree
bed
tree
bed
car car

novel class
road road

Input search Ground truth neighbors
image image
Input Ground truth neighbors 32−RBM 32−RBM 16384-gist
1

query retrieved
image retrieval object categorizationshowingitperce
Figure 6. 6. Curves showing per
Figure Curves
query images that make it int
query images that make into
ofof the query for 1400 image
the query for a a 1400 imag
to 5% of the database size.
upup to 5% of the database siz
analogies: RBM predictedpredicted labels (56%)
RBM labels (56%) crucial for scalable retrieval th
crucial for scalable retrieval
- large databases tree
from [Nister and Stewenius, ’07]
tree sky sky
database make it it to the very
database make to the very to
is is feasible only for a tiny f
feasible only for a tiny fra
- efficient indexing database grows large. Hence, w
database grows large. Hence,
building building the curves meet the y-axis. T
the curves meet the y-axis.
- compact representation (a) car car given in in Table 1 for larger n
given Table 1 for a a larger
sidewalk sidewalkcrosswalkcrosswalk conclusions can bebe drawn from
conclusions can drawn from
road road improves retrieval performance
improves retrieval performan
differences: from neighbors et al., ’07] performance than vocabularies.1
performance than 2 -norm. En
L L2 -norm.
Input image imageGround truth [Philbinneighbors 32−RBM 32−RBM vocabularies. O
Input least for smaller 16384-gist
- simple notions of visual Ground truth least for smaller
gives much better performance th
gives much better performance
(b)
relevancy is is setting T.
setting T.

(e.g., near-duplicate,
same object instance, settings used by [17].
settings used by [17].
The performance with vav
The performance with
same spatial layout) (c)
RBM predictedpredicted labels (63%) [Torralba et al., ’08]
RBM labels (63%) from on the full 6376 image databa
on the full 6376 image data
the scores decrease with inc
the scores decrease with in
ceiling ceiling
are more images toto confus
are more images confuse
Figure Thewall retrieval performance is is evaluated using a large
wall performance evaluated using a large
Figure 5. 5. The retrieval ofof the vocabulary tree is sh
the vocabulary tree is show
ground truth database (6376 images) with groups ofof four images
ground truth database (6376 images) with groups four images
door door defining the vocabulary tree
defining the vocabulary tre
poster poster

Relation to other tasks
novel class
search

image retrieval object classification
analogies: analogies:
- large databases - recognition of object
- efficient indexing classes from a few examples
- compact representation
differences:
differences: - classes to recognize are
- simple notions of visual defined a priori
relevancy - training and recognition
(e.g., near-duplicate, time is unimportant
same object instance, - storage of features is not an
same spatial layout) issue

Technical requirements of
novel class-search
• The object classiﬁer must be learned on the ﬂy from
few examples

• Recognition in the database must have low
computational cost

• Image descriptors must be compact to allow
storage in memory

State-of-the-art in
object classiﬁcation
Winning recipe: many features + non-linear classiﬁers
(e.g. [Gehler and Nowozin, CVPR’09])

non-linear
!"#$%
decision boundary
!"#$%&#'()*
+&,-)&.&#(#/*
...

01#-2"#*

&'()*+),%%
-'.,()*+/%
#"0$%

Model evaluation on Caltech256
45

40
gist
35 phog
phog2pi
30
accuracy (%)

ssim
25 bow5000

20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30
number of training examples

45

40 gist
phog
35
phog2pi
30 ssim
accuracy (%)

bow5000 !"#$%&'()*$+',
25 linear combination
/$%0.&$'2)(3"#%4)#
20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30

5)#6+"#$%&'()*$+',
45 /$%0.&$'2)(3"#%4)#'
40 7%898%8':.+4;+$'<$&#$+'
gist !$%&#"#=>'
35 phog ?@$A+$&'B'5)C)D"#E'FGH
phog2pi
30
accuracy (%)

ssim
25 bow5000
!"#$%&'()*$+',
linear combination /$%0.&$'2)(3"#%4)#
20 nonlinear combination
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5

0
0 5 10 15 20 25 30

Multiple kernel combiners
Classiﬁcation output is obtained by combining many features via
non-linear kernels:
F
N

h(x) = βf kf (x, xn )αn + b
f =1 n=1

sum over features sum over training examples

!#$%
...

where
'()*+),%%
-'.,()*+/%
#0$%

m=1
s. For a kernel function k between a SVM.
he short-hand notation
Training Same as for averaging.
= k(fm (x), fm (x )),
Multiple con- 4. Methods: Multiple Kernel Learning
kernel learning (MKL)

nel km : X × X → R only
espect to image feature fal., 2004; Sonnenburg etapproach toVarma and Ray, 2007] is to
[Bach et m . If the Another al., 2006; perform kernel selection
to a certain aspect, say, it only con- a kernel combination during the training phase of th
gorithm. jointly optimizing over
Learning a non-linear SVM by One prominent instance of this class is MKL
on, then the kernel measures simi-
F
a linear combinati
to this aspect. The subscript m of
nderstood as a linear combinationobjective ∗ (x, x ) k=(x, x ) =β over(x,fx ) x ) the par
1. indexing into the set of kernels k
is to optimize jointly
of kernels: ∗ F β k (x,
km f and
m
m=1 f =1
2. the SVM parameters: α ∈ RN and b ∈ R of an SVM.
ters
notational convenience, we will de- MKL was originally introduced in [1]. For efficiency
e of the m’th feature for a given  
F in order N obtain sparse, F
to interpretable coefficients,
F
raining samples xi , i = 1, 1 . . . , N
min βf αT Kf α stricts βm ≥ 0 and ,imposes thefconstraintT α βm
+ C L yn b + β Kf (xn ) m=1
α,β,b 2 Since the scope of this paper is to access the applicab
f =1 n=1 f =1
of MKL to feature combination rather than its optimiz
), km (x, x2 ), . . . , km (x, xN )]T .
F part we opted to present the MKL formulations in a wa
aining sample, i.e. x = xi , then = 1,lowing for easier 1, . . . , F
subject to βf βf ≥ 0, f = comparison with the other methods
h column of the m’th kernel matrix.f =1 write its objective function as
F
ernel selection In this papert) = max(0, 1 − yt) 1
where L(y, we
min βm αT Km α
classifiers that aim to combine sev- 2 m=1
Kf (x) = [kf (x, x1 ), kf (x, x2 ), . . . , kf (x, xN )]T
α,β,b
e model. Since we associate image
N F
ctions, kernel combination/selection
+C L(yi , b + βm Km (x)T α)

LP-β: a two-stage approach to MKL
! [Gehler and Nowozin, 2009]
• Classiﬁcation output of traditional MKL:
F
N

hM KL (x) = βf kf (x, xn )αn + b
f =1 n=1

• Classiﬁcation function of LP-β:

F
N

h(x) = βf kf (x, xn )αf n + bf
f =1
n=1

hf (x)
Two-stage training procedure:
1. train each hf (x) independently → traditional SVM learning
2. optimize over β → a simple linear program

LP-β for novel-class search?
The LP-β classiﬁer:
F
N

h(x) = βf kf (x, xn )αf n + bf
f =1 n=1

sum over features sum over training examples

Unsuitable for our needs due to:
• large storage requirements (typically over 20K bytes/image)
• costly evaluation (requires query-time kernel distance
computation for each test image)
• costly training (1+ minute for O(10) training examples)

Classemes: a compact descriptor for
efﬁcient recognition [Torresani et al., 2010]
!
Key-idea: represent each image x in terms of its “closeness”
to a set of basis classes (“classemes”)
x
Φ(x) = [φ1 (x), . . . , φC (x)]T
F
N

φc (x) = hclassemec (x) = c
βf kf (x, xc )αn + bc
n
c

f =1 n=1
output of a pre-learned LP-β for the c-th basis class
Φ(x1 ) ... Φ(xN )
Query-time learning: training
examples of
train a linear classiﬁer on Φ(x) novel  class

C
F
N

g duck (Φ(x); wduck ) = Φ(x)T wduck = wc 
duck c
βf kf (x, xc )αn + bc 
n
c

c=1

f =1 n=1

LP-β trained before the
trained at query-time
creation of the database

How this works...
Efficient Object Category Recognition Using Classemes 777

• Accurate weighted classemes. Five classemes with the highest LP-β weights
Table 1. Highly
semantic labels are not required...

to
•make semantic sense, but it should bejust used that detectors may create
for the retrieval experiment, for a selection of Caltech 256 categories. Somefor appear
Classeme classifiers are emphasized as our goal is simply to
specific patterns of texture, color, shape, etc.
a useful feature vector, not to assign semantic labels. The somewhat peculiar classeme
labels reflect the ontology used as a source of base categories.

!#$%'()*+$ ,-(./+$#-(.'0$%/1121$
%)#3)+4.'$ !#$% '()*%'+%*,-. -,.+(,/ -)##-%01# $2330/+(,/

05%6$ 1)$1*+(#,/ 1)45+)3+6,%* '60$$* 6,#.0/7 '%*,07!%
12##+$,#+!*4+
/6$ 3072*+'.,%* -,%%# 7*,8'0% 4,4+1)45
,/0$,#
7*-13$ 6,%*-*,3%+'2*3,- '-'0+-,1# ,#,*$+-#)-. !0/42 '*80/7+%*,5
6'%*/+!$0'(!*+
'*-/)3-'4898$ -)/89+%!0/7 $0/4+,*, -4(#,5* *),'%0/7+(,/
(*')/
%,.0/7+-,*+)3+ -)/%,0/*+(*''2*+
#./3**)#$ 1,77,7+()*%* -,/)(5+-#)'2*+)(/ *)60/7+'!##
')$%!0/7 1,**0*

Large-scale recognition benefits from a compact descriptor for each image,
for example allowing databases to be stored in memory rather than on disk. The

bject Classes by Between-Class Attribute Transfer
Hannes Nickisch Stefan Harmeling

Related work
or Biological Cybernetics, T¨ bingen, Germany
u
me.lastname}@tuebingen.mpg.de

•
otter

when train-
Attribute-based recognition:
black:
white:
yes
no
brown: yes
examples of stripes: no
hardly been water: yes
[Lampert et al., CVPR’09] [Farhadi et al., CVPR’09]
eats fish: yes
rule rather
ens of thou- polar bear
black: no
very few of white: yes
d annotated brown: no
stripes: no
water: yes
introducing eats fish: yes
ct detection zebra
ption of the black: yes
description white: yes

requires hand-specified attribute-class associations
brown: no
hape, color
s. On the left
h properties
stripes:
water:
yes
no
ribute be
hey can predic-
eats fish: no

to
displayed. attribute classifiers must be trained with
arethe cur- Figure 1. A description object categories: after learningthe transfer
by high-level attributes allows
ected based of knowledge between the visual
ed for a new cat- human-labeled examples
ve across appearance of attributes from any classes with training examples,
and to “engine”,can detect also object classes that do not have any training
ike facil- we based on which attribute description a test image fits best. randomly selected positively pre
new large- images, Figure 5: This figure shows
election helps
30,000 an- tributes for 12 typical images from 12 categories in Yahoo set.
nd “rein” that of well-labeled training imageslearnedtechniques
rson’s clas- lions and is likely out of
classifiers are numerous on Pascal train set and tested on Yahoo se
reach for years to come. Therefore,
emantic at-
one class outreducing the number of necessary training imagesattributes from the list of 64 attributes a
for domly select 5 predicted have

Method overview
1. Classeme learning

φ”body of water” (x) →

...
φ”walking” (x) →

2. Using the classemes for recognition and retrieval
training examples of novel class
C

g duck (Φ(x)) = wc φc (x)
duck

c=1

Φ(x1 ) ... Φ(xN )

Classeme learning:
choosing the basis classes
• Classeme labels desiderata:

- must be visual concepts

- should span the entire space of visual classes

• Our selection:
concepts deﬁned in the Large Scale Ontology for Multimedia
[LSCOM] to be “useful, observable and feasible for automatic
detection”.
2659 classeme labels, after manual elimination of
plurals, near-duplicates, and inappropriate concepts

Classeme learning:
gathering the training data
• We downloaded the top 150 images returned by
Bing Images for each classeme label
• For each of the 2659 classemes, a one-versus-the-rest
training set was formed to learn a binary classiﬁer
φ”walking” (x)

yes no

Classeme learning:
training the classifiers
• Each classeme classifier is an LP-β kernel combiner
[Gehler and Nowozin, 2009]:
F
N

φ(x) = βf kf (x, xn )αf,n + bf
f =1 n=1

linear combination of feature-specific SVMs

• We use 13 kernels based on spatial pyramid histograms
computed from the following features:
- color GIST [Oliva and Torralba, 2001]
- oriented gradients [Dalal and Triggs, 2009]
- self-similarity descriptors [Schechtman and Irani, 2007]
- SIFT [Lowe, 2004]

A dimensionality reduction
 
view of classemes
 
  GIST
 
   




 self-similarity
 descriptor Φ 
φ1 (x)
... 
x=



  φ2659 (x)
  oriented
 
  gradients
 
  • near state-of-the-art accuracy
SIFT with linear classiﬁers
• can be quantized down to
• non-linear kernels are needed 200 bytes/image with almost
for good classiﬁcation no recognition loss
• 23K bytes/image

Experiment 1: multiclass
recognition on Caltech256
60 LP-β in [Gehler
LPbeta Nowozin, 2009]
LPbeta13 using 39 kernels
50 MKL
Csvm LP-β with our x
Cq1svm
40 Xsvm our approach:
linear SVM with
accuracy (%)

classemes Φ(x)
30
linear SVM with
binarized classemes,
20 i.e. (Φ(x) 0)

linear SVM with x
10

0
0 10 20 30 40 50

Computational cost
comparison
Training time Testing time
1500 40

23 hours 30
time (minutes)

1000

time (ms)
20

500
9 minutes 10

0 0
LPbeta Csvm LPbeta Csvm

Accuracy vs. compactness
4
10

188 bytes/image
compactness (images per MB)

3
10

2.5K bytes/image
2
10

LPbeta13 23K bytes/image
1 Csvm
10
Cq1svm
nbnn [Boiman et al., 2008] 128K bytes/image
emk [Bo and Sminchisescu, 2008]
Xsvm
0
10
10 15 20 25 30 35 40 45
accuracy (%)

Lines link performance at 15 and 30 training examples

Experiment 2:
object class retrieval
Eﬃcient Object Category Recognition Using Classemes 787

30
Csvm
Cq1Rocchio (β=1, γ=0)
25
Cq1Rocchio (β=0.75, γ=0.15)
Precision @ 25 25

Bowsvm
Precision (%) @

20 BowRocchio (β=1, γ=0)
BowRocchio (β=0.75, γ=0.15)
15

• Random performance is 0.4%
10
• training Csvm takes 0.6 sec with
5*256 training examples
5

0
0 10 20 30 40 50
Number of training images

Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match the
query class. Random performance is 0.4%.

Analogies with text retrieval
• Classeme representation of an image:
presence/absence of visual attributes

• Bag-of-word representation of a text-document:
presence/absence of words

Related work
• Prior work (e.g., [Sivic Zisserman, 2003; Nister Stewenius, 2006;
Philbin et al., 2007]) has exploited a similar analogy for
object-instance retrieval by representing images as bag of visual words
Detect interest patches Compute SIFT descriptors [Lowe, 2004]

…
…

Quantize
Represent image as a sparse
descriptors
histogram of visual words
frequency

…..
codewords

• To extend this methodology to object-class retrieval we need:
- to use a representation more suited to object class recognition
(e.g. classemes as opposed to bag of visual words)
- to train the ranking/retrieval function for every new query-class

Data structures for
efﬁcient retrieval
Incidence matrix: Inverted index:
features
f0 f1 f2 f3 f4 f5 f6 f7 f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0 I0 I2 I0 I2 I1 I0 I4 I6
documents

I2: 1 1 0 1 0 0 0 0 I2 I7 I1 I3 I4 I6 I5 I9
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0 I3 I8 I3 I9 I5 I8
I5: 0 0 0 0 1 0 1 0 I4 I7 I9
I6: 1 0 0 0 0 1 0 1 I6 I9
I7: 0 1 0 0 1 0 0 0 I8
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
• enables efﬁcient calculation
of w Φ, as:
T
∀Φ
• very compact: only one bit
per feature entry wi Φi
i s.t. Φi =0

Efﬁcient retrieval via
inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Goal:
compute score w T Φ, for all binary vectors Φ in the database
∀Φ

inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Scoring:
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7

I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8

Cost of scoring is linear in the sum of the lengths of inverted
lists associated to non-zero weights

Improve efﬁciency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n
Tomographic inversion with example n
1 wavelet penalization 3
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
w2
regularizer loss function
w with d = AWT w and smallest 1 -norm

•
T
L2-SVM: R(w) d =wT w w and smallestn ,2yn ) = max(0, 1 − yn (wT Φn ))
w with = AW
, L(w; Φ -norm
d = AWT w
• 2
Since |wi | wi for small wi w 2
w 2i
|wi |
and |wi | wi for large wi , w1
2

choosing R(w) = i |wi | will tend to |w|

produce a small number of larger
wi
weights and 2 -ball: wzero2 weights
more 1 + w2 = constant
2
w

1 -ball: |w1 | + |w2 | = constant

Improve efﬁciency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n example n
N
E(w) = R(w) + C

• L2-SVM: R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (wT Φn ))

• L1-LR: R(w) = i |wi | , L(w; Φn , yn ) = log(1 + exp(−yn wT Φn ))

• FGM (Feature Generating Machine) [Tan et al., 2010]:
R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (w ⊙ d)T Φn )
s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product

Performance evaluation on
ImageNet (10M images)
35
! [Rastegari et al., 2011]
35
Full inner product evaluation L2 SVM
30
Full inner product evaluation L1 LR
30
Inverted index L2 SVM
Precision @ 10 (%)

25
Inverted index L1 LR

Precision @ 10 (%)
25
20
20 • Performance averaged over 400 object
15 classes used as queries
15 • 10 training examples per query class
10
10
• Database includes 450 images of the query
class and 9.7M images of other classes
5
5 •
Prec@10 of a random classiﬁers is 0.005%
0
20 40 60 80 100 120 140
Search time per query (seconds) 0
20 40 60 80 100 120 140
Each curve is obtained by varying sparsity through C in training objective Search time per query (seconds)

N
E(w) = R(w) + C

Top-k ranking
• Do we need to rank the entire database?
- users only care about the top-ranked images

• Key idea:
- for each image iteratively update an upper-bound and
a lower-bound on the score
- gradually prune images that cannot rank in the top-k

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]
• Highest possible score:
for binary vector ΦU s.t.
f0
I0: 1
f1
0
f2
1
f3
0
f4
0
f5
1
f6
0
f7
0
ΦU = 1 iﬀ wi 0
i
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 → initial upper bound
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0 u∗ = wT · ΦU (6 in this case)
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0
I8: 1
1
1
0
0
0
0
1
0
0
1
0
0
0
0
• Lowest possible score:
I9: 0 0 0 1 1 1 0 1 for binary vector ΦL s.t.
ΦL = 1 iﬀ wi 0
i
→ initial lower bound
l∗ = wT · ΦL (-10 in this case)

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ] • Initialization: u∗ , l∗ for all images
upper bound
f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1

I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
lower bound

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = +3 (0), for each image n:
- subtract +3 from the upper bound if φn,i = 0
- add +3 to the lower bound if φn,i = 1

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = -2 (0), for each image n:
- decrement by 2 the upper bound if φn,i = 1
- increment by 2 the lower bound if φn,i = 0

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9
• Load feature i
• Since wi = -6 (0), for each image n:
- decrement by 6 the upper bound if φn,i = 1
- increment by 6 the lower bound if φn,i = 0

Top-k pruning

w: [ 3 -2 0 -6 0 3 -2 0 ]

f0 f1 f2 f3 f4 f5 f6 f7
I0: 1 0 1 0 0 1 0 0
I1: 0 0 1 0 1 0 0 0
I2: 1 1 0 1 0 0 0 0 0
I3: 1 0 1 1 0 0 0 0
I4: 1 0 0 0 1 0 1 0
I5: 0 0 0 0 1 0 1 0
I6: 1 0 0 0 0 1 0 1
I7: 0 1 0 0 1 0 0 0
I8: 1 1 0 0 0 1 0 0
I9: 0 0 0 1 1 1 0 1
I0 I1 I2 I3 I4 I5 I6 I7 I8 I9

• Suppose k = 4:
we can prune I2,I9 since they cannot rank in the top-k

Distribution of weights and
pruning rate
CCV
CV IC
1745
745 #
#1
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540
40
11 100
100
L1−LR
L1−LR
Distribution absolute weight values
Distribution of absolute weight values

41
541
normalized of absolute weight values

42
542 L2−SVM
L2−SVM
43
543 0.8
0.8 FGM
FGM 80
80

% of images pruned
% of images pruned
44
544 TkP L1−LR, k=10
TkP L1−LR, k=10
45
545 TkP L1−LR, k=3000
TkP L1−LR, k=3000
0.6
0.6 60
60
46
546 TkP L2−SVM, k=10
TkP L2−SVM, k=10
47
547 TkP L2−SVM, k=3000
TkP L2−SVM, k=3000
48
548 0.4
0.4 40
40 TkP FGM, k=10
TkP FGM, k=10
49
549 TkP FGM, k=3000
TkP FGM, k=3000
50
550
0.2
0.2 20
20
51
551
52
552
53
553 00 00
54
554 aa 00 500
500 1000
1000 1500
1500
Dimension
2000
2000 2500
2500 bb 00 500
500 1000
1000 1500 1500 2000 2000
Number ofof iterations (d)
iterations (d)
2500
2500
Dimension Number
55
555
56
556 Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
57
557
Features considered in descending order of |wi |
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values ofof k (k = 10, 3000).
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values k (k = 10, 3000).
58
558
59
559
60
560 aa smaller value of kk allows the method to eliminate more
smaller value of allows the method to eliminate more
61 images from consideration at aavery early stage. 20
20 v=128
561 images from consideration at very early stage. v=128
8
v=256
v=256
62 w=2 8 v=256
v=256 w=28 8
562 w=2 6
v=64
v=64 w=2 6 w=2
w=2
63

Performance evaluation on 35

ImageNet (10M images) 30

35 ! [Rastegari et al., 2011]

Precision @ 10 (%)
25
30 TkP L1−LR
20
TkP L2−SVM
Inverted index L1−LR
Precision @ 10 (%)

25
15
Inverted index L2−SVM
20
10 • k = 10
15
• Performance averaged over 400 object
5 classes used as queries
10 • 10 training examples per query class
0
0 50 •
100 150 Database includes 450 images of the query
5 Search time per query (seconds) and 9.7M images of other classes
class
• Prec@10 of a random classiﬁers is 0.005%
0
0 50 100 150
Search time per query (seconds)

Each curve is obtained by varying sparsity through C in training objective
N
E(w) = R(w) + C

Alternative search strategy:
approximate ranking
• Key-idea: approximate the score function with a measure that can
computed (more) efficiently (related to approximate NN search:
[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al.,
2008])
• Approximate ranking via vector quantization:
wT Φ ≈ wT q(Φ) !
q(!)
where q(.) is a quantizer returning
the cluster centroid nearest to Φ

• Problem:
- to approximate well the score we need a fine quantization
- the dimensionality of our space is D=2659:
too large to enable a fine quantization using k-means clustering

Product quantization
!
Product quantization for nearest neighbor search
[Jegou et al., 2011]
• Split feature vector ! into v subvectors: ! [ !1 | !2 | ... | !v ]
Vector split into m subvectors:
• Subvectors are quantized separately by quantizers
Subvectors are quantized separately by quantizers
q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]
where each qi(.) is learned in a space of dimensionality D/v
where each is learned by k-means with a limited number of centroids

• Example from [Jegou vector split in 8 subvectors of dimension 16
Example: y = 128-dim
et al., 2011]:
! is a 128-dimensional vector split into 8 subvectors of dimension 16
16 components
16 components
y1 y2 y3 y4 y5 y6 y7 y8
!1 !2 !3 !4 !5 !6 !7 !8
xedni noitazitnauq tib-46
stib 8

256 ) 1 y( 1 q
q
) 2 y( 2 q
q2
) 3 y( 3 q
q3
)4y(4q
q4
)5y(5q
q5
)6y(6q
q6
)7y(7q )8y(8q
q7 q8
28 = 256
centroids 1
centroids
q1 q2
1 q3
1 q4
1 q5 q6 q7 q8
sdiortnec 1q 2q 3q 4q 5q 6q 7q 8q
652
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
q1(!1) q2(!2) q3(!3) q4(!4)
1
1y 1 1 1 1
2y 1 3y 4y 5y q5(!5) q6(!6) q7(!7) q8(!8)
6y 7y 8y
8 bits
stnenopmoc 61
64-bit quantization index
8 bits
64-bit quantization index
61 noisnemid fo srotcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE

hcae erehw sdiortnec fo rebmun detimil a htiw snaem-k yb denrael si

obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
 .
. 
 . 
tnauq yb yletarapes dezitnauq era srotcevbuS 
 w2 

sub-blocks
w1
 
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =

v
652
5q 4q 3q
Efﬁcient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
s11 w1
in
 ner product 
quantization for sub-block 1:
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
uct
s11 s12 prod w1
inner
 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

sub-blocks
duct
s11 s12 s13 ... ... ... ... ... ... s1r r pro i
w1
nne 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

wv
 .
. 
 . 
 w2 

s21 in

sub-blocks
ner prod
uct w1
s11 s12 s13 ... ... ... ... ... ... s1r
 
hcae erehw
look-up table
j=1
5y 4y 3y 2y T 1y

v
652
5q 4q 3q
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8

stib 8

) 1 y( 1 q ) 2 y( 2 q )3y(3q )4y(4q y(5q

Efﬁcient approximate scoringsdiortnec
652
1q 2q 3q 4q 5q

v

wT Φ ≈ wT q(Φ) = wj qj (Φj )
T 1y 2y 3y 4y 5y

j=1
stnenopmoc 61 can be precomputed and stored in a
look-up table

2.Score each quantized vector q(Φ)
in the database using the look-up hcae erehw centroids (r per sub-block)
table: s1r
s11 s12 s13 ... ... ... ... ... ...

sub-blocks
s21 s22 s23 ... ... ... ... ... ... s2r
w q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv... ... ) ...
T T T T
(Φv
tnauq yb yletarapes dezitnauq era srotcevbuS... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
T
q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv (Φv )
T T T
... ... ...
:srotcevbus m otni tilps rotceV ... ... ... ... ... ... ...
sv1 sv2 sv3 ... ... ... ... ... ... svr
Only v additions per image!

Choice of parameters
• Dimensionality is ﬁrst reduced with PCA from D=2659 to D’ D
• How do we choose D’, v (number of sub-blocks),
r (number of centroids per sub-block)?
• Effect of parameter choices on a database of 150K images:
(v,r)
20
8 8
(128,2 ) (256,2 ) 6
(256,2 )
6
(64,2 )
15
Precision @ 10 (%)

6
8
(64,2 )
(32,2 )
(128,28)
D’=512
10 8
(16,2 ) D’=256
8 6
(32,2 ) (64,2 ) D’=128

5 (32,28)
8
(16,2 )
8
(16,2 )
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Search time per query (seconds)

ICVSS2011 Selected Presentations

ICVSS2011 Selected Presentations

Recomendados

Recomendados

Más contenido relacionado

Similar a ICVSS2011 Selected Presentations

Similar a ICVSS2011 Selected Presentations (20)

Último

Último (20)

ICVSS2011 Selected Presentations