This is a presentation to share the experiences and selected presentation from International Computer Vision Summer School (ICVSS2011) attended by Angel Cruz and Andrea Rueda from Bioingenium Research Group of Universidad Nacional de Colombia.
1. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011: Selected Presentations
Angel Cruz and Andrea Rueda
BioIngenium Research Group, Universidad Nacional de Colombia
August 25, 2011
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
2. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - Lorenzo
Torresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
3. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - Lorenzo
Torresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
4. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011
International Computer Vision Summer School
15 speakers, from USA, France, UK, Italy, Prague and Israel
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
5. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011
International Computer Vision Summer School
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
6. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
ICVSS 2011
International Computer Vision Summer School
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
7. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - Lorenzo
Torresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
8. A Trillion Photos
Steve Seitz
University of Washington
Google
Sicily Computer Vision Summer School
July 11, 2011
9. Facebook
>3 billion uploaded each month
~ trillion photos taken each year
10. What do you do with a trillion photos?
Digital Shoebox
(hard drives, iphoto, facebook...)
29. Reconstructing Rome
In a day...
From ~1M images
Using ~1000 cores
Sameer Agarwal, Noah Snavely, Rick Szeliski, Steve Seitz
http://grail.cs.washington.edu/rome
47. ICVSS 2011 Steven Seitz Lorenzo Torresani Guillermo Sapiro Shmuel Peleg
Outline
1 ICVSS 2011
2 A Trillion Photos - Steven Seitz
3 Efficient Novel Class Recognition and Search - Lorenzo
Torresani
4 The Life of Structured Learned Dictionaries - Guillermo Sapiro
5 Image Rearrangement & Video Synopsis - Shmuel Peleg
Angel Cruz and Andrea Rueda — ICVSS 2011: Selected Presentations
49. Problem statement:
novel object-class search
• Given: image database user-provided images
(e.g., 1 million photos) of an object class
+
• Want:
database • no text/tags available
images • query images may
of this class represent a novel class
50. Application: Web-powered visual search
in unlabeled personal photos
Goal: Find “soccer camp”
pictures on my computer
1 1 Search the Web for images
of “soccer camp”
2 Find images of this visual class
on my computer
2
52. RBM predictedpredicted labels (47%)
RBM labels (47%)
Relation to other tasks sky sky
building building
tree
bed
tree
bed
car car
novel class
road road
Input search Ground truth neighbors
image image
Input Ground truth neighbors 32−RBM 32−RBM 16384-gist
1
query retrieved
image retrieval object categorizationshowingitperce
Figure 6. 6. Curves showing per
Figure Curves
query images that make it int
query images that make into
ofof the query for 1400 image
the query for a a 1400 imag
to 5% of the database size.
upup to 5% of the database siz
analogies: RBM predictedpredicted labels (56%)
RBM labels (56%) crucial for scalable retrieval th
crucial for scalable retrieval
- large databases tree
from [Nister and Stewenius, ’07]
tree sky sky
database make it it to the very
database make to the very to
is is feasible only for a tiny f
feasible only for a tiny fra
- efficient indexing database grows large. Hence, w
database grows large. Hence,
building building the curves meet the y-axis. T
the curves meet the y-axis.
- compact representation (a) car car given in in Table 1 for larger n
given Table 1 for a a larger
sidewalk sidewalkcrosswalkcrosswalk conclusions can bebe drawn from
conclusions can drawn from
road road improves retrieval performance
improves retrieval performan
differences: from neighbors et al., ’07] performance than vocabularies.1
performance than 2 -norm. En
L L2 -norm.
Input image imageGround truth [Philbinneighbors 32−RBM 32−RBM vocabularies. O
Input least for smaller 16384-gist
- simple notions of visual Ground truth least for smaller
gives much better performance th
gives much better performance
(b)
relevancy is is setting T.
setting T.
(e.g., near-duplicate,
same object instance, settings used by [17].
settings used by [17].
The performance with vav
The performance with
same spatial layout) (c)
RBM predictedpredicted labels (63%) [Torralba et al., ’08]
RBM labels (63%) from on the full 6376 image databa
on the full 6376 image data
the scores decrease with inc
the scores decrease with in
ceiling ceiling
are more images toto confus
are more images confuse
Figure Thewall retrieval performance is is evaluated using a large
wall performance evaluated using a large
Figure 5. 5. The retrieval ofof the vocabulary tree is sh
the vocabulary tree is show
ground truth database (6376 images) with groups ofof four images
ground truth database (6376 images) with groups four images
door door defining the vocabulary tree
defining the vocabulary tre
poster poster
53. Relation to other tasks
novel class
search
image retrieval object classification
analogies: analogies:
- large databases - recognition of object
- efficient indexing classes from a few examples
- compact representation
differences:
differences: - classes to recognize are
- simple notions of visual defined a priori
relevancy - training and recognition
(e.g., near-duplicate, time is unimportant
same object instance, - storage of features is not an
same spatial layout) issue
54. Technical requirements of
novel class-search
• The object classifier must be learned on the fly from
few examples
• Recognition in the database must have low
computational cost
• Image descriptors must be compact to allow
storage in memory
55. State-of-the-art in
object classification
Winning recipe: many features + non-linear classifiers
(e.g. [Gehler and Nowozin, CVPR’09])
non-linear
!"#$%
decision boundary
!"#$%&#'()*
+&,-)&.&#(#/*
...
01#-2"#*
&'()*+),%%
-'.,()*+/%
#"0$%
56. Model evaluation on Caltech256
45
40
gist
35 phog
phog2pi
30
accuracy (%)
ssim
25 bow5000
20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5
0
0 5 10 15 20 25 30
number of training examples
57. Model evaluation on Caltech256
45
40 gist
phog
35
phog2pi
30 ssim
accuracy (%)
bow5000 !"#$%&'()*$+',
25 linear combination
/$%0.&$'2)(3"#%4)#
20
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5
0
0 5 10 15 20 25 30
number of training examples
58. Model evaluation on Caltech256
5)#6+"#$%&'()*$+',
45 /$%0.&$'2)(3"#%4)#'
40 7%898%8':.+4;+$'<$&#$+'
gist !$%&#"#=>'
35 phog ?@$A+$&'B'5)C)D"#E'FGH
phog2pi
30
accuracy (%)
ssim
25 bow5000
!"#$%&'()*$+',
linear combination /$%0.&$'2)(3"#%4)#
20 nonlinear combination
!"#$%&'()*$+'
15
,
10 '"#*"-"*.%+'/$%0.&$1
5
0
0 5 10 15 20 25 30
number of training examples
59. Multiple kernel combiners
Classification output is obtained by combining many features via
non-linear kernels:
F
N
h(x) = βf kf (x, xn )αn + b
f =1 n=1
sum over features sum over training examples
!#$%
...
where
'()*+),%%
-'.,()*+/%
#0$%
60. m=1
s. For a kernel function k between a SVM.
he short-hand notation
Training Same as for averaging.
= k(fm (x), fm (x )),
Multiple con- 4. Methods: Multiple Kernel Learning
kernel learning (MKL)
nel km : X × X → R only
espect to image feature fal., 2004; Sonnenburg etapproach toVarma and Ray, 2007] is to
[Bach et m . If the Another al., 2006; perform kernel selection
to a certain aspect, say, it only con- a kernel combination during the training phase of th
gorithm. jointly optimizing over
Learning a non-linear SVM by One prominent instance of this class is MKL
on, then the kernel measures simi-
F
a linear combinati
to this aspect. The subscript m of
nderstood as a linear combinationobjective ∗ (x, x ) k=(x, x ) =β over(x,fx ) x ) the par
1. indexing into the set of kernels k
is to optimize jointly
of kernels: ∗ F β k (x,
km f and
m
m=1 f =1
2. the SVM parameters: α ∈ RN and b ∈ R of an SVM.
ters
notational convenience, we will de- MKL was originally introduced in [1]. For efficiency
e of the m’th feature for a given
F in order N obtain sparse, F
to interpretable coefficients,
F
raining samples xi , i = 1, 1 . . . , N
min βf αT Kf α stricts βm ≥ 0 and ,imposes thefconstraintT α βm
+ C L yn b + β Kf (xn ) m=1
α,β,b 2 Since the scope of this paper is to access the applicab
f =1 n=1 f =1
of MKL to feature combination rather than its optimiz
), km (x, x2 ), . . . , km (x, xN )]T .
F part we opted to present the MKL formulations in a wa
aining sample, i.e. x = xi , then = 1,lowing for easier 1, . . . , F
subject to βf βf ≥ 0, f = comparison with the other methods
h column of the m’th kernel matrix.f =1 write its objective function as
F
ernel selection In this papert) = max(0, 1 − yt) 1
where L(y, we
min βm αT Km α
classifiers that aim to combine sev- 2 m=1
Kf (x) = [kf (x, x1 ), kf (x, x2 ), . . . , kf (x, xN )]T
α,β,b
e model. Since we associate image
N F
ctions, kernel combination/selection
+C L(yi , b + βm Km (x)T α)
61. LP-β: a two-stage approach to MKL
! [Gehler and Nowozin, 2009]
• Classification output of traditional MKL:
F
N
hM KL (x) = βf kf (x, xn )αn + b
f =1 n=1
• Classification function of LP-β:
F
N
h(x) = βf kf (x, xn )αf n + bf
f =1
n=1
hf (x)
Two-stage training procedure:
1. train each hf (x) independently → traditional SVM learning
2. optimize over β → a simple linear program
62. LP-β for novel-class search?
The LP-β classifier:
F
N
h(x) = βf kf (x, xn )αf n + bf
f =1 n=1
sum over features sum over training examples
Unsuitable for our needs due to:
• large storage requirements (typically over 20K bytes/image)
• costly evaluation (requires query-time kernel distance
computation for each test image)
• costly training (1+ minute for O(10) training examples)
63. Classemes: a compact descriptor for
efficient recognition [Torresani et al., 2010]
!
Key-idea: represent each image x in terms of its “closeness”
to a set of basis classes (“classemes”)
x
Φ(x) = [φ1 (x), . . . , φC (x)]T
F
N
φc (x) = hclassemec (x) = c
βf kf (x, xc )αn + bc
n
c
f =1 n=1
output of a pre-learned LP-β for the c-th basis class
Φ(x1 ) ... Φ(xN )
Query-time learning: training
examples of
train a linear classifier on Φ(x) novel class
C
F
N
g duck (Φ(x); wduck ) = Φ(x)T wduck = wc
duck c
βf kf (x, xc )αn + bc
n
c
c=1
f =1 n=1
LP-β trained before the
trained at query-time
creation of the database
64. How this works...
Efficient Object Category Recognition Using Classemes 777
• Accurate weighted classemes. Five classemes with the highest LP-β weights
Table 1. Highly
semantic labels are not required...
to
•make semantic sense, but it should bejust used that detectors may create
for the retrieval experiment, for a selection of Caltech 256 categories. Somefor appear
Classeme classifiers are emphasized as our goal is simply to
specific patterns of texture, color, shape, etc.
a useful feature vector, not to assign semantic labels. The somewhat peculiar classeme
labels reflect the ontology used as a source of base categories.
!#$%'()*+$ ,-(./+$#-(.'0$%/1121$
%)#3)+4.'$ !#$% '()*%'+%*,-. -,.+(,/ -)##-%01# $2330/+(,/
05%6$ 1)$1*+(#,/ 1)45+)3+6,%* '60$$* 6,#.0/7 '%*,07!%
12##+$,#+!*4+
/6$ 3072*+'.,%* -,%%# 7*,8'0% 4,4+1)45
,/0$,#
7*-13$ 6,%*-*,3%+'2*3,- '-'0+-,1# ,#,*$+-#)-. !0/42 '*80/7+%*,5
6'%*/+!$0'(!*+
'*-/)3-'4898$ -)/89+%!0/7 $0/4+,*, -4(#,5* *),'%0/7+(,/
(*')/
%,.0/7+-,*+)3+ -)/%,0/*+(*''2*+
#./3**)#$ 1,77,7+()*%* -,/)(5+-#)'2*+)(/ *)60/7+'!##
')$%!0/7 1,**0*
Large-scale recognition benefits from a compact descriptor for each image,
for example allowing databases to be stored in memory rather than on disk. The
65. bject Classes by Between-Class Attribute Transfer
Hannes Nickisch Stefan Harmeling
Related work
or Biological Cybernetics, T¨ bingen, Germany
u
me.lastname}@tuebingen.mpg.de
•
otter
when train-
Attribute-based recognition:
black:
white:
yes
no
brown: yes
examples of stripes: no
hardly been water: yes
[Lampert et al., CVPR’09] [Farhadi et al., CVPR’09]
eats fish: yes
rule rather
ens of thou- polar bear
black: no
very few of white: yes
d annotated brown: no
stripes: no
water: yes
introducing eats fish: yes
ct detection zebra
ption of the black: yes
description white: yes
requires hand-specified attribute-class associations
brown: no
hape, color
s. On the left
h properties
stripes:
water:
yes
no
ribute be
hey can predic-
eats fish: no
to
displayed. attribute classifiers must be trained with
arethe cur- Figure 1. A description object categories: after learningthe transfer
by high-level attributes allows
ected based of knowledge between the visual
ed for a new cat- human-labeled examples
ve across appearance of attributes from any classes with training examples,
and to “engine”,can detect also object classes that do not have any training
ike facil- we based on which attribute description a test image fits best. randomly selected positively pre
new large- images, Figure 5: This figure shows
election helps
30,000 an- tributes for 12 typical images from 12 categories in Yahoo set.
nd “rein” that of well-labeled training imageslearnedtechniques
rson’s clas- lions and is likely out of
classifiers are numerous on Pascal train set and tested on Yahoo se
reach for years to come. Therefore,
emantic at-
one class outreducing the number of necessary training imagesattributes from the list of 64 attributes a
for domly select 5 predicted have
66. Method overview
1. Classeme learning
φ”body of water” (x) →
...
φ”walking” (x) →
2. Using the classemes for recognition and retrieval
training examples of novel class
C
g duck (Φ(x)) = wc φc (x)
duck
c=1
Φ(x1 ) ... Φ(xN )
67. Classeme learning:
choosing the basis classes
• Classeme labels desiderata:
- must be visual concepts
- should span the entire space of visual classes
• Our selection:
concepts defined in the Large Scale Ontology for Multimedia
[LSCOM] to be “useful, observable and feasible for automatic
detection”.
2659 classeme labels, after manual elimination of
plurals, near-duplicates, and inappropriate concepts
68. Classeme learning:
gathering the training data
• We downloaded the top 150 images returned by
Bing Images for each classeme label
• For each of the 2659 classemes, a one-versus-the-rest
training set was formed to learn a binary classifier
φ”walking” (x)
yes no
69. Classeme learning:
training the classifiers
• Each classeme classifier is an LP-β kernel combiner
[Gehler and Nowozin, 2009]:
F
N
φ(x) = βf kf (x, xn )αf,n + bf
f =1 n=1
linear combination of feature-specific SVMs
• We use 13 kernels based on spatial pyramid histograms
computed from the following features:
- color GIST [Oliva and Torralba, 2001]
- oriented gradients [Dalal and Triggs, 2009]
- self-similarity descriptors [Schechtman and Irani, 2007]
- SIFT [Lowe, 2004]
70. A dimensionality reduction
view of classemes
GIST
self-similarity
descriptor Φ
φ1 (x)
...
x=
φ2659 (x)
oriented
gradients
• near state-of-the-art accuracy
SIFT with linear classifiers
• can be quantized down to
• non-linear kernels are needed 200 bytes/image with almost
for good classification no recognition loss
• 23K bytes/image
71. Experiment 1: multiclass
recognition on Caltech256
60 LP-β in [Gehler
LPbeta Nowozin, 2009]
LPbeta13 using 39 kernels
50 MKL
Csvm LP-β with our x
Cq1svm
40 Xsvm our approach:
linear SVM with
accuracy (%)
classemes Φ(x)
30
linear SVM with
binarized classemes,
20 i.e. (Φ(x) 0)
linear SVM with x
10
0
0 10 20 30 40 50
number of training examples
72. Computational cost
comparison
Training time Testing time
1500 40
23 hours 30
time (minutes)
1000
time (ms)
20
500
9 minutes 10
0 0
LPbeta Csvm LPbeta Csvm
73. Accuracy vs. compactness
4
10
188 bytes/image
compactness (images per MB)
3
10
2.5K bytes/image
2
10
LPbeta13 23K bytes/image
1 Csvm
10
Cq1svm
nbnn [Boiman et al., 2008] 128K bytes/image
emk [Bo and Sminchisescu, 2008]
Xsvm
0
10
10 15 20 25 30 35 40 45
accuracy (%)
Lines link performance at 15 and 30 training examples
74. Experiment 2:
object class retrieval
Efficient Object Category Recognition Using Classemes 787
30
Csvm
Cq1Rocchio (β=1, γ=0)
25
Cq1Rocchio (β=0.75, γ=0.15)
Precision @ 25 25
Bowsvm
Precision (%) @
20 BowRocchio (β=1, γ=0)
BowRocchio (β=0.75, γ=0.15)
15
• Random performance is 0.4%
10
• training Csvm takes 0.6 sec with
5*256 training examples
5
0
0 10 20 30 40 50
Number of training images
Fig. 4. Retrieval. Percentage of the top 25 in a 6400-document set which match the
query class. Random performance is 0.4%.
75. Analogies with text retrieval
• Classeme representation of an image:
presence/absence of visual attributes
• Bag-of-word representation of a text-document:
presence/absence of words
76. Related work
• Prior work (e.g., [Sivic Zisserman, 2003; Nister Stewenius, 2006;
Philbin et al., 2007]) has exploited a similar analogy for
object-instance retrieval by representing images as bag of visual words
Detect interest patches Compute SIFT descriptors [Lowe, 2004]
…
…
Quantize
Represent image as a sparse
descriptors
histogram of visual words
frequency
…..
codewords
• To extend this methodology to object-class retrieval we need:
- to use a representation more suited to object class recognition
(e.g. classemes as opposed to bag of visual words)
- to train the ranking/retrieval function for every new query-class
84. Efficient retrieval via
inverted index
Inverted index:
w: [1.5 -2 0 -5 0 3 -2 0 ]
f0 f1 f2 f3 f4 f5 f6 f7
I0 I2 I0 I2 I1 I0 I4 I6
I2 I7 I1 I3 I4 I6 I5 I9
I3 I8 I3 I9 I5 I8
I4 I7 I9
I6 I9
I8
Cost of scoring is linear in the sum of the lengths of inverted
lists associated to non-zero weights
85. Improve efficiency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n
Tomographic inversion with example n
1 wavelet penalization 3
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
w2
regularizer loss function
w with d = AWT w and smallest 1 -norm
•
T
L2-SVM: R(w) d =wT w w and smallestn ,2yn ) = max(0, 1 − yn (wT Φn ))
w with = AW
, L(w; Φ -norm
d = AWT w
• 2
Since |wi | wi for small wi w 2
w 2i
|wi |
and |wi | wi for large wi , w1
2
choosing R(w) = i |wi | will tend to |w|
produce a small number of larger
wi
weights and 2 -ball: wzero2 weights
more 1 + w2 = constant
2
w
1 -ball: |w1 | + |w2 | = constant
86. Improve efficiency via
sparse weight vectors
Key-idea: force w to contain as many zeros as possible
classeme vector label of
Learning objective of example n example n
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
regularizer loss function
• L2-SVM: R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (wT Φn ))
• L1-LR: R(w) = i |wi | , L(w; Φn , yn ) = log(1 + exp(−yn wT Φn ))
• FGM (Feature Generating Machine) [Tan et al., 2010]:
R(w) = wT w , L(w; Φn , yn ) = max(0, 1 − yn (w ⊙ d)T Φn )
s.t. 1T d ≤ B d ∈ {0, 1}D elementwise product
87. Performance evaluation on
ImageNet (10M images)
35
! [Rastegari et al., 2011]
35
Full inner product evaluation L2 SVM
30
Full inner product evaluation L1 LR
30
Inverted index L2 SVM
Precision @ 10 (%)
25
Inverted index L1 LR
Precision @ 10 (%)
25
20
20 • Performance averaged over 400 object
15 classes used as queries
15 • 10 training examples per query class
10
10
• Database includes 450 images of the query
class and 9.7M images of other classes
5
5 •
Prec@10 of a random classifiers is 0.005%
0
20 40 60 80 100 120 140
Search time per query (seconds) 0
20 40 60 80 100 120 140
Each curve is obtained by varying sparsity through C in training objective Search time per query (seconds)
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
regularizer loss function
88. Top-k ranking
• Do we need to rank the entire database?
- users only care about the top-ranked images
• Key idea:
- for each image iteratively update an upper-bound and
a lower-bound on the score
- gradually prune images that cannot rank in the top-k
95. Distribution of weights and
pruning rate
CCV
CV IC
1745
745 #
#1
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
ICCV 2011 Submission #1745. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
540
40
11 100
100
L1−LR
L1−LR
Distribution absolute weight values
Distribution of absolute weight values
41
541
normalized of absolute weight values
42
542 L2−SVM
L2−SVM
43
543 0.8
0.8 FGM
FGM 80
80
% of images pruned
% of images pruned
44
544 TkP L1−LR, k=10
TkP L1−LR, k=10
45
545 TkP L1−LR, k=3000
TkP L1−LR, k=3000
0.6
0.6 60
60
46
546 TkP L2−SVM, k=10
TkP L2−SVM, k=10
47
547 TkP L2−SVM, k=3000
TkP L2−SVM, k=3000
48
548 0.4
0.4 40
40 TkP FGM, k=10
TkP FGM, k=10
49
549 TkP FGM, k=3000
TkP FGM, k=3000
50
550
0.2
0.2 20
20
51
551
52
552
53
553 00 00
54
554 aa 00 500
500 1000
1000 1500
1500
Dimension
2000
2000 2500
2500 bb 00 500
500 1000
1000 1500 1500 2000 2000
Number ofof iterations (d)
iterations (d)
2500
2500
Dimension Number
55
555
56
556 Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
Figure 2. (a) Distribution of weight absolute values for different classifiers (after sorting the weight magnitudes). TkP runs faster with
57
557
Features considered in descending order of |wi |
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values ofof k (k = 10, 3000).
sparse, highly skewed weight values. (b) Pruning rate of TkP for various classification model and different values k (k = 10, 3000).
58
558
59
559
60
560 aa smaller value of kk allows the method to eliminate more
smaller value of allows the method to eliminate more
61 images from consideration at aavery early stage. 20
20 v=128
561 images from consideration at very early stage. v=128
8
v=256
v=256
62 w=2 8 v=256
v=256 w=28 8
562 w=2 6
v=64
v=64 w=2 6 w=2
w=2
63
96. Performance evaluation on 35
ImageNet (10M images) 30
35 ! [Rastegari et al., 2011]
Precision @ 10 (%)
25
30 TkP L1−LR
20
TkP L2−SVM
Inverted index L1−LR
Precision @ 10 (%)
25
15
Inverted index L2−SVM
20
10 • k = 10
15
• Performance averaged over 400 object
5 classes used as queries
10 • 10 training examples per query class
0
0 50 •
100 150 Database includes 450 images of the query
5 Search time per query (seconds) and 9.7M images of other classes
class
• Prec@10 of a random classifiers is 0.005%
0
0 50 100 150
Search time per query (seconds)
Each curve is obtained by varying sparsity through C in training objective
N
E(w) = R(w) + C
N n=1 L(w; Φn , yn )
regularizer loss function
97. Alternative search strategy:
approximate ranking
• Key-idea: approximate the score function with a measure that can
computed (more) efficiently (related to approximate NN search:
[Shakhnarovich et al., 2006; Grauman and Darrell, 2007; Chum et al.,
2008])
• Approximate ranking via vector quantization:
wT Φ ≈ wT q(Φ) !
q(!)
where q(.) is a quantizer returning
the cluster centroid nearest to Φ
• Problem:
- to approximate well the score we need a fine quantization
- the dimensionality of our space is D=2659:
too large to enable a fine quantization using k-means clustering
98. Product quantization
!
Product quantization for nearest neighbor search
[Jegou et al., 2011]
• Split feature vector ! into v subvectors: ! [ !1 | !2 | ... | !v ]
Vector split into m subvectors:
• Subvectors are quantized separately by quantizers
Subvectors are quantized separately by quantizers
q(!) = [ q1(!1) | q2(!2) | ... | qv(!v) ]
where each qi(.) is learned in a space of dimensionality D/v
where each is learned by k-means with a limited number of centroids
• Example from [Jegou vector split in 8 subvectors of dimension 16
Example: y = 128-dim
et al., 2011]:
! is a 128-dimensional vector split into 8 subvectors of dimension 16
16 components
16 components
y1 y2 y3 y4 y5 y6 y7 y8
!1 !2 !3 !4 !5 !6 !7 !8
xedni noitazitnauq tib-46
stib 8
256 ) 1 y( 1 q
q
) 2 y( 2 q
q2
) 3 y( 3 q
q3
)4y(4q
q4
)5y(5q
q5
)6y(6q
q6
)7y(7q )8y(8q
q7 q8
28 = 256
centroids 1
centroids
q1 q2
1 q3
1 q4
1 q5 q6 q7 q8
sdiortnec 1q 2q 3q 4q 5q 6q 7q 8q
652
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
q1(!1) q2(!2) q3(!3) q4(!4)
1
1y 1 1 1 1
2y 1 3y 4y 5y q5(!5) q6(!6) q7(!7) q8(!8)
6y 7y 8y
8 bits
stnenopmoc 61
64-bit quantization index
8 bits
64-bit quantization index
61 noisnemid fo srotcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
hcae erehw sdiortnec fo rebmun detimil a htiw snaem-k yb denrael si
99. obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
.
.
.
tnauq yb yletarapes dezitnauq era srotcevbuS
w2
sub-blocks
w1
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =
v
652
5q 4q 3q
Efficient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8
xedni noitazitnauq tib-46
100. obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
.
.
.
tnauq yb yletarapes dezitnauq era srotcevbuS
w2
sub-blocks
s11 w1
in
ner product
quantization for sub-block 1:
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =
v
652
5q 4q 3q
Efficient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8
xedni noitazitnauq tib-46
101. obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
.
.
.
tnauq yb yletarapes dezitnauq era srotcevbuS
w2
sub-blocks
uct
s11 s12 prod w1
inner
quantization for sub-block 1:
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =
v
652
5q 4q 3q
Efficient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8
xedni noitazitnauq tib-46
102. obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
.
.
.
tnauq yb yletarapes dezitnauq era srotcevbuS
w2
sub-blocks
duct
s11 s12 s13 ... ... ... ... ... ... s1r r pro i
w1
nne
quantization for sub-block 1:
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =
v
652
5q 4q 3q
Efficient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8
xedni noitazitnauq tib-46
103. obhgien tseraen rof noitazitnauq tcudorP
:srotcevbus m otni tilps rotceV
wv
.
.
.
tnauq yb yletarapes dezitnauq era srotcevbuS
w2
s21 in
sub-blocks
ner prod
uct w1
s11 s12 s13 ... ... ... ... ... ... s1r
quantization for sub-block 2:
htiw snaem-k yb denrael si
centroids (r per sub-block)
hcae erehw
1.Filling the look-up table:
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
look-up table
can be precomputed and stored in a stnenopmoc 61
j=1
5y 4y 3y 2y T 1y
wj qj (Φj ) wT Φ ≈ wT q(Φ) =
v
652
5q 4q 3q
Efficient approximate scoring 2q 1q sdiortnec
y(5q )4y(4q )3y(3q ) 2 y( 2 q ) 1 y( 1 q
stib 8
xedni noitazitnauq tib-46
104. xedni noitazitnauq tib-46
stib 8
) 1 y( 1 q ) 2 y( 2 q )3y(3q )4y(4q y(5q
Efficient approximate scoringsdiortnec
652
1q 2q 3q 4q 5q
v
wT Φ ≈ wT q(Φ) = wj qj (Φj )
T 1y 2y 3y 4y 5y
j=1
stnenopmoc 61 can be precomputed and stored in a
look-up table
tcevbus 8 ni tilps rotcev mid-821 = y :elpmaxE
2.Score each quantized vector q(Φ)
in the database using the look-up hcae erehw centroids (r per sub-block)
htiw snaem-k yb denrael si
table: s1r
s11 s12 s13 ... ... ... ... ... ...
sub-blocks
s21 s22 s23 ... ... ... ... ... ... s2r
w q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv... ... ) ...
T T T T
(Φv
tnauq yb yletarapes dezitnauq era srotcevbuS... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ...
T
q(Φ) = w1 q1 (Φ1 ) + w2 q2 (Φ2 ) + . . . + wv qv (Φv )
T T T
... ... ...
:srotcevbus m otni tilps rotceV ... ... ... ... ... ... ...
sv1 sv2 sv3 ... ... ... ... ... ... svr
Only v additions per image!
obhgien tseraen rof noitazitnauq tcudorP
105. Choice of parameters
! [Rastegari et al., 2011]
• Dimensionality is first reduced with PCA from D=2659 to D’ D
• How do we choose D’, v (number of sub-blocks),
r (number of centroids per sub-block)?
• Effect of parameter choices on a database of 150K images:
(v,r)
20
8 8
(128,2 ) (256,2 ) 6
(256,2 )
6
(64,2 )
15
Precision @ 10 (%)
6
8
(64,2 )
(32,2 )
(128,28)
D’=512
10 8
(16,2 ) D’=256
8 6
(32,2 ) (64,2 ) D’=128
5 (32,28)
8
(16,2 )
8
(16,2 )
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Search time per query (seconds)