This document provides an overview of visual object recognition. It begins with an introduction explaining why object recognition is a challenging problem and discusses the importance of recognizing objects from different viewpoints, scales, textures, etc. It then describes how recognition can be achieved using local image features rather than analyzing the whole object. The document focuses on the Scale Invariant Feature Transform (SIFT) approach, outlining the key stages of detecting local features, generating invariant representations of those features, and verifying matches between images based on geometric configuration. Overall, the summary provides a high-level view of object recognition techniques with a focus on the seminal SIFT method.
1. Visual Object Recognition
Vi l Obj R ii
Perceptual Computing Seminar
Perceptual Computing Seminar
Sergio Escalera, Xavier Baró, Jordi Vitrià, Petia Radeva, Oriol Pujol
BCN Perceptual Computing Lab
2. Index
1. Introduction
2. Recognition with Local Features: Basics.
3.
3 Invariant representations: SIFT
I i i SIFT
4. Recognition as a Classification Problem:
g
FERNS
5. Very large databases: Hashing
5 Very large databases Hashing
Visual Object Recognition Perceptual Computing Seminar Page 2
3. Introduction
The recognition of object categories in images
is one of the most challenging problems in
computer vision especially when the number
vision,
of categories is large.
Humans are able to recognize thousands of
object types, whereas most of the existing
object recognition systems are trained to
j g y
recognize only a few.
Visual Object Recognition Perceptual Computing Seminar Page 3
4. Introduction
Invariance t i
I i to viewpoint, illumination, “shape”, color, scale, texture, etc.
i t ill i ti “h ” l l t t t
Visual Object Recognition Perceptual Computing Seminar Page 4
5. Introduction
Why do we care about recognition? (theoretical question)
y g ( q )
Perception of function: We can perceive the
p p
3D shape, texture, material properties,
without knowing about objects But the
objects. But,
concept of category encapsulates also
information about what can we d with
i f ti b t h t do ith
those objects.
Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories:
Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24.
Visual Object Recognition Perceptual Computing Seminar Page 5
6. Introduction
Why it is hard?
y
Find the chair in this image Output of correlation
This is a chair
Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories:
Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24.
Visual Object Recognition Perceptual Computing Seminar Page 6
7. Introduction
Why it is hard?
y
Find the chair in this image Pretty much garbage; Simple template
P tt h b Si l t l t
matching is not going to make it
Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories:
Year 2009, ICCV 2009 Kyoto, Short Course, September 24.
Visual Object Recognition Perceptual Computing Seminar Page 7
8. Introduction
Why do we care about recognition? (practical question)
Visual Object Recognition Perceptual Computing Seminar Page 8
9. Introduction
Why do we care about recognition? (practical question)
Visual Object Recognition Perceptual Computing Seminar Page 9
10. Introduction
Why do we care about recognition (practical question)?
Query Results from 5k Flickr images (demo available for 100k set)
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, Andrew Zisserman: Object retrieval with
large vocabularies and fast spatial matching. CVPR 2007
Visual Object Recognition Perceptual Computing Seminar Page 10
11. Recognition with Local Features
g
It is known that the visual system can use local,
informative image «fragments» of a given
object, rather than the whole object, to
j , j ,
classify it into a familiar category.
This approach has some advantages over holistic
methods...
methods
Visual Object Recognition Perceptual Computing Seminar Page 11
13. Recognition with Local Features
g
Jay Hegde, Evgeniy Bart, and Daniel Kersten, "Fragment‐based learning of visual object categories", Current
Biology, 2008.
Visual Object Recognition Perceptual Computing Seminar Page 13
14. Recognition with Local Features
g
The most basic approach is called the “bag of
words” approach (it was inspired in
as
techniques used by the natural language
processing community).
Visual Object Recognition Perceptual Computing Seminar Page 14
15. Recognition with Local Features
g
Assumptions:
• Independent features.
d d f Fragments
Fragments
vocabulary
• Histogram representation. (generic/class‐
based, etc.)
based etc )
Image
Image
=
Fragments
histogram
Li Fei‐Fei, Stanford; Rob Fergus, NYU; Antonio Torralba, MIT. Recognizing and Learning Object Categories:
Year 2009, ICCV 2009 Kyoto, Short Course, S eptember 24.
Visual Object Recognition Perceptual Computing Seminar Page 15
16. Recognition with Local Features
g
A more advanced approach involves several
steps:
steps
• Stage 0: Find image locations where we can
reliably find correspondences with other images.
• Stage 1: Image content is transformed into local
g g
features (that are invariant to translation,
rotation, and scale).
• Stage 2: Verify if they belong to a consistent
configuration
Slide credit: David Lowe
Visual Object Recognition Perceptual Computing Seminar Page 16
17. SIFT
A wonderful example of these stages can be found in
David Lowe’s (2004) “Distinctive image features from
Lowe s Distinctive
scale‐invariant keypoints” paper, which describes the
development and refinement of his Scale Invariant
Feature Transform (SIFT).
Local Features, e.g. SIFT
L lF t
Visual Object Recognition Perceptual Computing Seminar Page 17
18. Recognition with Local Features
g
Which local features?
?
Slide credit: A. Efros
Visual Object Recognition Perceptual Computing Seminar Page 18
19. SIFT
Stage 0: How can we find image locations where we can reliably find
correspondences with other images?
A “good” location has one stable sharp extremum.
f Good !
f f
bad bad
x x x
Visual Object Recognition Perceptual Computing Seminar Page 19
21. SIFT
Stage 0: How can we find image locations where we can reliably find
correspondences with other images?
How to compute extrema at a given scale:
1) We apply a Gaussian filter:
2) We compute a difference‐of‐Gaussians
3) We look for 3D extrema in the resulting structure.
Visual Object Recognition Perceptual Computing Seminar Page 21
24. SIFT
Stage 1: Image content is transformed into local features (that are invariant
to translation, rotation, and scale).
In addition to dealing with scale changes, we need to
deal with (at least) in‐plane image rotation.
One way to deal with this problem is to design
descriptors that are rotationally invariant, but such
descriptors have poor discriminability, i.e. they map
different looking patches to the same descriptor.
Visual Object Recognition Perceptual Computing Seminar Page 24
25. SIFT
A better method is to estimate a dominant
orientation at each detected keypoint.
1.Calculate histogram of local gradients in the window
2.Take the dominant orientation gradient as “up”
3.Rotate local area for computing descriptor
Visual Object Recognition Perceptual Computing Seminar Page 25
26. SIFT
Lowe:
• computes a 36‐bin histogram of edge orientations
weighted by both gradient magnitude and Gaussian
distance to the center,
• finds all peaks within 80% of the global maximum,
and then
• computes a more accurate orientation estimate
using a 3‐bin parabolic fit.
Visual Object Recognition Perceptual Computing Seminar Page 26
31. SIFT
Even after compensating for translation,
rotation,
rotation and scale changes the local
changes,
appearance of image patches will usually still
vary from image to image.
How can we make the descriptor that we match
more invariant to such changes while still
changes,
preserving discriminability between different
(non‐corresponding)
(non corresponding) patches?
Visual Object Recognition Perceptual Computing Seminar Page 31
32. SIFT
SIFT features are formed by computing the gradient at
each pixel in a 16x16 window around the d
h l d d h detected d
keypoint, using the appropriate level of the Gaussian
pyramid at which the k
id hi h h keypoint was d
i detected.
d
The
Th gradient magnitudes are d
di t it d downweighted b a G
i ht d by Gaussian f ll ff f ti
i fall‐off function
in order to reduce the influence of gradients far from the center, as these
are more affected by small misregistrations.
Visual Object Recognition Perceptual Computing Seminar Page 32
33. SIFT
In each 4x4 quadrant, a gradient orientation
histogram is formed b (concept all ) adding
by (conceptually)
the weighted gradient value to one of 8
orientation histogram bins.
Visual Object Recognition Perceptual Computing Seminar Page 33
34. SIFT
The resulting 128 non negative values form a
non‐negative
raw version of the SIFT descriptor vector.
To reduce the effects of contrast/gain (additive
variations are already removed by the
gradient), the 128‐D vector is normalized to
128 D
unit length.
Visual Object Recognition Perceptual Computing Seminar Page 34
35. SIFT
Once we have extracted features and their descriptors
from two or more images the next step is to establish
images,
some preliminary feature matches between these
images.
images
Visual Object Recognition Perceptual Computing Seminar Page 35
36. SIFT
Once we have extracted features and their descriptors
from two or more images the next step is to establish
images,
some preliminary feature matches between these
images.
images
SIFT uses a nearest neighbor classifier with a distance ratio
matching criterion We can define this nearest neighbor
criterion.
distance ratio as
where d1 and d2 are the nearest and second nearest neighbor
distances, and DA…..DC are the target descriptor along with its
closest two neighbors
neighbors.
Visual Object Recognition Perceptual Computing Seminar Page 36
38. SIFT
Linear method:
The simplest way to find all corresponding
feature points is to compare all features
against all other features in each pair of
potentially matching images.
Unfortunately, this is quadratic in the
f l h d h
number of extracted features, which makes it
impractical for some applications.
Visual Object Recognition Perceptual Computing Seminar Page 38
39. SIFT
Nearest‐neighbor matching is the major
computational bottleneck:
• Linear search performs dn2 operations for n
feature points and d dimensions
• No exact NN methods are faster than linear
search for d>10
• Approximate methods can be much faster, but
at the cost of missing some correct matches
matches.
Failure rate gets worse for large datasets.
Visual Object Recognition Perceptual Computing Seminar Page 39
40. SIFT
A better approach is to devise an indexing structure
such as a multi‐dimensional search tree or a hash
table to rapidly search for features near a given
feature.
For extremely large databases (millions of images or
more), even more efficient structures based on
ideas from document retrieval (e.g., vocabulary
trees) can be used.
Visual Object Recognition Perceptual Computing Seminar Page 40
41. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
The first step is to establish a set of putative
correspondences.
Visual Object Recognition Perceptual Computing Seminar Page 41
43. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Once we have some hypothetical (putative)
matches, we can use geometric alignment
to
t verify which matches are i li
if hi h t h inliers and
d
which ones are outliers.
Visual Object Recognition Perceptual Computing Seminar Page 43
44. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
• Extract features
• Compute putative matches
Visual Object Recognition Perceptual Computing Seminar Page 44
45. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
• Loop:
– Hypothesize transformation T (using a small group of putative
matches that are related by T)
matches that are related by T)
Visual Object Recognition Perceptual Computing Seminar Page 45
46. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
• Loop:
– Hypothesize transformation T (small group of putative matches that
are related by T)
– Verify transformation (search for other matches consistent with T)
Visual Object Recognition Perceptual Computing Seminar Page 46
47. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Visual Object Recognition Perceptual Computing Seminar Page 47
48. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
2D transformation models:
• Similarity
(translation,
(translation,
scale, rotation)
• Affine
• Projective
(homography)
Visual Object Recognition Perceptual Computing Seminar Page 48
49. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Fitting an affine transformation (given the point
correspondences):
( xi , yi )
( xi, yi)
Slide credit: S. Lazebnik
Visual Object Recognition Perceptual Computing Seminar Page 49
50. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Fitting an affine transformation (given the point
correspondences):
m1
m2
xi m1 m2 xi t1 x yi 0 0 1 0 m3 xi
y m y t i
i 3 m4 i 2 0 0 xi yi 0 1 m4 yi
t1
t2
Slide credit: S. Lazebnik
Visual Object Recognition Perceptual Computing Seminar Page 50
51. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Fitting an affine transformation (given the point
correspondences):
• Linear system with six unknowns
• Each match gives us two linearly independent equations:
need at least three to solve for the transformation
d l h l f h f
parameters
• C
Can solve Ax=b using pseduo‐inverse:
l A b i d i
x = (ATA)‐1ATb
Slide credit: S. Lazebnik
Visual Object Recognition Perceptual Computing Seminar Page 51
52. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
Fitting an affine transformation (given the point
correspondences):
• Linear system with six unknowns
• Each match gives us two linearly independent equations:
need at least three to solve for the transformation
d l h l f h f
parameters
• C
Can solve Ax=b using pseduo‐inverse:
l A b i d i
x = (ATA)‐1ATb
Slide credit: S. Lazebnik
Visual Object Recognition Perceptual Computing Seminar Page 52
53. SIFT
Stage 2: Verify if they belong to a consistent
configuration.
config ration
The process of selecting a small set of seed
matches and then verifying a larger set is
y g g
often called random sampling or RANSAC.
Visual Object Recognition Perceptual Computing Seminar Page 53
54. RANSAC
RANSAC was originally formulated in Martin A. Fischler and Robert C. Bolles (June
1981). "Random Sample Consensus: A Paradigm for Model Fitting with
Applications to Image Analysis and Automated Cartography". Comm. of the
pp g y g p y
ACM 24: 381–395.
Visual Object Recognition Perceptual Computing Seminar Page 54
55. RANSAC
“We approached the fitting problem in the opposite way from most previous
techniques. Instead of averaging all the measurements and then trying to
throw out bad ones we used the smallest number of measurements to
ones,
compute a model’s unknown parameters and then evaluated the
instantiated model by counting the number of consistent samples”
From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006.
Visual Object Recognition Perceptual Computing Seminar Page 55
56. RANSAC
It’s easy to understand and it’s effective
• It helps solve a common problem (i.e., filter out gross errors
introduced by automatic techniques)
• The number of trials to “guarantee” a high level of success
(e.g., 99.99
(e g 99 99 probability) is surprisingly small
• The dramatic increase in computation speed made it possible
to do a large number of trials (100s or 1000s)
• The algorithm can stop as soon as a good match is computed
(unlike Hough techniques that typically compute a large
number of examples and then identify matches)
From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006.
Visual Object Recognition Perceptual Computing Seminar Page 56
57. RANSAC
The basic idea is to repeat M times the following process:
1. A model is fitted to the hypothetical inliers, i.e. all free parameters of the
yp , p
model are reconstructed from the data set.
2. All other data are then tested against the fitted model and, if a point fits
well to the estimated model also considered as a hypothetical inlier
model, inlier.
3. The estimated model is reasonably good if sufficiently many points have
been classified as hypothetical inliers.
4. The model is reestimated from all hypothetical inliers, because it has only
been estimated from the initial set of hypothetical inliers.
5. Finally,
5 Finally the model is evaluated by estimating the error of the inliers relative
to the model.
This procedure is repeated a fixed number of times, each time producing
either a model which is rejected because too few points are classified as inliers
or a refined model together with a corresponding error measure. In the latter
case, we keep the refined model if its error is lower than the last saved model.
, p
From “RANSAC: An Historical Perspective” Bob Bolles & Marty Fischler, 2006.
Visual Object Recognition Perceptual Computing Seminar Page 57
74. Matching and Classification
g
SIFT allows reliable real‐time recognition but
at a computational cost that severely limits
the number of points that can be handled.
A standard implementation requires 1 ms per
feature point which limits the number of
point,
feature points to 50 per frame if one‐
requires frame rate performance
frame‐rate performance.
Visual Object Recognition Perceptual Computing Seminar Page 74
75. Matching and Classification
g
An alternative is to rely on statistical learning
techniques to model the set of possible
appearances of a patch.
The major challenge is to use simple models
to allow for real time efficient recognition
real‐time, recognition.
Visual Object Recognition Perceptual Computing Seminar Page 75
76. Matching and Classification
g
Can we match keypoints using simpler
features without intensive preprocessing?
?:{ … }
We will assume that we have the possibility
p y
to train a classifier for each keypoint class.
Visual Object Recognition Perceptual Computing Seminar Page 76
77. Matching and Classification
g
Simple binary features I(mi,1)
I(m
I( i,2)
The test compares the intensities of two
pixels around the keypoint:
1 if I(mii,1 ) I(mii,2 )
fi
0 otherwise
Visual Object Recognition Perceptual Computing Seminar Page 77
78. Matching and Classification
g
Without intensive preprocessing
We can synthetically generate the set of
keypoint’s possible appearances under
various perspective, lighting, noise, etc.
Visual Object Recognition Perceptual Computing Seminar Page 78
79. Matching and Classification
g
FERN Formulation
We model the class conditional probabilities
of a large number of binary features which
are estimated by a training phase.
y gp
At run time, these probabilities are used to
select the best match for a given image
patch.
patch
Visual Object Recognition Perceptual Computing Seminar Page 79
80. Matching and Classification
g
FERN Formulation
fi : Binary feature.
Nf : Total number of features in the model.
Ck : Class representing all views of an image patch
around a keypoint.
Given f1 ,..., f Nf select the class k such that
k arg max P(Ck | f1 , f 2 , , f N f ) arg max P( f1 , f 2 , , f N f | Ck )
k k
Mustafa Ozuysal, Michael Calonder, Vincent Lepetit, Pascal Fua, "Fast Keypoint Recognition Using Random
Ferns," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, , 2009
Visual Object Recognition Perceptual Computing Seminar Page 80
81. Matching and Classification
g
FERN Formulation
However, it is not practical to model the joint
distribution of all features. We group features
into small sets (fern) and assume independence
between these sets (Semi‐Naïve Bayesian
Classifier):
Fj : A fern is defined to be the set of S binary
features {fr ,..., fr+S }.
+S
M is the number of ferns, Nf = S X M.
Visual Object Recognition Perceptual Computing Seminar Page 81
82. Matching and Classification
g
FERN Formulation
P( f1 , f 2 , , f N f | Ck ) 2
Nf
p
parameters!
Nf
P( f1 , f 2 , , f N f | Ck ) P ( f i | Ck ) N f parameters,
p
i 1
but too simple.
M
P( f1 , f 2 , , f N f | Ck ) P ( F j | Ck ) M 2 s parameters.
j 1
Visual Object Recognition Perceptual Computing Seminar Page 82
83. Matching and Classification
g
FERN Implementation
We generate a random set of binary features.
A binary feature outputs a binary number
y p y
2
possibilities
8
possibilities
ibili i
A fern with S nodes outputs a number between o and 2S‐1
A fern with S nodes outputs a number between o and 2 ‐1.
Visual Object Recognition Perceptual Computing Seminar Page 83
84. Matching and Classification
g
FERN Implementation
When we have multiple patches of the same Probability
class we can model the output of a fern with for each
a multinomial distribution. possibility.
Visual Object Recognition Perceptual Computing Seminar Page 84
85. Matching and Classification
g
Slide Credit: V.Lepetit
Visual Object Recognition Perceptual Computing Seminar Page 85
86. Matching and Classification
g
0
1
1
6
Slide Credit: V.Lepetit
Visual Object Recognition Perceptual Computing Seminar Page 86
89. Matching and Classification
g
Slide Credit: V.Lepetit
Visual Object Recognition Perceptual Computing Seminar Page 89
90. Matching and Classification
g
Normalize:
N li
P( f , f 1 2 , , f n | C c i ) 1
000
001
111
Slide Credit: V.Lepetit
Visual Object Recognition Perceptual Computing Seminar Page 90
91. Matching and Classification
g
FERN Implementation
At the end of the training we have
distributions over possible fern outputs for
each class
class.
Visual Object Recognition Perceptual Computing Seminar Page 91
92. Matching and Classification
g
FERN Implementation
To recognize a new patch the outputs selects
rows of distributions for each fern and these
are then combined assuming independence
between distributions.
Visual Object Recognition Perceptual Computing Seminar Page 92
99. Matching and Classification
g
The FERN technique speeds‐up keypoint
matching but the training is slow and
performed offline.
Hence, it is not suited for applications that
require real‐time online learning or
real time
incremental addition of arbitrary numbers
of keypoints (f e SLAM)
(f.e. SLAM).
Visual Object Recognition Perceptual Computing Seminar Page 99
100. Matching and Classification
g
This limitation can be removed if we train a
FERN classifier to recognize a number of
keypoints extracted from a reference
database and all other keypoints are
characterized in terms of their response to
these classification ferns (signature)
(signature).
Visual Object Recognition Perceptual Computing Seminar Page 100
101. Matching and Classification
g
M. Calonder, V. Lepetit, and P. Fua, Keypoint Signatures for Fast Learning and Recognition.
In Proceedings of European Conference on Computer Vision, 2008.
Visual Object Recognition Perceptual Computing Seminar Page 101
102. Matching and Classification
g
It can be empirically shown that these
signatures are stable under changes in
viewing conditions
conditions.
Signatures are sparse in nature if we apply a
threshold function.
Signatures do not need a training phase and
scale well with the number of classes
(nearest neighbor).
Visual Object Recognition Perceptual Computing Seminar Page 102
103. Matching and Classification
g
However, matching signatures still involves
many more elementary operations than
absolutely necessary
necessary.
Moreover, evaluating the signatures requires
M l i h i i
storing many distributions of the same size as
themselves and, therefore, large amounts of
memory. y
Visual Object Recognition Perceptual Computing Seminar Page 103
104. Matching and Classification
g
The full response vector r(p) for all J Ferns is taken
p (p)
to be: Vectors storing the
probability that p is one of
the N reference points.
the N reference points
where Z is a normalizer s.t. its elements sum to one.
In practice, when p truly corresponds to one of the
reference keypoints r(p) contains one element that is close
keypoints,
to one where all others are close to zero.
Otherwise,
Otherwise it contains a few relatively large values that
correspond to reference keypoints that are similar in
appearance and small values elsewhere.
pp
Visual Object Recognition Perceptual Computing Seminar Page 104
105. Matching and Classification
g
We can compute a sparse signature by applting a
p p g y pp g
point wise threshold function with a θ value.
It is an N‐dimensional vector with only a few non‐
y
zero elements that is mostly invariant to different
imaging conditions and therefore presents a useful
g g p
descriptor for matching purposes.
Visual Object Recognition Perceptual Computing Seminar Page 105
106. Matching and Classification
g
The patch
J Ferns
Vectors storing
Vectors storing
the probability
that p is one of
the N reference
points.
Typical parameters:
J 50; d 10; N 500
J=50; d=10; N=500
Visual Object Recognition Perceptual Computing Seminar Page 106
107. Matching and Classification
g
Typical parameters:
J 50; d 10; N 500
J=50; d=10; N=500
We need for each of the 2d leaves in each of the J Ferns an N‐
dimensional vector of floats
floats.
The total memory requirement is M=bJ2d N bytes, where b is the
number of bytes to store a float (8) In practice 100MB!
(8). practice,
Visual Object Recognition Perceptual Computing Seminar Page 107
108. Matching and Classification
g
Compressive Sensing literature:
• High‐dimensional sparse vectors can be
g p
reconstructed from their linear projections into
much lower‐dimensional spaces.
p
• The Johnson–Lindenstrauss lemma states that a
small set of points in a h h d
ll f high‐dimensional space can
l
be embedded into a space of much lower
dimension i such a way that di
di i in h h distances b between
the points are nearly preserved.
Visual Object Recognition Perceptual Computing Seminar Page 108
109. Matching and Classification
g
Many kinds of matrices can be used for this
purpouse.
Random Ortho‐Projection (ROP) matrices
are a good choice and can be easily
constructed by applying a Gram‐Schmidt
y pp y g
orthonormalization process to a random
matrix.
matrix
Visual Object Recognition Perceptual Computing Seminar Page 109
110. Matching and Classification
g
In
I mathematics th G
th ti the Gram–Schmidt process i a
S h idt is
method for orthonormalizing a set of vectors in
an i inner product space, most commonly
d t t l
the Euclidean space Rn.
The Gram–Schmidt process takes a finite, linearly
independent set S = { 1, …, vk} f k ≤ n and
i d d t t {v for d
generates an orthogonal set S' = {u1, …, uk} that
k‐dimensional subspace of Rn as S
spans th same k di
the i l b f S.
Visual Object Recognition Perceptual Computing Seminar Page 110
111. Matching and Classification
g
M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐
speed Interest Point Description and Matching. In Proceedings of International Conference on Computer
Vision, 2009.
Visual Object Recognition Perceptual Computing Seminar Page 111
112. Matching and Classification
g
M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐
speed Interest Point Description and Matching. In Proceedings of International Conference on Computer
Vision, 2009.
Visual Object Recognition Perceptual Computing Seminar Page 112
113. Matching and Classification
g
M. Calonder, V. Lepetit, P. Fua, K. Konolige, J. Bowman, and P. Mihelich, Compact Signatures for High‐
speed Interest Point Description and Matching. In Proceedings of International Conference on Computer
Vision, 2009.
Visual Object Recognition Perceptual Computing Seminar Page 113
114. Matching and Classification
g
This approach reduces the memory requirement when
storing the models: for N=512, M=176, the
requirements change from 93.75MB to 175B!
The CPU time is 6.3ms per an exhaustive NN matching
of 256 points (256x256)
(256x256).
Visual Object Recognition Perceptual Computing Seminar Page 114
117. Min HASH
Let s suppose that we choose a LARGE bag
Let’s suppose that we choose a LARGE bag‐
of‐words representation of our images and
that we use a binary histogram.
that we use a binary histogram
Visual Object Recognition Perceptual Computing Seminar Page 117
118. Min HASH
Given two different images, we can
compute their histogram intersection:
Visual Object Recognition Perceptual Computing Seminar Page 118
120. Min HASH
Then we can define a set similarity
measure in the following way:
That is, the number of times both images have a given
keypoint in common divided by the total number of
keypoints that are present in both images.
Visual Object Recognition Perceptual Computing Seminar Page 120
122. Min HASH
We can perform clustering or matching
of an unordered set of i
f d d f images with this
h h
measure, but this can be used only with
a limited amount of data!
The method requires
w
d
i1
i
2
similarity evaluations, where w is
the size of the vocabulary and di is
the number of regions assigned to
th b f i i dt
the i‐th visual word.
Vocabulary commonly used is
w=1.000.000.
w=1 000 000
Visual Object Recognition Perceptual Computing Seminar Page 122
123. Min HASH
From can perform clustering or
matching of an unordered set of images
with this measure but this can be used
measure,
only with a limited amount of data!
Observation: histograms for an
g
image are highly sparse!
Visual Object Recognition Perceptual Computing Seminar Page 123
124. Min HASH
The key idea of min‐hash is to map
min hash
(“hash”) each row/histogram to a small
amount of data Sig(A) (the signature)
such that:
• Sig(A) is small enough.
• Rows A1 and A2 are highly similar if
Sig(A1) is highly similar to Sig(A2).
g g y g
Visual Object Recognition Perceptual Computing Seminar Page 124
125. Min HASH
Useful convention: we will refer to columns as
being of four types:
A1: 1010
A2: 1100
Type:
yp abcd
We will also use “a” as the number of columns
of type a.
yp
Notes:
• Sim (A1 , A2)=a/(a+b+c)
Sim (A A
• Most columns are type d.
Visual Object Recognition Perceptual Computing Seminar Page 125
126. Min HASH
• Imagine the columns permuted randomly in
order.
d
• Hash each row A to h(A), the number of the
first l
fi column i which row A h a 1.
in hi h has
1 0 0 1 0 π 0 1 0 0 1 h(A1) 2
)=2
1 0 0 0 0 0 1 0 0 0 h(A2)=2
The probability that h(A1) = h(A2) is
a/(a+b+c) = Sim (A1 , A2) (the hash agree if the
first column with a 1 is a and disagree if it is of type b or c).
Visual Object Recognition Perceptual Computing Seminar Page 126
127. Min HASH
If we repeat the experiment with a new
permutation of columns a l
f l large number of
b f
times, say 512, we get a signature
consisting of 512 column numbers for each
row.
The “similarity” of these lists (fraction of
positions in which they agree) will be very
close to the similarity of the rows (= (
similar signatures mean similar rows!).
Visual Object Recognition Perceptual Computing Seminar Page 127
128. Min HASH
In fact, it is not necessary to permute the columns: we
can hash each original column with 512 different hash
functions and keep for each row the lowest hash value of
a row in which that column has a 1, independently for
each of the 512 hash functions. Then we look for the
coincidences.
signature
row 1 0 0 1 0
h1 5 1 3 2 4 h1(row)= 2
h2 1 2 5 3 4 h2(row)= 1
h3 3 4 1 5 2 h3(row)= 3
(row)= 3
h4 2 5 4 1 3 h4(row)= 1
Visual Object Recognition Perceptual Computing Seminar Page 128
130. Min Hash
For efficient retrieval, the min hashes are
grouped into n‐tuples. In this example, we can
form the following 2‐tuples:
h1(row)= 1 , 2 , 1
h2(row)= 2 1 2
(row)= 2 , 1 , 2
h3(row)= 1 , 2 , 1
h4(row)= 3 , 2 , 3
(row) 3 , 2 , 3
The retrieval procedure then estimates the full
similarity for only those image pairs that have at
least h identical tuples out of k tuples.
Visual Object Recognition Perceptual Computing Seminar Page 130
131. Min Hash
From 100k images....
From 100k images
Visual Object Recognition Perceptual Computing Seminar Page 131
132. Min Hash
From 100k images....
From 100k images
Visual Object Recognition Perceptual Computing Seminar Page 132
133. Min Hash
From 100k images....
From 100k images
Representatives of the largest clusters
Visual Object Recognition Perceptual Computing Seminar Page 133