2. Benchmarking: why and how 37
• Dozens of feature detectors and descriptors have been proposed
• Benchmarks
- compare methods empirically
- select the best method for a task
• Public benchmarks
- reproducible research
- simplify your life!
• Ingredients of a benchmark:
Theory Data Software
5. Indirect feature evaluation 40
• Intuition
Test how well features persist and can be matched across image
transformations.
• Data
- must be representative of the transformations
(viewpoint, illumination, noise, etc.)
• Performance measures
- repeatability
persistence of features
- matching score
matchability of features
K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A
comparison of affine region detectors. IJCV, 1(65):43–72, 2005.
16. Detector repeatability 51
Intuition
• two pairs of features correspond
• another pair of features does not
• repeatability = 2/3
17. Region overlap 52
Formal definition
H
homography
Rb
|Ra HRb |
Ra HRb overlap(a, b) =
|Ra [ HRb |
area intersection
=
area union
18. Region overlap 53
A Comparison of Affine Region Detec
Intuition
gure 12. Overlap error O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bot
erlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.
1 - overlap
metimes specific to detectors and scene types of correspondences. The results for images conta
iscussed below), and sometimes general—the trans- ing repeated texture motifs (Fig. 9(b)) are displa
rmation is outside the range for which the detector is in Fig. 14. The best results are obtained with
• Examples of ellipses overlapping by different amounts
signed, e.g. discretization errors, noise, non-linear MSER detector for both scene types. This is due
umination changes, projective deformations etc. the high detection accuracy especially on the hom
•
lso the limited features are tested at 40% overlap geneous regions overlap)
Usually, ‘range’ of the regions shape (size, error (= 60% with distinctive boundaries. The
ewness, . . . ) can partially explain this effect. For peatability score for a viewpoint change of 20 degr
stance, in case of a zoomed out test image, only the varies between 40% and 78% and decreases for la
rge regions in the reference image will survive the viewpoint angles to 10% − 46%. The largest num
23. Normalised region overlap 58
Intuition
larger scale = better overlap
Rescale so that
reference region has
and area of 302 pixels
24. Normalised region overlap 59
Formal definition
H
1. Detect
homography
Ra Rb
2. Warp Ra HRb
3. Normalise sRa sHRb s = |Ra |/302
4. Intersection |sRa sHRb |
overlap(a, b) =
over union |sRa [ sHRb |
25. Detector repeatability 60
Formal definition
1. Find features H
in common area homography
{Ra : a 2 A} {Rb : b 2 B}
(
2. Threshold the overlap(a, b), overlap(a, b) 1 ✏o
sab =
overlap score inf, otherwise.
X
3. Find geometric M⇤ = max sab greedy
matches
M bipartite approximation
(a,b)2M
|M⇤ |
repeatability(A, B) =
min{|A|, |B|}
min{|A|
26. Descriptor matching score 61
Intuition
• In addition to being stable, features must be visually distinctive
• Descriptor matching score
- similar to repeatability
- but matches are constructed by comparing descriptors
27. Descriptor matching score 62
Formal definition
1. Find features H
in common area homography
{Ra : a 2 A} {Rb : b 2 B}
2. Descriptor
{da : a 2 A} {db : b 2 B} dab = kda db k2
distances
X
3. Descriptor M⇤ =
d min dab
M bipartite
matches (a,b)2M
X
4. Geometric M⇤ = max sab
M bipartite
matches (as before) (a,b)2M
|M⇤ M⇤ | d
match-score(A, B) =
min{|A|, |B|}
28. Example of a repeatability graph 63
Repeatability (graf)
100
VL DoG
VL Hessian
90
VL HessianLaplace
VL HarrisLaplace
80
VL DoG (double)
VL Hessian (double)
70
VL HessianLaplace (double)
Repeatability (graf)
VL HarrisLaplace (double)
60
VLFeat SIFT
CMP Hessian
50 VGG hes
VGG har
40
30
20
10
0
30 40 50 60 20
Viewpoint angle
30. Indirect evaluation 65
• Indirect evaluation
- a “synthetic” performance measure in a “synthetic” setting
• The good
- independent of a specific application / implementation
- allow to evaluate single components, e.g.
▪ repeatability of detector
▪ matching score of descriptor
• The bad
- difficult to design well
- unclear correlation to the performance in applications
31. Direct evaluation 66
• Direct evaluation
- performance of a real system using a feature object instance retrieval
object category recognition
• The good
object detection
- tied to the “real” performance of the feature text recognition
semantic segmentation
• The bad
...
- tied to one application
- worse, tied to one implementation
- difficult to evaluate single aspects of a feature
• In the follow up we will focus on object instance retrieval
34. Image retrieval pipeline 69
Step 1: find neighbours of each query descriptor
query
image
increasing descriptor distance
...
H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.
35. Image retrieval pipeline 70
Step 2: each query descriptor casts a vote for each DB image
d query descriptor
d1 d2 dk ...
vote strength max{dk di , 0}
distance
...
rank k
36. Image retrieval pipeline 71
Step 3: sort DB images by decreasing total votes
query image 1
Average
Precision (AP)
precision
35%
✔
✗ ✔
✗ 1
recall
...
✗ ✔ ✗
decreasing total votes
38. Oxford 5K data 73
A retrieval benchmark dataset
Query Retrieved Images
...
✔ ✗ ✔
• ~ 5K images of Oxford
- For each of 58 queries
▪ about XX matching images
▪ about XX confounders images
• Larger datasets are possible, but slow for extensive evaluation
• Relative ranking of features seems to be representative
40. VLBenchmarks 75
A new easy-to-use benchmarking suite
http://www.vlfeat.org/benchmarks/index.html
• A novel MATLAB framework for feature evaluation
- Repeatability and matching scores
▪ VGG affine testbed
- Image retrieval
▪ Oxford 5K
• Goodies
- Simple to use MATLAB code
- Automatically download datasets & run evaluations
- Backward compatible with published results
41. VLBenchmarks 76
Obtaining and installing the code
• Installation
- Download the latest version
- Unpack the archive
- Launch MATLAB and type
>>8install
• Requirements
- MATLAB R2008a (7.6)
- A C compiler (e.g. Visual Studio, GCC, or Xcode)
- Do not forget to setup MATLAB to use your C compiler
mex8;setup
42. Example usage 77
t.a choose a detector
t.a (Harris Affine, VGG version)
t.a
o se a choose a dataset
(graffiti sequence)
o s* "*d* *e a
choose test
o " sea (detector repeatability)
" o ttt
t s dt tt
t sled ttt
t s;ed ttt
t slee a
run the evaluation
(repeatability = 0.66)
43. Testing on a sequence of images 78
.ls . ls . ls
r *; s
r *o aoto o; s
r a *;s
r ceu
a* ; r ...
. d * t. ..
. * ;t ...
. *c;t ...
. * ;; s
s * at o " otUse parfor on a cluster!
F; s
d *o o; s
a *o ao; s
s
47. Example 82
• Compare the following features
- SIFT, MSER, and features on a grid
- on the Graffiti sequence
- for repeatability and number of correspondence
Repeatability Number of correspondences
70 800
SIFT SIFT
MSER MSER
Features on a grid 700 Features on a grid
60
Number of correspondences
600
50
500
Repeatability
40
400
30
300
20
200
10 100
0 0
30 40 50 60 20 30 40 50 60 20
Viewpoint angle Viewpoint angle
48. Backward compatible 83
• Previously published results can be easily reproducedInternational Journal of Computer Vision
c 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
- if interested, try the script8reproduceIjcv05.m DOI: 10.1007/s11263-005-3848-x
A Comparison of Affine Region Detectors
K. MIKOLAJCZYK
University of Oxford, OX1 3PJ, Oxford, United Kingdom
km@robots.ox.ac.uk
T. TUYTELAARS
University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
tuytelaa@esat.kuleuven.be
C. SCHMID
INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France
schmid@inrialpes.fr
A. ZISSERMAN
University of Oxford, OX1 3PJ, Oxford, United Kingdom
az@robots.ox.ac.uk
J. MATAS
Czech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic
matas@cmp.felk.cvut.cz
F. SCHAFFALITZKY AND T. KADIR
University of Oxford, OX1 3PJ, Oxford, United Kingdom
49. Other useful tricks 84
• Compare different parameter settings
detectors{1}"="VggAffine('Detector','haraff',"'Threshold"',"500)";
detectors{2}"="VggAffine('Detector','haraff',"'Threshold"',"1000)";
• Visualising matches
[~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...")
...
benchmarks.helpers.plotFrameMatches(matches,"reprojFrames)
SIFT Matches with 4 image (VggAffineDataset−graf dataset). Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset).
Matched ref. image frames Matched ref. image frames
Unmatched ref. image frames Unmatched ref. image frames
Matched test image frames Matched test image frames
Unmatched test image frames Unmatched test image frames
51. Summary 86
http://www.vlfeat.org/benchmarks/index.html
• Benchmarks
- Indirect: repeatability and matching score
- Direct: image retrieval
• VLBenchmarks
- a simple to use MATLAB framework
- convenient
• The future
- Existing measures have many shortcomings
- Hopefully better benchmarks will be available soon
- And they will be added to VLBenchmarks for your convenience
52. Credits 87
Karel Lenc Varun Gulshan
HARVEST Programme
Krystian Mikolajczyk Tinne Tuytelaars Jiri Matas Cordelia Schmid
Andrew Zisserman
53. Thank you for coming! 88
VLFeat
http://www.vlfeat.org/
VLBenchmarks
http://www.vlfeat.org/benchmarks/