Modern features-part-4-evaluation

HARVEST Programme

Feature evaluation
Old and new benchmarks, and new software
Andrea Vedaldi, University of Oxford

Benchmarking: why and how 37

• Dozens of feature detectors and descriptors have been proposed

• Benchmarks
- compare methods empirically
- select the best method for a task

• Public benchmarks
- reproducible research
- simplify your life!

• Ingredients of a benchmark:

Theory Data Software

38

Indirect evaluation
Repeatability and matching score
Data: afﬁne covariant testbed

Direct evaluation
Image retrieval
Data: oxford 5k

Software
VLBenchmarks

39

Indirect evaluation

Direct evaluation
Image retrieval
Data: oxford 5k

Software
VLBenchmarks

Indirect feature evaluation 40

• Intuition
Test how well features persist and can be matched across image
transformations.

• Data
- must be representative of the transformations
(viewpoint, illumination, noise, etc.)

• Performance measures
- repeatability
persistence of features
- matching score
matchability of features

K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A
comparison of afﬁne region detectors. IJCV, 1(65):43–72, 2005.

Afﬁne Testbed 49

Viewpoint, scale, rotation

Afﬁne Testbed 50

Lighting, compression, blur

Detector repeatability 51

Intuition

• two pairs of features correspond

• another pair of features does not

• repeatability = 2/3

Region overlap 52

Formal deﬁnition

H
homography

Rb

|Ra HRb |
Ra HRb overlap(a, b) =
|Ra [ HRb |

area intersection

=

area union

Region overlap 53
A Comparison of Afﬁne Region Detec
Intuition

gure 12. Overlap error O . Examples of ellipses projected on the corresponding ellipse with the ground truth transformation. (bot
erlap error for above displayed ellipses. Note that the overlap error comes from different size, orientation and position of the ellipses.
1 - overlap
metimes speciﬁc to detectors and scene types of correspondences. The results for images conta
iscussed below), and sometimes general—the trans- ing repeated texture motifs (Fig. 9(b)) are displa
rmation is outside the range for which the detector is in Fig. 14. The best results are obtained with
• Examples of ellipses overlapping by different amounts
signed, e.g. discretization errors, noise, non-linear MSER detector for both scene types. This is due
umination changes, projective deformations etc. the high detection accuracy especially on the hom
•
lso the limited features are tested at 40% overlap geneous regions overlap)
Usually, ‘range’ of the regions shape (size, error (= 60% with distinctive boundaries. The
ewness, . . . ) can partially explain this effect. For peatability score for a viewpoint change of 20 degr
stance, in case of a zoomed out test image, only the varies between 40% and 78% and decreases for la
rge regions in the reference image will survive the viewpoint angles to 10% − 46%. The largest num

Normalised region overlap 58

Intuition

larger scale = better overlap

Rescale so that
reference region has
and area of 302 pixels

Detector repeatability 60

Formal deﬁnition

1. Find features H
in common area homography

{Ra : a 2 A} {Rb : b 2 B}

(
2. Threshold the overlap(a, b), overlap(a, b) 1 ✏o
sab =
overlap score inf, otherwise.

X
3. Find geometric M⇤ = max sab greedy
matches
M bipartite approximation
(a,b)2M

|M⇤ |
repeatability(A, B) =
min{|A|, |B|}
min{|A|

Descriptor matching score 61

Intuition

• In addition to being stable, features must be visually distinctive

• Descriptor matching score
- similar to repeatability
- but matches are constructed by comparing descriptors

Descriptor matching score 62

Formal deﬁnition

1. Find features H
in common area homography

{Ra : a 2 A} {Rb : b 2 B}

2. Descriptor
{da : a 2 A} {db : b 2 B} dab = kda db k2
distances
X
3. Descriptor M⇤ =
d min dab
M bipartite
matches (a,b)2M

X
4. Geometric M⇤ = max sab
M bipartite
matches (as before) (a,b)2M

|M⇤ M⇤ | d
match-score(A, B) =
min{|A|, |B|}

Example of a repeatability graph 63

Repeatability (graf)
100
VL DoG
VL Hessian
90
VL HessianLaplace
VL HarrisLaplace
80
VL DoG (double)
VL Hessian (double)
70
VL HessianLaplace (double)
Repeatability (graf)

VL HarrisLaplace (double)
60
VLFeat SIFT
CMP Hessian
50 VGG hes
VGG har
40

30

20

10

0
30 40 50 60 20
Viewpoint angle

64

Indirect evaluation

Direct evaluation
Image retrieval
Data: oxford 5k

Software
VLBenchmarks

Indirect evaluation 65

• Indirect evaluation
- a “synthetic” performance measure in a “synthetic” setting

• The good
- independent of a speciﬁc application / implementation
- allow to evaluate single components, e.g.
▪ repeatability of detector
▪ matching score of descriptor

• The bad
- difﬁcult to design well
- unclear correlation to the performance in applications

Direct evaluation 66

• Direct evaluation
- performance of a real system using a feature object instance retrieval
object category recognition
• The good
object detection
- tied to the “real” performance of the feature text recognition
semantic segmentation
• The bad
...
- tied to one application
- worse, tied to one implementation
- difﬁcult to evaluate single aspects of a feature

• In the follow up we will focus on object instance retrieval

Image retrieval 67

Used to evaluate features

...

Image retrieval pipeline 68

Represent images as bags of features

input image detector descriptor

{f1 , . . . , fn } {d1 , . . . , dn }

Harris (Laplace) SIFT
Hessian (Laplace) LIOP
DoG BRIEF
MSER Jets
Harris Afﬁne ...
Hessian Afﬁne
....


Step 1: ﬁnd neighbours of each query descriptor

query
image

increasing descriptor distance

...

H. Jégou, M. Douze, and C. Schmid. Exploiting descriptor distances for precise image search. Technical Report 7656, INRIA, 2011.


Step 2: each query descriptor casts a vote for each DB image

d query descriptor

d1 d2 dk ...

vote strength max{dk di , 0}
distance

...
rank k


Step 3: sort DB images by decreasing total votes

query image 1
Average
Precision (AP)

precision
35%
✔

✗ ✔
✗ 1
recall

...

✗ ✔ ✗
decreasing total votes


Step 4: Overall performance score

query retrieval results AP

35%

✗ ✔ ✗

100%

✔ ✗ ✗

75%

✔ ✗ ✔
... ... ... ... ...

Mean Average Precision (mAP) 53%

Oxford 5K data 73

A retrieval benchmark dataset

Query Retrieved Images

...

✔ ✗ ✔

• ~ 5K images of Oxford
- For each of 58 queries
▪ about XX matching images
▪ about XX confounders images

• Larger datasets are possible, but slow for extensive evaluation

• Relative ranking of features seems to be representative

74

Indirect evaluation

Direct evaluation
Image retrieval
Data: oxford 5k

Software
VLBenchmarks

VLBenchmarks 75

A new easy-to-use benchmarking suite

http://www.vlfeat.org/benchmarks/index.html

• A novel MATLAB framework for feature evaluation
- Repeatability and matching scores
▪ VGG afﬁne testbed
- Image retrieval
▪ Oxford 5K

• Goodies
- Simple to use MATLAB code
- Automatically download datasets & run evaluations
- Backward compatible with published results

VLBenchmarks 76

Obtaining and installing the code

• Installation
- Download the latest version
- Unpack the archive
- Launch MATLAB and type

>>8install

• Requirements
- MATLAB R2008a (7.6)
- A C compiler (e.g. Visual Studio, GCC, or Xcode)
- Do not forget to setup MATLAB to use your C compiler

mex8;setup

Example usage 77

t.a choose a detector
t.a (Harris Afﬁne, VGG version)
t.a

o se a choose a dataset
(grafﬁti sequence)
o s* "*d* *e a

choose test
o " sea (detector repeatability)

" o ttt
t s dt tt
t sled ttt
t s;ed ttt
t slee a
run the evaluation
(repeatability = 0.66)

Testing on a sequence of images 78

.ls . ls . ls

r *; s
r *o aoto o; s
r a *;s

r ceu
a* ; r ...
. d * t. ..
. * ;t ...
. *c;t ...
. * ;; s

s * at o " otUse parfor on a cluster!
F; s
d *o o; s
a *o ao; s
s

79
1

0.9

0.8

0.7

0.6
repeatability

0.5

0.4

0.3

0.2

0.1

0
1 1.5 2 2.5 3 3.5 4 4.5 5
image number

Comparing two features 80

import"datasets.*;"import"localFeatures.*;"import"benchmarks.*;
detectors{1}"="VlFeatSift()";
detectors{2}"="VlFeatCovdet('EstimateAffineShape',"true)";
dataset"="VggAffineDataset('category','graf')";
benchmark"="RepeatabilityBenchmark()";
for"d"="1:2
for"j"="1:5
repeatability(j,d)"="...
""""""benchmark."testFeatureExtractor(detectors{d},"...
"""""""""""""""""""""""""""dataset.getTransformation(j),"...
"""""""""""""""""""""""""""dataset.getImagePath(1),"...
"""""""""""""""""""""""""""dataset.getImagePath(j))";
end
end
clf";"plot(repeatability,"'linewidth',"4)";
xlabel('image"number')";
ylabel('repeatability')";
grid"on";

81
1

0.9

0.8

0.7 Afﬁne Adaptation

0.6
repeatability

0.5

0.4

0.3

0.2

0.1

0
1 1.5 2 2.5 3 3.5 4 4.5 5
image number

Example 82

• Compare the following features
- SIFT, MSER, and features on a grid
- on the Grafﬁti sequence
- for repeatability and number of correspondence

Repeatability Number of correspondences

70 800
SIFT SIFT
MSER MSER
Features on a grid 700 Features on a grid
60

Number of correspondences
600
50

500
Repeatability

40
400
30
300

20
200

10 100

0 0
30 40 50 60 20 30 40 50 60 20
Viewpoint angle Viewpoint angle

Backward compatible 83

• Previously published results can be easily reproducedInternational Journal of Computer Vision
c 2006 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

- if interested, try the script8reproduceIjcv05.m DOI: 10.1007/s11263-005-3848-x

A Comparison of Afﬁne Region Detectors

K. MIKOLAJCZYK
University of Oxford, OX1 3PJ, Oxford, United Kingdom
km@robots.ox.ac.uk

T. TUYTELAARS
University of Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
tuytelaa@esat.kuleuven.be

C. SCHMID
INRIA, GRAVIR-CNRS, 655, av. de l’Europe, 38330, Montbonnot, France
schmid@inrialpes.fr

A. ZISSERMAN
az@robots.ox.ac.uk

J. MATAS
Czech Technical University, Karlovo Namesti 13, 121 35, Prague, Czech Republic
matas@cmp.felk.cvut.cz

F. SCHAFFALITZKY AND T. KADIR

Other useful tricks 84

• Compare different parameter settings
detectors{1}"="VggAffine('Detector','haraff',"'Threshold"',"500)";
detectors{2}"="VggAffine('Detector','haraff',"'Threshold"',"1000)";

• Visualising matches
[~"~"matches"reprojFrames]"="benchmark.testFeatureExtractor("...")
...
benchmarks.helpers.plotFrameMatches(matches,"reprojFrames)
SIFT Matches with 4 image (VggAffineDataset−graf dataset). Matches using mean−variance−median descriptor with 4 image (VggAffineDataset−graf dataset).

Matched ref. image frames Matched ref. image frames
Unmatched ref. image frames Unmatched ref. image frames
Matched test image frames Matched test image frames
Unmatched test image frames Unmatched test image frames

Other benchmarks 85

• Detector matching score

benchmark"="RepeatabilityBenchmark('mode','MatchingScore')";

• Image retrieval
- Example: Oxford 5K lite
- mAP evaluation

dataset"="VggRetrievalDataset('Category','oxbuild',
""""""""""""""""""""""""""""""'BadImagesNum',100);
benchmark"="RetrievalBenchmark()";
mAP"="benchmark.testFeatureExtractor(detectors{d},"dataset);

Summary 86

http://www.vlfeat.org/benchmarks/index.html

• Benchmarks
- Indirect: repeatability and matching score
- Direct: image retrieval

• VLBenchmarks
- a simple to use MATLAB framework
- convenient

• The future
- Existing measures have many shortcomings
- Hopefully better benchmarks will be available soon
- And they will be added to VLBenchmarks for your convenience

Credits 87

Karel Lenc Varun Gulshan

HARVEST Programme

Krystian Mikolajczyk Tinne Tuytelaars Jiri Matas Cordelia Schmid

Andrew Zisserman

Thank you for coming! 88

VLFeat
http://www.vlfeat.org/

VLBenchmarks
http://www.vlfeat.org/benchmarks/

Modern features-part-4-evaluation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Modern features-part-4-evaluation

Similar a Modern features-part-4-evaluation (20)

Más de zukun

Más de zukun (20)

Modern features-part-4-evaluation