3. • This work is about comparing the latest ConvNet
based feature representations on common ground
• We compare both different pre-trained network
architectures and different learning heuristics
Comparing Apples to Apples
Fixed
Evaluation
Protocol
Fixed Learning
CNN
Arch 1
CNN
Arch 2
IFV
Input
Dataset
…
4. Performance Evolution over VOC2007
BOW
32K
–
IFV-BL
327K
–
IFV
84K
–
IFV
84K
f s
DeCAF
4K
t t
CNN-F
4K
f s
CNN-M 2K
2K
f s
CNN-S
4K (TN)
f s
VGG-D+E
4K
S s
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
mAP
68.02
54.48
61.69
64.36
73.41
77.15
80.13
2008 2010 2013 2014...
82.42
Method
Dim.
Aug.
2015
89.70
CNN-based methods
5. Evaluation Setup
SVM Classifier
train
test
training set
test set
Evaluate using
mAP, accuracy etc.
classifier output
Pre-trained Net
on 1,000 ImageNet Classes
CNN Feature
Extractor
(4096-D feature vector out)
14. Data Augmentation
Given pre-trained ConvNet, augmentation applied at test time
CNN Feature
Extractor
Pre-trained Network
a. Extract crops
b. Pool features
(average, max)
15. Data Augmentation
a. No augmentation (= 1 image)
b. Flip augmentation (= 2 images)
c. Crop+Flip augmentation (= 10 images)
+
+ flips
224x224
224x224
224x224
18. Fully Convolutional Net
Sermanet et al. 2014 (Overfeat)
• Convert final fc layers to convolutional layers
• Output is then an activation map which can be pooled
8.8% 7.5% top-5 val. error ILSVRC-2014
22. Fine Tuning
mAP(VOC07)
79
80
81
82
83
No TN TN-RNK TN-RNK
82.4
82.2
79.7
• TN-CLS – classification loss max{ 0, 1 - ywT
φ( I ) }
• TN-RNK – ranking loss max{ 0, 1 - wT
( φ( IPOS ) - φ( INEG ) ) }
23. Comparison with State of the Art
VOC2007 VOC2012ILSVRC-2012
CNN-M 2048
CNN-S
CNN-S TUNE-RNK
13.5
13.1
13.1
80.1
79.7
82.4
82.4
82.9
83.2
Zeiler & Fergus
Oquab et al.
Wei et al.
Clarifai (1 net)
16.1 79.0
18.0 77.7 78.7 (82.8*)
81.5 (85.2*) 81.7 (90.3*)
GoogLeNet (1 net)
12.5
7.9
VGG Very Deep (1 net) 89.3 89.07.0
24. If you get the details right, a relatively simple ConvNet-
based pipeline can outperform much more complex
architectures
• Data augmentation helps a lot, both for deep and
shallow features
• Fine tuning makes a difference, and should use
ranking loss where appropriate
• Smaller filters and deeper networks help, although
feature computation is slower
Take-home Messages
25. • Presented here was just a subset of the full results
from the paper
• Check out the paper for full results on:
• VOC 2007
• VOC 2012
• Caltech-101
• Caltech-256
• ILSVRC-2012
There’s more…
26. • Caffe-compatible CNN models can be
downloaded from the Caffe Model Zoo: https://
github.com/BVLC/caffe/wiki/Model-Zoo
• Matlab feature computation code is also available
from the project website: http://
www.robots.ox.ac.uk/~vgg/software/deep_eval
Source Code
27. Related Publications
“Return of the Devil in the Details: Delving Deep into Convolutional Nets”
BMVC 2014 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew
Zisserman (Best Paper Prize)
“The devil is in the details: an evaluation of recent feature encoding methods”
BMVC 2011 Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Victor
Lempitsky, Andrew Zisserman
(Best Poster Prize Honourable Mention, 300+ citations)
http://www.robots.ox.ac.uk/~ken