1. The PASCAL Visual Object Classes
(VOC) Dataset and Challenge
Andrew Zisserman
Visual Geometry Group
University of Oxford
2. The PASCAL VOC Challenge
• Challenge in visual object
recognition funded by
PASCAL network of excellence
• Publicly available dataset of
annotated images
• Main competitions in classification (is there an X in this image), detection
(where are the X’s), and segmentation (which pixels belong to X)
• Competition has run each year since 2006.
• Organized by Mark Everingham, Luc Van Gool, Chris Williams, John Winn
and Andrew Zisserman
3. Dataset Content
• 20 classes: aeroplane, bicycle, boat, bottle, bus, car, cat,
chair, cow, dining table, dog, horse, motorbike, person,
potted plant, sheep, train, TV
• Real images downloaded from flickr, not filtered for “quality”
• Complex scenes, scale, pose, lighting, occlusion, truncation ...
4. Annotation
• Complete annotation of all objects
• Annotated in one session with written guidelines
Occluded Difficult
Object is Not scored
significantly in evaluation
occluded within BB
Truncated Pose
Object extends Facing left
beyond BB
6. Examples
Dining Table Dog Horse Motorbike Person
Potted Plant Sheep Sofa Train TV/Monitor
7. Dataset Statistics – VOC 2010
Training Testing
Images 10,103 9,637
Objects 23,374 22,992
• Minimum ~500 training objects per category
– ~1700 cars, 1500 dogs, 7000 people
• Approximately equal distribution across training and test sets
• Data augmented each year
8. Design choices: Things we got right …
1. Standard method of assessment
• Train/validation/test splits given
• Standard evaluation protocol – AP per class
• Software supplied
– Includes baseline classifier/detector/segmenter
– Runs from training to generating PR curve and AP on validation or
test data out of the box
• Has meant that results on VOC can be consistently
compared in publications
• “Best Practice” webpage on how the training/validation/test
data should be used for the challenge
9. Design choices: Things we got right …
2. Evaluation of test data
Three possibilities
1. Release test data and annotation (most liberal) and participants can
assess performance
– Cons: open to abuse
2. Release test data, but test annotation withheld - participants submit
results and organizers assess performance (use an evaluation
server)
3. No release of test data - participants have to submit software and
organizers run this and assess performance
10. Design choices: Things we got right …
3. Augmentation of dataset each year
• VOC2010 around 40% increase in size over VOC2009
Training Testing
Images 10,103 (7,054) 9,637 (6,650)
Objects 23,374 (17,218) 22,992 (16,829)
VOC2009 counts shown in brackets
• Has prevented over fitting on data
• 2008/9 datasets retained as subset of 2010
– Assignments to training/test sets maintained
– So can measure progress from 2008 to 2010
11. Detection Challenge: Progress 2008-2010
60
50
Max AP (%)
40
2008
30 2009
2010
20
10
0
tvmonitor
pottedplant
bottle
motorbike
diningtable
aeroplane
horse
sofa
person
train
sheep
bicycle
cow
cat
boat
bus
bird
dog
car
chair
• Results on 2008 data improve for best 2009 and 2010 methods
for all categories, by over 100% for some categories
– Caveat: Better methods or more training data?
12. Design choices: Things we got right …
4. Equivalence bands
• Uses Friedman/Nemenyi ANOVA significance test
• Makes explicit that many methods are just slight
perturbations of each other
• Need more research on this area for obtaining equivalence
classes for ranking (rather than classification)
• Plan to include banner/header results on evaluation server
(to aid comparisons for reviewing etc)
13. Statistical Significance
• Friedman/Nemenyi analysis
– Compare differences in mean rank of methods over classes using non-
parametric version of ANOVA
– Mean rank must differ by at least 5.4 to be considered significant
(p=0.05) Mean rank
45 40 35 30 25 20 15 10 5
(chance) NUSPSL_KERNELREGFUSING
HIT_PROTOLEARN_2 NUSPSL_MFDETSVM
NII_SVMSIFT NUSPSL_EXCLASSIFIER
BUPT_LPBETA_MULTFEAT NEC_V1_HOGLBP_NONLIN_SVMDET
NTHU_LINSPARSE_2 NLPR_VSTAR_CLS_DICTLEARN
BUPT_SVM_MULTFEAT NEC_V1_HOGLBP_NONLIN_SVM
LIG_MSVM_FUSE_CONCEPT UVA_BW_NEWCOLOURSIFT
WLU_SPM_EMDIST UVA_BW_NEWCOLOURSIFT_SRKDA
BUPT_SPM_SC_HOG CVC_PLUSDET
LIP6UPMC_RANKING SURREY_MK_KDA
LIP6UPMC_KSVM_BASELINE CVC_PLUS
LIP6UPMC_MKL_L1 BUT_FU_SVM_SIFT
UC3M_GENDISC XRCE_IFV
NUDT_SVM_WHGO_SIFT_CENTRIST_LLM CVC_FLAT
RITSU_CBVR_WKF BONN_FGT_SEGM
TIT_SIFT_GMM_MKL LIRIS_MKL_TRAINVAL
NUDT_SVM_LDP_SIFT_PMK_SPMK
14. The future …
Action Classification Taster Challenge – 2010 on
• Given the bounding box of a person, predict whether they are performing
a given action
Playing Instrument? Reading?
• Developed with Ivan Laptev
• Strong participation from the start (11 methods, 8 groups)
15. Ten Action Classes
Phoning Playing Instrument Reading Riding Bike Riding Horse
Running Taking Photo Using Computer Walking Jumping
• 2011: added jumping + `other’ class
16. Person Layout Taster Challenge – 2009 on
• Given the bounding box of a person, predict the positions of head, hands
and feet.
• aim: to encourage research on more detailed image interpretation
hand hand
head
head
head
hand hand
hand
foot
foot
foot
• 2009: no participation
• 2010: relaxed success criteria, 2 participants
• Assessment method still not satisfactory?
17. Futures
2012
• Last official year of PASCAL2
• May introduce a new taster competition, e.g
– Materials
– Actions in video
• Open to suggestions