1. Features and Learning Methods for
Large-scale Image Annotation and Categorization
Hideki Nakayama
The University of Tokyo
Department of Creative Informatics
2013/1/15
2. My research interest
Generic image (object) recognition
Whole-image level recognition
Weakly supervised training samples
画像アノテーショ
ン
一枚の画像全体へ
複数の単語を付与
(without region correspondence)
3. The era of big data
We can use gigantic weakly-labeled web data now!
Tags:
Nikon D200 DSLR Nikkor 60mmf28dmicro Nature
Landscape
Lake Idaho Ice Sunset Sun Mountain
Sky Frozen AnAwesomeShot
ImpressedBeauty isawyoufirst
ABigFave Ljomi ljspot4 ColorPhotoAward
http://www.flickr.com/
Flickr: 6 billion images (2011)
Facebook: 3 billion images every year
Youtube: 8 year movies every day
4. More data helps recognition
Simple k-NN using Flickr images & tags
query
Recog. result
100K dataset 1.6M dataset 12M dataset
football soccer varsity girls boys football soccer festival college church stainedglass football
travel party family school high futbol park people cycling bath city vacation travel
marchingband vacation cathedral window glass
Nearest neighbors
6. Challenge: scaling to large training data
Traditional methods are not scalable in training
Bag-of-visual words + kernel SVM (chi-square, etc)
complexity memory
O N2 ~ O N3 O N2
cf. [Yang et al., CVPR’09]
☹
Recent methods exploit linear methods
With carefully designed image features, where dot kernel
approximates the similarity between instances
☺
complexity memory
ON O1
8. Example-based image annotation
Standard approach for image annotation problem
K-NN tiger
tiger Kernel density
forest estimation
grass
etc… water
cow
street
city MBRM [Feng et al, 2004],
sea JEC [Makadia et al, 2008]
wave
Similar image people
TagProp [Guillaumin et al, 2009]
search
plane
sky
jet
Problem:
grass
How to define tiger
similarity? water
people
tree
stone
Image and label data
(training samples)
9. Fundamental problem: Semantic gap
Visually similar ≠ semantically similar
I look my dog contest:
http://www.hemmy.net/2006/06/25
/i-look-like-my-dog-contest/
Solution: Distance metric learning
10. Canonical Contextual Distance [Nakayama+, BMVC’10]
Canonical Correlation Analysis (CCA)
x : image features (e.g. BoVW), y : binary label vector
finds linear transformations
s AT x x , t BT y y that maximizes the correlation between s and t
t
1 2
XY YY YX A XX A AT XX A I
s
X t
Y YX
1
XX XY B YY B 2
BT YY B I
s : covariance matrices
Image feature Canonical space Labels feature
: canonical correlatio ns
similarity measure in the latent subspace
using probabilistic structure
latent
variable
z z ~ N 0, I d , min{ p, q} d 1
x|z ~ N Wx z x , x , Wx Rp d
x y y|z ~ N Wy z y , y , Wy Rq d
image labels
feature feature
Probabilistic interpretation of CCA [Bach and Jordan, 2005]
11. CCD for image auto-annotation
M x AT x x
T
T
E z | xi , y i
Mx I 2 1
1
I 2 1
1
AT x i x E z|x
My I 2
I 2
BT y i y
T
Mx
T
2 1 2 1
Mx
var z | x I M xM x
I I
var z | x i , y i I 1 1
My I 2
I 2 My
1 N p z | x i , y i p z | x dz
P w | xs P w | li P li | x s P li | x N
N i 1 p z | x j , y j p z | x dz
j 1
P w | li w , li 1 IDF w
w,li : annotation of training samples
1
12. Features
Image features
BoVW, GIST, etc… (off-the-shelf ones)
Needs to be encoded in a Euclidean space
Labels features
Binary occurrence vector cf. [Guillaumin et al., CVPR’10]
When the dictionary contains 「plane, sea, sky, clouds, mountain」
Ij y j = (1, 0, 1, 1, 0)
plane sky clouds y j , yk 2
Ik yk 0, 0, 1, 1, 1 Dot product counts the
number of common labels.
sky clouds
mountain
13. Evaluation
Benchmark datasets
Corel5K IAPR-TC12 ESP Game
# of words 260 291 268
# of training images 4,500 17,665 18,689
# of testing images 499 1,962 2,081
# of words per
3.4/5 5.7/23 4.7/15
image
(avg./max)
16. Basic pipeline
0 .5
1 .2
0 .1
1. Local feature extraction 2. Coding image-level
1-1. feature detector feature vector
(Operator, grid)
1-2. descriptor
(SIFT, SURF, …) How to encode similarity between
distributions of local features?
17. Bag-of-Visual-Words (traditional) [Csurka et al. 2004]
Vector quantization → histogram
○ computationally efficient
× large reconstruction error
× non-linear property (must be used with non-linear kernel)
Training images
Visual
Local features words
query
Credit: K. Yanai
18. New BoVW① sparse coding + max pooling
Reduce reconstruction error using multiple basis (words)
Max pooling leads to linearly-separable image signatures
(taking max response for each visual word) cf. [Boureau et al., ICML’10]
[Yang+, CVPR’09] [Wang+, CVPR’10]
19. New BoVW② encode higher-level statistics
N: # of visual words (10^3~10^4)
d: dimension of descriptor (10~100
Method Statistics Dim. of image signature
BoVW count (ratio) N
VLAD [Jegou+,CVPR’10] mean Nd
Super vector [Zhou+, ECCV’10] ratio+mean N(d+1)
Fisher vector [Perronnin+, mean+variance 2Nd
ECCV’10]
Global Gaussian mean+covariance d(d+1)/2
[Nakayama+, CVPR’10] (N=1)
VLAT [Picard+, ICIP’11] mean+covariance Nd(d+1)/2
Encoded in a feature vector so that the dot product
approximates the distance between distributions
20. Global Gaussian Coding [Nakayama+, CVPR’10]
Exploit Riemannian manifold of Gaussian
using information geometry framework
1 1 T
p x; d /2
exp x μ 1
x μ
2 2
x : local descriptor
Affine coordinates
2 T
η 1 ,, d , 11
2
1 , 12 1 2 ,, 1d 1 d , 22
2
2 ,, dd d
η, η η ηT G η η η
Inverse of Fisher information matrix
We use G η (metric on the center of samples) for entire space
η
ηi , η j ηT G η η j
i
Somewhat approximates
the KL-divergence…
21. Competition
Large-scale visual recognition challenge 2010
1000-class categorization
1.2M training images, 150K testing images
Evaluate top 5 classification accuracy
Part of ImageNet dataset [Deng et al.]
Labeled with Amazon Mechanical Turk
14M images, 22K categories (as of 2011)
Semantic structure according to WordNet
Credit: Fei-Fei Li
22. Result (2010)
11 teams participated
1. NEC+UIUC (72%) 80,000~260,000 dim ×6
2. Xerox Research (64%) 260,000 dim ×2
3. ISI(55%) 12,000 dim
4. UC Irvine (53%)
5. MIT (46%)
Examples
http://www.isi.imi.i.u-tokyo.ac.jp/pattern/ilsvrc/index.html
23. 2010 Winner: NEC-UIUC
LCC + super vector coding
Ensemble of six classifiers using different features
Parallelized feature extraction (Hadoop)
Linear SVM (Averaging SGD)
LCC→2days、Super vector→7days (with a 8-core
machine)
24. 2011 Winner: XRCE
Fisher vector
520K dim ×2 (SIFT, color)
2 days with a 16-core machine
Linear SVM (SGD)
1.5 days with a 16-core machine
25. 2012 Winner: Univ. Toronto
Deep learning
Huge convolutional neural network from raw images
Two GPUs, one week
10%
26. Summary
Large-scale image recognition is now a hot topic
Millions of training images, tens of thousands of categories
Scalability is the key issue
Linear training methods + compatibly-designed features
If we somehow approximate the sample similarity with dot
kernel, we can simply apply linear methods!
Explicit embedding
Fisher kernel
KPCA + Nystrom method
Personal interest: Can we do this with graph kernels?