This document discusses unsupervised and supervised approaches to object retrieval.
It begins by covering unsupervised approaches, describing common local and global features used for object retrieval like SIFT, HOG, and deep features. It also discusses feature aggregation methods like bag-of-features and Fisher vectors.
The document then reviews state-of-the-art results, noting methods that achieved mean average precision scores over 0.8 on standard datasets using techniques like selective match kernels and sum-pooled convolutional features.
It concludes by proposing future attempts could explore improving features, distance metrics, and incorporating supervision, suggesting object retrieval may benefit from a dual supervised/unsupervised learning approach.
12. Global and deep features
• GIST features [Oliva et al., 2001]
Ø Describe the images by spectral information
• Deep features
Ø Extracted from neural networks
12
[Krizhevsky et al., 2012]
13. Aggregated Features
• BoF [Sivic et al., 2003]
• Hamming Embedding [Jégou et al., 2008]
• Fisher Vector [Perronnin et al., 2007]
• VLAD [Jégou et al., 2012]
13
14. Bag of Features (BoF)
• Cluster local descriptors to build a dictionary.
• Compute the BoF vector as a histogram of
visual words.
14
Images
c2
c3
Dictionary
Bag of Features
[Sivic et al., 2003]
15. Hamming Embedding
• Each local descriptors set of an image will be
encoded by a binary signature.
15
[Jégou et al., 2008]
16. Fisher Vector (FV)
• Cluster the local descriptors by GMM
• Fisher Kernel
• Fisher Vector
16
Images Local descriptors
GMM
Fisher Vector
[Perronnin et al., 2007]
17. VLAD
• Replace the GMM in FV by k-means clustering
• Approximate FV by
17
Images Local descriptors
K-means
VLAD Vector
[Jégou et al., 2012]
22. Indexing and compressing data
• Coarse-to-fine strategy
Ø Use quantization techniques to build an inverted
file (IVF)
22
c1 1 3
c2 2
c3 4 5 6
id code
m bytes
c2
c3
Inverted File
Compressed vectorFaster
search
Better memory
footprint
[Jégou et al., 2011]
23. Quantization techniques
• Compress the data for
better memory footprint
• Search accuracy is
acceptable with
appropriate parameters
23
Recall = 95% with 64 bits code
[Jégou et al., 2011]
3
5 6
id code
m bytes
24. Feature processing
• Square rooting [Arandjelovic & Zisserman,
2012]
• L2-normalization [Jain et al., 2012]
• Centralization [Tolias et al., 2013]
• Down-weight highly populated cells in
aggregation [Jégou et al., 2009]
• Whitening [Jégou et al., 2010]
24
25. Image processing: re-ranking
• Estimate a transformation between the query
region and each target image.
• Target images are re-ranked based on the
discriminability of the spatially verified visual
words.
25
mAP with BoF:
0.618→0.645
[Philbin et al., 2007]
Dataset: Oxford
Buildings
Queries
26. Image processing: query expansion 26
mAP with BoF:
0.645→0.696
[Chum et al., 2007]
• Requery after reconstructing the original
query.
• The new query is constructed from verified
query in the first time retrieval.
Dataset: Oxford
Buildings
28. Nearest neighbor search
• Datasets: 1M~1B vectors with ground truth
data
Ø BIGANN dataset: http://corpus-texmex.irisa.fr/
• Evaluation
Ø recall@R = the proportion of queries with NN
ranked in top-R results.
28
c1 1 3
c2 2
c3 4 5 6
id code
m bytes
c2
c3
Inverted File
Compressed vector
29. Quantization techniques
• Additive Quantization
[Babenko et al., 2014]
• Approximate a vector by
the sum of codewords.
• Learn codewords by an
iterative optimization.
• Composite Quantization
[Zhang et al., 2014]
• Minimize the
orthogonality of the
approximation.
29
30. Indexing techniques
• Multi-indexing [Babenko et al., 2012, 2015]
• Performance in a dataset of one billion SIFT
vectors
Ø Memory: 12 GB
Ø Search time: 2 ms/query
Ø recall@100 = 70%
30
31. Image search
• Datasets: Oxford building dataset [Philbin et
al., 2007]
• Evaluation
Ø mAP: Mean average precision for a set of queries
is the mean of the average precision scores for
each query.
31
32. Selective Match Kernel
• [Tolias et al., 2013]
• Apply the power normalization to each VLAD
component to improve the accuracy.
• Use hashing to reduce the memory footprint.
• mAP = 0.817 on Oxford5K dataset [Philbin et al., 2007]
32
33. Neural Codes
• [Babenko et al., 2014]
• Attempt to use features that are extracted from
neural network to object retrieval.
• Features are fine-tuned.
• mAP = 0.435 with fc6 features on Oxford5K
dataset.
33
34. Sum-pooled convolutional features
• [Babenko et al., 2015]
• Deep features are sum-pooled and Gaussian
weighted to improve the accuracy.
• mAP = 0.657 on Oxford5K dataset.
34
35. Summary of image retrieval results 35
• Search framework with deep features in object
retrieval still need to be improved.
Method Feature Framework mAP
ASMK [Tolias et al., 2013] SIFT VLAD 0.817
Neural codes [Babenko et al., 2014] Deep features - 0.435
SPoC [Babenko et al., 2015] Deep features SPoC 0.657
37. Attempts on current topics
• Improve the features:
Ø Feature fusion
Ø Find new match kernels
Ø Improve the system with deep features?
• Improve the distance metrics and NN search.
37
38. Dual-process system 38
• [Stanovich et al., 1999, 2004]
Fast, high
capacity, implicit
knowledge and
basic emotions
only .
Slow, limited
capacity, explicit
knowledge and
complicated
emotions.
39. Supervised Object Retrieval?
• More than just apply the deep features into
retrieval.
• Learning while searching?
• Learning with feedback?
39
40. The Duality of Object Retrieval
• The collaboration between unsupervised
learning and supervised learning in object
retrieval.
40
[Stanovich et al., 1999, 2004]
41. Conclusion
• Basic Object Retrieval
Ø Features: SIFT, HOG, GIST, deep features
Ø Distance metrics and NN search
Ø Hamming Embedding and Aggregation
Ø Pre-processing and post-processing
• State-of-the-art results
• Future attempts: Duality & Supervised &
Unsupervised?
41