https://telecombcn-dl.github.io/2018-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
4. Where do we get training data?
Structure-from-Motion gives us plenty of data
TRAINING TESTING
5. Related work: PoseNet
Kendall et al. PoseNet:. ICCV 2015
Non-Linear Embedding
Pretrained
y R2048
Linear Regression
p R3
FC
FC
q R4
6. Outdoor localization: SIFT wins
• Experiments on Cambridge Landmarks
SIFT CNN
Walch, Hazirbas, Leal-Taixé, Sattler, Hilsenbeck, Cremers, Image-based localization using LSTMs for structured feature correlation. ICCV, 2017
7. Indoor localization: SIFT suffers
• Experiments on 7Scenes
Number of images that it failed to localize
SIFT CNN
8. SIFT-based methods do not work at all!
Our new dataset TUM-LSI: SIFT dies
https://github.com/NavVisResearch/NavVis-Indoor-Dataset
9. • One separate model is needed per scene
Limitations of current methods
Pretrained
y R2048
p R3
FC
FC
q R4
Kendall et al. PoseNet:. ICCV 2015 Brachmann et al. DSAC:. CVPR 2017
10. • One separate model is needed per scene
• The most used loss function requires the setting of a
hyperparameter which is again scene dependent
Limitations of current methods
Pretrained
y R2048
p R3
FC
FC
q R4
DOES NOT SCALE
11. • L2 loss
pose label pose
prediction
camera center
position
camera rotation
in quaternion
scaling factor to balance
the weighting between
the rotation error and the
position error
PoseNet loss functions
Kendall et al. PoseNet:. ICCV 2015
Need to find its value per scene à can range from 200 to 2500
No geometry information
12. • Geometric loss function: Minimize re-projection error of 3D points
visible in image
PoseNet loss functions
Kendall et al. Geometric Loss Functions for Camera Pose Regression with Deep Learning. CVPR 2017
Network needs to be pre-trained on L2 loss
One network per scene
More accurate results
13. Relative Pose Estimation
• Use a neural network to predict relative poses
• Loss function: PoseNet L2 loss
Laskar et al. Camera relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network. ICCVW 2017
Loss still depends on a scene-dependent
hyperparameter
One network for all scenes
14. • One separate model is needed per scene
• The most used loss function requires the setting of a
hyperparameter which is again scene dependent
Limitations of current methods
15. • One separate model is needed per scene
• The most used loss function requires the setting of a
hyperparameter which is again scene dependent
Proposed method
Relative Pose Estimation
with
Geometric Matching
Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018
Essential Matrix
Regression
One network
for all scenes
No
hyperparameter
on the loss
16. • One separate model is needed per scene
• The most used loss function requires the setting of a
hyperparameter which is again scene dependent
Proposed method
Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018
Essential Matrix
Regression
One network
for all scenes
No
hyperparameter
on the loss
Relative Pose Estimation
with
Geometric Matching
17. • Siamese architecture
– Shared weights on the input
image pair
– Feature concatenation
Regressing relative poses
18. I. Rocco et al. “Convolutional neural network architecture for geometric matching. CVPR 2017.
Geometric matching layer
• Mimic the geometrical matching process
Features
from the
siamese net
Fixed
operation,
no learnable
parameters
19. Relative Pose Estimation
with
Geometric Matching
• One separate model is needed per scene
• The most used loss function requires the setting of a
hyperparameter which is again scene dependent
Proposed method
Q. Zhou, T. Sattler, L. Leal-Taixé. Submitted to ECCV 2018
One network
for all scenes
No
hyperparameter
on the loss
Essential Matrix
Regression
20. • Essential matrix between query and training image
• L2 loss function on the entries of the essential matrix
The loss is
hyperparameter
free!
Essential Matrix loss
Fixing the scale of translations to
get consistent predictions
21. Are we really learning an essential matrix?
• Ratio between first two eigenvalues is close to 1?
22. Are we really learning an essential matrix?
• Ratio between first two eigenvalues is close to 1?
• Third eigenvalue close to 0?
23. Are we really learning an essential matrix?
• Ratio between first two eigenvalues is close to 1?
• Third eigenvalue close to 0?
• We learn a good approximation of a true essential matrix
• Feature-based methods also project the resulting matrix to a rank 2
matrix (due to noise)
• Forcing this projection does not have any effect on the results on
any of the tested datasets
24. A CNN model for Image
retrieval to identify 30
nearest neighbors for the
query image
F. Radenovic ́et al.. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples.. ECCV 2016
Full localization pipeline
25. EssNet to regress essential
matrices and estimate
relative poses
Full localization pipeline
28. Comparison to SOA
• Experiments on outdoor scenes, Cambridge Landmarks
• SOA positional error with better rotational error
29. Comparison to SOA
• Experiments on outdoor scenes, Cambridge Landmarks
• Only geometric loss is more accurate in rotation but 21% less
accurate in translation
30. Comparison to SOA
• Experiments on outdoor scenes, Cambridge Landmarks
One single
network
One network trained per scene
31. What can we do with more data?
• Training relative poses allows us to leverage more data
Trained on
Cambridge
Landmarks
Pre-trained on a new dataset and fine-
tuned on Cambridge Landmarks
32. Out with the Old?
• SIFT-based methods still outperform CNN-based methods but
there is potential in indoor scenes
• EssNet: relative pose estimation
– mimics the visual localization pipeline à takes advantage of multiple
view geometry
– One network to perform localization in any scene
• Generalization still has a long way to go: if no images of the test
scene are shown to the network, performance drops
40