1) Modern techniques use deep neural networks to jointly learn multimodal representations of videos, images, and text for tasks like video captioning and retrieval.
2) Convolutional neural networks (CNNs) are commonly used to extract visual features from videos and images, while recurrent neural networks (RNNs) like LSTMs model the temporal dynamics of video and the sequence of words in sentences.
3) Models are trained using tasks like predicting captions to learn joint representations in a shared semantic space, and evaluated on tasks like retrieving relevant videos for a given text query.
2. 2
Deep Semantic Feature
Sentence Sentence
Embedding
Video
Embedding
Web Images
Embedding Space
Video
“A baby is playing a guitar.”
Image Search
Deep Semantic Feature
12. •
•
: [Babaguchi 2004]
12
[Babaguchi 2004] N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, “Personalized abstraction of broadcasted
American football video by highlight selection,” TMM 2004.
13. : [Gong 2014]
• Fisher vector/SIFT desc. /1
• Coverage
13
[Gong 2014] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised
video summarization,” NIPS 2014.
14. : [Gygli 2014]
• Importantce
•
14
etc.
Importance
[Gygli 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. van Gool, “Creating summaries from user
videos,” ECCV 2014.
23. 23
… “A man playing a guitar
outside his house”
“A flock of zebras
grazing.”
?
24. (e.g. [Li 2010])
24
…
m
an
w
om
an
piano
guitar
zebralion
grass
… … …
{1, 0, … 1, 0, …, 0, 0, …, 0}
{0, 0, … 0, 0, …, 1, 0, …, 1}
[Li 2010] L.-J. Li, H. Su, E. P. Xing, F.-F. Li, “Object bank: A high level image representation for scene classification
& semantic feature sparsification,” NIPS 2010.
25. •
• (e.g., word2vec) + Recurrent Neural Net (RNN)
•
• Convolutional Neural Net (CNN) + Pooling
• 3D-CNN
• + RNN
25
Deep Neural Net
29. • ILSVRC
CNN
• AlexNet, VGG-16,
GoogLeNet, ResNet
• Mean Pooling
• FC
CNN + Pooling (e.g. [Pan 2016])
29
… …
……
[Pan 2016] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and
language,” CVPR 2016.
30. 3D-CNN (e.g. [Tran 2015])
•
• FC
•
30
… …
[Tran 2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D
convolutional networks,” ICCV 2015.
34. RNN
34
•
•
Stacked convolutional GRU [Ballas 2016]
[Ballas 2016] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for
learning video representations,” ICLR 2016.
35. RNN
•
•
35
Hierarchical RNN [Pan 2015]
[Pan 2015] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video
representation with application to captioning,” CVPR 2015.
38. (
•
38
“A man playing a
guitar outside his
house”
“A flock of zebras
grazing.”
( ), ),
39. •
39
(
“A man playing a
guitar outside his
house”
“A flock of zebras
grazing.”
( ), ),
40. :
• Play the keyboard vs Type the keyboard
40
keyboard
Query: “A man is playing a keyboard.”
keyboard keyboard
41. • :LSTM :CNN + mean pooling
• Contrastive loss /
• LSTM
41
“A man is playing a keyboard”
semantic space
A man is playing a keyboard
CNN +
mean pooling
LSTM
CNN
42. •
• CNN RNN
42
Pooling
}
+
Loss
Web images
Video
“.”“A” “dog” “is” “eating” “watermelon”
Pooling
}
Sentence
Fully-connected LayersCNN for Videos
CNN for Web Images
RNN for Sentences
RNN RNN RNN RNN RNN RNN
43. 43
“A child dances to the TV”
“A man is playing a guitar”
“A cat is hitting the keys on a piano”
• MS Video Description Corpus (# Clips 1970, # Text 85K)
44. [Otani 2016]
44
ECCV-16 submission ID 631 11
Query GoogLeNet+VS GoogLeNet+ALL2
(1) A man is playing a keyboard.
(2) Kids are playing in a pool.
(3) A man is trimming fat from a roast.
Query GoogLeNet+VI GoogLeNet+ALL2
(4) A boy is singing into a microphone.
[Otani 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, N. Yokoya, “Learning joint representations of videos
and sentences with web image search, ECCVW 2016.