深層学習を利用した映像要約への取り組み（第7回ステアラボ人工知能セミナー）

2
Deep Semantic Feature  
Sentence Sentence
Embedding
Video
Embedding
Web Images
Embedding Space
Video
“A baby is playing a guitar.”
Image Search
 
Deep Semantic Feature

7
[1] [2]
[1] https://www.ibm.com/blogs/think/2016/08/31/cognitive-movie-trailer/
[2] Uchihashi et al., “Video Manga: generating semantically meaningful video summaries,” ACM MM, 1999
From: https://www.youtube.com/watch?v=gJEzuYynaiw

•
•
• vs
•
• Coverage/Representative vs Importance/Interestingness
•
9

11
Coverage
Importance/
Preference

•
•
: [Babaguchi 2004]
12
[Babaguchi 2004] N. Babaguchi, Y. Kawai, T. Ogura, and T. Kitahashi, “Personalized abstraction of broadcasted
American football video by highlight selection,” TMM 2004.

: [Gong 2014]
• Fisher vector/SIFT desc. /1
• Coverage
13
[Gong 2014] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised
video summarization,” NIPS 2014.

: [Gygli 2014]
• Importantce
•
14
etc.
Importance
[Gygli 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. van Gool, “Creating summaries from user
videos,” ECCV 2014.

19
… “A man playing a guitar
outside his house”
“A flock of zebras
grazing.”

23
… “A man playing a guitar
grazing.”
?

(e.g. [Li 2010])
24
…
m
an
w
om
an
piano
guitar
zebralion
grass
… … …
{1, 0, … 1, 0, …, 0, 0, …, 0}
{0, 0, … 0, 0, …, 1, 0, …, 1}
[Li 2010] L.-J. Li, H. Su, E. P. Xing, F.-F. Li, “Object bank: A high level image representation for scene classiﬁcation 
& semantic feature sparsiﬁcation,” NIPS 2010.

•
• (e.g., word2vec) + Recurrent Neural Net (RNN)
•
• Convolutional Neural Net (CNN) + Pooling
• 3D-CNN
• + RNN
25
Deep Neural Net

DNN
26
…
“A man playing a guitar
grazing.”
DNNDNN

( )
•
•
•
28
{“A”, “man”, “playing”, “a”, “guitar”, “outside”, “his”, “house”, “.”}

• ILSVRC
CNN
• AlexNet, VGG-16,
GoogLeNet, ResNet
• Mean Pooling
• FC
CNN + Pooling (e.g. [Pan 2016])
29
… …
……
[Pan 2016] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and translation to bridge video and
language,” CVPR 2016.

3D-CNN (e.g. [Tran 2015])
•
• FC
•
30
… …
[Tran 2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D
convolutional networks,” ICCV 2015.

GRU
• LSTM gate reset update 2
33

RNN
34
•
•
Stacked convolutional GRU [Ballas 2016]
[Ballas 2016] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for
learning video representations,” ICLR 2016.

RNN
•
•
35
Hierarchical RNN [Pan 2015]
[Pan 2015] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video
representation with application to captioning,” CVPR 2015.

•
37
“A man is playing a keyboard.”
DNNDNN
Loss

(
•
38
“A man playing a
guitar outside his
house”
grazing.”
( ), ),

•
39
(
“A man playing a
guitar outside his
house”
grazing.”
( ), ),

:
• Play the keyboard vs Type the keyboard
40
keyboard
Query: “A man is playing a keyboard.”
keyboard keyboard

• :LSTM :CNN + mean pooling
• Contrastive loss /
• LSTM
41
“A man is playing a keyboard”
semantic space
A man is playing a keyboard
CNN +
mean pooling
LSTM
CNN

•
• CNN RNN
42
Pooling
}
+
Loss
Web images
Video
“.”“A” “dog” “is” “eating” “watermelon”
Pooling
}
Sentence
Fully-connected LayersCNN for Videos
CNN for Web Images
RNN for Sentences
RNN RNN RNN RNN RNN RNN

43
“A child dances to the TV”
“A man is playing a guitar”
“A cat is hitting the keys on a piano”
• MS Video Description Corpus (# Clips 1970, # Text 85K)

[Otani 2016]
44
ECCV-16 submission ID 631 11
Query GoogLeNet+VS GoogLeNet+ALL2
(1) A man is playing a keyboard.
(2) Kids are playing in a pool.
(3) A man is trimming fat from a roast.
Query GoogLeNet+VI GoogLeNet+ALL2
(4) A boy is singing into a microphone.
[Otani 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, N. Yokoya, “Learning joint representations of videos
and sentences with web image search, ECCVW 2016.

•
‣
‣
‣ Storytelling
•
‣
‣
‣
46

Take-home message
•
‣
‣
•
‣
‣
‣
•
47

深層学習を利用した映像要約への取り組み（第7回ステアラボ人工知能セミナー）

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (13)

Más de STAIR Lab, Chiba Institute of Technology

Más de STAIR Lab, Chiba Institute of Technology (7)

Último

Último (20)

深層学習を利用した映像要約への取り組み（第7回ステアラボ人工知能セミナー）