https://github.com/mcv-m6-video/deepvideo-2019
The synchronization of the visual and audio tracks recorded in videos can be used as a supervisory signal for machine learning. This presentation reviews some recent research on this topic exploiting the capabilities of deep neural networks.
17. 17
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
Speech Encoding
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
19. 19
Audio Decoding
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua
Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
25. 25
Audio Feature Learning (self-supervised)
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.
26. 26
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Use videos to train a CNN that predicts the audio statistics of a frame.
Audio Feature Learning (self-supervised)
27. 27
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means overthe training set
Cluster assignments at test time (one row=one cluster)
Audio Feature Learning (self-supervised)
28. 28
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.
Audio Feature Learning (self-supervised)
29. 29
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Video sonorization: Retrieve matching sounds for videos of people hitting objects
with a drumstick.
Audio Feature Learning (self-supervised)
30. 30
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
The Greatest Hits Dataset
Audio Feature Learning (self-supervised)
31. 31
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Audio Clip
Retrieval
Not end-to-end
Audio Feature Learning (self-supervised)
32. 32
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.
33. 33
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Audio Feature Learning (label transfer)
34. 34
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.
35. 35
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
36. 36
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Learned audio features are good for environmental sound recognition.
Audio Feature Learning (label transfer)
37. 37
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
38. 38
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.
Audio Feature Learning (label transfer)
39. 39
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
40. 40
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)
Audio Feature Learning (label transfer)
41. 41
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)
Audio Feature Learning (label transfer)
48. 48
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
49. 49
Speech to Pixels
Generated faces from known identities.
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
50. 50
Speech to Pixels
Faces from average speeches.
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
51. 51
Speech to Pixels
Interpolated faces from interpolated speeches.
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019
53. 53
Multimodal Retrieval
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature
54. 54
Multimodal Retrieval
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
55. 55
Joint Feature Learning
Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised
Synchronization." NIPS 2018. #AVTS #selfsupervision
56. 56
Joint Feature Learning
Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised
Synchronization." NIPS 2018. #AVTS #selfsupervision
57. 57
Joint Feature Learning
Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised
Synchronization." NIPS 2018. #AVTS #selfsupervision
58. 58Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. #selfsupervision
Joint Feature Learning
59. 59Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Joint Feature Learning
Most activated unit in pool4 layer of the visual network.
60. 60Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Joint Feature Learning
Visual features used to train a linear classifier on ImageNet.
61. 61Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Joint Feature Learning
Most activated unit in pool4 layer of the audio network
62. 62Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Joint Feature Learning
Audio features achieve state of the art performance.
65. 65
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
66. 66
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.
68. 68
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.
73. 73
Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV
2018.
Visual Re-dubbing (pixels)
77. 77
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.
78. bit.ly/DLCV2018
#DLUPC
78
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
79. 79
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017
80. bit.ly/DLCV2018
#DLUPC
80
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017
81. 81
Outline
1. Motivation
2. Deep Neural Topologies
3. Multimedia Encoding and Decoding
4. Multimodal Architectures
a. Cross-modal
b. Joint Representations (Embeddings)
c. Multimodal inputs
83. 83
Speech2Signs: Video Generation MSc
thesis
Video generation for sign language
R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning to generate long-term future via
hierarchical prediction,” ICML 2017
Cai, Haoye, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. "Deep video generation, prediction and
completion of human action sequences." ECCV 2018. [demos]
Balakrishnan, Guha, Amy Zhao, Adrian V. Dalca, Fredo Durand, and John Guttag. "Synthesizing images
of humans in unseen poses." CVPR 2018.
84. 84
Speech2Signs: Unsupervised MT MSc
thesis
Unsupervised Machine Translation between Spoken and Sign Language
Artetxe, Mikel, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. "Unsupervised neural machine
translation." ICLR 2018.
Lample, Guillaume, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised
machine translation using monolingual corpora only." ICLR 2018.