Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

@DocXavi
[http://pagines.uab.cat/mcv/]
Module 6 - Day 9 - Lecture 2
Self-supervised
Audiovisual Learning
4th April 2019
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de
Catalunya

2
Outline
1. Motivation
2. Audio Encoding and Decoding
3. Multimodal Architectures
a. Vision to Audio
b. Audio to Vision
c. Multimodal Inputs

7
Encoder
0
1
0
Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012

8Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation
[1,0,0]
[0,1,0]
[0,0,1]

11
Decoder
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
0
1
0
Cat
Fig: Xudong Mao #DCGAN

12
Encoder Decoder
Representation

13
Vision
Audio
Speech
Video
Synchronization among modalities
captured by video is exploited in a
self-supervised manner.

14
Outline
1. Motivation
a. Vision to Audio
b. Audio to Vision

15
Encoder Decoder
Representation

17
Chan, William, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. "Listen, attend and spell: A neural network for large vocabulary
conversational speech recognition." ICASSP 2016.
Speech Encoding
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

19
Audio Decoding
Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua
Bengio. "SampleRNN: An unconditional end-to-end neural audio generation model." ICLR 2017.
RNN
CNN
Oord, Aaron van den, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew
Senior, and Koray Kavukcuoglu. "Wavenet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).

20
Encoder Decoder
Representation

21
Encoder Decoder
Representation
Raw
MFCC
Mel spectrum
Raw
MFCC
Mel spectrum

22
Speech Enhancement
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network."
Interspeech 2017.

23
Outline
1. Motivation
a. Vision to Audio
b. Audio to Vision

24
Encoder Encoder
Representation

25
Audio Feature Learning (self-supervised)
Owens, Andrew, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. "Ambient sound
provides supervision for visual learning." ECCV 2016
Based on the assumption that ambient sound in video is related to the visual
semantics.

26
Use videos to train a CNN that predicts the audio statistics of a frame.

27
Task: Use the predicted audio stats to clusters images. Audio clusters built with
K-means overthe training set
Cluster assignments at test time (one row=one cluster)

28
Although the CNN was not trained with class labels, local units with semantic
meaning emerge.

29
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman.
"Visually indicated sounds." CVPR 2016.
Video sonorization: Retrieve matching sounds for videos of people hitting objects
with a drumstick.

30
The Greatest Hits Dataset

31
Audio Clip
Retrieval
Not end-to-end

32
Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T.
Freeman. "Visually indicated sounds." CVPR 2016.

33
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video."
NIPS 2016.
Teacher network: Visual Recognition (object & scenes)
Audio Feature Learning (label transfer)

34
Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." NIPS
2016.

35
NIPS 2016.
Learned audio features are good for environmental sound recognition.

36
NIPS 2016.
Learned audio features are good for environmental sound recognition.

37
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.

38
NIPS 2016.
Visualization of the 1D filters over raw audio in conv1.

39
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)

40
NIPS 2016.
Visualize video frames that mostly activate a neuron in a late layer (conv7)

41
S Albanie, A Nagrani, A Vedaldi, A Zisserman, “Emotion Recognition in Speech using Cross-Modal Transfer in the Wild”
ACM Multimedia 2018.
Teacher network: Facial Emotion Recognition (visual)

42
DecoderEncoder
Representation

43
Speech Reconstruction
Ephrat, Ariel, and Shmuel Peleg. "Vid2speech: speech reconstruction from silent video." ICASSP 2017.
CNN
(VGG)
Frame from a
silent video
Audio feature
Post-hoc
synthesis

44
Speech Reconstruction
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV 2017
Workshop on Computer Vision for Audio-Visual Media. 2017.

bit.ly/DLCV2018
#DLUPC
45
Ephrat, Ariel, Tavi Halperin, and Shmuel Peleg. "Improved speech reconstruction from silent video." In ICCV Workshop on
Computer Vision for Audio-Visual Media. 2017.

46
Outline
1. Motivation
a. Vision to Audio
b. Audio to Vision

47
Encoder Decoder
Representation

48
Speech to Pixels
Amanda Duarte, Francisco Roldan, Miquel Tubau, Janna Escur, Santiago Pascual, Amaia Salvador, Eva Mohedano et al.
“Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks” ICASSP 2019

49
Speech to Pixels
Generated faces from known identities.

50
Speech to Pixels
Faces from average speeches.

51
Speech to Pixels
Interpolated faces from interpolated speeches.

52
Encoder Encoder
Representation

53
Multimodal Retrieval
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.
Best
match
Audio feature

54
Multimodal Retrieval
Best
match
Visual feature Audio feature
Amanda Duarte, Dídac Surís, Amaia Salvador, Jordi Torres, and Xavier Giró-i-Nieto. "Cross-modal Embeddings for
Video and Audio Retrieval." ECCV Women in Computer Vision Workshop 2018.

55
Joint Feature Learning
Korbar, Bruno, Du Tran, and Lorenzo Torresani. "Cooperative Learning of Audio and Video Models from Self-Supervised
Synchronization." NIPS 2018. #AVTS #selfsupervision

56

57

58Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017. #selfsupervision

59Arandjelović, Relja, and Andrew Zisserman. "Look, Listen and Learn." ICCV 2017.
Most activated unit in pool4 layer of the visual network.

Visual features used to train a linear classifier on ImageNet.

Most activated unit in pool4 layer of the audio network

Audio features achieve state of the art performance.

63
Sound Source Localization
Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018. #selfsupervision

64Arandjelović, Relja, and Andrew Zisserman. "Objects that Sound." ECCV 2018.

65
Sound Source Localization
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.

66
Senocak, Arda, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. "Learning to Localize Sound Source in Visual
Scenes." CVPR 2018.

67
Encoder
Decoder
Representation
Encoder Representation

68
Speech Separation with Vision (lips)
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech Enhancement."
Interspeech 2018.

bit.ly/DLCV2018
#DLUPC
69
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "The Conversation: Deep Audio-Visual Speech
Enhancement." Interspeech 2018.

70
Encoder
Decoder
Representation
Encoder Representation

71
Visual Re-dubbing (pixels)
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid

bit.ly/DLCV2018
#DLUPC
72Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. #speech2vid

73
Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV
2018.

bit.ly/DLCV2018
#DLUPC
74Chen, Lele, Zhiheng Li, Ross K. Maddox, Zhiyao Duan, and Chenliang Xu. "Lip Movements Generation at a Glance." ECCV 2018.

75
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." arXiv preprint arXiv:1805.09313 (2018).
Adversarial losses at frame (spatial) & sequence (temporal) scales.

bit.ly/DLCV2018
#DLUPC
76
Vougioukas, Konstantinos, Stavros Petridis, and Maja Pantic. "End-to-End Speech-Driven Facial Animation with Temporal
GANs." arXiv preprint arXiv:1805.09313 (2018).

77
Visual Re-dubbing (lip keypoints)
Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. "Synthesizing Obama: learning lip sync from
audio." SIGGRAPH 2017.

bit.ly/DLCV2018
#DLUPC
78
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

79
Visual Re-dubbing (3D meshes)
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by joint end-to-end
learning of pose and emotion." SIGGRAPH 2017

bit.ly/DLCV2018
#DLUPC
80
Karras, Tero, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. "Audio-driven facial animation by
joint end-to-end learning of pose and emotion." SIGGRAPH 2017

81
Outline
1. Motivation
2. Deep Neural Topologies
3. Multimedia Encoding and Decoding
a. Cross-modal
b. Joint Representations (Embeddings)
c. Multimodal inputs

82
ASRHello "Hello"
Video
Generator
NMT
Gloss
Speech2Signs MSc
thesis

83
Speech2Signs: Video Generation MSc
thesis
Video generation for sign language
R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning to generate long-term future via
hierarchical prediction,” ICML 2017
Cai, Haoye, Chunyan Bai, Yu-Wing Tai, and Chi-Keung Tang. "Deep video generation, prediction and
completion of human action sequences." ECCV 2018. [demos]
Balakrishnan, Guha, Amy Zhao, Adrian V. Dalca, Fredo Durand, and John Guttag. "Synthesizing images
of humans in unseen poses." CVPR 2018.

84
Speech2Signs: Unsupervised MT MSc
thesis
Unsupervised Machine Translation between Spoken and Sign Language
Artetxe, Mikel, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. "Unsupervised neural machine
translation." ICLR 2018.
Lample, Guillaume, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato. "Unsupervised
machine translation using monolingual corpora only." ICLR 2018.

85
Speech2Signs: How2 MSc
thesis
● Challenge in ICML 2019
● Task in MediaEval 2019

Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019

Similar a Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019 (20)

Más de Universitat Politècnica de Catalunya

Más de Universitat Politècnica de Catalunya (20)

Último

Último (20)

Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019