SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
Speech-conditioned Face Generation using
Generative Adversarial Networks
Wav2Pix
1
Wav2Pix
MOTIVATION
● Audio and visual signals are the most common modalities used by humans to
identify other humans and sense their emotional state
● Features extracted from these two signals are often highly correlated
● Roldán et. al. address this correlation proposing a face synthesis method using
exclusively raw audio representation as inputs
NETWORK
Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”. Master in Computer Vision, 2018 2
Wav2Pix
MOTIVATION
Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”
● Roldán et. al. model does not generalize for unseen speech
● Inception Score metric used by Roldán et. al. evaluates the images in terms of
quality but not in terms of realism
NETWORK
In this Project
Enhancement Evaluation
● Search the optimal input audio length
● Add an audio classifier on top of the
embedding
● Augment the capacity to generate
128x128 resolution
● Fréchet Inception Distance computation
● Compute the accuracy of a fine-tuned
VGG classifier
● Compute a novel approach based on a
face detection system
● Perform an online survey assessed by
humans
3
RELATED WORK
4
Wav2Pix
GENERATIVE MODELS
Image taken from https://blog.openai.com/generative-models/ 5
Wav2Pix
GENERATIVE ADVERSARIAL NETWORKS
Goodfellow, Ian, et al. "Generative adversarial nets." NIPS. 2014. 6
Wav2Pix
GENERATIVE ADVERSARIAL NETWORKS
Slide credit: Santi Pascual 7
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether the money is true or fake.
100
100
FAKE! It’s not
even green
Wav2Pix
GENERATIVE ADVERSARIAL NETWORKS
8
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether the money is true or fake.
100
100
FAKE! There’s
no watermark
Slide credit: Santi Pascual
Wav2Pix
GENERATIVE ADVERSARIAL NETWORKS
9
Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to
detect whether the money is true or fake.
100
100
FAKE! Watermark
should be rounded
Slide credit: Santi Pascual
Wav2Pix
GENERATIVE ADVERSARIAL NETWORKS
10
After enough iterations:
?
Slide credit: Santi Pascual
Wav2Pix
CONDITIONED GANs
Mirza, Mehdi and Simon Osindero. “Conditional generative adversarial nets.” NIPS (2014). 11
In that case: In our case:
one-hot vector with the
corresponding MNIST
class
speech embedded in a
128 dimensional vector
Wav2Pix
SPEECH-CONDITIONED IMAGE SYNTHESIS
• Chung et. al. presented a method for generating a video of a talking face starting from audio features and
an image of him/her (identity)
• Suwajanakorn et. al. focused on animating a point-based lip model to later synthesize high quality
videos of President Barack Obama
NETWORK
NETWORK
MFCC features
• Karras et. al. propose a model for driving 3D facial animation by audio input in real time and with low
latency.
NETWORK
autocorrelation coefficients
3D vertex coordinates
of a face model
Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017.
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM TOG, 2017.
Tero Karras,Timo Aila,Samuli Laine,Antti Herva, and Jaakko Lehtinen,“Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM
TOG, 2017.
12
Wav2Pix
SPEECH-CONDITIONED IMAGE SYNTHESIS
• Chung et. al. presented a method for generating a video of a talking face starting from audio features and
an image of him/her (identity)
• Suwajanakorn et. al. focused on animating a point-based lip model to later synthesize high quality
videos of President Barack Obama
NETWORK
NETWORK
MFCC features
• Karras et. al. propose a model for driving 3D facial animation by audio input in real time and with low
latency.
NETWORK
autocorrelation coefficients
3D vertex coordinates
of a face model
• Roldán et. al. propose a deep neural network that is trained from scratch in an end-to-end fashion,
generating a face directly from the raw speech waveform without any additional identity
information (e.g reference image or one-hot encoding)
13
NETWORK
Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”
Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”
Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “Segan: Speech enhancement generative adversarial network,” Interspeech, 2017
Wav2Pix
SPEECH-CONDITIONED FACE GENERATION WITH DEEP GANs
LSGAN 64x64 resolution Dropout instead of noise input
14
DATASET
15
Wav2Pix
PREVIOUS DATASET
Good recording hardware
Wide range of emotions
High amount of frontal
faces
youtubers_v1
16
Wav2Pix
PREVIOUS DATASET
drawbacks
● Imbalanced dataset. Among the 62 youtubers, the amount of images/audios vary
between 2669 and 52 pairs
● Notable amount of false positives
true identity false positives
● Most of the speech frames were noisy
○ Background music in a post-process edition
○ Voice of a third person
17
4s
x5
600x10 IDs
1s 1s 1s 1s 1s
30k
10 ids
Wav2Pix
DATASET
youtubers_v2 - new dataset collection
youtubers_v2_dataaugmented
DATA AUGMENTATION
18
Wav2Pix
DATASET
19
Roldán Ours
ARCHITECTURE
20
Wav2Pix
ARCHITECTURE
end-to-end
segan
decoder
+
classifier
64 x 64 | 128 x 128
128 • real image loss
• fake image loss
• wrong image loss
D loss
• fake image loss
• softmax loss
• customized regularizers
G loss
21
Wav2Pix
ARCHITECTURE
contributions
● Audio segmentation module
● Speech classifier
○ 1-hidden NN with 10 output units
● Additional convolutional and deconvolutional layers
○ Kernel size: 4
○ Stride: 2
○ Padding: 1
0
22
EVALUATION
23
Wav2Pix
EVALUATION
Fréchet Inception Distance
● Inception-v3 network pre-trained on ImageNet
● Results not consistent with human judgements
● Little amount of data to obtain reliable results
● The measure relies on an ImageNet-pretrained inception
network, far from ideal for datasets like faces
24
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a
local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626– 6637.
Wav2Pix
EVALUATION
VGGFace fine-tuned classifier
● Network proposed by the Visual Geometry Group department of Engineering
Science (University of Oxford)
● Improvement of our model in preserving the identity
● Bearing in mind the metric is sensible to image quality, and
the probability of confusion is 90%, the results are
promising.
25
Roldán Ours
Real data 100 100
Generated data for
seen speech
56.34 76.81
Generated data for
unseen speech
16.69 50.08
O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, 2015.
Wav2Pix
EVALUATION
Facial Landmark Detection ratio
● Robustness to image quality
● 90% of the generated images of our
model for unseen speech can be
considered as faces
26
Roldán Ours
Real data 75.02 72.48
Generated
data for seen
speech
61.76 84.45
Generated
data for
unseen speech
60.81 90.25
Wav2Pix
EVALUATION
Facial Landmark Detection ratio
● Robustness to image quality
● 90% of the generated images of our
model for unseen speech can be
considered as faces
27
Roldán Ours
Real data 75.02 72.48
Generated
data for seen
speech
61.76 84.45
Generated
data for
unseen speech
60.81 90.25
Wav2Pix
EVALUATION
Online survey
● 42 people have been asked to answer 2 questions for 32 different pairs of images:
● Not reliable results
● This metric should be further improved
28
MOS
2.09
% NOT SURE % NO % YES
14 52 34
EXPERIMENTS
29
Wav2Pix
EXPERIMENTS
datasets comparison
Facial landmark detection ratio (%)
Best quality images manually selected
The following experiments have been
performed with this dataset
30
Wav2Pix
EXPERIMENTS
input audio length
Fine-tuned VGGclassifier accuracy in % w.r.t the audio length (in seconds)
Best quality images manually selected w.r.t the audio length
The following experiments have been
performed with this length 31
Wav2Pix
EXPERIMENTS
input audio length
true identity: Jaime Altozano true identity: Mely
The more voice frames in the audio, the easier
for the network to learn the identity
32
Wav2Pix
EXPERIMENTS
image resolution
The following experiments have been
performed with 128x128 image resolution
33
Wav2Pix
EXPERIMENTS
identity classifier
Fine-tuned VGGFace classifier accuracy in % w.r.t the
model
Close to ramdomness! (10%)
34
Roldán Ours
16.69 50.08
Wav2Pix
EXPERIMENTS
image generation for unseen voice
35
My voice Piano music
1r
2nd 2nd
1r
The network does not generalize for unseen IDs!!
Wav2Pix
EXPERIMENTS
audio interpolation
0.9×audio_mely+0.1×audio_jaime 0.4×audio_mely+0.6×audio_jaime
The network does not generate faces for audios which
do not contain distinguishable voice. The model has
learned to identify speech in audio frames
36
Wav2Pix
EXPERIMENTS
image generation for audio averages
The model performs a good generalization for unseen
speech of seen IDs
37
CONCLUSIONS
38
Wav2Pix
CONCLUSIONS
In comparison to Roldán et. al. network, our contributions allows the final model:
• Generate images of higher quality due to the network’s capacity increase
• Generate more face-looking images for unseen speech
• Preserve the identity better for unseen speech
• Obtain better results with a smaller dataset (~70% smaller in terms of memory size)
• Obtain results that can be evaluated in terms of quality, face appearance and identity preservation with
three different metrics
However,
• No generalization is achieved for unseen ID’s
• The dataset needs to be very clean in order to obtain notable results. The building process is very
time-consuming
39

Más contenido relacionado

La actualidad más candente

Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaUniversitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Universitat Politècnica de Catalunya
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Universitat Politècnica de Catalunya
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaUniversitat Politècnica de Catalunya
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Universitat Politècnica de Catalunya
 

La actualidad más candente (20)

One Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and VisionOne Perceptron to Rule Them All: Language and Vision
One Perceptron to Rule Them All: Language and Vision
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)Deep Learning from Videos (UPC 2018)
Deep Learning from Videos (UPC 2018)
 
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
Video Saliency Prediction with Deep Neural Networks - Juan Jose Nieto - DCU 2019
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)Deep Learning for Video: Object Tracking (UPC 2018)
Deep Learning for Video: Object Tracking (UPC 2018)
 
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
Deep Audio and Vision - Eva Mohedano - UPC Barcelona 2018
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
Skipping and Repeating Samples in Recurrent Neural Networks
Skipping and Repeating Samples in Recurrent Neural NetworksSkipping and Repeating Samples in Recurrent Neural Networks
Skipping and Repeating Samples in Recurrent Neural Networks
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
 
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
Video Analysis (D4L2 2017 UPC Deep Learning for Computer Vision)
 
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)Learning with Videos  (D4L4 2017 UPC Deep Learning for Computer Vision)
Learning with Videos (D4L4 2017 UPC Deep Learning for Computer Vision)
 
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
 
Once Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for MultimediaOnce Perceptron to Rule Them all: Deep Learning for Multimedia
Once Perceptron to Rule Them all: Deep Learning for Multimedia
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 

Similar a Wav2Pix: Speech-conditioned face generation using Generative Adversarial Networksa

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018Universitat Politècnica de Catalunya
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesFellowship at Vodafone FutureLab
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communicationsNAVER Engineering
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxPierre Schaus
 
Small Deep-Neural-Networks: Their Advantages and Their Design
Small Deep-Neural-Networks: Their Advantages and Their DesignSmall Deep-Neural-Networks: Their Advantages and Their Design
Small Deep-Neural-Networks: Their Advantages and Their DesignForrest Iandola
 
Deep learning italia speech galazzo
Deep learning italia speech galazzoDeep learning italia speech galazzo
Deep learning italia speech galazzoDeep Learning Italia
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016Hagai Aronowitz
 
ppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxssusera4da91
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM France Lab
 
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWDEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWvivatechijri
 
Resume
ResumeResume
Resumebutest
 
Resume
ResumeResume
Resumebutest
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart..."Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...Edge AI and Vision Alliance
 
Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...SamanthaGallone
 
Digital Twin: convergence of Multimedia Technologies
Digital Twin: convergence of Multimedia TechnologiesDigital Twin: convergence of Multimedia Technologies
Digital Twin: convergence of Multimedia TechnologiesAbdulmotaleb El Saddik
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsChester Chen
 
Fuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingFuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingAlison B. Lowndes
 

Similar a Wav2Pix: Speech-conditioned face generation using Generative Adversarial Networksa (20)

One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Semantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network ApproachesSemantic segmentation with Convolutional Neural Network Approaches
Semantic segmentation with Convolutional Neural Network Approaches
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
The Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- ReduxThe Concurrent Constraint Programming Research Programmes -- Redux
The Concurrent Constraint Programming Research Programmes -- Redux
 
Small Deep-Neural-Networks: Their Advantages and Their Design
Small Deep-Neural-Networks: Their Advantages and Their DesignSmall Deep-Neural-Networks: Their Advantages and Their Design
Small Deep-Neural-Networks: Their Advantages and Their Design
 
RaymondResume2015v5
RaymondResume2015v5RaymondResume2015v5
RaymondResume2015v5
 
Deep learning italia speech galazzo
Deep learning italia speech galazzoDeep learning italia speech galazzo
Deep learning italia speech galazzo
 
Google and SRI talk September 2016
Google and SRI talk September 2016Google and SRI talk September 2016
Google and SRI talk September 2016
 
ppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptxppt icitisee 2022_without_recording.pptx
ppt icitisee 2022_without_recording.pptx
 
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning ChallengesIBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
IBM Cloud Paris Meetup 20180517 - Deep Learning Challenges
 
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEWDEEPFAKE DETECTION TECHNIQUES: A REVIEW
DEEPFAKE DETECTION TECHNIQUES: A REVIEW
 
Perception and Quality of Immersive Media
Perception and Quality of Immersive MediaPerception and Quality of Immersive Media
Perception and Quality of Immersive Media
 
Resume
ResumeResume
Resume
 
Resume
ResumeResume
Resume
 
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
Language and Vision (D2L11 Insight@DCU Machine Learning Workshop 2017)
 
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart..."Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
"Approaches for Vision-based Driver Monitoring," a Presentation from PathPart...
 
Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...Evaluation of conditional images synthesis: generating a photorealistic image...
Evaluation of conditional images synthesis: generating a photorealistic image...
 
Digital Twin: convergence of Multimedia Technologies
Digital Twin: convergence of Multimedia TechnologiesDigital Twin: convergence of Multimedia Technologies
Digital Twin: convergence of Multimedia Technologies
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 
Fuelling the AI Revolution with Gaming
Fuelling the AI Revolution with GamingFuelling the AI Revolution with Gaming
Fuelling the AI Revolution with Gaming
 

Más de Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Universitat Politècnica de Catalunya
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Universitat Politècnica de Catalunya
 

Más de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 

Último

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 

Último (17)

MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 

Wav2Pix: Speech-conditioned face generation using Generative Adversarial Networksa

  • 1. Speech-conditioned Face Generation using Generative Adversarial Networks Wav2Pix 1
  • 2. Wav2Pix MOTIVATION ● Audio and visual signals are the most common modalities used by humans to identify other humans and sense their emotional state ● Features extracted from these two signals are often highly correlated ● Roldán et. al. address this correlation proposing a face synthesis method using exclusively raw audio representation as inputs NETWORK Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”. Master in Computer Vision, 2018 2
  • 3. Wav2Pix MOTIVATION Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks” ● Roldán et. al. model does not generalize for unseen speech ● Inception Score metric used by Roldán et. al. evaluates the images in terms of quality but not in terms of realism NETWORK In this Project Enhancement Evaluation ● Search the optimal input audio length ● Add an audio classifier on top of the embedding ● Augment the capacity to generate 128x128 resolution ● Fréchet Inception Distance computation ● Compute the accuracy of a fine-tuned VGG classifier ● Compute a novel approach based on a face detection system ● Perform an online survey assessed by humans 3
  • 5. Wav2Pix GENERATIVE MODELS Image taken from https://blog.openai.com/generative-models/ 5
  • 6. Wav2Pix GENERATIVE ADVERSARIAL NETWORKS Goodfellow, Ian, et al. "Generative adversarial nets." NIPS. 2014. 6
  • 7. Wav2Pix GENERATIVE ADVERSARIAL NETWORKS Slide credit: Santi Pascual 7 Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether the money is true or fake. 100 100 FAKE! It’s not even green
  • 8. Wav2Pix GENERATIVE ADVERSARIAL NETWORKS 8 Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether the money is true or fake. 100 100 FAKE! There’s no watermark Slide credit: Santi Pascual
  • 9. Wav2Pix GENERATIVE ADVERSARIAL NETWORKS 9 Imagine we have a counterfeiter (G) trying to make fake money, and the police (D) has to detect whether the money is true or fake. 100 100 FAKE! Watermark should be rounded Slide credit: Santi Pascual
  • 10. Wav2Pix GENERATIVE ADVERSARIAL NETWORKS 10 After enough iterations: ? Slide credit: Santi Pascual
  • 11. Wav2Pix CONDITIONED GANs Mirza, Mehdi and Simon Osindero. “Conditional generative adversarial nets.” NIPS (2014). 11 In that case: In our case: one-hot vector with the corresponding MNIST class speech embedded in a 128 dimensional vector
  • 12. Wav2Pix SPEECH-CONDITIONED IMAGE SYNTHESIS • Chung et. al. presented a method for generating a video of a talking face starting from audio features and an image of him/her (identity) • Suwajanakorn et. al. focused on animating a point-based lip model to later synthesize high quality videos of President Barack Obama NETWORK NETWORK MFCC features • Karras et. al. propose a model for driving 3D facial animation by audio input in real time and with low latency. NETWORK autocorrelation coefficients 3D vertex coordinates of a face model Chung, Joon Son, Amir Jamaludin, and Andrew Zisserman. "You said that?." BMVC 2017. Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM TOG, 2017. Tero Karras,Timo Aila,Samuli Laine,Antti Herva, and Jaakko Lehtinen,“Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM TOG, 2017. 12
  • 13. Wav2Pix SPEECH-CONDITIONED IMAGE SYNTHESIS • Chung et. al. presented a method for generating a video of a talking face starting from audio features and an image of him/her (identity) • Suwajanakorn et. al. focused on animating a point-based lip model to later synthesize high quality videos of President Barack Obama NETWORK NETWORK MFCC features • Karras et. al. propose a model for driving 3D facial animation by audio input in real time and with low latency. NETWORK autocorrelation coefficients 3D vertex coordinates of a face model • Roldán et. al. propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding) 13 NETWORK Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks”
  • 14. Francisco Roldán Sánchez, “Speech-conditioned face generation with deep adversarial networks” Santiago Pascual, Antonio Bonafonte, and Joan Serrà, “Segan: Speech enhancement generative adversarial network,” Interspeech, 2017 Wav2Pix SPEECH-CONDITIONED FACE GENERATION WITH DEEP GANs LSGAN 64x64 resolution Dropout instead of noise input 14
  • 16. Wav2Pix PREVIOUS DATASET Good recording hardware Wide range of emotions High amount of frontal faces youtubers_v1 16
  • 17. Wav2Pix PREVIOUS DATASET drawbacks ● Imbalanced dataset. Among the 62 youtubers, the amount of images/audios vary between 2669 and 52 pairs ● Notable amount of false positives true identity false positives ● Most of the speech frames were noisy ○ Background music in a post-process edition ○ Voice of a third person 17
  • 18. 4s x5 600x10 IDs 1s 1s 1s 1s 1s 30k 10 ids Wav2Pix DATASET youtubers_v2 - new dataset collection youtubers_v2_dataaugmented DATA AUGMENTATION 18
  • 21. Wav2Pix ARCHITECTURE end-to-end segan decoder + classifier 64 x 64 | 128 x 128 128 • real image loss • fake image loss • wrong image loss D loss • fake image loss • softmax loss • customized regularizers G loss 21
  • 22. Wav2Pix ARCHITECTURE contributions ● Audio segmentation module ● Speech classifier ○ 1-hidden NN with 10 output units ● Additional convolutional and deconvolutional layers ○ Kernel size: 4 ○ Stride: 2 ○ Padding: 1 0 22
  • 24. Wav2Pix EVALUATION Fréchet Inception Distance ● Inception-v3 network pre-trained on ImageNet ● Results not consistent with human judgements ● Little amount of data to obtain reliable results ● The measure relies on an ImageNet-pretrained inception network, far from ideal for datasets like faces 24 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017, pp. 6626– 6637.
  • 25. Wav2Pix EVALUATION VGGFace fine-tuned classifier ● Network proposed by the Visual Geometry Group department of Engineering Science (University of Oxford) ● Improvement of our model in preserving the identity ● Bearing in mind the metric is sensible to image quality, and the probability of confusion is 90%, the results are promising. 25 Roldán Ours Real data 100 100 Generated data for seen speech 56.34 76.81 Generated data for unseen speech 16.69 50.08 O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference, 2015.
  • 26. Wav2Pix EVALUATION Facial Landmark Detection ratio ● Robustness to image quality ● 90% of the generated images of our model for unseen speech can be considered as faces 26 Roldán Ours Real data 75.02 72.48 Generated data for seen speech 61.76 84.45 Generated data for unseen speech 60.81 90.25
  • 27. Wav2Pix EVALUATION Facial Landmark Detection ratio ● Robustness to image quality ● 90% of the generated images of our model for unseen speech can be considered as faces 27 Roldán Ours Real data 75.02 72.48 Generated data for seen speech 61.76 84.45 Generated data for unseen speech 60.81 90.25
  • 28. Wav2Pix EVALUATION Online survey ● 42 people have been asked to answer 2 questions for 32 different pairs of images: ● Not reliable results ● This metric should be further improved 28 MOS 2.09 % NOT SURE % NO % YES 14 52 34
  • 30. Wav2Pix EXPERIMENTS datasets comparison Facial landmark detection ratio (%) Best quality images manually selected The following experiments have been performed with this dataset 30
  • 31. Wav2Pix EXPERIMENTS input audio length Fine-tuned VGGclassifier accuracy in % w.r.t the audio length (in seconds) Best quality images manually selected w.r.t the audio length The following experiments have been performed with this length 31
  • 32. Wav2Pix EXPERIMENTS input audio length true identity: Jaime Altozano true identity: Mely The more voice frames in the audio, the easier for the network to learn the identity 32
  • 33. Wav2Pix EXPERIMENTS image resolution The following experiments have been performed with 128x128 image resolution 33
  • 34. Wav2Pix EXPERIMENTS identity classifier Fine-tuned VGGFace classifier accuracy in % w.r.t the model Close to ramdomness! (10%) 34 Roldán Ours 16.69 50.08
  • 35. Wav2Pix EXPERIMENTS image generation for unseen voice 35 My voice Piano music 1r 2nd 2nd 1r The network does not generalize for unseen IDs!!
  • 36. Wav2Pix EXPERIMENTS audio interpolation 0.9×audio_mely+0.1×audio_jaime 0.4×audio_mely+0.6×audio_jaime The network does not generate faces for audios which do not contain distinguishable voice. The model has learned to identify speech in audio frames 36
  • 37. Wav2Pix EXPERIMENTS image generation for audio averages The model performs a good generalization for unseen speech of seen IDs 37
  • 39. Wav2Pix CONCLUSIONS In comparison to Roldán et. al. network, our contributions allows the final model: • Generate images of higher quality due to the network’s capacity increase • Generate more face-looking images for unseen speech • Preserve the identity better for unseen speech • Obtain better results with a smaller dataset (~70% smaller in terms of memory size) • Obtain results that can be evaluated in terms of quality, face appearance and identity preservation with three different metrics However, • No generalization is achieved for unseen ID’s • The dataset needs to be very clean in order to obtain notable results. The building process is very time-consuming 39