SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
Day 4 Lecture 3
Language and Vision
Xavier Giró-i-Nieto
2
Acknowledgments
Santi Pascual
3
In lecture D2L6 RNNs...
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for
statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs...
Neco, R.P. and Forcada, M.L., 1997, June. Asynchronous translations with recurrent neural nets. In
Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE.
6
Encoder-Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding
For clarity, let’s study a Neural Machine Translation (NMT) case:
7
Encoder: One-hot encoding
One-hot encoding: Binary representation of the words in a
vocabulary, where the only combinations with a single hot (1)
bit and all other cold (0) bits are allowed.
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder: One-hot encoding
Natural language words can also be one-hot encoded on a
vector of dimensionality equal to the size of the dictionary (K).
Word One-hot encoding
economic 000010...
growth 001000...
has 100000...
slowed 000001...
Encoder: One-hot encoding
One-hot is a very simple representation: every word is
equidistant from every other word.
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
10
Encoder: Projection to continious space
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
siM Wi
E
The one-hot is linearly projected to a space of lower dimension
(typically 100-500) with matrix E for learned weights.
K
K
11
Encoder: Projection to continious space
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer, so
its parameters will be learned with a training process.
K
12
Encoder: Projection to continious space
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Sequence of
continious-space
word representations
Sequence of words
13
Encoder: Recurrence
Sequence
Figure: Cristopher Olah, “Understanding LSTM Networks” (2015)
14
Encoder: Recurrence
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
15
Encoder: Recurrence
time
time
Front View Side View
Rotation
90o
16
Encoder: Recurrence
Front View
Rotation
90o
Side View
Representation or
embedding of the sentence
17
Sentence Embedding
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM
hidden states
18
(Word Embeddings)
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and
phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119. 2013.
19
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
RNN’s internal state zi
depends on: sentence embedding ht
,
previous word ui-1
and previous internal state zi-1
.
20
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
With zi
ready, we can score each word k in the vocabulary with
a dot product...
RNN
internal
state
Neuron
weights for
word k
21
Decoder
Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum
Mutual Information Estimation of Parameters." NIPS 1989
...and finally normalize to word probabilities with a softmax.
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
More words for the decoded sentence are generated until a
<EOS> (End Of Sentence) “word” is predicted.
EOS
23
Encoder-Decoder
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
24
Encoder-Decoder: Training
Dataset of pairs of sentences in the two languages to translate.
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for
statistical machine translation." AMNLP 2014.
25
Encoder-Decoder: Seq2Seq
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks."
NIPS 2014.
26
Encoder-Decoder: Beyond text
27
Captioning: DeepImageSent
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating
image descriptions." CVPR 2015
28
Captioning: DeepImageSent
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for
generating image descriptions." CVPR 2015
only takes into account
image features in the first
hidden state
Multimodal Recurrent
Neural Network
29
Captioning: Show & Tell
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
30
Captioning: Show & Tell
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
31
Captioning: LSTM for image & video
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,
Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
32
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Captioning (+ Detection): DenseCap
33
Captioning (+ Detection): DenseCap
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
34
Captioning (+ Detection): DenseCap
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
35
Captioning (+ Retrieval): DenseCap
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
36
Captioning: HRNE
( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical
Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
37
Visual Question Answering
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“Is economic growth decreasing ?”
“Yes”
Encode
Encode
Decode
38
Extract visual
features
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Visual Question Answering
Slide credit: Issey Masuda
39
Visual Question Answering
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter
prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering: Dynamic
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks
for Visual and Textual Question Answering." arXiv preprint arXiv:1603.01417 (2016).
41
Visual Question Answering: Dynamic
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks
for Visual and Textual Question Answering." ICML 2016.
Main idea: split image into local regions.
Consider each region equivalent to a
sentence.
Local Region Feature Extraction: CNN (VGG-
19):
(1) Rescale input to 448x448.
(2) Take output from last pooling layer →
D=512x14x14 → 196 512-d local region vectors.
Visual feature embedding: W matrix to project
image features to “q”-textual space.
42
Visual Question Answering: Grounded
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.
43
Datasets: Visual Genome
Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen
et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations."
arXiv preprint arXiv:1602.07332 (2016).
44
Datasets: Microsoft SIND
Microsoft SIND
45
Challenge: Microsoft Coco
Captioning
46
Challenge: Storytelling
Storytelling
47
Challenge: Movie Description
Movie Description, Retrieval and Fill-in-the-blank
48
Challenges: Movie Question Answering
Movie Question Answering
49
Challenges: Visual Question Answering
Visual Question Answering
50
100%0%
Humans
83,30%
UC Berkeley
& Sony
66,47%
Baseline
LSTM&CNN
54,06%
Baseline Nearest
neighbor
42,85%
Baseline Prior per
question type
37,47%
Baseline All yes
29,88%
53,62%
I. Masuda-Mora, “Open-Ended Visual Question-Answering”. Submitted as BSc ETSETB thesis.
[clean code in Keras, perfect for beginners !]
Challenges: Visual Question Answering
51
Summary
● Embedding language and vision into semantic embeddings
allows fusion learning.
● Very high interest among researchers. Great topic for your
thesis.
● Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one ?
52
Conclusions
New Turing test? How to evaluate AI’s image
understanding?
Slide credit: Issey Masuda
53
Learn more
Julia Hockenmeirer
54
Thanks ! Q&A ?
Follow me at
https://imatge.upc.edu/web/people/xavier-giro
@DocXavi
/ProfessorXavi

Más contenido relacionado

Destacado

Destacado (20)

Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)Deep Learning for Computer Vision: Visualization (UPC 2016)
Deep Learning for Computer Vision: Visualization (UPC 2016)
 
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
 
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)Deep Learning for Computer Vision: Segmentation (UPC 2016)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
 
Deep Learning for Computer Vision: Welcome (UPC TelecomBCN 2016)
Deep Learning for Computer Vision: Welcome (UPC TelecomBCN 2016)Deep Learning for Computer Vision: Welcome (UPC TelecomBCN 2016)
Deep Learning for Computer Vision: Welcome (UPC TelecomBCN 2016)
 
Deep Learning for Computer Vision: Face Recognition (UPC 2016)
Deep Learning for Computer Vision: Face Recognition (UPC 2016)Deep Learning for Computer Vision: Face Recognition (UPC 2016)
Deep Learning for Computer Vision: Face Recognition (UPC 2016)
 
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
Deep Learning for Computer Vision: Medical Imaging (UPC 2016)
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
Deep Learning for Computer Vision: Unsupervised Learning (UPC 2016)
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...Deep Learning for Computer Vision: Memory usage and computational considerati...
Deep Learning for Computer Vision: Memory usage and computational considerati...
 
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
Deep Learning for Computer Vision: ImageNet Challenge (UPC 2016)
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
 
Deep Learning for Computer Vision: Closing (UPC 2016)
Deep Learning for Computer Vision: Closing (UPC 2016)Deep Learning for Computer Vision: Closing (UPC 2016)
Deep Learning for Computer Vision: Closing (UPC 2016)
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Linking vision and language: From infant cognition to eyetracking in the visu...
Linking vision and language: From infant cognition to eyetracking in the visu...Linking vision and language: From infant cognition to eyetracking in the visu...
Linking vision and language: From infant cognition to eyetracking in the visu...
 
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
Deep Learning for Computer Vision (3/4): Video Analytics @ laSalle 2016
 
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
 
01 introduction
01 introduction01 introduction
01 introduction
 

Más de Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

Más de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Último

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Último (20)

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 

Deep Learning for Computer Vision: Language and vision (UPC 2016)

  • 1. Day 4 Lecture 3 Language and Vision Xavier Giró-i-Nieto
  • 3. 3 In lecture D2L6 RNNs... Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014). Language IN Language OUT
  • 5. 5 Much earlier than lecture D2L6 RNNs... Neco, R.P. and Forcada, M.L., 1997, June. Asynchronous translations with recurrent neural nets. In Neural Networks, 1997., International Conference on (Vol. 4, pp. 2535-2540). IEEE.
  • 6. 6 Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Representation or Embedding For clarity, let’s study a Neural Machine Translation (NMT) case:
  • 7. 7 Encoder: One-hot encoding One-hot encoding: Binary representation of the words in a vocabulary, where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed. Word Binary One-hot encoding zero 00 0000 one 01 0010 two 10 0100 three 11 1000
  • 8. 8 Encoder: One-hot encoding Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K). Word One-hot encoding economic 000010... growth 001000... has 100000... slowed 000001...
  • 9. Encoder: One-hot encoding One-hot is a very simple representation: every word is equidistant from every other word. Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  • 10. 10 Encoder: Projection to continious space Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) siM Wi E The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights. K K
  • 11. 11 Encoder: Projection to continious space Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) siM Wi Projection matrix E corresponds to a fully connected layer, so its parameters will be learned with a training process. K
  • 12. 12 Encoder: Projection to continious space Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Sequence of continious-space word representations Sequence of words
  • 13. 13 Encoder: Recurrence Sequence Figure: Cristopher Olah, “Understanding LSTM Networks” (2015)
  • 14. 14 Encoder: Recurrence Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  • 16. 16 Encoder: Recurrence Front View Rotation 90o Side View Representation or embedding of the sentence
  • 17. 17 Sentence Embedding Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014 Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
  • 18. 18 (Word Embeddings) Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119. 2013.
  • 19. 19 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) RNN’s internal state zi depends on: sentence embedding ht , previous word ui-1 and previous internal state zi-1 .
  • 20. 20 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) With zi ready, we can score each word k in the vocabulary with a dot product... RNN internal state Neuron weights for word k
  • 21. 21 Decoder Bridle, John S. "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters." NIPS 1989 ...and finally normalize to word probabilities with a softmax. Score for word k Probability that the ith word is word k Previous words Hidden state
  • 22. 22 Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) More words for the decoded sentence are generated until a <EOS> (End Of Sentence) “word” is predicted. EOS
  • 23. 23 Encoder-Decoder Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
  • 24. 24 Encoder-Decoder: Training Dataset of pairs of sentences in the two languages to translate. Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." AMNLP 2014.
  • 25. 25 Encoder-Decoder: Seq2Seq Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014.
  • 27. 27 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  • 28. 28 Captioning: DeepImageSent (Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network
  • 29. 29 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  • 30. 30 Captioning: Show & Tell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015.
  • 31. 31 Captioning: LSTM for image & video Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
  • 32. 32 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Captioning (+ Detection): DenseCap
  • 33. 33 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  • 34. 34 Captioning (+ Detection): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses”
  • 35. 35 Captioning (+ Retrieval): DenseCap Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016
  • 36. 36 Captioning: HRNE ( Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data
  • 37. 37 Visual Question Answering [z1 , z2 , … zN ] [y1 , y2 , … yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  • 38. 38 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Visual Question Answering Slide credit: Issey Masuda
  • 39. 39 Visual Question Answering Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet)
  • 40. 40 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." arXiv preprint arXiv:1603.01417 (2016).
  • 41. 41 Visual Question Answering: Dynamic (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016. Main idea: split image into local regions. Consider each region equivalent to a sentence. Local Region Feature Extraction: CNN (VGG- 19): (1) Rescale input to 448x448. (2) Take output from last pooling layer → D=512x14x14 → 196 512-d local region vectors. Visual feature embedding: W matrix to project image features to “q”-textual space.
  • 42. 42 Visual Question Answering: Grounded (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  • 43. 43 Datasets: Visual Genome Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." arXiv preprint arXiv:1602.07332 (2016).
  • 47. 47 Challenge: Movie Description Movie Description, Retrieval and Fill-in-the-blank
  • 48. 48 Challenges: Movie Question Answering Movie Question Answering
  • 49. 49 Challenges: Visual Question Answering Visual Question Answering
  • 50. 50 100%0% Humans 83,30% UC Berkeley & Sony 66,47% Baseline LSTM&CNN 54,06% Baseline Nearest neighbor 42,85% Baseline Prior per question type 37,47% Baseline All yes 29,88% 53,62% I. Masuda-Mora, “Open-Ended Visual Question-Answering”. Submitted as BSc ETSETB thesis. [clean code in Keras, perfect for beginners !] Challenges: Visual Question Answering
  • 51. 51 Summary ● Embedding language and vision into semantic embeddings allows fusion learning. ● Very high interest among researchers. Great topic for your thesis. ● Will vision and language (and multimedia) communities be merged with (absorbed by) the machine learning one ?
  • 52. 52 Conclusions New Turing test? How to evaluate AI’s image understanding? Slide credit: Issey Masuda
  • 54. 54 Thanks ! Q&A ? Follow me at https://imatge.upc.edu/web/people/xavier-giro @DocXavi /ProfessorXavi