SlideShare a Scribd company logo
1 of 53
Download to read offline
The Transformer
Lecture 20
Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
2
Video-lecture
3
Acknowledgments
Marta R. Costa-jussà
Associate Professor
Universitat Politècnica de Catalunya
Carlos Escolano
PhD Candidate
Universitat Politècnica de Catalunya
Gerard I. Gállego
PhD Student
Universitat Politècnica de Catalunya
gerard.ion.gallego@upc.edu
@geiongallego
Outline
1. Reminders
4
5
Reminder
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
6
Reminder
Attention is a mechanism to compute a context vector (c) for a query (Q) as a
weighted sum of values (V).
Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
7
Reminder
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
8
Reminder: Seq2Seq with Cross-Attention
Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020
In this case, cross-attention
refers to the attention
between the encoder and
decoder states.
9
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
What may the term “self” refer to, as a contrast of “cross”-attention ?
Outline
1. Motivation
2. Self-attention
10
11
Self-Attention (or intra-Attention)
Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. A structured self-attentive sentence embedding.
ICLR 2017.
Figure:
Jay Alammar,
“The Illustrated Transformer”
Self-attention refers to attending to other elements from the SAME sequence.
12
Self-Attention (or intra-Attention)
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
Query (Q)
g(x) = WQ
x
Key (K)
f(x) = WK
x
Value (V)
h(x) = WV
x
WQ
, WK
and WV
are projection
layers shared across all words.
13
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
A (scaled) dot-product is computed between each pair of word embeddings
(eg. e1
and e2
)...
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
14
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
… a softmax layer normalizes the attention scores to obtain the attention
distribution...
15
Self-Attention (or intra-Attention)
Which steps are necessary to compute the contextual representation of a word
embedding e2
in a sequences of four words embeddings (e1
, e2
, e3
, e4
) ?
...the same word
embeddings are combined to
obtain the contextual
representation e2
’.
16
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
17
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
18
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
19
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
20
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
21
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
22
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
23
Self-Attention (or intra-Attention)
Figure: Jay Alammar, “The illustrated Transformer” (2018)
24
Self-Attention (or intra-Attention) Scaled dot-product
attention
Figure: Jay Alammar, “The illustrated Transformer” (2018)
25
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Figure:
Frank Xu
Generator (G): Details can be generated using cues from all feature locations.
Discriminator: Can check consistency betweenn features in distant portions of the image.
26
Study case: Self-Attention in for image generation
#SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial
networks." ICML 2019. [video]
Query locations Attention maps for differet query locations
Outline
1. Motivation
2. Self-attention
3. Multi-head Self-Attention (MHSA)
27
28
Multi-Head Self-Attention (MHSA)
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
In vanilla self-attention, a single set of projection matrices WQ
, WK
, WV
is used.
29
Nikhil Sha, “Attention ? An other Perspective!”. 2020.
In multi-head self-attention, multiple sets of projection matrices are used, and can
provide different contextual representations for the same input token.
Multi-Head Self-Attention (MHSA)
30
The multi-head self-attended E’i
matrixes are concatenated:
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Multi-Head Self-Attention (MHSA)
31
A fully connected layer on top combines everything in a new E’.
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Multi-Head Self-Attention (MHSA)
Multi-head Self-Attention: Visualization
32
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
Each colour
corresponds
to a head.
Blue: First
head only.
Multi-color:
Multiple
heads.
33
Self-Attention and Convolutional Layers
Cordonnier, J. B., Loukas, A., & Jaggi, M. On the relationship between self-attention and convolutional layers. ICLR 2020.
[tweet] [code]
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
34
Positional Encoding
35
Given that the attention mechanism allows accessing all input (and output)
tokens, we no longer need a memory through recurrent layers.
Positional Encoding
36
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Where is the relative relation in the sequence encoded ?
Positional Encoding
37
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Where is the relative relation in the sequence encoded ?
Positional Encoding
38
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
Sinusoidal functions are typically used to provide positional encodings.
Positional Encoding
39
Figure: Jay Alammar, “The illustrated Transformer” (2018)
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
5. The Transformer
40
The Transformer
41
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
The Transformer removed the recurrency mechanism thanks to self-attention.
The Transformer
42
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
Positional Encoding over the output
sequence.
Positional Encoding
over the input
sequence.
Auto-regressive (at test).
The Transformer
43
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
Cross-Attention (or inter-attention)
between input and output tokens
Self-attention for
the input tokens.
Self-attention for the output tokens.
The Transformer: Layers
44
#Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention
is all you need. NeurIPS 2017.
N decoder layers
N encoder
layers
The Transformer: Layers
45
#BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet]
A birds-eye view of attention across all of the model’s layers and heads
The Transformer: Visualization
46
Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
47
Are Transformers for Language only ? NO !!
#ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa
Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code]
Outline
1. Motivation
2. Self-attention
3. Multi-head Attention
4. Positional Encoding
5. The Transformer
48
49
(extra) PyTorch Lab on Google Colab
DL resources from UPC Telecos:
● Lectures (with Slides & Videos)
● Labs
Gerard Gallego
gerard.ion.gallego@upc.edu
Student PhD
Universitat Politecnica de Catalunya
Technical University of Catalonia
50
Software
● Transformers in HuggingFace.
● GPT-Neo by EleutherAI
○ Similar results to GPT-3, but smaller and open source.
● Andrej Karpathy, minGPT (2020).
51
Learn more
Ashish Vaswani, Stanford CS224N 2019.
52
Learn more
● Tutorials
○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017).
○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. distill.pub 2016.
○ Lilian Weg, The Transformer Family. Lil’Log 2020
● Twitter threads
○ Christian Wolf (INSA Lyon)
● Scientific publications
○ #Perceiver Jaegle, Andrew, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira.
"Perceiver: General perception with iterative attention." arXiv preprint arXiv:2103.03206 (2021).
○ #Longformer Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv
preprint arXiv:2004.05150 (2020).
○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNS: Fast autoregressive transformers with linear
attention. ICML 2020.
○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye
Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020. [tweet]
○ Self-attention in language
■ Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint
arXiv:1601.06733.
○ Self-attention in images
■ Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. ICML
2018.
■ Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. "Non-local neural networks." In CVPR 2018.
53
Questions ?

More Related Content

What's hot

What's hot (20)

Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
Vision Transformer(ViT) / An Image is Worth 16*16 Words: Transformers for Ima...
 
Transformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to HeroTransformers in Vision: From Zero to Hero
Transformers in Vision: From Zero to Hero
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Swin transformer
Swin transformerSwin transformer
Swin transformer
 
JPEG2000 in a nutshell
JPEG2000 in a nutshellJPEG2000 in a nutshell
JPEG2000 in a nutshell
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Pr083 Non-local Neural Networks
Pr083 Non-local Neural NetworksPr083 Non-local Neural Networks
Pr083 Non-local Neural Networks
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
A Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth EstimationA Beginner's Guide to Monocular Depth Estimation
A Beginner's Guide to Monocular Depth Estimation
 
DiffusionMBIR_presentation_slide.pptx
DiffusionMBIR_presentation_slide.pptxDiffusionMBIR_presentation_slide.pptx
DiffusionMBIR_presentation_slide.pptx
 
Presentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AIPresentation - Model Efficiency for Edge AI
Presentation - Model Efficiency for Edge AI
 
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View SynthesisNeural Scene Representation & Rendering: Introduction to Novel View Synthesis
Neural Scene Representation & Rendering: Introduction to Novel View Synthesis
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Video Transformers.pptx
Video Transformers.pptxVideo Transformers.pptx
Video Transformers.pptx
 
Jpeg2000
Jpeg2000Jpeg2000
Jpeg2000
 
Image super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptxImage super-resolution via iterative refinement.pptx
Image super-resolution via iterative refinement.pptx
 
"Attention Is All You Need" presented by Maroua Maachou (Veepee)
"Attention Is All You Need" presented by Maroua Maachou (Veepee)"Attention Is All You Need" presented by Maroua Maachou (Veepee)
"Attention Is All You Need" presented by Maroua Maachou (Veepee)
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 

Similar to The Transformer - Xavier Giró - UPC Barcelona 2021

Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255
deffa5
 

Similar to The Transformer - Xavier Giró - UPC Barcelona 2021 (20)

Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
alexVAE_New.pdf
alexVAE_New.pdfalexVAE_New.pdf
alexVAE_New.pdf
 
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
Deep Language and Vision by Amaia Salvador (Insight DCU 2018)
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Striving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational ModellingStriving to Demystify Bayesian Computational Modelling
Striving to Demystify Bayesian Computational Modelling
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
 
ALASI15 Writing Analytics Workshop
ALASI15 Writing Analytics WorkshopALASI15 Writing Analytics Workshop
ALASI15 Writing Analytics Workshop
 
Multimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QAMultimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QA
 
Transformer based approaches for visual representation learning
Transformer based approaches for visual representation learningTransformer based approaches for visual representation learning
Transformer based approaches for visual representation learning
 
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Lan...
 
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
Self-supervised Learning from Video Sequences - Xavier Giro - UPC Barcelona 2019
 
ICED 2013 A
ICED 2013 AICED 2013 A
ICED 2013 A
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 
Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)Deep Learning: concepts and use cases (October 2018)
Deep Learning: concepts and use cases (October 2018)
 
Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255Natural Language Processing: Lecture 255
Natural Language Processing: Lecture 255
 

More from Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 

Recently uploaded

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

The Transformer - Xavier Giró - UPC Barcelona 2021

  • 1. The Transformer Lecture 20 Xavier Giro-i-Nieto Associate Professor Universitat Politecnica de Catalunya @DocXavi xavier.giro@upc.edu
  • 3. 3 Acknowledgments Marta R. Costa-jussà Associate Professor Universitat Politècnica de Catalunya Carlos Escolano PhD Candidate Universitat Politècnica de Catalunya Gerard I. Gállego PhD Student Universitat Politècnica de Catalunya gerard.ion.gallego@upc.edu @geiongallego
  • 5. 5 Reminder Nikhil Sha, “Attention ? An other Perspective!”. 2020.
  • 6. 6 Reminder Attention is a mechanism to compute a context vector (c) for a query (Q) as a weighted sum of values (V). Figure: Nikhil Shah, “Attention? An Other Perspective! [Part 1]” (2020)
  • 7. 7 Reminder Nikhil Sha, “Attention ? An other Perspective!”. 2020.
  • 8. 8 Reminder: Seq2Seq with Cross-Attention Slide concept: Abigail See, Matthew Lamm (Stanford CS224N), 2020 In this case, cross-attention refers to the attention between the encoder and decoder states.
  • 9. 9 Nikhil Sha, “Attention ? An other Perspective!”. 2020. What may the term “self” refer to, as a contrast of “cross”-attention ?
  • 11. 11 Self-Attention (or intra-Attention) Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. A structured self-attentive sentence embedding. ICLR 2017. Figure: Jay Alammar, “The Illustrated Transformer” Self-attention refers to attending to other elements from the SAME sequence.
  • 12. 12 Self-Attention (or intra-Attention) Nikhil Sha, “Attention ? An other Perspective!”. 2020. Query (Q) g(x) = WQ x Key (K) f(x) = WK x Value (V) h(x) = WV x WQ , WK and WV are projection layers shared across all words.
  • 13. 13 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? A (scaled) dot-product is computed between each pair of word embeddings (eg. e1 and e2 )... #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017.
  • 14. 14 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? … a softmax layer normalizes the attention scores to obtain the attention distribution...
  • 15. 15 Self-Attention (or intra-Attention) Which steps are necessary to compute the contextual representation of a word embedding e2 in a sequences of four words embeddings (e1 , e2 , e3 , e4 ) ? ...the same word embeddings are combined to obtain the contextual representation e2 ’.
  • 16. 16 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 17. 17 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 18. 18 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 19. 19 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 20. 20 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 21. 21 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 22. 22 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 23. 23 Self-Attention (or intra-Attention) Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 24. 24 Self-Attention (or intra-Attention) Scaled dot-product attention Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 25. 25 Study case: Self-Attention in for image generation #SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial networks." ICML 2019. [video] Figure: Frank Xu Generator (G): Details can be generated using cues from all feature locations. Discriminator: Can check consistency betweenn features in distant portions of the image.
  • 26. 26 Study case: Self-Attention in for image generation #SAGAN Zhang, Han, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. "Self-attention generative adversarial networks." ICML 2019. [video] Query locations Attention maps for differet query locations
  • 27. Outline 1. Motivation 2. Self-attention 3. Multi-head Self-Attention (MHSA) 27
  • 28. 28 Multi-Head Self-Attention (MHSA) Nikhil Sha, “Attention ? An other Perspective!”. 2020. In vanilla self-attention, a single set of projection matrices WQ , WK , WV is used.
  • 29. 29 Nikhil Sha, “Attention ? An other Perspective!”. 2020. In multi-head self-attention, multiple sets of projection matrices are used, and can provide different contextual representations for the same input token. Multi-Head Self-Attention (MHSA)
  • 30. 30 The multi-head self-attended E’i matrixes are concatenated: Figure: Jay Alammar, “The illustrated Transformer” (2018) Multi-Head Self-Attention (MHSA)
  • 31. 31 A fully connected layer on top combines everything in a new E’. Figure: Jay Alammar, “The illustrated Transformer” (2018) Multi-Head Self-Attention (MHSA)
  • 32. Multi-head Self-Attention: Visualization 32 #BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet] Each colour corresponds to a head. Blue: First head only. Multi-color: Multiple heads.
  • 33. 33 Self-Attention and Convolutional Layers Cordonnier, J. B., Loukas, A., & Jaggi, M. On the relationship between self-attention and convolutional layers. ICLR 2020. [tweet] [code]
  • 34. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 34
  • 35. Positional Encoding 35 Given that the attention mechanism allows accessing all input (and output) tokens, we no longer need a memory through recurrent layers.
  • 36. Positional Encoding 36 Figure: Jay Alammar, “The illustrated Transformer” (2018) Where is the relative relation in the sequence encoded ?
  • 37. Positional Encoding 37 Figure: Jay Alammar, “The illustrated Transformer” (2018) Where is the relative relation in the sequence encoded ?
  • 38. Positional Encoding 38 Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020. Sinusoidal functions are typically used to provide positional encodings.
  • 39. Positional Encoding 39 Figure: Jay Alammar, “The illustrated Transformer” (2018)
  • 40. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 5. The Transformer 40
  • 41. The Transformer 41 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. The Transformer removed the recurrency mechanism thanks to self-attention.
  • 42. The Transformer 42 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. Positional Encoding over the output sequence. Positional Encoding over the input sequence. Auto-regressive (at test).
  • 43. The Transformer 43 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. Cross-Attention (or inter-attention) between input and output tokens Self-attention for the input tokens. Self-attention for the output tokens.
  • 44. The Transformer: Layers 44 #Transformer Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I.. Attention is all you need. NeurIPS 2017. N decoder layers N encoder layers
  • 45. The Transformer: Layers 45 #BertViz Vig, Jesse. "A multiscale visualization of attention in the transformer model." ACL 2019. [code] [tweet] A birds-eye view of attention across all of the model’s layers and heads
  • 46. The Transformer: Visualization 46 Maria Ribalta, Pere-Pau Vàzquez, “Visualization is all you need”. UPC GCED 2020.
  • 47. 47 Are Transformers for Language only ? NO !! #ViT Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021. [blog] [code]
  • 48. Outline 1. Motivation 2. Self-attention 3. Multi-head Attention 4. Positional Encoding 5. The Transformer 48
  • 49. 49 (extra) PyTorch Lab on Google Colab DL resources from UPC Telecos: ● Lectures (with Slides & Videos) ● Labs Gerard Gallego gerard.ion.gallego@upc.edu Student PhD Universitat Politecnica de Catalunya Technical University of Catalonia
  • 50. 50 Software ● Transformers in HuggingFace. ● GPT-Neo by EleutherAI ○ Similar results to GPT-3, but smaller and open source. ● Andrej Karpathy, minGPT (2020).
  • 51. 51 Learn more Ashish Vaswani, Stanford CS224N 2019.
  • 52. 52 Learn more ● Tutorials ○ Sebastian Ruder, Deep Learning for NLP Best Practices # Attention (2017). ○ Chris Olah, Shan Carter, “Attention and Augmented Recurrent Neural Networks”. distill.pub 2016. ○ Lilian Weg, The Transformer Family. Lil’Log 2020 ● Twitter threads ○ Christian Wolf (INSA Lyon) ● Scientific publications ○ #Perceiver Jaegle, Andrew, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. "Perceiver: General perception with iterative attention." arXiv preprint arXiv:2103.03206 (2021). ○ #Longformer Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150 (2020). ○ Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. Transformers are RNS: Fast autoregressive transformers with linear attention. ICML 2020. ○ Siddhant M. Jayakumar, Wojciech M. Czarnecki, Jacob Menick, Jonathan Schwarz, Jack Rae, Simon Osindero, Yee Whye Teh, Tim Harley, Razvan Pascanu, “Multiplicative Interactions and Where to Find Them”. ICLR 2020. [tweet] ○ Self-attention in language ■ Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733. ○ Self-attention in images ■ Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer, N., Ku, A., & Tran, D. (2018). Image transformer. ICML 2018. ■ Wang, Xiaolong, Ross Girshick, Abhinav Gupta, and Kaiming He. "Non-local neural networks." In CVPR 2018.