SlideShare a Scribd company logo
1 of 44
Download to read offline
Paper Review
Attention Is All You Need
(Ashish et al. 2017) [Arxiv pre-print link]
Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Santiago Pascual de la Puente
June 07, 2018
TALP UPC, Barcelona
Table of contents
1. Introduction
2. The Transformer
A Myriad of Attentions
Point-Wise Feed Forward Networks
The Transformer Block
3. Interfacing Token Sequences
Embeddings
Positional Encoding
4. Results
5. Conclusions
1/37
Introduction
Introduction
Recurrent neural networks (RNNs) and their cell variants are firmly
established as state of the art in sequence modeling and transduction
(e.g. machine translation).
In transduction we map a sequence X = {x1, · · · , xT } to another one
Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde
and
ym ∈ Rdd
.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
2/37
Introduction
1. The encoder RNN will encode source symbols X = {x1, · · · , xT }
into useful abstractions to mix up contextual contents →
H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b).
2. Last encoder state hT is typically taken as the summary of the
input, and it is injected into the decoder initial state hd
0 = hT .
3. The decoder RNN will generate one-by-one the target sequence
(autoregressive) by feeding back its previous prediction ym−1 as
input, also conditioned in h0 encoder summarization.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
3/37
Introduction
Encoding a sentence into one vector would be super amazing, but it is
unfeasible. In the real world we need a mechanism that gives the decoder
hints on where to look from encoder to weight the source vectors, not
just get the last → ATTENTION MECHANISM.
• cm =
T−1
t=0 αm
t · ht
• Each cm is a row (and
additional input to dec), and
each αm
t is an orange square.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-
translation/seq2seq-translation.ipynb 4/37
Introduction
• RNNs factor computation along symbol time positions, generating
ht out of ht−1 → cannot parallelize in training:
ht = tanh(Wxt + Uht−1 + b)
• Attention is used with SOTA transduction RNNs → model
dependencies without regard to their distance in the input or output
sequences.
5/37
Introduction
• Let’s get rid of recurrence and rely entirely on attentions to draw
global dependencies b/w input and output.
• The Transformer is born, significantly boosting parallelization and
reaching new SOTA in translation.
6/37
The Transformer
The Transformer
We will have a new encoder-decoder structure, without any recurrence:
only fully connected layers (independent at every time-step) and
self-attention to merge global info in the sequences.
• Encoder will map X = {x1, · · · , xT } to a sequence of continuous
representations Z = {z1, · · · , zT }.
• Given Z the decoder will generate Y = {y1, · · · , yN }
• Still auto-regressive! But no recurrent connections at all.
7/37
The Transformer
8/37
Attention Generic Formulation
• Attention function maps a query and a set of key-value pairs to an
output: query, keys, values, and output are all vectors:
o = f (q, k, v)
• Output is computed as a weighted sum of the values.
• Weight assigned to each value is computed by a compatibility
function of the query with the corresponding key.
o =
T−1
t=0
g(qi , ki
t ) · vt
9/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
10/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
• FAST TRICK: compute the att on a set of queries simultaneously,
packing matrices Q, K, V .
Attention(Q, K, V ) = Softmax(
QKT
2
√
dk
)V
11/37
The Fault In Our Scale
Wait... why do we scale the output from the matching function between
query and key by 2
√
dk ?
12/37
The Fault In Our Scale
Two most commonly used attention methods (to merge k and q):
• Additive: MLP with one hidden layer where vectors are
concatenated at input of MLP.
• Multiplicative: dot-product seen here → MUCH faster and more
space-efficient.
For small values of dk both behave similarly, but additive outperforms
dot-product for larger dk .
Suspicion: for large values dk , dot-products grow large in magnitude,
pushing Softmax into regions with extremely small gradients.
Assume components of q and k are independent random variables with
µ = 0 and σ = 1 ⇒ is q · k =
dk
i=1 qi · ki with µ = 0 and σ = 2
√
dk .
We counteract this effect by scaling 1
2
√
dk
.
13/37
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information
from different representation subspaces at different positions. With a
single attention head, averaging inhibits this.
14/37
Multi-Head Attention
MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0
headi = Attention(QW Q
i , KW K
i , VW V
i )
W Q
i ∈ Rdmodel ×dk
, W K
i ∈ Rdmodel ×dk
, W V
i ∈ Rdmodel ×dv
, W 0
∈ Rhdv ×dmodel
In this work h = 8 and dk = dv = dmodel
h = 64.
15/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
16/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
17/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
3. The decoder has the same self-attention mechanism. BUT!...
prevent leftward information flow (it must be autoregressive).
18/37
Decoder Attention Mask
Prevent leftward information flow inside of scaled dot-product attention,
by masking out (setting to − inf) all values in the input of the Softmax
which correspond to ”illegal” connections.
19/37
Point-Wise Feed Forward Networks
Simply an MLP to each time position with the same parameters:
FFN(x) = max(0, xW1 + b1)W2 + b2
These can be seen as two Convolutions1D with kwidth = 1. The
dimensionality of input and output is dmodel = 512 and inner layer has
dimensionality dff = 2048.
20/37
Point-Wise Feed Forward Networks
21/37
The Transformer Block
If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN,
a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain
the Transformer block:
22/37
The Transformer Block
We can see how N stacks of these blocks form the whole Transformer
END-TO-END network. Note the extra enc-dec-attention in the
decoder blocks.
23/37
Interfacing Token Sequences
Embeddings
As in seq2seq models, we use learned embeddings to convert input tokens
and output tokens to dense vectors of dimension dmodel . There is also (of
course) an output linear transformation to go from dmodel to number of
classes and Softmax.
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
24/37
Embeddings
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
25/37
Embeddings
26/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact?
27/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact? NO.
So let’s work it out.
28/37
Positional Encoding
• In order for the model to make use of the order of the sequence, we
must inject some information about the relative or absolute position
of the tokens in the sequence.
• Add positional encodings joint with the embeddings, summing them
up such that the positional info is merged in the input.
PE(pos, 2i ) = sin(
pos
10000
2i
dmodel
)
PE(pos, 2i+1) = cos(
pos
10000
2i
dmodel
)
Where i is the dimension and pos the position (time-step). Each
dimension corresponds to a sinusoid, with wavelengths forming a
geometric progression. The frequency and offset of the wave is different
for each dimension.
29/37
Positional Encoding
At every time-step we will have a combination of sinusoids telling us
where are relative to the beginning (with combination of phases).
Advantage of these codes: generalization to any length in test (cyclic
nature of sinusoids rather than growing indefinitely).
30/37
Results
Results
31/37
Results
• On the WMT 2014 English-to-German translation task, the big
transformer model (Transformer (big)) outperforms the best
previously reported models (including ensembles) by more than 2.0
BLEU! (new SOTA of 28.4).
• Training took 3.5 days on 8 P100 GPUs. Even their base model
surpasses all previously published models and ensembles, at a
fraction of the training cost of any of the competitive models.
• On the WMT 2014 English-to-French translation task, the big
model achieves a BLEU score of 41.0, outperforming all of the
previously published single models, at less than 1
4 the training cost.
32/37
Results
Enc Layer2
33/37
Results
Enc Layer6
34/37
Results
Dec Layer2
35/37
Results
Dec-SRC Layer2
36/37
Conclusions
Conclusions
• The Transformer is the first sequence transduction model based
entirely on attention ( replacing the recurrent layers most commonly
used in encoder-decoder architectures with multi-headed
self-attention).
• For translation tasks, the Transformer can be trained significantly
faster than architectures based on recurrent or convolutional layer.
• New SOTA on WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks.
• Code used to train and evaluate original models is available at
https://github.com/tensorflow/tensor2tensor. .
37/37
Thanks!
@santty128
37/37

More Related Content

What's hot

ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models Chia-Wen Cheng
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer VisionDongmin Choi
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)Sanjay Saha
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative ModelsMLReview
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applicationsAnas Arram, Ph.D
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 

What's hot (20)

ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
VQ-VAE
VQ-VAEVQ-VAE
VQ-VAE
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
Attention-based Models (DLAI D8L 2017 UPC Deep Learning for Artificial Intell...
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)ResNet basics (Deep Residual Network for Image Recognition)
ResNet basics (Deep Residual Network for Image Recognition)
 
ViT.pptx
ViT.pptxViT.pptx
ViT.pptx
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 
Lstm
LstmLstm
Lstm
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipeliningjagrat123
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Thien Q. Tran
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptxKokilaK25
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfRajJain516913
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptxhtn540
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphstuxette
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual) (20)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
SPAA11
SPAA11SPAA11
SPAA11
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Fx3111501156
Fx3111501156Fx3111501156
Fx3111501156
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

  • 1. Paper Review Attention Is All You Need (Ashish et al. 2017) [Arxiv pre-print link] Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html Santiago Pascual de la Puente June 07, 2018 TALP UPC, Barcelona
  • 2. Table of contents 1. Introduction 2. The Transformer A Myriad of Attentions Point-Wise Feed Forward Networks The Transformer Block 3. Interfacing Token Sequences Embeddings Positional Encoding 4. Results 5. Conclusions 1/37
  • 4. Introduction Recurrent neural networks (RNNs) and their cell variants are firmly established as state of the art in sequence modeling and transduction (e.g. machine translation). In transduction we map a sequence X = {x1, · · · , xT } to another one Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde and ym ∈ Rdd . https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 2/37
  • 5. Introduction 1. The encoder RNN will encode source symbols X = {x1, · · · , xT } into useful abstractions to mix up contextual contents → H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b). 2. Last encoder state hT is typically taken as the summary of the input, and it is injected into the decoder initial state hd 0 = hT . 3. The decoder RNN will generate one-by-one the target sequence (autoregressive) by feeding back its previous prediction ym−1 as input, also conditioned in h0 encoder summarization. https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 3/37
  • 6. Introduction Encoding a sentence into one vector would be super amazing, but it is unfeasible. In the real world we need a mechanism that gives the decoder hints on where to look from encoder to weight the source vectors, not just get the last → ATTENTION MECHANISM. • cm = T−1 t=0 αm t · ht • Each cm is a row (and additional input to dec), and each αm t is an orange square. https://github.com/spro/practical-pytorch/blob/master/seq2seq- translation/seq2seq-translation.ipynb 4/37
  • 7. Introduction • RNNs factor computation along symbol time positions, generating ht out of ht−1 → cannot parallelize in training: ht = tanh(Wxt + Uht−1 + b) • Attention is used with SOTA transduction RNNs → model dependencies without regard to their distance in the input or output sequences. 5/37
  • 8. Introduction • Let’s get rid of recurrence and rely entirely on attentions to draw global dependencies b/w input and output. • The Transformer is born, significantly boosting parallelization and reaching new SOTA in translation. 6/37
  • 10. The Transformer We will have a new encoder-decoder structure, without any recurrence: only fully connected layers (independent at every time-step) and self-attention to merge global info in the sequences. • Encoder will map X = {x1, · · · , xT } to a sequence of continuous representations Z = {z1, · · · , zT }. • Given Z the decoder will generate Y = {y1, · · · , yN } • Still auto-regressive! But no recurrent connections at all. 7/37
  • 12. Attention Generic Formulation • Attention function maps a query and a set of key-value pairs to an output: query, keys, values, and output are all vectors: o = f (q, k, v) • Output is computed as a weighted sum of the values. • Weight assigned to each value is computed by a compatibility function of the query with the corresponding key. o = T−1 t=0 g(qi , ki t ) · vt 9/37
  • 13. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. 10/37
  • 14. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. • FAST TRICK: compute the att on a set of queries simultaneously, packing matrices Q, K, V . Attention(Q, K, V ) = Softmax( QKT 2 √ dk )V 11/37
  • 15. The Fault In Our Scale Wait... why do we scale the output from the matching function between query and key by 2 √ dk ? 12/37
  • 16. The Fault In Our Scale Two most commonly used attention methods (to merge k and q): • Additive: MLP with one hidden layer where vectors are concatenated at input of MLP. • Multiplicative: dot-product seen here → MUCH faster and more space-efficient. For small values of dk both behave similarly, but additive outperforms dot-product for larger dk . Suspicion: for large values dk , dot-products grow large in magnitude, pushing Softmax into regions with extremely small gradients. Assume components of q and k are independent random variables with µ = 0 and σ = 1 ⇒ is q · k = dk i=1 qi · ki with µ = 0 and σ = 2 √ dk . We counteract this effect by scaling 1 2 √ dk . 13/37
  • 17. Multi-Head Attention Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. 14/37
  • 18. Multi-Head Attention MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0 headi = Attention(QW Q i , KW K i , VW V i ) W Q i ∈ Rdmodel ×dk , W K i ∈ Rdmodel ×dk , W V i ∈ Rdmodel ×dv , W 0 ∈ Rhdv ×dmodel In this work h = 8 and dk = dv = dmodel h = 64. 15/37
  • 19. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 16/37
  • 20. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 17/37
  • 21. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 3. The decoder has the same self-attention mechanism. BUT!... prevent leftward information flow (it must be autoregressive). 18/37
  • 22. Decoder Attention Mask Prevent leftward information flow inside of scaled dot-product attention, by masking out (setting to − inf) all values in the input of the Softmax which correspond to ”illegal” connections. 19/37
  • 23. Point-Wise Feed Forward Networks Simply an MLP to each time position with the same parameters: FFN(x) = max(0, xW1 + b1)W2 + b2 These can be seen as two Convolutions1D with kwidth = 1. The dimensionality of input and output is dmodel = 512 and inner layer has dimensionality dff = 2048. 20/37
  • 24. Point-Wise Feed Forward Networks 21/37
  • 25. The Transformer Block If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN, a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain the Transformer block: 22/37
  • 26. The Transformer Block We can see how N stacks of these blocks form the whole Transformer END-TO-END network. Note the extra enc-dec-attention in the decoder blocks. 23/37
  • 28. Embeddings As in seq2seq models, we use learned embeddings to convert input tokens and output tokens to dense vectors of dimension dmodel . There is also (of course) an output linear transformation to go from dmodel to number of classes and Softmax. In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 24/37
  • 29. Embeddings In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 25/37
  • 31. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? 27/37
  • 32. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? NO. So let’s work it out. 28/37
  • 33. Positional Encoding • In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. • Add positional encodings joint with the embeddings, summing them up such that the positional info is merged in the input. PE(pos, 2i ) = sin( pos 10000 2i dmodel ) PE(pos, 2i+1) = cos( pos 10000 2i dmodel ) Where i is the dimension and pos the position (time-step). Each dimension corresponds to a sinusoid, with wavelengths forming a geometric progression. The frequency and offset of the wave is different for each dimension. 29/37
  • 34. Positional Encoding At every time-step we will have a combination of sinusoids telling us where are relative to the beginning (with combination of phases). Advantage of these codes: generalization to any length in test (cyclic nature of sinusoids rather than growing indefinitely). 30/37
  • 37. Results • On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU! (new SOTA of 28.4). • Training took 3.5 days on 8 P100 GPUs. Even their base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. • On the WMT 2014 English-to-French translation task, the big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1 4 the training cost. 32/37
  • 43. Conclusions • The Transformer is the first sequence transduction model based entirely on attention ( replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention). • For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layer. • New SOTA on WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. • Code used to train and evaluate original models is available at https://github.com/tensorflow/tensor2tensor. . 37/37