2. Points
• A neural discriminative constituency parser - F1 93.55
• Chart parser/decoder
• Encoder-decoder style dcp - the architecture
• Structure meaning of multi-headed self-attention for cp
• 8-layer, 8-head transformer + BiLSTM decoder
• Analysis by input ablation: word, POS and position
• position or content (POS⟺morph, ElMo/CharConcat)
• Metric of tree structure accuracy: ParsEval
3. Constituency Parsing
Grammar structure CKY-algorithm ChomskyCFG Transition-based
Chart parser
NLP tutorial 10: (11↑)
• Probability as score
• Bottom-up combine
(bracketing per se)
• Beam search
Godfather
Transformer .
word+POS+position
Decomposition,
3 4 5 6 7
A BiLSTM for
fence points
5. Incrementally build up
Score for a bracket: (decoder)
How to deal with non-phrase?
• CKY: little probability (PCFG)
• Chen (me): <nil> tag / vector
• This research: s(i, j, ∅) = 0
i, j are fence points;
l is a label
↕ train with ∅ or <nil>
6. Position Embedding
Encoder: linguistic Information
Word Embedding
POS
Embedding
Input Zdmodel
T
Component-wise add
zt = wt + mt + pt
Since then, zt is sent to the
Transformer and dmodel keeps
throughout the encoder.
8. Encoder: linguistic Information
qt = WT
Qxt
kt = WT
K xt
vt = WT
V xt
p(i → j)
¯vt
qi
ki
vi vj
kj
qj
p(i → j)
¯vi
xi
“gather information from up to 8 remote locations”
9. Decoder again
Wi Wj…
Run a BiRNN once
Run a FFN several times
“92.67 F1 on Penn Treebank WSJ dev set”
We must be the 2018 champion! と⼼心が叫びそうだ
T*(T+1)/2 times Δ
10. Analysis by Input Ablation
zt = wt + mt + pt
Word, POS and position embeddings are
added, but also overlapped:
qt = WT
Qzt
kt = WT
K zt
vt = WT
V zt
p(i → j)
¯vt
qt = WT
Q pt
kt = WT
K pt
vt = WT
V zt
Layer-wise disabled
“it seems strange that content-based attention
benefits our model to such a small degree.”
11. Decomposition on i/w
zt = wt + mt + pt
zt = [wt + mt; pt]
F1 92.60
F1 92.67
1. Decompose input
2. Decompose attention
q ⋅ k
q = q(c)
+ q(p)
k = k(c)
+ k(p)
k ⋅ q = (q(c)
+ q(p)
) ⋅ (k(c)
+ k(p)
)
k(c)
⋅ q(p)
+ k(p)
⋅ q(c)
All mix-up:
An example of cross-terms:
“the word the always attends to the 5th
position in the sentence”
xt = [x(c)
; x(p)
]
c = Wx = [c(c)
; c(p)
] = [W(c)
x(c)
; W(p)
x(p)
]
F1 93.15 (+0.5)
all on dev set
12. Analysis by Constrains
“When we began to investigate how the model makes use
of long-distance attention, we found that there are
particular attention heads at some layers in our model
that almost always attend to the start token.”
RECALL: There are 8 heads in
each of the transformer layer.
“This suggests that the start token
is being used as the location for
some sentence-wide pooling/
processing, or perhaps as a
dummy target location when a
head fails to find the particular
phenomenon that it’s learned to
search for.”
In short, it is a dustbin for
redundant .attention
WinA WinA + some spec
←Train with window
and then test on dev
8 layers :)