N20181126

•

1 recomendación•105 vistas

TMU, Japan

Slider for 論文紹介20181126

Educación

Points
• A neural discriminative constituency parser - F1 93.55

• Chart parser/decoder

• Encoder-decoder style dcp - the architecture
• Structure meaning of multi-headed self-attention for cp

• 8-layer, 8-head transformer + BiLSTM decoder

• Analysis by input ablation: word, POS and position

• position or content (POS⟺morph, ElMo/CharConcat)

• Metric of tree structure accuracy: ParsEval

Constituency Parsing
Grammar structure CKY-algorithm ChomskyCFG Transition-based
Chart parser
NLP tutorial 10: (11↑)
• Probability as score
• Bottom-up combine
(bracketing per se)
• Beam search
Godfather
Transformer .
word+POS+position
Decomposition,
3 4 5 6 7
A BiLSTM for  
fence points

Incrementally build up
W0 W1 W2 W3 W4<bos> <eos>
CKY
⇊fence points⇊

Incrementally build up
Score for a bracket: (decoder)

How to deal with non-phrase?

• CKY: little probability (PCFG)

• Chen (me): <nil> tag / vector

• This research: s(i, j, ∅) = 0
i, j are fence points;

l is a label
↕ train with ∅ or <nil>

Position Embedding
Encoder: linguistic Information
Word Embedding
POS
Embedding
Input Zdmodel
T
Component-wise add
zt = wt + mt + pt
Since then, zt is sent to the
Transformer and dmodel keeps
throughout the encoder.

Encoder: linguistic Information
zt
xt
yt
xt
xt

Encoder: linguistic Information
qt = WT
Qxt
kt = WT
K xt
vt = WT
V xt
p(i → j)
¯vt
qi
ki
vi vj
kj
qj
p(i → j)
¯vi
xi
“gather information from up to 8 remote locations”

Decoder again
Wi Wj…
Run a BiRNN once
Run a FFN several times
“92.67 F1 on Penn Treebank WSJ dev set”
We must be the 2018 champion! と⼼心が叫びそうだ
T*(T+1)/2 times Δ

Analysis by Input Ablation
zt = wt + mt + pt
Word, POS and position embeddings are
added, but also overlapped:
qt = WT
Qzt
kt = WT
K zt
vt = WT
V zt
p(i → j)
¯vt
qt = WT
Q pt
kt = WT
K pt
vt = WT
V zt
Layer-wise disabled
“it seems strange that content-based attention
beneﬁts our model to such a small degree.”

Decomposition on i/w
zt = wt + mt + pt
zt = [wt + mt; pt]
F1 92.60
F1 92.67
1. Decompose input
2. Decompose attention
q ⋅ k
q = q(c)
+ q(p)
k = k(c)
+ k(p)
k ⋅ q = (q(c)
+ q(p)
) ⋅ (k(c)
+ k(p)
)
k(c)
⋅ q(p)
+ k(p)
⋅ q(c)
All mix-up:
An example of cross-terms:

“the word the always attends to the 5th
position in the sentence”
xt = [x(c)
; x(p)
]
c = Wx = [c(c)
; c(p)
] = [W(c)
x(c)
; W(p)
x(p)
]
F1 93.15 (+0.5)
all on dev set

Analysis by Constrains
“When we began to investigate how the model makes use
of long-distance attention, we found that there are
particular attention heads at some layers in our model
that almost always attend to the start token.”
RECALL: There are 8 heads in
each of the transformer layer.

“This suggests that the start token
is being used as the location for
some sentence-wide pooling/
processing, or perhaps as a
dummy target location when a
head fails to ﬁnd the particular
phenomenon that it’s learned to
search for.”

In short, it is a dustbin for
redundant .attention
WinA WinA + some spec
←Train with window
and then test on dev

8 layers :)

5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt]
-4 layers at ELMo
pneumonoultramicrosco
picsilicovolcanoconiosis
>> Longtu’s

Más contenido relacionado

La actualidad más candente

CS2303 Theory of computation April may 2015appasami

9. ES6 | Let And Const | TypeScript | JavaScriptpcnmtutorials

[Question Paper] Fundamentals of Digital Computing (Revised Course) [May / 2016]Mumbai B.Sc.IT Study

Cheat Sheets for Hard ProblemsNeeldhara Misra

Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.

python gilrfyiamcool

Chapter 22 Finite FieldTony Cervera Jr.

QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...The Statistical and Applied Mathematical Sciences Institute

algo1guest140e61

1.6 all notesLorie Blickhan

Matlab integrationpramodkumar1804

SPIRE2013-tabei20131009Yasuo Tabei

A One-Pass Triclustering Approach: Is There any Room for Big Data?Dmitrii Ignatov

2015 CMS Winter Meeting PosterChelsea Battell

CPM2013-tabei201306Yasuo Tabei

MultipipesEric Van Hensbergen

Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.

Weekends with Competitive ProgrammingNiharikaSingh839269

Automatic variational inference with latent categorical variablesTomasz Kusmierczyk

Lesson 5 Nov 3ingroy

La actualidad más candente (20)

CS2303 Theory of computation April may 2015

9. ES6 | Let And Const | TypeScript | JavaScript

[Question Paper] Fundamentals of Digital Computing (Revised Course) [May / 2016]

Cheat Sheets for Hard Problems

Faster Practical Block Compression for Rank/Select Dictionaries

python gil

Chapter 22 Finite Field

QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...

algo1

1.6 all notes

Matlab integration

SPIRE2013-tabei20131009

A One-Pass Triclustering Approach: Is There any Room for Big Data?

2015 CMS Winter Meeting Poster

CPM2013-tabei201306

Multipipes

Fast Identification of Heavy Hitters by Cached and Packed Group Testing

Weekends with Competitive Programming

Automatic variational inference with latent categorical variables

Lesson 5 Nov 3

Similar a N20181126

Relaxation methods for the matrix exponential on large networksDavid Gleich

Sep logic sliderainoftime

Branch and bounding : Data structuresKàŕtheek Jåvvàjí

zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden

Variational Bayes: A Gentle IntroductionFlavio Morelli

System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...Cemal Ardil

Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021Peng Cheng

Elliptic Curve CryptographyKelly Bresnahan

Paris data-geeks-2013-03-28Ted Dunning

Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim

Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Seokhwan Kim

streamingalgo88585858585858585pppppp.pptxGopiNathVelivela

derivative.pptSpyder20

derivative.pptbahbib22

Contrastive Divergence Learningpenny 梁斌

chapter9.pptPraveen Kumar

01 knapsack using backtrackingmandlapure

Codes and IsogeniesPriyanka Aash

lecture01_lecture01_lecture0001_ceva.pdfAnaNeacsu5

Compilation of COSMO for GPU using LLVMLinaro

Similar a N20181126 (20)

Relaxation methods for the matrix exponential on large networks

Sep logic slide

Branch and bounding : Data structures

zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...

Variational Bayes: A Gentle Introduction

System overflow blocking-transients-for-queues-with-batch-arrivals-using-a-fa...

Shape Safety in Tensor Programming is Easy for a Theorem Prover -SBTB 2021

Elliptic Curve Cryptography

Paris data-geeks-2013-03-28

Tutorial on Object Detection (Faster R-CNN)

Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...

streamingalgo88585858585858585pppppp.pptx

derivative.ppt

Contrastive Divergence Learning

chapter9.ppt

01 knapsack using backtracking

Codes and Isogenies

lecture01_lecture01_lecture0001_ceva.pdf

Compilation of COSMO for GPU using LLVM

Último

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb

ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli

Raw materials used in Herbal Cosmetics.pptxAshokrao Mane college of Pharmacy Peth-Vadgaon

Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

Computed Fields and api Depends in the Odoo 17Celine George

DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña

Influencing policy (training slides from Fast Track Impact)Mark Reed

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.

Roles & Responsibilities in PharmacovigilanceSamikshaHamane

Proudly South Africa powerpoint Thorisha.pptxthorishapillay1

Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

Karra SKD Conference Presentation Revised.pptxAshokKarra1

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan

N20181126

1. Introducer: Z.Chen

2. Points • A neural discriminative constituency parser - F1 93.55 • Chart parser/decoder • Encoder-decoder style dcp - the architecture • Structure meaning of multi-headed self-attention for cp • 8-layer, 8-head transformer + BiLSTM decoder • Analysis by input ablation: word, POS and position • position or content (POS⟺morph, ElMo/CharConcat) • Metric of tree structure accuracy: ParsEval

3. Constituency Parsing Grammar structure CKY-algorithm ChomskyCFG Transition-based Chart parser NLP tutorial 10: (11↑) • Probability as score • Bottom-up combine (bracketing per se) • Beam search Godfather Transformer . word+POS+position Decomposition, 3 4 5 6 7 A BiLSTM for   fence points

4. Incrementally build up W0 W1 W2 W3 W4<bos> <eos> CKY ⇊fence points⇊

5. Incrementally build up Score for a bracket: (decoder) How to deal with non-phrase? • CKY: little probability (PCFG) • Chen (me): <nil> tag / vector • This research: s(i, j, ∅) = 0 i, j are fence points; l is a label ↕ train with ∅ or <nil>

6. Position Embedding Encoder: linguistic Information Word Embedding POS Embedding Input Zdmodel T Component-wise add zt = wt + mt + pt Since then, zt is sent to the Transformer and dmodel keeps throughout the encoder.

7. Encoder: linguistic Information zt xt yt xt xt

8. Encoder: linguistic Information qt = WT Qxt kt = WT K xt vt = WT V xt p(i → j) ¯vt qi ki vi vj kj qj p(i → j) ¯vi xi “gather information from up to 8 remote locations”

9. Decoder again Wi Wj… Run a BiRNN once Run a FFN several times “92.67 F1 on Penn Treebank WSJ dev set” We must be the 2018 champion! と⼼心が叫びそうだ T*(T+1)/2 times Δ

10. Analysis by Input Ablation zt = wt + mt + pt Word, POS and position embeddings are added, but also overlapped: qt = WT Qzt kt = WT K zt vt = WT V zt p(i → j) ¯vt qt = WT Q pt kt = WT K pt vt = WT V zt Layer-wise disabled “it seems strange that content-based attention beneﬁts our model to such a small degree.”

11. Decomposition on i/w zt = wt + mt + pt zt = [wt + mt; pt] F1 92.60 F1 92.67 1. Decompose input 2. Decompose attention q ⋅ k q = q(c) + q(p) k = k(c) + k(p) k ⋅ q = (q(c) + q(p) ) ⋅ (k(c) + k(p) ) k(c) ⋅ q(p) + k(p) ⋅ q(c) All mix-up: An example of cross-terms: “the word the always attends to the 5th position in the sentence” xt = [x(c) ; x(p) ] c = Wx = [c(c) ; c(p) ] = [W(c) x(c) ; W(p) x(p) ] F1 93.15 (+0.5) all on dev set

12. Analysis by Constrains “When we began to investigate how the model makes use of long-distance attention, we found that there are particular attention heads at some layers in our model that almost always attend to the start token.” RECALL: There are 8 heads in each of the transformer layer. “This suggests that the start token is being used as the location for some sentence-wide pooling/ processing, or perhaps as a dummy target location when a head fails to ﬁnd the particular phenomenon that it’s learned to search for.” In short, it is a dustbin for redundant .attention WinA WinA + some spec ←Train with window and then test on dev 8 layers :)

13. 5 Lexical ModelsPOS tags from Stanford parserzt = [wt + mt; pt] -4 layers at ELMo pneumonoultramicrosco picsilicovolcanoconiosis >> Longtu’s

14. Finale

N20181126

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a N20181126

Similar a N20181126 (20)

Último

Último (20)

N20181126