K -best, Locally Pruned, Transition-based Dependency Parsing Using Robust Risk Minimization

K-best, Locally-pruned,
Transition-based
Dependency Parsing using
Robust Risk Minimization
Jinho D. Choi
University of Colorado at Boulder
J. D. Power and Associates
September 9, 2009

Dependency Structure
• What is dependency?
- Syntactic or semantic relation between word-tokens
• Syntactic: NMOD (a beautiful woman)
• Semantic: LOC (places in this city), TMP (events in this year)
• Phrase structure vs. dependency structure
- Constituents vs. dependencies
S
bought

NP VP SBJ OBJ

Pro V NP she car

she bought Det N DET

a
a car

Dependency Graph
• For a sentence s = w .. w , a dependency graph G = (V , E )
1 n s s s

- V = {w = root, w , ... , w }
s 0 1 n

- E = {(w , r, w ) : w ! w , w ! V , w ! V - {w }, r ! R }
s i j i j i s j s 0 s

- R = a set of all dependency relations in s
s

• A well-formed dependency graph
- Unique root, single head, connected, acyclic ! dependency tree
- Projective vs. non-projective
root She bought a car
O(n)
vs.
root She bought a car yesterday that was blue O(n2)

Dependency Parsing Models
• Transition-based parsing model
- Transition: an operation that searches for a dependency
relation between each pair of words (e.g. Left-Arc, Shift, etc.)

- Greedy search that finds local optimums (locally optimized
transitions) " do better for short-distance dependencies

- Nivre’s algorithm (p, O(n)), Covington’s algorithm (n, O(n2))

• Graph-based parsing model
- Build a complete graph with directed/weighted edges and find
the tree with the highest score (sum of all weighted edges)

- Exhaustive search that finds for the global optimum (maximum
spanning tree) " do better for long-distance dependencies

- Eisner’s algorithm (p, O(n2)), Edmonds’ algorithm (n, O(n3))

Nivre’s List-based Algorithm
• Transition-based, non-projective dependency parsing algorithm
• # , # != lists of partially processed tokens
1 2
$ != a list of remaining unprocessed tokens

• Initialization: (# , # , $, A) = ([0], [ ], [1, 2, . . . , n], { })
1 2
Termination: (#1, #2, $, A) = ([...], [...], [ ], {...})

Deterministic shift vs. non-deterministic shift


!1 !2 " A


!1 !2 " A
• Initialize


she
bought
a
root car

!1 !2 " A
• Initialize


she
bought
a
root car

!1 !2 " A
• Initialize
• Shift : she


bought
she a
root car

!1 !2 " A
• Initialize
• Shift : she


bought
she a
root car

!1 !2 " A
• Initialize
• Shift : she
• Left-Arc : she ! bought


bought
a
root she car she ! bought

!1 !2 " A
• Initialize
• Shift : she


bought
a
root she car she ! bought

!1 !2 " A
• Initialize
• Shift : she
• Right-Arc : root " bought


bought
root a root " bought
she car she ! bought

!1 !2 " A
• Initialize
• Shift : she


bought
root a root " bought
she car she ! bought

!1 !2 " A
• Initialize
• Shift : she
• Shift : root, she, bought


bought
she a root " bought
root car she ! bought

!1 !2 " A
• Initialize
• Shift : she


bought
she a root " bought

!1 !2 " A
• Initialize • Shift : a
• Shift : she


a
bought
she root " bought

!1 !2 " A
• Shift : she


a
bought
she root " bought

!1 !2 " A
• Shift : she • Left-Arc : a ! car


bought a ! car
she root " bought
root a car she ! bought

!1 !2 " A


bought a ! car
she root " bought

!1 !2 " A
• Left-Arc : she ! bought • Right-Arc : bought " car


bought " car
a ! car
she bought root " bought

!1 !2 " A


bought " car
a ! car
she bought root " bought

!1 !2 " A
• Right-Arc : root " bought • Shift: bought, a, car


car
a bought " car
bought a ! car
she root " bought
root she ! bought

!1 !2 " A


car
a bought " car
bought a ! car
she root " bought
root she ! bought

!1 !2 " A
• Shift : root, she, bought • Terminate

Robust Risk Minimization
• Linear binary classiﬁcation algorithm
- Searches for a hyperplane h(x) = w ·x ! ! that separates two
T

classes, -1 and 1, where class(xi) = (h(xi) < 0) ? -1 : 1.

- Finds " and ^! that solve the following optimization problem.

• Advantages
- Learns irrelevant features faster (than Perceptron).
- Deals with non-linearly separable data more ﬂexibly.

K-best, Locally-pruned Parsing
• RRM is a binary classification algorithm.
- One-against-all method using multiple classifiers.
- What if more than one classifier predict transitions?
• Pick the transition with the highest score.
• What if the highest scoring transition is not correct?

K-best, Locally-pruned Parsing
• Predicting a wrong transition at any state can generate a
completely different tree (from as it would be in gold-standard).

• It is better to use k-best transitions instead of 1-best.
- Derive several trees and pick the one with the highest score.
- score(tree) = % score(transition)
" transitions used to derive the tree

- Problem with the above equation (addressed yesterday)
• A tree derived by a longer sequence of transitions win.
• Normalize the score by the total number of transitions.
• score(tree) = 1/|T|·% score(transition)
" transitions

Post-processing
• The output from the transition-based parser is not guaranteed
to be a tree but rather a forest.

- It is possible for some tokens not found their heads.
- For each such token, compare it against all other tokens and
pick the one that gives the highest score to be the head.

- For such w ,j

• Compare it against all w and see which wi gives the
i<j
highest scoring Right-Arc transition.

• Compare it against all w j<kand see which wk gives the
highest scoring Left-Arc transition.

Feature Space
• About 14 million features

• f: form, m: lemma, p: pos-tag, d: dependency label
• lm(w): left-most dependent , ln(w): left-nearest dependent
rm(w): right-most dependent, rn(w): right-nearest dependent

Evaluation
• Models
I. Greedy search using the highest scoring transition
II. Best search using all predicted transitions
III. II + using the upper bound of 1
IV. III + using the lower bound of "0.1
V. III + using the lower bound of "0.2
VI. V + using top 2 scoring transitions
VII. VI + post-processing

Evaluation
• Parsing accuracies
Labled Attachment Score Unlabeled Attachment Score

95.00

91.25
90.97
90.12 90.47 90.47
89.21 89.34 89.42 89.28
88.62 88.87 88.87
87.50 87.88 87.96 88.08

83.75

80.00
I II III IV V VI VII

Evaluation
• Average number of transitions
I II-III IV V VI-VII

1,500

1,125

750

375

0
2007 1-10 11-20 21-30 31-40 41-50 > 50

Summary and Conclusions
• Summary
- Transition-based, non-projective dependency parsing
- k-best, locally pruned dependency parsing
- Post-processing
- Robust Risk Minimization
• Conclusions
- It is possible to achieve higher parsing accuracy by considering
k-best, locally pruned trees,

- while keeping near quadratic running time in practice.

Future Work
• Parsing Algorithm
- Search transitions for both left and right sides of "[0].
- Beam search.
- Normalize scores and use priors for transitions.
• Feature
- Cut-off ones less than a threshold.
- Predicate-argument structure from frameset ﬁles.
• Machine learning algorithm
- Apply different values for learning parameters.
- Compare with Perceptron, Support Vector Machine.

K -best, Locally Pruned, Transition-based Dependency Parsing Using Robust Risk Minimization

Recomendados

Recomendados

Más contenido relacionado

Más de Jinho Choi

Más de Jinho Choi (20)

Último

Último (20)

K -best, Locally Pruned, Transition-based Dependency Parsing Using Robust Risk Minimization