Лексико-семантические ресурсы играют ключевую роль в автоматической обработке текста. В последние годы ресурсы, создаваемые сообществом, такие как Википедия и Wiktionary, становятся привлекательной альтернативой для классических ресурсов, создаваемых экспертами, таких как WordNet, особенно для языков для которых мало ресурсов. Недавние крупномасштабные проекты, например YAGO, BabelNet, UBY, нацелены на комбинирование множества лексикосемантических ресурсов в рамках одной системы. В своем докладе я представлю выравнивание смыслов слов как задачу, критически важную для комбинирования лексико-семантических ресурсов и взаимодополняющего использования их сильных сторон. В задаче выравнивания смыслов слов, смысл термина (например, Java как язык программирования) должен быть связан с синонимичными значениями во множестве ресурсов и отделен от других значений того же слова (например, Java, как остров). В докладе будут рассмотрены два подхода к решению описанной задачи: основанный на близости текстов и основанный на графах, также их оценка на парах лексико-семантических ресурсов с различными свойствами. В конце будут приведены примеры использования выровненных лексикосемантических ресурсов в автоматической бработке текста.
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах"
1. Programming language is not an island: Word Sense
Alignment of Lexical-Semantic Resources
Iryna Gurevych
Joint work with: Judith Eckle-Kohler, Kostadin Cholakov, Silvana
Hartmann, Michael Matuschek, Christian M. Meyer
http://www.ukp.tu-darmstadt.de/data/uby
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 1
UBY
2. Applications of Linked Lexical Resources
2
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
3. Text Analysis Needs Lexical-Semantic Knowledge
NLP application Lexical resource
Which lexical resource
to choose?
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 3
4. Resources are Largely Different
Different coverage of words/word senses
Different types of information
Encyclopedic vs. linguistic knowledge
Syntactic vs. semantic knowledge
…
Resource integration can significantly influence the performance
of your system! – Instead of choosing only one (best performing):
Why not combine multiple resources
and benefit from all their knowledge?
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 4
5. Overlap of Lexical Entries
Roget’s Thesaurus
(62,797)
25,541
28,650
163,027 67,868
56,240
Wiktionary
(364,663)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 5
WordNet
(149,502)
Common vocabulary is
rather small (28,650).
Each resource contains a lot
of “unique” words.
6. Overlap of Lexical Entries
slang
dialect
natural
sciences
computer
science
surprisingly
neologisms
named
entities
social
sciences
humanities
biological
taxonomy
small
overlap
math
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 6
7. 7
Word Sense Alignment
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
8. Prior Work on Linked Lexical Resources (LLR)
Meaning Multilingual Central Repository, Atserias et al. (2004)
Yago, Suchanek et al. (2007)
SemLink (Palmer, 2009)
Universal Wordnet (UWN), Gerard de Melo and Gerhard Weikum
(2009)
eXtended WordFrameNet, Laparra and Rigau (2010)
BabelNet, Navigli and Ponzetto (2010)
NULEX, McFate and Forbus (2011)
UBY, Gurevych et al. (2012)
… many more, e.g. on the Semantic Web
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 8
9. Potential of Linked Lexical Resources
Increased coverage and the enriched sense representation
Linking FrameNet, VerbNet, and WordNet for semantic parsing
(Shi and Mihalcea, 2005)
Linking VerbNet, FrameNet and PropBank for semantic role labeling
(Palmer, 2009)
Linking WordNet and Wikipedia for word sense disambiguation
(Navigli and Ponzetto, 2010)
Linking WordNet and Wiktionary for measuring verb similarity
(Meyer and Gurevych, 2012)
Linking OmegaWiki and Wiktionary for mining translations (McCrae
and Cimiano, 2013)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 9
10. The Challenge: Heterogeneity of Resources
Different coverage:
missing entities in one
of the resources
Different granularity:
entities are defined at
different levels
Different perspectives:
entities are defined for
a different purpose
vs.
vs.
vs.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 10
(Euzenat/Shvaiko, 2007)
11. Lemma Alignment
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 13
Wiktionary
WordNet
Content integration at the lemma
level is easy, but…
12. Word Sense Alignment
Content integration at the lemma
level is easy, but…
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 14
Wiktionary
WordNet
…integration at the
sense level is hard!
13. Word Sense Alignment
plant in Wiktionary
(botany) An organism of the kingdom
Plantae […]
(proscribed as biologically inaccurate)
Any creature that grows on soil or
similar surfaces, including plants and
fungi.
A factory or other industrial or
institutional building or facility.
(snooker) A play in which the cue ball
knocks one (usually red) ball onto
another […]
plant in WordNet
buildings for carrying on
industrial labor
(botany) a living organism
lacking the power of
locomotion
an actor situated in the
audience whose acting is
rehearsed but seems
spontaneous to the
audience
?
?
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 15
14. The Alignment Process
Can be generalized for multiple resources „multi-alignment“:
parameters p
r
resource 1
A Matching
A‘
knowledge k
r‘
alignment
(possibly empty)
resource 2
initial
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 17
output
alignment
A‘ = f(r, r‘, A, p, k)
A‘ = f(r1,…,rn, A, p, k)
(Euzenat/Shvaiko, 2007)
15. Applications of Linked Lexical Resources
20
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
16. Construction of aligned lexical resources
What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for
Increased Domain Coverage. Christian M. Meyer and Iryna Gurevych. In: Proceedings
of IJCNLP, pp. 883-892, November 2011.
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
14.05.2014 | Technische Universität Darmstadt | Iryna Gurevych 21
17. Similarity-based Word Sense Alignment
Increased coverage
Enriched sense
representations
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 22
18. works
(factory) …
23
bird
(animal)
Wikipedia
article …
Wikipedia
article …
Aligning Wiktionary and WordNet
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)
19. works
(factory) …
24
bird
(animal)
Wikipedia
article …
Wikipedia
article …
Aligning Wiktionary and WordNet
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
20. X
works
(factory) …
25
bird
(animal)
Wikipedia
article …
Wikipedia
article …
Aligning Wiktionary and WordNet
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)
X
X
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
21. Bag of Words Representation
synset
hypernyms
hyponyms
hyper- &
hyponyms
bag-of-words
bag-of-words
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 26
lemma
sense
definition
usage
examples
synonyms
Synsets are represented
by synonyms, gloss,
examples
22. Candidate Disambiguation
semantic
relatedness
measure
bag-of-words
bag-of-words
COS: Cosine similarity
score s
PPR: Personalized PageRank
s < threshold s ≥ threshold
No alignment!
Align this pair of
WordNet synset and
Wiktionary sense!
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 27
23. Evaluation Dataset
Dataset creation:
No previous alignments = no other evaluation datasets
We created a new dataset with 2,423 sense pairs
10 human raters (students/researchers from CS, math, linguistics)
Annotate each pair as “same meaning” or “different meaning”
Dataset reliability:
Inter-rater agreement: AO = .93, κ = .70
Removing two biased raters: AO = .94, κ = .74
Gold standard:
Majority vote of the 8 raters, additional tie breaker
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 39
24. Evaluation Results
RAND: Random baseline
MFS: Baseline aligning always the first sense (≈ most frequent sense)
Method A P R F1
RAND .662 .212 .594 .313
MFS .802 .329 .508 .399
COS only .901 .598 .703 .646
PPR only .915 .684 .636 .659
COS&PPR .914 .674 .649 .661
Our approach significantly outperforms the baseline (at 1% level)
COS highest recall; PPR highest precision; COS&PPR highest F1
Significant difference of PPR, COS&PPR over COS (at 1% level)
No significant difference between PPR and COS&PPR
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 40
25. Error Analysis
110 false negatives:
“same meaning, but was not aligned”
Very different wording
“good discernment” vs.“ability to notice what others might miss”
Similar senses but slightly below threshold
“plants of the genus Centaurea” vs. “common weeds of the genus
Centaurea”
Pointing to another entry rather than a content-based gloss
pacification: “the process of pacifying”
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 42
26. Error Analysis
98 false positives:
“different meaning, but have been aligned”
Similar wording, but refer to different concepts
“a computer that provides client stations with access to files and
printers as shared resources to a computer network” vs. “any
computer attached to a network”
High relatedness, but generic- versus domain-specific vocabulary
“any computer attached to a network” vs. “any organization that
provides resources and facilities for a function or event”
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 43
27. Increased Coverage: Parts of Speech
Our alignment: 56,970 sense pairs
Final resource contains 488,988 word senses
Substantial increase in the coverage of senses
Wiktionary is not restricted to nouns/verbs/adjectives: proverbs,
idioms, collocations, particles, determiners, inflected forms, etc.
Wiktionary
AND WordNet
Additionally in
Wiktionary
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 44
Additionally in
WordNet
Nouns 34,464 158,085 47,651
Verbs 8,252 29,119 5,515
Adj./Adv. 14,236 60,977 7,541
Other POS 0 16,778 0
Inflected Forms 0 106,328 0
28. Increased Coverage: Domains
Wiktionary
AND WordNet
Additionally in
Wiktionary
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 45
Additionally in
WordNet
Biology 4,465 4,067 12,869
Chemistry 2,561 8,260 2,268
Engineering 1,108 940 1,080
Geology 2,287 2,898 2,479
Humanities 4,949 2,700 5,060
IT 439 3,032 557
Linguistics 1,249 1,011 1,576
Math 615 2,747 483
Medicine 3,613 3,728 3,058
Military 574 426 585
Physics 1,246 2,835 1,252
Religion 733 1,154 781
Social Sciences 3,745 2,907 4,458
Sport 905 2,821 807
29. Enriched Sense Representation
Synonyms
Gloss
Example sentence
Subsumption hierarchy
Synset organization
…
Pronunciation
Etymology
Syntactic knowledge
Quotations
Related terms
Translations
…
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 46
30. Selected Conclusions
Aligned Wiktionary – WordNet is characterized by:
(1) Increased coverage
Different parts of speech, not only nouns
e.g. humanities and social sciences from WordNet
e.g. technical domains and leisure from Wiktionary
(2) Enriched sense representation
Pronunciation, etymology, related terms, translations, etc.
Novel evaluation dataset annotated by 10 human raters
Better results based on the resource-structure based and hybrid
techniques in later work (Matuschek & Gurevych, TACL ‘13)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 47
31. Applications of Linked Lexical Resources
48
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
32. Construction of aligned lexical resources
Michael Matuschek and Iryna Gurevych: Dijkstra-WSA: A Graph-Based
Approach to Word Sense Alignment, in: Transactions of the Association
for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
14.05.2014 | Technische Universität Darmstadt | Iryna Gurevych 49
33. Similarity-Based Approaches Suffer From…
Different vocabulary employed by definitions
Example: English noun eye/discernment, e.g.,
she has an eye for fresh talent
he has an artist's eye
good discernment (either visually or as if visually)
low semantic relatedness score…
ability to notice what others might miss
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 50
34. Solution: Use the Graph Topology
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 51
Word Senses
Java1 Java2 of Java
Java3
35. Intuition of Graph Topology
Java1 Java2 of Java
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 52
Word Senses
Monosemous
lexeme
programming
language
Java3
programming
language1
36. Java1 Java2 of Java
53
Word Senses
Word Senses
of Ruby
Intuition of Graph Topology
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Monosemous
lexeme
programming
language
Java3
programming
language1
Ruby1
37. Intuition of Graph Topology
Java1 Java2 of Java
Related senses are in the same region of the graph
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 54
Word Senses
Monosemous
lexeme
programming
language
Word Senses
of Ruby
Java3
programming
language1
Ruby1
38. Dijkstra-WSA
Graph-based word sense alignment approach
Key ideas:
Represent lexical resources as graphs
Rely on trivial alignments as “reference nodes” and “bridges”
Use Dijkstra’s shortest path algorithm
to find alignments
Steps:
1. Graph construction
2. Computing sense alignments
(Matuschek/Gurevych, 2013)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 55
39. Step 1: Graph Construction
Represent each lexical resource as an undirected graph
L = (V, E) with
the set of nodes V representing senses or synsets
the set of edges E V x V representing some kind of (semantic)
similarity between a pair of nodes
An edge connects sense S1 and sense S2 if, for example…
There exists a semantic relation between S1 and S2
A lexeme W2 occurs in the sense definition of S1, and
W2 is monosemous
S1 and S2 share the same syntactic behavior
…
(Matuschek/Gurevych, 2013)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 56
40. Step 1: Graph Construction
Graph of resource 1
Graph of resource 2
Java1
Java3
edges representing some kind of
(semantic) similarity between nodes
Java2
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 57
Java1
programming
language1
programming
language1
espresso1
espresso1
41. Step 2: Computing Sense Alignments
a) Create trivial alignments between the resources:
Trivial = lexeme is unique/monosemous in both resources
Example: programming language
Precision: >0.95
b) Identify alignment candidates
For example: nodes representing the same lemma
c) For all nodes still unaligned, find shortest paths to the
candidate nodes in the other graph
Trivial alignments serve as “bridges” between the graphs
Align the node pair with the shortest path
(Matuschek/Gurevych, 2013)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 58
42. Step 2: Computing Sense Alignments
Graph of resource 1
Graph of resource 2
Java1
Java2
Java3
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 59
Java1
programming
language1
programming
language1
espresso1
espresso1
45. Step 2c: Shortest Paths to the Candidates
Graph of resource 2
Graph of resource 1
3
5
∞
Java1
Java2
Java3
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 62
Java1
programming
language1
programming
language1
espresso1
espresso1
46. Step 2c: Align the Nodes
Graph of resource 2
Graph of resource 1
!
Java1
Java2
Java3
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 63
Java1
programming
language1
programming
language1
espresso1
espresso1
47. Parameter Choices
Restricting the number of alignments
Stop when the first candidate is found (1:1 alignment)
Keep going and align everything you can reach (1:n alignment)
Possibly with a restricted search depth
Graph construction
Use semantic relations, monosemous linking, or both
Get rid of relations to high frequent monosemous lexemes (e.g., there is)
Limiting to rare lexemes avoids “explosion” of edges
Rare = only appearing in 1 / N of the definitions (e.g., N = 200)
Computing Sense Alignments
Path length L: unbounded L yields unmanageable runtime!
Best F1 score between 5 and 8, depending on the resource pair
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 65
48. Hybrid Approach
Main issue of Dijkstra-WSA
Low recall due to missing edges / sparse graph
Hybrid approach
Try to align using the graph first
Parameterized for high precision
Align those with no match using a similarity-based approach
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 66
49. Evaluation Datasets
Sampled datasets:
WordNet – Wikipedia (1,815 sense pairs)
WordNet – Wiktionary (2,423 sense pairs)
FrameNet – Wiktionary (2,789 sense pairs)
WordNet – OmegaWiki (683 sense pairs)
Wiktionary – OmegaWiki (586 sense pairs)
Wiktionary –Wikipedia English (367 sense pairs)
Full datasets:
GermaNet – Wiktionary (45,636 sense pairs)
Wiktionary –Wikipedia German (31,808 sense pairs)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 67
50. Datasets Display Different Properties
WordNet, OmegaWiki, Wikipedia: sense definitions and semantic
relations
Wiktionary: no disambiguated semantic relations => sparse graphs
GermaNet: very few sense definitions
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 68
51. Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance
(Matuschek/Gurevych, 2013)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 69
52. Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance
Significant improvement
(Matuschek/Gurevych, 2013)
in recall….
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 70
53. Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance
(Matuschek/Gurevych, 2013)
… and F-measure…
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 71
54. Evaluation
… also on all other
datasets!
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 72
55. Selected Conclusions
Dijkstra-WSA ≥ gloss similarity for densely linked LSRs
Generic alignment approach is valid
But: low recall for sparse LSRs (English Wiktionary, OmegaWiki)
Dijkstra-WSA + similarity-based backoff outperfoms previous work
on all datasets
The two notions of similarity are complementary
Could they be combined in a smarter way?
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 76
56. Joint Modeling of Features
Applications of Linked Lexical Resources
77
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
57. Construction of Aligned Lexical Resources
Michael Matuschek and Iryna Gurevych: High Performance Word Sense Alignment by
Joint Modeling of Sense Distance and Gloss Similarity, in: Proceedings of the 25th
International Conference on Computational Linguistics (COLING 2014). Dublin, Ireland.
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones
14.05.2014 | Technische Universität Darmstadt | Iryna Gurevych 78
58. Joint Usage of Features
Similarity- and graph-based approaches both have weaknesses
Different formulation of glosses
Sparse / disconnected graphs
Two-step hybrid approach already helped improve recall
But: No real combination of both notions
Idea: Combine them using Machine Learning
Exploit the complementary strengths more effectively
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 79
59. Setup - Features
Features:
Gloss similarity (COS, PPR)
Dijkstra-WSA distances
Infinite distance if no target can be found
Other possible features:
Part of speech, sense index, translation overlap, example sentence
patterns
No significant improvement by using them!
Glosses and structure are sufficient
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 80
60. Setup - Classifiers
Classifiers used:
Naive Bayes
Bayesian Networks
Perceptrons
Support Vector Machines (SVMs)
Decision Trees
Evaluation using 10-fold cross validation
Same datasets as before
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 81
61. Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 82
62. Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance
General improvement in
precision…
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 83
63. Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance
…but in F-measure only
for some of the
datasets!
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 84
64. Selected Conclusions
Better overall results on 4 out of 8 datasets
Machine learning helps most for sparse and incomplete LSRs like
OmegaWiki and Wiktionary
For „complete“ LSRs like WordNet, we cannot gain much
Better precision on 7 out of 8
Most robust: Bayesian Networks
Complex classifiers (e.g. SVMs) challenged by skewed values
Main source of improvements:
Better classification of „borderline“ examples
High gloss similarity & distance or vice versa
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 88
65. Borderline Example
Genome:
1. “The non-redundant genetic information stored in DNA sequences
that defines an individual organism”
2. “In the context of a genetic algorithm, the information that defines
an individual entity”
Very similar description
But: Far apart in the graph
=> No alignment!
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 89
66. Joint Modeling of Features
Applications of Linked Lexical Resources
90
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
67. Linked Lexical Resources
Gurevych et
al., EACL
2012
█ █
LLRs
Eckle-Kohler
et al., LREC
2012
█ █
Eckle-Kohler
& Gurevych,
EACL 2012
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian
Wirth: UBY - A Large-Scale Unified Lexical-Semantic █
Resource Based on LMF, in: Proceedings of the 13th
Conference of the European chapter of the Association for Computational Linguistics (EACL), April 2012.
Eckle-Kohler
et al., SWJ,
2014
█
Eckle-Kohler
et al., LMF,
2013
█ █ █
█ Large-scale unified LR based on LMF
█ Standardizing heterogeneous LRs
█ Standardized format for subcat frames
█ Language independence of lexicon
models
12.0.2014 | Technische Universität Darmstadt | Iryna Gurevych 91
68. UBY: Linking Lexical Resource
Two main charUaBYcteristics:
- Word Sense Alignments
- Standardized Representation
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 92
Web 2.0
IMSLex-Subcat
69. Heterogeneity of Lexical Resources
Complementary information types
Different terminology
Incompatible Data formats
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 93
70. Unified Lexical Resource UBY
Unified lexicon model
Preserves variety of lexical information
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 94
Extensible
72. Structure Integration in UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 96
(Eckle-Kohler et al. 2012)
73. Sense Alignments Enable Semantic Interoperability
Senses linked by SenseAxis class (over 1,000,000 instances)
English alignments, e.g. WordNet-Wikipedia
German alignments, e.g. GermaNet-Wiktionary
Cross-lingual alignments, e.g. WordNet-OmegaWiki DE
97
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
74. Available Alignments
Wikipedia English—WordNet 83,192
Wiktionary English—WordNet 138,282
GermaNet—Wiktionary German 32,850
FrameNet—Wiktionary English 12,340
Wiktionary English—OmegaWiki English 34,509
WordNet—OmegaWiki German 27,529
Wiktionary German—Wikipedia German 21,872
Wiktionary English—Wikipedia English 66,050
WordNet—VerbNet 40,716
FrameNet—VerbNet 17,529
Wikipedia English—OmegaWiki English 3,960
Wikipedia German—OmegaWiki German 1,097
Wikipedia English—Wikipedia German 463,311
OmegaWiki English—OmegaWiki German 58,785
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 98
75. Resource Integration Workflow in UBY
JWNL FN API JWPL JWKTL
Human users Machines
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 99
76. Step 1. Structure Integration
UBY API UBY API UBY API UBY API
Human users Machines
UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 100
77. 101
UBY-API
Step 2. Content Integration
Human users Machines
UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
78. UBY Web UI – Textual View
Textual View: allows to list senses across all resources, to display sense details
and to perform sense comparisons.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 104
79. UBY Web UI – Visual View
Visual view: allows to explore the sense alignments.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 105
80. UBY Java API
The UBY API is open source at Google Code: http://code.google.com/p/uby/
Getting Started:
1. Download a UBY database dump
2. Import the dump into a MySQL database
3. Start using the UBY API
The UBY API is work in progress!
Many API methods need to be added – consider contributing!
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 106
81. UBY – Data and Tools
https://uby.ukp.informatik.tu-darmstadt.de/webui/
Database Dumps UBY
http://uby.ukp.informatik.tu-darmstadt.de/uby/ UBY
107
http://code.google.com/p/uby/
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
Web Interface
Open Source API (JAVA)
82. Joint Approaches to Word Sense Alignment
Applications of Linked Lexical Resources
108
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych
83. Utilizing Linked Lexical Resources
Cholakov et
al., EACL
2014
█
Kostadin Utilizing
Cholakov and Judith Eckle-Kohler and Iryna Gurevych: Automated Verb Sense
Labelling LLRs
Based on Linked Lexical Resources, in: Proceedings of the 14th Conference of
the European Chapter of the Association for Computational Linguistics (EACL 2014), pp.
68-77, April 2014
Matuschek
et al.,
KONVENS
2014
█
Michael Matuschek and Christian M. Meyer and Iryna Gurevych: Multilingual Knowledge
in Aligned Wiktionary and OmegaWiki for Translation Applications, in: Translation:
Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-118, July 2013
Matuschek
et al., TC3,
2013
█
Hartmann et
al., 2014 (in
preparation)
Hartmann &
Gurevych,
ACL 2013
█
█
█ Sense annotation/disambiguation
█ Machine/computer-assisted translation
█ Semantic role labelling
Michael Matuschek and Tristan Miller and Iryna Gurevych : A Language-independent
█ Cross-language transfer of lexical-semantic
Sense Clustering Approach for Enhanced WSD, in Proceedings of the 12th Konferenz zur
Verarbeitung naturlicher Sprache (KONVENS 2014), to appear
14.05.2014 | Technische Universität Darmstadt | Iryna Gurevych 109
resources
84. Automatic Verb Sense Labelling of Corpora
Motivation
Automatically create verb sense-annotated corpora as training data for
supervised approaches
Approach
1. Create sense patterns from UBY (combining WordNet, FrameNet, VerbNet,
Wiktionary)
2. Compare these to patterns derived from corpus instances
3. Assign word sense in corpus if similarity is above a threshold
4. Use this data to train supervised systems (distant supervision)
Results
Significant improvement over MFS baseline for verb sense disambiguation on
MASC and Senseval-3
April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler 110
85. Using Alignments for Word Sense Clustering
Motivation
Cluster fine-grained word senses in expert-built resources to improve WSD
performance
Approach
1. Create alignments between resources using Dijkstra-WSA, allowing 1:n
alignments
Source: GermaNet, WordNet
Target: Wiktionary, Wikipedia, OmegaWiki
2. If two or more senses are aligned to the same sense in the other resource,
merge them into one coarse sense
3. Rescore state-of-the-art WSD algorithms on clustered sense inventory
Results
Significant improvement over random clusters of same granularity on
WebCAGe (GermaNet) and Senseval-3 (WordNet)
April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler 111
86. Using Aligned Resources for Computer-aided
Translation
Motivation
SMT systems help, but are not smart enough to replace manual translation
Approach
1. Create sense alignments between multilingual resources
2. Display information from all resources for a particular meaning
Results
Substantially more available translations and other information types
Example: “bass” in Wiktionary and OmegaWiki
April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler 112
87. Programming language is not an island!
Word Sense Alignment is vital for increasing coverage and
richness of sense representations
But: It is a hard problem!
Various approaches
Similarity-based, graph-based, combined
Performance depends on resources
Sparsity, availability of glosses,…
Machine learning shows most robust results
Aligned resources help improve performance for various
applications
VSD, coarse-grained WSD, computer-aided translation
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 128
88. Future Work
1. Linked lexical resources (LLRs)
Integrating and aligning further resources in UBY
Special focus: cross-lingual alignment
2. Construction of aligned lexical resources
Investigating more elaborate similarity measures for glosses
Using different graph algorithms to better express similarity
Aligning several resources at once (n-way alignment)
3. Utilizing LLR for language processing
Unified deep learning framework utilizing linked resources
Distant supervision applied to semantic role labeling
Word sense disambiguation and lexical substitution for German
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 129
90. Sense Alignment of Lexical Resources
(References)
Elisabeth Niemann and Iryna Gurevych. The People’s Web Meets Linguistic Knowledge: Automatic Sense
Alignment of Wikipedia and WordNet. In: Proceedings of the 9th International Conference on Computational
Semantics (IWCS), p. 205-214, January 2011.
Christian M. Meyer and Iryna Gurevych. What Psycholinguists Know About Chemistry: Aligning Wiktionary and
WordNet for Increased Domain Coverage. In: Proceedings of the 5th International Joint Conference on Natural
Language Processing (IJCNLP), p. 883–892, November 2011.
Michael Matuschek and Iryna Gurevych. Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment.
Transactions of the Association for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013.
Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using
Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
Tristan Miller and Iryna Gurevych. WordNet-Wikipedia-Wiktionary: Construction of a Three-way Alignment. In:
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 2014. (to
appear)
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 131
91. Linked Lexical Resources @ UKP
(References)
Judith Eckle-Kohler and Iryna Gurevych. Subcat-LMF – Fleshing out a Standardized Format for Subcategorization
Frame Interoperability. In: Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), p. 550-560, April 2012.
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF - A
Uniform Model for Standardizing Heterogeneous Lexical-Semantic Resources in ISO-LMF. In: Proceedings of the
8th International Conference on Language Resources and Evaluation (LREC), p. 275-282, May 2012.
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF -
Exploring the Boundaries of Language-Independent Lexicon Models. In: LMF Lexical Markup Framework, chap. 10,
p. 145-156, ISTE - HERMES - Wiley, 2013. ISBN 978 184 821 4309.
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian
Wirth. UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF. In: Proceedings of the 13th
Conference of the European Chapter of the Association for Computational Linguistics (EACL), p. 580--590, April
2012.
Judith Eckle-Kohler, John Philip McCrae, and Christian Chiarcos. lemonUby - A Large, Interlinked, Syntactically-rich
Lexical Resource for Ontologies. Semantic Web Journal, March 2014.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 132
92. Utilizing Linked Lexical Resources
(References)
Kostadin Cholakov, Judith Eckle-Kohler, and Iryna Gurevych. Automated Verb Sense Labelling Based on Linked
Lexical Resources. In: Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), p. 68-77, April 2014.
Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using
Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
Michael Matuschek, Tristan Miller, and Iryna Gurevych. A Language-independent Sense Clustering Approach for
Enhanced WSD. In: Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing,
October 2014. (in submission)
Michael Matuschek, Christian M. Meyer, and Iryna Gurevych. Multilingual Knowledge in Aligned Wiktionary and
OmegaWiki for Translation Applications. Translation: Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-
118, July 2013.
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 133