Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах"

Programming language is not an island: Word Sense
Alignment of Lexical-Semantic Resources
Iryna Gurevych
Joint work with: Judith Eckle-Kohler, Kostadin Cholakov, Silvana
Hartmann, Michael Matuschek, Christian M. Meyer
http://www.ukp.tu-darmstadt.de/data/uby
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych 1
UBY

Applications of Linked Lexical Resources
2
Motivation
Similarity-based Word Sense Alignment
Graph-based Word Sense Alignment
Outline
Joint Modeling of Features
Putting the Pieces Together: UBY
12.09.2014 | Technische Universität Darmstadt | Iryna Gurevych

Text Analysis Needs Lexical-Semantic Knowledge
NLP application Lexical resource
Which lexical resource
to choose?

Resources are Largely Different
 Different coverage of words/word senses
 Different types of information
Encyclopedic vs. linguistic knowledge
Syntactic vs. semantic knowledge
 …
Resource integration can significantly influence the performance
of your system! – Instead of choosing only one (best performing):
Why not combine multiple resources
and benefit from all their knowledge?

Overlap of Lexical Entries
Roget’s Thesaurus
(62,797)
25,541
28,650
163,027 67,868
56,240
Wiktionary
(364,663)
WordNet
(149,502)
Common vocabulary is
rather small (28,650).
Each resource contains a lot
of “unique” words.

Overlap of Lexical Entries
slang
dialect
natural
sciences
computer
science
surprisingly
neologisms
named
entities
social
sciences
humanities
biological
taxonomy
small
overlap
math

7
Word Sense Alignment
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.

Prior Work on Linked Lexical Resources (LLR)
Meaning Multilingual Central Repository, Atserias et al. (2004)
 Yago, Suchanek et al. (2007)
 SemLink (Palmer, 2009)
 Universal Wordnet (UWN), Gerard de Melo and Gerhard Weikum
(2009)
 eXtended WordFrameNet, Laparra and Rigau (2010)
 BabelNet, Navigli and Ponzetto (2010)
NULEX, McFate and Forbus (2011)
 UBY, Gurevych et al. (2012)
 … many more, e.g. on the Semantic Web

Potential of Linked Lexical Resources
Increased coverage and the enriched sense representation
 Linking FrameNet, VerbNet, and WordNet for semantic parsing
(Shi and Mihalcea, 2005)
 Linking VerbNet, FrameNet and PropBank for semantic role labeling
(Palmer, 2009)
 Linking WordNet and Wikipedia for word sense disambiguation
(Navigli and Ponzetto, 2010)
 Linking WordNet and Wiktionary for measuring verb similarity
(Meyer and Gurevych, 2012)
 Linking OmegaWiki and Wiktionary for mining translations (McCrae
and Cimiano, 2013)

The Challenge: Heterogeneity of Resources
Different coverage:
missing entities in one
of the resources
Different granularity:
entities are defined at
different levels
Different perspectives:
entities are defined for
a different purpose
vs.
vs.
vs.
(Euzenat/Shvaiko, 2007)

Lemma Alignment
Wiktionary
WordNet
Content integration at the lemma
level is easy, but…

Content integration at the lemma
level is easy, but…
Wiktionary
WordNet
…integration at the
sense level is hard!

plant in Wiktionary
 (botany) An organism of the kingdom
Plantae […]
 (proscribed as biologically inaccurate)
Any creature that grows on soil or
similar surfaces, including plants and
fungi.
 A factory or other industrial or
institutional building or facility.
 (snooker) A play in which the cue ball
knocks one (usually red) ball onto
another […]
plant in WordNet
 buildings for carrying on
industrial labor
 (botany) a living organism
lacking the power of
locomotion
 an actor situated in the
audience whose acting is
rehearsed but seems
spontaneous to the
audience
?
?

The Alignment Process
 Can be generalized for multiple resources „multi-alignment“:
parameters p
r
resource 1
A Matching
A‘
knowledge k
r‘
alignment
(possibly empty)
resource 2
initial
output
alignment
A‘ = f(r, r‘, A, p, k)
A‘ = f(r1,…,rn, A, p, k)
(Euzenat/Shvaiko, 2007)

20
Motivation
Outline

Construction of aligned lexical resources
What Psycholinguists Know About Chemistry: Aligning Wiktionary and WordNet for
Increased Domain Coverage. Christian M. Meyer and Iryna Gurevych. In: Proceedings
of IJCNLP, pp. 883-892, November 2011.
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
█ Graph-based alignment
█ Resource-independent alignment
█ Text similarity-based alignment
█ Exploitation of existing LR alignments
to produce new ones

Increased coverage
Enriched sense
representations

works
(factory) …
23
bird
(animal)
Wikipedia
article …
Wikipedia
article …
Aligning Wiktionary and WordNet
A two-step approach:
1. Candidate extraction
2. Candidate disambiguation
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)

works
(factory) …
24
bird
(animal)
Wikipedia
article …
Wikipedia
article …
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)

X
works
(factory) …
25
bird
(animal)
Wikipedia
article …
Wikipedia
article …
plant
(factory)
plant
(organism)
plant
(person)
works
(machine)
WordNet synsets
Wiktionary senses
{plant, works,
industrial plant}
{plant, works,
industrial plant}
{plant, works,
industrial plant}
to fly
(move) reddish
(color)
X
X

Bag of Words Representation
synset
hypernyms
hyponyms
hyper- &
hyponyms
bag-of-words
bag-of-words
lemma
sense
definition
usage
examples
synonyms
Synsets are represented
by synonyms, gloss,
examples

Candidate Disambiguation
semantic
relatedness
measure
bag-of-words
bag-of-words
COS: Cosine similarity
score s
PPR: Personalized PageRank
s < threshold s ≥ threshold
No alignment!
Align this pair of
WordNet synset and
Wiktionary sense!

Evaluation Dataset
Dataset creation:
 No previous alignments = no other evaluation datasets
 We created a new dataset with 2,423 sense pairs
 10 human raters (students/researchers from CS, math, linguistics)
 Annotate each pair as “same meaning” or “different meaning”
Dataset reliability:
 Inter-rater agreement: AO = .93, κ = .70
 Removing two biased raters: AO = .94, κ = .74
Gold standard:
 Majority vote of the 8 raters, additional tie breaker

Evaluation Results
 RAND: Random baseline
 MFS: Baseline aligning always the first sense (≈ most frequent sense)
Method A P R F1
RAND .662 .212 .594 .313
MFS .802 .329 .508 .399
COS only .901 .598 .703 .646
PPR only .915 .684 .636 .659
COS&PPR .914 .674 .649 .661
 Our approach significantly outperforms the baseline (at 1% level)
 COS highest recall; PPR highest precision; COS&PPR highest F1
 Significant difference of PPR, COS&PPR over COS (at 1% level)
 No significant difference between PPR and COS&PPR

Error Analysis
110 false negatives:
“same meaning, but was not aligned”
 Very different wording
 “good discernment” vs.“ability to notice what others might miss”
 Similar senses but slightly below threshold
 “plants of the genus Centaurea” vs. “common weeds of the genus
Centaurea”
 Pointing to another entry rather than a content-based gloss
 pacification: “the process of pacifying”

Error Analysis
98 false positives:
“different meaning, but have been aligned”
 Similar wording, but refer to different concepts
 “a computer that provides client stations with access to files and
printers as shared resources to a computer network” vs. “any
computer attached to a network”
 High relatedness, but generic- versus domain-specific vocabulary
 “any computer attached to a network” vs. “any organization that
provides resources and facilities for a function or event”

Increased Coverage: Parts of Speech
 Our alignment: 56,970 sense pairs
 Final resource contains 488,988 word senses
 Substantial increase in the coverage of senses
 Wiktionary is not restricted to nouns/verbs/adjectives: proverbs,
idioms, collocations, particles, determiners, inflected forms, etc.
Wiktionary
AND WordNet
Additionally in
Wiktionary
Additionally in
WordNet
Nouns 34,464 158,085 47,651
Verbs 8,252 29,119 5,515
Adj./Adv. 14,236 60,977 7,541
Other POS 0 16,778 0
Inflected Forms 0 106,328 0

Increased Coverage: Domains
Wiktionary
AND WordNet
Additionally in
Wiktionary
Additionally in
WordNet
Biology 4,465 4,067 12,869
Chemistry 2,561 8,260 2,268
Engineering 1,108 940 1,080
Geology 2,287 2,898 2,479
Humanities 4,949 2,700 5,060
IT 439 3,032 557
Linguistics 1,249 1,011 1,576
Math 615 2,747 483
Medicine 3,613 3,728 3,058
Military 574 426 585
Physics 1,246 2,835 1,252
Religion 733 1,154 781
Social Sciences 3,745 2,907 4,458
Sport 905 2,821 807

Enriched Sense Representation
Synonyms
Gloss
Example sentence
Subsumption hierarchy
Synset organization
…
Pronunciation
Etymology
Syntactic knowledge
Quotations
Related terms
Translations
…

Selected Conclusions
 Aligned Wiktionary – WordNet is characterized by:
(1) Increased coverage
 Different parts of speech, not only nouns
 e.g. humanities and social sciences from WordNet
 e.g. technical domains and leisure from Wiktionary
(2) Enriched sense representation
 Pronunciation, etymology, related terms, translations, etc.
 Novel evaluation dataset annotated by 10 human raters
 Better results based on the resource-structure based and hybrid
techniques in later work (Matuschek & Gurevych, TACL ‘13)

48
Motivation
Outline

Construction of aligned lexical resources
Michael Matuschek and Iryna Gurevych: Dijkstra-WSA: A Graph-Based
Approach to Word Sense Alignment, in: Transactions of the Association
for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
to produce new ones

Similarity-Based Approaches Suffer From…
 Different vocabulary employed by definitions
 Example: English noun eye/discernment, e.g.,
she has an eye for fresh talent
he has an artist's eye
good discernment (either visually or as if visually)
low semantic relatedness score…
ability to notice what others might miss

Solution: Use the Graph Topology
Word Senses
Java1 Java2 of Java
Java3

Intuition of Graph Topology
Java1 Java2 of Java
Word Senses
Monosemous
lexeme
programming
language
Java3
programming
language1

Java1 Java2 of Java
53
Word Senses
Word Senses
of Ruby
Monosemous
lexeme
programming
language
Java3
programming
language1
Ruby1

Java1 Java2 of Java
Related senses are in the same region of the graph
Word Senses
Monosemous
lexeme
programming
language
Word Senses
of Ruby
Java3
programming
language1
Ruby1

Dijkstra-WSA
Graph-based word sense alignment approach
Key ideas:
 Represent lexical resources as graphs
 Rely on trivial alignments as “reference nodes” and “bridges”
Use Dijkstra’s shortest path algorithm
to find alignments
Steps:
1. Graph construction
2. Computing sense alignments
(Matuschek/Gurevych, 2013)

Step 1: Graph Construction
Represent each lexical resource as an undirected graph
L = (V, E) with
the set of nodes V representing senses or synsets
 the set of edges E  V x V representing some kind of (semantic)
similarity between a pair of nodes
An edge connects sense S1 and sense S2 if, for example…
 There exists a semantic relation between S1 and S2
 A lexeme W2 occurs in the sense definition of S1, and
W2 is monosemous
 S1 and S2 share the same syntactic behavior
 …

Step 1: Graph Construction
Graph of resource 1
Graph of resource 2
Java1
Java3
edges representing some kind of
(semantic) similarity between nodes
Java2
Java1
programming
language1
programming
language1
espresso1
espresso1

Step 2: Computing Sense Alignments
a) Create trivial alignments between the resources:
 Trivial = lexeme is unique/monosemous in both resources
 Example: programming language
 Precision: >0.95
b) Identify alignment candidates
 For example: nodes representing the same lemma
c) For all nodes still unaligned, find shortest paths to the
candidate nodes in the other graph
 Trivial alignments serve as “bridges” between the graphs
 Align the node pair with the shortest path

Step 2: Computing Sense Alignments
Graph of resource 1
Graph of resource 2
Java1
Java2
Java3
Java1
programming
language1
programming
language1
espresso1
espresso1

Step 2a: Create Trivial Alignments
Graph of resource 2
Graph of resource 1
Java1
Java2
Java3
Java1
programming
language1
programming
language1
espresso1
espresso1

Step 2b: Identify Alignment Candidates
Graph of resource 2
Graph of resource 1
?
?
?
Java1
Java2
Java3
Java1
programming
language1
programming
language1
espresso1
espresso1

Step 2c: Shortest Paths to the Candidates
Graph of resource 2
Graph of resource 1
3
5
∞
Java1
Java2
Java3
Java1
programming
language1
programming
language1
espresso1
espresso1

Step 2c: Align the Nodes
Graph of resource 2
Graph of resource 1
!
Java1
Java2
Java3
Java1
programming
language1
programming
language1
espresso1
espresso1

Parameter Choices
Restricting the number of alignments
 Stop when the first candidate is found (1:1 alignment)
 Keep going and align everything you can reach (1:n alignment)
 Possibly with a restricted search depth
Graph construction
 Use semantic relations, monosemous linking, or both
 Get rid of relations to high frequent monosemous lexemes (e.g., there is)
 Limiting to rare lexemes avoids “explosion” of edges
 Rare = only appearing in 1 / N of the definitions (e.g., N = 200)
Computing Sense Alignments
 Path length L: unbounded L yields unmanageable runtime!
 Best F1 score between 5 and 8, depending on the resource pair

Hybrid Approach
Main issue of Dijkstra-WSA
 Low recall due to missing edges / sparse graph
Hybrid approach
 Try to align using the graph first
 Parameterized for high precision
 Align those with no match using a similarity-based approach

Evaluation Datasets
Sampled datasets:
 WordNet – Wikipedia (1,815 sense pairs)
 WordNet – Wiktionary (2,423 sense pairs)
 FrameNet – Wiktionary (2,789 sense pairs)
 WordNet – OmegaWiki (683 sense pairs)
 Wiktionary – OmegaWiki (586 sense pairs)
 Wiktionary –Wikipedia English (367 sense pairs)
Full datasets:
 GermaNet – Wiktionary (45,636 sense pairs)
 Wiktionary –Wikipedia German (31,808 sense pairs)

Datasets Display Different Properties
 WordNet, OmegaWiki, Wikipedia: sense definitions and semantic
relations
 Wiktionary: no disambiguated semantic relations => sparse graphs
 GermaNet: very few sense definitions

Evaluation
Random baseline
1:1
1st
Similarity-based (SB)
Semantic Relations (SR)
Linking Monosemes (LM)
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance

Evaluation
Random baseline
1:1
1st
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance
Significant improvement
in recall….

Evaluation
Random baseline
1:1
1st
SR + LM
SR + SB
LM + SB
SR + LM + SB
Hybrid
Human performance
… and F-measure…

Evaluation
… also on all other
datasets!

 Dijkstra-WSA ≥ gloss similarity for densely linked LSRs
 Generic alignment approach is valid
 But: low recall for sparse LSRs (English Wiktionary, OmegaWiki)
 Dijkstra-WSA + similarity-based backoff outperfoms previous work
on all datasets
 The two notions of similarity are complementary
 Could they be combined in a smarter way?

77
Motivation
Outline

Construction of Aligned Lexical Resources
Michael Matuschek and Iryna Gurevych: High Performance Word Sense Alignment by
Joint Modeling of Sense Distance and Gloss Similarity, in: Proceedings of the 25th
International Conference on Computational Linguistics (COLING 2014). Dublin, Ireland.
Niemann &
Gurevych,
IWCS 2011
█
Sense
Alignment
Meyer &
Gurevych,
IJCNLP
2011
█
Matuschek
& Gurevych,
TACL, 2013
█ █ █
Matuschek
& Gurevych,
COLING,
2014
█ █ █
Miller &
Gurevych,
LREC 2014
█ █ █
Hartmann &
Gurevych,
ACL 2013
█ █
to produce new ones

Joint Usage of Features
 Similarity- and graph-based approaches both have weaknesses
 Different formulation of glosses
 Sparse / disconnected graphs
Two-step hybrid approach already helped improve recall
 But: No real combination of both notions
 Idea: Combine them using Machine Learning
 Exploit the complementary strengths more effectively

Setup - Features
Features:
 Gloss similarity (COS, PPR)
 Dijkstra-WSA distances
 Infinite distance if no target can be found
Other possible features:
 Part of speech, sense index, translation overlap, example sentence
patterns
No significant improvement by using them!
 Glosses and structure are sufficient

Setup - Classifiers
Classifiers used:
 Naive Bayes
 Bayesian Networks
 Perceptrons
 Support Vector Machines (SVMs)
 Decision Trees
Evaluation using 10-fold cross validation
 Same datasets as before

Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance

Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance
General improvement in
precision…

Evaluation
Random
1:1
1st
SB
DWSA
Hybrid
SVM
Naive Bayes
Bayesian Network
Perceptron
Decision Tree
Human performance
…but in F-measure only
for some of the
datasets!

 Better overall results on 4 out of 8 datasets
Machine learning helps most for sparse and incomplete LSRs like
OmegaWiki and Wiktionary
 For „complete“ LSRs like WordNet, we cannot gain much
 Better precision on 7 out of 8
 Most robust: Bayesian Networks
 Complex classifiers (e.g. SVMs) challenged by skewed values
Main source of improvements:
 Better classification of „borderline“ examples
 High gloss similarity & distance or vice versa

Borderline Example
Genome:
1. “The non-redundant genetic information stored in DNA sequences
that defines an individual organism”
2. “In the context of a genetic algorithm, the information that defines
an individual entity”
 Very similar description
But: Far apart in the graph
=> No alignment!

90
Motivation
Outline

Linked Lexical Resources
Gurevych et
al., EACL
2012
█ █
LLRs
Eckle-Kohler
et al., LREC
2012
█ █
Eckle-Kohler
& Gurevych,
EACL 2012
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian
Wirth: UBY - A Large-Scale Unified Lexical-Semantic █
Resource Based on LMF, in: Proceedings of the 13th
Conference of the European chapter of the Association for Computational Linguistics (EACL), April 2012.
Eckle-Kohler
et al., SWJ,
2014
█
Eckle-Kohler
et al., LMF,
2013
█ █ █
█ Large-scale unified LR based on LMF
█ Standardizing heterogeneous LRs
█ Standardized format for subcat frames
█ Language independence of lexicon
models

UBY: Linking Lexical Resource
Two main charUaBYcteristics:
- Word Sense Alignments
- Standardized Representation
Web 2.0
IMSLex-Subcat

Heterogeneity of Lexical Resources
Complementary information types
Different terminology
Incompatible Data formats

Unified Lexical Resource UBY
Unified lexicon model
Preserves variety of lexical information
Extensible

Structure Integration
Standardized representation frameworks
 Lexical Markup Framework (LMF)
http://www.lexicalmarkupframework.org
 Text Encoding Initiative (TEI)
http://www.tei-c.org
<entry>
<form>
<orth>disproof</orth>
<pron>dIs"pru:f</pron>
</form>
<gramGrp>
<pos>n</pos>
</gramGrp>
<sense n="1">
<def>facts that disprove something.</def>
</sense>
<sense n="2">
<def>the act of disproving.</def> [..]

Structure Integration in UBY
(Eckle-Kohler et al. 2012)

Sense Alignments Enable Semantic Interoperability
 Senses linked by SenseAxis class (over 1,000,000 instances)
 English alignments, e.g. WordNet-Wikipedia
 German alignments, e.g. GermaNet-Wiktionary
 Cross-lingual alignments, e.g. WordNet-OmegaWiki DE
97
1. To sing: To produce musical or
harmonious sounds with one’s
voice.
2. To sing: To express audibly by means of
a harmonious vocalization.
3. To sing: To confess under
interrogation.
1. singen: Mit
der Stimme
harmonische
Töne erzeugen.
1. To sing: Produce
tones with the voice
2. To sing: divulge
confidential information
or secrets
1. To sing: To produce
harmonious sounds
with one's voice.

Available Alignments
Wikipedia English—WordNet 83,192
Wiktionary English—WordNet 138,282
GermaNet—Wiktionary German 32,850
FrameNet—Wiktionary English 12,340
Wiktionary English—OmegaWiki English 34,509
WordNet—OmegaWiki German 27,529
Wiktionary German—Wikipedia German 21,872
Wiktionary English—Wikipedia English 66,050
WordNet—VerbNet 40,716
FrameNet—VerbNet 17,529
Wikipedia English—OmegaWiki English 3,960
Wikipedia German—OmegaWiki German 1,097
Wikipedia English—Wikipedia German 463,311
OmegaWiki English—OmegaWiki German 58,785

Resource Integration Workflow in UBY
JWNL FN API JWPL JWKTL
Human users Machines

Step 1. Structure Integration
UBY API UBY API UBY API UBY API
UBY

101
UBY-API
Step 2. Content Integration
UBY

UBY Web UI – Textual View
Textual View: allows to list senses across all resources, to display sense details
and to perform sense comparisons.

UBY Web UI – Visual View
Visual view: allows to explore the sense alignments.

UBY Java API
The UBY API is open source at Google Code: http://code.google.com/p/uby/
Getting Started:
1. Download a UBY database dump
2. Import the dump into a MySQL database
3. Start using the UBY API
The UBY API is work in progress!
Many API methods need to be added – consider contributing!

UBY – Data and Tools
https://uby.ukp.informatik.tu-darmstadt.de/webui/
Database Dumps UBY
http://uby.ukp.informatik.tu-darmstadt.de/uby/ UBY
107
http://code.google.com/p/uby/
Web Interface
Open Source API (JAVA)

Joint Approaches to Word Sense Alignment
108
Motivation
Outline

Utilizing Linked Lexical Resources
Cholakov et
al., EACL
2014
█
Kostadin Utilizing
Cholakov and Judith Eckle-Kohler and Iryna Gurevych: Automated Verb Sense
Labelling LLRs
Based on Linked Lexical Resources, in: Proceedings of the 14th Conference of
the European Chapter of the Association for Computational Linguistics (EACL 2014), pp.
68-77, April 2014
Matuschek
et al.,
KONVENS
2014
█
Michael Matuschek and Christian M. Meyer and Iryna Gurevych: Multilingual Knowledge
in Aligned Wiktionary and OmegaWiki for Translation Applications, in: Translation:
Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-118, July 2013
Matuschek
et al., TC3,
2013
█
Hartmann et
al., 2014 (in
preparation)
Hartmann &
Gurevych,
ACL 2013
█
█
█ Sense annotation/disambiguation
█ Machine/computer-assisted translation
█ Semantic role labelling
Michael Matuschek and Tristan Miller and Iryna Gurevych : A Language-independent
█ Cross-language transfer of lexical-semantic
Sense Clustering Approach for Enhanced WSD, in Proceedings of the 12th Konferenz zur
Verarbeitung naturlicher Sprache (KONVENS 2014), to appear
resources

Automatic Verb Sense Labelling of Corpora
Motivation
 Automatically create verb sense-annotated corpora as training data for
supervised approaches
Approach
1. Create sense patterns from UBY (combining WordNet, FrameNet, VerbNet,
Wiktionary)
2. Compare these to patterns derived from corpus instances
3. Assign word sense in corpus if similarity is above a threshold
4. Use this data to train supervised systems (distant supervision)
Results
 Significant improvement over MFS baseline for verb sense disambiguation on
MASC and Senseval-3
April 28, 2014 | Computer Science Department | UKP Lab Prof. Iryna Gurevych | Dr. Judith Eckle-Kohler 110

Using Alignments for Word Sense Clustering
Motivation
 Cluster fine-grained word senses in expert-built resources to improve WSD
performance
Approach
1. Create alignments between resources using Dijkstra-WSA, allowing 1:n
alignments
 Source: GermaNet, WordNet
 Target: Wiktionary, Wikipedia, OmegaWiki
2. If two or more senses are aligned to the same sense in the other resource,
merge them into one coarse sense
3. Rescore state-of-the-art WSD algorithms on clustered sense inventory
Results
 Significant improvement over random clusters of same granularity on
WebCAGe (GermaNet) and Senseval-3 (WordNet)

Using Aligned Resources for Computer-aided
Translation
Motivation
 SMT systems help, but are not smart enough to replace manual translation
Approach
1. Create sense alignments between multilingual resources
2. Display information from all resources for a particular meaning
Results
 Substantially more available translations and other information types
 Example: “bass” in Wiktionary and OmegaWiki

Programming language is not an island!
 Word Sense Alignment is vital for increasing coverage and
richness of sense representations
But: It is a hard problem!
 Various approaches
 Similarity-based, graph-based, combined
 Performance depends on resources
 Sparsity, availability of glosses,…
 Machine learning shows most robust results
 Aligned resources help improve performance for various
applications
 VSD, coarse-grained WSD, computer-aided translation

Future Work
1. Linked lexical resources (LLRs)
 Integrating and aligning further resources in UBY
 Special focus: cross-lingual alignment
2. Construction of aligned lexical resources
 Investigating more elaborate similarity measures for glosses
 Using different graph algorithms to better express similarity
 Aligning several resources at once (n-way alignment)
3. Utilizing LLR for language processing
 Unified deep learning framework utilizing linked resources
 Distant supervision applied to semantic role labeling
 Word sense disambiguation and lexical substitution for German

Thank you. Questions?

Sense Alignment of Lexical Resources
(References)
 Elisabeth Niemann and Iryna Gurevych. The People’s Web Meets Linguistic Knowledge: Automatic Sense
Alignment of Wikipedia and WordNet. In: Proceedings of the 9th International Conference on Computational
Semantics (IWCS), p. 205-214, January 2011.
 Christian M. Meyer and Iryna Gurevych. What Psycholinguists Know About Chemistry: Aligning Wiktionary and
WordNet for Increased Domain Coverage. In: Proceedings of the 5th International Joint Conference on Natural
Language Processing (IJCNLP), p. 883–892, November 2011.
 Michael Matuschek and Iryna Gurevych. Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment.
Transactions of the Association for Computational Linguistics (TACL), vol. 1, p. 151-164, May 2013.
 Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using
Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
 Tristan Miller and Iryna Gurevych. WordNet-Wikipedia-Wiktionary: Construction of a Three-way Alignment. In:
Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), May 2014. (to
appear)

Linked Lexical Resources @ UKP
(References)
 Judith Eckle-Kohler and Iryna Gurevych. Subcat-LMF – Fleshing out a Standardized Format for Subcategorization
Frame Interoperability. In: Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), p. 550-560, April 2012.
 Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF - A
Uniform Model for Standardizing Heterogeneous Lexical-Semantic Resources in ISO-LMF. In: Proceedings of the
8th International Conference on Language Resources and Evaluation (LREC), p. 275-282, May 2012.
 Judith Eckle-Kohler, Iryna Gurevych, Silvana Hartmann, Michael Matuschek and Christian M. Meyer. UBY-LMF -
Exploring the Boundaries of Language-Independent Lexicon Models. In: LMF Lexical Markup Framework, chap. 10,
p. 145-156, ISTE - HERMES - Wiley, 2013. ISBN 978 184 821 4309.
 Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer and Christian
Wirth. UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF. In: Proceedings of the 13th
Conference of the European Chapter of the Association for Computational Linguistics (EACL), p. 580--590, April
2012.
 Judith Eckle-Kohler, John Philip McCrae, and Christian Chiarcos. lemonUby - A Large, Interlinked, Syntactically-rich
Lexical Resource for Ontologies. Semantic Web Journal, March 2014.

Utilizing Linked Lexical Resources
(References)
 Kostadin Cholakov, Judith Eckle-Kohler, and Iryna Gurevych. Automated Verb Sense Labelling Based on Linked
Lexical Resources. In: Proceedings of the 14th Conference of the European Chapter of the Association for
Computational Linguistics (EACL), p. 68-77, April 2014.
 Silvana Hartmann and Iryna Gurevych. FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using
Wiktionary as Interlingual Connection. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics (ACL), vol. 1, p. 1363-1373, August 2013.
 Michael Matuschek, Tristan Miller, and Iryna Gurevych. A Language-independent Sense Clustering Approach for
Enhanced WSD. In: Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing,
October 2014. (in submission)
 Michael Matuschek, Christian M. Meyer, and Iryna Gurevych. Multilingual Knowledge in Aligned Wiktionary and
OmegaWiki for Translation Applications. Translation: Corpora, Computation, Cognition (TC3), vol. 3, no. 1, p. 87-
118, July 2013.

Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах"

Recomendados

Recomendados

Más contenido relacionado

Similar a Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах"

Similar a Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах" (20)

Más de AINL Conferences

Más de AINL Conferences (19)

Último

Último (20)

Ирина Гуревич "Язык программирования – это не остров: выравнивание смысла слов в лексико-семантических ресурсах"