SlideShare una empresa de Scribd logo
1 de 81
ULM-1
Understanding Language
by Machines
The Borders of Ambiguity
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com
Structure
 Part I
 The ULM-1 project
 Part II
 Error analysis on WSD
 Part III
 Using Background Information to Perform WSD
 Part IV
 What is next?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 2
Who am I?
 Ruben Izquierdo Bevia
 Computer Science, Alicante, Spain 2004
 2004-2011 researcher at the University of Alicante
 September 2010, Alicante
 Phd. Thesis: An approach to Word Sense Disambiguation based on Supervised
Machine Learning and Semantic Classes
 Sept 2011  Sept 2012
 DutchSemCor project (Tilburg and VU universities, NL)
 Sept 2012  Sept 2014
 Opener project (VU University, NL)
 Sept 2014 
 ULM1 Spinoza project
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity”
3
Part I
Understanding Language by
Machines
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 4
Understanding Languages by
Machines
 NWO (Netherlands Organization for Scientific
Research)
 Spinoza Price
 Highest Dutch award in science for top researchers with
international reputation
 Piek Vossen was one of the three winners in 2013
 Some money for research  4 ULM projects
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 5
Understanding Languages by
Machines
 Develop computer models that assign deeper meaning
to language and approximates human understanding
 Use the models to automatically read and understand
texts
 Words and texts are highly ambiguous
 Get a better understanding of the scope and complexity
of this ambiguity
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 6
Understanding Languages by
Machines
 ULM-1: The borders of ambiguity
 Word relations and ambiguity
 Define the problem and find an optimal solution
 ULM-2: Word, Concept, Perception and Brain
 Relate words and meanings to perceptual data and brain activation patterns
 ULM-3: From timelines to storylines
 Interpretation of words and our way of interacting with the changing world
 Structure these changes as stories along explanatory motivations
 ULM-4: A quantum model of text understanding
 Technical model
 Move from pipeline approaches which take early decisions to a model there the final
interpretation is carried out by high-order semantic and contextual models
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 7
Understanding Languages by
Machines
 ULM-1: The borders of ambiguity
 Word relations and ambiguity
 Define the problem and find an optimal solution
 ULM-2: Word, Concept, Perception and Brain
 Relate words and meanings to perceptual data and brain activation patterns
 ULM-3: From timelines to storylines
 Interpretation of words and our way of interacting with the changing world
 Structure these changes as stories along explanatory motivations
 ULM-4: A quantum model of text understanding
 Technical model
 Move from pipeline approaches which take early decisions to a model there the final
interpretation is carried out by high-order semantic and contextual models
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 8
ULM-1: The Borders of
Ambiguity
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 9
Piek Vossen Marten Postma Ruben Izquierdo
Word Sense Disambiguation
WSD  “The problem of computationally determining which
‘sense’ of a word is activated by the use of that word in a
particular context” (Agirre & Edmonds, 2006)
Our1 project14 looks14 into1 breaking60 the1 borders10 of1
ambiguity1, for1 which1 the1 queen12 piece18 is13 an1 example1
1.981.324.800 interpretations !!!
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 10
Classical Approaches
 Supervised approaches
 Require annotated data
 Problems with domain adaptation
 Knowledge based
 Dependent on the resources
 Unsupervised approaches
 Low performance
 Require large amount of data
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 11
Still Unsolved
WSD is still considered to be “unsolved”
Competition Year Type Baseline Best F1
SensEval2 2001 all-words 57.0 69.0 (Sup)
SensEval3 2004 All-words 60.9 65.1 (Sup)
SemEval1 2007 All-words (task 17) 51.4 59.1 (Sup)
SemEval2 2010 All-words on specific
domain
50.5 56.2 (Kb)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 12
General Trends
 Look at WSD as a purely classification problem
 Focus more on the low level algorithm than on the
WSD problem itself
 Poor representation of the context
 Following the idea: “the more features, the better
performance”
 Usually Bag-of-words features
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 13
… but … what about the
discourse and background
information?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 14
Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 15
Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 16
Winner  the contestant who wins the contest (wordnet
synset ENG30-10782940-n)
Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 17
The winner won the Nedbank
Golf Challengue
Discourse and Background
Knowledge
The winner will walk away with $1.5 million
source: http://www.southafrica.info/news/sport/golf- nedbank-
210613.htm#.VEAWkYusVW8
Creation time: 21 June 2013
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 18
The winner was  Thomas Bjørn
Borders of Ambiguity
Lexical WSD: WordNet sense of winner
Discourse information: “winner” is the winner of the
Nedbank Golf Challenge
Referential WSD: the “winner” is Thomas Børjn
WordNet
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 19
The Role of Background
knowledge
“One of the best moves by Gary Kasparov which includes a queen sacrifice…”
Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 20
The Role of Background
knowledge
“One of the best moves by Gary Kasparov which includes a queen sacrifice…”
Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice
STATE OF THE ART SYSTEM
It-makes-sense WSD system (Zhong and Ng, 2010)
• 36% queen.n.1: the only fertile female in a colony of social insects such
as bees, ants or termites.
• 34% queen.n.2: a female sovereign ruler
• 30% queen.n.3: the wife or widow of a king
• …..
• 0% queen.n.6: the most powerful chess piece
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 21
The Role of Background
knowledge
 A very naïve approach
 Find “Gary Kasparov” as an entity and link it to Wikipedia
 Compare textual overlapping of:
 Wikipage Queen_chess  Wikipage Gary_Kasparov
 170 overlapping types
 Wikipage Queen_regnant  Wikipage Gary_Kasparov
 88 overlapping types
Examples of matching words Queen_chess – G. Kasparov
board opening matches game press championship rules
chess player king queen
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 22
Our ideal system
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 23
Part II
Error Analysis of WSD
systems
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 24
Piek Vossen Marten Postma Ruben Izquierdo
Motivation
Word Sense Disambiguation is still an unsolved problem
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 25
Hypothesis
 Little attention has been paid to the problem
 WSD as just 1 problem
 The context is not being exploited properly
 Systems rely too much on the Most Frequent Sense
 It is indeed the baseline, very hard to overcome
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 26
Goal of the Analysis
 Perform error analysis of the participant systems on
previous WSD evaluations to prove our hypothesis
 Senseval-2: all-words task
 Senseval-3: all-words task
 Semeval2007: all-words task (#17)
 Semeval2010: all-words on specific domain (#17)
 Semeval2013: multilingual all-words WSD and entity
linking (#12)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 27
Analysis
 Calculate the performance of the systems according to
different criteria of the gold data
 Monosemous / polysemous
 Part-of-speech
 Most Frequent Sense vs. Non MFS
 Polysemy class
 Frequency class
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 28
Monosemous errors
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 29
Monosemous Errors
Competition Monosemou
s
Wrong Examples
Senseval2 499 (20.9%) 37.5% gene.n (suppressor_gene.n), chance.a
(chance.n) next.r (next.a)
Senseval3 334 (16.6%) 44.1% Datum.n (data.n) making.n (make.v)
out_of_sight (sight)
Semeval2007 25 (5.5%) 11.1% get_stuck.v, lack.v, write_about.v
Semeval2010 31 (2.2%) 97.9% Tidal_zone.n pine_marten.n roe_deer.n
cordgrass.n
Semeval2013
(lemmas)
348 (21.1%) 1.9% Private_enterprise, developing_country,
narrow_margin
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 30
Most Frequent Sense
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 31
Most Frequent Sense
 When the correct sense is NOT the most frequent
sense
 Systems still assign mostly the MFS
 Senseval2
 799 tokens are not MFS
 84% systems still assign the MFS
 Most “failed” words due to MFS bias
 Senseval2, senseval3
 Say.v find.v take.v have.v cell.n church.n
 Semeval2010
 Area.n nature.n connection.n water.n population.n
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 32
Analysis per PoS-tag
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 33
Polysemy Profile
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 34
Frequency Class
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 35
Expected vs. Observed
difficulty
 Calculate per sentence
 The “expected” difficulty
 Average polysemy, sentence length, average word length
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 36
Expected vs. Observed
difficulty
 Calculate per sentence
 The “expected” difficulty
 Average polysemy, sentence length, average word length
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 37
Expected vs. Observed
difficulty
 Calculate per sentence
 The “expected” difficulty
 Average polysemy, sentence length, average wor length
 The “observed” difficulty
 From the real participant outputs, average error rate
 We could expect:
harder sentences higher error rate
easier sentences lower error rate
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 38
Expected vs. Observed
difficulty
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 39
Expected vs. Observed
difficulty
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 40
Expected vs. Observed
difficulty
• The context is not (probably) exploited properly
• Expected “easy” sentences SHOULD show low error rates
• Occurrences of the same word in different contexts have similar
error rate
• The difficulty of a word depends more on its polysemy than on
the context where it appears
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 41
WSD Corpora
http://github.com/rubenIzquierdo/wsd_corpora
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 42
WSD Corpora
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 43
System Outputs
https://github.com/rubenIzquierdo/sval_systems
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 44
System Outputs
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 45
Part III
When to Use Background
Information to Perform WSD
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 46
Piek Vossen Marten Postma Ruben Izquierdo
SemEval-2015 Task #13
 Multilingual All-Words Sense Disambiguation and Entity
Linking
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 47
SemEval-2015 Task #13
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 48
Motivation
 From the previous error analysis
 MFS bias is a big problem
 For both supervised and unsupervised approaches
 Specially when there is domain shift
 Our approach
1. Determine the predominant sense for every lemma in the
specific domain (unsupervised)
2. Apply a state-of-the-art WSD system
3. Define an heuristic to determine when to apply 1) or 2)
4. We focused on WSD in English only
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 49
Architecture
 IMS route: favors the MFS in general domain and local features
 Background route: favors the predominant sense in the domain
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 50
ROUTE 1
ROUTE 2
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 51
Architecture
 Two different approaches
 Online approach
 The SemEval test documents (4 documents)
 Offline approach
 Precompiled documents for the target domain
 Documents from biomedical domain
 Converted to NAF
 Tokens, Lemmas and PoS tags
Seed documents SD
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 52
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 53
Architecture
 DBpedia spotlight is applied to the seed documents
 Entities and links to DBpedia are extracted
 Wikipedia pages from DBpedia links
 Filter:
 Consider only DBpedia links with a ontological type which
is a leaf on the ontology
 Better results without filter
 All the wikipedia pages compile the EAC corpus
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 54
Architecture
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 55
Architecture
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 56
Architecture
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 57
Architecture
Entity Article Corpus EAC
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 58
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 59
Architecture
 Targets high recall and low precision/quality
 Entity Article Corpus EAC  LDA  Domain Model DM
 For every document DEAC in EAC
 Obtain the DBpedia type T
 Obtain the set of DBpedia entities S from DBpedia which belong to
T
 For every document DS in S:
 Compute the similarity of DS against the model DM
 If similarity >= THRESHOLD  select document for the Entity expanded
corpus
LDA Expansion
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 60
Architecture
LDA Expansion
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 61
Architecture
LDA Expansion
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 62
Architecture
LDA Expansion
http://dbpedia.org/ontology/HumanGene
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 63
Architecture
LDA Expansion
Domain
Model
LDA
Similarity
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 64
Entity Article
Corpus EAC
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 65
Architecture
Entity Overlapping Expansion
 Targets high quality and medium recall
 Entity Article Corpus EAC
 Extract all the set of entities: SE
 For every entity E in SE:
 Obtain all the wikilinks in E: W
 For every Ew in W
 Obtain all the wikilinks Wew in Ew  SW
 Compute the overlap SE and SW
 Filter by threshold
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 66
Architecture
Entity Overlapping Expansion
…
…
http://dbpedia.org/resource/CCDC11
…
…
SE
WikiPage for CCDC11
Get wikilinks for
CCDC11
…
…
Phosphorylation
…
…
WikiPage for Phosphorylation
Get wikilinks for
Phosphorylation
Phosphate
Enzymes
Biochemistry
Prokaryotic
CCDC11
wikilinks
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 67
Architecture
Entity Overlapping Expansion
…
…
http://dbpedia.org/resource/CCDC11
…
…
SE
Phosphate
Enzymes
Biochemistry
Prokaryotic
Calculate overlap
> THRESHOLD
Select / Reject
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 68
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 69
Architecture
Predominant Sense Algorithm
 Background corpus BC: EAC + EE
 For every lemma L in BC:
 Extract all sentences containing L
 If there are more than 100 sentences
 Word sense induction with Hierarchical Dirichlet Processes
(Lau et al., 2012)
 Induce senses using Topic Modeling
 Output: list of senses with confidences per lemma
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 70
Architecture
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 71
Architecture
Voting
 For a new instance for a given lemma
 Obtain sense ranking of Predominant Sense (PS)
 Only if first 2 senses agglomerate 85% of confidence (avoid
skewedness)
 Mix both sense rankings
 PS and ItMakesSense
 Select the sense with highest confidence
 If there is no Predominant Sense information
 Use ItMakesSense best sense
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 72
Results
All domains
Measure All N V
Precision 67.5 (2) 64.7 56.6
Recall 51.4 (5) 42.9 53.9
F1 58.4 (4) 51.6 55.2
Social Issues domain
Measure All N V
F1 61.2 (2) 54.8 (7) 70.6 (1)
Math Computer domain
Measure All N V
F1 47.7 (5) 30.5 (13) 49.7 (7)
Biomedical domain
Measure All N V
F1 66.4 (4) 62.7 (9) 53.8 (2)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 73
Discussion
 The domain was not just biomedical, but mixed
 We couldn’t use offline approach
 Online approach: small size of seed documents
 We used WN1.7.1 while gold was WN3.0
 Some test instances were not annotated
 Only the predominant sense output
 Precision nouns improved 64.7%  69.1%
 Precision verbs improved 56.6%  64.6%
 … but…
 Recall nouns 42.9%  20.1%
 Recall verbs 53.9%  17.7%
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 74
GitHub Code
https://github.com/cltl/vua-wsd-sem2015
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 75
Part IV
What is next?
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 76
Current and Future
 Most Frequent Sense Classifier
 Decide when MFS apply or not
 Based on the output of 2 WSD systems
 UKB
 IMS
 Random Forest algorithm
 Features
 Confidence of the MFS by systems
 Sense ranking entropy
 WordNet Domains / SuperSense for the MFS
 …
 Voting for selecting the MFS
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 77
Current and Future
 Unsupervised learning for MFS / LFS
 Distributional semantics and word2vec for detecting the
MFS
 Vectors for representing MFS cases
 Vectors for representing LFS cases
 Operate with vectors
 V(‘Paris’) – V(‘France’) + V(‘Italy’) => V(‘Rome’)
 V(‘king’) – V(‘man’) + V(‘woman’)  V(‘queen’)
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 78
ULM-1
Understanding Language
by Machines
The Borders of Ambiguity
THANKS
Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com
SemEval2013 datasets
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 80
SemEval2013 results
Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 81

Más contenido relacionado

Destacado

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
Fabio Petroni, PhD
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
Yunyao Li
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
Rubén Izquierdo Beviá
 

Destacado (16)

HDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law GraphsHDRF: Stream-Based Partitioning for Power-Law Graphs
HDRF: Stream-Based Partitioning for Power-Law Graphs
 
LCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative FilteringLCBM: Statistics-Based Parallel Collaborative Filtering
LCBM: Statistics-Based Parallel Collaborative Filtering
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe system
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 

Similar a ULM-1 Understanding Languages by Machines: The borders of Ambiguity

The Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented ProgrammingThe Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented Programming
Diane Allen
 
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
jzadeh
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Leon Derczynski
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
butest
 

Similar a ULM-1 Understanding Languages by Machines: The borders of Ambiguity (20)

Assessing, Creating and Using Knowledge Graph Restrictions
Assessing, Creating and Using Knowledge Graph RestrictionsAssessing, Creating and Using Knowledge Graph Restrictions
Assessing, Creating and Using Knowledge Graph Restrictions
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
Complexity Play&Learn
Complexity Play&LearnComplexity Play&Learn
Complexity Play&Learn
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?Can Deep Learning Techniques Improve Entity Linking?
Can Deep Learning Techniques Improve Entity Linking?
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
What you Can Make Out of Linked Data
What you Can Make Out of Linked DataWhat you Can Make Out of Linked Data
What you Can Make Out of Linked Data
 
Presentation r. doust 2
Presentation r. doust 2Presentation r. doust 2
Presentation r. doust 2
 
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
 
Introduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad balochIntroduction of object oriented analysis & design by sarmad baloch
Introduction of object oriented analysis & design by sarmad baloch
 
Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
NLP DLforDS
NLP DLforDSNLP DLforDS
NLP DLforDS
 
The Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented ProgrammingThe Four Principles Of Object Oriented Programming
The Four Principles Of Object Oriented Programming
 
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORELOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
 
Hacking Human Language (PyData London)
Hacking Human Language (PyData London)Hacking Human Language (PyData London)
Hacking Human Language (PyData London)
 
M01_OO_Intro.ppt
M01_OO_Intro.pptM01_OO_Intro.ppt
M01_OO_Intro.ppt
 
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
us-15-Zadeh-From-False-Positives-To-Actionable-Analysis-Behavioral-Intrusion-...
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Modeling Causal Reasoning in Complex Networks through NLP: an Introduction
Modeling Causal Reasoning in Complex Networks through NLP: an IntroductionModeling Causal Reasoning in Complex Networks through NLP: an Introduction
Modeling Causal Reasoning in Complex Networks through NLP: an Introduction
 
Machine Learning of Natural Language
Machine Learning of Natural LanguageMachine Learning of Natural Language
Machine Learning of Natural Language
 

Más de Rubén Izquierdo Beviá

CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
Rubén Izquierdo Beviá
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
Rubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
Rubén Izquierdo Beviá
 

Más de Rubén Izquierdo Beviá (9)

CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)CLTL python course: Object Oriented Programming (2/3)
CLTL python course: Object Oriented Programming (2/3)
 
CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)CLTL python course: Object Oriented Programming (1/3)
CLTL python course: Object Oriented Programming (1/3)
 
CLTL Software and Web Services
CLTL Software and Web Services CLTL Software and Web Services
CLTL Software and Web Services
 
Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)Thesis presentation (WSD and Semantic Classes)
Thesis presentation (WSD and Semantic Classes)
 
ULM1 - The borders of Ambiguity
ULM1 - The borders of AmbiguityULM1 - The borders of Ambiguity
ULM1 - The borders of Ambiguity
 
CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013CLTL: Description of web services and sofware. Nijmegen 2013
CLTL: Description of web services and sofware. Nijmegen 2013
 
CLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRFCLTL presentation: training an opinion mining system from KAF files using CRF
CLTL presentation: training an opinion mining system from KAF files using CRF
 
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor  Building a semantically annotated corpus for DutchCLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch
CLIN 2012: DutchSemCor Building a semantically annotated corpus for Dutch
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 

Último

Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 

Último (20)

Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Air breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animalsAir breathing and respiratory adaptations in diver animals
Air breathing and respiratory adaptations in diver animals
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 97 Noida Escorts >༒8448380779 Escort Service
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)Introduction to Prompt Engineering (Focusing on ChatGPT)
Introduction to Prompt Engineering (Focusing on ChatGPT)
 

ULM-1 Understanding Languages by Machines: The borders of Ambiguity

  • 1. ULM-1 Understanding Language by Machines The Borders of Ambiguity Ruben Izquierdo ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com
  • 2. Structure  Part I  The ULM-1 project  Part II  Error analysis on WSD  Part III  Using Background Information to Perform WSD  Part IV  What is next? Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 2
  • 3. Who am I?  Ruben Izquierdo Bevia  Computer Science, Alicante, Spain 2004  2004-2011 researcher at the University of Alicante  September 2010, Alicante  Phd. Thesis: An approach to Word Sense Disambiguation based on Supervised Machine Learning and Semantic Classes  Sept 2011  Sept 2012  DutchSemCor project (Tilburg and VU universities, NL)  Sept 2012  Sept 2014  Opener project (VU University, NL)  Sept 2014   ULM1 Spinoza project Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 3
  • 4. Part I Understanding Language by Machines Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 4
  • 5. Understanding Languages by Machines  NWO (Netherlands Organization for Scientific Research)  Spinoza Price  Highest Dutch award in science for top researchers with international reputation  Piek Vossen was one of the three winners in 2013  Some money for research  4 ULM projects Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 5
  • 6. Understanding Languages by Machines  Develop computer models that assign deeper meaning to language and approximates human understanding  Use the models to automatically read and understand texts  Words and texts are highly ambiguous  Get a better understanding of the scope and complexity of this ambiguity Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 6
  • 7. Understanding Languages by Machines  ULM-1: The borders of ambiguity  Word relations and ambiguity  Define the problem and find an optimal solution  ULM-2: Word, Concept, Perception and Brain  Relate words and meanings to perceptual data and brain activation patterns  ULM-3: From timelines to storylines  Interpretation of words and our way of interacting with the changing world  Structure these changes as stories along explanatory motivations  ULM-4: A quantum model of text understanding  Technical model  Move from pipeline approaches which take early decisions to a model there the final interpretation is carried out by high-order semantic and contextual models Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 7
  • 8. Understanding Languages by Machines  ULM-1: The borders of ambiguity  Word relations and ambiguity  Define the problem and find an optimal solution  ULM-2: Word, Concept, Perception and Brain  Relate words and meanings to perceptual data and brain activation patterns  ULM-3: From timelines to storylines  Interpretation of words and our way of interacting with the changing world  Structure these changes as stories along explanatory motivations  ULM-4: A quantum model of text understanding  Technical model  Move from pipeline approaches which take early decisions to a model there the final interpretation is carried out by high-order semantic and contextual models Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 8
  • 9. ULM-1: The Borders of Ambiguity Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 9 Piek Vossen Marten Postma Ruben Izquierdo
  • 10. Word Sense Disambiguation WSD  “The problem of computationally determining which ‘sense’ of a word is activated by the use of that word in a particular context” (Agirre & Edmonds, 2006) Our1 project14 looks14 into1 breaking60 the1 borders10 of1 ambiguity1, for1 which1 the1 queen12 piece18 is13 an1 example1 1.981.324.800 interpretations !!! Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 10
  • 11. Classical Approaches  Supervised approaches  Require annotated data  Problems with domain adaptation  Knowledge based  Dependent on the resources  Unsupervised approaches  Low performance  Require large amount of data Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 11
  • 12. Still Unsolved WSD is still considered to be “unsolved” Competition Year Type Baseline Best F1 SensEval2 2001 all-words 57.0 69.0 (Sup) SensEval3 2004 All-words 60.9 65.1 (Sup) SemEval1 2007 All-words (task 17) 51.4 59.1 (Sup) SemEval2 2010 All-words on specific domain 50.5 56.2 (Kb) Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 12
  • 13. General Trends  Look at WSD as a purely classification problem  Focus more on the low level algorithm than on the WSD problem itself  Poor representation of the context  Following the idea: “the more features, the better performance”  Usually Bag-of-words features Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 13
  • 14. … but … what about the discourse and background information? Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 14
  • 15. Discourse and Background Knowledge The winner will walk away with $1.5 million source: http://www.southafrica.info/news/sport/golf- nedbank- 210613.htm#.VEAWkYusVW8 Creation time: 21 June 2013 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 15
  • 16. Discourse and Background Knowledge The winner will walk away with $1.5 million source: http://www.southafrica.info/news/sport/golf- nedbank- 210613.htm#.VEAWkYusVW8 Creation time: 21 June 2013 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 16 Winner  the contestant who wins the contest (wordnet synset ENG30-10782940-n)
  • 17. Discourse and Background Knowledge The winner will walk away with $1.5 million source: http://www.southafrica.info/news/sport/golf- nedbank- 210613.htm#.VEAWkYusVW8 Creation time: 21 June 2013 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 17 The winner won the Nedbank Golf Challengue
  • 18. Discourse and Background Knowledge The winner will walk away with $1.5 million source: http://www.southafrica.info/news/sport/golf- nedbank- 210613.htm#.VEAWkYusVW8 Creation time: 21 June 2013 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 18 The winner was  Thomas Bjørn
  • 19. Borders of Ambiguity Lexical WSD: WordNet sense of winner Discourse information: “winner” is the winner of the Nedbank Golf Challenge Referential WSD: the “winner” is Thomas Børjn WordNet Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 19
  • 20. The Role of Background knowledge “One of the best moves by Gary Kasparov which includes a queen sacrifice…” Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 20
  • 21. The Role of Background knowledge “One of the best moves by Gary Kasparov which includes a queen sacrifice…” Source: http://www.chess.com/forum/view/chess-players/kasparov-queen-sacrifice STATE OF THE ART SYSTEM It-makes-sense WSD system (Zhong and Ng, 2010) • 36% queen.n.1: the only fertile female in a colony of social insects such as bees, ants or termites. • 34% queen.n.2: a female sovereign ruler • 30% queen.n.3: the wife or widow of a king • ….. • 0% queen.n.6: the most powerful chess piece Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 21
  • 22. The Role of Background knowledge  A very naïve approach  Find “Gary Kasparov” as an entity and link it to Wikipedia  Compare textual overlapping of:  Wikipage Queen_chess  Wikipage Gary_Kasparov  170 overlapping types  Wikipage Queen_regnant  Wikipage Gary_Kasparov  88 overlapping types Examples of matching words Queen_chess – G. Kasparov board opening matches game press championship rules chess player king queen Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 22
  • 23. Our ideal system Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 23
  • 24. Part II Error Analysis of WSD systems Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 24 Piek Vossen Marten Postma Ruben Izquierdo
  • 25. Motivation Word Sense Disambiguation is still an unsolved problem Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 25
  • 26. Hypothesis  Little attention has been paid to the problem  WSD as just 1 problem  The context is not being exploited properly  Systems rely too much on the Most Frequent Sense  It is indeed the baseline, very hard to overcome Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 26
  • 27. Goal of the Analysis  Perform error analysis of the participant systems on previous WSD evaluations to prove our hypothesis  Senseval-2: all-words task  Senseval-3: all-words task  Semeval2007: all-words task (#17)  Semeval2010: all-words on specific domain (#17)  Semeval2013: multilingual all-words WSD and entity linking (#12) Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 27
  • 28. Analysis  Calculate the performance of the systems according to different criteria of the gold data  Monosemous / polysemous  Part-of-speech  Most Frequent Sense vs. Non MFS  Polysemy class  Frequency class Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 28
  • 29. Monosemous errors Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 29
  • 30. Monosemous Errors Competition Monosemou s Wrong Examples Senseval2 499 (20.9%) 37.5% gene.n (suppressor_gene.n), chance.a (chance.n) next.r (next.a) Senseval3 334 (16.6%) 44.1% Datum.n (data.n) making.n (make.v) out_of_sight (sight) Semeval2007 25 (5.5%) 11.1% get_stuck.v, lack.v, write_about.v Semeval2010 31 (2.2%) 97.9% Tidal_zone.n pine_marten.n roe_deer.n cordgrass.n Semeval2013 (lemmas) 348 (21.1%) 1.9% Private_enterprise, developing_country, narrow_margin Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 30
  • 31. Most Frequent Sense Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 31
  • 32. Most Frequent Sense  When the correct sense is NOT the most frequent sense  Systems still assign mostly the MFS  Senseval2  799 tokens are not MFS  84% systems still assign the MFS  Most “failed” words due to MFS bias  Senseval2, senseval3  Say.v find.v take.v have.v cell.n church.n  Semeval2010  Area.n nature.n connection.n water.n population.n Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 32
  • 33. Analysis per PoS-tag Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 33
  • 34. Polysemy Profile Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 34
  • 35. Frequency Class Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 35
  • 36. Expected vs. Observed difficulty  Calculate per sentence  The “expected” difficulty  Average polysemy, sentence length, average word length Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 36
  • 37. Expected vs. Observed difficulty  Calculate per sentence  The “expected” difficulty  Average polysemy, sentence length, average word length Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 37
  • 38. Expected vs. Observed difficulty  Calculate per sentence  The “expected” difficulty  Average polysemy, sentence length, average wor length  The “observed” difficulty  From the real participant outputs, average error rate  We could expect: harder sentences higher error rate easier sentences lower error rate Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 38
  • 39. Expected vs. Observed difficulty Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 39
  • 40. Expected vs. Observed difficulty Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 40
  • 41. Expected vs. Observed difficulty • The context is not (probably) exploited properly • Expected “easy” sentences SHOULD show low error rates • Occurrences of the same word in different contexts have similar error rate • The difficulty of a word depends more on its polysemy than on the context where it appears Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 41
  • 43. WSD Corpora Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 43
  • 45. System Outputs Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 45
  • 46. Part III When to Use Background Information to Perform WSD Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 46 Piek Vossen Marten Postma Ruben Izquierdo
  • 47. SemEval-2015 Task #13  Multilingual All-Words Sense Disambiguation and Entity Linking Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 47
  • 48. SemEval-2015 Task #13 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 48
  • 49. Motivation  From the previous error analysis  MFS bias is a big problem  For both supervised and unsupervised approaches  Specially when there is domain shift  Our approach 1. Determine the predominant sense for every lemma in the specific domain (unsupervised) 2. Apply a state-of-the-art WSD system 3. Define an heuristic to determine when to apply 1) or 2) 4. We focused on WSD in English only Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 49
  • 50. Architecture  IMS route: favors the MFS in general domain and local features  Background route: favors the predominant sense in the domain Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 50 ROUTE 1 ROUTE 2
  • 51. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 51
  • 52. Architecture  Two different approaches  Online approach  The SemEval test documents (4 documents)  Offline approach  Precompiled documents for the target domain  Documents from biomedical domain  Converted to NAF  Tokens, Lemmas and PoS tags Seed documents SD Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 52
  • 53. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 53
  • 54. Architecture  DBpedia spotlight is applied to the seed documents  Entities and links to DBpedia are extracted  Wikipedia pages from DBpedia links  Filter:  Consider only DBpedia links with a ontological type which is a leaf on the ontology  Better results without filter  All the wikipedia pages compile the EAC corpus Entity Article Corpus EAC Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 54
  • 55. Architecture Entity Article Corpus EAC Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 55
  • 56. Architecture Entity Article Corpus EAC Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 56
  • 57. Architecture Entity Article Corpus EAC Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 57
  • 58. Architecture Entity Article Corpus EAC Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 58
  • 59. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 59
  • 60. Architecture  Targets high recall and low precision/quality  Entity Article Corpus EAC  LDA  Domain Model DM  For every document DEAC in EAC  Obtain the DBpedia type T  Obtain the set of DBpedia entities S from DBpedia which belong to T  For every document DS in S:  Compute the similarity of DS against the model DM  If similarity >= THRESHOLD  select document for the Entity expanded corpus LDA Expansion Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 60
  • 61. Architecture LDA Expansion Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 61
  • 62. Architecture LDA Expansion Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 62
  • 64. Architecture LDA Expansion Domain Model LDA Similarity Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 64 Entity Article Corpus EAC
  • 65. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 65
  • 66. Architecture Entity Overlapping Expansion  Targets high quality and medium recall  Entity Article Corpus EAC  Extract all the set of entities: SE  For every entity E in SE:  Obtain all the wikilinks in E: W  For every Ew in W  Obtain all the wikilinks Wew in Ew  SW  Compute the overlap SE and SW  Filter by threshold Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 66
  • 67. Architecture Entity Overlapping Expansion … … http://dbpedia.org/resource/CCDC11 … … SE WikiPage for CCDC11 Get wikilinks for CCDC11 … … Phosphorylation … … WikiPage for Phosphorylation Get wikilinks for Phosphorylation Phosphate Enzymes Biochemistry Prokaryotic CCDC11 wikilinks Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 67
  • 68. Architecture Entity Overlapping Expansion … … http://dbpedia.org/resource/CCDC11 … … SE Phosphate Enzymes Biochemistry Prokaryotic Calculate overlap > THRESHOLD Select / Reject Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 68
  • 69. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 69
  • 70. Architecture Predominant Sense Algorithm  Background corpus BC: EAC + EE  For every lemma L in BC:  Extract all sentences containing L  If there are more than 100 sentences  Word sense induction with Hierarchical Dirichlet Processes (Lau et al., 2012)  Induce senses using Topic Modeling  Output: list of senses with confidences per lemma Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 70
  • 71. Architecture Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 71
  • 72. Architecture Voting  For a new instance for a given lemma  Obtain sense ranking of Predominant Sense (PS)  Only if first 2 senses agglomerate 85% of confidence (avoid skewedness)  Mix both sense rankings  PS and ItMakesSense  Select the sense with highest confidence  If there is no Predominant Sense information  Use ItMakesSense best sense Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 72
  • 73. Results All domains Measure All N V Precision 67.5 (2) 64.7 56.6 Recall 51.4 (5) 42.9 53.9 F1 58.4 (4) 51.6 55.2 Social Issues domain Measure All N V F1 61.2 (2) 54.8 (7) 70.6 (1) Math Computer domain Measure All N V F1 47.7 (5) 30.5 (13) 49.7 (7) Biomedical domain Measure All N V F1 66.4 (4) 62.7 (9) 53.8 (2) Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 73
  • 74. Discussion  The domain was not just biomedical, but mixed  We couldn’t use offline approach  Online approach: small size of seed documents  We used WN1.7.1 while gold was WN3.0  Some test instances were not annotated  Only the predominant sense output  Precision nouns improved 64.7%  69.1%  Precision verbs improved 56.6%  64.6%  … but…  Recall nouns 42.9%  20.1%  Recall verbs 53.9%  17.7% Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 74
  • 75. GitHub Code https://github.com/cltl/vua-wsd-sem2015 Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 75
  • 76. Part IV What is next? Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 76
  • 77. Current and Future  Most Frequent Sense Classifier  Decide when MFS apply or not  Based on the output of 2 WSD systems  UKB  IMS  Random Forest algorithm  Features  Confidence of the MFS by systems  Sense ranking entropy  WordNet Domains / SuperSense for the MFS  …  Voting for selecting the MFS Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 77
  • 78. Current and Future  Unsupervised learning for MFS / LFS  Distributional semantics and word2vec for detecting the MFS  Vectors for representing MFS cases  Vectors for representing LFS cases  Operate with vectors  V(‘Paris’) – V(‘France’) + V(‘Italy’) => V(‘Rome’)  V(‘king’) – V(‘man’) + V(‘woman’)  V(‘queen’) Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 78
  • 79. ULM-1 Understanding Language by Machines The Borders of Ambiguity THANKS Ruben Izquierdo ruben.izquierdobevia@vu.nl http://rubenizquierdobevia.com
  • 80. SemEval2013 datasets Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 80
  • 81. SemEval2013 results Ruben Izquierdo, Nov 2015 “The Borders of Ambiguity” 81

Notas del editor

  1. More purely WSD task SemEval2013 was multinligual and with Babelnet, Wikipedia and Wordnet
  2. Subsets of the test data
  3. Shorter words  more polysemous
  4. Di ter min
  5. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  6. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  7. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  8. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  9. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  10. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  11. Lesk algorithm to map from TOPICS to wordnet senses
  12. The LDA technique first obtains a topic model using Latent Dirichlet analysis on the whole background corpus\footnote{We have used the Python library gensim for this purpose, \url{http://radimrehurek.com/gensim/}}. Then for every background document in our initial set, the DBpedia ontology class of this document is obtained (for instance \emph{HumanGene}, and using our \emph{dbpediaEnquirerPy} module, all the DBpedia entries that belong to that specific class are retrieved (in our example we would download \textbf{all} the possible entries in DBPedia for human genes). This process can be quite time consuming (there are a total of 15 entries in DBpedia for \textit{HumanGene}, but there are 1.65 million entries for \textit{Person}). Every document is then compared against the background LDA model, and only those reaching a certain similarity are selected to be included in the expanded corpus. The whole process is highly time consuming and the result in terms of quality is not as good as expected, probably related to the number of documents retrieved is very large, and the domains are very diverse and in many cases different to our reference domain. The EO expansion follows a different approach. We collect all the DBpedia links from the first background corpus, which makes up our list of domain reference entities. Then, for each of the background documents, we obtain all the wiki--links contained on the wikipedia text. For each of these wiki--links, we retrieve the wikipedia page, and again all the entities (wiki--links) contained in this wikipedia page. We obtain what is the overlap of these lists of entities with our original list of domain reference entities. The higher overlapping, the more similar and domain related the new wikipedia page and our original background corpus are. Only wikipedia pages reaching a minimum overlapping are selected to be part of the expanded corpus. For instance, starting from the document for \emph{http://en.wikipedia.org/wiki/CCDC11} (a protein), we could extract this wiki--link: \emph{http://en.wikipedia.org/wiki/Phosphorylation}. Then we would extract the list of wiki-links (entities) found in the wikipedia page of Phosphorylation (wikipedia pages for phosphate, protein, post--transactional modification\ldots). The last step would be to obtain the overlap between these set of entities and the original domain entities of the background corpus. This process is much faster in terms of computation than the LDA approach, and leads to a smaller corpus but with a higher quality and more coherence with the original domain.
  13. Winner: P: 68.7 R: 63.1 F1 65.8
  14. Winner: P: 68.7 R: 63.1 F1 65.8