3. Linguistic approach: A text is parsed by a NER classifier. Entity labels are
used to look up resources in a referent KB. A ranking function is used to
select the best match (relatedness, semantic similarity)
End-to-End approach: A dictionary of mentions and links is built from a
referent KB. A text is split in n-grams that are used to look up candidate
links from the dictionary. A selection function is used to pick the best
match (relatedness, semantic similarity, relevance)
Hybrid approach: combination of both
Current approaches performing the NEL task
3
4. ranking and selection of the candidate links are led by the
relatedness of the entities in the knowledge base
Henri Leland
db:Henry_M._Leland
Lincoln Motor Company
db:Lincoln_Motor_Company
Joe
?
Lincoln
db:Abraham_Lincoln
Cadillac
db:Cadillac
when the context is
poor, head entities are
favoured
4
?
5. “Henry Leland
… formed the
Lincoln Motor
Company... Joe
drove a Lincoln
for the first
time in his life”
…
Joe PER NIL
Lincoln PRO
db:Lincoln_Motor
_Company
5
6. Text as
input
Reranking
with
Context
...
Joe PER NIL
Lincoln PRO
db:Lincoln_Motor_
Company
Resolution
and
Classification
Candidate
Selection
Mention
Extraction
General-purpose Hybrid Annotator
6
8. General-purpose Hybrid Annotator (I)
Mention extraction: proper nouns as classified by Stanford POS Tagger (trained with
english-bidirectional-distsim model) and named entities as classified by Stanford
NERClassifierCombiner (trained with CoNLL 2003, MUC 6, MUC 7 corpora)
Resolution and typing
o When one mention is substring of another we take the longest:
o When a part of one mention is a substring of another we do a merge to create a new
one:
POS: (United States, NNPS)
NER: (United States of America, PLACE)
United States of America, PLACE
POS: (United States, NNPS)
NER: (States of America, PLACE)
United States of America, PLACE
Plu et al., Revealing Entities from Textual Documents Using a Hybrid Approach, (ISWC'15) NLP & DBpedia 2015 8
9. General-purpose Hybrid Annotator (II)
Candidate selection: fuzzy string match over an index based on DBpedia2015-04
o NIL Clustering when no candidates are found. Exact match of labels within the
boundaries of a sentence
o Candidate Ranking if multiple candidates are found.
9
r(l): the score of the label l
L: the Levenshtein distance
m: the extracted mention
title: the title of the label l
R: the set of redirect pages associated to
the label l
D: the set of disambiguation pages
associated to the label l
PR: Pagerank associated to the label l
a, b and c are weights
following the
properties:
● a > b > c
● a + b + c = 1
11. Reranking with Context
Aim: Adapt the linking task to the textual content that is being
analysed
Approach: Leverage the genre and topic domain information about
the text
Apply: 4 heuristics (H1, H2, H3, H4) in cascade. They take the form
of binary rules
11
12. H1: Order of processing
Process the running text sequentially, starting from the first
sentence. Process the title at the end
Reasoning: Title is typically ambiguous/catchy. The first
sentences of an article are written most explicitly
12
13. H2: Coherence
Detect if an entity is co-referential (an abbreviation or a
substring) with an entity that occurs previously in the same
news article
Reasoning: Once the writer has clearly introduced an entity,
she can use abbreviations or more ambiguous ways to refer
to it later in text
13
14. H3: Domain relevance
Use a contextual knowledge base to examine whether a
mention has been frequently and dominantly associated with
a certain entity within a domain
Reasoning: It is customary that the entities mentioned in
domain-specific text stem from the same domain. Also, within
a domain, a mention is typically associated with one
dominant entity
14
15. H4: Semantic Typing
Check whether the semantic type of the entity resolved by H2
or H3 fits the textual context
Reasoning: The entity should fit the textual context and fulfil
a certain role in text
15
16. MEANTIME* AIDA-YAGO2**
benchmark the approach
with a corpus composed
of 4 topic-specific gold
standards
test the generalizability of
the approach
16
*Minard et al., MEANTIME, the newsreader multilingual event and time corpus. LREC 2016
**Hoffart et al., Robust Disambiguation of Named Entities in Text. EMNLP 2011
Benchmark corpora
17. Number
of
Articles
Number of
Tokens
Number
of Entities
Number of
Links
Number
of NILs
Number of
Entity
Types
MEANTI
ME*
airbus 30 3,620 614 414 200 5
apple 30 3,452 812 525 287 5
gm 30 3,641 760 526 234 5
stock 30 3,362 449 331 118 4
AIDA-
YAGO2**
231 46,435 5,616 4,485 1,131 4
17
Corpora statistics
19. Discussion
Reranking with context is effective and brings improvement over the baseline
for all corpora
Improvement also on AIDA-YAGO2, even though it stems from a neutral topic
domain. This is because MEANTIME and AIDA-YAGO2 share the genre domain,
and many of the entities in MEANTIME stem from the neutral domain as well
H1 (Order of Processing) and H3 (Domain Relevance) with these settings are the
most effective heuristics
H4 (Semantic Typing) requires further investigations
19
20. Future Work
Model the genre and topic domains to contextualize further the entity linking,
i.e. adding more features to improve our adaptive contextual model
Investigate the dynamic adaptability in different contexts using knowledge
bases as inputs
20