This document describes a project to mine named entities from Wikipedia. It discusses using Wikipedia's internal links, redirect links, external links, and categories to identify named entities and their synonyms with high accuracy. It presents an algorithm for generic named entity recognition that classifies Wikipedia entries based on capitalization, title formatting, and other features. The project aims to build a search system that matches queries to candidates using vector space modeling and considers contextual windows around search terms.
How to Troubleshoot Apps for the Modern Connected Worker
MINING NAMED ENTITIES FROM WIKIPEDIA
1. MINING NAME ENTITY FROM
WIKIPEDIA
GROUP MEMBER
- NIKHIL BAROTE
- KUNJ THAKKAR
- SHIVANI PODDAR
- ANKIT SHARMA
2. In many search domains, both contents and searches are
frequently tied to named entities such as a person, a
company or similar.
One challenge from an information retrieval point of view is
that a single entity can have more than one way of referring
to it.
In this project we describe how to use Wikipedia contents
to automatically generate a dictionary of named entities
and synonyms that are all referring to the same entity.
we can find named entities and their synonyms with a high
degree of accuracy with our approach.
3. There are four Wikipedia features that are in particular
attractive as a mining source when building a large
collection of NEs:
1.INTERNAL LINKS
2.REDIRECT LINKS
3.EXTERNAL LINKS
4.CATEGORIES
4. Generic Named Entity Recognition
The generic named entity recognition is only classifying a Wikipedia entry
as an entity or not. It starts out by looking at the title of the entry, since as
mentioned earlier, most of the article titles are nouns, and the only nouns
we are interested in are the proper nouns.
Category Based Named-Entity Recognition
It is a subtask of information extraction that seeks to locate and classify
elements in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.
Synonym extraction
After a set of NEs have been identified, we want to find their synonyms.
We intend to use the internal links, redirects and disambiguation pages
for this, and we can easily extract all of these after we have the NEs.
This will give us a list of captions, all used on links to a particular entity.
5. Generic Named Entity Recognition Algorithm
To classify the entries we implemented an algorithm using the
following steps when given a title, T, and the text of an entry:
1. Remove any domain suffix from T
2. Tokenize T into n units, w1;w2; :::;wn
3. Remove any wi from W where wi is included in S
4. Classify as an entity if any of these conditions holds
true:
• ∑ C(wi) = n and n >= 2
• ∑ D(wi) >= 2
• ∑ E(T)/N(T) >= α
A domain suffix is the text enclosed in parentheses that follows
the title of entries with multiple senses.
6. They are used to disambiguate between the senses, but
since they are not part of the Extracting entity name, we
must first strip them from the title. Next we strip all wi
which are found in S, which is a list of stop words.
1. C=1 if any li ∊ [A::Z], 0 otherwise
2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise
3. D returns 1 if the parameter has multiple capital
letters, 0 otherwise C is a function that returns 1 if the
parameter is capitalized, and 0 otherwise, while D is a
function that that returns 1 if the parameter has
multiple capital letters, and 0 otherwise. a is a variable
used as a threshold for the third condition.
7. Search System
First we take unigrams , bigrams & trigrams from our query
document
We look for them in our synonym database & We will get a
list of doc_titles & corresponding doc_ids.
Now we look for words in window centered at current
word And we look at candidate documents & their doc_ids
(window size is set beforehand).
We use vector space model to match our query document
to these candidates.
We pick candidates with score greater than already set
threshold.Now we look for category for these entities in our
database
8.
9. Zesch et al. evaluate the usefulness of Wikipedia as a lexical
semantic resource, and compares it to more traditional
resources, such as dictionaries, thesauri, semantic wordnets, etc.
Bunescu and Pa¸sca study how to use Wikipedia for detecting
and disambiguating NEs in open domain text.
10. R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for
named entity disambiguation. In Proceedings of
EACL’2006, 2006.
R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically
annotated Wikipedia XML corpus. In Proceedings of
BTW’2007, 2007.
T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and
accessing Wikipedia as a lexical semantic resource. In
Proceedings of Biannual Conference of the Society for
Computational Linguistics and Language Technology, 2007.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.