Wikipedia for Semantic Search

Wikitology
Wikipedia as an Ontology
Zareen Syed, Tim Finin
and Anupam Joshi

University of Maryland
Baltimore County

zarsyed1@umbc.edu, finin@cs.umbc.edu, joshi@cs.umbc.edu

Outline
• Introduction and motivation
• Wikipedia
• Methodology and Experiments
• Evaluation
• Future Work Directions
• Conclusion

• intro • wikipedia • experiments • evaluation • next • conclusion •

Introduction
• Identifying the topics and concepts
associated with a document or collection
of documents is a common task for
many applications and can help in:
– Annotation and categorization of documents
in a corpus.
– Modelling user interests
– Business intelligence
– Selecting Advertisements


Motivation
• Problem: describe what an analyst has
been working on to support collaboration

• Idea:
– track documents she reads
– map these to terms in an ontology
– aggregate to produce a short list of topics


Approach
• Use Wikipedia articles and categories as
ontology terms
• Categories as Generalized Concepts
• Articles as Specialized Concepts
• How to map the documents she reads to the
ontology terms?
– Use document to Wiki-article similarity for the
mapping
• How to aggregate to get a shorter list?
– Use spreading activation algorithm for aggregation


What’s a document
about?
• Two common approaches:
• Statistical Approach
Select words and phrases using TF-IDF that
characterize the document
(2) Controlled Vocabulary or Ontology
Map document to a list of terms from a
controlled vocabulary or ontology
• First approach is flexible and does not require
creating and maintaining an ontology
• Second approach can tie documents to a rich
knowledge base

Wikitology !
• Using Wikipedia as an ontology offers the best of both
approaches
• Each article is a concept in the ontology
• Terms linked via Wikipedia’s category system and
inter-article links
• It’s a consensus ontology created, kept current and
maintained by a diverse community
• Overall content quality is high
• Terms have unique IDs (URLs) and are “self
describing” for people
• Underlying graphs provide structure: categories,
article links

Wikipedia Graph
Structures
• Wikipedia Category graph is a
thesaurus

• Wikipedia Page links
Ou r p lan s a re going ah ead , He ylig hen i s
get ting ticke ts, so let's put
th at in in ha rd pencil, but keep yo ur
erase r h and y a ll the sam e. M y lif e is Ou r p la n s a re going ahead , Heylighen is
getting t icke ts, so let 's put
re ally hectic, and I won 't be in a sane
pla ce ti ll m id -Apr il . We 'll p in th at in in ha rd pen cil, but keep your
dow n th e det ails later , but a t this po int erase r h and y a ll the s ame . My life is
re ally hect ic, and I won't be i n a sane
we we re consi dering Val onl y t o
ta lk on "T he M etasystem Tra nsition as pla ce ti ll m id -A pril . We 'll p in Ou r p la n s ar e going ah ead , He ylig hen is
dow n th e det ai ls lat er, b ut at t his point getting t ic ke ts, so le t 's put
th e Qu ant um of E volu tion". Th is is
th e th eore tical ba se to t he P C P, wh ich I we we re consi deri ng Va l only to th at in in har d pen cil, but keep yo ur
ta lk on "T he M etasyst em Tr ansitio n as erase r hand y all the same . My lif e is
described t he for m of in my ta lk
th e Qu antum of E volu tion" . Th is is re ally hect ic , and I won 't be i n a sane
to W ES S. I t 's b asically be tw een F ran cis
and Val ho w would like to talk, or th e th eore ti cal ba se to the P C P, wh ic h I pla ce till mid -A pril. We 'll p in
described the f orm of in m y talk dow n th e detai ls la t er, b ut a t t his po int
bot h, or wha t.
to W ES S . It 's b asica lly be tw een F ran cis we we re consideri ng Va l onl y t o
and V al ho w wou ld li k e to tal k, or ta lk on " The Metasyst em Tra nsitio n as

graph is similar to WWW
both, or wha t. th e Qu antum of E voluti on". Th is is
th e th eore ti cal base to the P C P, wh ic h I
described the f orm of in m y ta lk
to W ES S . It's basica l y be tw een F ran cis
and V al how wou ld li ke to t al k, or
both, or wha t.

Ou r p lan s a re going ah ead , He ylig hen i s
get t ing ticke ts , so let's put
th at i n in ha rd pencil, but keep yo ur
Ou r p lan s a re goi ng erase r h and y a ll the sam e. M y lif e is
ahe ad, Heyl igh en is re al ly hectic, and I won 't be in a sane
get ting t ick e ts, so pla ce ti ll m id- Apr il . We 'll p in
let's put dow n th e details later , but a t this po int
th at in in ha rd we we re consi dering Val onl y t o

Network
pen cil, but keep ta lk on "T he M etasystem Tra nsitio n as
your era ser han dy th e Qu ant um of E volu tion". Th is is Ou r p lan s ar e going ahead , He ylig hen i s
all the sa m e. M y life th e th eore tical ba se to t he P C P, wh ich I gett ing t icke ts, so le t 's put
Ou r p lan s a re goi ng ah ead , He ylig hen i s descri bed t he for m of in my ta lk th at in in har d pen cil, but keep yo ur
get ting t ick e ts, so let's put to W E SS. I t's b asically be tw een F ran cis eras e r h and y all the same . My life is
th at in in ha rd pen cil, but keep yo ur and Val ho w would like to talk, or re al ly hect ic , and I won't be i n a sane
er ase r h and y a ll t he sam e. M y lif e is bot h, or what. pla c e till mid -A pril. We 'll p in
re ally hect ic, and I won 't be in a sane dow n th e detai ls la t er, b ut at t his po int
pla ce till m id -April . We 'll p in we we re c onsideri ng Va l only to
dow n th e det ails later , but a t this po int ta lk on "T he Metasyst em Tr ans it io n as
we we re consi dering Val onl y th e Qu ant um of E volution" . Th is is
th e th eore ti cal base to the P C P, wh ich I
described the f orm of in m y talk
to W ES S. It's basica l y be tw een F ran c is
and Val ho w wou ld li ke to tal k, or
both, or wha t.
Ou r p lan s ar e going ahead , He ylig hen i s
get ting t ic ke ts, so le t 's put O ur pla ns are go ing a head, H eyli ghen is
th at in in har d pen cil, but keep yo ur ge t in g ti cket s, so le t' s p ut
erase r hand y all the same . My life is that in in h ard pe ncil, bu t keep your
re ally hect ic , and I won't be i n a sane era ser han dy all t he sa me. My li fe is
pla ce till mid -A pril. We 'll p in rea lly he cti c, an d I wo n't be in a sane
dow n th e detai ls la t er, b ut a t t his po int place till m id-A pri l. W e'll pin
we we re consideri ng Va l only to do wn the de ta ils la te r, but at this p oin t
ta lk on " The Metasyst em Tr a nsit io n as we w ere con side ring V al o nly to
th e Qu antum of E volution" . Th is is talk on "T he Me ta syste m Tr ansition a s
th e th eore ti cal base to the P C P, wh ich I the Q uan tu m o f Evol ution" . This is
described the f orm of in m y ta lk the theo ret ic al b ase to th e P CP, which I
to W ES S . It's basica l y be tw een F ran cis de scribe d th e fo rm o f in my t alk
and Val how wou ld li ke to tal k, or to WE SS. It 's basi cally bet we en F ranci s
bot h, or wha t. an d Va l how w ould like t o talk, or
bo th , or wh at.


Methods
• Goal: given one or more documents, compute
a ranked list of the top N Wikipedia articles
and/or categories that describe it.
• Basic metric: document similarity between
Wikipedia article and document(s)
• Variations:
– role of categories
– eliminating uninteresting articles
– use of spreading activation
– using similarity scores for weighing links
– number of spreading activation pulses
– individual or set of query documents, etc, etc.

Spreading Activation
• In associative retrieval the idea is that it is
possible to retrieve relevant documents if
they are associated with other documents
that have been considered relevant by the
user.
• The documents can be represented as
nodes and their associations as links in a
network.


Start with an initial set of activated nodes

At each pulse/iteration, spread activation to adjacent nodes

Some nodes will have higher activation Constraints
than others • Distance
• Fan out
• Path constraints
• Activation threshold

Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts

Input
Query similar to
doc(s) 0.2 0.1
0.8 Similar Wikipedia Articles

Cosine similarity 0.2
0.3

Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts

Wikipedia
Category
Graph

Input
Query similar to
doc(s) 0.2 0.1

0.3

Method 1
Using Wikipedia Article Text and Categories to Predict Concepts

Output
Rank Categories
Wikipedia
• Links Category
• Cosine similarity Graph

0.9
3

Input
Query similar to
doc(s) 0.2 0.1

0.3

Method 2
Using Spreading Activation on Category Links Graph to get
Aggregated Concepts
Output
Ranked Concepts based Wikipedia
Category
on Final Activation Score Graph

Input
Query Similar to
doc(s) 0.2 0.1
0.8

0.3 Input Function Ij = ∑ Oi
i
Aj
Output Function Oj =
Dj * k

• Can we predict concepts that are NOT
present in the category hierarchy?
• Use the article concepts!
• But How?

Method 3
Using Spreading Activation on Article Links Graph
Input
Threshold: Ignore Spreading
Query Similar To Activation to articles with less
doc(s) than 0.4 Cosine similarity
score

Edge Weights: Cosine
similarity between linked
articles Wikipedia
Article Links
Graph

Node Input Function Ij = ∑ Oiwij
i
Aj
Output
Node Output Function Oj =
Ranked Concepts based on Final Activation Score
k

Preliminary Experiments
• An initial informal evaluation compared results against
our own judgments
• Downloaded articles from internet and predicted
concepts
• Using Single Document and Group of Related
Documents

Prediction for Single Test Document
Test Document Method 1 Method 2 Method 2
Title Ranking Categories Directly Spreading Spreading Activation
Activation Pulses=3
Pulses=2
Weather Prediction “Weather_Hazards” “Weather_Hazards” “Meterology”
of thunder
“Winds” “Current_events” “Nature”
storms (CNN)
“Severe_weather_and_convection” “Types_of_cyclone” “Weather”

More pulses -> More Generalized Concepts


Preliminary Experiments
Prediction for Set of Test Documents
Test Document Titles in the Set: (Wikipedia Articles)
Crop_rotation
Permaculture
Beneficial_insects
Neem
Lady_Bird
Principles_of_Organic_Agriculture
Rhizobia Concept not in the
Biointensive Category Hierarchy
Inter-cropping
Green_manure

Method 1 Method 2 (2 pulses) Method 3 (2 pulses)
Ranking Categories Directly Spreading Activation on Spreading Activation on Article
Category links Graph Links Graph

Agriculture Skills Organic_farming
Sustainable_technologies Applied_sciences Sustainable_agriculture
Crops Land_management Organic_gardening
Agronomy Food_industry Agriculture
Permaculture Agriculture Companion_planting

Evaluation
• Select wikipedia articles randomly and predict their
categories and links
• Sort the results based on Average Similarity

Average Similarity
0.8
0.5
0.7
Query similar to
0.2
0.9
doc(s)
Cosine similarity
0.5 + 0.9 + 0.7 + 0.2 + 0.8
5

Evaluation
Medicines Observation
Articles are linked
often with super
Medical Treatments and sub
categories both

1st
Antibiotics

• If our system predicts a
Tetracyclin category three levels higher in
hierarchy than the original
category we consider our
prediction to be correct
Oxytetracyclin

Category Prediction Evaluation
M1 Method 1
SA1 Spreading
Activation pulse(s)= 1
SA2 Spreading
Activation pulse(s)=2

• Spreading activation with two pulses worked best
• Only considering articles with similarity > 0.5 was a good threshold


Article Links Prediction
Evaluation
• Spreading activation with one pulse worked best
• Only considering articles with similarity > 0.5 was a
good threshold

Similar Documents, N = 5
Spreading Activation pulses=1


Prediction Accuracy
• Issues:
– To what extent the concept is represented in
Wikipedia For eg. we have a category related to the
fruit apple but not for mango
– Presence of links between semantically related
concepts
– Presence of links between irrelevant articles (term
definitions, country names)
• Possible Solutions:
– Use Average Similarity Score to measure the extent
of concept representation with in Wikipedia
– Use existing semantic relatedness measures to
handle presence or absence of semantically related
links

Potential Applications
• Recommending categories and links
for new Wikipedia articles
• Introducing new Wikipedia categories
• Automating the process of building a
Wiki from a corpus

Future Work
• Classifying links in Wikipedia using
Machine learning techniques
– To Predict semantic type of article
– To control flow of spreading activation
• Exploit parallel execution on cluster
• Refining Wikipedia ontology
• Bridging the gap between Wikipedia and
formal ontologies


Document Expansion with Wikipedia
*
Derived Ontology Terms
• Expansion of each TREC document using Wikitology
terms
• We are still working on refining the methodology
Doc: FT921-4598 (3/9/92)
... Alan Turing, described as a brilliant
mathematician and a key figure in the breaking of the
Nazis' Enigma codes. Prof IJ Good says it is as well
that British security was unaware of Turing's
homosexuality, otherwise he might have been fired
'and we might have lost the war'. In 1950 Turing
wrote the seminal paper 'Computing Machinery And
Intelligence', but in 1954 killed himself ...

Turing_machine, Turing_test, Church_Turing_thesis,
Halting_problem, Computable_number, Bombe,
Alan_Turing, Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science,
Artificial_Intelligence

*
In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory

Conclusion
• We tested the idea of using Wikitology
for describing documents and proposed
different methods using the Wikipedia
article text, category links and article
links
• Suggested improvements
• Using average similarity to judge the
accuracy of prediction
• Easily extendable to other wikis and
collaborative KBs, e.g., Intellipedia,
Freebase

References
• Crestani, F. 1997. Application of Spreading Activation
Techniques in Information Retrieval. Artificial Intelligence
Review, 1997, vol 11; No. 6, 453-482.
• Gabrilovich, E., and Markovitch, S. 2006. Overcoming the
brittleness bottleneck using Wikipedia: Enhancing text
categorization with encyclopedic knowledge. Proceedings of the
Twenty-First National Conference on Artificial Intelligence.
AAAI’06. Boston, MA.
• Schonhofen, P. 2006. Identifying Document Topics Using the
Wikipedia Category Network. Proc. 2006 IEEE/WIC/ACM
International Conference on Web Intelligence. 456-462, 2006.
IEEE Computer Society, Washington, DC, USA.
• Strube,M., and Ponzetto, S.P. 2006. Exploiting semantic role
labeling, WordNet and Wikipedia for coreference resolution.
Proceedings of the main conference on Human Language
Technology Conference of the North American Chapter of the
Association of Computational Linguistics (2006). Asso-ciation for
Computational Linguistics Morristown, NJ, USA.

References
• Gabrilovich, E., and Markovitch, S. 2007. Computing Semantic
Relatedness using Wikipedia-based Explicit Semantic Analysis,
Proc. of the 20th International Joint Con-ference on Artificial
Intelligence (IJCAI’07), 6-12.
• Krizhanovsky, A. 2006. Synonym search in Wikipedia:
Synarcher.
• URL:http://arxiv.org/abs/cs/0606097v1
• Mihalcea, R. 2007. Using Wikipedia for Automatic Word Sense
Disambiguation. Proc NAACL HLT. 196-203.
• Strube,M., and Ponzetto, S.P. 2006. WikiRelate! Computing
semantic relatedness using Wikipedia. American Association for
Artificial Intelligence, 2006, Boston, MA.
• Voss, J. 2006. Collaborative thesaurus tagging the Wikipedia
way. Collaborative Web Tagging Workshop. Arxiv Computer
Science e prints . URL http://arxiv.org/abs/cs/0604036
• Milne, D. 2007. Computing Semantic Relatedness using
Wikipedia Link Structure. Proceedings of the New Zealand
Computer Science Research Student conference
(NZCSRSC’07), Hamilton, New Zealand.

Thank you

Questions and Suggestions?

Wikipedia for Semantic Search

Recomendados

Recomendados

Más contenido relacionado

Similar a Wikipedia for Semantic Search

Similar a Wikipedia for Semantic Search (20)

Último

Último (20)

Wikipedia for Semantic Search

Notas del editor