Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Wikipedia for Semantic Search
1. Wikitology
Wikipedia as an Ontology
Zareen Syed, Tim Finin
and Anupam Joshi
University of Maryland
Baltimore County
zarsyed1@umbc.edu, finin@cs.umbc.edu, joshi@cs.umbc.edu
2. Outline
• Introduction and motivation
• Wikipedia
• Methodology and Experiments
• Evaluation
• Future Work Directions
• Conclusion
• intro • wikipedia • experiments • evaluation • next • conclusion •
3. Introduction
• Identifying the topics and concepts
associated with a document or collection
of documents is a common task for
many applications and can help in:
– Annotation and categorization of documents
in a corpus.
– Modelling user interests
– Business intelligence
– Selecting Advertisements
• intro • wikipedia • experiments • evaluation • next • conclusion •
4. Motivation
• Problem: describe what an analyst has
been working on to support collaboration
• Idea:
– track documents she reads
– map these to terms in an ontology
– aggregate to produce a short list of topics
• intro • wikipedia • experiments • evaluation • next • conclusion •
5. Approach
• Use Wikipedia articles and categories as
ontology terms
• Categories as Generalized Concepts
• Articles as Specialized Concepts
• How to map the documents she reads to the
ontology terms?
– Use document to Wiki-article similarity for the
mapping
• How to aggregate to get a shorter list?
– Use spreading activation algorithm for aggregation
• intro • wikipedia • experiments • evaluation • next • conclusion •
6. What’s a document
about?
• Two common approaches:
• Statistical Approach
Select words and phrases using TF-IDF that
characterize the document
(2) Controlled Vocabulary or Ontology
Map document to a list of terms from a
controlled vocabulary or ontology
• First approach is flexible and does not require
creating and maintaining an ontology
• Second approach can tie documents to a rich
knowledge base
• intro • wikipedia • experiments • evaluation • next • conclusion •
7. Wikitology !
• Using Wikipedia as an ontology offers the best of both
approaches
• Each article is a concept in the ontology
• Terms linked via Wikipedia’s category system and
inter-article links
• It’s a consensus ontology created, kept current and
maintained by a diverse community
• Overall content quality is high
• Terms have unique IDs (URLs) and are “self
describing” for people
• Underlying graphs provide structure: categories,
article links
• intro • wikipedia • experiments • evaluation • next • conclusion •
8. Wikipedia Graph
Structures
• Wikipedia Category graph is a
thesaurus
• Wikipedia Page links
Ou r p lan s a re going ah ead , He ylig hen i s
get ting ticke ts, so let's put
th at in in ha rd pencil, but keep yo ur
erase r h and y a ll the sam e. M y lif e is Ou r p la n s a re going ahead , Heylighen is
getting t icke ts, so let 's put
re ally hectic, and I won 't be in a sane
pla ce ti ll m id -Apr il . We 'll p in th at in in ha rd pen cil, but keep your
dow n th e det ails later , but a t this po int erase r h and y a ll the s ame . My life is
re ally hect ic, and I won't be i n a sane
we we re consi dering Val onl y t o
ta lk on "T he M etasystem Tra nsition as pla ce ti ll m id -A pril . We 'll p in Ou r p la n s ar e going ah ead , He ylig hen is
dow n th e det ai ls lat er, b ut at t his point getting t ic ke ts, so le t 's put
th e Qu ant um of E volu tion". Th is is
th e th eore tical ba se to t he P C P, wh ich I we we re consi deri ng Va l only to th at in in har d pen cil, but keep yo ur
ta lk on "T he M etasyst em Tr ansitio n as erase r hand y all the same . My lif e is
described t he for m of in my ta lk
th e Qu antum of E volu tion" . Th is is re ally hect ic , and I won 't be i n a sane
to W ES S. I t 's b asically be tw een F ran cis
and Val ho w would like to talk, or th e th eore ti cal ba se to the P C P, wh ic h I pla ce till mid -A pril. We 'll p in
described the f orm of in m y talk dow n th e detai ls la t er, b ut a t t his po int
bot h, or wha t.
to W ES S . It 's b asica lly be tw een F ran cis we we re consideri ng Va l onl y t o
and V al ho w wou ld li k e to tal k, or ta lk on " The Metasyst em Tra nsitio n as
graph is similar to WWW
both, or wha t. th e Qu antum of E voluti on". Th is is
th e th eore ti cal base to the P C P, wh ic h I
described the f orm of in m y ta lk
to W ES S . It's basica l y be tw een F ran cis
and V al how wou ld li ke to t al k, or
both, or wha t.
Ou r p lan s a re going ah ead , He ylig hen i s
get t ing ticke ts , so let's put
th at i n in ha rd pencil, but keep yo ur
Ou r p lan s a re goi ng erase r h and y a ll the sam e. M y lif e is
ahe ad, Heyl igh en is re al ly hectic, and I won 't be in a sane
get ting t ick e ts, so pla ce ti ll m id- Apr il . We 'll p in
let's put dow n th e details later , but a t this po int
th at in in ha rd we we re consi dering Val onl y t o
Network
pen cil, but keep ta lk on "T he M etasystem Tra nsitio n as
your era ser han dy th e Qu ant um of E volu tion". Th is is Ou r p lan s ar e going ahead , He ylig hen i s
all the sa m e. M y life th e th eore tical ba se to t he P C P, wh ich I gett ing t icke ts, so le t 's put
Ou r p lan s a re goi ng ah ead , He ylig hen i s descri bed t he for m of in my ta lk th at in in har d pen cil, but keep yo ur
get ting t ick e ts, so let's put to W E SS. I t's b asically be tw een F ran cis eras e r h and y all the same . My life is
th at in in ha rd pen cil, but keep yo ur and Val ho w would like to talk, or re al ly hect ic , and I won't be i n a sane
er ase r h and y a ll t he sam e. M y lif e is bot h, or what. pla c e till mid -A pril. We 'll p in
re ally hect ic, and I won 't be in a sane dow n th e detai ls la t er, b ut at t his po int
pla ce till m id -April . We 'll p in we we re c onsideri ng Va l only to
dow n th e det ails later , but a t this po int ta lk on "T he Metasyst em Tr ans it io n as
we we re consi dering Val onl y th e Qu ant um of E volution" . Th is is
th e th eore ti cal base to the P C P, wh ich I
described the f orm of in m y talk
to W ES S. It's basica l y be tw een F ran c is
and Val ho w wou ld li ke to tal k, or
both, or wha t.
Ou r p lan s ar e going ahead , He ylig hen i s
get ting t ic ke ts, so le t 's put O ur pla ns are go ing a head, H eyli ghen is
th at in in har d pen cil, but keep yo ur ge t in g ti cket s, so le t' s p ut
erase r hand y all the same . My life is that in in h ard pe ncil, bu t keep your
re ally hect ic , and I won't be i n a sane era ser han dy all t he sa me. My li fe is
pla ce till mid -A pril. We 'll p in rea lly he cti c, an d I wo n't be in a sane
dow n th e detai ls la t er, b ut a t t his po int place till m id-A pri l. W e'll pin
we we re consideri ng Va l only to do wn the de ta ils la te r, but at this p oin t
ta lk on " The Metasyst em Tr a nsit io n as we w ere con side ring V al o nly to
th e Qu antum of E volution" . Th is is talk on "T he Me ta syste m Tr ansition a s
th e th eore ti cal base to the P C P, wh ich I the Q uan tu m o f Evol ution" . This is
described the f orm of in m y ta lk the theo ret ic al b ase to th e P CP, which I
to W ES S . It's basica l y be tw een F ran cis de scribe d th e fo rm o f in my t alk
and Val how wou ld li ke to tal k, or to WE SS. It 's basi cally bet we en F ranci s
bot h, or wha t. an d Va l how w ould like t o talk, or
bo th , or wh at.
• intro • wikipedia • experiments • evaluation • next • conclusion •
9. Methods
• Goal: given one or more documents, compute
a ranked list of the top N Wikipedia articles
and/or categories that describe it.
• Basic metric: document similarity between
Wikipedia article and document(s)
• Variations:
– role of categories
– eliminating uninteresting articles
– use of spreading activation
– using similarity scores for weighing links
– number of spreading activation pulses
– individual or set of query documents, etc, etc.
10. Spreading Activation
• In associative retrieval the idea is that it is
possible to retrieve relevant documents if
they are associated with other documents
that have been considered relevant by the
user.
• The documents can be represented as
nodes and their associations as links in a
network.
• intro • wikipedia • experiments • evaluation • next • conclusion •
13. Spreading Activation
Some nodes will have higher activation Constraints
than others • Distance
• Fan out
• Path constraints
• Activation threshold
14. Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts
Input
Query similar to
doc(s) 0.2 0.1
0.8 Similar Wikipedia Articles
Cosine similarity 0.2
0.3
15. Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts
Wikipedia
Category
Graph
Input
Query similar to
doc(s) 0.2 0.1
0.8 Similar Wikipedia Articles
Cosine similarity 0.2
0.3
16. Method 1
Using Wikipedia Article Text and Categories to Predict Concepts
Output
Rank Categories
Wikipedia
• Links Category
• Cosine similarity Graph
0.9
3
Input
Query similar to
doc(s) 0.2 0.1
0.8 Similar Wikipedia Articles
Cosine similarity 0.2
0.3
17. Method 2
Using Spreading Activation on Category Links Graph to get
Aggregated Concepts
Spreading Activation
Output
Ranked Concepts based Wikipedia
Category
on Final Activation Score Graph
Input
Query Similar to
doc(s) 0.2 0.1
0.8
Cosine similarity 0.2
0.3 Input Function Ij = ∑ Oi
i
Aj
Output Function Oj =
Dj * k
18. • Can we predict concepts that are NOT
present in the category hierarchy?
• Use the article concepts!
• But How?
19. Method 3
Using Spreading Activation on Article Links Graph
Input
Threshold: Ignore Spreading
Query Similar To Activation to articles with less
doc(s) than 0.4 Cosine similarity
score
Edge Weights: Cosine
similarity between linked
articles Wikipedia
Article Links
Graph
Spreading Activation
Node Input Function Ij = ∑ Oiwij
i
Aj
Output
Node Output Function Oj =
Ranked Concepts based on Final Activation Score
k
20. Preliminary Experiments
• An initial informal evaluation compared results against
our own judgments
• Downloaded articles from internet and predicted
concepts
• Using Single Document and Group of Related
Documents
Prediction for Single Test Document
Test Document Method 1 Method 2 Method 2
Title Ranking Categories Directly Spreading Spreading Activation
Activation Pulses=3
Pulses=2
Weather Prediction “Weather_Hazards” “Weather_Hazards” “Meterology”
of thunder
“Winds” “Current_events” “Nature”
storms (CNN)
“Severe_weather_and_convection” “Types_of_cyclone” “Weather”
More pulses -> More Generalized Concepts
• intro • wikipedia • experiments • evaluation • next • conclusion •
21. Preliminary Experiments
Prediction for Set of Test Documents
Test Document Titles in the Set: (Wikipedia Articles)
Crop_rotation
Permaculture
Beneficial_insects
Neem
Lady_Bird
Principles_of_Organic_Agriculture
Rhizobia Concept not in the
Biointensive Category Hierarchy
Inter-cropping
Green_manure
Method 1 Method 2 (2 pulses) Method 3 (2 pulses)
Ranking Categories Directly Spreading Activation on Spreading Activation on Article
Category links Graph Links Graph
Agriculture Skills Organic_farming
Sustainable_technologies Applied_sciences Sustainable_agriculture
Crops Land_management Organic_gardening
Agronomy Food_industry Agriculture
Permaculture Agriculture Companion_planting
• intro • wikipedia • experiments • evaluation • next • conclusion •
22. Evaluation
• Select wikipedia articles randomly and predict their
categories and links
• Sort the results based on Average Similarity
Average Similarity
0.8
0.5
0.7
Query similar to
0.2
0.9
doc(s)
Cosine similarity
0.5 + 0.9 + 0.7 + 0.2 + 0.8
5
• intro • wikipedia • experiments • evaluation • next • conclusion •
23. Evaluation
Medicines Observation
Articles are linked
often with super
Medical Treatments and sub
categories both
1st
Antibiotics
• If our system predicts a
Tetracyclin category three levels higher in
hierarchy than the original
category we consider our
prediction to be correct
Oxytetracyclin
24. Category Prediction Evaluation
M1 Method 1
SA1 Spreading
Activation pulse(s)= 1
SA2 Spreading
Activation pulse(s)=2
• Spreading activation with two pulses worked best
• Only considering articles with similarity > 0.5 was a good threshold
• intro • wikipedia • experiments • evaluation • next • conclusion •
25. Article Links Prediction
Evaluation
• Spreading activation with one pulse worked best
• Only considering articles with similarity > 0.5 was a
good threshold
Similar Documents, N = 5
Spreading Activation pulses=1
• intro • wikipedia • experiments • evaluation • next • conclusion •
26. Prediction Accuracy
• Issues:
– To what extent the concept is represented in
Wikipedia For eg. we have a category related to the
fruit apple but not for mango
– Presence of links between semantically related
concepts
– Presence of links between irrelevant articles (term
definitions, country names)
• Possible Solutions:
– Use Average Similarity Score to measure the extent
of concept representation with in Wikipedia
– Use existing semantic relatedness measures to
handle presence or absence of semantically related
links
• intro • wikipedia • experiments • evaluation • next • conclusion •
27. Potential Applications
• Recommending categories and links
for new Wikipedia articles
• Introducing new Wikipedia categories
• Automating the process of building a
Wiki from a corpus
28. Future Work
• Classifying links in Wikipedia using
Machine learning techniques
– To Predict semantic type of article
– To control flow of spreading activation
• Exploit parallel execution on cluster
• Refining Wikipedia ontology
• Bridging the gap between Wikipedia and
formal ontologies
• intro • wikipedia • experiments • evaluation • next • conclusion •
29. Document Expansion with Wikipedia
*
Derived Ontology Terms
• Expansion of each TREC document using Wikitology
terms
• We are still working on refining the methodology
Doc: FT921-4598 (3/9/92)
... Alan Turing, described as a brilliant
mathematician and a key figure in the breaking of the
Nazis' Enigma codes. Prof IJ Good says it is as well
that British security was unaware of Turing's
homosexuality, otherwise he might have been fired
'and we might have lost the war'. In 1950 Turing
wrote the seminal paper 'Computing Machinery And
Intelligence', but in 1954 killed himself ...
Turing_machine, Turing_test, Church_Turing_thesis,
Halting_problem, Computable_number, Bombe,
Alan_Turing, Recusion_theory, Formal_methods,
Computational_models, Theory_of_computation,
Theoretical_computer_science,
Artificial_Intelligence
*
In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory
30. Conclusion
• We tested the idea of using Wikitology
for describing documents and proposed
different methods using the Wikipedia
article text, category links and article
links
• Suggested improvements
• Using average similarity to judge the
accuracy of prediction
• Easily extendable to other wikis and
collaborative KBs, e.g., Intellipedia,
Freebase
• intro • wikipedia • experiments • evaluation • next • conclusion •
31. References
• Crestani, F. 1997. Application of Spreading Activation
Techniques in Information Retrieval. Artificial Intelligence
Review, 1997, vol 11; No. 6, 453-482.
• Gabrilovich, E., and Markovitch, S. 2006. Overcoming the
brittleness bottleneck using Wikipedia: Enhancing text
categorization with encyclopedic knowledge. Proceedings of the
Twenty-First National Conference on Artificial Intelligence.
AAAI’06. Boston, MA.
• Schonhofen, P. 2006. Identifying Document Topics Using the
Wikipedia Category Network. Proc. 2006 IEEE/WIC/ACM
International Conference on Web Intelligence. 456-462, 2006.
IEEE Computer Society, Washington, DC, USA.
• Strube,M., and Ponzetto, S.P. 2006. Exploiting semantic role
labeling, WordNet and Wikipedia for coreference resolution.
Proceedings of the main conference on Human Language
Technology Conference of the North American Chapter of the
Association of Computational Linguistics (2006). Asso-ciation for
Computational Linguistics Morristown, NJ, USA.
32. References
• Gabrilovich, E., and Markovitch, S. 2007. Computing Semantic
Relatedness using Wikipedia-based Explicit Semantic Analysis,
Proc. of the 20th International Joint Con-ference on Artificial
Intelligence (IJCAI’07), 6-12.
• Krizhanovsky, A. 2006. Synonym search in Wikipedia:
Synarcher.
• URL:http://arxiv.org/abs/cs/0606097v1
• Mihalcea, R. 2007. Using Wikipedia for Automatic Word Sense
Disambiguation. Proc NAACL HLT. 196-203.
• Strube,M., and Ponzetto, S.P. 2006. WikiRelate! Computing
semantic relatedness using Wikipedia. American Association for
Artificial Intelligence, 2006, Boston, MA.
• Voss, J. 2006. Collaborative thesaurus tagging the Wikipedia
way. Collaborative Web Tagging Workshop. Arxiv Computer
Science e prints . URL http://arxiv.org/abs/cs/0604036
• Milne, D. 2007. Computing Semantic Relatedness using
Wikipedia Link Structure. Proceedings of the New Zealand
Computer Science Research Student conference
(NZCSRSC’07), Hamilton, New Zealand.
My talk today is about our research work related to using the Wikipedia’s Ontology which we refer to as Wikitology for describing documents
So this is the outline of the talk, I will be starting with the introduction and motivation Then I’ll give some details about Wikipedia After that I will be discussing the methods we proposed and the experiments that we conducted Then how we evaluated our methods Then what are our future work directions and finally the conclusion
Identifying the topics and concepts associated with a document or collection of documents is a common task for many applications and can help in: Primarily the annotation and categorization of documents themselves If we know what documents the user is looking at or reading, we can use them to model the user’s interests We can use it for business intelligence and even aid in advertising
Our motivation was the following problem: A team of analysts is working on a set of common tasks, They work sometimes independently and sometimes in a tightly coordinated group, They can collaborate more effectively if they are aware of what their colleagues are working on
Use Wikipedia articles and categories as ontology terms Since Wikipedia is encyclopedic in nature, we consider each category and the article title as a concept in Wikitology Categories can be considered as broad or more generalized concepts whereas the article titles can be considered as specialized concepts Rest just read the slides
How to describe what the document is about? Two common approaches
A famous algorithm an
Distance Constraints: first order,second order or third order nodes Fan out constraint: high degree node should be less activated Path Constraint: certain paths may allow more flow than others Activation Constraint: threshold depending on the overall activation of the network
Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
Utilizing Wikipedia Article Text and Article Links Graph with Spreading Activation Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Select them as initial set of activated nodes in the article links graph. Can We predict concepts not in the category hierarchy?
An initial informal evaluation compared results against our own judgments Used to select promising combinations of ideas and parameter settings Formal evaluation by selecting wikipedia articles randomly and predicting their categories and links Experiment Set No. 1: Objective : To observe how well do wikipedia categories represent concepts in individual documents Methods: Method 1 & 2 Experiment Set No 2: Objective : Repeat Experiment 1 but with a set of test documents Methods: Method 1 & 2 Experiment Set No 3: Objective : To observe if we can get additional information about documents using article links graph as well using wikipedia articles themselves as test set Methods: Method 1,2 & 3 3Weather Prediction of thunder storms (taken from CNN Weather Prediction)“Weather_Hazards”“Winds”“Severe_weather_and_convection”“Weather_Hazards”“Current_events”“Types_of_cyclone”“Meterology”“Nature”“Weather”“Main_topic_classifications”“Fundamental”“Atmospheric_sciences”
Antibiotics -> medicines -> medical treatment -> medicines Select 100 Wikipedia articles randomly for testing; remove from Wikipedia index and graphs For each, use methods to predict categories and linked articles Compare results using precision and recall to known categories and linked articles If our system predicts a category three levels higher in hierarchy than the original category, we consider our prediction correct Compare predicted links with actual links Select 100 Wikipedia articles randomly for testing; remove from Wikipedia index and graphs For each, use methods to predict categories and linked articles Compare results using precision and recall to known categories and linked articles If our system predicts a category three levels higher in hierarchy than the original category, we consider our prediction correct Compare predicted links with actual links
Tetracycline_antibiotics
These graphs show the precision, average precision, recall and f-measure metrics as the average similar- ity threshold varies from 0.1 to 0.8. Legend label M1 is method 1, M1(1) and M(2) are method 1 with scoring schemes 1 and 2, respectively, and SA1 and S2 represent the use of spreading activation with one and two pulses, respectively. Considering top three level categories Spreading activation with two pulses worked best Only considering articles with similarity > 0.5 was a good threshold
Single document in many categories: George W. Bush is included in about 29 categories Links between articles belonging to very different categories John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points. Number of articles with in a category: Some categories are under represented where as others have many articles Administrative Categories For eg: Clean up from Sep 2006 Articles with unsourced statements Links to words in an article For eg. If the word United States appears in the document then that word might be linked to the page on “United States”
In Table 4 we report retrieval performance using mean average precision (MAP) and precision at ten documents (P@10). When normal relevance feedback is used a 19% relative improvement is seen over the base run. The concepts+rf condition is about as effective as base+rf . Mean average precision suffers a 2.8% relative drop and P@10 gains 1.6%; neither difference is statistically significant.