SlideShare una empresa de Scribd logo
1 de 33
Wikitology
Wikipedia as an Ontology
Zareen Syed, Tim Finin
  and Anupam Joshi



 University of Maryland
   Baltimore County

            zarsyed1@umbc.edu, finin@cs.umbc.edu, joshi@cs.umbc.edu
Outline
• Introduction and motivation
• Wikipedia
• Methodology and Experiments
• Evaluation
• Future Work Directions
• Conclusion


 • intro • wikipedia • experiments • evaluation • next • conclusion •
Introduction
• Identifying the topics and concepts
  associated with a document or collection
  of documents is a common task for
  many applications and can help in:
 – Annotation and categorization of documents
   in a corpus.
 – Modelling user interests
 – Business intelligence
 – Selecting Advertisements


• intro • wikipedia • experiments • evaluation • next • conclusion •
Motivation
• Problem: describe what an analyst has
  been working on to support collaboration

• Idea:
 – track documents she reads
 – map these to terms in an ontology
 – aggregate to produce a short list of topics




• intro • wikipedia • experiments • evaluation • next • conclusion •
Approach
• Use Wikipedia articles and categories as
  ontology terms
• Categories as Generalized Concepts
• Articles as Specialized Concepts
• How to map the documents she reads to the
  ontology terms?
  – Use document to Wiki-article similarity for the
    mapping
• How to aggregate to get a shorter list?
  – Use spreading activation algorithm for aggregation

• intro • wikipedia • experiments • evaluation • next • conclusion •
What’s a document
            about?
• Two common approaches:
 •   Statistical Approach
     Select words and phrases using TF-IDF that
     characterize the document
 (2) Controlled Vocabulary or Ontology
     Map document to a list of terms from a
     controlled vocabulary or ontology
• First approach is flexible and does not require
  creating and maintaining an ontology
• Second approach can tie documents to a rich
  knowledge base
• intro • wikipedia • experiments • evaluation • next • conclusion •
Wikitology !
• Using Wikipedia as an ontology offers the best of both
  approaches
• Each article is a concept in the ontology
• Terms linked via Wikipedia’s category system and
  inter-article links
• It’s a consensus ontology created, kept current and
  maintained by a diverse community
• Overall content quality is high
• Terms have unique IDs (URLs) and are “self
  describing” for people
• Underlying graphs provide structure: categories,
  article links
• intro • wikipedia • experiments • evaluation • next • conclusion •
Wikipedia Graph
                Structures
• Wikipedia Category graph is a
 thesaurus




• Wikipedia Page links
                                                       Ou r p lan s a re going ah ead , He ylig hen i s
                                                       get ting ticke ts, so let's put
                                                       th at in in ha rd pencil, but keep yo ur
                                                       erase r h and y a ll the sam e. M y lif e is                                                                                        Ou r p la n s a re going ahead , Heylighen is
                                                                                                                                                                                           getting t icke ts, so let 's put
                                                       re ally hectic, and I won 't be in a sane
                                                       pla ce ti ll m id -Apr il . We 'll p in                                                                                             th at in in ha rd pen cil, but keep your
                                                       dow n th e det ails later , but a t this po int                                                                                     erase r h and y a ll the s ame . My life is
                                                                                                                                                                                           re ally hect ic, and I won't be i n a sane
                                                       we we re consi dering Val onl y t o
                                                       ta lk on "T he M etasystem Tra nsition as                                                                                           pla ce ti ll m id -A pril . We 'll p in                       Ou r p la n s ar e going ah ead , He ylig hen is
                                                                                                                                                                                           dow n th e det ai ls lat er, b ut at t his point              getting t ic ke ts, so le t 's put
                                                       th e Qu ant um of E volu tion". Th is is
                                                       th e th eore tical ba se to t he P C P, wh ich I                                                                                    we we re consi deri ng Va l only to                           th at in in har d pen cil, but keep yo ur
                                                                                                                                                                                           ta lk on "T he M etasyst em Tr ansitio n as                   erase r hand y all the same . My lif e is
                                                       described t he for m of in my ta lk
                                                                                                                                                                                           th e Qu antum of E volu tion" . Th is is                      re ally hect ic , and I won 't be i n a sane
                                                       to W ES S. I t 's b asically be tw een F ran cis
                                                       and Val ho w would like to talk, or                                                                                                 th e th eore ti cal ba se to the P C P, wh ic h I             pla ce till mid -A pril. We 'll p in
                                                                                                                                                                                           described the f orm of in m y talk                            dow n th e detai ls la t er, b ut a t t his po int
                                                       bot h, or wha t.
                                                                                                                                                                                           to W ES S . It 's b asica lly be tw een F ran cis             we we re consideri ng Va l onl y t o
                                                                                                                                                                                           and V al ho w wou ld li k e to tal k, or                      ta lk on " The Metasyst em Tra nsitio n as




  graph is similar to WWW
                                                                                                                                                                                           both, or wha t.                                               th e Qu antum of E voluti on". Th is is
                                                                                                                                                                                                                                                         th e th eore ti cal base to the P C P, wh ic h I
                                                                                                                                                                                                                                                         described the f orm of in m y ta lk
                                                                                                                                                                                                                                                         to W ES S . It's basica l y be tw een F ran cis
                                                                                                                                                                                                                                                         and V al how wou ld li ke to t al k, or
                                                                                                                                                                                                                                                         both, or wha t.




                                                                                                                                                               Ou r p lan s a re going ah ead , He ylig hen i s
                                                                                                                                                               get t ing ticke ts , so let's put
                                                                                                                                                               th at i n in ha rd pencil, but keep yo ur
                                                   Ou r p lan s a re goi ng                                                                                    erase r h and y a ll the sam e. M y lif e is
                                                   ahe ad, Heyl igh en is                                                                                      re al ly hectic, and I won 't be in a sane
                                                   get ting t ick e ts, so                                                                                     pla ce ti ll m id- Apr il . We 'll p in
                                                   let's put                                                                                                   dow n th e details later , but a t this po int
                                                   th at in in ha rd                                                                                           we we re consi dering Val onl y t o




  Network
                                                   pen cil, but keep                                                                                           ta lk on "T he M etasystem Tra nsitio n as
                                                   your era ser han dy                                                                                         th e Qu ant um of E volu tion". Th is is                                         Ou r p lan s ar e going ahead , He ylig hen i s
                                                   all the sa m e. M y life                                                                                    th e th eore tical ba se to t he P C P, wh ich I                                 gett ing t icke ts, so le t 's put
                                                   Ou r p lan s a re goi ng ah ead , He ylig hen i s                                                           descri bed t he for m of in my ta lk                                             th at in in har d pen cil, but keep yo ur
                                                   get ting t ick e ts, so let's put                                                                           to W E SS. I t's b asically be tw een F ran cis                                  eras e r h and y all the same . My life is
                                                   th at in in ha rd pen cil, but keep yo ur                                                                   and Val ho w would like to talk, or                                              re al ly hect ic , and I won't be i n a sane
                                                   er ase r h and y a ll t he sam e. M y lif e is                                                              bot h, or what.                                                                  pla c e till mid -A pril. We 'll p in
                                                   re ally hect ic, and I won 't be in a sane                                                                                                                                                   dow n th e detai ls la t er, b ut at t his po int
                                                   pla ce till m id -April . We 'll p in                                                                                                                                                        we we re c onsideri ng Va l only to
                                                   dow n th e det ails later , but a t this po int                                                                                                                                              ta lk on "T he Metasyst em Tr ans it io n as
                                                   we we re consi dering Val onl y                                                                                                                                                              th e Qu ant um of E volution" . Th is is
                                                                                                                                                                                                                                                th e th eore ti cal base to the P C P, wh ich I
                                                                                                                                                                                                                                                described the f orm of in m y talk
                                                                                                                                                                                                                                                to W ES S. It's basica l y be tw een F ran c is
                                                                                                                                                                                                                                                and Val ho w wou ld li ke to tal k, or
                                                                                                                                                                                                                                                both, or wha t.
                                                                                                          Ou r p lan s ar e going ahead , He ylig hen i s
                                                                                                          get ting t ic ke ts, so le t 's put                                                O ur pla ns are go ing a head, H eyli ghen is
                                                                                                          th at in in har d pen cil, but keep yo ur                                          ge t in g ti cket s, so le t' s p ut
                                                                                                          erase r hand y all the same . My life is                                           that in in h ard pe ncil, bu t keep your
                                                                                                          re ally hect ic , and I won't be i n a sane                                        era ser han dy all t he sa me. My li fe is
                                                                                                          pla ce till mid -A pril. We 'll p in                                               rea lly he cti c, an d I wo n't be in a sane
                                                                                                          dow n th e detai ls la t er, b ut a t t his po int                                 place till m id-A pri l. W e'll pin
                                                                                                          we we re consideri ng Va l only to                                                 do wn the de ta ils la te r, but at this p oin t
                                                                                                          ta lk on " The Metasyst em Tr a nsit io n as                                       we w ere con side ring V al o nly to
                                                                                                          th e Qu antum of E volution" . Th is is                                            talk on "T he Me ta syste m Tr ansition a s
                                                                                                          th e th eore ti cal base to the P C P, wh ich I                                    the Q uan tu m o f Evol ution" . This is
                                                                                                          described the f orm of in m y ta lk                                                the theo ret ic al b ase to th e P CP, which I
                                                                                                          to W ES S . It's basica l y be tw een F ran cis                                    de scribe d th e fo rm o f in my t alk
                                                                                                          and Val how wou ld li ke to tal k, or                                              to WE SS. It 's basi cally bet we en F ranci s
                                                                                                          bot h, or wha t.                                                                   an d Va l how w ould like t o talk, or
                                                                                                                                                                                             bo th , or wh at.




  • intro • wikipedia • experiments • evaluation • next • conclusion •
Methods
• Goal: given one or more documents, compute
  a ranked list of the top N Wikipedia articles
  and/or categories that describe it.
• Basic metric: document similarity between
  Wikipedia article and document(s)
• Variations:
  –   role of categories
  –   eliminating uninteresting articles
  –   use of spreading activation
  –   using similarity scores for weighing links
  –   number of spreading activation pulses
  –   individual or set of query documents, etc, etc.
Spreading Activation
• In associative retrieval the idea is that it is
  possible to retrieve relevant documents if
  they are associated with other documents
  that have been considered relevant by the
  user.
• The documents can be represented as
  nodes and their associations as links in a
  network.



  • intro • wikipedia • experiments • evaluation • next • conclusion •
Spreading Activation
Start with an initial set of activated nodes
Spreading Activation
At each pulse/iteration, spread activation to adjacent nodes
Spreading Activation
Some nodes will have higher activation       Constraints
than others                              • Distance
                                         • Fan out
                                         • Path constraints
                                         • Activation threshold
Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts




     Input
     Query     similar to
     doc(s)                        0.2         0.1
                                                           0.8   Similar Wikipedia Articles

               Cosine similarity                     0.2
                                         0.3
Method 1
Using Wikipedia Article Text and Categories to Predict
Concepts


                                                                                Wikipedia
                                                                                Category
                                                                                Graph




     Input
     Query     similar to
     doc(s)                        0.2         0.1
                                                           0.8   Similar Wikipedia Articles

               Cosine similarity                     0.2
                                         0.3
Method 1
Using Wikipedia Article Text and Categories to Predict Concepts


  Output
 Rank Categories
                                                                                  Wikipedia
 • Links                                                                          Category
 • Cosine similarity                                                              Graph

                                                                                   0.9
                                                  3


       Input
      Query      similar to
      doc(s)                         0.2         0.1
                                                             0.8   Similar Wikipedia Articles

                 Cosine similarity                     0.2
                                           0.3
Method 2
Using Spreading Activation on Category Links Graph to get
Aggregated Concepts
                                                Spreading Activation
Output
Ranked Concepts based                                                                   Wikipedia
                                                                                        Category
on Final Activation Score                                                               Graph




        Input
       Query         Similar to
       doc(s)                            0.2         0.1
                                                                 0.8


                     Cosine similarity                     0.2
                                               0.3                     Input Function    Ij = ∑ Oi
                                                                                               i
                                                                                               Aj
                                                                       Output Function Oj =
                                                                                              Dj * k
• Can we predict concepts that are NOT
  present in the category hierarchy?
• Use the article concepts!
• But How?
Method 3
         Using Spreading Activation on Article Links Graph
  Input
                                                          Threshold: Ignore Spreading
 Query Similar To                                         Activation to articles with less
 doc(s)                                                   than 0.4 Cosine similarity
                                                          score




Edge Weights: Cosine
similarity between linked
articles                                                                  Wikipedia
                                                                          Article Links
                                                                          Graph

                                                  Spreading Activation
Node Input Function   Ij = ∑ Oiwij
                              i
                                  Aj
                                         Output
Node Output Function Oj   =
                                        Ranked Concepts based on Final Activation Score
                                  k
Preliminary Experiments
   • An initial informal evaluation compared results against
     our own judgments
   • Downloaded articles from internet and predicted
     concepts
   • Using Single Document and Group of Related
     Documents

Prediction for Single Test Document
 Test Document                Method 1                      Method 2             Method 2
       Title         Ranking Categories Directly           Spreading        Spreading Activation
                                                             Activation          Pulses=3
                                                            Pulses=2
Weather Prediction   “Weather_Hazards”                 “Weather_Hazards”    “Meterology”
   of thunder
                     “Winds”                           “Current_events”     “Nature”
   storms (CNN)
                     “Severe_weather_and_convection”   “Types_of_cyclone”   “Weather”


                                More pulses -> More Generalized Concepts

        • intro • wikipedia • experiments • evaluation • next • conclusion •
Preliminary Experiments
Prediction for Set of Test Documents
Test Document Titles in the Set: (Wikipedia Articles)
Crop_rotation
Permaculture
Beneficial_insects
Neem
Lady_Bird
Principles_of_Organic_Agriculture
Rhizobia                                      Concept not in the
Biointensive                                  Category Hierarchy
Inter-cropping
Green_manure

         Method 1               Method 2 (2 pulses)         Method 3 (2 pulses)
  Ranking Categories Directly   Spreading Activation on   Spreading Activation on Article
                                 Category links Graph             Links Graph

         Agriculture                    Skills                 Organic_farming
  Sustainable_technologies        Applied_sciences          Sustainable_agriculture
           Crops                 Land_management              Organic_gardening
         Agronomy                  Food_industry                  Agriculture
        Permaculture                 Agriculture             Companion_planting
     • intro • wikipedia • experiments • evaluation • next • conclusion •
Evaluation
•   Select wikipedia articles randomly and predict their
    categories and links
•   Sort the results based on Average Similarity




       Average Similarity
                                                                0.8
                                        0.5
                                                    0.7
        Query       similar to
                                                          0.2
                                              0.9
        doc(s)
                    Cosine similarity
                                   0.5 + 0.9 + 0.7 + 0.2 + 0.8
                                                            5
     • intro • wikipedia • experiments • evaluation • next • conclusion •
Evaluation
Medicines                           Observation
                                    Articles are linked
                                    often with super
             Medical Treatments     and sub
                                    categories both

                                              1st
        Antibiotics

                          • If our system predicts a
        Tetracyclin         category three levels higher in
                            hierarchy than the original
                            category we consider our
                            prediction to be correct
       Oxytetracyclin
Category Prediction Evaluation
                                                                           M1 Method 1
                                                                           SA1 Spreading
                                                                           Activation pulse(s)= 1
                                                                           SA2 Spreading
                                                                           Activation pulse(s)=2




  • Spreading activation with two pulses worked best
  • Only considering articles with similarity > 0.5 was a good threshold

    • intro • wikipedia • experiments • evaluation • next • conclusion •
Article Links Prediction
              Evaluation
• Spreading activation with one pulse worked best
• Only considering articles with similarity > 0.5 was a
  good threshold


                                     Similar Documents, N = 5
                                     Spreading Activation pulses=1




     • intro • wikipedia • experiments • evaluation • next • conclusion •
Prediction Accuracy
• Issues:
  – To what extent the concept is represented in
    Wikipedia For eg. we have a category related to the
    fruit apple but not for mango
  – Presence of links between semantically related
    concepts
  – Presence of links between irrelevant articles (term
    definitions, country names)
• Possible Solutions:
  – Use Average Similarity Score to measure the extent
    of concept representation with in Wikipedia
  – Use existing semantic relatedness measures to
    handle presence or absence of semantically related
    links
 • intro • wikipedia • experiments • evaluation • next • conclusion •
Potential Applications
•   Recommending categories and links
    for new Wikipedia articles
•   Introducing new Wikipedia categories
•   Automating the process of building a
    Wiki from a corpus
Future Work
• Classifying links in Wikipedia using
  Machine learning techniques
 – To Predict semantic type of article
 – To control flow of spreading activation
• Exploit parallel execution on cluster
• Refining Wikipedia ontology
• Bridging the gap between Wikipedia and
  formal ontologies


• intro • wikipedia • experiments • evaluation • next • conclusion •
Document Expansion with Wikipedia
                                                                                *
                 Derived Ontology Terms
    • Expansion of each TREC document using Wikitology
      terms
    • We are still working on refining the methodology
                                                   Doc: FT921-4598 (3/9/92)
                                                   ... Alan Turing, described as a brilliant
                                                   mathematician and a key figure in the breaking of the
                                                   Nazis' Enigma codes. Prof IJ Good says it is as well
                                                   that British security was unaware of Turing's
                                                   homosexuality, otherwise he might have been fired
                                                   'and we might have lost the war'. In 1950 Turing
                                                   wrote the seminal paper 'Computing Machinery And
                                                   Intelligence', but in 1954 killed himself ...

                                                   Turing_machine, Turing_test, Church_Turing_thesis,
                                                   Halting_problem, Computable_number, Bombe,
                                                   Alan_Turing, Recusion_theory, Formal_methods,
                                                   Computational_models,     Theory_of_computation,
                                                   Theoretical_computer_science,
                                                   Artificial_Intelligence

*
    In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory
Conclusion
• We tested the idea of using Wikitology
  for describing documents and proposed
  different methods using the Wikipedia
  article text, category links and article
  links
• Suggested improvements
• Using average similarity to judge the
  accuracy of prediction
• Easily extendable to other wikis and
  collaborative KBs, e.g., Intellipedia,
  Freebase
• intro • wikipedia • experiments • evaluation • next • conclusion •
References
• Crestani, F. 1997. Application of Spreading Activation
  Techniques in Information Retrieval. Artificial Intelligence
  Review, 1997, vol 11; No. 6, 453-482.
• Gabrilovich, E., and Markovitch, S. 2006. Overcoming the
  brittleness bottleneck using Wikipedia: Enhancing text
  categorization with encyclopedic knowledge. Proceedings of the
  Twenty-First National Conference on Artificial Intelligence.
  AAAI’06. Boston, MA.
• Schonhofen, P. 2006. Identifying Document Topics Using the
  Wikipedia Category Network. Proc. 2006 IEEE/WIC/ACM
  International Conference on Web Intelligence. 456-462, 2006.
  IEEE Computer Society, Washington, DC, USA.
• Strube,M., and Ponzetto, S.P. 2006. Exploiting semantic role
  labeling, WordNet and Wikipedia for coreference resolution.
  Proceedings of the main conference on Human Language
  Technology Conference of the North American Chapter of the
  Association of Computational Linguistics (2006). Asso-ciation for
  Computational Linguistics Morristown, NJ, USA.
References
• Gabrilovich, E., and Markovitch, S. 2007. Computing Semantic
  Relatedness using Wikipedia-based Explicit Semantic Analysis,
  Proc. of the 20th International Joint Con-ference on Artificial
  Intelligence (IJCAI’07), 6-12.
• Krizhanovsky, A. 2006. Synonym search in Wikipedia:
  Synarcher.
• URL:http://arxiv.org/abs/cs/0606097v1
• Mihalcea, R. 2007. Using Wikipedia for Automatic Word Sense
  Disambiguation. Proc NAACL HLT. 196-203.
• Strube,M., and Ponzetto, S.P. 2006. WikiRelate! Computing
  semantic relatedness using Wikipedia. American Association for
  Artificial Intelligence, 2006, Boston, MA.
• Voss, J. 2006. Collaborative thesaurus tagging the Wikipedia
  way. Collaborative Web Tagging Workshop. Arxiv Computer
  Science e prints . URL http://arxiv.org/abs/cs/0604036
• Milne, D. 2007. Computing Semantic Relatedness using
  Wikipedia Link Structure. Proceedings of the New Zealand
  Computer Science Research Student conference
  (NZCSRSC’07), Hamilton, New Zealand.
Thank you

Questions and Suggestions?

Más contenido relacionado

Similar a Wikipedia for Semantic Search

June 27 Log (CUWP2012)
June 27 Log (CUWP2012)June 27 Log (CUWP2012)
June 27 Log (CUWP2012)Jon Ostenson
 
TED Talk Evaluation Presentation
TED Talk Evaluation PresentationTED Talk Evaluation Presentation
TED Talk Evaluation PresentationSylvester Alfonso
 
Ted talk evaluation presentation
Ted talk evaluation presentationTed talk evaluation presentation
Ted talk evaluation presentationSylvester Alfonso
 
Devore poetry anthology
Devore poetry anthologyDevore poetry anthology
Devore poetry anthologychelsid21
 
Global Flash Master Slides
Global Flash Master SlidesGlobal Flash Master Slides
Global Flash Master SlidesShane
 
Nur 300 Reflective Analysis
Nur 300 Reflective AnalysisNur 300 Reflective Analysis
Nur 300 Reflective Analysisjackiegolsch
 
August 6, 1945
August 6, 1945August 6, 1945
August 6, 1945englishcgs
 
Better answers by fa
Better answers by faBetter answers by fa
Better answers by fafallynjade
 
Nazim evaluation attract audience
Nazim evaluation attract audienceNazim evaluation attract audience
Nazim evaluation attract audienceRajaProdcutions
 
Inspirational Quotes On Planning
Inspirational Quotes On PlanningInspirational Quotes On Planning
Inspirational Quotes On Planningxtrm nurse
 
IBS Corporate Brochure
IBS Corporate BrochureIBS Corporate Brochure
IBS Corporate BrochureIBS Mauritius
 
TED Talk Evaluation Presentation
TED Talk Evaluation PresentationTED Talk Evaluation Presentation
TED Talk Evaluation PresentationSylvester Alfonso
 
Autism and inclusion
Autism and inclusionAutism and inclusion
Autism and inclusionJennaSavage
 
Huff post prom 2012 heels steps
Huff post prom 2012 heels   stepsHuff post prom 2012 heels   steps
Huff post prom 2012 heels stepsrunwaydiy
 
Affairs of the heart part a
Affairs of the heart part aAffairs of the heart part a
Affairs of the heart part aTeddy Jones
 
Articles- Darko Stevanović- Irena Matić Todorović
Articles- Darko Stevanović- Irena Matić Todorović Articles- Darko Stevanović- Irena Matić Todorović
Articles- Darko Stevanović- Irena Matić Todorović nasaskolatakmicenja1
 

Similar a Wikipedia for Semantic Search (20)

June 27 Log (CUWP2012)
June 27 Log (CUWP2012)June 27 Log (CUWP2012)
June 27 Log (CUWP2012)
 
TED Talk Evaluation Presentation
TED Talk Evaluation PresentationTED Talk Evaluation Presentation
TED Talk Evaluation Presentation
 
Ted talk evaluation presentation
Ted talk evaluation presentationTed talk evaluation presentation
Ted talk evaluation presentation
 
Theory
TheoryTheory
Theory
 
Devore poetry anthology
Devore poetry anthologyDevore poetry anthology
Devore poetry anthology
 
Global Flash Master Slides
Global Flash Master SlidesGlobal Flash Master Slides
Global Flash Master Slides
 
Publication2
Publication2Publication2
Publication2
 
Nur 300 Reflective Analysis
Nur 300 Reflective AnalysisNur 300 Reflective Analysis
Nur 300 Reflective Analysis
 
August 6, 1945
August 6, 1945August 6, 1945
August 6, 1945
 
HLABC Forum: January 2008
HLABC Forum: January 2008HLABC Forum: January 2008
HLABC Forum: January 2008
 
Better answers by fa
Better answers by faBetter answers by fa
Better answers by fa
 
Nazim evaluation attract audience
Nazim evaluation attract audienceNazim evaluation attract audience
Nazim evaluation attract audience
 
Inspirational Quotes On Planning
Inspirational Quotes On PlanningInspirational Quotes On Planning
Inspirational Quotes On Planning
 
IBS Corporate Brochure
IBS Corporate BrochureIBS Corporate Brochure
IBS Corporate Brochure
 
Audience reaserch
Audience reaserchAudience reaserch
Audience reaserch
 
TED Talk Evaluation Presentation
TED Talk Evaluation PresentationTED Talk Evaluation Presentation
TED Talk Evaluation Presentation
 
Autism and inclusion
Autism and inclusionAutism and inclusion
Autism and inclusion
 
Huff post prom 2012 heels steps
Huff post prom 2012 heels   stepsHuff post prom 2012 heels   steps
Huff post prom 2012 heels steps
 
Affairs of the heart part a
Affairs of the heart part aAffairs of the heart part a
Affairs of the heart part a
 
Articles- Darko Stevanović- Irena Matić Todorović
Articles- Darko Stevanović- Irena Matić Todorović Articles- Darko Stevanović- Irena Matić Todorović
Articles- Darko Stevanović- Irena Matić Todorović
 

Último

Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptxmary850239
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptxAneriPatwari
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 

Último (20)

Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx4.11.24 Poverty and Inequality in America.pptx
4.11.24 Poverty and Inequality in America.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
ARTERIAL BLOOD GAS ANALYSIS........pptx
ARTERIAL BLOOD  GAS ANALYSIS........pptxARTERIAL BLOOD  GAS ANALYSIS........pptx
ARTERIAL BLOOD GAS ANALYSIS........pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 

Wikipedia for Semantic Search

  • 1. Wikitology Wikipedia as an Ontology Zareen Syed, Tim Finin and Anupam Joshi University of Maryland Baltimore County zarsyed1@umbc.edu, finin@cs.umbc.edu, joshi@cs.umbc.edu
  • 2. Outline • Introduction and motivation • Wikipedia • Methodology and Experiments • Evaluation • Future Work Directions • Conclusion • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 3. Introduction • Identifying the topics and concepts associated with a document or collection of documents is a common task for many applications and can help in: – Annotation and categorization of documents in a corpus. – Modelling user interests – Business intelligence – Selecting Advertisements • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 4. Motivation • Problem: describe what an analyst has been working on to support collaboration • Idea: – track documents she reads – map these to terms in an ontology – aggregate to produce a short list of topics • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 5. Approach • Use Wikipedia articles and categories as ontology terms • Categories as Generalized Concepts • Articles as Specialized Concepts • How to map the documents she reads to the ontology terms? – Use document to Wiki-article similarity for the mapping • How to aggregate to get a shorter list? – Use spreading activation algorithm for aggregation • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 6. What’s a document about? • Two common approaches: • Statistical Approach Select words and phrases using TF-IDF that characterize the document (2) Controlled Vocabulary or Ontology Map document to a list of terms from a controlled vocabulary or ontology • First approach is flexible and does not require creating and maintaining an ontology • Second approach can tie documents to a rich knowledge base • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 7. Wikitology ! • Using Wikipedia as an ontology offers the best of both approaches • Each article is a concept in the ontology • Terms linked via Wikipedia’s category system and inter-article links • It’s a consensus ontology created, kept current and maintained by a diverse community • Overall content quality is high • Terms have unique IDs (URLs) and are “self describing” for people • Underlying graphs provide structure: categories, article links • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 8. Wikipedia Graph Structures • Wikipedia Category graph is a thesaurus • Wikipedia Page links Ou r p lan s a re going ah ead , He ylig hen i s get ting ticke ts, so let's put th at in in ha rd pencil, but keep yo ur erase r h and y a ll the sam e. M y lif e is Ou r p la n s a re going ahead , Heylighen is getting t icke ts, so let 's put re ally hectic, and I won 't be in a sane pla ce ti ll m id -Apr il . We 'll p in th at in in ha rd pen cil, but keep your dow n th e det ails later , but a t this po int erase r h and y a ll the s ame . My life is re ally hect ic, and I won't be i n a sane we we re consi dering Val onl y t o ta lk on "T he M etasystem Tra nsition as pla ce ti ll m id -A pril . We 'll p in Ou r p la n s ar e going ah ead , He ylig hen is dow n th e det ai ls lat er, b ut at t his point getting t ic ke ts, so le t 's put th e Qu ant um of E volu tion". Th is is th e th eore tical ba se to t he P C P, wh ich I we we re consi deri ng Va l only to th at in in har d pen cil, but keep yo ur ta lk on "T he M etasyst em Tr ansitio n as erase r hand y all the same . My lif e is described t he for m of in my ta lk th e Qu antum of E volu tion" . Th is is re ally hect ic , and I won 't be i n a sane to W ES S. I t 's b asically be tw een F ran cis and Val ho w would like to talk, or th e th eore ti cal ba se to the P C P, wh ic h I pla ce till mid -A pril. We 'll p in described the f orm of in m y talk dow n th e detai ls la t er, b ut a t t his po int bot h, or wha t. to W ES S . It 's b asica lly be tw een F ran cis we we re consideri ng Va l onl y t o and V al ho w wou ld li k e to tal k, or ta lk on " The Metasyst em Tra nsitio n as graph is similar to WWW both, or wha t. th e Qu antum of E voluti on". Th is is th e th eore ti cal base to the P C P, wh ic h I described the f orm of in m y ta lk to W ES S . It's basica l y be tw een F ran cis and V al how wou ld li ke to t al k, or both, or wha t. Ou r p lan s a re going ah ead , He ylig hen i s get t ing ticke ts , so let's put th at i n in ha rd pencil, but keep yo ur Ou r p lan s a re goi ng erase r h and y a ll the sam e. M y lif e is ahe ad, Heyl igh en is re al ly hectic, and I won 't be in a sane get ting t ick e ts, so pla ce ti ll m id- Apr il . We 'll p in let's put dow n th e details later , but a t this po int th at in in ha rd we we re consi dering Val onl y t o Network pen cil, but keep ta lk on "T he M etasystem Tra nsitio n as your era ser han dy th e Qu ant um of E volu tion". Th is is Ou r p lan s ar e going ahead , He ylig hen i s all the sa m e. M y life th e th eore tical ba se to t he P C P, wh ich I gett ing t icke ts, so le t 's put Ou r p lan s a re goi ng ah ead , He ylig hen i s descri bed t he for m of in my ta lk th at in in har d pen cil, but keep yo ur get ting t ick e ts, so let's put to W E SS. I t's b asically be tw een F ran cis eras e r h and y all the same . My life is th at in in ha rd pen cil, but keep yo ur and Val ho w would like to talk, or re al ly hect ic , and I won't be i n a sane er ase r h and y a ll t he sam e. M y lif e is bot h, or what. pla c e till mid -A pril. We 'll p in re ally hect ic, and I won 't be in a sane dow n th e detai ls la t er, b ut at t his po int pla ce till m id -April . We 'll p in we we re c onsideri ng Va l only to dow n th e det ails later , but a t this po int ta lk on "T he Metasyst em Tr ans it io n as we we re consi dering Val onl y th e Qu ant um of E volution" . Th is is th e th eore ti cal base to the P C P, wh ich I described the f orm of in m y talk to W ES S. It's basica l y be tw een F ran c is and Val ho w wou ld li ke to tal k, or both, or wha t. Ou r p lan s ar e going ahead , He ylig hen i s get ting t ic ke ts, so le t 's put O ur pla ns are go ing a head, H eyli ghen is th at in in har d pen cil, but keep yo ur ge t in g ti cket s, so le t' s p ut erase r hand y all the same . My life is that in in h ard pe ncil, bu t keep your re ally hect ic , and I won't be i n a sane era ser han dy all t he sa me. My li fe is pla ce till mid -A pril. We 'll p in rea lly he cti c, an d I wo n't be in a sane dow n th e detai ls la t er, b ut a t t his po int place till m id-A pri l. W e'll pin we we re consideri ng Va l only to do wn the de ta ils la te r, but at this p oin t ta lk on " The Metasyst em Tr a nsit io n as we w ere con side ring V al o nly to th e Qu antum of E volution" . Th is is talk on "T he Me ta syste m Tr ansition a s th e th eore ti cal base to the P C P, wh ich I the Q uan tu m o f Evol ution" . This is described the f orm of in m y ta lk the theo ret ic al b ase to th e P CP, which I to W ES S . It's basica l y be tw een F ran cis de scribe d th e fo rm o f in my t alk and Val how wou ld li ke to tal k, or to WE SS. It 's basi cally bet we en F ranci s bot h, or wha t. an d Va l how w ould like t o talk, or bo th , or wh at. • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 9. Methods • Goal: given one or more documents, compute a ranked list of the top N Wikipedia articles and/or categories that describe it. • Basic metric: document similarity between Wikipedia article and document(s) • Variations: – role of categories – eliminating uninteresting articles – use of spreading activation – using similarity scores for weighing links – number of spreading activation pulses – individual or set of query documents, etc, etc.
  • 10. Spreading Activation • In associative retrieval the idea is that it is possible to retrieve relevant documents if they are associated with other documents that have been considered relevant by the user. • The documents can be represented as nodes and their associations as links in a network. • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 11. Spreading Activation Start with an initial set of activated nodes
  • 12. Spreading Activation At each pulse/iteration, spread activation to adjacent nodes
  • 13. Spreading Activation Some nodes will have higher activation Constraints than others • Distance • Fan out • Path constraints • Activation threshold
  • 14. Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Input Query similar to doc(s) 0.2 0.1 0.8 Similar Wikipedia Articles Cosine similarity 0.2 0.3
  • 15. Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Wikipedia Category Graph Input Query similar to doc(s) 0.2 0.1 0.8 Similar Wikipedia Articles Cosine similarity 0.2 0.3
  • 16. Method 1 Using Wikipedia Article Text and Categories to Predict Concepts Output Rank Categories Wikipedia • Links Category • Cosine similarity Graph 0.9 3 Input Query similar to doc(s) 0.2 0.1 0.8 Similar Wikipedia Articles Cosine similarity 0.2 0.3
  • 17. Method 2 Using Spreading Activation on Category Links Graph to get Aggregated Concepts Spreading Activation Output Ranked Concepts based Wikipedia Category on Final Activation Score Graph Input Query Similar to doc(s) 0.2 0.1 0.8 Cosine similarity 0.2 0.3 Input Function Ij = ∑ Oi i Aj Output Function Oj = Dj * k
  • 18. • Can we predict concepts that are NOT present in the category hierarchy? • Use the article concepts! • But How?
  • 19. Method 3 Using Spreading Activation on Article Links Graph Input Threshold: Ignore Spreading Query Similar To Activation to articles with less doc(s) than 0.4 Cosine similarity score Edge Weights: Cosine similarity between linked articles Wikipedia Article Links Graph Spreading Activation Node Input Function Ij = ∑ Oiwij i Aj Output Node Output Function Oj = Ranked Concepts based on Final Activation Score k
  • 20. Preliminary Experiments • An initial informal evaluation compared results against our own judgments • Downloaded articles from internet and predicted concepts • Using Single Document and Group of Related Documents Prediction for Single Test Document Test Document Method 1 Method 2 Method 2 Title Ranking Categories Directly Spreading Spreading Activation Activation Pulses=3 Pulses=2 Weather Prediction “Weather_Hazards” “Weather_Hazards” “Meterology” of thunder “Winds” “Current_events” “Nature” storms (CNN) “Severe_weather_and_convection” “Types_of_cyclone” “Weather” More pulses -> More Generalized Concepts • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 21. Preliminary Experiments Prediction for Set of Test Documents Test Document Titles in the Set: (Wikipedia Articles) Crop_rotation Permaculture Beneficial_insects Neem Lady_Bird Principles_of_Organic_Agriculture Rhizobia Concept not in the Biointensive Category Hierarchy Inter-cropping Green_manure Method 1 Method 2 (2 pulses) Method 3 (2 pulses) Ranking Categories Directly Spreading Activation on Spreading Activation on Article Category links Graph Links Graph Agriculture Skills Organic_farming Sustainable_technologies Applied_sciences Sustainable_agriculture Crops Land_management Organic_gardening Agronomy Food_industry Agriculture Permaculture Agriculture Companion_planting • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 22. Evaluation • Select wikipedia articles randomly and predict their categories and links • Sort the results based on Average Similarity Average Similarity 0.8 0.5 0.7 Query similar to 0.2 0.9 doc(s) Cosine similarity 0.5 + 0.9 + 0.7 + 0.2 + 0.8 5 • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 23. Evaluation Medicines Observation Articles are linked often with super Medical Treatments and sub categories both 1st Antibiotics • If our system predicts a Tetracyclin category three levels higher in hierarchy than the original category we consider our prediction to be correct Oxytetracyclin
  • 24. Category Prediction Evaluation M1 Method 1 SA1 Spreading Activation pulse(s)= 1 SA2 Spreading Activation pulse(s)=2 • Spreading activation with two pulses worked best • Only considering articles with similarity > 0.5 was a good threshold • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 25. Article Links Prediction Evaluation • Spreading activation with one pulse worked best • Only considering articles with similarity > 0.5 was a good threshold Similar Documents, N = 5 Spreading Activation pulses=1 • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 26. Prediction Accuracy • Issues: – To what extent the concept is represented in Wikipedia For eg. we have a category related to the fruit apple but not for mango – Presence of links between semantically related concepts – Presence of links between irrelevant articles (term definitions, country names) • Possible Solutions: – Use Average Similarity Score to measure the extent of concept representation with in Wikipedia – Use existing semantic relatedness measures to handle presence or absence of semantically related links • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 27. Potential Applications • Recommending categories and links for new Wikipedia articles • Introducing new Wikipedia categories • Automating the process of building a Wiki from a corpus
  • 28. Future Work • Classifying links in Wikipedia using Machine learning techniques – To Predict semantic type of article – To control flow of spreading activation • Exploit parallel execution on cluster • Refining Wikipedia ontology • Bridging the gap between Wikipedia and formal ontologies • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 29. Document Expansion with Wikipedia * Derived Ontology Terms • Expansion of each TREC document using Wikitology terms • We are still working on refining the methodology Doc: FT921-4598 (3/9/92) ... Alan Turing, described as a brilliant mathematician and a key figure in the breaking of the Nazis' Enigma codes. Prof IJ Good says it is as well that British security was unaware of Turing's homosexuality, otherwise he might have been fired 'and we might have lost the war'. In 1950 Turing wrote the seminal paper 'Computing Machinery And Intelligence', but in 1954 killed himself ... Turing_machine, Turing_test, Church_Turing_thesis, Halting_problem, Computable_number, Bombe, Alan_Turing, Recusion_theory, Formal_methods, Computational_models, Theory_of_computation, Theoretical_computer_science, Artificial_Intelligence * In Collaboration with Paul McNamee, John Hopkins University Applied Physics Laboratory
  • 30. Conclusion • We tested the idea of using Wikitology for describing documents and proposed different methods using the Wikipedia article text, category links and article links • Suggested improvements • Using average similarity to judge the accuracy of prediction • Easily extendable to other wikis and collaborative KBs, e.g., Intellipedia, Freebase • intro • wikipedia • experiments • evaluation • next • conclusion •
  • 31. References • Crestani, F. 1997. Application of Spreading Activation Techniques in Information Retrieval. Artificial Intelligence Review, 1997, vol 11; No. 6, 453-482. • Gabrilovich, E., and Markovitch, S. 2006. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of the Twenty-First National Conference on Artificial Intelligence. AAAI’06. Boston, MA. • Schonhofen, P. 2006. Identifying Document Topics Using the Wikipedia Category Network. Proc. 2006 IEEE/WIC/ACM International Conference on Web Intelligence. 456-462, 2006. IEEE Computer Society, Washington, DC, USA. • Strube,M., and Ponzetto, S.P. 2006. Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (2006). Asso-ciation for Computational Linguistics Morristown, NJ, USA.
  • 32. References • Gabrilovich, E., and Markovitch, S. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, Proc. of the 20th International Joint Con-ference on Artificial Intelligence (IJCAI’07), 6-12. • Krizhanovsky, A. 2006. Synonym search in Wikipedia: Synarcher. • URL:http://arxiv.org/abs/cs/0606097v1 • Mihalcea, R. 2007. Using Wikipedia for Automatic Word Sense Disambiguation. Proc NAACL HLT. 196-203. • Strube,M., and Ponzetto, S.P. 2006. WikiRelate! Computing semantic relatedness using Wikipedia. American Association for Artificial Intelligence, 2006, Boston, MA. • Voss, J. 2006. Collaborative thesaurus tagging the Wikipedia way. Collaborative Web Tagging Workshop. Arxiv Computer Science e prints . URL http://arxiv.org/abs/cs/0604036 • Milne, D. 2007. Computing Semantic Relatedness using Wikipedia Link Structure. Proceedings of the New Zealand Computer Science Research Student conference (NZCSRSC’07), Hamilton, New Zealand.
  • 33. Thank you Questions and Suggestions?

Notas del editor

  1. My talk today is about our research work related to using the Wikipedia’s Ontology which we refer to as Wikitology for describing documents
  2. So this is the outline of the talk, I will be starting with the introduction and motivation Then I’ll give some details about Wikipedia After that I will be discussing the methods we proposed and the experiments that we conducted Then how we evaluated our methods Then what are our future work directions and finally the conclusion
  3. Identifying the topics and concepts associated with a document or collection of documents is a common task for many applications and can help in: Primarily the annotation and categorization of documents themselves If we know what documents the user is looking at or reading, we can use them to model the user’s interests We can use it for business intelligence and even aid in advertising
  4. Our motivation was the following problem: A team of analysts is working on a set of common tasks, They work sometimes independently and sometimes in a tightly coordinated group, They can collaborate more effectively if they are aware of what their colleagues are working on
  5. Use Wikipedia articles and categories as ontology terms Since Wikipedia is encyclopedic in nature, we consider each category and the article title as a concept in Wikitology Categories can be considered as broad or more generalized concepts whereas the article titles can be considered as specialized concepts Rest just read the slides
  6. How to describe what the document is about? Two common approaches
  7. A famous algorithm an
  8. Distance Constraints: first order,second order or third order nodes Fan out constraint: high degree node should be less activated Path Constraint: certain paths may allow more flow than others Activation Constraint: threshold depending on the overall activation of the network
  9. Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
  10. Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
  11. Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
  12. Utilizing Wikipedia Article Text Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Extract wikipedia categories for matching documents and score categories based on two scoring schemes Scoring Scheme 1: Category score = Number of occurrences Scoring Scheme 2: Category score= Sum of tf.idf score for matching document for each occurrence of category
  13. Utilizing Wikipedia Article Text and Article Links Graph with Spreading Activation Input 'm' test documents as search query to wikipedia index Retrieve top 'n' matching documents to each test document Select them as initial set of activated nodes in the article links graph. Can We predict concepts not in the category hierarchy?
  14. An initial informal evaluation compared results against our own judgments Used to select promising combinations of ideas and parameter settings Formal evaluation by selecting wikipedia articles randomly and predicting their categories and links Experiment Set No. 1: Objective : To observe how well do wikipedia categories represent concepts in individual documents Methods: Method 1 & 2 Experiment Set No 2: Objective : Repeat Experiment 1 but with a set of test documents Methods: Method 1 & 2 Experiment Set No 3: Objective : To observe if we can get additional information about documents using article links graph as well using wikipedia articles themselves as test set Methods: Method 1,2 & 3 3Weather Prediction of thunder storms (taken from CNN Weather Prediction)“Weather_Hazards”“Winds”“Severe_weather_and_convection”“Weather_Hazards”“Current_events”“Types_of_cyclone”“Meterology”“Nature”“Weather”“Main_topic_classifications”“Fundamental”“Atmospheric_sciences”
  15. Antibiotics -> medicines -> medical treatment -> medicines Select 100 Wikipedia articles randomly for testing; remove from Wikipedia index and graphs For each, use methods to predict categories and linked articles Compare results using precision and recall to known categories and linked articles If our system predicts a category three levels higher in hierarchy than the original category, we consider our prediction correct Compare predicted links with actual links Select 100 Wikipedia articles randomly for testing; remove from Wikipedia index and graphs For each, use methods to predict categories and linked articles Compare results using precision and recall to known categories and linked articles If our system predicts a category three levels higher in hierarchy than the original category, we consider our prediction correct Compare predicted links with actual links
  16. Tetracycline_antibiotics
  17. These graphs show the precision, average precision, recall and f-measure metrics as the average similar- ity threshold varies from 0.1 to 0.8. Legend label M1 is method 1, M1(1) and M(2) are method 1 with scoring schemes 1 and 2, respectively, and SA1 and S2 represent the use of spreading activation with one and two pulses, respectively. Considering top three level categories Spreading activation with two pulses worked best Only considering articles with similarity > 0.5 was a good threshold
  18. Single document in many categories: George W. Bush is included in about 29 categories Links between articles belonging to very different categories John F. Kennedy has a link for “coincidence theory” which belongs to the Mathematical Analysis/ Topology/Fixed Points. Number of articles with in a category: Some categories are under represented where as others have many articles Administrative Categories For eg: Clean up from Sep 2006 Articles with unsourced statements Links to words in an article For eg. If the word United States appears in the document then that word might be linked to the page on “United States”
  19. In Table 4 we report retrieval performance using mean average precision (MAP) and precision at ten documents (P@10). When normal relevance feedback is used a 19% relative improvement is seen over the base run. The concepts+rf condition is about as effective as base+rf . Mean average precision suffers a 2.8% relative drop and P@10 gains 1.6%; neither difference is statistically significant.