SlideShare a Scribd company logo
1 of 23
KNOWLEDGE
EXTRACTION
FROM
WIKIPEDIA
Ofer Egozi
Doug Lenat
“Intelligence is 10 million rules…”
Cyc, 1984
(#$genls #$Tree-ThePlant #$Plant)
(#$implies (#$and
(#$isa ?OBJ ?SUBSET)
(#$genls ?SUBSET
?SUPERSET))
(#$isa ?OBJ ?SUPERSET))
…an oak is a plant
Predicted to complete in 10 years.
Cyc Today
Can make impressive inferences, such as:
• You have to be awake to eat
• You cannot remember events that have not happened yet
• If you cut a lump of peanut butter in half, each half is also a
lump of peanut butter; if you cut a table in half, neither half
is a table
• When people die, they stay dead
But after 30 years and 700 man-years, only 2M+
rules…
What went wrong?
Knowledge Acquisition
Machine Translation
Rule-Based Machine Translation (1970s):
• Dictionary for both languages
• Rules representing language structure
• Parsing sentences to find structure
• Mapping between structures
Built by human experts, accumulating rules over
time.
Rules end up conflicting and ambiguous
‫תפוח‬ ‫אוכל‬ ‫ילד‬
Object-verb-subject
Boy eats apple
Subject-verb-object
Machine Translation
Statistical Translation (1990s):
• Massive bilingual corpora
• Corpus alignment
• Calculate probability for word in 1st language
to match word in 2nd language
• Use n-gram to build models that take context into account
Franz Och
Built by data scientists, no linguists needed
Improves as more data gets added
Encyclopedia?
Asymptotic goal: Enter “the world’s most general
knowledge,” down to ever more detailed levels. A
preliminary milestone would be to finish encoding a one-
volume desk encyclopedia...
…There are approximately 30,000 articles in a typical one-
volume desk encyclopedia… For comparison, the
Encyclopedia Brittanica has nine times as many
articles... A conservative estimate for the data enterers’
rate is one paragraph per day; this would make their total
effort about 150 man-years.
Doug Lenat, 1985
Wikipedia
Un+Structured Data
YAGO
 “Yet Another Great Ontology”, 2007, MPI
 10M entities, 120M facts
 http://en.wikipedia.org/wiki/Albert_Einstein
 (AlbertEinstein, bornInYear, 1879)
 (AlbertEinstein, hasWonPrize, NobelPrize)
 (AlbertEinstein, isA, Physicist)
 Uses the WordNet curated ontology, and
expands it into Wikipedia entities
 E.g. Albert Einstein is a Person
YAGO
YAGO
 Knowledge acquisition:
 Work started in 2006
 2007: 1M entities, 5M facts
 2012: 10M entities, 120M facts
 Now adding places
 Data export
 Query over SPARQL
DBpedia
 Created an ontology from scratch
 Crowdsourced the rule definition and mining
 More coverage, but less coherent model and
structure
 2.3M entities, 400M facts
 Uses YAGO ontology as part of resources
 Data export, and SPARQL queries
ESA
 Explicit Semantic Analysis
 Prof. Shaul Markovitch, Dr. Evgeniy
Gabrilovich and yours truly
 The name is a pun on Latent Semantic
Analysis (LSA) – a quick context recap
follows…
Latent Semantic Analysis
 Technique to find “hidden” semantic relations
between groups of terms in documents
ESA
 Wikipedia articles are clear, coherent
and universal semantic concepts
Panther
a
Article words are associated with the concept
(TF.IDF)
Cat [0.92]
Leopard [0.84]
Roar [0.77]
ESA
Cat
Panthera
[0.92]
Cat
[0.95]
Jane
Fonda
[0.07]
The semantics of a word is the vector
of its associations with Wikipedia concepts
ESA
button
Dick
Button
[0.84]
Button
[0.93]
Game
Controlle
r
[0.32]
Mouse
(computing
)
[0.81]
mouse
Mouse
(computing
)
[0.84]
Mouse
(rodent)
[0.91]
John
Steinbec
k
[0.17]
Mickey
Mouse
[0.81]
mouse button
Drag-
and-drop
[0.91]
Mouse
(computing
)
[0.95]
Mouse
(rodent)
[0.56]
Game
Controlle
r
[0.64]
mouse button
The semantics of a text fragment is the average
vector (centroid) of the semantics of its words
Uses of ESA
 Text Categorization
 Semantic Relatedness
 Information Retrieval
More semantic projects
 Word-sense disambiguation
 Multi-lingual dictionary from language links
 Cross-lingual search (Cross-Lingual-ESA)
 WikiData
Questions?
References
 Cyc:
 Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge
Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985
 Cycorp: http://www.cyc.com/
 YAGO:
 Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW
2007
 YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/
 ESA:
 E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge,
AAAI 2006
 E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, IJCAI 2007
 Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011
 Others:
 Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL
HLT, 2007
 Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947,
2008
 Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval,
2008

More Related Content

Similar to Extracting Meaning from Wikipedia

Will Robots Inherit Earth
Will Robots Inherit EarthWill Robots Inherit Earth
Will Robots Inherit Earth
elliando dias
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
vbrant
 
UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04
Rafael Alvarado
 

Similar to Extracting Meaning from Wikipedia (20)

State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
State of the Future and Advancing Technologies [José Cordeiro Millennium Proj...
 
Advances In Wsd Acl 2005
Advances In Wsd Acl 2005Advances In Wsd Acl 2005
Advances In Wsd Acl 2005
 
Dialogare con agenti artificiali
Dialogare con agenti artificiali  Dialogare con agenti artificiali
Dialogare con agenti artificiali
 
Artificial intelligence(01)
Artificial intelligence(01)Artificial intelligence(01)
Artificial intelligence(01)
 
Encylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging PartnersEncylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
Encylopedia of Life Informatics (Data Model) Workshop: Engaging Partners
 
Will Robots Inherit Earth
Will Robots Inherit EarthWill Robots Inherit Earth
Will Robots Inherit Earth
 
Normal Considered Harmful
Normal Considered HarmfulNormal Considered Harmful
Normal Considered Harmful
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Bat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective OptimisationBat Algorithm for Multi-objective Optimisation
Bat Algorithm for Multi-objective Optimisation
 
Between Biological and Digital Memory Prof David Wishart
Between Biological and Digital Memory       Prof David WishartBetween Biological and Digital Memory       Prof David Wishart
Between Biological and Digital Memory Prof David Wishart
 
Looking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic WebLooking for Commonsense in the Semantic Web
Looking for Commonsense in the Semantic Web
 
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
Jeff Hawkins NAISys 2020: How the Brain Uses Reference Frames, Why AI Needs t...
 
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and CommunicationSetting the Scene for ViBRANT – Strategy, Philosophy and Communication
Setting the Scene for ViBRANT – Strategy, Philosophy and Communication
 
20110122 vibrant final
20110122 vibrant final20110122 vibrant final
20110122 vibrant final
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04UVA MDST 3703 Hypertext 2012-09-04
UVA MDST 3703 Hypertext 2012-09-04
 
Intoduction of Artificial Intelligence
Intoduction of Artificial IntelligenceIntoduction of Artificial Intelligence
Intoduction of Artificial Intelligence
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Extracting Meaning from Wikipedia

  • 2. Doug Lenat “Intelligence is 10 million rules…” Cyc, 1984 (#$genls #$Tree-ThePlant #$Plant) (#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET)) …an oak is a plant Predicted to complete in 10 years.
  • 3. Cyc Today Can make impressive inferences, such as: • You have to be awake to eat • You cannot remember events that have not happened yet • If you cut a lump of peanut butter in half, each half is also a lump of peanut butter; if you cut a table in half, neither half is a table • When people die, they stay dead But after 30 years and 700 man-years, only 2M+ rules… What went wrong?
  • 5.
  • 6. Machine Translation Rule-Based Machine Translation (1970s): • Dictionary for both languages • Rules representing language structure • Parsing sentences to find structure • Mapping between structures Built by human experts, accumulating rules over time. Rules end up conflicting and ambiguous ‫תפוח‬ ‫אוכל‬ ‫ילד‬ Object-verb-subject Boy eats apple Subject-verb-object
  • 7. Machine Translation Statistical Translation (1990s): • Massive bilingual corpora • Corpus alignment • Calculate probability for word in 1st language to match word in 2nd language • Use n-gram to build models that take context into account Franz Och Built by data scientists, no linguists needed Improves as more data gets added
  • 8. Encyclopedia? Asymptotic goal: Enter “the world’s most general knowledge,” down to ever more detailed levels. A preliminary milestone would be to finish encoding a one- volume desk encyclopedia... …There are approximately 30,000 articles in a typical one- volume desk encyclopedia… For comparison, the Encyclopedia Brittanica has nine times as many articles... A conservative estimate for the data enterers’ rate is one paragraph per day; this would make their total effort about 150 man-years. Doug Lenat, 1985
  • 11. YAGO  “Yet Another Great Ontology”, 2007, MPI  10M entities, 120M facts  http://en.wikipedia.org/wiki/Albert_Einstein  (AlbertEinstein, bornInYear, 1879)  (AlbertEinstein, hasWonPrize, NobelPrize)  (AlbertEinstein, isA, Physicist)  Uses the WordNet curated ontology, and expands it into Wikipedia entities  E.g. Albert Einstein is a Person
  • 12. YAGO
  • 13. YAGO  Knowledge acquisition:  Work started in 2006  2007: 1M entities, 5M facts  2012: 10M entities, 120M facts  Now adding places  Data export  Query over SPARQL
  • 14. DBpedia  Created an ontology from scratch  Crowdsourced the rule definition and mining  More coverage, but less coherent model and structure  2.3M entities, 400M facts  Uses YAGO ontology as part of resources  Data export, and SPARQL queries
  • 15. ESA  Explicit Semantic Analysis  Prof. Shaul Markovitch, Dr. Evgeniy Gabrilovich and yours truly  The name is a pun on Latent Semantic Analysis (LSA) – a quick context recap follows…
  • 16. Latent Semantic Analysis  Technique to find “hidden” semantic relations between groups of terms in documents
  • 17. ESA  Wikipedia articles are clear, coherent and universal semantic concepts Panther a Article words are associated with the concept (TF.IDF) Cat [0.92] Leopard [0.84] Roar [0.77]
  • 18. ESA Cat Panthera [0.92] Cat [0.95] Jane Fonda [0.07] The semantics of a word is the vector of its associations with Wikipedia concepts
  • 20. Uses of ESA  Text Categorization  Semantic Relatedness  Information Retrieval
  • 21. More semantic projects  Word-sense disambiguation  Multi-lingual dictionary from language links  Cross-lingual search (Cross-Lingual-ESA)  WikiData
  • 23. References  Cyc:  Lenat et al, CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks, AI Magazine Vol. 6 No. 4, 1985  Cycorp: http://www.cyc.com/  YAGO:  Suchanek et al, YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW 2007  YAGO on Max-Planck Institut: http://www.mpi-inf.mpg.de/yago-naga/yago/  ESA:  E. Gabrilovich and S. Markovitch, Enhancing Text Categorization with Encyclopedic Knowledge, AAAI 2006  E. Gabrilovich and S. Markovitch, Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, IJCAI 2007  Egozi et al, Concept-Based Information Retrieval using Explicit Semantic Analysis, TOIS, 2011  Others:  Rada Mihalcea, Using Wikipedia for AutomaticWord Sense Disambiguation, Proceedings of NAACL HLT, 2007  Erdmann et al, An Approach for Extracting Bilingual Terminology from Wikipedia, LNCS Vol. 4947, 2008  Potthast et al, A Wikipedia-Based Multilingual Retrieval Model, Advances in Information Retrieval, 2008

Editor's Notes

  1. Lenat actually explained that Cyc will solve the bottleneck by moving it to the decision of what data to enter, rather than the entry process itself. Compared to entering specific rules, entering facts and generalized rules is certainly better, but still manual.
  2. Fast forward 20 years…
  3. Fast forward 20 years…
  4. There were quite a few efforts to use this wealth of information, I’ll speak about one that was quite impressive in its breadth and comparable to Cyc