SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
We all love Wikipedia!
Wikipedia has lots of data...
Lots of semi-structured data!
At Freebase, we use Wikipedia as a source for
      extracting facts and relationships
Some Interesting Data in Wikipedia
Infoboxes           Categories




                       Text
Problems With Wikipedia Data

•The data is dirty
•Wiki markup is hard to parse..
  •.. and often dirty
  •.. and not well defined
•Properties and relations are in Wiki markup
•Page redirects have to be resolved
{{Infobox Company
 | company_name = Sony Corporation <br>
 | company_logo = [[Image:Sony logo.svg|220px|]]
 | slogan = like.no.other
 | company_type = [[Public company|Public]]<br>({{Tyo|6758}})<br>({{nyse|SNE}})
 | foundation     = [[May 7]] [[1946]] (adopted current name in 1958)<ref
name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/
CorporateInfo/|title=Sony Global - Corporate Information|
accessdate=2007-07-24}}</ref>
 | founder       = [[Masaru Ibuka]]<br>[[Akio Morita]]
 | location     = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref
name="sonycorpinfo"/>
 | area_served     = [[Worldwide]]
 | key_people      = [[Sir Howard Stringer]]<br><small>([[Chairman]]) & ([[CEO]])</
small><ref name="sonycorpinfo"/><br />[[Ryoji Chubachi]]<br><small>([[President]]) &
([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/>
 | industry      = [[Consumer electronics]]<br>[[Entertainment]]
 | products       = [[Audio]]<br>[[Video]]<br>[[Televisions]]<br>[[Information Technology|
Communications and Information Technology]]<br>[[Semiconductors]]<br>[[Electronic
components]]<br>[[Motion Picture]]<br>[[Music]]<br>[[Online|Online Business]]<br>[[Sony
Playstation]]
 | services      = [[Financial services]]
 | market cap      = [[United States Dollar|US$]] 40.56 Billion (''2008'')
 | revenue       = {{profit}} [[United States Dollar|US$]] 88.714 Billion
(''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/
IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the
fiscal year ended March 31, 2008|format=PDF}}</ref>
 | operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'')
 | net_income       = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'')
 | assets       = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'')
 | equity       = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'')
 | num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/>
 | parent       =
 | subsid       = [[Sony Corporation shareholders and subsidiaries|List of the
subsidiaries]]
Problems With Wikipedia Data
•Wikipedia is huge!
   •2,150,00 articles
   •7,100,000 category references
     •found in 280,000 categories
   •54,029 non-trivial templates (>= 5 uses)
     •50,671,533 Template name-value properties
•Wikipedia is growing!
   •Grows 2% a week
    •25,170 new articles
    •39,571 new redirects
    •8,000 deletes
    •5,000 name changes
    •1,000 article splits
    •1,000 id changes
•Before this talk is over, there will be 150 NEW articles!
The Freebase Wikipedia Extraction (WEX)


Current           Current                                     Current
                    Current
                  Wikipedia                                     Current
                                                              Wikipedia
Wikipedia                            Wiki2XML Parser
                   Wikipedia
                   Dump Markup
                    Wiki                                       Wikipedia
                                                               Dump Markup
                                                                XML
 Dump
                     Dump
                       Articles                                  Dump
                                                                   Articles


                                  Magnus Manske




     Big                            Sections      Templates
                                                                   Text
  Postgres                                                        Articles


  Database!                        Freebase
                                                  Redirects      Categories
                                   Mappings
WEX Article XML
<template name="Infobox_President">
 <param name="name">Abraham Lincoln</param>
 <param name="nationality">American</param>
 <param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param>
 <param name="order">16th<space/><link>
       <target>President of the United States</target></link></param>
 <param name="term_start"><link><target>March 4</target></link>,
       <space/><link><target>1861</target></link></param>
 <param name="term_end"><link><target>April 15</target></link>,
       <space/><link><target>1865</target></link></param>
 <param name="predecessor"><link><target>James Buchanan</target></link></param>
 <param name="successor"><link><target>Andrew Johnson</target></link></param>
....
</template>
WEX Schema

            category_
            members
 articles

             sections



template_
             redirects
  calls


template_   freebase_
  values       wpid


freebase_   freebase_
   types      names
SELECT xpath('/param/text()', template_values.xml)
FROM template_values
INNER JOIN template_calls ON call_id = template_calls.id
INNER JOIN articles ON articles.wpid = article_wpid
WHERE template_article_name = 'Template:Infobox Bridge'
 AND template_values.name = 'mainspan'
 AND articles.name = 'Fremont Bridge (Portland)'

Result: "{"1,255 ft (382.5 m)","longest in Oregon"}"
words become “features”
category_list=['Category:Ninjas',
               'Category:Pirates',
               'Category:Assassins',
               'Category:Apple Inc. employees',
               'Category:Microsoft employees',
               'Category:Google employees',
               'Category:Free software programmers',
               'Category:Computer programmers']
print "Training classifiers..."

# Get the members of every category
name_classes={}
for category in category_list:
    members = get_category_members(cur, category)
    print str(len(members)) + " examples for " + category
    for name in members:
        name_classes.setdefault(name,set()).add(category)
def get_category_members(cur, category):
    records = []
    queue = [category]
    recordsSeen = set()
    while len(queue) > 0 and len(records) < 500:
        currentCategory = queue.pop(0)
        cur.execute("select articles.wpid, articles.name " +
          "from wikipedia.category_members, wikipedia.articles " +
          "where category_members.category_name like %s and " +
          "articles.wpid=category_members.article_wpid",
          (currentCategory,))
        result = cur.fetchall()
        for wpid, name in result:
            if wpid not in recordsSeen:
                 recordsSeen.add(wpid)
                 if name.startswith("Category:"):
                     queue.append(name)
                 else: records.append(name)
    return records
for name in name_classes:
    name, text = get_article_text(cur, name)

    words=getwords(text[0:1024])

    for cat,cl in classifiers:
        if cat in name_classes[name]:
            cl.train(words,1)
        else:
            cl.train(words,0)
def get_article_text(cur, name):
    cur.execute("select name, text from " +
                "wikipedia.articles where name=%s", (name,))
    return cur.fetchone()
# Test set:
test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow",
          "Storm Shadow (G.I. Joe)", "Leonardo (TMNT)",
          "Bill Gates", "Steve Jobs", "Richard Stallman",
          "Larry Page", "Guido van Rossum", "Larry Wall",
          "Jerry Yang"]

# Run tests:
for testName in test_set:
    name, text = get_article_text(cur, testName)

    print name
    words=getwords(text[0:1024])

    for cat,cl in classifiers:
        py,pn=cl.prob(words,1),cl.prob(words,0)
        print '%st%st%f' % (cat,cl.classify(words),py/pn if pn>0
else 100)
Category:Ninjas 0    0.000000
Category:Pirates     1    155082805066493952.000000
Category:Assassins     0    0.000001
Category:Apple Inc. employees 0      0.000000
Category:Microsoft employees 0      0.000000
Category:Google employees      0    0.000000
Category:Free software programmers     0    0.000000
Category:Computer programmers 0        0.000000
Category:Ninjas 1    166627867883323968.000000
Category:Pirates     1    19.751727
Category:Assassins     1    413388475722811.625000
Category:Apple Inc. employees 0     0.000000
Category:Microsoft employees 0      0.000000
Category:Google employees      0    0.000000
Category:Free software programmers     0    0.000000
Category:Computer programmers 0        0.000000
ninja
   japan
characters
  period
 sengoku
  service
    they
     use
  means
   term
   based
 different
 japanese
     era
  heroes
appearance
   kanji
pirate
  coast
  ship
 pirates
  crew
   off
   sea
captured
 island
century
  ships
 piracy
 north
merchant
 captain
according
can we construct a sentence that
         fits into both categories?



(without using the word “pirate” or “ninja”)
“Toby Segaran lived during the sengoku
period in Japan. He spent many years at
     sea capturing Japanese ships.”
http://code.google.com/p/wexbayes/
Category:Ninjas 0.008158
Category:Pirates 21125425312885.750000
Category:Assassins 237.408562
Category:Apple Inc. employees 0.000000
Category:Microsoft employees 0.000000
Category:Google employees 0.000000
Category:Free software programmers 0.000000
Category:Computer programmers 0.000000
Category:Ninjas 2924533519139.380859
Category:Pirates 0.003120
Category:Assassins 67800277337186.242188
Category:Ninjas 300.861827
Category:Pirates 8392781192.375138
Category:Assassins 3817.111709
Category:Ninjas 0.000000
Category:Pirates 0.000000
Category:Assassins 0.000000
Category:Apple Inc. employees 63.870863
Category:Microsoft employees 186751.882154
Category:Google employees 0.012458
Category:Free software programmers 0.000197
Category:Computer programmers 1222.542202
Category:Apple Inc. employees 66414979293154.234375
Category:Microsoft employees 694373.180082
Category:Google employees 2381.361809
Category:Free software programmers 0.014712
Category:Computer programmers 871493.654163
Category:Apple Inc. employees 11.530269
Category:Microsoft employees 1.616829
Category:Google employees 2703.744581
Category:Free software programmers 12521439594.583622
Category:Computer programmers 141466542964.903381
Category:Apple Inc. employees 23162.014385
Category:Microsoft employees 99.417180
Category:Google employees 291981001482.833679
Category:Free software programmers 0.026512
Category:Computer programmers 258.589845
Category:Apple Inc. employees 2.018667
Category:Microsoft employees 0.310716
Category:Google employees 84.447472
Category:Free software programmers 21693.027739
Category:Computer programmers 7538656551.050776
Category:Apple Inc. employees 518.061964
Category:Microsoft employees 16855.582495
Category:Google employees 940060750.923012
Category:Free software programmers 259957538.360797
Category:Computer programmers 462842056873.530640
Category:Apple Inc. employees 0.063467
Category:Microsoft employees 0.004360
Category:Google employees 79.474061
Category:Free software programmers 0.000001
Category:Computer programmers 0.000108

Más contenido relacionado

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Destacado

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Destacado (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008

  • 1. We all love Wikipedia!
  • 2. Wikipedia has lots of data...
  • 4. At Freebase, we use Wikipedia as a source for extracting facts and relationships
  • 5. Some Interesting Data in Wikipedia Infoboxes Categories Text
  • 6. Problems With Wikipedia Data •The data is dirty •Wiki markup is hard to parse.. •.. and often dirty •.. and not well defined •Properties and relations are in Wiki markup •Page redirects have to be resolved
  • 7. {{Infobox Company | company_name = Sony Corporation <br> | company_logo = [[Image:Sony logo.svg|220px|]] | slogan = like.no.other | company_type = [[Public company|Public]]<br>({{Tyo|6758}})<br>({{nyse|SNE}}) | foundation = [[May 7]] [[1946]] (adopted current name in 1958)<ref name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/ CorporateInfo/|title=Sony Global - Corporate Information| accessdate=2007-07-24}}</ref> | founder = [[Masaru Ibuka]]<br>[[Akio Morita]] | location = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref name="sonycorpinfo"/> | area_served = [[Worldwide]] | key_people = [[Sir Howard Stringer]]<br><small>([[Chairman]]) & ([[CEO]])</ small><ref name="sonycorpinfo"/><br />[[Ryoji Chubachi]]<br><small>([[President]]) & ([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/> | industry = [[Consumer electronics]]<br>[[Entertainment]] | products = [[Audio]]<br>[[Video]]<br>[[Televisions]]<br>[[Information Technology| Communications and Information Technology]]<br>[[Semiconductors]]<br>[[Electronic components]]<br>[[Motion Picture]]<br>[[Music]]<br>[[Online|Online Business]]<br>[[Sony Playstation]] | services = [[Financial services]] | market cap = [[United States Dollar|US$]] 40.56 Billion (''2008'') | revenue = {{profit}} [[United States Dollar|US$]] 88.714 Billion (''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/ IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the fiscal year ended March 31, 2008|format=PDF}}</ref> | operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'') | net_income = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'') | assets = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'') | equity = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'') | num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/> | parent = | subsid = [[Sony Corporation shareholders and subsidiaries|List of the subsidiaries]]
  • 8. Problems With Wikipedia Data •Wikipedia is huge! •2,150,00 articles •7,100,000 category references •found in 280,000 categories •54,029 non-trivial templates (>= 5 uses) •50,671,533 Template name-value properties •Wikipedia is growing! •Grows 2% a week •25,170 new articles •39,571 new redirects •8,000 deletes •5,000 name changes •1,000 article splits •1,000 id changes •Before this talk is over, there will be 150 NEW articles!
  • 9. The Freebase Wikipedia Extraction (WEX) Current Current Current Current Wikipedia Current Wikipedia Wikipedia Wiki2XML Parser Wikipedia Dump Markup Wiki Wikipedia Dump Markup XML Dump Dump Articles Dump Articles Magnus Manske Big Sections Templates Text Postgres Articles Database! Freebase Redirects Categories Mappings
  • 10. WEX Article XML <template name="Infobox_President"> <param name="name">Abraham Lincoln</param> <param name="nationality">American</param> <param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param> <param name="order">16th<space/><link> <target>President of the United States</target></link></param> <param name="term_start"><link><target>March 4</target></link>, <space/><link><target>1861</target></link></param> <param name="term_end"><link><target>April 15</target></link>, <space/><link><target>1865</target></link></param> <param name="predecessor"><link><target>James Buchanan</target></link></param> <param name="successor"><link><target>Andrew Johnson</target></link></param> .... </template>
  • 11. WEX Schema category_ members articles sections template_ redirects calls template_ freebase_ values wpid freebase_ freebase_ types names
  • 12. SELECT xpath('/param/text()', template_values.xml) FROM template_values INNER JOIN template_calls ON call_id = template_calls.id INNER JOIN articles ON articles.wpid = article_wpid WHERE template_article_name = 'Template:Infobox Bridge' AND template_values.name = 'mainspan' AND articles.name = 'Fremont Bridge (Portland)' Result: "{"1,255 ft (382.5 m)","longest in Oregon"}"
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 19. category_list=['Category:Ninjas', 'Category:Pirates', 'Category:Assassins', 'Category:Apple Inc. employees', 'Category:Microsoft employees', 'Category:Google employees', 'Category:Free software programmers', 'Category:Computer programmers'] print "Training classifiers..." # Get the members of every category name_classes={} for category in category_list: members = get_category_members(cur, category) print str(len(members)) + " examples for " + category for name in members: name_classes.setdefault(name,set()).add(category)
  • 20. def get_category_members(cur, category): records = [] queue = [category] recordsSeen = set() while len(queue) > 0 and len(records) < 500: currentCategory = queue.pop(0) cur.execute("select articles.wpid, articles.name " + "from wikipedia.category_members, wikipedia.articles " + "where category_members.category_name like %s and " + "articles.wpid=category_members.article_wpid", (currentCategory,)) result = cur.fetchall() for wpid, name in result: if wpid not in recordsSeen: recordsSeen.add(wpid) if name.startswith("Category:"): queue.append(name) else: records.append(name) return records
  • 21. for name in name_classes: name, text = get_article_text(cur, name) words=getwords(text[0:1024]) for cat,cl in classifiers: if cat in name_classes[name]: cl.train(words,1) else: cl.train(words,0)
  • 22. def get_article_text(cur, name): cur.execute("select name, text from " + "wikipedia.articles where name=%s", (name,)) return cur.fetchone()
  • 23. # Test set: test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow", "Storm Shadow (G.I. Joe)", "Leonardo (TMNT)", "Bill Gates", "Steve Jobs", "Richard Stallman", "Larry Page", "Guido van Rossum", "Larry Wall", "Jerry Yang"] # Run tests: for testName in test_set: name, text = get_article_text(cur, testName) print name words=getwords(text[0:1024]) for cat,cl in classifiers: py,pn=cl.prob(words,1),cl.prob(words,0) print '%st%st%f' % (cat,cl.classify(words),py/pn if pn>0 else 100)
  • 24. Category:Ninjas 0 0.000000 Category:Pirates 1 155082805066493952.000000 Category:Assassins 0 0.000001 Category:Apple Inc. employees 0 0.000000 Category:Microsoft employees 0 0.000000 Category:Google employees 0 0.000000 Category:Free software programmers 0 0.000000 Category:Computer programmers 0 0.000000
  • 25. Category:Ninjas 1 166627867883323968.000000 Category:Pirates 1 19.751727 Category:Assassins 1 413388475722811.625000 Category:Apple Inc. employees 0 0.000000 Category:Microsoft employees 0 0.000000 Category:Google employees 0 0.000000 Category:Free software programmers 0 0.000000 Category:Computer programmers 0 0.000000
  • 26. ninja japan characters period sengoku service they use means term based different japanese era heroes appearance kanji
  • 27. pirate coast ship pirates crew off sea captured island century ships piracy north merchant captain according
  • 28. can we construct a sentence that fits into both categories? (without using the word “pirate” or “ninja”)
  • 29. “Toby Segaran lived during the sengoku period in Japan. He spent many years at sea capturing Japanese ships.”
  • 31. Category:Ninjas 0.008158 Category:Pirates 21125425312885.750000 Category:Assassins 237.408562 Category:Apple Inc. employees 0.000000 Category:Microsoft employees 0.000000 Category:Google employees 0.000000 Category:Free software programmers 0.000000 Category:Computer programmers 0.000000
  • 34. Category:Ninjas 0.000000 Category:Pirates 0.000000 Category:Assassins 0.000000 Category:Apple Inc. employees 63.870863 Category:Microsoft employees 186751.882154 Category:Google employees 0.012458 Category:Free software programmers 0.000197 Category:Computer programmers 1222.542202
  • 35. Category:Apple Inc. employees 66414979293154.234375 Category:Microsoft employees 694373.180082 Category:Google employees 2381.361809 Category:Free software programmers 0.014712 Category:Computer programmers 871493.654163
  • 36. Category:Apple Inc. employees 11.530269 Category:Microsoft employees 1.616829 Category:Google employees 2703.744581 Category:Free software programmers 12521439594.583622 Category:Computer programmers 141466542964.903381
  • 37. Category:Apple Inc. employees 23162.014385 Category:Microsoft employees 99.417180 Category:Google employees 291981001482.833679 Category:Free software programmers 0.026512 Category:Computer programmers 258.589845
  • 38. Category:Apple Inc. employees 2.018667 Category:Microsoft employees 0.310716 Category:Google employees 84.447472 Category:Free software programmers 21693.027739 Category:Computer programmers 7538656551.050776
  • 39. Category:Apple Inc. employees 518.061964 Category:Microsoft employees 16855.582495 Category:Google employees 940060750.923012 Category:Free software programmers 259957538.360797 Category:Computer programmers 462842056873.530640
  • 40. Category:Apple Inc. employees 0.063467 Category:Microsoft employees 0.004360 Category:Google employees 79.474061 Category:Free software programmers 0.000001 Category:Computer programmers 0.000108