Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008

At Freebase, we use Wikipedia as a source for
extracting facts and relationships

Some Interesting Data in Wikipedia
Infoboxes Categories

Text

Problems With Wikipedia Data

•The data is dirty
•Wiki markup is hard to parse..
•.. and often dirty
•.. and not well deﬁned
•Properties and relations are in Wiki markup
•Page redirects have to be resolved

{{Infobox Company
| company_name = Sony Corporation 
| company_logo = [[Image:Sony logo.svg|220px|]]
| slogan = like.no.other
| company_type = [[Public company|Public]] ({{Tyo|6758}}) ({{nyse|SNE}})
| foundation = [[May 7]] [[1946]] (adopted current name in 1958)<ref
name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/
CorporateInfo/|title=Sony Global - Corporate Information|
accessdate=2007-07-24}}</ref>
| founder = [[Masaru Ibuka]] [[Akio Morita]]
| location = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref
name="sonycorpinfo"/>
| area_served = [[Worldwide]]
| key_people = [[Sir Howard Stringer]] ([[Chairman]]) & ([[CEO]])<ref name="sonycorpinfo"/> [[Ryoji Chubachi]] ([[President]]) &
([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/>
| industry = [[Consumer electronics]] [[Entertainment]]
| products = [[Audio]] [[Video]] [[Televisions]] [[Information Technology|
Communications and Information Technology]] [[Semiconductors]] [[Electronic
components]] [[Motion Picture]] [[Music]] [[Online|Online Business]] [[Sony
Playstation]]
| services = [[Financial services]]
| market cap = [[United States Dollar|US$]] 40.56 Billion (''2008'')
| revenue = {{profit}} [[United States Dollar|US$]] 88.714 Billion
(''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/
IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the
fiscal year ended March 31, 2008|format=PDF}}</ref>
| operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'')
| net_income = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'')
| assets = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'')
| equity = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'')
| num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/>
| parent =
| subsid = [[Sony Corporation shareholders and subsidiaries|List of the
subsidiaries]]

Problems With Wikipedia Data
•Wikipedia is huge!
•2,150,00 articles
•7,100,000 category references
•found in 280,000 categories
•54,029 non-trivial templates (>= 5 uses)
•50,671,533 Template name-value properties
•Wikipedia is growing!
•Grows 2% a week
•25,170 new articles
•39,571 new redirects
•8,000 deletes
•5,000 name changes
•1,000 article splits
•1,000 id changes
•Before this talk is over, there will be 150 NEW articles!

The Freebase Wikipedia Extraction (WEX)

Current Current Current
Current
Wikipedia Current
Wikipedia
Wikipedia Wiki2XML Parser
Wikipedia
Dump Markup
Wiki Wikipedia
Dump Markup
XML
Dump
Dump
Articles Dump
Articles

Magnus Manske

Big Sections Templates
Text
Postgres Articles

Database! Freebase
Redirects Categories
Mappings

WEX Article XML
<template name="Infobox_President">
<param name="name">Abraham Lincoln</param>
<param name="nationality">American</param>
<param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param>
<param name="order">16th<space/><link>
<target>President of the United States</target></link></param>
<param name="term_start"><link><target>March 4</target></link>,
<space/><link><target>1861</target></link></param>
<param name="term_end"><link><target>April 15</target></link>,
<space/><link><target>1865</target></link></param>
<param name="predecessor"><link><target>James Buchanan</target></link></param>
<param name="successor"><link><target>Andrew Johnson</target></link></param>
....
</template>

WEX Schema

category_
members
articles

sections

template_
redirects
calls

template_ freebase_
values wpid

freebase_ freebase_
types names

SELECT xpath('/param/text()', template_values.xml)
FROM template_values
INNER JOIN template_calls ON call_id = template_calls.id
INNER JOIN articles ON articles.wpid = article_wpid
WHERE template_article_name = 'Template:Infobox Bridge'
AND template_values.name = 'mainspan'
AND articles.name = 'Fremont Bridge (Portland)'

Result: "{"1,255 ft (382.5 m)","longest in Oregon"}"

category_list=['Category:Ninjas',
'Category:Pirates',
'Category:Assassins',
'Category:Apple Inc. employees',
'Category:Microsoft employees',
'Category:Google employees',
'Category:Free software programmers',
'Category:Computer programmers']
print "Training classifiers..."

# Get the members of every category
name_classes={}
for category in category_list:
members = get_category_members(cur, category)
print str(len(members)) + " examples for " + category
for name in members:
name_classes.setdefault(name,set()).add(category)

def get_category_members(cur, category):
records = []
queue = [category]
recordsSeen = set()
while len(queue) > 0 and len(records) < 500:
currentCategory = queue.pop(0)
cur.execute("select articles.wpid, articles.name " +
"from wikipedia.category_members, wikipedia.articles " +
"where category_members.category_name like %s and " +
"articles.wpid=category_members.article_wpid",
(currentCategory,))
result = cur.fetchall()
for wpid, name in result:
if wpid not in recordsSeen:
recordsSeen.add(wpid)
if name.startswith("Category:"):
queue.append(name)
else: records.append(name)
return records

for name in name_classes:
name, text = get_article_text(cur, name)

words=getwords(text[0:1024])

for cat,cl in classifiers:
if cat in name_classes[name]:
cl.train(words,1)
else:
cl.train(words,0)

def get_article_text(cur, name):
cur.execute("select name, text from " +
"wikipedia.articles where name=%s", (name,))
return cur.fetchone()

# Test set:
test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow",
"Storm Shadow (G.I. Joe)", "Leonardo (TMNT)",
"Bill Gates", "Steve Jobs", "Richard Stallman",
"Larry Page", "Guido van Rossum", "Larry Wall",
"Jerry Yang"]

# Run tests:
for testName in test_set:
name, text = get_article_text(cur, testName)

print name
words=getwords(text[0:1024])

for cat,cl in classifiers:
py,pn=cl.prob(words,1),cl.prob(words,0)
print '%st%st%f' % (cat,cl.classify(words),py/pn if pn>0
else 100)

Category:Ninjas 0 0.000000
Category:Pirates 1 155082805066493952.000000
Category:Assassins 0 0.000001
Category:Apple Inc. employees 0 0.000000
Category:Microsoft employees 0 0.000000
Category:Google employees 0 0.000000
Category:Free software programmers 0 0.000000
Category:Computer programmers 0 0.000000

Category:Ninjas 1 166627867883323968.000000
Category:Pirates 1 19.751727
Category:Assassins 1 413388475722811.625000
Category:Apple Inc. employees 0 0.000000
Category:Microsoft employees 0 0.000000
Category:Google employees 0 0.000000
Category:Free software programmers 0 0.000000
Category:Computer programmers 0 0.000000

ninja
japan
characters
period
sengoku
service
they
use
means
term
based
different
japanese
era
heroes
appearance
kanji

pirate
coast
ship
pirates
crew
off
sea
captured
island
century
ships
piracy
north
merchant
captain
according

can we construct a sentence that
ﬁts into both categories?

(without using the word “pirate” or “ninja”)

“Toby Segaran lived during the sengoku
period in Japan. He spent many years at
sea capturing Japanese ships.”

http://code.google.com/p/wexbayes/

Category:Ninjas 0.008158
Category:Pirates 21125425312885.750000
Category:Assassins 237.408562
Category:Apple Inc. employees 0.000000
Category:Microsoft employees 0.000000
Category:Google employees 0.000000
Category:Free software programmers 0.000000
Category:Computer programmers 0.000000

Category:Ninjas 2924533519139.380859

Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

Machine Learning for Knowledge Extraction from Wikipedia & Other Semantically Weak Sources - OSCON 2008