Wikipedia contains a wealth of collective knowledge but due to its semi-structured design and idiosyncratic markup mining this resource is a formidable challenge. This session will examine techniques for mining semantically weak data sources for explicit facts.
The session will utilize WEX and preprocessed normalization of Wikipedia designed to make this corpus easily accessible to developers interested in machine learning, natural language processing, or knowledge extraction. The process through which WEX is prepared, as a guide to creating mineable structures from semi-structured data, will be discussed followed by approaches to machine extraction on structures of mixed data quality.
The session is targeted at intermediate developers with an interest in machine learning or knowledge extraction (though no experience is assumed with either).
The demonstrations leverage the power of Postgres 8.3’s XPath capability to simplify the programming model and present examples in Python, but the data and principles are compatible with any modern data infrastructure.
6. Problems With Wikipedia Data
•The data is dirty
•Wiki markup is hard to parse..
•.. and often dirty
•.. and not well defined
•Properties and relations are in Wiki markup
•Page redirects have to be resolved
7. {{Infobox Company
| company_name = Sony Corporation <br>
| company_logo = [[Image:Sony logo.svg|220px|]]
| slogan = like.no.other
| company_type = [[Public company|Public]]<br>({{Tyo|6758}})<br>({{nyse|SNE}})
| foundation = [[May 7]] [[1946]] (adopted current name in 1958)<ref
name=sonycorpinfo>{{citeweb|url=http://www.sony.net/SonyInfo/
CorporateInfo/|title=Sony Global - Corporate Information|
accessdate=2007-07-24}}</ref>
| founder = [[Masaru Ibuka]]<br>[[Akio Morita]]
| location = {{flagicon|Japan}} [[Minato, Tokyo]], [[Japan]]<ref
name="sonycorpinfo"/>
| area_served = [[Worldwide]]
| key_people = [[Sir Howard Stringer]]<br><small>([[Chairman]]) & ([[CEO]])</
small><ref name="sonycorpinfo"/><br />[[Ryoji Chubachi]]<br><small>([[President]]) &
([[CEO|Electronics CEO]])<ref name="sonycorpinfo"/>
| industry = [[Consumer electronics]]<br>[[Entertainment]]
| products = [[Audio]]<br>[[Video]]<br>[[Televisions]]<br>[[Information Technology|
Communications and Information Technology]]<br>[[Semiconductors]]<br>[[Electronic
components]]<br>[[Motion Picture]]<br>[[Music]]<br>[[Online|Online Business]]<br>[[Sony
Playstation]]
| services = [[Financial services]]
| market cap = [[United States Dollar|US$]] 40.56 Billion (''2008'')
| revenue = {{profit}} [[United States Dollar|US$]] 88.714 Billion
(''2008'')<ref name="2007 Q4">{{citeweb|url=http://www.sony.net/SonyInfo/
IR/financial/fr/07q4_sony.pdf|title=Sony Corporation Earnings release for the
fiscal year ended March 31, 2008|format=PDF}}</ref>
| operating_income = {{profit}} [[United States Dollar|US$]] 3.745 Billion (''2008'')
| net_income = {{profit}} [[United States Dollar|US$]] 3.694 Billion (''2008'')
| assets = {{increase}} [[United States Dollar|US$]] 117.603 Billion (''2008'')
| equity = {{increase}} [[United States Dollar|US$]] 32.465 Billion (''2008'')
| num_employees = 180,500 (as of [[March 31]] [[2008]]) <ref name="sonycorpinfo"/>
| parent =
| subsid = [[Sony Corporation shareholders and subsidiaries|List of the
subsidiaries]]
8. Problems With Wikipedia Data
•Wikipedia is huge!
•2,150,00 articles
•7,100,000 category references
•found in 280,000 categories
•54,029 non-trivial templates (>= 5 uses)
•50,671,533 Template name-value properties
•Wikipedia is growing!
•Grows 2% a week
•25,170 new articles
•39,571 new redirects
•8,000 deletes
•5,000 name changes
•1,000 article splits
•1,000 id changes
•Before this talk is over, there will be 150 NEW articles!
9. The Freebase Wikipedia Extraction (WEX)
Current Current Current
Current
Wikipedia Current
Wikipedia
Wikipedia Wiki2XML Parser
Wikipedia
Dump Markup
Wiki Wikipedia
Dump Markup
XML
Dump
Dump
Articles Dump
Articles
Magnus Manske
Big Sections Templates
Text
Postgres Articles
Database! Freebase
Redirects Categories
Mappings
10. WEX Article XML
<template name="Infobox_President">
<param name="name">Abraham Lincoln</param>
<param name="nationality">American</param>
<param name="image">Abraham Lincoln head on shoulders photo portrait.jpg</param>
<param name="order">16th<space/><link>
<target>President of the United States</target></link></param>
<param name="term_start"><link><target>March 4</target></link>,
<space/><link><target>1861</target></link></param>
<param name="term_end"><link><target>April 15</target></link>,
<space/><link><target>1865</target></link></param>
<param name="predecessor"><link><target>James Buchanan</target></link></param>
<param name="successor"><link><target>Andrew Johnson</target></link></param>
....
</template>
19. category_list=['Category:Ninjas',
'Category:Pirates',
'Category:Assassins',
'Category:Apple Inc. employees',
'Category:Microsoft employees',
'Category:Google employees',
'Category:Free software programmers',
'Category:Computer programmers']
print "Training classifiers..."
# Get the members of every category
name_classes={}
for category in category_list:
members = get_category_members(cur, category)
print str(len(members)) + " examples for " + category
for name in members:
name_classes.setdefault(name,set()).add(category)
20. def get_category_members(cur, category):
records = []
queue = [category]
recordsSeen = set()
while len(queue) > 0 and len(records) < 500:
currentCategory = queue.pop(0)
cur.execute("select articles.wpid, articles.name " +
"from wikipedia.category_members, wikipedia.articles " +
"where category_members.category_name like %s and " +
"articles.wpid=category_members.article_wpid",
(currentCategory,))
result = cur.fetchall()
for wpid, name in result:
if wpid not in recordsSeen:
recordsSeen.add(wpid)
if name.startswith("Category:"):
queue.append(name)
else: records.append(name)
return records
21. for name in name_classes:
name, text = get_article_text(cur, name)
words=getwords(text[0:1024])
for cat,cl in classifiers:
if cat in name_classes[name]:
cl.train(words,1)
else:
cl.train(words,0)
22. def get_article_text(cur, name):
cur.execute("select name, text from " +
"wikipedia.articles where name=%s", (name,))
return cur.fetchone()
23. # Test set:
test_set=["Henri Caesar", "Long John Silver", "Jack Sparrow",
"Storm Shadow (G.I. Joe)", "Leonardo (TMNT)",
"Bill Gates", "Steve Jobs", "Richard Stallman",
"Larry Page", "Guido van Rossum", "Larry Wall",
"Jerry Yang"]
# Run tests:
for testName in test_set:
name, text = get_article_text(cur, testName)
print name
words=getwords(text[0:1024])
for cat,cl in classifiers:
py,pn=cl.prob(words,1),cl.prob(words,0)
print '%st%st%f' % (cat,cl.classify(words),py/pn if pn>0
else 100)