Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Using Linked Data to Mine RDF from Wikipedia's Tables
1. Using Linked Data to Mine
RDF from Wikipedia’s Tables
http://emunoz.org/wikitables
Emir Muñoz
Fujitsu (Ireland) Limited
National University of Ireland Galway
Joint work with A. Hogan and A. Mileo
WSDM 2014 @ New York City, February 24-28
2. Emir M. - WSDM, New York City, USA, 27th February, 2014 2
MOTIVATION
(1/10)
3. Emir M. - WSDM, New York City, USA, 27th February, 2014 3
MOTIVATION
The tables embedded in Wikipedia articles contain rich,
semi-structured encyclopaedic content
… BUT we cannot query all that content…
A query example:
(2/10)
Wikipedia tables or tables in the body are ignored
[Borrowed from Entity Linking tutorial]
4. Emir M. - WSDM, New York City, USA, 27th February, 2014 4
Results at
25-02-2014
5. Emir M. - WSDM, New York City, USA, 27th February, 2014 5
First result
6. Emir M. - WSDM, New York City, USA, 27th February, 2014 6
Second result
10
Airlines
7. Emir M. - WSDM, New York City, USA, 27th February, 2014 7
Third result
19
Airlines
8. • Same query in SPARQL over
Emir M. - WSDM, New York City, USA, 27th February, 2014 8
MOTIVATION
SELECT ?p ?o WHERE
{ <http://dbpedia.org/resource/Airbus_A380> ?p ?o . }
FAIL
(7/10)
9. Emir M. - WSDM, New York City, USA, 27th February, 2014 9
10. Emir M. - WSDM, New York City, USA, 27th February, 2014 10
No evidence of A380
11. • We perform automatic facts extraction (RDF)
from Wikipedia tables using KBs
MOTIVATION
Emir M. - WSDM, New York City, USA, 27th February, 2014 11
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
(10/10)
12. • As far as we know, DBpedia and YAGO
ignore tables in article’s body
– Mainly focused on info-boxes
• Languages such as R2RML can express
custom mappings from relational database
tables to RDF
– Each row as a subject, each column as a
predicate and each cell as an object
– Needs a mapping definition
Emir M. - WSDM, New York City, USA, 27th February, 2014 12
EXTRACTING RDF FROM TABLES (1/4)
13. • [Limaye et al. 2010; Mulwad et al. 2010&2013]
presented approaches using a in-house KB and
small datasets for validation
– Entity recognition/disambiguation
– Determine types for each column
– Determine relationships between columns
• We focus on Wikipedia tables, running our
algorithms over the entire corpus with
“row-centric” features for Machine
Learning models
Emir M. - WSDM, New York City, USA, 27th February, 2014 13
EXTRACTING RDF FROM TABLES (2/4)
14. Emir M. - WSDM, New York City, USA, 27th February, 2014 14
EXTRACTING RDF FROM TABLES
• Extraction of two types of relationships
– Between main entity and cell in the same columns,
e.g., “Manchester United F.C.” and “David de Gea”
– Between entities in different columns but same row
(3/4)
dbp:currentClub
dbp:position
15. Emir M. - WSDM, New York City, USA, 27th February, 2014 15
EXTRACTING RDF FROM TABLES (4/4)
16. • Wikipedia dump from February 13th 2013
• Table taxonomy
Emir M. - WSDM, New York City, USA, 27th February, 2014 16
WIKITABLES SURVEY (1/2)
1.14 million tables
17. • Table model
– Input: a source of tables (a set of tables)
• E.g., a Wikipedia article
• Each table belongs to is modeled as
an matrix
• We do normalize the tables and convert
each HTML table into a matrix
Emir M. - WSDM, New York City, USA, 27th February, 2014 17
WIKITABLES SURVEY (2/2)
18. • To extract RDF from Wikitables we rely on
a reference knowledge base
– Version 3.8
Emir M. - WSDM, New York City, USA, 27th February, 2014 18
MINING RDF FROM WIKITABLES
Extract links in the cells
Mapping links to DBpedia
Lookups on DBpedia to find
relationships between entities
in the same row
Candidate
relationships
Wikipedia
table
(1/6)
19. • We aim to discover:
– Relations between entities on the same row
– Relations between entities in the table and the
protagonist of the article
• Map the links inside the cells to RDF
resources
• Get candidate relationships from the KB
Emir M. - WSDM, New York City, USA, 27th February, 2014 19
MINING RDF FROM WIKITABLES
SELECT DISTINCT ?p1 ?p2
WHERE { {<e1>} ?p1 <e2> } UNION { <e2> ?p2 <e1>} }
(2/6)
20. • We detected some weak relationships
• … We need more filtering for relationships
Emir M. - WSDM, New York City, USA, 27th February, 2014 20
MINING RDF FROM WIKITABLES
dbp:currentClub
dbp:youthClubs
(3/6)
21. • Features at different levels used to train
Machine Learning models
• Article features (e.g., # of tables)
• Table features (e.g., #rows, #columns, ratios)
• Cell features (e.g., # of entities, string length, has
format)
• Column features (e.g., # of entities, # of unique
entities)
• Predicate/Column features (e.g., string similarity, # of
rows where relation holds)
• Predicate features (e.g., triple count, count unique)
• Triple features (e.g., is the table from article or body)
Emir M. - WSDM, New York City, USA, 27th February, 2014 21
MINING RDF FROM WIKITABLES (4/6)
22. • The experimentation set-up
– Wikipedia dump from February 2013
– DBpedia dump version 3.8
– 8 machines (ca. 2005) with 4GB of RAM,
2.2GHz single-core processors
• After 12 days we got 34.9 million unique
triples not in DBpedia
• We manually annotated a sample of 750
triples to train the ML models
Emir M. - WSDM, New York City, USA, 27th February, 2014 22
MINING RDF FROM WIKITABLES (5/6)
23. Emir M. - WSDM, New York City, USA, 27th February, 2014 23
MINING RDF FROM WIKITABLES (6/6)
Bagging DT Simple Logistic SVM
accuracy 78.1% 78.53% 72.6%
precision 81.5% 79.62% 72.4%
recall 77.4% 79.01% 75.8%
24. • In this work we aimed to
– Interpret the semantic of tables using KB’s
– Enrich KB’s with new facts mined from tables
• With the best model we got 7.9 million
unique novel triples
• We still don’t
– consider literals/string values in the cells
– Explode domain/range of predicates
– Test other KBs like Freebase and YAGO
Emir M. - WSDM, New York City, USA, 27th February, 2014 24
CONCLUSION
25. • Most of the related papers use some
knowledge base, such as DBpedia
– They can be benefited by new RDF triples
extracted from Wikipedia tables
• We can use the similarity proposed in
Knowledge-based graph document modeling, by
Schuhmacher and Ponzetto, to improve the
relation extraction
• And use the paper Trust, but Verify: Predicting
Contribution Quality for Knowledge Base Construction
and Curation, Chun How et al, to determine the
correctness of the quality of the output triples
Emir M. - WSDM, New York City, USA, 27th February, 2014
CONTRAST WITH OTHER PAPERS
25