This document describes a mid-ontology learning approach for integrating ontology schemas from different linked data sources. It collects data from linked instances using owl:sameAs links. Predicates are grouped by exact matching of objects and pruning using string and knowledge-based similarity measures. The approach aims to automatically learn a simple ontology that can represent data from diverse domains in linked open data.
Strategies for Landing an Oracle DBA Job as a Fresher
Mid-Ontology Learning from Linked Data @JIST2011
1. 大学共同利用機関法人 情報・システム研究機構
国立情報学研究所
National Institute of Informatics
Mid-Ontology Learning from Linked Data
Lihua Zhao and Ryutaro Ichise
JIST2011, 12.05.2011, Hangzhou
2. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Outline
Introduction
Mid-Ontology Learning Approach
Experimental Evaluation
Related Work
Conclusion and Future Work
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 2
国立情報学研究所
National Institute of Informatics
3. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Linked Open Data
295 data sets, 31 billion RDF triples (as of Sep. 2011)
7 domains (cross-domain, geographic, media, life sciences,
government, user-generated content, and publications)
Interlinked Instances (owl:sameAs)
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 3
国立情報学研究所
National Institute of Informatics
4. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Challenging Problem
Each data set has specific ontology schema
DBpedia: http://dbpedia.org/property/population
Geonames: http://www.geonames.org/ontology#population
Time-consuming to learn all the ontology schema
DBpedia: 320 classes and thousands of properties.
Heterogeneity of ontology schema
http://dbpedia.org/property/populationTotal
http://dbpedia.org/property/population
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 4
国立情報学研究所
National Institute of Informatics
5. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Objective
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 5
国立情報学研究所
National Institute of Informatics
6. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Introduction
Simple ontology for various data sets: Mid-Ontology
Investigation on linked instances
owl:sameAs links identical or related instances
Scale down the data set
Automatic ontology learning
Integrate ontologies from diverse domain data sets
Automate the ontology construction process
Adapt to linked open data sets
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 6
国立情報学研究所
National Institute of Informatics
7. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 7
国立情報学研究所
National Institute of Informatics
8. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Data Collection
We scale down the data sets by collecting only linked instances,
from which we can extract related information.
Extract data linked with owl:sameAs
Select a core data set (inward & outward links)
Collect all instances that have owl:sameAs
Remove noisy instances of the core data set
Noisy instances: without any meaningful triple
Collect predicates and objects
collect <predicate, object> (PO) pairs from collected instances
collect PO pairs from linked instances (other data sets)
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 8
国立情報学研究所
National Institute of Informatics
9. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Collected Data
dbpedia:Berlin owl:sameAs http://sws.geonames.org/2950159/
http://data.nytimes.com/N50987186835223032381 owl:sameAs dbpedia:Berlin
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 9
国立情報学研究所
National Institute of Informatics
10. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 10
国立情報学研究所
National Institute of Informatics
11. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 11
国立情報学研究所
National Institute of Informatics
12. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
One predicate may have various objects
Different predicates may have the same object value
Prune groups by similarity matching
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 12
国立情報学研究所
National Institute of Informatics
13. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Group Predicates by Exact Matching
Create initial groups (Gi ) of PO pairs
e.g. Gi .predicates = { db-prop:name, geo-onto:alternateName }
Gi .objects = { Berlin, Berlyn@af }
Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/first use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 13
国立情報学研究所
National Institute of Informatics
14. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Exact matching may ignore
Terms of predicates or objects written in different languages
Semantically identical or related predicates
Refine groups using extracted relations
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 14
国立情報学研究所
National Institute of Informatics
15. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Ontology similarity matching at the concept level
String-based similarity measure: StrSim(O(Gi ), O(Gj ))
O(Gi ): objects in Gi
Prefix, Suffix, Levenshtein distance, and n-gram.
Knowledge-based similarity measure: WNSim(T (Gi ), T (Gj ))
T (Gi ): pre-processed terms of predicates in Gi
Natural Language Processing: tokenizing terms, removing stop words,
and stemming.
WordNet-based similarity measures: LCH, RES, HSO, JCN, LESK,
PATH, WUP, LIN, and VECTOR
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 15
国立情報学研究所
National Institute of Informatics
16. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Prune Groups by Similarity Matching
Similarity between initial groups {G1 , G2 , . . . Gk }
StrSim(O(Gi ), O(Gj )) + WNSim(T (Gi ), T (Gj ))
Sim(Gi , Gj ) =
2
Prune initial groups Gi
If Sim(Gi , Gj ) is higher than the predefined similarity threshold, we
merge Gi and Gj .
If an initial group Gi has not been merged and has only one PO
pair, we remove Gi .
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 16
国立情報学研究所
National Institute of Informatics
17. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
An Example of Similarity Calculation
Group Predicate Object
Gi http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /ontology /populationTotal 3439100
Gj http : //www .geonames.org /ontology #population 3426354
Example of String-based similarity measures on pairwise objects.
Pairwise Objects prefix suffix Levenshtein distance n-gram
“3439100”, “3426354” 0.29 0 0 0.29
Example of WordNet-based similarity measures on pairwise terms.
Pairwise Terms LCH RES HSO JCN LESK PATH WUP LIN VECTOR
population, population 1 1 1 1 1 1 1 1 1
population, total 0.4 0 0 0.06 0.03 0.11 0.33 0 0.06
0.145 + 0.5825
Sim(Gi , Gj ) = = 0.36375
2
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 17
国立情報学研究所
National Institute of Informatics
18. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Predicate Grouping
Grouping related predicates from different ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Refine groups using extracted relations
Divide pruned groups according to rdfs:domain and rdfs:range.
Keep groups with high frequency
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 18
国立情報学研究所
National Institute of Informatics
19. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Learning Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 19
国立情報学研究所
National Institute of Informatics
20. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Mid-Ontology Construction
Select terms for Mid-Ontology
Collect all the terms of predicates in each refined group Gi .
Collect all the pre-processed terms of P(Gi ) (predicates in Gi ).
Choose one term, which has the highest frequency and longest
term.
e.g. “area” and “areaCode” are totally different
Construct Relations
mo-prop:hasMembers to link Mid-Ontology classes and integrated
predicates
Construct Mid-Ontology
Automatically construct Mid-Ontology using selected terms and
mo-prop:hasMembers.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 20
国立情報学研究所
National Institute of Informatics
21. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Evaluation
Evaluate the Mid-Ontology approach from four different aspects:
Evaluation of Data Reduction
Evaluation of Ontology Quality
Evaluation with A SPARQL Example
Analysis of Mid-Ontology Approach
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 21
国立情報学研究所
National Institute of Informatics
22. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Implementation
Environment
Linux Ubuntu 10.10, 16GB Memory, 1 TB Disk
Core i7 CPU 880 3.07GHz
Java, Netbeans 6.9
Virtuoso
High-performance server for RDF storage
SPARQL query endpoint
WordNet::Similarity
Implemented in Perl
Knowledge-based similarity measures
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 22
国立情報学研究所
National Institute of Informatics
23. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs
Geonames: geographical domain, 7 million URIs
NYTimes: media domain, 10,467 subject news
Choose DBpedia as the core data set, because of its wealth of inward
and outward links to other data sets.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 23
国立情報学研究所
National Institute of Informatics
24. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Data Reduction
Evaluate the effectiveness of data reduction during the data
collection phase by comparing the number of instances.
Number of distinct instances during data collection phase.
Data set Before reduction owl:sameAs retrieval Noisy data removal
DBpedia 8,955,728 135,749 (1.52%) 88,506 (0.99%)
Geonames 7,479,714 128,961 (1.72%) 82,054 (1.10%)
NYTimes 10,467 9,226 (88.14%) 8,535 (81.54%)
Evaluation Analysis
The data sets are dramatically scaled down by keeping only
linked instances that share related information.
Successfully removed noisy instances, which may affect the
quality of the Mid-Ontology.
e.g. Removed instances with only db-prop:hasPhotosCollection
(broken link) and owl:sameAs link.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 24
国立情報学研究所
National Institute of Informatics
25. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Evaluate the quality of Mid-Ontology by validating whether
predicates in each class share related information.
Accuracy of Mid-Ontology
n |Correct Predicates in Ci |
i=1 |Ci |
ACC (MO) =
n
n: the number of classes
|Ci |: the number of predicates in class Ci .
Cardinality
|Number of Predicates|
Cardinality =
|Number of Classes|
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 25
国立情報学研究所
National Institute of Informatics
26. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation of Ontology Quality
Improvement achieved by our approach
MO no p r: with exact matching (without the pruning and
refining processes)
MO: with both pruning and refining processes
MO Number of Classes Number of Predicates Cardinality Accuracy
MO no p r 11 300 27.27 68.78%
MO 29 180 6.21 90.10%
Evaluation Analysis
Significantly improved the accuracy
Decreased the cardinality (Less number of predicates and more
classes)
Successfully removed unrelated predicates
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 26
国立情報学研究所
National Institute of Informatics
27. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL Example
Evaluate the effectiveness of information retrieval with the
Mid-Ontology constructed with our approach.
Predicates grouped in mo-onto:population.
<rdf:Description rdf:about=“mid-onto:population”>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/population”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/popLatest”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/ontology/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/einwohner”/>
<mo-prop:hasMembers rdf:resource=“http://www.geonames.org/ontology#population”/>
</rdf:Description>
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 27
国立情報学研究所
National Institute of Informatics
28. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Evaluation with A SPARQL Example
SPARQL: Find places with a population of more than 10 million.
SELECT DISTINCT ?places
WHERE{ mid-onto:population mo-prop:hasMembers ?prop.
?places ?prop ?population.
FILTER (xsd:integer(?population) > 10000000). }
Single property for population Number of Results
http://dbpedia.org/property/population 177
http://dbpedia.org/property/popLatest 1
http://dbpedia.org/property/populationTotal 107
http://dbpedia.org/ontology/populationTotal 129
http://dbpedia.org/property/einwohner 1
http://www.geonames.org/ontology#population 244
Evaluation Analysis
Find 517 places with mid-onto:population.
Less results with each single predicate under the same
condition.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 28
国立情報学研究所
National Institute of Informatics
29. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Analysis of Mid-Ontology Approach
Analyze whether we can successfully identify how data sets are
connected.
Sample classes in the Mid-Ontology
DBpedia DBpedia & Geonames DBpedia & Geonames & NYTimes
mo-onto:birthdate mo-onto:population mo-onto:name
mo-onto:deathdate mo-onto:prominence mo-onto:long
mo-onto:motto mo-onto:postal
Evaluation Analysis
Predicates in DBpedia are heterogeneous.
Linked instances between DBpedia and Geonames are about
places.
Linked instances among DBpedia, Geonames, and NYTimes
are about events, persons, or places.
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 29
国立情報学研究所
National Institute of Informatics
30. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 30
国立情報学研究所
National Institute of Informatics
31. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Possible Application
Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
Add owl:sameAs link
http://dbpedia.org/resource/Cyclades owl:sameAs http://sws.geonames.org/259819/
http://sws.geonames.org/259819/ owl:sameAs http://dbpedia.org/resource/Cyclades
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 31
国立情報学研究所
National Institute of Informatics
32. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Related Work
Construct intermediate-layer ontology from geospatial, zoology,
and genetics data resources. [Parundekar, et al.,2010]
Limited to a specific domain
Construct intermediate-level ontology by enriching upper
ontology (by adding new classes and properties). [Damova, et
al., 2010]
Still too large
Analysis of basic properties of SameAs network,
Pay-Level-Domain network and Class-Level Similarity network.
[Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 32
国立情報学研究所
National Institute of Informatics
33. Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work
Conclusion and Future Work
Conclusion
Learning heterogeneous ontology schema in the linked open
data sets is not feasible.
An automatic Mid-Ontology learning approach can solve the
heterogeneity problem by integrating related predicates.
The Mid-Ontology has a high accuracy, and effective to search
from various data sets.
A simple Mid-Ontology can be constructed without learning
the entire ontology schema.
Future Work
Billion Triple Challenge (BTC) data set
Crawl links at two or three depths without a core data set
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 33
国立情報学研究所
National Institute of Informatics
34. Questions?
Lihua Zhao, lihua@nii.ac.jp
Ryutaro Ichise, ichise@nii.ac.jp
大学共同利用機関法人 情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 34
国立情報学研究所
National Institute of Informatics