Mid-Ontology Learning from Linked Data @JIST2011

大学共同利用機関法人情報・システム研究機構
国立情報学研究所
National Institute of Informatics

Mid-Ontology Learning from Linked Data

Lihua Zhao and Ryutaro Ichise
JIST2011, 12.05.2011, Hangzhou

Introduction Mid-Ontology Learning Approach Experimental Evaluation Related Work Conclusion and Future Work

Outline

Introduction

Mid-Ontology Learning Approach

Experimental Evaluation

Related Work

Conclusion and Future Work

大学共同利用機関法人情報・システム研究機構 Lihua Zhao and Ryutaro Ichise | Mid-Ontology Learning from Linked Data | 2


Introduction
Linked Open Data
295 data sets, 31 billion RDF triples (as of Sep. 2011)
7 domains (cross-domain, geographic, media, life sciences,
government, user-generated content, and publications)
Interlinked Instances (owl:sameAs)



Introduction

Challenging Problem
Each data set has speciﬁc ontology schema
DBpedia: http://dbpedia.org/property/population
Geonames: http://www.geonames.org/ontology#population
Time-consuming to learn all the ontology schema
DBpedia: 320 classes and thousands of properties.
Heterogeneity of ontology schema
http://dbpedia.org/property/populationTotal
http://dbpedia.org/property/population



Introduction

Objective

Collected data based on “http://dbpedia.org/resource/Berlin”.
Predicate Object
http : //dbpedia.org /property /name Berlin
http : //dbpedia.org /property /population 3439100
http : //dbpedia.org /property /plz 10001-14199
http : //dbpedia.org /ontology /postalCode 10001-14199
http : //dbpedia.org /ontology /populationTotal 3439100
...... ......
http : //www .geonames.org /ontology #alternateName Berlin
http : //www .geonames.org /ontology #alternateName Berlyn@af
http : //www .geonames.org /ontology #population 3426354
...... ......
http : //www .w 3.org /2004/02/skos/core#prefLabel Berlin (Germany)
http : //data.nytimes.com/elements/ﬁrst use 2004-09-12
http : //data.nytimes.com/elements/latest use 2010-06-13



Introduction

Simple ontology for various data sets: Mid-Ontology
Investigation on linked instances
owl:sameAs links identical or related instances
Scale down the data set
Automatic ontology learning
Integrate ontologies from diverse domain data sets
Automate the ontology construction process
Adapt to linked open data sets



Data Collection

We scale down the data sets by collecting only linked instances,
from which we can extract related information.
Extract data linked with owl:sameAs
Select a core data set (inward & outward links)
Collect all instances that have owl:sameAs
Remove noisy instances of the core data set
Noisy instances: without any meaningful triple
Collect predicates and objects
collect <predicate, object> (PO) pairs from collected instances
collect PO pairs from linked instances (other data sets)



An Example of Collected Data
dbpedia:Berlin owl:sameAs http://sws.geonames.org/2950159/
http://data.nytimes.com/N50987186835223032381 owl:sameAs dbpedia:Berlin

Predicate Object
...... ......
...... ......



Predicate Grouping

Grouping related predicates from diﬀerent ontology schema,
because many similar or related predicates actually refer to the
same thing.
Group predicates by exact matching
Prune groups by similarity matching
Reﬁne groups using extracted relations



Predicate Grouping

same thing.
One predicate may have various objects
Diﬀerent predicates may have the same object value



Group Predicates by Exact Matching
Create initial groups (Gi ) of PO pairs
e.g. Gi .predicates = { db-prop:name, geo-onto:alternateName }
Gi .objects = { Berlin, Berlyn@af }
Predicate Object
...... ......
...... ......



Predicate Grouping

same thing.
Exact matching may ignore
Terms of predicates or objects written in diﬀerent languages
Semantically identical or related predicates



Prune Groups by Similarity Matching

Ontology similarity matching at the concept level
String-based similarity measure: StrSim(O(Gi ), O(Gj ))
O(Gi ): objects in Gi
Preﬁx, Suﬃx, Levenshtein distance, and n-gram.
Knowledge-based similarity measure: WNSim(T (Gi ), T (Gj ))
T (Gi ): pre-processed terms of predicates in Gi
Natural Language Processing: tokenizing terms, removing stop words,
and stemming.
WordNet-based similarity measures: LCH, RES, HSO, JCN, LESK,
PATH, WUP, LIN, and VECTOR



Prune Groups by Similarity Matching

Similarity between initial groups {G1 , G2 , . . . Gk }

StrSim(O(Gi ), O(Gj )) + WNSim(T (Gi ), T (Gj ))
Sim(Gi , Gj ) =
2
Prune initial groups Gi
If Sim(Gi , Gj ) is higher than the predeﬁned similarity threshold, we
merge Gi and Gj .
If an initial group Gi has not been merged and has only one PO
pair, we remove Gi .



An Example of Similarity Calculation

Group Predicate Object
Gi http : //dbpedia.org /property /population 3439100
Gj http : //www .geonames.org /ontology #population 3426354

Example of String-based similarity measures on pairwise objects.
Pairwise Objects preﬁx suﬃx Levenshtein distance n-gram
“3439100”, “3426354” 0.29 0 0 0.29

Example of WordNet-based similarity measures on pairwise terms.
Pairwise Terms LCH RES HSO JCN LESK PATH WUP LIN VECTOR
population, population 1 1 1 1 1 1 1 1 1
population, total 0.4 0 0 0.06 0.03 0.11 0.33 0 0.06

0.145 + 0.5825
Sim(Gi , Gj ) = = 0.36375
2



Predicate Grouping

same thing.
Divide pruned groups according to rdfs:domain and rdfs:range.
Keep groups with high frequency



Mid-Ontology Construction
Select terms for Mid-Ontology
Collect all the terms of predicates in each reﬁned group Gi .
Collect all the pre-processed terms of P(Gi ) (predicates in Gi ).
Choose one term, which has the highest frequency and longest
term.
e.g. “area” and “areaCode” are totally diﬀerent
Construct Relations
mo-prop:hasMembers to link Mid-Ontology classes and integrated
predicates
Construct Mid-Ontology
Automatically construct Mid-Ontology using selected terms and
mo-prop:hasMembers.



Experimental Evaluation

Evaluate the Mid-Ontology approach from four diﬀerent aspects:
Evaluation of Data Reduction
Evaluation of Ontology Quality
Evaluation with A SPARQL Example
Analysis of Mid-Ontology Approach



Implementation

Environment
Linux Ubuntu 10.10, 16GB Memory, 1 TB Disk
Core i7 CPU 880 3.07GHz
Java, Netbeans 6.9
Virtuoso
High-performance server for RDF storage
SPARQL query endpoint
WordNet::Similarity
Implemented in Perl
Knowledge-based similarity measures



Experimental Data
DBpedia: cross-domain, 3.5 million things, 8.9 million URIs
Geonames: geographical domain, 7 million URIs
NYTimes: media domain, 10,467 subject news

Choose DBpedia as the core data set, because of its wealth of inward
and outward links to other data sets.


Evaluation of Data Reduction
Evaluate the eﬀectiveness of data reduction during the data
collection phase by comparing the number of instances.
Number of distinct instances during data collection phase.
Data set Before reduction owl:sameAs retrieval Noisy data removal
DBpedia 8,955,728 135,749 (1.52%) 88,506 (0.99%)
Geonames 7,479,714 128,961 (1.72%) 82,054 (1.10%)
NYTimes 10,467 9,226 (88.14%) 8,535 (81.54%)

Evaluation Analysis
The data sets are dramatically scaled down by keeping only
linked instances that share related information.
Successfully removed noisy instances, which may aﬀect the
quality of the Mid-Ontology.
e.g. Removed instances with only db-prop:hasPhotosCollection
(broken link) and owl:sameAs link.


Improvement achieved by our approach
MO no p r: with exact matching (without the pruning and
refining processes)
MO: with both pruning and refining processes

MO Number of Classes Number of Predicates Cardinality Accuracy
MO no p r 11 300 27.27 68.78%
MO 29 180 6.21 90.10%

Evaluation Analysis
Significantly improved the accuracy
Decreased the cardinality (Less number of predicates and more
classes)
Successfully removed unrelated predicates



Evaluate the eﬀectiveness of information retrieval with the
Mid-Ontology constructed with our approach.

Predicates grouped in mo-onto:population.
<rdf:Description rdf:about=“mid-onto:population”>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/population”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/popLatest”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/ontology/populationTotal”/>
<mo-prop:hasMembers rdf:resource=“http://dbpedia.org/property/einwohner”/>
<mo-prop:hasMembers rdf:resource=“http://www.geonames.org/ontology#population”/>
</rdf:Description>



SPARQL: Find places with a population of more than 10 million.
SELECT DISTINCT ?places
WHERE{ mid-onto:population mo-prop:hasMembers ?prop.
?places ?prop ?population.
FILTER (xsd:integer(?population) > 10000000). }

Single property for population Number of Results
http://dbpedia.org/property/population 177
http://dbpedia.org/property/popLatest 1
http://dbpedia.org/property/populationTotal 107
http://dbpedia.org/ontology/populationTotal 129
http://dbpedia.org/property/einwohner 1
http://www.geonames.org/ontology#population 244

Evaluation Analysis
Find 517 places with mid-onto:population.
Less results with each single predicate under the same
condition.


Analysis of Mid-Ontology Approach
Analyze whether we can successfully identify how data sets are
connected.
Sample classes in the Mid-Ontology
DBpedia DBpedia & Geonames DBpedia & Geonames & NYTimes
mo-onto:birthdate mo-onto:population mo-onto:name
mo-onto:deathdate mo-onto:prominence mo-onto:long
mo-onto:motto mo-onto:postal

Evaluation Analysis
Predicates in DBpedia are heterogeneous.
Linked instances between DBpedia and Geonames are about
places.
Linked instances among DBpedia, Geonames, and NYTimes
are about events, persons, or places.


Possible Application

Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”



Possible Application

Find missing owl:sameAs links
e.g. Find missing owl:sameAs link with mo-onto:population
http://dbpedia.org/resource/Cyclades db-prop:population “119549”
http://dbpedia.org/resource/Cyclades db-prop:name “Cyclades”
http://sws.geonames.org/259819/ geo-onto:population “119549”
http://sws.geonames.org/259819/ geo-onto:alternateName “Cyclades”
Add owl:sameAs link
http://dbpedia.org/resource/Cyclades owl:sameAs http://sws.geonames.org/259819/
http://sws.geonames.org/259819/ owl:sameAs http://dbpedia.org/resource/Cyclades



Related Work

Construct intermediate-layer ontology from geospatial, zoology,
and genetics data resources. [Parundekar, et al.,2010]
Limited to a speciﬁc domain
Construct intermediate-level ontology by enriching upper
ontology (by adding new classes and properties). [Damova, et
al., 2010]
Still too large
Analysis of basic properties of SameAs network,
Pay-Level-Domain network and Class-Level Similarity network.
[Ding, et al., 2010]
Only frequent types are considered to analyze how data are connected



Conclusion and Future Work
Conclusion
Learning heterogeneous ontology schema in the linked open
data sets is not feasible.
An automatic Mid-Ontology learning approach can solve the
heterogeneity problem by integrating related predicates.
The Mid-Ontology has a high accuracy, and eﬀective to search
from various data sets.
A simple Mid-Ontology can be constructed without learning
the entire ontology schema.
Future Work
Billion Triple Challenge (BTC) data set
Crawl links at two or three depths without a core data set

Questions?
Lihua Zhao, lihua@nii.ac.jp
Ryutaro Ichise, ichise@nii.ac.jp


Mid-Ontology Learning from Linked Data @JIST2011

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Mid-Ontology Learning from Linked Data @JIST2011

Similar a Mid-Ontology Learning from Linked Data @JIST2011 (20)

Último

Último (20)

Mid-Ontology Learning from Linked Data @JIST2011