1. WWW2012 Tutorial
Practical Cross-Dataset Queries on the Web of Data
Instance Matching
Robert Isele
Freie Universität Berlin
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
2. Outline
Motivation
Link Discovery Tools
Linking Workflow
Silk Workbench
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
3. Motivation
The Web of Data is a single global data space because data sources are
connected by links
Over 31 billion triples published as Linked Open Data and growing
But:
● Less than 500 million links
● Most publishers only link to one other dataset
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
4. Use Case 1: Publishing a New Dataset
A data provider wants to publish a new dataset
Wants to interlink with existing data sets from the same
domain
Example
● A data publisher wants to publish a new dataset about movies
● Interlink movies with LinkedMDB (Linked Movie Data Base)
● Interlink directors with DBpedia (Wikipedia)
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
5. Use Case 2: Linked Data Application
Linked Data application integrates multiple data sources from
the same domain
In the decentralized Web of Data, many data sources use
different URIs for the same real world object.
Identifying these URI aliases, is a central problem in Linked
Data.
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
6. Challenges for Link Discovery
The Web of Data is heterogeneous
● Many different vocabularies are in use
● Different data formats
● Many different ways to represent the same information
Distribution of the most widely used vocabularies
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
7. Challenges for Link Discovery
Large range of domains
● 256 data sources in the LOD cloud from a variety of domains
● Linkage Rules are different in each domain
● Writing a Linkage Rule is for each of these domains is usually not
trivial
Distribution of triples by domain
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
8. Challenges for Link Discovery
Scalability
● The current LOD cloud contains 277 datasets (August 2011)
● 30 billion triples in total
● Infeasible to compare every possible entity pair
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
9. Link Discovery Tools
Tools enable data publishers to set links
Most tools generate links based on user-defined linkage rules
A linkage rule specifies the conditions data items must fulfill
in order to be interlinked
Popular Link Discover Tools:
● Silk Link Discovery Framework
● LIMES
● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
10. Silk Link Discovery Framework
Tool for discovering links between data items within different
Linked Data sources.
The Silk Link Specification Language (Silk-LSL) allows to
express complex linkage rules
Can be used to generate owl:sameAs links as well as other
relationships
Scalability and high performance through efficient data
handling
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
11. Silk Versions
Silk Single Machine
● Generate links on a single machine
● Local or remote data sets
Silk MapReduce
● Generate RDF links using a cluster of multiple machines
● Based on Hadoop (Can be run on Amazon Elastic MapReduce)
Silk Server
● Provides an HTTP API for matching instances from an incoming
stream of RDF data while keeping track of known entities
● Can be used as an identity resolution component within
applications that consume Linked Data from the Web
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
12. Silk Workbench
Silk Workbench is a web application which guides the user
through the process of interlinking different data sources.
Enables the user to manage different sets of data sources
and linking tasks.
Offers a graphical editor which enables the user to easily
create and edit linkage rules
Offers tools to evaluate the current linkage rule
Includes experimental support for learning linkage rules
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
13. Linking Workflow
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
14. Typical linkage rule
Select the values to be compared
● Example: Select labels and dates of a music record
Normalize the values
● Example: Transform dates to a common format
Compare different values using similarity measures
● Example: Compare labels and dates of a music record
Aggregate the results of multiple comparisons
● Example: Compute the average of the label and date similarity
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
15. Value selectors
Values in the graph around the entities can be used for comparison
Property path languages have been developed for that purpose
Examples (SPARQL 1.1 Property Paths Language):
● Entity label: rdfs:label
● Movie director name: dbpedia-owl:director/foaf:name
● All movies of a director: ^dbpedia-owl:director/rdfs:label
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
16. Data Transformations
Different data sets may use different data formats
Data sets may be noisy
⇒ Values must be normalized prior to comparison
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
17. Common Transformations
Case normalization
Structural transformation
Extract values from URIs
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
18. Similarity Measures
A similarity measure compares two values
It returns a value between 0 (no similarity) and 1 (equality)
Formally, a similarity measure is a function:
* *
sim : Σ ×Σ →[0,1]
Various similarity measures have been proposed
● Character-based measures
● Token-based measures
● Domain-specific measures
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
19. Character-Based Similarity Measures
Usually rely on character edit operations
Often used for catching typographical errors
Most popular
● Levenstein
● Jaro/Jaro-Winkler
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
20. Levenshtein Distance
The minimum number of edits needed to transform one
string into the other
Allowed edit operations:
● insert a character into the string
● delete a character from the string
● replace one character with a different character
Examples:
● levensthein('Table', 'Cable') = 1 (1 Substitution)
● levensthein('Table', 'able') = 1 (1 Deletion)
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
21. Token-Based Similarity Measures
Character-based measures work well for typographical
errors, but fail when word arrangements differ
Example: 'John Doe', 'Doe, John', 'Mr. John Doe'
Token-based measures split the values into tokens before
computing the similarity
Example: tokenize('Mr. John Doe') = {'Mr.', 'John', 'Doe'}
Most popular: Jaccard, Dice
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
22. Jaccard coefficient
Intuition: Measure the fraction of the tokens which are
shared by both strings
Defined as the number of matching words divided by the
total number of distinct words:
∣A∩B∣
Jaccard ( A , B)=
∣A∪B∣
Example:
2
Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5
4
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
23. Domain-Specific Similarity Measures
Geographic distance
Date/Time
Numbers
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
24. Aggregating Similarity Values
In order to determine if two entities are duplicates it is
usually not sufficient to compare a single property
Aggregation Functions aggregate the similarity of multiple
comparisons
Example: Interlinking geographical datasets
● Compare by label and geographic coordinates
● Aggregate similarity values
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
25. Popular Aggregation Functions
Minimum
● Choose the lowest value
● ⇒ All values must exceed the threshold
Maximum
● Choose the highest value
● ⇒ At least one value must exceed the threshold
Weighted Average
● Assign a weight to each comparison
● Compute the weighted mean
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
26. Putting it all together
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
27. Example
Interlink cities in different data sources:
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
28. Evaluating Linkage Rules
Gold standard in the form of reference links
● Positive links (definitive matches)
● Negative links (definitive non-matches)
Based on the reference links, we can determine the number
of correct and incorrect matches
We distinguish between 4 cases:
Positive Link Negative Link
match(a,b) = link True positive False positive
match(a,b) = nonlink False negative True negative
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
29. Evaluating Linkage Rules
Recall: Ratio of correct links compared to all known links
∣true positives∣
recall =
∣true positives∣+ ∣ false positives∣
Precision: Ratio of correct links compared to all found links
∣true positives∣
precision =
∣true positives∣+ ∣ false negatives∣
F-measure: Harmonic mean of precision and recall
2⋅precision⋅recall
F=
precision + recall
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
30. Recall-Precision diagram
A recall-precision diagram visualizes the trade-off between
maximizing the recall and maximizing the precision
From: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ)
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
31. Outline
Motivation
Link Discovery Tools
Linking Workflow
Silk Workbench
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
32. Silk Worbench
Silk Workbench offers a GUI for:
● Manage different data sourcs and linkage rules
● Creating linkage rules
● Executing linkage rules
● Evaluating linkage rules
● Learning Linkage Rules
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
33. Workspace
The Workspace holds a set of projects
consisting of:
Data Sources
● Holds all information that is needed
by Silk to retrieve entities from it.
● Usually a file dump or a SPARQL
endpoint
Linking Tasks
● Interlinks a type of entity between
two data sources
● e.g. Interlinkiing movies in DBpedia
and LinkedMDB
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
34. Linkage Rule Editor
Allows to view and edit linkage rules
Linkage Rules are shown as a tree
Editing using drag & drop.
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
35. Generating Links
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
37. Conclusion
In order to publish a new data set or to consume an existing
dataset we need to generate links
A linkage rule specifies the conditions which must hold true
for two entities in order to be considered the same real-
world object.
The Silk Workbench provides a graphical user interface to
create and edit linking tasks
The hands on session will cover a simple example interlinking
musical artists in freebase and DBpedia
WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data