SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
WWW2012 Tutorial
Practical Cross-Dataset Queries on the Web of Data



       Instance Matching




              Robert Isele
         Freie Universität Berlin



                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Outline
   Motivation
   Link Discovery Tools
   Linking Workflow
   Silk Workbench




                            WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Motivation
   The Web of Data is a single global data space because data sources are
    connected by links
   Over 31 billion triples published as Linked Open Data and growing
   But:
    ●   Less than 500 million links
    ●   Most publishers only link to one other dataset




                                        WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Use Case 1: Publishing a New Dataset
   A data provider wants to publish a new dataset
   Wants to interlink with existing data sets from the same
    domain
   Example
    ●   A data publisher wants to publish a new dataset about movies
    ●   Interlink movies with LinkedMDB (Linked Movie Data Base)
    ●   Interlink directors with DBpedia (Wikipedia)




                                  WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Use Case 2: Linked Data Application
   Linked Data application integrates multiple data sources from
    the same domain
   In the decentralized Web of Data, many data sources use
    different URIs for the same real world object.
   Identifying these URI aliases, is a central problem in Linked
    Data.




                              WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery

   The Web of Data is heterogeneous
    ●   Many different vocabularies are in use
    ●   Different data formats
    ●   Many different ways to represent the same information




                 Distribution of the most widely used vocabularies
                                  WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery
   Large range of domains
    ●   256 data sources in the LOD cloud from a variety of domains
    ●   Linkage Rules are different in each domain
    ●   Writing a Linkage Rule is for each of these domains is usually not
        trivial




                        Distribution of triples by domain

                                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Challenges for Link Discovery
   Scalability
    ●   The current LOD cloud contains 277 datasets (August 2011)
    ●   30 billion triples in total
    ●   Infeasible to compare every possible entity pair




                                      WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Link Discovery Tools
   Tools enable data publishers to set links
   Most tools generate links based on user-defined linkage rules
   A linkage rule specifies the conditions data items must fulfill
    in order to be interlinked
   Popular Link Discover Tools:
    ●   Silk Link Discovery Framework
    ●   LIMES
    ●   Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining




                                           WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Link Discovery Framework
   Tool for discovering links between data items within different
    Linked Data sources.
   The Silk Link Specification Language (Silk-LSL) allows to
    express complex linkage rules
   Can be used to generate owl:sameAs links as well as other
    relationships
   Scalability and high performance through efficient data
    handling




                              WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Versions
   Silk Single Machine
    ●   Generate links on a single machine
    ●   Local or remote data sets
   Silk MapReduce
    ●   Generate RDF links using a cluster of multiple machines
    ●   Based on Hadoop (Can be run on Amazon Elastic MapReduce)
   Silk Server
    ●   Provides an HTTP API for matching instances from an incoming
        stream of RDF data while keeping track of known entities
    ●   Can be used as an identity resolution component within
        applications that consume Linked Data from the Web
                                    WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Workbench
   Silk Workbench is a web application which guides the user
    through the process of interlinking different data sources.
   Enables the user to manage different sets of data sources
    and linking tasks.
   Offers a graphical editor which enables the user to easily
    create and edit linkage rules
   Offers tools to evaluate the current linkage rule
   Includes experimental support for learning linkage rules




                              WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Linking Workflow




     WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Typical linkage rule
   Select the values to be compared
    ●   Example: Select labels and dates of a music record
   Normalize the values
    ●   Example: Transform dates to a common format
   Compare different values using similarity measures
    ●   Example: Compare labels and dates of a music record
   Aggregate the results of multiple comparisons
    ●   Example: Compute the average of the label and date similarity




                                  WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Value selectors
   Values in the graph around the entities can be used for comparison
   Property path languages have been developed for that purpose
   Examples (SPARQL 1.1 Property Paths Language):
    ●   Entity label: rdfs:label
    ●   Movie director name: dbpedia-owl:director/foaf:name
    ●   All movies of a director: ^dbpedia-owl:director/rdfs:label




                                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Data Transformations
   Different data sets may use different data formats
   Data sets may be noisy
⇒ Values must be normalized prior to comparison




                             WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Common Transformations
   Case normalization


   Structural transformation




   Extract values from URIs




                                WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Similarity Measures
   A similarity measure compares two values
   It returns a value between 0 (no similarity) and 1 (equality)
   Formally, a similarity measure is a function:
                               *          *
                        sim : Σ ×Σ →[0,1]

   Various similarity measures have been proposed
    ●   Character-based measures
    ●   Token-based measures
    ●   Domain-specific measures


                                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Character-Based Similarity Measures
   Usually rely on character edit operations
   Often used for catching typographical errors
   Most popular
    ●   Levenstein
    ●   Jaro/Jaro-Winkler




                              WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Levenshtein Distance
   The minimum number of edits needed to transform one
    string into the other
   Allowed edit operations:
    ●   insert a character into the string
    ●   delete a character from the string
    ●   replace one character with a different character
   Examples:
    ●   levensthein('Table', 'Cable') = 1 (1 Substitution)
    ●   levensthein('Table', 'able') = 1 (1 Deletion)



                                    WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Token-Based Similarity Measures
   Character-based measures work well for typographical
    errors, but fail when word arrangements differ
   Example: 'John Doe', 'Doe, John', 'Mr. John Doe'


   Token-based measures split the values into tokens before
    computing the similarity
   Example: tokenize('Mr. John Doe') = {'Mr.', 'John', 'Doe'}


   Most popular: Jaccard, Dice


                              WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Jaccard coefficient
   Intuition: Measure the fraction of the tokens which are
    shared by both strings
   Defined as the number of matching words divided by the
    total number of distinct words:

                                           ∣A∩B∣
                      Jaccard ( A , B)=
                                           ∣A∪B∣

   Example:
                                                                2
     Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5
                                                                4




                                WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Domain-Specific Similarity Measures

   Geographic distance
   Date/Time
   Numbers




                          WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Aggregating Similarity Values
   In order to determine if two entities are duplicates it is
    usually not sufficient to compare a single property
   Aggregation Functions aggregate the similarity of multiple
    comparisons
   Example: Interlinking geographical datasets
    ●   Compare by label and geographic coordinates
    ●   Aggregate similarity values




                                  WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Popular Aggregation Functions
   Minimum
    ●   Choose the lowest value
    ●   ⇒ All values must exceed the threshold
   Maximum
    ●   Choose the highest value
    ●   ⇒ At least one value must exceed the threshold
   Weighted Average
    ●   Assign a weight to each comparison
    ●   Compute the weighted mean



                                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Putting it all together




         WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Example
   Interlink cities in different data sources:




                               WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Evaluating Linkage Rules
   Gold standard in the form of reference links
    ●   Positive links (definitive matches)
    ●   Negative links (definitive non-matches)
   Based on the reference links, we can determine the number
    of correct and incorrect matches
   We distinguish between 4 cases:

                                Positive Link               Negative Link

        match(a,b) = link       True positive               False positive

        match(a,b) = nonlink False negative                 True negative


                                   WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Evaluating Linkage Rules
   Recall: Ratio of correct links compared to all known links
                       ∣true positives∣
     recall =
              ∣true positives∣+ ∣ false positives∣

   Precision: Ratio of correct links compared to all found links
                         ∣true positives∣
    precision =
                ∣true positives∣+ ∣ false negatives∣

   F-measure: Harmonic mean of precision and recall
       2⋅precision⋅recall
    F=
        precision + recall


                                    WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Recall-Precision diagram
    A recall-precision diagram visualizes the trade-off between
     maximizing the recall and maximizing the precision




From: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ)
                                                WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Outline
   Motivation
   Link Discovery Tools
   Linking Workflow
   Silk Workbench




                            WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Silk Worbench

   Silk Workbench offers a GUI for:
    ●   Manage different data sourcs and linkage rules
    ●   Creating linkage rules
    ●   Executing linkage rules
    ●   Evaluating linkage rules
    ●   Learning Linkage Rules




                                  WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Workspace
The Workspace holds a set of projects
consisting of:


   Data Sources
    ●   Holds all information that is needed
        by Silk to retrieve entities from it. 
    ●   Usually a file dump or a SPARQL
        endpoint
   Linking Tasks
    ●   Interlinks a type of entity between
        two data sources
    ●   e.g. Interlinkiing movies in DBpedia
        and LinkedMDB


                                           WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Linkage Rule Editor
   Allows to view and edit linkage rules
   Linkage Rules are shown as a tree
   Editing using drag & drop.




                                    WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Generating Links




     WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Managing Reference Links




         WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Conclusion
   In order to publish a new data set or to consume an existing
    dataset we need to generate links
   A linkage rule specifies the conditions which must hold true
    for two entities in order to be considered the same real-
    world object.
   The Silk Workbench provides a graphical user interface to
    create and edit linking tasks
   The hands on session will cover a simple example interlinking
    musical artists in freebase and DBpedia




                             WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
Q&A




WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data

Más contenido relacionado

La actualidad más candente

Jarrar: Data Schema Integration
Jarrar: Data Schema IntegrationJarrar: Data Schema Integration
Jarrar: Data Schema IntegrationMustafa Jarrar
 
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Olaf Hartig
 
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...Brief State of the Art - Semantic Web technologies for geospatial data - Mode...
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...Ana Roxin
 
Comparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseComparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseEditor IJMTER
 
Understanding Linked Data via EAV Model based Structured Descriptions
Understanding Linked Data via EAV Model based Structured DescriptionsUnderstanding Linked Data via EAV Model based Structured Descriptions
Understanding Linked Data via EAV Model based Structured DescriptionsKingsley Uyi Idehen
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationTeoman Turan
 
Evaluation of DOM Tree Similarities - Thesis Report
Evaluation of DOM Tree Similarities - Thesis ReportEvaluation of DOM Tree Similarities - Thesis Report
Evaluation of DOM Tree Similarities - Thesis ReportTeoman Turan
 
Jarrar: Architectural solutions in Data Integration
Jarrar: Architectural solutions in Data IntegrationJarrar: Architectural solutions in Data Integration
Jarrar: Architectural solutions in Data IntegrationMustafa Jarrar
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityThomas Lee
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and DatabaseIRJET Journal
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS WebinarBen Blaiszik
 
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...Olaf Hartig
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 

La actualidad más candente (20)

LOD2 Webinar: UnifiedViews
LOD2 Webinar: UnifiedViewsLOD2 Webinar: UnifiedViews
LOD2 Webinar: UnifiedViews
 
LOD2 Webinar Series: CubeViz
LOD2 Webinar Series: CubeViz LOD2 Webinar Series: CubeViz
LOD2 Webinar Series: CubeViz
 
Jarrar: Data Schema Integration
Jarrar: Data Schema IntegrationJarrar: Data Schema Integration
Jarrar: Data Schema Integration
 
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
Tutorial "Linked Data Query Processing" Part 1 "Introduction" (WWW 2013 Ed.)
 
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORELOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
LOD2 Webinar Series Classification and Quality Analysis with DL Learner and ORE
 
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...Brief State of the Art - Semantic Web technologies for geospatial data - Mode...
Brief State of the Art - Semantic Web technologies for geospatial data - Mode...
 
Comparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented DatabaseComparison of Relational Database and Object Oriented Database
Comparison of Relational Database and Object Oriented Database
 
Understanding Linked Data via EAV Model based Structured Descriptions
Understanding Linked Data via EAV Model based Structured DescriptionsUnderstanding Linked Data via EAV Model based Structured Descriptions
Understanding Linked Data via EAV Model based Structured Descriptions
 
Evaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis PresentationEvaluation of DOM Tree Similarities - Thesis Presentation
Evaluation of DOM Tree Similarities - Thesis Presentation
 
Evaluation of DOM Tree Similarities - Thesis Report
Evaluation of DOM Tree Similarities - Thesis ReportEvaluation of DOM Tree Similarities - Thesis Report
Evaluation of DOM Tree Similarities - Thesis Report
 
Jarrar: Architectural solutions in Data Integration
Jarrar: Architectural solutions in Data IntegrationJarrar: Architectural solutions in Data Integration
Jarrar: Architectural solutions in Data Integration
 
LOD2 Webinar Series: 3rd relase of the Stack
LOD2 Webinar Series: 3rd relase of the StackLOD2 Webinar Series: 3rd relase of the Stack
LOD2 Webinar Series: 3rd relase of the Stack
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data Interoperability
 
Ado.net
Ado.netAdo.net
Ado.net
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and Database
 
Ado.net
Ado.netAdo.net
Ado.net
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
 
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
Tutorial "Linked Data Query Processing" Part 4 "Execution Process" (WWW 2013 ...
 
PhD Defense
PhD DefensePhD Defense
PhD Defense
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 

Similar a Instance Matching

Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Anja Jentzsch
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dassDiego Pessoa
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
EPA OEI Linked Data Process
EPA OEI Linked Data ProcessEPA OEI Linked Data Process
EPA OEI Linked Data Process3 Round Stones
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareIMC Technologies
 
Creating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesCreating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesNikolaos Konstantinou
 
OSLC & The Future of Interoperability
OSLC & The Future of InteroperabilityOSLC & The Future of Interoperability
OSLC & The Future of InteroperabilityKoneksys
 
Deploying Linked Open Data: Methodologies and Software Tools
Deploying Linked Open Data: Methodologies and Software ToolsDeploying Linked Open Data: Methodologies and Software Tools
Deploying Linked Open Data: Methodologies and Software ToolsNikolaos Konstantinou
 
Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®ANIL247048
 
Introduction: Linked Data and the Semantic Web
Introduction: Linked Data and the Semantic WebIntroduction: Linked Data and the Semantic Web
Introduction: Linked Data and the Semantic WebNikolaos Konstantinou
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05Niit Care
 
Making the Conceptual Layer Real via HTTP based Linked Data
Making the Conceptual Layer Real via HTTP based Linked DataMaking the Conceptual Layer Real via HTTP based Linked Data
Making the Conceptual Layer Real via HTTP based Linked DataKingsley Uyi Idehen
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Nhibernate Part 1
Nhibernate   Part 1Nhibernate   Part 1
Nhibernate Part 1guest075fec
 

Similar a Instance Matching (20)

LOD2 Webinar Series: SILK
LOD2 Webinar Series: SILKLOD2 Webinar Series: SILK
LOD2 Webinar Series: SILK
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)
 
03 interlinking-dass
03 interlinking-dass03 interlinking-dass
03 interlinking-dass
 
Nosql
NosqlNosql
Nosql
 
Nosql
NosqlNosql
Nosql
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
EPA OEI Linked Data Process
EPA OEI Linked Data ProcessEPA OEI Linked Data Process
EPA OEI Linked Data Process
 
Linked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the SoftwareLinked Data for the Masses: The approach and the Software
Linked Data for the Masses: The approach and the Software
 
Creating Linked Data from Relational Databases
Creating Linked Data from Relational DatabasesCreating Linked Data from Relational Databases
Creating Linked Data from Relational Databases
 
OSLC & The Future of Interoperability
OSLC & The Future of InteroperabilityOSLC & The Future of Interoperability
OSLC & The Future of Interoperability
 
Deploying Linked Open Data: Methodologies and Software Tools
Deploying Linked Open Data: Methodologies and Software ToolsDeploying Linked Open Data: Methodologies and Software Tools
Deploying Linked Open Data: Methodologies and Software Tools
 
Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®Data documentation and retrieval using unity in a universe®
Data documentation and retrieval using unity in a universe®
 
Introduction: Linked Data and the Semantic Web
Introduction: Linked Data and the Semantic WebIntroduction: Linked Data and the Semantic Web
Introduction: Linked Data and the Semantic Web
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
 
Making the Conceptual Layer Real via HTTP based Linked Data
Making the Conceptual Layer Real via HTTP based Linked DataMaking the Conceptual Layer Real via HTTP based Linked Data
Making the Conceptual Layer Real via HTTP based Linked Data
 
Linked data 20171106
Linked data 20171106Linked data 20171106
Linked data 20171106
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Nhibernate Part 1
Nhibernate   Part 1Nhibernate   Part 1
Nhibernate Part 1
 

Último

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Último (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Instance Matching

  • 1. WWW2012 Tutorial Practical Cross-Dataset Queries on the Web of Data Instance Matching Robert Isele Freie Universität Berlin WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 2. Outline  Motivation  Link Discovery Tools  Linking Workflow  Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 3. Motivation  The Web of Data is a single global data space because data sources are connected by links  Over 31 billion triples published as Linked Open Data and growing  But: ● Less than 500 million links ● Most publishers only link to one other dataset WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 4. Use Case 1: Publishing a New Dataset  A data provider wants to publish a new dataset  Wants to interlink with existing data sets from the same domain  Example ● A data publisher wants to publish a new dataset about movies ● Interlink movies with LinkedMDB (Linked Movie Data Base) ● Interlink directors with DBpedia (Wikipedia) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 5. Use Case 2: Linked Data Application  Linked Data application integrates multiple data sources from the same domain  In the decentralized Web of Data, many data sources use different URIs for the same real world object.  Identifying these URI aliases, is a central problem in Linked Data. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 6. Challenges for Link Discovery  The Web of Data is heterogeneous ● Many different vocabularies are in use ● Different data formats ● Many different ways to represent the same information Distribution of the most widely used vocabularies WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 7. Challenges for Link Discovery  Large range of domains ● 256 data sources in the LOD cloud from a variety of domains ● Linkage Rules are different in each domain ● Writing a Linkage Rule is for each of these domains is usually not trivial Distribution of triples by domain WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 8. Challenges for Link Discovery  Scalability ● The current LOD cloud contains 277 datasets (August 2011) ● 30 billion triples in total ● Infeasible to compare every possible entity pair WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 9. Link Discovery Tools  Tools enable data publishers to set links  Most tools generate links based on user-defined linkage rules  A linkage rule specifies the conditions data items must fulfill in order to be interlinked  Popular Link Discover Tools: ● Silk Link Discovery Framework ● LIMES ● Others: http://esw.w3.org/TaskForces/CommunityProjects/LinkingOpenData/EquivalenceMining WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 10. Silk Link Discovery Framework  Tool for discovering links between data items within different Linked Data sources.  The Silk Link Specification Language (Silk-LSL) allows to express complex linkage rules  Can be used to generate owl:sameAs links as well as other relationships  Scalability and high performance through efficient data handling WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 11. Silk Versions  Silk Single Machine ● Generate links on a single machine ● Local or remote data sets  Silk MapReduce ● Generate RDF links using a cluster of multiple machines ● Based on Hadoop (Can be run on Amazon Elastic MapReduce)  Silk Server ● Provides an HTTP API for matching instances from an incoming stream of RDF data while keeping track of known entities ● Can be used as an identity resolution component within applications that consume Linked Data from the Web WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 12. Silk Workbench  Silk Workbench is a web application which guides the user through the process of interlinking different data sources.  Enables the user to manage different sets of data sources and linking tasks.  Offers a graphical editor which enables the user to easily create and edit linkage rules  Offers tools to evaluate the current linkage rule  Includes experimental support for learning linkage rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 13. Linking Workflow WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 14. Typical linkage rule  Select the values to be compared ● Example: Select labels and dates of a music record  Normalize the values ● Example: Transform dates to a common format  Compare different values using similarity measures ● Example: Compare labels and dates of a music record  Aggregate the results of multiple comparisons ● Example: Compute the average of the label and date similarity WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 15. Value selectors  Values in the graph around the entities can be used for comparison  Property path languages have been developed for that purpose  Examples (SPARQL 1.1 Property Paths Language): ● Entity label: rdfs:label ● Movie director name: dbpedia-owl:director/foaf:name ● All movies of a director: ^dbpedia-owl:director/rdfs:label WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 16. Data Transformations  Different data sets may use different data formats  Data sets may be noisy ⇒ Values must be normalized prior to comparison WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 17. Common Transformations  Case normalization  Structural transformation  Extract values from URIs WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 18. Similarity Measures  A similarity measure compares two values  It returns a value between 0 (no similarity) and 1 (equality)  Formally, a similarity measure is a function: * * sim : Σ ×Σ →[0,1]  Various similarity measures have been proposed ● Character-based measures ● Token-based measures ● Domain-specific measures WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 19. Character-Based Similarity Measures  Usually rely on character edit operations  Often used for catching typographical errors  Most popular ● Levenstein ● Jaro/Jaro-Winkler WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 20. Levenshtein Distance  The minimum number of edits needed to transform one string into the other  Allowed edit operations: ● insert a character into the string ● delete a character from the string ● replace one character with a different character  Examples: ● levensthein('Table', 'Cable') = 1 (1 Substitution) ● levensthein('Table', 'able') = 1 (1 Deletion) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 21. Token-Based Similarity Measures  Character-based measures work well for typographical errors, but fail when word arrangements differ  Example: 'John Doe', 'Doe, John', 'Mr. John Doe'  Token-based measures split the values into tokens before computing the similarity  Example: tokenize('Mr. John Doe') = {'Mr.', 'John', 'Doe'}  Most popular: Jaccard, Dice WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 22. Jaccard coefficient  Intuition: Measure the fraction of the tokens which are shared by both strings  Defined as the number of matching words divided by the total number of distinct words: ∣A∩B∣ Jaccard ( A , B)= ∣A∪B∣  Example: 2 Jaccard ({Thomas ,Sean , Connery },{Sir ,Sean , Connery })= =0.5 4 WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 23. Domain-Specific Similarity Measures  Geographic distance  Date/Time  Numbers WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 24. Aggregating Similarity Values  In order to determine if two entities are duplicates it is usually not sufficient to compare a single property  Aggregation Functions aggregate the similarity of multiple comparisons  Example: Interlinking geographical datasets ● Compare by label and geographic coordinates ● Aggregate similarity values WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 25. Popular Aggregation Functions  Minimum ● Choose the lowest value ● ⇒ All values must exceed the threshold  Maximum ● Choose the highest value ● ⇒ At least one value must exceed the threshold  Weighted Average ● Assign a weight to each comparison ● Compute the weighted mean WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 26. Putting it all together WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 27. Example  Interlink cities in different data sources: WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 28. Evaluating Linkage Rules  Gold standard in the form of reference links ● Positive links (definitive matches) ● Negative links (definitive non-matches)  Based on the reference links, we can determine the number of correct and incorrect matches  We distinguish between 4 cases: Positive Link Negative Link match(a,b) = link True positive False positive match(a,b) = nonlink False negative True negative WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 29. Evaluating Linkage Rules  Recall: Ratio of correct links compared to all known links ∣true positives∣ recall = ∣true positives∣+ ∣ false positives∣  Precision: Ratio of correct links compared to all found links ∣true positives∣ precision = ∣true positives∣+ ∣ false negatives∣  F-measure: Harmonic mean of precision and recall 2⋅precision⋅recall F= precision + recall WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 30. Recall-Precision diagram  A recall-precision diagram visualizes the trade-off between maximizing the recall and maximizing the precision From: Creating probabilistic databases from duplicated data, Oktie Hassanzadeh · Renée J. Miller (VLDBJ) WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 31. Outline  Motivation  Link Discovery Tools  Linking Workflow  Silk Workbench WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 32. Silk Worbench  Silk Workbench offers a GUI for: ● Manage different data sourcs and linkage rules ● Creating linkage rules ● Executing linkage rules ● Evaluating linkage rules ● Learning Linkage Rules WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 33. Workspace The Workspace holds a set of projects consisting of:  Data Sources ● Holds all information that is needed by Silk to retrieve entities from it.  ● Usually a file dump or a SPARQL endpoint  Linking Tasks ● Interlinks a type of entity between two data sources ● e.g. Interlinkiing movies in DBpedia and LinkedMDB WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 34. Linkage Rule Editor  Allows to view and edit linkage rules  Linkage Rules are shown as a tree  Editing using drag & drop. WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 35. Generating Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 36. Managing Reference Links WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 37. Conclusion  In order to publish a new data set or to consume an existing dataset we need to generate links  A linkage rule specifies the conditions which must hold true for two entities in order to be considered the same real- world object.  The Silk Workbench provides a graphical user interface to create and edit linking tasks  The hands on session will cover a simple example interlinking musical artists in freebase and DBpedia WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data
  • 38. Q&A WWW2012 Tutorial: Practical Cross-Dataset Queries on the Web of Data