https://www.insight-centre.org/content/leveraging-matching-dependencies-guided-user-feedback-linked-data-applications
Presented at IIWeb2012
ABSTRACT
This paper presents a new approach for managing integration quality and user feedback, for entity consolidation, within applications consuming Linked Open Data. The quality of a dataspace containing multiple linked datasets is defined in term of a utility measure, based on domain specific matching dependencies. Furthermore, the user is involved in the consolidation process through soliciting feedback about identity resolution links, where each candidate link is ranked according to its benefit to the dataspace; calculated by approximating the improvement in the utility of dataspace utility. The approach
evaluated on real world and synthetic datasets demonstrates the effectiveness of utility measure; through dataspace integration quality improvement that requires less overall user feedback iterations.
Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications
1. Digital Enterprise Research Institute www.deri.ie
Leveraging Matching Dependencies for Guided
User Feedback in Linked Data Applications
Umair ul Hassan, Sean O’Riain, Edward Curry
Digital Enterprise Research Institute
National University of Ireland, Galway
Copyright 2011 Digital Enterprise Research Institute. All rights reserved.
2. Outline
Digital Enterprise Research Institute www.deri.ie
Motivation & Problem Space
Identity Resolution on the Linked Open Data (LOD) Web
Proposed Approach
LOD Application Architecture
How it relates to existing works
Evaluation
Conclusion & Future Work
3. Overview
Digital Enterprise Research Institute www.deri.ie
Identity Resolution in the Linked Open Data Web
Real-world entities have multiple identifiers in LOD
Identity resolution links have associated uncertainty
LOD Applications require user verification of links
Problem
Feedback for all links is infeasible for large datasets
LOD Applications have domain specific utility of links
Proposed Approach
Leverages matching dependencies to define domain specific
requirements of identity resolution
Ranks identity resolution links according to value of perfect information
4. Linked Open Data (LOD)
Digital Enterprise Research Institute www.deri.ie
Expose and interlink datasets on the Web
Using URIs to identify “things” in your data
Using a graph representation (RDF) to describe URIs
Vision: The Web as a huge graph database
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
5. Linked Data Example
Digital Enterprise Research Institute www.deri.ie
Identity resolution links
Multiple Identifiers
6. Identity Resolution in LOD
Digital Enterprise Research Institute www.deri.ie
Identity resolution is required for consolidation of data in
applications consuming LOD
Three sources of identity resolution links
Provided by data publishers (e.g. dbpedia.org)
Generated by consumer through tools (e.g. SILK, SEMIRI, RiMOM)
Maintained by third party web services (e.g. sameas.org)
Uncertainty associated with links
Due to multiple identity equivalence interpretations
Due to characteristics of link generation algorithms (similarity based)
7. Identity Resolution Problem
Digital Enterprise Research Institute www.deri.ie
User feedback for uncertain links
Verify uncertain identity resolution links from users/experts
Improve quality of entity consolidation
Challenges
Domain specific semantic requirements
– How to define domain specific requirements of quality for Linked
Data applications?
Limited user attention
– How to rank candidate links according to their benefit to maximize
utility of user feedback?
8. Identity Resolution Problem
Digital Enterprise Research Institute www.deri.ie
User feedback for uncertain links
Verify uncertain identity resolution links from users/experts
Improve quality of entity consolidation
Proposed Approach
Domain specific semantic requirements
– Leverage Matching Dependencies
Limited user attention
– Employ value of perfect information theory
9. LOD Application Architecture
Digital Enterprise Research Institute www.deri.ie
Utility Feedback Consolidation
Module Module Module
Candidate Links
Questions
Rules Feedback
Matching Utility
Dependencies Improvement
Ranked
Feedback Tasks
Tom Heath and Christian Bizer (2011) Linked Data: Evolving the Web into a Global Data Space (1st edition), 1-136. Morgan & Claypool.
10. Related Work
Digital Enterprise Research Institute www.deri.ie
Jeffery et al., “Pay-as-you-go user feedback for dataspace
systems,” in Proceedings of the 2008 ACM SIGMOD
Conference, 2008, pp. 847-860.
Utility:
In terms of cardinality of query results on dataspace
General metric not suitable for application specific data quality
Assumption:
Availability of global query statistics
– Problematic for Linked Open Data
11. Proposed Approach
Digital Enterprise Research Institute www.deri.ie
Domain Specific Utility
Define utility in terms of user specified rules i.e. matching dependencies
Rank candidates links for user feedback according to value of perfect
information
Assumptions
We assume matching dependencies are either provided by user or generated
through existing tools
Utility is based on satisfaction ratio of dependencies in dataspace
12. Proposed Approach
Digital Enterprise Research Institute www.deri.ie
Matching Dependencies
Matching Rule
Example
Utility of rule
g (mk ) U ( Dmk , M {mk }) pk
Value of Perfect Information U ( Dmk , M {mk })(1 pk )
U ( D, M )
13. Evaluation
Digital Enterprise Research Institute www.deri.ie
Measure change in utility of a dataspace according to
matching rules after a specific number of feedback iterations
Candidate links generated by the Silk framework
14. Evaluation
Digital Enterprise Research Institute www.deri.ie
Datasets
IIMB 2009 Dataset UCI-Adult Dataset Drug Dataset
Data Source Instance Matching Benchmark UCI Machine Learning Repository Instance Matching Benchmark
2009 2010
Data Collection IIMB 2009 US Consensus Dataset DrugBank and Sider Datasets
- Reference Ontology - Manually created duplicates and - Interlinking between two datasets
- Ontology #16 with errors in data value errors of same domain
attributes
Entity Types imdb:Movie foaf:Person drugbank:drugs, sider:drugs
Total Triples 291 64000 14348
Total Entity IDs 44 4000 5696
Total Attributes 9 16 3
Total Values 130 10878 8473
Candidate Links 81 72 94
Correct Links 22 72 66
16. Conclusion
Digital Enterprise Research Institute www.deri.ie
Matching dependencies provide an effective mechanism to:
Represent entity matching rules
Specify domain specific semantic requirements
Measure utility of dataspaces
Value of perfect information enables effective ranking strategy
for user feedback
In the three datasets 100% utility improvement was reached
under 40% of user feedback
17. Future Work
Digital Enterprise Research Institute www.deri.ie
Expand to other data quality problems
Expand on types of dependencies such as comparable
dependencies and order dependencies
Allow multi-user feedback for collaborative data cleaning
Notas del editor
Personal background
Executive summary vs. overview
Executive summary vs. overview
Complete stack of semantic web technologies is based on open standards and protocols.The semantic web technologies focus on application layer of internet stack.
Go back to research question slidesGo back to work flow and highlight whats needed