SlideShare una empresa de Scribd logo
1 de 15
Descargar para leer sin conexión
Experiments with
evolving RDF
Sławek Staworko
(joint work with Peter Buneman)
University of Edinburgh
Preservation of evolving data
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
Version 1 Version 2 Version 3
…
Archive
• Version retrieval
• Timeline queries
• Storage space efficiency
Approaches to data
preservation
• Store all versions
• Store the original databases and log the changes
• Hybrid approach of the above two
• store the initial and every 10th version
• store log changes for the intermediate versions
• Annotation based approach!
• never delete data but annotate its validity with
time intervals
Annotation of RDF
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
Version 1 Version 2 Version 3
Archive
Tom
cat
has [1–2]
tuna
eats [1–1]
Apr 1
dies [2–2]
dog
has [3—]
dog
food
eats [3—]
What exactly is the input?
Delta = difference between two databases expressed with
two atomic operations: inserting a triple and deleting a triple
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
delete (cat, eats, tuna)
insert (cat, dies, Apr 1)
delete (Tom, has, cat)
insert (Tom, has, dog)
inset (dog, eats, dog food)
delete (cat, dies, Apr 1)
Snapshots
Deltas
Snapshots = complete database instances
Challenges in preserving
evolving data with annotations
1. The task is relatively simple if deltas are know:!
• deleting a triple closes its interval!
• adding a triple opens a new interval !
2. It gets complicated when only snapshots are given!
• it boils down to computing deltas!
• main challenge: identify objects that are the same across
versions of the database
Entity resolution problem!
which data object represent the same entity across different versions!
well-studied database problem in various different settings
(from duplicate elimination to record matching)
Entity resolution and RDF
URI (Uniform resource identifier)
URIs are supposed to make things easy but…
• RDF has also blank nodes
• URIs don’t exactly solve the problem in the
context of evolving/merged ontologies…
Two different RDF nodes need not represent different objects
Blank nodes
• LOD initiative frowns upon them
• Blank nodes are commonplace (and misused?)
Tom
cat
has
Peter
believes
Tom cathas
Peter believes
_bsubject
pred
object
_b
2.4 -0.4
Reification Complex number
Blank nodes (cont.)
1. Reification (Peter believes that Tom has a cat)
2. Data structures (complex types)
3. Anonymization (Tom has a pet)
Assumptions on reasonable use of blank nodes:!
1. Represent concrete objects !
2. The objects can be identified from the context
Deblanking
_b1
7 end
_b2
3
_b3
5
LISP-style encoding
list of numbers [5,3,7]
head
head
head
tail
tail
tail
#(7,end)
7 end
_b2
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
#(5,3,7,end)
5
head
head
head
tail
tail
tail
Assumption: graph has no cycles consisting of blanks only
Assumption: identity of a blank node is determined by its contents
Experiements
• 10 versions of Experimental Factor Ontology (EFO)
data expressed in OWL
• 200k triples in the 1st version, 290k in the last
• On average 20k blank nodes in each version
• 920k triples overall (blank nodes are independent)
• many triples do not last more than 1 version
Experiment
Deblanking and life expectancy of an object
Round Triples Blanks Life expect.
0 921896 165935 2.55
1 358857 33253 6.39
2 348356 28150 6.57
3 339695 23502 6.88
4 330564 18862 7.10
5 318761 14763 7.24
6 311562 11021 7.39
7 304628 7299 7.54
8 297744 3622 7.83
9 285484 58 7.83
10 285334 2 7.83
11 285334 1 7.83
12 285334 0 7.83
Improving space efficiency
Peter
Edinburgh +44 712 4567
phone [1–10]lives [1–10]
Peter
Edinburgh +44 712 4567
phonelives
[1–10]Lift common intervals to subject
dog
has [1–5]
dog
has [1–5]
• Intervals moved from all but 33.7k triples (of total 285k)
• Number of subjects with histories is 34.3k
• Total number of intervals is reduced from 285k to 60k
• The size of the index reduced by almost 80%
Future:
• Bisimulation
• Nested RDF
Conclusions
• Annotation offers an attractive way of representing
an evolving RDF dataset (need for nested RDF?)
• Evolution of data may require more complex atomic
operations. For instance, vocabulary evolution:
adding, splitting, merging classes. (can
bisimulation help here?)

Más contenido relacionado

Similar a Experiments with evolving RDF

Similar a Experiments with evolving RDF (20)

Getting started in Python presentation by Laban K
Getting started in Python presentation by Laban KGetting started in Python presentation by Laban K
Getting started in Python presentation by Laban K
 
Kavitha_python.ppt
Kavitha_python.pptKavitha_python.ppt
Kavitha_python.ppt
 
Python 101 1
Python 101   1Python 101   1
Python 101 1
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury SechetDConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
Python Basics
Python BasicsPython Basics
Python Basics
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
Lenguaje Python
Lenguaje PythonLenguaje Python
Lenguaje Python
 
pysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
pysdasdasdsadsadsadsadsadsadasdasdthon1.pptpysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
pysdasdasdsadsadsadsadsadsadasdasdthon1.ppt
 
coolstuff.ppt
coolstuff.pptcoolstuff.ppt
coolstuff.ppt
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
Introductio_to_python_progamming_ppt.ppt
Introductio_to_python_progamming_ppt.pptIntroductio_to_python_progamming_ppt.ppt
Introductio_to_python_progamming_ppt.ppt
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
python1.ppt
python1.pptpython1.ppt
python1.ppt
 
ENGLISH PYTHON.ppt
ENGLISH PYTHON.pptENGLISH PYTHON.ppt
ENGLISH PYTHON.ppt
 
1. python programming
1. python programming1. python programming
1. python programming
 
Programming in Python
Programming in Python Programming in Python
Programming in Python
 

Más de PRELIDA Project

Más de PRELIDA Project (17)

Steps towards a Data Value Chain
Steps towards a Data Value ChainSteps towards a Data Value Chain
Steps towards a Data Value Chain
 
Preserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructurePreserving linked data: sustainability and organizational infrastructure
Preserving linked data: sustainability and organizational infrastructure
 
Organizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data PreservationOrganizational and Economic Issues in Linked Data Preservation
Organizational and Economic Issues in Linked Data Preservation
 
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
CEDAR: From Fragment to Fabric - Dutch Census Data in a Web of Global Cultura...
 
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
Privacy‐Aware Preservation: Challenges from the Perspective of a Linked Data ...
 
Media Ecology Project
Media Ecology ProjectMedia Ecology Project
Media Ecology Project
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
CEDAR & PRELIDA Preservation of Linked Socio-Historical DataCEDAR & PRELIDA Preservation of Linked Socio-Historical Data
CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
 
DIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for PreservationDIACHRON Preservation: Evolution Management for Preservation
DIACHRON Preservation: Evolution Management for Preservation
 
DIACHRON Project Overview
DIACHRON Project OverviewDIACHRON Project Overview
DIACHRON Project Overview
 
PRELIDA Project Draft Roadmap
PRELIDA Project Draft RoadmapPRELIDA Project Draft Roadmap
PRELIDA Project Draft Roadmap
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
Introduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination WorkshopIntroduction to PRELIDA Consolidation and Dissemination Workshop
Introduction to PRELIDA Consolidation and Dissemination Workshop
 
D3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital PreservationD3.1 State of the art assessment on Linked Data and Digital Preservation
D3.1 State of the art assessment on Linked Data and Digital Preservation
 
Gap Analysis
Gap AnalysisGap Analysis
Gap Analysis
 
Towards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA projectTowards long-term preservation of linked data - the PRELIDA project
Towards long-term preservation of linked data - the PRELIDA project
 
Introduction to Prelida
Introduction to PrelidaIntroduction to Prelida
Introduction to Prelida
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Experiments with evolving RDF

  • 1. Experiments with evolving RDF Sławek Staworko (joint work with Peter Buneman) University of Edinburgh
  • 2. Preservation of evolving data Tom cat has tuna eats Tom cat has Apr 1 dies Tom dog has dog food eats Version 1 Version 2 Version 3 … Archive • Version retrieval • Timeline queries • Storage space efficiency
  • 3. Approaches to data preservation • Store all versions • Store the original databases and log the changes • Hybrid approach of the above two • store the initial and every 10th version • store log changes for the intermediate versions • Annotation based approach! • never delete data but annotate its validity with time intervals
  • 4. Annotation of RDF Tom cat has tuna eats Tom cat has Apr 1 dies Tom dog has dog food eats Version 1 Version 2 Version 3 Archive Tom cat has [1–2] tuna eats [1–1] Apr 1 dies [2–2] dog has [3—] dog food eats [3—]
  • 5. What exactly is the input? Delta = difference between two databases expressed with two atomic operations: inserting a triple and deleting a triple Tom cat has tuna eats Tom cat has Apr 1 dies Tom dog has dog food eats delete (cat, eats, tuna) insert (cat, dies, Apr 1) delete (Tom, has, cat) insert (Tom, has, dog) inset (dog, eats, dog food) delete (cat, dies, Apr 1) Snapshots Deltas Snapshots = complete database instances
  • 6. Challenges in preserving evolving data with annotations 1. The task is relatively simple if deltas are know:! • deleting a triple closes its interval! • adding a triple opens a new interval ! 2. It gets complicated when only snapshots are given! • it boils down to computing deltas! • main challenge: identify objects that are the same across versions of the database Entity resolution problem! which data object represent the same entity across different versions! well-studied database problem in various different settings (from duplicate elimination to record matching)
  • 7. Entity resolution and RDF URI (Uniform resource identifier) URIs are supposed to make things easy but… • RDF has also blank nodes • URIs don’t exactly solve the problem in the context of evolving/merged ontologies… Two different RDF nodes need not represent different objects
  • 8. Blank nodes • LOD initiative frowns upon them • Blank nodes are commonplace (and misused?) Tom cat has Peter believes Tom cathas Peter believes _bsubject pred object _b 2.4 -0.4 Reification Complex number
  • 9. Blank nodes (cont.) 1. Reification (Peter believes that Tom has a cat) 2. Data structures (complex types) 3. Anonymization (Tom has a pet) Assumptions on reasonable use of blank nodes:! 1. Represent concrete objects ! 2. The objects can be identified from the context
  • 10. Deblanking _b1 7 end _b2 3 _b3 5 LISP-style encoding list of numbers [5,3,7] head head head tail tail tail #(7,end) 7 end _b2 3 _b3 5 head head head tail tail tail #(7,end) 7 end #(3,7,end) 3 _b3 5 head head head tail tail tail #(7,end) 7 end #(3,7,end) 3 #(5,3,7,end) 5 head head head tail tail tail Assumption: graph has no cycles consisting of blanks only Assumption: identity of a blank node is determined by its contents
  • 11. Experiements • 10 versions of Experimental Factor Ontology (EFO) data expressed in OWL • 200k triples in the 1st version, 290k in the last • On average 20k blank nodes in each version • 920k triples overall (blank nodes are independent) • many triples do not last more than 1 version
  • 12. Experiment Deblanking and life expectancy of an object Round Triples Blanks Life expect. 0 921896 165935 2.55 1 358857 33253 6.39 2 348356 28150 6.57 3 339695 23502 6.88 4 330564 18862 7.10 5 318761 14763 7.24 6 311562 11021 7.39 7 304628 7299 7.54 8 297744 3622 7.83 9 285484 58 7.83 10 285334 2 7.83 11 285334 1 7.83 12 285334 0 7.83
  • 13. Improving space efficiency Peter Edinburgh +44 712 4567 phone [1–10]lives [1–10] Peter Edinburgh +44 712 4567 phonelives [1–10]Lift common intervals to subject dog has [1–5] dog has [1–5] • Intervals moved from all but 33.7k triples (of total 285k) • Number of subjects with histories is 34.3k • Total number of intervals is reduced from 285k to 60k • The size of the index reduced by almost 80%
  • 15. Conclusions • Annotation offers an attractive way of representing an evolving RDF dataset (need for nested RDF?) • Evolution of data may require more complex atomic operations. For instance, vocabulary evolution: adding, splitting, merging classes. (can bisimulation help here?)