by Sławek Staworko, (joint work with Peter Buneman), University of Edinburgh, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu
2. Preservation of evolving data
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
Version 1 Version 2 Version 3
…
Archive
• Version retrieval
• Timeline queries
• Storage space efficiency
3. Approaches to data
preservation
• Store all versions
• Store the original databases and log the changes
• Hybrid approach of the above two
• store the initial and every 10th version
• store log changes for the intermediate versions
• Annotation based approach!
• never delete data but annotate its validity with
time intervals
5. What exactly is the input?
Delta = difference between two databases expressed with
two atomic operations: inserting a triple and deleting a triple
Tom
cat
has
tuna
eats
Tom
cat
has
Apr 1
dies
Tom
dog
has
dog
food eats
delete (cat, eats, tuna)
insert (cat, dies, Apr 1)
delete (Tom, has, cat)
insert (Tom, has, dog)
inset (dog, eats, dog food)
delete (cat, dies, Apr 1)
Snapshots
Deltas
Snapshots = complete database instances
6. Challenges in preserving
evolving data with annotations
1. The task is relatively simple if deltas are know:!
• deleting a triple closes its interval!
• adding a triple opens a new interval !
2. It gets complicated when only snapshots are given!
• it boils down to computing deltas!
• main challenge: identify objects that are the same across
versions of the database
Entity resolution problem!
which data object represent the same entity across different versions!
well-studied database problem in various different settings
(from duplicate elimination to record matching)
7. Entity resolution and RDF
URI (Uniform resource identifier)
URIs are supposed to make things easy but…
• RDF has also blank nodes
• URIs don’t exactly solve the problem in the
context of evolving/merged ontologies…
Two different RDF nodes need not represent different objects
8. Blank nodes
• LOD initiative frowns upon them
• Blank nodes are commonplace (and misused?)
Tom
cat
has
Peter
believes
Tom cathas
Peter believes
_bsubject
pred
object
_b
2.4 -0.4
Reification Complex number
9. Blank nodes (cont.)
1. Reification (Peter believes that Tom has a cat)
2. Data structures (complex types)
3. Anonymization (Tom has a pet)
Assumptions on reasonable use of blank nodes:!
1. Represent concrete objects !
2. The objects can be identified from the context
10. Deblanking
_b1
7 end
_b2
3
_b3
5
LISP-style encoding
list of numbers [5,3,7]
head
head
head
tail
tail
tail
#(7,end)
7 end
_b2
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
_b3
5
head
head
head
tail
tail
tail
#(7,end)
7 end
#(3,7,end)
3
#(5,3,7,end)
5
head
head
head
tail
tail
tail
Assumption: graph has no cycles consisting of blanks only
Assumption: identity of a blank node is determined by its contents
11. Experiements
• 10 versions of Experimental Factor Ontology (EFO)
data expressed in OWL
• 200k triples in the 1st version, 290k in the last
• On average 20k blank nodes in each version
• 920k triples overall (blank nodes are independent)
• many triples do not last more than 1 version
13. Improving space efficiency
Peter
Edinburgh +44 712 4567
phone [1–10]lives [1–10]
Peter
Edinburgh +44 712 4567
phonelives
[1–10]Lift common intervals to subject
dog
has [1–5]
dog
has [1–5]
• Intervals moved from all but 33.7k triples (of total 285k)
• Number of subjects with histories is 34.3k
• Total number of intervals is reduced from 285k to 60k
• The size of the index reduced by almost 80%
15. Conclusions
• Annotation offers an attractive way of representing
an evolving RDF dataset (need for nested RDF?)
• Evolution of data may require more complex atomic
operations. For instance, vocabulary evolution:
adding, splitting, merging classes. (can
bisimulation help here?)