The Extremely Annotational RDF Markup, a.k.a. EARMARK, is an OWL 2 DL ontology that defines document meta-markup. It is an ontologically precise definition of markup that instantiates the structure of a text document as an independent OWL document outside of the text string it annotates, and through appropriate OWL and SWRL characterizations it can define organizations such as trees or graphs and can be used to generate validity constraints. In this paper we present an extension of EARMARK that allows us to describe how markup documents evolve in time, which complies with concepts expressed in the Functional Requirements for Bibliographic Records (FRBR).
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation
1. http://creativecommons.org/licenses/by-sa/3.0
Tracking changes through EARMARK:
a theoretical perspective and an implementation
Silvio Peroni – essepuntato@cs.unibo.it
Francesco Poggi – fpoggi@cs.unibo.it
FabioVitali – fabio@cs.unibo.it
1st InternationalWorkshop on
Document Changes: Modeling, Detection, Storage andVisualization
http://diff.cs.unibo.it/dchanges2013/
@ DocEng 2013, Florence, Italy - September 10, 2013
2. Outline
• Documents and their changes in time through FRBR
• Change tracking and provenance data of XML documents
• Defining multi-hierarchy documents through EARMARK
• EARMARK Changes Ontology and its application
• Querying and byte-counting
• Conclusions
3. Documents do change in time
• Any creative act of a text
✦ starts from a particular draft made by someone at a certain time
✦ is then modified through consecutive revisions
✦ may end up being forked into different variants
✦ may be modified by additional editorial activities such as typo-fixing,
shortening, restructuring, etc.
• Importance of keeping tracks of changes
✦ Computer Science: to show how programming code or computational
models evolve throughout the natural lifecycle of software development
✦ Philology: to tell the way in which variant copies of a same book overlap in
time and content
✦ Scientific Publishing: to understand the entity and quality of the
modifications and driving the final acceptance or rejection of a paper
✦ etc.
4. About (textual) documents and their changes
A document is more
than the string it is
composed of
Alice
produces an hand-written
document on a piece of paper
Bob
produces a digital document
through a word-processor
“Hello world.”
composed by the string
“Hello world.”
composed by the string
different
documents
same
string
modifies Bob’s document
producing a new version
“Hey, hello world!”
composed by the string
Charles
used by
to certain extent
same
document
different
strings
A document refers
to different strings
in time
5. Introducing FRBR for change tracking
• “Functional Requirements for Bibliographic Records (FRBR) is a conceptual
entity-relationship model developed by the International Federation of Library
Associations and Institutions (IFLA) that relates user tasks of retrieval and access
in online library catalogues and bibliographic databases from a user’s perspective.”
• According to FRBR, document is an overloaded word, and is better substituted
by four different concepts called respectively
✦ Work, coupled with the concept of identity
✦ Expression, to record the evolution in time
✦ Manifestation, which specifies the form and format
✦ Item, identifying the concrete object
from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records
string of characters
constituting the
content of a document
FRBR Expression layer
Expressions never change in time
: a distinct intellectual or artistic creation
: the intellectual or artistic realisation of a work in the form
of alpha-numeric, musical, or choreographic
notation, sound, image, object, movement, etc.,
or any combination of such forms
FRBR Work layer
Bob
“Hello world.”
Charles
“Hey, hello world!”
the abstract
conceptualisation of a
document
realisation of realisation of
revision of
Time
6. What happens to
XML-based markup documents?
• What changes we have to keep track of among those that occur
in XML documents, and what are the markup elements that, after
edits and document changes, are
✦ directly affected
✦ hierarchically affected
✦ completely unaffected
• Research questions:
✦ When a markup element E1 within a document version V1 changes in some
way, e.g. by adding something to the text it contains, thereby generating
document version V2, are the two instances of E1 in V1 and E2 in V2 to be
considered actually the same element?
✦ In the case the aforementioned instances of E are to be considered
different, is the difference meant to be propagated also to their ancestor
elements?
7. DEL
<section>
NEW
Applying FRBR to XML markup
Elements and text nodes as FRBR Expressions
NEW
<section>
<p> <p>
<em>
Some interesting content. It was written by me.
INS
<em>
very
NEW
<section>
<p>
<section>
<p>Some <em>interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Alice
<section>
<p>Some <em>very
interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Bob
revised by
<section>
<p>Some <em>very
interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Charles
revised by
has part
revision of
8. Who, What, How and When
• An important part of change
tracking operations involves
keeping track of provenance
information
✦ who made the modification
✦ what was modified
✦ how it was modified
✦ when it was modified
• How do we keep track of all
these data in practice?
✦ XML-based languages use
workarounds to implement
overlapping markup
✦ Other solutions?
<section>
<section>
<p> <p>
<em>
very
<section>
Version made by Bob
July 3, 2013, at 04:15
Text inserted by Bob
July 3, 2013, at 04:15
Version made by Alice
June 19, 2013, at 13:45
Markup deleted by Charles
July 5, 2013, at 03:33
Version made by Charles
July 5, 2013, at 03:33
9. From theory to practice through
The Extremely Annotational RDF
Markup (EARMARK) is at the
same time a markup meta-language
and an ontology of (document) markup
✦ More expressive than XML – it allows to
organise markup structures as graphs
✦ It makes easy to associate annotations to
document items such as change tracking
information – since an EARMARK
document is a set of OWL assertions, all
the markup items and text nodes are
individuals of particular classes identified
by an IRI
✦ Lot of tools available:
- a Java API
- frameworks to convert XML
documents into EARMARK ones and
vice versa
more information at http://palindrom.es/phd/research/earmark
10. Example of EARMARK document
Linearised using Turtle
Some interesting content. It was written by me.
# Textual content of the document
:content a earmark:StringDocuverse ;
earmark:hasContent "Some interesting content. It was written by me."^^xsd:string .
full Turtle source of the
document available at
http://www.essepuntato.it/2013/dchanges
# String ’Some ’
:r1 a earmark:PointerRange ;
earmark:refersTo :content ;
earmark:begins "0"^^xsd:nonNegativeInteger ;
earmark:ends "5"^^xsd:nonNegativeInteger .
# String ‘interesting’
:r2 a earmark:PointerRange ...
# String ’ content.’
:r3 a earmark:PointerRange ...
# String ’It was written by me.’
:r4 a earmark:PointerRange ...
<section>
<p>
# Element ‘section’
:section a earmark:Element ;
earmark:hasGeneralIdentifier "section"^^xsd:string ;
co:firstItem [ a co:ListItem ; co:itemContent :p1 ;
co:nextItem [ a co:ListItem ;
co:itemContent :p2 ] ] .
<p>
<em>
# First element ’p’
:p1 a earmark:Element ;
earmark:hasGeneralIdentifier "p"^^xsd:string ;
co:firstItem [ a co:ListItem ;
co:itemContent :r1 ;
co:nextItem [ a co:ListItem ;
co:itemContent :em ;
co:nextItem [ a co:ListItem ;
co:itemContent :r3 ] ] ] .
... and similarly for the other markup elements
11. EARMARK Changes Ontology
Extending EARMARK to manage change tracking information
• The EARMARK Changes Ontology (EChO) extends the EARMARK ontology
and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr)
and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track
of all the changes and provenance data related to different versions of the same
document
✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the
different document versions and to store them all within a single EARMARK document
✦ frbr:revisionOf to indicate that a markup item is a revision of another
✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a
previous version of the document
✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and
prov:generatedAtTime to indicate that a particular markup item, a range or a whole document
version has been created at a certain time
✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and
prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document
version has been deleted at a certain time
✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a
certain item
12. Version creation
• Who:Alice
• What: document version (implicitly identified by the document
element of the markup document :section)
• How: creation
• When: June 19, 2013 at 13:45
:section
# Provenance information
prov:wasGeneratedBy :creation-by-alice ;
prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime .
# Activity of creation of a new version
:creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .
13. Revision, insertion and deletion
• Bob’s revision of Alice’s version was made
on July 3, 2013, at 04:15, and concerns only
the insertion of the string “very ” as first
textual node of the element em
• Charles’ revision of Bob’s version was
made on July 5, 2013, at 03:33, and deletes
the Bob’s second p of section
# Element ’section ’ by Bob
:section-by-bob a earmark:Element ;
earmark:hasGeneralIdentifier
"section"^^xsd:string ;
co:firstItem [ a co:ListItem ;
co:itemContent :p1-by-bob ;
co:nextItem [ a co:ListItem ;
co:itemContent :p2 ] ] ;
# relation with previous version
frbr:revisionOf :section ;
# provenance information
prov:wasGeneratedBy :creation-by-bob ;
prov:generatedAtTime
"2013-07-03T04:15:00Z"^^xsd:dateTime .
:creation-by-bob a echo:VersionCreation ;
prov:wasAssociatedWith :bob .
revision
# New content of the document
:content-by-bob a earmark:StringDocuverse ;
earmark:hasContent "very "^^xsd:string .
# New string ’very ’
:r5 a earmark:PointerRange ;
earmark:refersTo :content-by-bob ;
earmark:begins "0"^^xsd:nonNegativeInteger ;
earmark:ends "5"^^xsd:nonNegativeInteger ;
# provenance information
prov:wasGeneratedBy :insertion-by-bob ;
prov:generatedAtTime
"2013-07-03T04:15:00Z"^^xsd:dateTime .
:insertion-by-bob a echo:ItemInsertion ;
prov:wasAssociatedWith :bob .
insertion
:p2 # Second element ’p’ of Bob’s ‘section‘
prov:wasInvalidatedBy :deletion-by-charles
prov:invalidatedAtTime
"2013-07-05T03:33:00Z"^^xsd:dateTime .
:deletion-by-charles a echo:ItemDeletion ;
prov:wasAssociatedWith :charles .
deletion
14. Splitting ranges up
• Daniel’s revision of the Alice’s
version, where Daniel decided
to substitute the string “me”
in the second p with its name
(i.e. the string “Daniel”)
• In EARMARK, this string
substitution (i.e. a deletion
plus an insertion) is possible
by defining four new ranges
• We use prov:wasDerivedFrom
statements between ranges to
describe (at an abstract level)
a more complex scenario of
overlapping markup between
the two versions
15. Querying the history of changes
• Since EARMARK is defined by means of Semantic Web technologies,
we can use already implemented standards such as SPARQL 1.1 to
query over the change tracking history of a certain EARMARK
document
✦ Return a new EARMARK document that contains only Bob’s version
CONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE {
{ SELECT DISTINCT ?other ?version WHERE {
{ SELECT DISTINCT ?version WHERE {
?version a earmark:Element ; prov:wasGeneratedBy ?activity .
?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } }
?other (^co:itemContent?/^co:item)+ ?version } }
?version ?pv ?ov . ?other ?p ?o .
OPTIONAL { ?other a earmark:PointerRange ;
earmark:refersTo ?docuverse . ?docuverse ?pd ?od } }
✦ Select the textual content of all paragraphs removed by Charles
SELECT DISTINCT ?range WHERE {
?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ;
co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity .
?range a earmark:PointerRange .
?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }
16. Byte-counting EARMARK documents
• We used two documents
✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for
recognizability and obtained by applying very common edits according to three authors
✦ The second document composed of seven different versions, named after the weekdays and
created by seven different authors when editing a very simple document
• We compared the size in bytes of consecutive versions of such documents
according to OpenDocument and OpenXML formats, and to EARMARK
linearised in six different formats:Turtle, RDF/XML, OWL/XML, N-Triples, HDT
and Manchester Syntax
Turtle seems the best
linearisation format
for EARMARK
HDT (compressed)
performances similar to
ODT and OOXML
17. Conclusions
• In this paper we presented a theoretical approach to track document changes
based on FRBR and provenance data
• We proposed one possible implementation of it through EARMARK, a
Semantic Web-aware meta-markup language that enables the definition of
multiple overlapping markup hierarchies representing different versions of the
same document
• We highlighted the main advantages and drawbacks in terms of querying and
storing such EARMARK documents
• In the future we plan to extend EChO so as to enable the description of
additional change tracking operations (e.g. swap, update)
• We also plan to experiment the effective use of translation mechanisms to
convert EARMARK documents with change tracking information into XML
formats, e.g. ODT and OOXML