SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
http://creativecommons.org/licenses/by-sa/3.0
Tracking changes through EARMARK:
a theoretical perspective and an implementation
Silvio Peroni – essepuntato@cs.unibo.it
Francesco Poggi – fpoggi@cs.unibo.it
FabioVitali – fabio@cs.unibo.it
1st InternationalWorkshop on
Document Changes: Modeling, Detection, Storage andVisualization
http://diff.cs.unibo.it/dchanges2013/
@ DocEng 2013, Florence, Italy - September 10, 2013
Outline
• Documents and their changes in time through FRBR
• Change tracking and provenance data of XML documents
• Defining multi-hierarchy documents through EARMARK
• EARMARK Changes Ontology and its application
• Querying and byte-counting
• Conclusions
Documents do change in time
• Any creative act of a text
✦ starts from a particular draft made by someone at a certain time
✦ is then modified through consecutive revisions
✦ may end up being forked into different variants
✦ may be modified by additional editorial activities such as typo-fixing,
shortening, restructuring, etc.
• Importance of keeping tracks of changes
✦ Computer Science: to show how programming code or computational
models evolve throughout the natural lifecycle of software development
✦ Philology: to tell the way in which variant copies of a same book overlap in
time and content
✦ Scientific Publishing: to understand the entity and quality of the
modifications and driving the final acceptance or rejection of a paper
✦ etc.
About (textual) documents and their changes
A document is more
than the string it is
composed of
Alice
produces an hand-written
document on a piece of paper
Bob
produces a digital document
through a word-processor
“Hello world.”
composed by the string
“Hello world.”
composed by the string
different
documents
same
string
modifies Bob’s document
producing a new version
“Hey, hello world!”
composed by the string
Charles
used by
to certain extent
same
document
different
strings
A document refers
to different strings
in time
Introducing FRBR for change tracking
• “Functional Requirements for Bibliographic Records (FRBR) is a conceptual
entity-relationship model developed by the International Federation of Library
Associations and Institutions (IFLA) that relates user tasks of retrieval and access
in online library catalogues and bibliographic databases from a user’s perspective.”
• According to FRBR, document is an overloaded word, and is better substituted
by four different concepts called respectively
✦ Work, coupled with the concept of identity
✦ Expression, to record the evolution in time
✦ Manifestation, which specifies the form and format
✦ Item, identifying the concrete object
from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records
string of characters
constituting the
content of a document
FRBR Expression layer
Expressions never change in time
: a distinct intellectual or artistic creation
: the intellectual or artistic realisation of a work in the form
of alpha-numeric, musical, or choreographic
notation, sound, image, object, movement, etc.,
or any combination of such forms
FRBR Work layer
Bob
“Hello world.”
Charles
“Hey, hello world!”
the abstract
conceptualisation of a
document
realisation of realisation of
revision of
Time
What happens to
XML-based markup documents?
• What changes we have to keep track of among those that occur
in XML documents, and what are the markup elements that, after
edits and document changes, are
✦ directly affected
✦ hierarchically affected
✦ completely unaffected
• Research questions:
✦ When a markup element E1 within a document version V1 changes in some
way, e.g. by adding something to the text it contains, thereby generating
document version V2, are the two instances of E1 in V1 and E2 in V2 to be
considered actually the same element?
✦ In the case the aforementioned instances of E are to be considered
different, is the difference meant to be propagated also to their ancestor
elements?
DEL
<section>
NEW
Applying FRBR to XML markup
Elements and text nodes as FRBR Expressions
NEW
<section>
<p> <p>
<em>
Some interesting content. It was written by me.
INS
<em>
very
NEW
<section>
<p>
<section>
<p>Some <em>interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Alice
<section>
<p>Some <em>very
interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Bob
revised by
<section>
<p>Some <em>very
interesting</em>
content.</p>
<p>It was written by me.</p>
</section>
Charles
revised by
has part
revision of
Who, What, How and When
• An important part of change
tracking operations involves
keeping track of provenance
information
✦ who made the modification
✦ what was modified
✦ how it was modified
✦ when it was modified
• How do we keep track of all
these data in practice?
✦ XML-based languages use
workarounds to implement
overlapping markup
✦ Other solutions?
<section>
<section>
<p> <p>
<em>
very
<section>
Version made by Bob
July 3, 2013, at 04:15
Text inserted by Bob
July 3, 2013, at 04:15
Version made by Alice
June 19, 2013, at 13:45
Markup deleted by Charles
July 5, 2013, at 03:33
Version made by Charles
July 5, 2013, at 03:33
From theory to practice through
The Extremely Annotational RDF
Markup (EARMARK) is at the
same time a markup meta-language
and an ontology of (document) markup
✦ More expressive than XML – it allows to
organise markup structures as graphs
✦ It makes easy to associate annotations to
document items such as change tracking
information – since an EARMARK
document is a set of OWL assertions, all
the markup items and text nodes are
individuals of particular classes identified
by an IRI
✦ Lot of tools available:
- a Java API
- frameworks to convert XML
documents into EARMARK ones and
vice versa
more information at http://palindrom.es/phd/research/earmark
Example of EARMARK document
Linearised using Turtle
Some interesting content. It was written by me.
# Textual content of the document
:content a earmark:StringDocuverse ;
earmark:hasContent "Some interesting content. It was written by me."^^xsd:string .
full Turtle source of the
document available at
http://www.essepuntato.it/2013/dchanges
# String ’Some ’
:r1 a earmark:PointerRange ;
earmark:refersTo :content ;
earmark:begins "0"^^xsd:nonNegativeInteger ;
earmark:ends "5"^^xsd:nonNegativeInteger .
# String ‘interesting’
:r2 a earmark:PointerRange ...
# String ’ content.’
:r3 a earmark:PointerRange ...
# String ’It was written by me.’
:r4 a earmark:PointerRange ...
<section>
<p>
# Element ‘section’
:section a earmark:Element ;
earmark:hasGeneralIdentifier "section"^^xsd:string ;
co:firstItem [ a co:ListItem ; co:itemContent :p1 ;
co:nextItem [ a co:ListItem ;
co:itemContent :p2 ] ] .
<p>
<em>
# First element ’p’
:p1 a earmark:Element ;
earmark:hasGeneralIdentifier "p"^^xsd:string ;
co:firstItem [ a co:ListItem ;
co:itemContent :r1 ;
co:nextItem [ a co:ListItem ;
co:itemContent :em ;
co:nextItem [ a co:ListItem ;
co:itemContent :r3 ] ] ] .
... and similarly for the other markup elements
EARMARK Changes Ontology
Extending EARMARK to manage change tracking information
• The EARMARK Changes Ontology (EChO) extends the EARMARK ontology
and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr)
and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track
of all the changes and provenance data related to different versions of the same
document
✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the
different document versions and to store them all within a single EARMARK document
✦ frbr:revisionOf to indicate that a markup item is a revision of another
✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a
previous version of the document
✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and
prov:generatedAtTime to indicate that a particular markup item, a range or a whole document
version has been created at a certain time
✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and
prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document
version has been deleted at a certain time
✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a
certain item
Version creation
• Who:Alice
• What: document version (implicitly identified by the document
element of the markup document :section)
• How: creation
• When: June 19, 2013 at 13:45
:section
# Provenance information
prov:wasGeneratedBy :creation-by-alice ;
prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime .
# Activity of creation of a new version
:creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .
Revision, insertion and deletion
• Bob’s revision of Alice’s version was made
on July 3, 2013, at 04:15, and concerns only
the insertion of the string “very ” as first
textual node of the element em
• Charles’ revision of Bob’s version was
made on July 5, 2013, at 03:33, and deletes
the Bob’s second p of section
# Element ’section ’ by Bob
:section-by-bob a earmark:Element ;
earmark:hasGeneralIdentifier
"section"^^xsd:string ;
co:firstItem [ a co:ListItem ;
co:itemContent :p1-by-bob ;
co:nextItem [ a co:ListItem ;
co:itemContent :p2 ] ] ;
# relation with previous version
frbr:revisionOf :section ;
# provenance information
prov:wasGeneratedBy :creation-by-bob ;
prov:generatedAtTime
"2013-07-03T04:15:00Z"^^xsd:dateTime .
:creation-by-bob a echo:VersionCreation ;
prov:wasAssociatedWith :bob .
revision
# New content of the document
:content-by-bob a earmark:StringDocuverse ;
earmark:hasContent "very "^^xsd:string .
# New string ’very ’
:r5 a earmark:PointerRange ;
earmark:refersTo :content-by-bob ;
earmark:begins "0"^^xsd:nonNegativeInteger ;
earmark:ends "5"^^xsd:nonNegativeInteger ;
# provenance information
prov:wasGeneratedBy :insertion-by-bob ;
prov:generatedAtTime
"2013-07-03T04:15:00Z"^^xsd:dateTime .
:insertion-by-bob a echo:ItemInsertion ;
prov:wasAssociatedWith :bob .
insertion
:p2 # Second element ’p’ of Bob’s ‘section‘
prov:wasInvalidatedBy :deletion-by-charles
prov:invalidatedAtTime
"2013-07-05T03:33:00Z"^^xsd:dateTime .
:deletion-by-charles a echo:ItemDeletion ;
prov:wasAssociatedWith :charles .
deletion
Splitting ranges up
• Daniel’s revision of the Alice’s
version, where Daniel decided
to substitute the string “me”
in the second p with its name
(i.e. the string “Daniel”)
• In EARMARK, this string
substitution (i.e. a deletion
plus an insertion) is possible
by defining four new ranges
• We use prov:wasDerivedFrom
statements between ranges to
describe (at an abstract level)
a more complex scenario of
overlapping markup between
the two versions
Querying the history of changes
• Since EARMARK is defined by means of Semantic Web technologies,
we can use already implemented standards such as SPARQL 1.1 to
query over the change tracking history of a certain EARMARK
document
✦ Return a new EARMARK document that contains only Bob’s version
CONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE {
{ SELECT DISTINCT ?other ?version WHERE {
{ SELECT DISTINCT ?version WHERE {
?version a earmark:Element ; prov:wasGeneratedBy ?activity .
?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } }
?other (^co:itemContent?/^co:item)+ ?version } }
?version ?pv ?ov . ?other ?p ?o .
OPTIONAL { ?other a earmark:PointerRange ;
earmark:refersTo ?docuverse . ?docuverse ?pd ?od } }
✦ Select the textual content of all paragraphs removed by Charles
SELECT DISTINCT ?range WHERE {
?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ;
co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity .
?range a earmark:PointerRange .
?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }
Byte-counting EARMARK documents
• We used two documents
✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for
recognizability and obtained by applying very common edits according to three authors
✦ The second document composed of seven different versions, named after the weekdays and
created by seven different authors when editing a very simple document
• We compared the size in bytes of consecutive versions of such documents
according to OpenDocument and OpenXML formats, and to EARMARK
linearised in six different formats:Turtle, RDF/XML, OWL/XML, N-Triples, HDT
and Manchester Syntax
Turtle seems the best
linearisation format
for EARMARK
HDT (compressed)
performances similar to
ODT and OOXML
Conclusions
• In this paper we presented a theoretical approach to track document changes
based on FRBR and provenance data
• We proposed one possible implementation of it through EARMARK, a
Semantic Web-aware meta-markup language that enables the definition of
multiple overlapping markup hierarchies representing different versions of the
same document
• We highlighted the main advantages and drawbacks in terms of querying and
storing such EARMARK documents
• In the future we plan to extend EChO so as to enable the description of
additional change tracking operations (e.g. swap, update)
• We also plan to experiment the effective use of translation mechanisms to
convert EARMARK documents with change tracking information into XML
formats, e.g. ODT and OOXML
Thanks for your attention
<end>
<end>revision of

Más contenido relacionado

Similar a Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

SWAP : A Dublin Core Application Profile for desribing scholarly works
SWAP : A Dublin Core Application Profile for desribing scholarly worksSWAP : A Dublin Core Application Profile for desribing scholarly works
SWAP : A Dublin Core Application Profile for desribing scholarly worksJulie Allinson
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformAntonio Peric-Mazar
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Oscar Corcho
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceOCLC
 
Repositories and the wider context
Repositories and the wider contextRepositories and the wider context
Repositories and the wider contextJulie Allinson
 
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEONuxeo
 
ResearchSpace Platform in Use
ResearchSpace Platform in UseResearchSpace Platform in Use
ResearchSpace Platform in UseBarry Norton
 
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...Tom Hofte
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)ALATechSource
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
Dcap Ja Progmeet 2007 07 05
Dcap Ja Progmeet 2007 07 05Dcap Ja Progmeet 2007 07 05
Dcap Ja Progmeet 2007 07 05Julie Allinson
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Jane Stevenson
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
IFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked DataIFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked DataLars G. Svensson
 
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...soapconf
 
ontology.ppt
ontology.pptontology.ppt
ontology.pptPrerak10
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)Julie Allinson
 

Similar a Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation (20)

Eprints Application Profile
Eprints Application ProfileEprints Application Profile
Eprints Application Profile
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 
Longwell final ppt
Longwell final pptLongwell final ppt
Longwell final ppt
 
SWAP : A Dublin Core Application Profile for desribing scholarly works
SWAP : A Dublin Core Application Profile for desribing scholarly worksSWAP : A Dublin Core Application Profile for desribing scholarly works
SWAP : A Dublin Core Application Profile for desribing scholarly works
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering EvidenceBIBFRAME and OCLC Works: Defining Models and Discovering Evidence
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
 
Repositories and the wider context
Repositories and the wider contextRepositories and the wider context
Repositories and the wider context
 
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
[Nuxeo World 2013] OPENING KEYNOTE - ERIC BARROCA, NUXEO CEO
 
ResearchSpace Platform in Use
ResearchSpace Platform in UseResearchSpace Platform in Use
ResearchSpace Platform in Use
 
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...
Mind The Gap - Mapping a domain model to a RESTful API - OReilly SACon 2018, ...
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Dcap Ja Progmeet 2007 07 05
Dcap Ja Progmeet 2007 07 05Dcap Ja Progmeet 2007 07 05
Dcap Ja Progmeet 2007 07 05
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
IFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked DataIFLA LIDASIG Open Session 2017: Introduction to Linked Data
IFLA LIDASIG Open Session 2017: Introduction to Linked Data
 
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...
Felix Sasaki - Value beyond content creation - Introducing ITS 2.0; soapconf ...
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
 

Más de University of Bologna

The Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusThe Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusUniversity of Bologna
 
A Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentA Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentUniversity of Bologna
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseUniversity of Bologna
 
A pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsA pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsUniversity of Bologna
 
Semantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherSemantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherUniversity of Bologna
 
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...University of Bologna
 
Characterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentCharacterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentUniversity of Bologna
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsUniversity of Bologna
 
The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...University of Bologna
 
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...University of Bologna
 

Más de University of Bologna (14)

The Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations CorpusThe Initiative for Open Citations and the OpenCitations Corpus
The Initiative for Open Citations and the OpenCitations Corpus
 
OpenCitations
OpenCitationsOpenCitations
OpenCitations
 
A Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology DevelopmentA Simplified Agile Methodology for Ontology Development
A Simplified Agile Methodology for Ontology Development
 
FOOD: FOod in Open Data
FOOD: FOod in Open DataFOOD: FOod in Open Data
FOOD: FOod in Open Data
 
Freedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations ariseFreedom for bibliographic references: OpenCitations arise
Freedom for bibliographic references: OpenCitations arise
 
A pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflowsA pattern-based ontology for describing publishing workflows
A pattern-based ontology for describing publishing workflows
 
Semantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing togetherSemantic lenses to bring digital and semantic publishing together
Semantic lenses to bring digital and semantic publishing together
 
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
Zeri e LODE
: Extracting the Zeri photo archive to Linked Open Data: formaliz...
 
Characterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experimentCharacterising citations in scholarly articles: an experiment
Characterising citations in scholarly articles: an experiment
 
Towards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citationsTowards the automatic identification of the nature of citations
Towards the automatic identification of the nature of citations
 
The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...The Live OWL Documentation Environment: a tool for the automatic generation o...
The Live OWL Documentation Environment: a tool for the automatic generation o...
 
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
Scholarly publishing and Linked Data: describing roles, statuses, temporal an...
 
Dealing with Markup Semantics
Dealing with Markup SemanticsDealing with Markup Semantics
Dealing with Markup Semantics
 
Handling Markup Overlaps Using OWL
Handling Markup Overlaps Using OWLHandling Markup Overlaps Using OWL
Handling Markup Overlaps Using OWL
 

Último

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Tracking Changes through EARMARK: a Theoretical Perspective and an Implementation

  • 1. http://creativecommons.org/licenses/by-sa/3.0 Tracking changes through EARMARK: a theoretical perspective and an implementation Silvio Peroni – essepuntato@cs.unibo.it Francesco Poggi – fpoggi@cs.unibo.it FabioVitali – fabio@cs.unibo.it 1st InternationalWorkshop on Document Changes: Modeling, Detection, Storage andVisualization http://diff.cs.unibo.it/dchanges2013/ @ DocEng 2013, Florence, Italy - September 10, 2013
  • 2. Outline • Documents and their changes in time through FRBR • Change tracking and provenance data of XML documents • Defining multi-hierarchy documents through EARMARK • EARMARK Changes Ontology and its application • Querying and byte-counting • Conclusions
  • 3. Documents do change in time • Any creative act of a text ✦ starts from a particular draft made by someone at a certain time ✦ is then modified through consecutive revisions ✦ may end up being forked into different variants ✦ may be modified by additional editorial activities such as typo-fixing, shortening, restructuring, etc. • Importance of keeping tracks of changes ✦ Computer Science: to show how programming code or computational models evolve throughout the natural lifecycle of software development ✦ Philology: to tell the way in which variant copies of a same book overlap in time and content ✦ Scientific Publishing: to understand the entity and quality of the modifications and driving the final acceptance or rejection of a paper ✦ etc.
  • 4. About (textual) documents and their changes A document is more than the string it is composed of Alice produces an hand-written document on a piece of paper Bob produces a digital document through a word-processor “Hello world.” composed by the string “Hello world.” composed by the string different documents same string modifies Bob’s document producing a new version “Hey, hello world!” composed by the string Charles used by to certain extent same document different strings A document refers to different strings in time
  • 5. Introducing FRBR for change tracking • “Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity-relationship model developed by the International Federation of Library Associations and Institutions (IFLA) that relates user tasks of retrieval and access in online library catalogues and bibliographic databases from a user’s perspective.” • According to FRBR, document is an overloaded word, and is better substituted by four different concepts called respectively ✦ Work, coupled with the concept of identity ✦ Expression, to record the evolution in time ✦ Manifestation, which specifies the form and format ✦ Item, identifying the concrete object from http://en.wikipedia.org/wiki/Functional_Requirements_for_Bibliographic_Records string of characters constituting the content of a document FRBR Expression layer Expressions never change in time : a distinct intellectual or artistic creation : the intellectual or artistic realisation of a work in the form of alpha-numeric, musical, or choreographic notation, sound, image, object, movement, etc., or any combination of such forms FRBR Work layer Bob “Hello world.” Charles “Hey, hello world!” the abstract conceptualisation of a document realisation of realisation of revision of Time
  • 6. What happens to XML-based markup documents? • What changes we have to keep track of among those that occur in XML documents, and what are the markup elements that, after edits and document changes, are ✦ directly affected ✦ hierarchically affected ✦ completely unaffected • Research questions: ✦ When a markup element E1 within a document version V1 changes in some way, e.g. by adding something to the text it contains, thereby generating document version V2, are the two instances of E1 in V1 and E2 in V2 to be considered actually the same element? ✦ In the case the aforementioned instances of E are to be considered different, is the difference meant to be propagated also to their ancestor elements?
  • 7. DEL <section> NEW Applying FRBR to XML markup Elements and text nodes as FRBR Expressions NEW <section> <p> <p> <em> Some interesting content. It was written by me. INS <em> very NEW <section> <p> <section> <p>Some <em>interesting</em> content.</p> <p>It was written by me.</p> </section> Alice <section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section> Bob revised by <section> <p>Some <em>very interesting</em> content.</p> <p>It was written by me.</p> </section> Charles revised by has part revision of
  • 8. Who, What, How and When • An important part of change tracking operations involves keeping track of provenance information ✦ who made the modification ✦ what was modified ✦ how it was modified ✦ when it was modified • How do we keep track of all these data in practice? ✦ XML-based languages use workarounds to implement overlapping markup ✦ Other solutions? <section> <section> <p> <p> <em> very <section> Version made by Bob July 3, 2013, at 04:15 Text inserted by Bob July 3, 2013, at 04:15 Version made by Alice June 19, 2013, at 13:45 Markup deleted by Charles July 5, 2013, at 03:33 Version made by Charles July 5, 2013, at 03:33
  • 9. From theory to practice through The Extremely Annotational RDF Markup (EARMARK) is at the same time a markup meta-language and an ontology of (document) markup ✦ More expressive than XML – it allows to organise markup structures as graphs ✦ It makes easy to associate annotations to document items such as change tracking information – since an EARMARK document is a set of OWL assertions, all the markup items and text nodes are individuals of particular classes identified by an IRI ✦ Lot of tools available: - a Java API - frameworks to convert XML documents into EARMARK ones and vice versa more information at http://palindrom.es/phd/research/earmark
  • 10. Example of EARMARK document Linearised using Turtle Some interesting content. It was written by me. # Textual content of the document :content a earmark:StringDocuverse ; earmark:hasContent "Some interesting content. It was written by me."^^xsd:string . full Turtle source of the document available at http://www.essepuntato.it/2013/dchanges # String ’Some ’ :r1 a earmark:PointerRange ; earmark:refersTo :content ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger . # String ‘interesting’ :r2 a earmark:PointerRange ... # String ’ content.’ :r3 a earmark:PointerRange ... # String ’It was written by me.’ :r4 a earmark:PointerRange ... <section> <p> # Element ‘section’ :section a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1 ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] . <p> <em> # First element ’p’ :p1 a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :r1 ; co:nextItem [ a co:ListItem ; co:itemContent :em ; co:nextItem [ a co:ListItem ; co:itemContent :r3 ] ] ] . ... and similarly for the other markup elements
  • 11. EARMARK Changes Ontology Extending EARMARK to manage change tracking information • The EARMARK Changes Ontology (EChO) extends the EARMARK ontology and includes the OWL 2 DL implementation of FRBR (http://purl.org/spar/frbr) and the Provenance Ontology (http://www.w3.org/ns/prov#), so as to keep track of all the changes and provenance data related to different versions of the same document ✦ the EARMARK items (docuverses, ranges and markup items) to model the structure of the different document versions and to store them all within a single EARMARK document ✦ frbr:revisionOf to indicate that a markup item is a revision of another ✦ prov:wasDerivedFrom to indicate that a range is actually derived from another one defined in a previous version of the document ✦ prov:wasGeneratedBy (coupled with instances of echo:VersionCreation and echo:ItemInsertion) and prov:generatedAtTime to indicate that a particular markup item, a range or a whole document version has been created at a certain time ✦ prov:wasInvalidatedBy (coupled with instances of echo:VersionRemoval and echo:ItemDeletion) and prov:invalidatedAtTime to indicate that a particular markup item, a range or a whole document version has been deleted at a certain time ✦ prov:wasAssociatedWith to indicate the agent involved in the activity of generation/invalidation of a certain item
  • 12. Version creation • Who:Alice • What: document version (implicitly identified by the document element of the markup document :section) • How: creation • When: June 19, 2013 at 13:45 :section # Provenance information prov:wasGeneratedBy :creation-by-alice ; prov:generatedAtTime "2013-06-19T13:45:00Z"^^xsd:dateTime . # Activity of creation of a new version :creation-by-alice a echo:VersionCreation ; prov:wasAssociatedWith :alice .
  • 13. Revision, insertion and deletion • Bob’s revision of Alice’s version was made on July 3, 2013, at 04:15, and concerns only the insertion of the string “very ” as first textual node of the element em • Charles’ revision of Bob’s version was made on July 5, 2013, at 03:33, and deletes the Bob’s second p of section # Element ’section ’ by Bob :section-by-bob a earmark:Element ; earmark:hasGeneralIdentifier "section"^^xsd:string ; co:firstItem [ a co:ListItem ; co:itemContent :p1-by-bob ; co:nextItem [ a co:ListItem ; co:itemContent :p2 ] ] ; # relation with previous version frbr:revisionOf :section ; # provenance information prov:wasGeneratedBy :creation-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime . :creation-by-bob a echo:VersionCreation ; prov:wasAssociatedWith :bob . revision # New content of the document :content-by-bob a earmark:StringDocuverse ; earmark:hasContent "very "^^xsd:string . # New string ’very ’ :r5 a earmark:PointerRange ; earmark:refersTo :content-by-bob ; earmark:begins "0"^^xsd:nonNegativeInteger ; earmark:ends "5"^^xsd:nonNegativeInteger ; # provenance information prov:wasGeneratedBy :insertion-by-bob ; prov:generatedAtTime "2013-07-03T04:15:00Z"^^xsd:dateTime . :insertion-by-bob a echo:ItemInsertion ; prov:wasAssociatedWith :bob . insertion :p2 # Second element ’p’ of Bob’s ‘section‘ prov:wasInvalidatedBy :deletion-by-charles prov:invalidatedAtTime "2013-07-05T03:33:00Z"^^xsd:dateTime . :deletion-by-charles a echo:ItemDeletion ; prov:wasAssociatedWith :charles . deletion
  • 14. Splitting ranges up • Daniel’s revision of the Alice’s version, where Daniel decided to substitute the string “me” in the second p with its name (i.e. the string “Daniel”) • In EARMARK, this string substitution (i.e. a deletion plus an insertion) is possible by defining four new ranges • We use prov:wasDerivedFrom statements between ranges to describe (at an abstract level) a more complex scenario of overlapping markup between the two versions
  • 15. Querying the history of changes • Since EARMARK is defined by means of Semantic Web technologies, we can use already implemented standards such as SPARQL 1.1 to query over the change tracking history of a certain EARMARK document ✦ Return a new EARMARK document that contains only Bob’s version CONSTRUCT { ?other ?p ?o . ?version ?pv ?ov . ?docuverse ?pd ?od } WHERE { { SELECT DISTINCT ?other ?version WHERE { { SELECT DISTINCT ?version WHERE { ?version a earmark:Element ; prov:wasGeneratedBy ?activity . ?activity a echo:VersionCreation ; prov:wasAssociatedWith :bob } } ?other (^co:itemContent?/^co:item)+ ?version } } ?version ?pv ?ov . ?other ?p ?o . OPTIONAL { ?other a earmark:PointerRange ; earmark:refersTo ?docuverse . ?docuverse ?pd ?od } } ✦ Select the textual content of all paragraphs removed by Charles SELECT DISTINCT ?range WHERE { ?p a earmark:Element ; earmark:hasGeneralIdentifier "p"^^xsd:string ; co:item/co:itemContent ?range ; prov:wasInvalidatedBy ?activity . ?range a earmark:PointerRange . ?activity a echo:ItemDeletion ; prov:wasAssociatedWith :charles }
  • 16. Byte-counting EARMARK documents • We used two documents ✦ The first document composed of seven different versions, named after the “Seven Dwarfs” for recognizability and obtained by applying very common edits according to three authors ✦ The second document composed of seven different versions, named after the weekdays and created by seven different authors when editing a very simple document • We compared the size in bytes of consecutive versions of such documents according to OpenDocument and OpenXML formats, and to EARMARK linearised in six different formats:Turtle, RDF/XML, OWL/XML, N-Triples, HDT and Manchester Syntax Turtle seems the best linearisation format for EARMARK HDT (compressed) performances similar to ODT and OOXML
  • 17. Conclusions • In this paper we presented a theoretical approach to track document changes based on FRBR and provenance data • We proposed one possible implementation of it through EARMARK, a Semantic Web-aware meta-markup language that enables the definition of multiple overlapping markup hierarchies representing different versions of the same document • We highlighted the main advantages and drawbacks in terms of querying and storing such EARMARK documents • In the future we plan to extend EChO so as to enable the description of additional change tracking operations (e.g. swap, update) • We also plan to experiment the effective use of translation mechanisms to convert EARMARK documents with change tracking information into XML formats, e.g. ODT and OOXML
  • 18. Thanks for your attention <end> <end>revision of