The ready availability of data is leading to the increased opportunity of their re-use for new applications and for analyses. Most of these data are not necessarily in the format users want, are usually heterogeneous, and highly dynamic, and this necessitates data transformation efforts to re-purpose them. Interactive data transformation (IDT) tools are becoming easily available to lower these barriers to data transformation
efforts. This paper describes a principled way to capture data
lineage of interactive data transformation processes. We provide a formal
model of IDT, its mapping to a provenance representation, and its
implementation and validation on Google Refine. Provision of the data transformation process sequences allows assessment of data quality and
ensures portability between IDT and other data transformation platforms.
The proposed model showed a high level of coverage against a set of requirements used for evaluating systems that provide provenance
management solutions.
Capturing Interactive Data Transformation Operations using Provenance Workflows
1. Digital Enterprise Research Institute www.deri.ie
Capturing interactive data transformation
operations using provenance workflows
Tope Omitola, Andre Freitas, Edward Curry, Sean
O'Riain, Nicholas Gibbins and Nigel Shadbolt
SWPM Workshop 28.05.2012, Herakleion, Crete
Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
2. Outline
Digital Enterprise Research Institute www.deri.ie
Motivation
Interactive data transformations (IDTs)
IDT & Provenance
Modelling IDTs
Provenance Representation
Provenance Capture
Case Study
Conclusion
3. Motivation
Digital Enterprise Research Institute www.deri.ie
Dataspaces:
High number of heterogeneous data sources
Complex data transformation environment
Need for both repeatable data transformations and once-
off transformations
Traditional ETL approaches for data
transformation/integration:
Based on scripting/programming
Focus on repeatable data transformation processes
4. Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute www.deri.ie
Based on user interaction paradigms for user
creation of data transformations
Explores GUI elements mapping to data
transformation operations
Instant feedback of each iteration
Complementary to existing ETL tools
Lower the barriers for non-programmers (reduces
programming effort) of doing data transformations
Example platforms: Google Refine, Potters Wheel,
Wrangler
6. Challenges
Digital Enterprise Research Institute www.deri.ie
How to model IDTs?
Facilitating the reuse of previous IDTs
Representing IDTs
Provenance
Making IDT platforms provenance-aware
Enabling transportability across IDT and ETL
platforms
7. IDT & Provenance
Digital Enterprise Research Institute www.deri.ie
Provenance supports representation of interactive
data transformations
Output: a provenance descriptor which shows the
relationship between the inputs, the outputs, and
the applied transformation operations
Both retrospective and prospective provenance
8. IDT
Digital Enterprise Research Institute www.deri.ie
IDT model
Formal model (Algebra for IDT)
Provenance representation
Provenance capture of IDTs
9. IDT Model: Core Elements
Digital Enterprise Research Institute www.deri.ie
Schema and instance data
Set of predefined operations
GUI elements mapping to predefined operations
User actions
Operation selection
Parameter selection
Operation composition (workflow)
11. Formalizing the mapping from IDT to
Provenance
Digital Enterprise Research Institute www.deri.ie
Definition 1: A provenance-based interactive data
transformation engine, consists of a set of
transformations (or activities) on a set of datasets
generating outputs in the form of other datasets or
events which may trigger further transformations
Definition 2: An interactive data transformation
event, consists of the input dataset, the output
dataset(s), the applied transformation function,
and the time the transformation took place
12. Formalizing the mapping from IDT to
Provenance
Digital Enterprise Research Institute www.deri.ie
Definition 3: A run is a function from time to
dataset(s) and the transformation applied to those
dataset(s)
Definition 4: A trace is the sequence of pairs of a
run and the time the run was made
13. Provenance Representation
Digital Enterprise Research Institute www.deri.ie
Proposed in Representing Interoperable Provenance
Descriptions for ETL Workflows
Three-layered provenance model:
Open Provenance Model Vocabulary Layer
Cogs ETL Provenance Vocabulary
Domain-Specific Model Layer
Linked Data standards
16. Case study
Digital Enterprise Research Institute www.deri.ie
Implementation over the GR Platform
Example descriptor
@prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .
grf :MassCellChange-1092380975 rdf:type opmv:Process,
cogs:ColumnOperation, cogs:Transformation; Mapping to the actual program
cogs:operationName "MassCellChange"^^xsd:string;
cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string; Process
rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.
grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ; Input Artifact
rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.
grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact; Output Artifact
rdfs:label "* '''John Wayne'''"^^xsd:string.
Workflow structure
grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.
grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.
grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.
grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
17. Conclusion
Digital Enterprise Research Institute www.deri.ie
The proposed approach provides low impact on the
existing IDT process
Provenance representation supports different data
models
Preliminary implementation of a Google Refine
provenance extension