SlideShare a Scribd company logo
1 of 28
Download to read offline
Representing Interoperable Provenance Descriptions
  for ETL Workflows
  André Freitas, Benedikt Kämpgen, Joao Gabriel Oliveira, Seán O’Riain, Edward Curry
  The role of Semantic Web in Provenance Management, Extended Semantic Web Conference 2012
  28 May 2012


Institute of Applied Informatics and Formal Description Methods (AIFB)




  KIT – University of the State of Baden-Wuerttemberg and
  National Research Center of the Helmholtz Association                           www.kit.edu
Motivation

        Decision-support on more complex and
       heterogeneous data environments (dataspaces,
       Linked Open Data)
        Extract-Transform-Load (ETL) workflows
       inherent part of data analysis
        Challenges:
              Management of complex ETL workflows
              Information quality, trust




2   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                      Formal Description Methods (AIFB)
Problem
                                                                                                  1. Lookup printer log
                                                                   ETL
                                                                                                  file – 20sec
                                                                   ETL
                                                                                                      2. Parse to RDF –
                                                                                                      30sec
                                                                  ETL
                                                                                                      3. Filter for 2010 –
                                                                                                      1sec

                                                                  ETL
                             Sustainability report                                                4. Aggregate over
                                                                                                  people – 1sec
                                                           2009          2010
                             printing emissions                  600           503
                             paper usage                      4 165          3 968
                             travel emissions              534 000       429 193
                             commute emissions                   456           391
                             Carbon dioxide emission by kg




3   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows           Institute of Applied Informatics and
                                                                                                             Formal Description Methods (AIFB)
Problem
                                                                                                    1. Extract from travel
                                                                                                    form DB – 20sec
                  1. Crawl from RDFa
                                                                                                        2. Parse from CSV
                  on website – 1h
                                                                                                        to RDF – 30sec
                                                                    ETL
                  2. Apply constant                                                                     3. Aggregate over
                  factor – 1sec                                                                         people – 1sec

                                                                    ETL
                               Sustainability report                                                4. Filter for 2010 –
                                                                                                    1sec
                                                             2009          2010
                               printing emissions                  600           503
                               paper usage                      4 165          3 968
                               travel emissions              534 000       429 193
                               commute emissions                   456           391
                               Carbon dioxide emission by kg




4   28 May 2012     B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows          Institute of Applied Informatics and
                                                                                                              Formal Description Methods (AIFB)
Solution: Provenance information about
    ETL workflows
        Prospective provenance: representation of ETL workflow
       at design time
        Retrospective provenance: representation of ETL workflow
       after execution

        Applications of provenance information for ETL workflows
              Documentation (reproducibility and reuse)
              Data quality assessment (trustworthiness)
              Management (consistency-checking, debugging and semantic
              reconciliation)




5   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                      Formal Description Methods (AIFB)
Outline

        Motivation & Problem
        Gap of ETL Descriptions
        Interoperable ETL Provenance Model
        Case Study
        Conclusions




6   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                      Formal Description Methods (AIFB)
Gap of ETL Descriptions (1)

             Provenance models                            Provenance
                                                         representation                      Davidson and
                                                          from an ETL                        Buneman (1998)
       Conceptual modelling
                                                           perspective
       using ontologies

    Becker and Ghedini                                    Semantic                         Galhardas et al. (2001)
    (2005)                                             interoperability
                                                      across different
      CWM, PMML, BPMN,                                ETL applications                     Cui and Widom (2003)
      BPEL + ontologies

                  Data Mining                             Usability and                     Simmhan et al. (2005)
                  Ontology (2009)                          ontological
                                                          commitment

    Conceptual modelling of                          Interoperable                                     Formal models of
        ETL workflows                            ETL provenance model                                   ETL workflows
7   28 May 2012    B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows         Institute of Applied Informatics and
                                                                                                            Formal Description Methods (AIFB)
Gap of ETL Descriptions (2)

        Common ETL applications
              such as Kapow Software, Pentaho Data Integration,
              Google Refine and Yahoo Pipes
              do not create and use provenance information or
              do not support sharing and integrating such provenance
              information




8   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                      Formal Description Methods (AIFB)
Outline

        Motivation & Problem
        Gap of ETL Descriptions
        Interoperable ETL Provenance Model
        Case Study
        Conclusions




9   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                      Formal Description Methods (AIFB)
Outline

         Motivation & Problem
         Gap of ETL Descriptions
         Interoperable ETL Provenance Model
               Requirements Analysis
               High-level approach
               Cogs: Linked Data vocabulary
               Requirements Coverage Analysis
         Case Study
         Conclusions


10   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Requirements Analysis
                                       Provenance                          Semantic                       Usability and
                                       representation from an              interoperability across        ontological
                                       ETL perspective                     different ETL platforms        commitment

     Prospective and                   +                                   +
     retrospective
     descriptions
     Separation of concerns                                                +
     Common terminology                +                                   +
     Terminological                    +                                   +
     completeness
     Lightweight ontology                                                                                 +
     structure
     Availability of different                                             +                              +
     abstraction levels
     Data representation                                                                                  +
     independency
     Accessibility                                                         +                              +
     Decentralization                                                      +                              +

11     28 May 2012    B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows        Institute of Applied Informatics and
                                                                                                              Formal Description Methods (AIFB)
Interoperable Provenance Model for ETL
     Workflows

                                                                            High-level approach
                                                                                  reuse of the OPM
                                                                                  Vocabulary (OPMV)
                                                                                  workflow structure as
                                                                                  abstract provenance model
                                                                                  creation of Cogs, an RDF
                                                                                  vocabulary for representing
                                                                                  ETL Provenance
                                                                                  can be extended by
                                                                                  domain specific models
                                                                                  use of the Linked Data
     Three-layered Provenance Model                                               principles for representing
                                                                                  provenance descriptors


12   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Open Provenance Model Vocabulary
     (OPMV)
         Community-built
        provenance model
         Simple workflow structure
        (processes, artifacts,
        agents)
         Designed to be a minimal
        level of provenance
        interoperability
         Designed to be extensible
         ETL and provenance share
        workflow-level semantics
                                                              http://open-biomed.sourceforge.net/opmv/ns.html




13   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
RDF vocabulary for representing ETL elements
        Complementary vocabulary for expressing the
        elements present in an ETL workflow based on
            ETL/data transformation tools (Pentaho Data
            Integration, Google Refine)
            Concepts and structures from the ETL literature.
         https://sites.google.com/site/cogsvocab/




14   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Cogs – OPMV workflow extension
         The representation of
        nested workflows
        allows different
        abstraction levels




15   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Cogs – Structure
         Taxonomy of ETL elements mapping
        to provenance processes and artifacts
         High-level classes:
               cogs:Execution, e.g., ScheduledJob
               cogs:State, e.g., Running
               opmv:Process
                     cogs:Extraction, e.g., Parsing
                     cogs:Transformation, e.g., RegexFilter
                     cogs:Loading, e.g., IncrementalLoad
               opmv:Artifact
                     cogs:Object, e.g., CSV File
               cogs:Layer, e.g., StagingArea
                                                                  Cogs:
                                                                  151 classes
                                                                  17 properties
16   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Requirements Coverage Analysis

                                      OPMV                                 Cogs                           LD principles


     Prospective and                  +                                    +
     retrospective
     descriptions
     Separation of concerns           +                                    +
     Common terminology               +                                    +
     Terminological                   +                                    +                              +
     completeness
     Lightweight ontology             +                                    +
     structure
     Availability of different                                             +
     abstraction levels
     Data representation              +                                    +                              +
     independency
     Accessibility                    +                                                                   +
     Decentralization                                                                                     +

17     28 May 2012    B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows        Institute of Applied Informatics and
                                                                                                              Formal Description Methods (AIFB)
Outline

         Motivation & Problem
         Gap of ETL Descriptions
         Interoperable ETL Provenance Model
         Case Study
         Conclusions




18   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Sustainability Reporting

         ETL over heterogeneous data sources (e.g., log
        files, survey results, travel request DB, RDF)



                                                                                                       Sustainability report

                                                                                                                         2009          2010

                                                                                                        printing                600           503
                                                                                                        emissions

                                                                                                        paper                  4 165     3 968
                                                                                                        usage

                                                                                                        travel             534 000     429 193
                                                                                                        emissions

                                                                                                        commute                 456           391
                                                                                                        emissions
                                                                                                       Carbon dioxide emission by kg




19   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Sustainability Reporting

         ETL over heterogeneous data sources (e.g., log
        files, survey results, travel request DB, RDF)
                                               2.                                     3.                                   4.




                                                                                                       Sustainability report


1.                                                                                                                       2009          2010

                                                                                                        printing                600           503
                                                                                                        emissions

                                                                                                        paper                  4 165     3 968
                                                                                                        usage

                                                                                                        travel             534 000     429 193
                                                                                                        emissions

                                                                                                        commute                 456           391
                                                                                                        emissions
                                                                                                       Carbon dioxide emission by kg




20   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Architecture with
     Provenance-aware ETL Applications




21   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Sustainability Report
     Values




22   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Provenance Descriptor
     Visualization




23   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Case Study – Provenance Descriptor
     Visualization




                                      4. Aggregation




24   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                          1. Lookup    Formal Description Methods (AIFB)
Case Study – Possible Queries
         OPMV
               What are the data artifacts, processes and agents behind this data
               value?
               When and how long were the processes executed?
         OPMV + Cogs
               How long did all lookups take?
               What scripts have been used to transform the data into RDF?
               To which values constant factors have been applied?
               Which aggregation functions were used to calculate this indicator?




25   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Outline

         Motivation & Problem
         Gap of ETL Descriptions
         Interoperable ETL Provenance Model
         Case Study
         Conclusions




26   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows    Institute of Applied Informatics and
                                                                                                       Formal Description Methods (AIFB)
Conclusions

                        Provenance                         Semantic                          Usability and
                       representation                   interoperability                      ontological
                        from an ETL                    across different                      commitment
                         perspective                   ETL applications


                            Evaluation in small case study
                            For a full evaluation of interoperability
                           benefits model needs to be adopted in
                           provenance-aware ETL applications.
                            Starting point: Provenance-aware
                           Google Refine using Cogs.


27   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows     Institute of Applied Informatics and
                                                                                                        Formal Description Methods (AIFB)
Conclusions

                        Provenance                         Semantic                          Usability and
                       representation                   interoperability                      ontological
                        from an ETL                    across different                      commitment
                         perspective                   ETL applications




                                                            Thanks!




28   28 May 2012   B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows     Institute of Applied Informatics and
                                                                                                        Formal Description Methods (AIFB)

More Related Content

More from Andre Freitas

Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsAndre Freitas
 
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...Andre Freitas
 
Semantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsAndre Freitas
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementAndre Freitas
 
Categorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary DefinitionsCategorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary DefinitionsAndre Freitas
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesAndre Freitas
 
Different Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsAndre Freitas
 
WiSS Challenge - Day 2
WiSS Challenge - Day 2WiSS Challenge - Day 2
WiSS Challenge - Day 2Andre Freitas
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataAndre Freitas
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeAndre Freitas
 
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...Andre Freitas
 
Semantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional ApproachSemantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional ApproachAndre Freitas
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Andre Freitas
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...Andre Freitas
 
How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?Andre Freitas
 
Towards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackAndre Freitas
 
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary StudyOn the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study Andre Freitas
 
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Andre Freitas
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 

More from Andre Freitas (20)

Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
 
Semantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering Systems
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
 
Categorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary DefinitionsCategorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary Definitions
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
 
Different Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering Systems
 
WiSS Challenge - Day 2
WiSS Challenge - Day 2WiSS Challenge - Day 2
WiSS Challenge - Day 2
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
 
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
 
Semantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional ApproachSemantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional Approach
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
 
How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?
 
Towards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web StackTowards a Distributional Semantic Web Stack
Towards a Distributional Semantic Web Stack
 
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary StudyOn the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study
On the Semantic Mapping of Schema-agnostic Queries: A Preliminary Study
 
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
Talking to your Data: Natural Language Interfaces for a schema-less world (Ke...
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 

Recently uploaded

EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsRommel Regala
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 

Recently uploaded (20)

EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Millenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptxMillenials and Fillennials (Ethical Challenge and Responses).pptx
Millenials and Fillennials (Ethical Challenge and Responses).pptx
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
The Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World PoliticsThe Contemporary World: The Globalization of World Politics
The Contemporary World: The Globalization of World Politics
 
TEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docxTEACHER REFLECTION FORM (NEW SET........).docx
TEACHER REFLECTION FORM (NEW SET........).docx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 

Representing Interoperable Provenance Descriptions for ETL Workflows

  • 1. Representing Interoperable Provenance Descriptions for ETL Workflows André Freitas, Benedikt Kämpgen, Joao Gabriel Oliveira, Seán O’Riain, Edward Curry The role of Semantic Web in Provenance Management, Extended Semantic Web Conference 2012 28 May 2012 Institute of Applied Informatics and Formal Description Methods (AIFB) KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
  • 2. Motivation Decision-support on more complex and heterogeneous data environments (dataspaces, Linked Open Data) Extract-Transform-Load (ETL) workflows inherent part of data analysis Challenges: Management of complex ETL workflows Information quality, trust 2 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 3. Problem 1. Lookup printer log ETL file – 20sec ETL 2. Parse to RDF – 30sec ETL 3. Filter for 2010 – 1sec ETL Sustainability report 4. Aggregate over people – 1sec 2009 2010 printing emissions 600 503 paper usage 4 165 3 968 travel emissions 534 000 429 193 commute emissions 456 391 Carbon dioxide emission by kg 3 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 4. Problem 1. Extract from travel form DB – 20sec 1. Crawl from RDFa 2. Parse from CSV on website – 1h to RDF – 30sec ETL 2. Apply constant 3. Aggregate over factor – 1sec people – 1sec ETL Sustainability report 4. Filter for 2010 – 1sec 2009 2010 printing emissions 600 503 paper usage 4 165 3 968 travel emissions 534 000 429 193 commute emissions 456 391 Carbon dioxide emission by kg 4 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 5. Solution: Provenance information about ETL workflows Prospective provenance: representation of ETL workflow at design time Retrospective provenance: representation of ETL workflow after execution Applications of provenance information for ETL workflows Documentation (reproducibility and reuse) Data quality assessment (trustworthiness) Management (consistency-checking, debugging and semantic reconciliation) 5 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 6. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions 6 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 7. Gap of ETL Descriptions (1) Provenance models Provenance representation Davidson and from an ETL Buneman (1998) Conceptual modelling perspective using ontologies Becker and Ghedini Semantic Galhardas et al. (2001) (2005) interoperability across different CWM, PMML, BPMN, ETL applications Cui and Widom (2003) BPEL + ontologies Data Mining Usability and Simmhan et al. (2005) Ontology (2009) ontological commitment Conceptual modelling of Interoperable Formal models of ETL workflows ETL provenance model ETL workflows 7 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 8. Gap of ETL Descriptions (2) Common ETL applications such as Kapow Software, Pentaho Data Integration, Google Refine and Yahoo Pipes do not create and use provenance information or do not support sharing and integrating such provenance information 8 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 9. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions 9 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 10. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Requirements Analysis High-level approach Cogs: Linked Data vocabulary Requirements Coverage Analysis Case Study Conclusions 10 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 11. Requirements Analysis Provenance Semantic Usability and representation from an interoperability across ontological ETL perspective different ETL platforms commitment Prospective and + + retrospective descriptions Separation of concerns + Common terminology + + Terminological + + completeness Lightweight ontology + structure Availability of different + + abstraction levels Data representation + independency Accessibility + + Decentralization + + 11 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 12. Interoperable Provenance Model for ETL Workflows High-level approach reuse of the OPM Vocabulary (OPMV) workflow structure as abstract provenance model creation of Cogs, an RDF vocabulary for representing ETL Provenance can be extended by domain specific models use of the Linked Data Three-layered Provenance Model principles for representing provenance descriptors 12 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 13. Open Provenance Model Vocabulary (OPMV) Community-built provenance model Simple workflow structure (processes, artifacts, agents) Designed to be a minimal level of provenance interoperability Designed to be extensible ETL and provenance share workflow-level semantics http://open-biomed.sourceforge.net/opmv/ns.html 13 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 14. RDF vocabulary for representing ETL elements Complementary vocabulary for expressing the elements present in an ETL workflow based on ETL/data transformation tools (Pentaho Data Integration, Google Refine) Concepts and structures from the ETL literature. https://sites.google.com/site/cogsvocab/ 14 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 15. Cogs – OPMV workflow extension The representation of nested workflows allows different abstraction levels 15 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 16. Cogs – Structure Taxonomy of ETL elements mapping to provenance processes and artifacts High-level classes: cogs:Execution, e.g., ScheduledJob cogs:State, e.g., Running opmv:Process cogs:Extraction, e.g., Parsing cogs:Transformation, e.g., RegexFilter cogs:Loading, e.g., IncrementalLoad opmv:Artifact cogs:Object, e.g., CSV File cogs:Layer, e.g., StagingArea Cogs: 151 classes 17 properties 16 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 17. Requirements Coverage Analysis OPMV Cogs LD principles Prospective and + + retrospective descriptions Separation of concerns + + Common terminology + + Terminological + + + completeness Lightweight ontology + + structure Availability of different + abstraction levels Data representation + + + independency Accessibility + + Decentralization + 17 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 18. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions 18 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 19. Case Study – Sustainability Reporting ETL over heterogeneous data sources (e.g., log files, survey results, travel request DB, RDF) Sustainability report 2009 2010 printing 600 503 emissions paper 4 165 3 968 usage travel 534 000 429 193 emissions commute 456 391 emissions Carbon dioxide emission by kg 19 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 20. Case Study – Sustainability Reporting ETL over heterogeneous data sources (e.g., log files, survey results, travel request DB, RDF) 2. 3. 4. Sustainability report 1. 2009 2010 printing 600 503 emissions paper 4 165 3 968 usage travel 534 000 429 193 emissions commute 456 391 emissions Carbon dioxide emission by kg 20 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 21. Case Study – Architecture with Provenance-aware ETL Applications 21 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 22. Case Study – Sustainability Report Values 22 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 23. Case Study – Provenance Descriptor Visualization 23 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 24. Case Study – Provenance Descriptor Visualization 4. Aggregation 24 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and 1. Lookup Formal Description Methods (AIFB)
  • 25. Case Study – Possible Queries OPMV What are the data artifacts, processes and agents behind this data value? When and how long were the processes executed? OPMV + Cogs How long did all lookups take? What scripts have been used to transform the data into RDF? To which values constant factors have been applied? Which aggregation functions were used to calculate this indicator? 25 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 26. Outline Motivation & Problem Gap of ETL Descriptions Interoperable ETL Provenance Model Case Study Conclusions 26 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 27. Conclusions Provenance Semantic Usability and representation interoperability ontological from an ETL across different commitment perspective ETL applications Evaluation in small case study For a full evaluation of interoperability benefits model needs to be adopted in provenance-aware ETL applications. Starting point: Provenance-aware Google Refine using Cogs. 27 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)
  • 28. Conclusions Provenance Semantic Usability and representation interoperability ontological from an ETL across different commitment perspective ETL applications Thanks! 28 28 May 2012 B. Kämpgen – Representing Interoperable Provenance Descriptions for ETL Workflows Institute of Applied Informatics and Formal Description Methods (AIFB)