Provenance state-of-art

•Download as PPTX, PDF•

1 like•139 views

vty

CLARIAH workshop on collection and storage of provenance information in research projects.

Science

dans.knaw.nl
DANS is een instituut van KNAW en NWO
Provenance state-of-the-art
Vyacheslav Tykhonov
Senior Information Scientist
Research and Innovation department (DANS)
CLARIAH provenance workshop,
03.09.2018

Provenance and reproducibility in scientific workflows
• Where was dataset found?
• How it was produced, what is workflow?
• What versions of software and OS were used?
• Are all facts were considered?
• Is it up-to-date?
• Where to find all versions of software and data?

Current provenance standards
PROV (provenance vocabulary) is W3C standard introduced in 2013.
• PROV-DM, data model for provenance
• PROV-N: The Provenance Notation
• PROV-O: The PROV Ontology
• PROV-JSON: a JSON Representation for the PROV data model
• GDPRov: ontology enrichment for describing the provenance of
data and consent lifecycles using GDPR terms

Provenance information in Software Agents
All KNAW institutes involved in CLARIAH using Docker to distribute their software. Docker is
the most popular (at the moment) tool that performs operating-system-level virtualization.
Some useful features:
• Docker contains some provenance information in yaml specification
• Docker images can be distributed with OS and installed software components
• The state of running Docker container can saved as snapshot, archived in some
repository and transmitted for further collaboration with other groups of developers or
researchers
Docker orchestration tools allow to build workflow pipelines keeping some provenance

Example: Docker Compose specification (Timbuctoo)

Provenance of Docker images
Main challenge in Docker orchestration: how to track provenance
information in distributed images? Can you trust Docker Hub?
Docker Content Trust provides strong cryptographic guarantees over what
code and what versions of software are being run in your infrastructure.
It integrates The Update Framework (TUF) into Docker using Notary, an
tool that provides trust over any content.
• Image Provenance is critical for Production
• Distributed content should be digitally signed

Provenance of data
• Data repository should be able to archive all revisions of dataset
containing provenance information
• In the ideal situation every change in data should have provenance
record, for reproducibility purposes
• Archived datasets with provenance information can get “second life” in the
new research keeping all references between original and forked datasets
• In the combination with SoftwareAgent provenance, it can form an

Provenance repositories
• Apache NiFi was designed to automate the flow of data
between software systems
• Dataverse has specific API to keep provenance information
about Software Agents and workflows up-to-date (PROV-
JSON)
• ProvStore is a web service that allows you to store, browse
and manage your provenance documents
The most intriguing is to transmit provenance information
from repository to LOD!

Linked Data provenance challenges
Provenance representation:
• Annotation method (provenance is pre-computed and stored as metadata,
can be archived as well)
• Inversion method (provenance is computed when needed)
Provenance storage approaches:
• Stored within the same storage system
• Distributed storage in different locations, VoID (Vocabulary of Interlinked

What's hot

Accelerating your Research with Microsoft Azure (June 2015)Microsoft Azure for Research

Keynote IEEE International Workshop on Cloud Analytics. Dennis GannonMicrosoft Azure for Research

iRODS/Dataverse Project by Jonathan Crabtreedatascienceiqss

Nuix Presentationtbonk_dti

iRODS: Interoperability in Data ManagementThe HDF-EOS Tools and Information Center

ORDS, research data networkJisc RDM

Doing Research in the Cloud - NIH Workshop Dennis GannonMicrosoft Azure for Research

Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker

20160818 Semantics and Linkage of Archived Catalogsandrea huang

Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...The HDF-EOS Tools and Information Center

Publishing and Serving Machine Learning Models with DLHubGlobus

Cloud Transforms Culture, Europeana Tech 2014PavelKats

20170501 Distributed Network of Digital Heritage InformationEnno Meijers

DSpace-CRIS: new features and contribution to the DSpace mainstreamAndrea Bollini

Dataverse in the European Open Science Cloudvty

Digital Preservation in Production (DPN and DuraCloud Vault)DuraSpace

17. kb.nederlab.20150324ingeangevaare

SSHOC Dataverse in the European Open Science Cloudvty

Ariadne: Interoperabilityariadnenetwork

TIB's action for research data managament as a national library's strategy in...Peter Löwe

What's hot (20)

Accelerating your Research with Microsoft Azure (June 2015)

Keynote IEEE International Workshop on Cloud Analytics. Dennis Gannon

iRODS/Dataverse Project by Jonathan Crabtree

Nuix Presentation

iRODS: Interoperability in Data Management

ORDS, research data network

Doing Research in the Cloud - NIH Workshop Dennis Gannon

Open Science Days 2014 - Becker - Repositories and Linked Data

20160818 Semantics and Linkage of Archived Catalogs

Access HDF-EOS data with OGC Web Coverage Service - Earth Observation Applica...

Publishing and Serving Machine Learning Models with DLHub

Cloud Transforms Culture, Europeana Tech 2014

20170501 Distributed Network of Digital Heritage Information

DSpace-CRIS: new features and contribution to the DSpace mainstream

Dataverse in the European Open Science Cloud

Digital Preservation in Production (DPN and DuraCloud Vault)

17. kb.nederlab.20150324

SSHOC Dataverse in the European Open Science Cloud

Ariadne: Interoperability

TIB's action for research data managament as a national library's strategy in...

Similar to Provenance state-of-art

Technical integration of data repositories status and challengesvty

How Cyverse.org enables scalable data discoverability and re-useMatthew Vaughn

5 years of Dataverse evolution vty

CESSDA Persistent Identifiers vty

Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE

Curation and Characterization of Web ServicesJose Enrique Ruiz

Building an electronic repository and archives on Dataverse in the European O...vty

Why & Where Knoldus Uses Rust?Knoldus Inc.

Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10Nuxeo

Дмитрий Попович "How to build a data warehouse?"Fwdays

Provenance and TrustJose Manuel Gómez-Pérez

Deployment pipeline for databasesEduardo Piairo

Dataverse opportunitiesvty

Building a lightweight discovery interface for Chinese patentsOpenSource Connections

COPO - Collaborative Open Plant Omics, by Rob DaveyAIMS (Agricultural Information Management Standards)

Nuxeo Fact SheetNuxeo

FAIR Workflows and Research Objects get a Workout Carole Goble

"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper

Standard Provenance Reporting and Scientific Software Management in Virtual L...njcar

Running Dataverse repository in the European Open Science Cloud (EOSC)vty

Similar to Provenance state-of-art (20)

Technical integration of data repositories status and challenges

How Cyverse.org enables scalable data discoverability and re-use

5 years of Dataverse evolution

CESSDA Persistent Identifiers

Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)

Curation and Characterization of Web Services

Building an electronic repository and archives on Dataverse in the European O...

Why & Where Knoldus Uses Rust?

Nuxeo Platform LTS 2015 - Opening Keynote Event 2015-10

Дмитрий Попович "How to build a data warehouse?"

Provenance and Trust

Deployment pipeline for databases

Dataverse opportunities

Building a lightweight discovery interface for Chinese patents

COPO - Collaborative Open Plant Omics, by Rob Davey

Nuxeo Fact Sheet

FAIR Workflows and Research Objects get a Workout

"Data Provenance: Principles and Why it matters for BioMedical Applications"

Standard Provenance Reporting and Scientific Software Management in Virtual L...

Running Dataverse repository in the European Open Science Cloud (EOSC)

Recently uploaded

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330

Bacterial Identification and ClassificationsAreesha Ahmad

Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju

Animal Communication- Auditory and Visual.pptxUmerFayaz5

Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197

Formation of low mass protostars and their circumstellar disksSérgio Sacani

Biological Classification BioHack (3).pdfmuntazimhurra

Botany 4th semester series (krishna).pdfSumit Kumar yadav

GBSN - Microbiology (Unit 2)Areesha Ahmad

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

Green chemistry and Sustainable development.pptxRajatChauhan518211

Disentangling the origin of chemical differences using GHOSTSérgio Sacani

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344

Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav

Chemistry 4th semester series (krishna).pdfSumit Kumar yadav

COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed

Recently uploaded (20)

SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE

Bacterial Identification and Classifications

Pests of mustard_Identification_Management_Dr.UPR.pdf

Animal Communication- Auditory and Visual.pptx

Biopesticide (2).pptx .This slides helps to know the different types of biop...

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL

Formation of low mass protostars and their circumstellar disks

Biological Classification BioHack (3).pdf

Botany 4th semester series (krishna).pdf

GBSN - Microbiology (Unit 2)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b

Isotopic evidence of long-lived volcanism on Io

Green chemistry and Sustainable development.pptx

Disentangling the origin of chemical differences using GHOST

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.

TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...

Botany 4th semester file By Sumit Kumar yadav.pdf

Chemistry 4th semester series (krishna).pdf

COST ESTIMATION FOR A RESEARCH PROJECT.pptx

Provenance state-of-art

1. dans.knaw.nl DANS is een instituut van KNAW en NWO Provenance state-of-the-art Vyacheslav Tykhonov Senior Information Scientist Research and Innovation department (DANS) CLARIAH provenance workshop, 03.09.2018

2. Provenance introduction • Why provenance information is important for reproducibility (Reproducible Science) • Why do we need to keep any change in provenance • What standards we have for provenance • How to keep provenance information up-to-date for data and tools

3. Provenance and reproducibility in scientific workflows • Where was dataset found? • How it was produced, what is workflow? • What versions of software and OS were used? • Are all facts were considered? • Is it up-to-date? • Where to find all versions of software and data?

4. Current provenance standards PROV (provenance vocabulary) is W3C standard introduced in 2013. • PROV-DM, data model for provenance • PROV-N: The Provenance Notation • PROV-O: The PROV Ontology • PROV-JSON: a JSON Representation for the PROV data model • GDPRov: ontology enrichment for describing the provenance of data and consent lifecycles using GDPR terms

5. PROV-O terms

6. PROV-O example from PARTHENOS project

7. Provenance information in Software Agents All KNAW institutes involved in CLARIAH using Docker to distribute their software. Docker is the most popular (at the moment) tool that performs operating-system-level virtualization. Some useful features: • Docker contains some provenance information in yaml specification • Docker images can be distributed with OS and installed software components • The state of running Docker container can saved as snapshot, archived in some repository and transmitted for further collaboration with other groups of developers or researchers Docker orchestration tools allow to build workflow pipelines keeping some provenance

8. Example: Docker Compose specification (Timbuctoo)

9. Provenance of Docker images Main challenge in Docker orchestration: how to track provenance information in distributed images? Can you trust Docker Hub? Docker Content Trust provides strong cryptographic guarantees over what code and what versions of software are being run in your infrastructure. It integrates The Update Framework (TUF) into Docker using Notary, an tool that provides trust over any content. • Image Provenance is critical for Production • Distributed content should be digitally signed

10. Provenance of data • Data repository should be able to archive all revisions of dataset containing provenance information • In the ideal situation every change in data should have provenance record, for reproducibility purposes • Archived datasets with provenance information can get “second life” in the new research keeping all references between original and forked datasets • In the combination with SoftwareAgent provenance, it can form an

11. Provenance repositories • Apache NiFi was designed to automate the flow of data between software systems • Dataverse has specific API to keep provenance information about Software Agents and workflows up-to-date (PROV- JSON) • ProvStore is a web service that allows you to store, browse and manage your provenance documents The most intriguing is to transmit provenance information from repository to LOD!

12. Provenance information in Turtle format

13. Linked Data provenance challenges Provenance representation: • Annotation method (provenance is pre-computed and stored as metadata, can be archived as well) • Inversion method (provenance is computed when needed) Provenance storage approaches: • Stored within the same storage system • Distributed storage in different locations, VoID (Vocabulary of Interlinked

14. Questions?

Provenance state-of-art

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Provenance state-of-art

Similar to Provenance state-of-art (20)

More from vty

More from vty (20)

Recently uploaded

Recently uploaded (20)

Provenance state-of-art