Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

SHARP: harmonizing cross-workflow provenance

200 visualizaciones

Publicado el

The slides of the talk given at SeWeBMeDA workshop ( https://sites.google.com/insight-centre.org/sewebmeda/home ) co-located with ESWC 2017.

Publicado en: Ciencias
  • Sé el primero en comentar

SHARP: harmonizing cross-workflow provenance

  1. 1. SHARP: Harmonizing cross-workflow provenance SeWeBMeDA’17 Alban Gaignard1 , Khalid Belhajjame2 , Hala Skaf-Molli3 May 28, 2017 1 Nantes Academic Hospital, France 2 LAMSADE Paris-Dauphine University, France 3 LS2N - Nantes University, France
  2. 2. Multi-scale biological observations ? data silos ! ! ! ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  3. 3. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  4. 4. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  5. 5. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale Provenance Informing on data sources, production and analysis chains → transparency, reproducibility A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  6. 6. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 go to owl:sameAs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 2
  7. 7. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  8. 8. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  9. 9. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” Need for an overall tracking of provenance over both Galaxy and Taverna workflows ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  10. 10. PROV-O ontology https://www.w3.org/TR/prov-o A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 4
  11. 11. Heterogeneity in provenance capture ! Galaxy PROV predicates counts prov:wasDerivedFrom 118 rdf:type 76 rdfs:label 62 prov:used 61 prov:wasAttributedTo 34 prov:wasGeneratedBy 33 prov:endedAtTime 26 prov:startedAtTime 26 prov:wasAssociatedWith 26 prov:generatedAtTime 1 https://www.w3.org/TR/prov-o Taverna PROV predicates counts rdf:type 54 rdfs:label 13 prov:atTime 8 wfprov:describedByParameter 6 rdfs:comment 6 prov:hadRole 6 prov:activity 5 dcterms:hasPart 4 prov:agent 4 prov:endedAtTime 4 prov:hadPlan 4 prov:qualifiedAssociation 4 prov:qualifiedEnd 4 prov:qualifiedStart 4 prov:startedAtTime 4 prov:wasAssociatedWith 4 tavernaprov:content 3 wfprov:usedInput 3 wfprov:wasEnactedBy 3 wfprov:wasOutputFrom 3 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 5
  12. 12. Problem statement • Heterogeneity in PROV predicates: how to reconcile different PROV traces ? • Needs for human-oriented interpretation: tractable size, domain concepts. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
  13. 13. Approach 1. Applying graph saturation and PROV inferences to overcome vocabulary heterogeneity. 2. Summarizing harmonized provenance graphs as life-science experiment reports. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
  14. 14. SHARP: Harmonizing multiple PROV graphs owl:sameAs inferred PROV PROV trace PROV trace nanopub PROV interlinking PROV harmonization PROV summarization 11 12 13 … 14 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 8
  15. 15. Tackling multiple provenance heterogeneity
  16. 16. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  17. 17. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example Producing owl:sameAs 1. SHA-512 fingerprint of files 2. annotating PROV entities with the SHA-512 digest 3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  18. 18. — Reasoning: PROV inference regime A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
  19. 19. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  20. 20. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  21. 21. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  22. 22. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } Issues • may produce many non-informative blank nodes for existential variables. • may increase the size of the PROV graph. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  23. 23. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  24. 24. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource Example prov:qualifiedGeneration e a prov:activity :_gen1 :_gen2 prov:qualifiedGeneration prov:activity prov:used a2 prov:qualifiedGeneration e a prov:activity :_gen1 prov:used a2 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  25. 25. ˜ Equality-Generating Dependencies (EGDs) Input : G : the provenance graph resulting from the application of TGD on G Output: G : the provenance graph with substituted blank nodes, when possible. 1 begin 2 G ← G 3 substitutions ← new List < Pair < Node, Node >> () 4 repeat 5 S ← findSubstitutions(G ) 6 foreach (s ∈ S) do 7 source ← s[0] 8 target ← s[1] 9 foreach (in ∈ G .listStatements(∗, ∗, source)) do 10 G ← G .add(in.getSubject(), in.getPredicate(), target) 11 G ← G .del(in) 12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do 13 G ← G .add(target, out.getPredicate(), out.getObject()) 14 G ← G .del(out) 15 until (S.size() = 0) A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
  26. 26. Answering domain-specific questions
  27. 27. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  28. 28. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications reference genome sample_001 predicted phenotypes from exons sio:has-phenotype sio:is-variant-of sio:is-supported-by sio:is-supported-by sio:is-supported-by sio:is-supported-by scientific question :head { ex:pub1 a np:Nanopublication . ex:pub1 np:hasAssertion :assertion1 ; np:hasAssertion :assertion2 . ex:pub1 np:hasProvenance :provenance . ex:pub1 np:hasPublicationInfo :pubInfo . } :assertion1 { ex:question a sio:Question ; sio:has-value "What are the effects of SNPs located in exons for study-Y samples" ; sio:is-supported-by ex:referenceGenome ; sio:is-supported-by ex:sample_001 ; sio:is-supported-by ex:annotatedVariants . } :assertion2 { ex:referenceGenome a sio:Genome . ex:sample_001 a sio:Sample ; sio:is-variant-of ex:referenceGenome ; sio:has-phenotype ex:annotatedVariants . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  29. 29. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  30. 30. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). CONSTRUCT { GRAPH :assertion { ?ref_genome a sio:Genome . ?sample a sio:Sample ; sio:is-variant-of ?ref_genome ; sio:has-phenotype ?out . ?out rdfs:label ?out_label . ?out sio:is-supported-by ?ref_genome . } } WHERE { ?sample rdfs:label ?sample_label. FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) . ?ref_genome rdfs:label ?ref_genome_label. FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) . ?out ( prov:wasInfluencedBy )+ ?sample ?out tavernaprov:content ?out_label . FILTER (contains(lcase(str(?out_label)), lcase("exons"))) . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  31. 31. Experimental results
  32. 32. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  33. 33. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs Results • almost 5s for the whole harmonization process • no new wasDerivedFrom relation (must be captured) • inferred wasInfluencedBy → “common denominator” for data lineage ∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time PA + 2776 + 2 + 1 + 7 4835 ± 343 ms PB + 3102 + 4 + 1 + 58 4759 ± 71 ms PC + 5023 + 63 + 1 + 231 5304 ± 176 ms A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  34. 34. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  35. 35. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” Material & methods • Galaxy PROV trace • Taverna PROV trace • Manually provided owl:sameAs • Harmonization process • Summarization query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  36. 36. Can SHARP produce human-oriented experiment reports ? Harmonization results Galaxy PROV Taverna PROV Harmonized PROV predicates counts predicates counts predicates counts prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617 rdf:type 76 rdfs:label 13 rdf:type 958 rdfs:label 62 prov:atTime 8 prov:wasInflu.By 515 prov:used 61 wfprov:descByParam. 6 prov:influenced 291 prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268 prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223 prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218 prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208 prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199 prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172 all 463 all 177 all 8654 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 18
  37. 37. Can SHARP produce human-oriented experiment reports ? Nanopublication as summarization result “Which reference genome was used when predicting the phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
  38. 38. Conclusion & future works
  39. 39. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
  40. 40. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules Perspectives • size of inferred graphs: OWL/PROV entailment subsets ? • not yet automated summarization query → work in progress • towards decentralized provenance harmonization (federated querying) • “user study” to assess better interpretation, sharing, trust A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
  41. 41. Questions ? alban.gaignard@univ-nantes.fr Acknowledgments

×