SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
From Scientific Workflow Patterns
to 5-star Linked Open Data
Alban Gaignard Hala Skaf-Molli Audrey Bihouée
Nantes Academic
Hospital (CHU de
Nantes), CNRS, France
LINA,
Nantes University,
CNRS,
France
Institut du Thorax,
Nantes University,
INSERM, CNRS,
France
8th USENIX workshop on Theory and Practice of Provenance (TaPP’16)
Washington DC
Needs for linked experiment reports
2
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
3
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
4
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
5
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
6
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
Challenges
Algorithmic performance, storage, preservation,
reuse (limit recompute) & share.
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
7
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
8
Need for scientific context : metadata
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Expected result: human+machine tractable reports
9
Annotated “Material & Methods”
Links to some workflow artifacts (algorithms, data)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
10
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
11
How to ease this process ?
● Workflow engines → automation
● PROV → workflow runs as linked data
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
12
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
13
too fine-grained
no domain concepts
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Provenance as a Linked Experiment Report
14
few + meaningful statements
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
15
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
16
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
Objectives
(1) Leverage annotated workflow patterns to generate
provenance mining rules.
(2) Refine provenance traces into linked experiment reports.
Rules generation
17
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Approach
18
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
19
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
20
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
Experiment report template ❷
Link scientific claims, statements, material and methods
● MicroPublication ontology: Material, Method, Claims
● Experimental factor ontology: Transcriptome, Gene expression
● NCBI taxonomy: Homo Sapiens
● Open Annotation model: hasBody, hasTarget
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: generating PrOvEnance Mining rules ❸
21
(SPARQL Property path)
(SPARQL Basic graph pattern)
(SPARQL Construct query)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: sample generated rule ❸
22
<If> part
<Then> part
First experiments &
results
23
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment context
24
SyMeTRIC: systems medicine project (2015-2017, call “Connect
Talent”), funded by the french region Pays de la Loire.
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
25
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
26
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
Results (for 1 biological sample)
● 60h CPU (12 cores for genome alignment), 21Gb storage
● 3s to export 81 PROV triples from the Galaxy history
● 2s to apply the rule and produce 35 Micropublication triples
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
27
Semi-automated approach
(1) PoeM generates semantic web rules
(2) PoeM rules applied on PROV traces to assemble
linked experiment reports (MicroPublication)
Limitations:
- Sequence workflow patterns only
- SPARQL property paths with complex WF patterns ?
- Syntactic matching between WF patterns and PROV labels
Usage scenarios:
→ Query workflow datasets with domain concepts
→ Populate RDF repositories with WF results
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
28
Future works
(1) WF patterns: split-merge, “common motifs”
(2) Genericity: other domains / other reports (RO, Nanopub.)
(3) PROV heterogeneity: multi-systems PROV
(4) Evaluation: involving biologists, at larger scale
Questions ?
Demo: http://poem.univ-nantes.fr
Contact: alban.gaignard@univ-nantes.fr
Acknowledgments
BiRD bioinformatics facility Connect Talent Call
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30
Larger scale experiments (PROV traces)
1232 edges
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31
Larger scale experiments (PROV traces)
49 edges

Más contenido relacionado

La actualidad más candente

Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录
insight-labs
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013
Elsa von Licy
 

La actualidad más candente (20)

Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc project
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Sept2016 sv nabsys
Sept2016 sv nabsysSept2016 sv nabsys
Sept2016 sv nabsys
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Similar a PoemTapp16

Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
GenomeInABottle
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
GenomeInABottle
 

Similar a PoemTapp16 (20)

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_small
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenance
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

PoemTapp16

  • 1. From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC
  • 2. Needs for linked experiment reports 2
  • 3. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 3
  • 4. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 4 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 5. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 5 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 6. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 6 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb Challenges Algorithmic performance, storage, preservation, reuse (limit recompute) & share. 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 7. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 7
  • 8. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 8 Need for scientific context : metadata
  • 9. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Expected result: human+machine tractable reports 9 Annotated “Material & Methods” Links to some workflow artifacts (algorithms, data)
  • 10. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 10 W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 11. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 11 How to ease this process ? ● Workflow engines → automation ● PROV → workflow runs as linked data W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 12. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 12
  • 13. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 13 too fine-grained no domain concepts
  • 14. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Provenance as a Linked Experiment Report 14 few + meaningful statements
  • 15. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 15 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ?
  • 16. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 16 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Objectives (1) Leverage annotated workflow patterns to generate provenance mining rules. (2) Refine provenance traces into linked experiment reports.
  • 18. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Approach 18
  • 19. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 19 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map
  • 20. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 20 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map Experiment report template ❷ Link scientific claims, statements, material and methods ● MicroPublication ontology: Material, Method, Claims ● Experimental factor ontology: Transcriptome, Gene expression ● NCBI taxonomy: Homo Sapiens ● Open Annotation model: hasBody, hasTarget
  • 21. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: generating PrOvEnance Mining rules ❸ 21 (SPARQL Property path) (SPARQL Basic graph pattern) (SPARQL Construct query)
  • 22. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: sample generated rule ❸ 22 <If> part <Then> part
  • 24. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment context 24 SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.
  • 25. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 25 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API)
  • 26. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 26 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API) Results (for 1 biological sample) ● 60h CPU (12 cores for genome alignment), 21Gb storage ● 3s to export 81 PROV triples from the Galaxy history ● 2s to apply the rule and produce 35 Micropublication triples
  • 27. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 27 Semi-automated approach (1) PoeM generates semantic web rules (2) PoeM rules applied on PROV traces to assemble linked experiment reports (MicroPublication) Limitations: - Sequence workflow patterns only - SPARQL property paths with complex WF patterns ? - Syntactic matching between WF patterns and PROV labels Usage scenarios: → Query workflow datasets with domain concepts → Populate RDF repositories with WF results
  • 28. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 28 Future works (1) WF patterns: split-merge, “common motifs” (2) Genericity: other domains / other reports (RO, Nanopub.) (3) PROV heterogeneity: multi-systems PROV (4) Evaluation: involving biologists, at larger scale
  • 29. Questions ? Demo: http://poem.univ-nantes.fr Contact: alban.gaignard@univ-nantes.fr Acknowledgments BiRD bioinformatics facility Connect Talent Call
  • 30. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30 Larger scale experiments (PROV traces) 1232 edges
  • 31. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31 Larger scale experiments (PROV traces) 49 edges