SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Workflow Provenance:
From Modelling to Reporting
Rayhan Ferdous
Banani Roy
Chanchal K. Roy
Kevin A. Schneider
Provenance
Relates to any question about data lineage
Does it matter?
Big Data Analytics is NOT for FREE !!!
Taxonomy of
Provenance
da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Provenance entered into
Big Data
Standardization is necessary
Any system is never complete
Users are from different levels of expertise and goals
Fundamental research questions need to be identified
Data source, format, management varies
Users need a meaningful and flexible way to interact
Its not feasible to offer a high learning curve
When
multiple
domains join
together…
I have my
own style
Data
provenance
vs workflow
provenance
are
necessary
01 Logging is
necessary
02
Workflows differ
by modelling,
architecture and
implementation
from domain to
domain
03 Logging
mechanisms
and log
structures
also differ
04
We want to
bring
everything
into one
place…
for Big Data
Provenance
Programming Model + Automated Logging
External configurability of logs
Use with a Domain Specific Language (DSL)
Extensible with further technologies
Parse logs in Graph Database (GDB)
Proposed fundamental workflow provenance queries
Data visualization to answer queries
Primary complexity analysis
User study of visualizations
Scale the system
Avoiding
Mathematical Modelling for this Session
Object Oriented Programming Model
Proposed Programming
Model
Tools
DSL
Extension
(Hadoop, Spark etc.)
Logging
Configuration
Workflow
User
Domain
Expert
Model
Developer
uses
uses
uses
OOP Layer
Modelling
Layer
DSL Layer
Tool Layer
User Layer
Proposed
System
Architecture
Workflow System
(Tools, DSL, Proposed Model,
OOP, Extension)
Logs
Online
Parser
Visualization
Service
(Reporting)
User
Proposed
System
Components
Relation
Proposed
fundamental
queries vs
Cypher
Unit Map
MATCH (n:Type)
WHERE Condition
RETURN n
Time Sequence Map
MATCH (n:Type)
WHERE Condition
RETURN n.p
ORDER BY n.ptime
Data Sequence Map
MATCH (n1:Type1),(n2:Type2)
WHERE Condition AND n1.p ==n2.p
RETURN n1.p1, n2.p2
Examples of some queries
What are the frequencies of different workflow components?
match(n) return n.label as label, count(n) as freq
What are the frequencies of different modules?
match (n:Module) return n.NAME as tool, count(n) as count
What is the time series mapping of CPU load for FastQC module?
match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“
return n.time as time, n.cpu_run as cpuload order by n.time
What is the cpu load to execution time mapping for all modules?
match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0"
return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
Classification of Workflow
Provenance Queries
Is this really necessary ???!!!
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
400+ possible queries
Increases according to the GDB Node properties and different combinations
Coverage of Existing Works
Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
Ghoshal Akidau Anand Buneman Cheney
A comprehensive classification
leads to the way of
storytelling with data
Data Visualization Research can be merged with the queries
in a systematic way
Primary Visualization Suggestion
Chart (X, Y, Size,
Color)
Frequency
Time series -
ordinal
Time series -
nominal
Mapping -
ordinal vs ordinal
Mapping -
nominal vs ordinal
Mapping -
nominal vs nominal
Lineage
Bar chart X X X X
Grouped bar chart X X X X
Stacked bar chart X X X X
Line chart X X X
Step line chart X
Basis line chart X X X
Pie chart X X X X X
Ring chart X X X X X
Area chart X X
Stacked area chart X X
Scatter plot X X X X
Bubble chart X X X X X
Floating bar chart X X X X
Floating pie chart X X X X X
Floating ring chart X X X X X
Block matrix X X X X X X
Heatmap X
Histogram X X
Box plot X X
Strip chart X X X
Bee Swarm chart X X X
DAG X
Tree map X
Metric X
Tabular X X X X X X
Complexity of our approach
Selected modules
&
Implemented workflows
System Configuration:
Intel Core i7-7700
16 GB DDR4 RAM
256GB SSD
Ubuntu LTS 16.04
Next Step 1
to scale the system
with state of the art techs.
Next Step 2
to find the best visualization
through user study
for provenance queries.
So many
angles to
investigate
How could only a line chart
be drawn in a better way?
Do we need interactivity?
What type of interactivity
is not an excess?
Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Contributed
In Progress
Future work
References
1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT
2013 Workshops.
2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost
in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment,
2015.
3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT
2010.
4. Buneman et al., "Why and where: A characterization of data provenance." International conference
on database theory. 2001.
5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in
Databases, 2009.
6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of
provenance in scientific workflow management systems." Services-I, 2009 World Conference on.
IEEE, 2009.
7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific
workflows." Provenance and Annotation of Data and Processes (2008): 152-159.
8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow
provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357.
9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008).
10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the
twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM,
2007.
11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010.
12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph
analysis benchmark." International Conference on Web-Age Information Management. Springer,
Berlin, Heidelberg, 2010.
Thanks !!! (Demo)

Más contenido relacionado

Similar a Workflow Provenance: From Modelling to Reporting

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Rakebul Hasan
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석datasciencekorea
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...SK Ahammad Fahad
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsHong-Linh Truong
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataJian Wu
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyRichard Zijdeman
 

Similar a Workflow Provenance: From Modelling to Reporting (20)

Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석Bayesian Network 을 활용한 예측 분석
Bayesian Network 을 활용한 예측 분석
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...Managing textual data semantically in relational databases by wael yahfooz an...
Managing textual data semantically in relational databases by wael yahfooz an...
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
TUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflowsTUW - Quality of data-aware data analytics workflows
TUW - Quality of data-aware data analytics workflows
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
CiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big DataCiteSeerX: Mining Scholarly Big Data
CiteSeerX: Mining Scholarly Big Data
 
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Resume
ResumeResume
Resume
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 

Workflow Provenance: From Modelling to Reporting

  • 1. Workflow Provenance: From Modelling to Reporting Rayhan Ferdous Banani Roy Chanchal K. Roy Kevin A. Schneider
  • 2. Provenance Relates to any question about data lineage Does it matter? Big Data Analytics is NOT for FREE !!!
  • 3. Taxonomy of Provenance da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
  • 4. Scopes of R&D that were focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008.
  • 5. Provenance entered into Big Data Standardization is necessary Any system is never complete Users are from different levels of expertise and goals Fundamental research questions need to be identified Data source, format, management varies Users need a meaningful and flexible way to interact Its not feasible to offer a high learning curve
  • 6. When multiple domains join together… I have my own style Data provenance vs workflow provenance are necessary 01 Logging is necessary 02 Workflows differ by modelling, architecture and implementation from domain to domain 03 Logging mechanisms and log structures also differ 04
  • 7. We want to bring everything into one place… for Big Data Provenance Programming Model + Automated Logging External configurability of logs Use with a Domain Specific Language (DSL) Extensible with further technologies Parse logs in Graph Database (GDB) Proposed fundamental workflow provenance queries Data visualization to answer queries Primary complexity analysis User study of visualizations Scale the system
  • 9. Object Oriented Programming Model Proposed Programming Model Tools DSL Extension (Hadoop, Spark etc.) Logging Configuration Workflow User Domain Expert Model Developer uses uses uses OOP Layer Modelling Layer DSL Layer Tool Layer User Layer Proposed System Architecture
  • 10. Workflow System (Tools, DSL, Proposed Model, OOP, Extension) Logs Online Parser Visualization Service (Reporting) User Proposed System Components Relation
  • 11. Proposed fundamental queries vs Cypher Unit Map MATCH (n:Type) WHERE Condition RETURN n Time Sequence Map MATCH (n:Type) WHERE Condition RETURN n.p ORDER BY n.ptime Data Sequence Map MATCH (n1:Type1),(n2:Type2) WHERE Condition AND n1.p ==n2.p RETURN n1.p1, n2.p2
  • 12. Examples of some queries
  • 13. What are the frequencies of different workflow components? match(n) return n.label as label, count(n) as freq What are the frequencies of different modules? match (n:Module) return n.NAME as tool, count(n) as count What is the time series mapping of CPU load for FastQC module? match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“ return n.time as time, n.cpu_run as cpuload order by n.time What is the cpu load to execution time mapping for all modules? match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0" return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
  • 14. Classification of Workflow Provenance Queries Is this really necessary ???!!!
  • 15. Classification WF Provenance Questions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation 400+ possible queries Increases according to the GDB Node properties and different combinations
  • 17. Classification WF Provenance Questions Time Point (Unit Mapping) Time Series Sequence Mapping Statistical Sequence Mapping Evaluate Evaluate Evaluate Compare Predict Past Now Past Now Past Now Past Now Future object invocation object invocation sequence frequency of object invocation (inter WF) object-object invocation correlation (inter WF) object invocation (inner WF) object-object invocation correlation (inner WF) object invocation histogram of object invocation (inter WF) histogram comparison (inter WF) distribution (inner WF) histogram comparison (inner WF) distribution statistical measurements (inter WF) measurements comparison (inter WF) threshold (inter WF) measurements correlation (inner WF) measurements comparison (inner WF) threshold (inner WF) measurements correlation object source (module) object lineage (module) sequence measurements of DAG (inter WF) lineage-lineage comparison (inter WF) graph similarity (inter WF) lineage-lineage correlation object destination (module) (inner WF) lineage-lineage comparison (inner WF) graph similarity (inner WF) lineage-lineage correlation object property object property sequence frequency of object property (inter WF) property-property comparison (inter WF) object property (inter WF) property-object correlation histogram of object property (inner WF) property-property comparison (inner WF) object property statistical measurements (inner WF) property-object correlation Ghoshal Akidau Anand Buneman Cheney
  • 18. A comprehensive classification leads to the way of storytelling with data Data Visualization Research can be merged with the queries in a systematic way
  • 20. Chart (X, Y, Size, Color) Frequency Time series - ordinal Time series - nominal Mapping - ordinal vs ordinal Mapping - nominal vs ordinal Mapping - nominal vs nominal Lineage Bar chart X X X X Grouped bar chart X X X X Stacked bar chart X X X X Line chart X X X Step line chart X Basis line chart X X X Pie chart X X X X X Ring chart X X X X X Area chart X X Stacked area chart X X Scatter plot X X X X Bubble chart X X X X X Floating bar chart X X X X Floating pie chart X X X X X Floating ring chart X X X X X Block matrix X X X X X X Heatmap X Histogram X X Box plot X X Strip chart X X X Bee Swarm chart X X X DAG X Tree map X Metric X Tabular X X X X X X
  • 21. Complexity of our approach
  • 23. System Configuration: Intel Core i7-7700 16 GB DDR4 RAM 256GB SSD Ubuntu LTS 16.04
  • 24. Next Step 1 to scale the system with state of the art techs.
  • 25.
  • 26. Next Step 2 to find the best visualization through user study for provenance queries.
  • 27. So many angles to investigate How could only a line chart be drawn in a better way? Do we need interactivity? What type of interactivity is not an excess?
  • 28. Scopes of R&D that were focused independently Provenance Data Collection Workflow design Changes to system Version control Data usage feedback Reporting and learning Learning system Recommendation Data usage Monitoring Resource Time series Control Smart re-run Fault detection Data analysis Data Provenance Process provenance Visualization Version comparison User tracking Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and Annotation Workshop. 2008. Contributed In Progress Future work
  • 29. References 1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment, 2015. 3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT 2010. 4. Buneman et al., "Why and where: A characterization of data provenance." International conference on database theory. 2001. 5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in Databases, 2009. 6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of provenance in scientific workflow management systems." Services-I, 2009 World Conference on. IEEE, 2009. 7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific workflows." Provenance and Annotation of Data and Processes (2008): 152-159. 8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357. 9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008). 10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2007. 11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010. 12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph analysis benchmark." International Conference on Web-Age Information Management. Springer, Berlin, Heidelberg, 2010.