2. Provenance
Relates to any question about data lineage
Does it matter?
Big Data Analytics is NOT for FREE !!!
3. Taxonomy of
Provenance
da Cruz et al. "Towards a taxonomy of provenance in scientific workflow management systems." Services, 2009
4. Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
5. Provenance entered into
Big Data
Standardization is necessary
Any system is never complete
Users are from different levels of expertise and goals
Fundamental research questions need to be identified
Data source, format, management varies
Users need a meaningful and flexible way to interact
Its not feasible to offer a high learning curve
6. When
multiple
domains join
together…
I have my
own style
Data
provenance
vs workflow
provenance
are
necessary
01 Logging is
necessary
02
Workflows differ
by modelling,
architecture and
implementation
from domain to
domain
03 Logging
mechanisms
and log
structures
also differ
04
7. We want to
bring
everything
into one
place…
for Big Data
Provenance
Programming Model + Automated Logging
External configurability of logs
Use with a Domain Specific Language (DSL)
Extensible with further technologies
Parse logs in Graph Database (GDB)
Proposed fundamental workflow provenance queries
Data visualization to answer queries
Primary complexity analysis
User study of visualizations
Scale the system
9. Object Oriented Programming Model
Proposed Programming
Model
Tools
DSL
Extension
(Hadoop, Spark etc.)
Logging
Configuration
Workflow
User
Domain
Expert
Model
Developer
uses
uses
uses
OOP Layer
Modelling
Layer
DSL Layer
Tool Layer
User Layer
Proposed
System
Architecture
10. Workflow System
(Tools, DSL, Proposed Model,
OOP, Extension)
Logs
Online
Parser
Visualization
Service
(Reporting)
User
Proposed
System
Components
Relation
11. Proposed
fundamental
queries vs
Cypher
Unit Map
MATCH (n:Type)
WHERE Condition
RETURN n
Time Sequence Map
MATCH (n:Type)
WHERE Condition
RETURN n.p
ORDER BY n.ptime
Data Sequence Map
MATCH (n1:Type1),(n2:Type2)
WHERE Condition AND n1.p ==n2.p
RETURN n1.p1, n2.p2
13. What are the frequencies of different workflow components?
match(n) return n.label as label, count(n) as freq
What are the frequencies of different modules?
match (n:Module) return n.NAME as tool, count(n) as count
What is the time series mapping of CPU load for FastQC module?
match(n:Module) where n.NAME="FastQC" and n.cpu_run >= "0“
return n.time as time, n.cpu_run as cpuload order by n.time
What is the cpu load to execution time mapping for all modules?
match(n:Module) where n.cpu_run >= "0" and n.duration_run >= "0"
return n.NAME as name, n.cpu_run as cpu, n.duration_run as duration
17. Classification WF Provenance Questions
Time Point (Unit
Mapping)
Time Series Sequence
Mapping
Statistical Sequence Mapping
Evaluate Evaluate Evaluate Compare Predict
Past Now Past Now Past Now Past Now Future
object invocation object invocation sequence
frequency of object invocation
(inter WF) object-object invocation correlation (inter WF) object invocation
(inner WF) object-object invocation correlation (inner WF) object invocation
histogram of object invocation
(inter WF) histogram comparison (inter WF) distribution
(inner WF) histogram comparison (inner WF) distribution
statistical measurements
(inter WF) measurements comparison
(inter WF) threshold
(inter WF) measurements correlation
(inner WF) measurements comparison
(inner WF) threshold
(inner WF) measurements correlation
object source (module)
object lineage (module)
sequence
measurements of DAG
(inter WF) lineage-lineage comparison
(inter WF) graph similarity
(inter WF) lineage-lineage correlation
object destination (module)
(inner WF) lineage-lineage comparison
(inner WF) graph similarity
(inner WF) lineage-lineage correlation
object property object property sequence
frequency of object property
(inter WF) property-property comparison
(inter WF) object property
(inter WF) property-object correlation
histogram of object property (inner WF) property-property comparison
(inner WF) object property
statistical measurements (inner WF) property-object correlation
Ghoshal Akidau Anand Buneman Cheney
18. A comprehensive classification
leads to the way of
storytelling with data
Data Visualization Research can be merged with the queries
in a systematic way
20. Chart (X, Y, Size,
Color)
Frequency
Time series -
ordinal
Time series -
nominal
Mapping -
ordinal vs ordinal
Mapping -
nominal vs ordinal
Mapping -
nominal vs nominal
Lineage
Bar chart X X X X
Grouped bar chart X X X X
Stacked bar chart X X X X
Line chart X X X
Step line chart X
Basis line chart X X X
Pie chart X X X X X
Ring chart X X X X X
Area chart X X
Stacked area chart X X
Scatter plot X X X X
Bubble chart X X X X X
Floating bar chart X X X X
Floating pie chart X X X X X
Floating ring chart X X X X X
Block matrix X X X X X X
Heatmap X
Histogram X X
Box plot X X
Strip chart X X X
Bee Swarm chart X X X
DAG X
Tree map X
Metric X
Tabular X X X X X X
24. Next Step 1
to scale the system
with state of the art techs.
25.
26. Next Step 2
to find the best visualization
through user study
for provenance queries.
27. So many
angles to
investigate
How could only a line chart
be drawn in a better way?
Do we need interactivity?
What type of interactivity
is not an excess?
28. Scopes of R&D
that were
focused
independently
Provenance
Data Collection Workflow design
Changes to system
Version control
Data usage
feedback
Reporting and
learning
Learning system
Recommendation
Data usage
Monitoring
Resource
Time series
Control
Smart re-run
Fault detection
Data analysis
Data Provenance
Process
provenance
Visualization
Version comparison
User tracking
Crawl et al. "A provenance-based fault tolerance mechanism for scientific workflows." International Provenance and
Annotation Workshop. 2008.
Contributed
In Progress
Future work
29. References
1. Ghoshal et al., "Provenance from log files: a BigData problem." Proceedings of the Joint EDBT/ICDT
2013 Workshops.
2. Akidau et al., “The dataflow model: a practical approach to balancing correctness, latency, and cost
in massive-scale, unbounded, out-of-order data processing." Proceedings of the VLDB Endowment,
2015.
3. Anand et al., "Techniques for efficiently querying scientific workflow provenance graphs." EDBT
2010.
4. Buneman et al., "Why and where: A characterization of data provenance." International conference
on database theory. 2001.
5. Cheney et al., "Provenance in databases: Why, how, and where." Foundations and Trends® in
Databases, 2009.
6. da Cruz, Sérgio Manuel Serra, Maria Luiza M. Campos, and Marta Mattoso. "Towards a taxonomy of
provenance in scientific workflow management systems." Services-I, 2009 World Conference on.
IEEE, 2009.
7. Crawl, Daniel, and Ilkay Altintas. "A provenance-based fault tolerance mechanism for scientific
workflows." Provenance and Annotation of Data and Processes (2008): 152-159.
8. Amsterdamer, Yael, et al. "Putting lipstick on pig: Enabling database-style workflow
provenance." Proceedings of the VLDB Endowment 5.4 (2011): 346-357.
9. Hazel, Dan. "Using rational numbers to key nested sets." arXiv preprint arXiv:0806.3115 (2008).
10. Green, Todd J., Grigoris Karvounarakis, and Val Tannen. "Provenance semirings." Proceedings of the
twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM,
2007.
11. Acar, Umut, et al. "A graph model of data and workflow provenance." 2010.
12. Dominguez-Sal, David, et al. "Survey of graph database performance on the hpc scalable graph
analysis benchmark." International Conference on Web-Age Information Management. Springer,
Berlin, Heidelberg, 2010.