SlideShare a Scribd company logo
1 of 33
Download to read offline
Big Data Provenance - Challenges and
Implications for Benchmarking
Boris Glavic
IIT DBGroup
December 17, 2012
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Disclaimer
• My previous benchmark experience
• Limited to being a user
• My main background is in provenance
• . . . mostly for RDBMS
• The main point of this talk (visions)
• Provenance as a benchmark use-case
• Provenance as a supporting technology for benchmarking
• I will make some outrageous claims to get my point across
Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance
What is Provenance?
• Information about the creation process and origin of data
• Data items and collections
• Data item = atomic unit of data
• Transformation
• Abstraction for processing
• Inputs and outputs are data items (collections)
Example
Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance
Types of Provenance
• Given a data item d
Data Provenance
• Which data items where used in the generation of d
• Representation: e.g., a set of data items
Transformation Provenance
• Which transformations generated d
• transitively
Additional Information
• Execution Environment
• User
• Time
• . . .
Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance
Data and Transformation Provenance
Example (Data Provenance)
Provenance
for
Data
Provenance
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
Data and Transformation Provenance
Example (Transformation Provenance)
Transformation
Provenance
Provenance
for
Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
Coarse-grained vs. Fine-grained Data Provenance
Coarse-grained Provenance
• Transformations are handled as black-boxes
• ⇒Each output of transformation
• . . . depends on all inputs
Fine-grained Provenance
• Consider processing logic of transformation to determine
data-flow
• ⇒All data items that contributed to a result
• Sufficiency: sufficient to derive d through transformation
• Necessity: necessary in deriving d through transformation
‘
Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance
Granularity
Example (Coarse-grained)
Provenance
for
coarse
grained
Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
Granularity
Example (Fine-grained)
Provenance
for
fine
grained
Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
Use-cases
• Data debugging
• E.g., tracing an erroneous data item back to erroneous inputs
• Trust and Quality
• E.g., computing trust based on trust in data items in
provenance
• Probabilistic data
• Computing probability of result based on probabilities in
provenance
• Security
• Enforce access-control on query results based on provenance
• Understanding misbehaviour of systems
• E.g., detecting security breaches
Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance
Annotations
Provenance as Annotations
• Model provenance as annotations on the data
• ⇒Provenance management =
• Efficient storage of annotations
• Propagating annotations through transformations
Example (Systems applying this approach)
• Perm
• Orchestra
• DBNotes
• Trio
• . . . and many more
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
Annotations
References
Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya.
An Annotation Management System for Relational Databases.
VLDB Journal, 14(4):373–396, 2005.
Jennifer Widom.
Trio: A System for Managing Data, Uncertainty, and Lineage.
Managing and Mining Uncertain Data, 2008.
Boris Glavic and Gustavo Alonso.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting.
ICDE, 174-185, 2009.
T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen.
Provenance in ORCHESTRA.
IEEE Data Eng. Bull., 33(3):9-16, 2010.
Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
Size Considerations
• Provenance models dependencies between inputs and outputs
of a transformation
• ⇒A subset of Inputs × Outputs
• ⇒Can be quadratic in size
• DAG of transformations
• multiply by maximal path length
• Provenance of two data items often overlaps
• Reuse of intermediate results
• Coarse-grained provenance
‘
Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Big Provenance Challenges
• Big data analytics is all about agility
• ⇒un/semi-structured data and no extensive meta-data
available
• ⇒hard to define provenance model
• Transparency of distribution
• Provenance use-case may require location information
• Analytics that cross system boundaries
• Inter-operability of storage formats and systems
Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - Provenance for Map-Reduce Workflows
• Fine-grained provenance for workflows that are DAGs of map
and reduce functions
• Add provenance to values: (key, value) → (key, (value, p))
• Provenance determined by map/reduce function semantics
• Wrap map and reduce functions to handle provenance
• Strip of and cache provenance
• Call original function
• Re-attach provenance according to function semantics
References
R. Ikeda, H. Park, and J. Widom.
Provenance for generalized map and reduce workflows.
CIDR, 273-283, 2011.
Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - OS Provenance for the Cloud
PASS with Cloud Storage Backend
• File and process level provenance
• Each node runs a modified linux kernel (PASS)
• Intercept system-calls for file and process creation
• Store provenance in Amazon S3, SimpleDB, and/or SQS
Instrumenting the Xen Hypervisor
• File and process level provenance for virtual machines
• Intercept “hyper-calls” for file and process creation
• Store provenance in DB
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - OS Provenance for the Cloud
References
M.I. Seltzer, P. Macko, and M.A. Chiarini.
Collecting provenance via the xen hypervisor.
TaPP, 2011.
K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer.
Provenance for the cloud.
FAST, 15–14, 2010.
Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Example - Sketching Distributed Provenance
• File + process level provenance modelled as DAG
• intra- and inter-host dependencies
• inter = socket communication
• Intercept system calls
• Distributed provenance graph storage
• Each node stores the part of the provenance graph
corresponding to its local processing
• Links to other host for inter-host dependencies
• Query provenance
• Nodes exchange summaries of provenance graphs using bloom
filters
References
T. Malik, L. Nistor, and A. Gehani.
Tracking and sketching distributed data provenance.
eScience, 190–197, 2010.
Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Take-away Message
• Challenging problems
• Approaches that address distribution directly
• Approaches that map relational techniques to Big Data
• ⇒more work needed!
Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
Provenance as Data and Workloads
Provenance-based Performance Metrics
Profiling and Monitoring with Provenance
4 Conclusions
Why Benchmark?
Impress the Customer - Competition
• Competition should be fair
• Stable benchmark
• Similar to real world workloads
• Precise definitions
• ⇒probably several iterations of design + test
Understand (and Improve) System Performance
• Stable benchmark
• Similar to real world workloads
• Profiling and monitoring
Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Benchmarking 101
A Comprehensive Benchmark . . .
• Define Parameters
• Data-set specifications
• Structure and interrelationships
• Value ranges and distributions
• Data type definitions
• Provide data generator and/or validator
• Workload specification
• Specify jobs
• Restrictions for running the jobs
• Provide workload generator and/or validator
• Benchmark metrics specification
• What to measure
• How to measure
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Benchmarking 101
A Comprehensive Benchmark . . . TPC-H
• Define Parameters e.g., SF
• Data-set specifications Standard Specification
• Structure and interrelationships Schema provided
• Value ranges and distributions e.g., L DISCOUNT column values
between 0.00 and 1.00
• Data type definitions e.g., date is YYYY-MM-DD at least 14
years range
• Provide data generator and/or validator dbgen
• Workload specification Standard Specification
• Specify jobs Query and update workloads
• Restrictions for running the jobs
• Provide workload generator and/or validator qgen
• Benchmark metrics specification Standard Specification
• What to measure e.g., Composite Query-per-Hour Metric
• How to measure e.g., first query char to last output char
Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Big Data Benchmarking
• Data generation is a necessity
• Shipping pre-generated datasets of Big Data dimensions is
unfeasible
• Complex and mixed workloads better match real-world use of
Big Data analytics
• Hard to understand why a system performs bad/well on a
complex workload
• Robustness of performance and scalability important
⇒Measure it!
• What is a good metric for robustness and how to compute it?
Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Generating Large Datasets and Workloads
• Provenance data is large
• Compute provenance to generate large datasets
• Run provenance-aware system to collect provenance for a small
workload
• The resulting data-set can be used in the benchmark
• ⇒Generate large data-sets from simple “generators”
Example
• Simple task: sharpen an image
• Apply approach that instruments the program to detect
data-dependencies as provenance
• Huge and very fine-granular provenance
Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Stress-Testing Exploitation of Data Commonalities
• Provenance data large, but large overlap
• ⇒Use provenance to test how well a system is able to exploit
data commonalities
Example
• Streaming data with multi-step windowed aggregation
• ⇒Can predict overlap in provenance
• Generate provenance
• Queries over provenance as workload
Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Data-centric Performance Information as Provenance
• Assume the hypothetical existence of god
• Big provenance system
• Efficiently computes all types of provenance
• For any Big Data system
• Also evaluates any types of queries
Data-centric performance information as provenance
• For each data item use god to record
• execution times of each transformation in provenance
• history of data movements for each data item
• . . .
Example (Measure Robustness)
• Benchmark that requires several runs of mixed job workloads
• Workloads between runs overlap
• Performance metric: Resource consumption variation per job
Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Using Provenance in Profiling
Profiling and Monitoring
• Identify causes for (benchmark) results
• Even more important for Big Data
• Diverse workloads
• Distribution
• . . .
Example (How can Provenance help?)
• How do I profile a program on one node
• e.g., instrument and collect statistics about execution
• Provenance provides data-centric view
• Identify repeated computations
• Overlaps in provenance (same data, different
nodes/transformations)
• Identify unnecessary computations
• Computations that are not in the fine-grained provenance of
anything
Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
Outline
1 Provenance
2 Big Data Provenance - Big Provenance
3 Implications for Benchmarking
4 Conclusions
Conclusions
• Big Provenance: Many interesting research problems
• Provenance as benchmark use-case
• Provenance for benchmark data generation
• Provenance-based performance metrics
• Supporting profiling and monitoring
Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions
Questions?
Info
• Boris Glavic: http://www.cs.iit.edu/~glavic/
• IIT DBGroup: http://www.cs.iit.edu/~dbgroup/
• Perm: http://permdbms.sourceforge.net/
Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions

More Related Content

Viewers also liked

Making sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenanceMaking sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenanceKai Xu
 
Intro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" AgendaIntro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" Agendacthanopoulos
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]balvis_ms
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloAaron Cordova
 
Big Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experienceBig Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experienceAndreas Drakos
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityRaffael Marty
 
Agriculture and Big Data
Agriculture and Big DataAgriculture and Big Data
Agriculture and Big DataUIResearchPark
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Technology in Ministry
Technology in MinistryTechnology in Ministry
Technology in MinistryRon Hudson
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenancePaolo Missier
 

Viewers also liked (19)

Making sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenanceMaking sense of (big) data - visual analytics and provenance
Making sense of (big) data - visual analytics and provenance
 
Intro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" AgendaIntro Course "Big data in Agriculture" Agenda
Intro Course "Big data in Agriculture" Agenda
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
Big Data and Agriculture
Big Data and AgricultureBig Data and Agriculture
Big Data and Agriculture
 
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
Mphasi s agil_analytics_life_cycle_business_style_for_big_data_services[1]
 
Big Data in Agriculture - Setting the scene for the CGIAR
Big Data in Agriculture - Setting the scene for the CGIARBig Data in Agriculture - Setting the scene for the CGIAR
Big Data in Agriculture - Setting the scene for the CGIAR
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
Big Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experienceBig Data in Agriculture, the SemaGrow and agINFRA experience
Big Data in Agriculture, the SemaGrow and agINFRA experience
 
Big Data in Agriculture : Opportunities for data driven agronomy
Big Data in Agriculture : Opportunities for data driven agronomyBig Data in Agriculture : Opportunities for data driven agronomy
Big Data in Agriculture : Opportunities for data driven agronomy
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
Agriculture and Big Data
Agriculture and Big DataAgriculture and Big Data
Agriculture and Big Data
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
What is big data?
What is big data?What is big data?
What is big data?
 
Technology in Ministry
Technology in MinistryTechnology in Ministry
Technology in Ministry
 
Keystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenanceKeystone summer school 2015 paolo-missier-provenance
Keystone summer school 2015 paolo-missier-provenance
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Data Con LA
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Denodo
 
Convergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingConvergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingDESMOND YUEN
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...Neo4j
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBigDataExpo
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014Raja Chiky
 
Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e sciencebdemchak
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 

Similar to WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking" (20)

351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
Montali - DB-Nets: On The Marriage of Colored Petri Nets 
and Relational Data...
 
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in EnterprisesPragmatics Driven Issues in Data and Process Integrity in Enterprises
Pragmatics Driven Issues in Data and Process Integrity in Enterprises
 
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Convergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and SupercomputingConvergence of Machine Learning, Big Data and Supercomputing
Convergence of Machine Learning, Big Data and Supercomputing
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Provinance in scientific workflows in e science
Provinance in scientific workflows in e scienceProvinance in scientific workflows in e science
Provinance in scientific workflows in e science
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊[系列活動] 資料探勘速遊
[系列活動] 資料探勘速遊
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 

More from Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSONBoris Glavic
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceBoris Glavic
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesBoris Glavic
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...Boris Glavic
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...Boris Glavic
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"Boris Glavic
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningBoris Glavic
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...Boris Glavic
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanBoris Glavic
 

More from Boris Glavic (17)

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
 
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
 
2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata Generator
 
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
 
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
 
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
 
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-AnswersTaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
TaPP 2015 - Towards Constraint-based Explanations for Answers and Non-Answers
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of ProvenanceTaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
TaPP 2011 Talk Boris - Reexamining some Holy Grails of Provenance
 
EDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested SubqueriesEDBT 2009 - Provenance for Nested Subqueries
EDBT 2009 - Provenance for Nested Subqueries
 
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
ICDE 2009 - Perm: Processing Provenance and Data on the same Data Model throu...
 
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
2010 VLDB - TRAMP: Understanding the Behavior of Schema Mappings through Prov...
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"
 
TaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data MiningTaPP 2013 - Provenance for Data Mining
TaPP 2013 - Provenance for Data Mining
 
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
TaPP 2014 Talk Boris - A Generic Provenance Middleware for Database Queries, ...
 
Ipaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, IanIpaw14 presentation Quan, Tanu, Ian
Ipaw14 presentation Quan, Tanu, Ian
 

Recently uploaded

G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 

Recently uploaded (20)

G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"

  • 1. Big Data Provenance - Challenges and Implications for Benchmarking Boris Glavic IIT DBGroup December 17, 2012
  • 2. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 3. Disclaimer • My previous benchmark experience • Limited to being a user • My main background is in provenance • . . . mostly for RDBMS • The main point of this talk (visions) • Provenance as a benchmark use-case • Provenance as a supporting technology for benchmarking • I will make some outrageous claims to get my point across Slide 1 of 23 Boris Glavic Big Data Provenance - Provenance
  • 4. What is Provenance? • Information about the creation process and origin of data • Data items and collections • Data item = atomic unit of data • Transformation • Abstraction for processing • Inputs and outputs are data items (collections) Example Slide 2 of 23 Boris Glavic Big Data Provenance - Provenance
  • 5. Types of Provenance • Given a data item d Data Provenance • Which data items where used in the generation of d • Representation: e.g., a set of data items Transformation Provenance • Which transformations generated d • transitively Additional Information • Execution Environment • User • Time • . . . Slide 3 of 23 Boris Glavic Big Data Provenance - Provenance
  • 6. Data and Transformation Provenance Example (Data Provenance) Provenance for Data Provenance Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
  • 7. Data and Transformation Provenance Example (Transformation Provenance) Transformation Provenance Provenance for Slide 4 of 23 Boris Glavic Big Data Provenance - Provenance
  • 8. Coarse-grained vs. Fine-grained Data Provenance Coarse-grained Provenance • Transformations are handled as black-boxes • ⇒Each output of transformation • . . . depends on all inputs Fine-grained Provenance • Consider processing logic of transformation to determine data-flow • ⇒All data items that contributed to a result • Sufficiency: sufficient to derive d through transformation • Necessity: necessary in deriving d through transformation ‘ Slide 5 of 23 Boris Glavic Big Data Provenance - Provenance
  • 9. Granularity Example (Coarse-grained) Provenance for coarse grained Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
  • 10. Granularity Example (Fine-grained) Provenance for fine grained Slide 6 of 23 Boris Glavic Big Data Provenance - Provenance
  • 11. Use-cases • Data debugging • E.g., tracing an erroneous data item back to erroneous inputs • Trust and Quality • E.g., computing trust based on trust in data items in provenance • Probabilistic data • Computing probability of result based on probabilities in provenance • Security • Enforce access-control on query results based on provenance • Understanding misbehaviour of systems • E.g., detecting security breaches Slide 7 of 23 Boris Glavic Big Data Provenance - Provenance
  • 12. Annotations Provenance as Annotations • Model provenance as annotations on the data • ⇒Provenance management = • Efficient storage of annotations • Propagating annotations through transformations Example (Systems applying this approach) • Perm • Orchestra • DBNotes • Trio • . . . and many more Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
  • 13. Annotations References Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. An Annotation Management System for Relational Databases. VLDB Journal, 14(4):373–396, 2005. Jennifer Widom. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, 2008. Boris Glavic and Gustavo Alonso. Perm: Processing Provenance and Data on the same Data Model through Query Rewriting. ICDE, 174-185, 2009. T.J. Green, G. Karvounarakis, and Z.G.I.V. Tannen. Provenance in ORCHESTRA. IEEE Data Eng. Bull., 33(3):9-16, 2010. Slide 8 of 23 Boris Glavic Big Data Provenance - Provenance
  • 14. Size Considerations • Provenance models dependencies between inputs and outputs of a transformation • ⇒A subset of Inputs × Outputs • ⇒Can be quadratic in size • DAG of transformations • multiply by maximal path length • Provenance of two data items often overlaps • Reuse of intermediate results • Coarse-grained provenance ‘ Slide 9 of 23 Boris Glavic Big Data Provenance - Provenance
  • 15. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 16. Big Provenance Challenges • Big data analytics is all about agility • ⇒un/semi-structured data and no extensive meta-data available • ⇒hard to define provenance model • Transparency of distribution • Provenance use-case may require location information • Analytics that cross system boundaries • Inter-operability of storage formats and systems Slide 10 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 17. Example - Provenance for Map-Reduce Workflows • Fine-grained provenance for workflows that are DAGs of map and reduce functions • Add provenance to values: (key, value) → (key, (value, p)) • Provenance determined by map/reduce function semantics • Wrap map and reduce functions to handle provenance • Strip of and cache provenance • Call original function • Re-attach provenance according to function semantics References R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. CIDR, 273-283, 2011. Slide 11 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 18. Example - OS Provenance for the Cloud PASS with Cloud Storage Backend • File and process level provenance • Each node runs a modified linux kernel (PASS) • Intercept system-calls for file and process creation • Store provenance in Amazon S3, SimpleDB, and/or SQS Instrumenting the Xen Hypervisor • File and process level provenance for virtual machines • Intercept “hyper-calls” for file and process creation • Store provenance in DB Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 19. Example - OS Provenance for the Cloud References M.I. Seltzer, P. Macko, and M.A. Chiarini. Collecting provenance via the xen hypervisor. TaPP, 2011. K.K. Muniswamy-Reddy, P. Macko, and M. Seltzer. Provenance for the cloud. FAST, 15–14, 2010. Slide 12 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 20. Example - Sketching Distributed Provenance • File + process level provenance modelled as DAG • intra- and inter-host dependencies • inter = socket communication • Intercept system calls • Distributed provenance graph storage • Each node stores the part of the provenance graph corresponding to its local processing • Links to other host for inter-host dependencies • Query provenance • Nodes exchange summaries of provenance graphs using bloom filters References T. Malik, L. Nistor, and A. Gehani. Tracking and sketching distributed data provenance. eScience, 190–197, 2010. Slide 13 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 21. Take-away Message • Challenging problems • Approaches that address distribution directly • Approaches that map relational techniques to Big Data • ⇒more work needed! Slide 14 of 23 Boris Glavic Big Data Provenance - Big Data Provenance - Big Provenance
  • 22. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking Provenance as Data and Workloads Provenance-based Performance Metrics Profiling and Monitoring with Provenance 4 Conclusions
  • 23. Why Benchmark? Impress the Customer - Competition • Competition should be fair • Stable benchmark • Similar to real world workloads • Precise definitions • ⇒probably several iterations of design + test Understand (and Improve) System Performance • Stable benchmark • Similar to real world workloads • Profiling and monitoring Slide 15 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 24. Benchmarking 101 A Comprehensive Benchmark . . . • Define Parameters • Data-set specifications • Structure and interrelationships • Value ranges and distributions • Data type definitions • Provide data generator and/or validator • Workload specification • Specify jobs • Restrictions for running the jobs • Provide workload generator and/or validator • Benchmark metrics specification • What to measure • How to measure Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 25. Benchmarking 101 A Comprehensive Benchmark . . . TPC-H • Define Parameters e.g., SF • Data-set specifications Standard Specification • Structure and interrelationships Schema provided • Value ranges and distributions e.g., L DISCOUNT column values between 0.00 and 1.00 • Data type definitions e.g., date is YYYY-MM-DD at least 14 years range • Provide data generator and/or validator dbgen • Workload specification Standard Specification • Specify jobs Query and update workloads • Restrictions for running the jobs • Provide workload generator and/or validator qgen • Benchmark metrics specification Standard Specification • What to measure e.g., Composite Query-per-Hour Metric • How to measure e.g., first query char to last output char Slide 16 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 26. Big Data Benchmarking • Data generation is a necessity • Shipping pre-generated datasets of Big Data dimensions is unfeasible • Complex and mixed workloads better match real-world use of Big Data analytics • Hard to understand why a system performs bad/well on a complex workload • Robustness of performance and scalability important ⇒Measure it! • What is a good metric for robustness and how to compute it? Slide 17 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 27. Generating Large Datasets and Workloads • Provenance data is large • Compute provenance to generate large datasets • Run provenance-aware system to collect provenance for a small workload • The resulting data-set can be used in the benchmark • ⇒Generate large data-sets from simple “generators” Example • Simple task: sharpen an image • Apply approach that instruments the program to detect data-dependencies as provenance • Huge and very fine-granular provenance Slide 18 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 28. Stress-Testing Exploitation of Data Commonalities • Provenance data large, but large overlap • ⇒Use provenance to test how well a system is able to exploit data commonalities Example • Streaming data with multi-step windowed aggregation • ⇒Can predict overlap in provenance • Generate provenance • Queries over provenance as workload Slide 19 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 29. Data-centric Performance Information as Provenance • Assume the hypothetical existence of god • Big provenance system • Efficiently computes all types of provenance • For any Big Data system • Also evaluates any types of queries Data-centric performance information as provenance • For each data item use god to record • execution times of each transformation in provenance • history of data movements for each data item • . . . Example (Measure Robustness) • Benchmark that requires several runs of mixed job workloads • Workloads between runs overlap • Performance metric: Resource consumption variation per job Slide 20 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 30. Using Provenance in Profiling Profiling and Monitoring • Identify causes for (benchmark) results • Even more important for Big Data • Diverse workloads • Distribution • . . . Example (How can Provenance help?) • How do I profile a program on one node • e.g., instrument and collect statistics about execution • Provenance provides data-centric view • Identify repeated computations • Overlaps in provenance (same data, different nodes/transformations) • Identify unnecessary computations • Computations that are not in the fine-grained provenance of anything Slide 21 of 23 Boris Glavic Big Data Provenance - Implications for Benchmarking
  • 31. Outline 1 Provenance 2 Big Data Provenance - Big Provenance 3 Implications for Benchmarking 4 Conclusions
  • 32. Conclusions • Big Provenance: Many interesting research problems • Provenance as benchmark use-case • Provenance for benchmark data generation • Provenance-based performance metrics • Supporting profiling and monitoring Slide 22 of 23 Boris Glavic Big Data Provenance - Conclusions
  • 33. Questions? Info • Boris Glavic: http://www.cs.iit.edu/~glavic/ • IIT DBGroup: http://www.cs.iit.edu/~dbgroup/ • Perm: http://permdbms.sourceforge.net/ Slide 23 of 23 Boris Glavic Big Data Provenance - Conclusions