A talk given at the DATAPLAT workshop, co-located with the IEEE ICDE conference (May 2024, Utrecht, NL).
Data Provenance for Data Science is our attempt to provide a foundation to add explainability to data-centric AI.
It is a prototype, with lots of work still to do.
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in the Web of Data. For the Web of Data, it is known that data instances are not only added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an efficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
1. The document discusses Discngine's Tibco Spotfire Pipeline Pilot connector, which allows graphs stored in Pipeline Pilot to be accessed and visualized in Spotfire.
2. It describes the architecture of the connector and how it executes Pipeline Pilot protocols to generate HTML pages for visualization in Spotfire.
3. Challenges in integrating the large Spotfire API and synchronizing client and server datasets are also discussed.
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in the Web of Data. For the Web of Data, it is known that data instances are not only added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an efficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
The document describes the automated construction of a large semantic network called SemNet. It analyzes a large text corpus to extract terms and relations using n-gram analysis, part-of-speech tagging, and pattern matching. SemNet contains over 2.7 million terms and 37.5 million relations. The document evaluates SemNet by comparing it to WordNet and ConceptNet, finding that it contains over 77% of WordNet synsets and over 82% of ConceptNet nouns.
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
This document discusses the RAMSES project, which aims to develop a new science of end-to-end analytical performance modeling of science workflows in extreme-scale science environments. The RAMSES research agenda involves developing component and end-to-end models, tools to provide performance advice, data-driven estimation methods, automated experiments, and a performance database. The models will be evaluated using five challenge workflows: high-performance file transfer, diffuse scattering experimental data analysis, data-intensive distributed analytics, exascale application kernels, and in-situ analysis placement.
1. The document discusses Discngine's Tibco Spotfire Pipeline Pilot connector, which allows graphs stored in Pipeline Pilot to be accessed and visualized in Spotfire.
2. It describes the architecture of the connector and how it executes Pipeline Pilot protocols to generate HTML pages for visualization in Spotfire.
3. Challenges in integrating the large Spotfire API and synchronizing client and server datasets are also discussed.
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Data Lineage, Property Based Testing & Neo4j Neo4j
Neo4j can be used to create data lineage graphs that track how data changes through transformations and processes. These graphs make it easy to identify errors, evaluate impacts, and improve communication by showing the relationships between data. More than 90% of algorithms can be modeled as graphs, and graphs are well-suited to hosting process chains with relationships that can be filtered. The document demonstrates how Neo4j can be used to build a data lineage graph from metadata and answer questions about where restricted data became exposed.
This document discusses developing analytics applications using machine learning on Azure Databricks and Apache Spark. It begins with an introduction to Richard Garris and the agenda. It then covers the data science lifecycle including data ingestion, understanding, modeling, and integrating models into applications. Finally, it demonstrates end-to-end examples of predicting power output, scoring leads, and predicting ratings from reviews.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and evaluating the performance and energy efficiency of different hardware platforms for big data workloads.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
The document discusses Pentaho Data Integration (Kettle), an open-source ETL tool. It describes Kettle's components like Spoon for designing transformations and jobs, Pan and Kitchen for executing them. It covers extracting, transforming and loading data, transformation steps, job entries, and how Kettle is used to integrate data from various sources into data warehouses.
TensorFlow 16: Building a Data Science Platform Seldon
1. The document discusses building a data science platform on DC/OS to operationalize machine learning models. It outlines challenges at each stage of the ML pipeline and how DC/OS addresses them with distributed computing capabilities and services for data storage, processing, model training and deployment.
2. Key stages covered include data preparation, distributed training using frameworks like TensorFlow, model management with storage of trained models, and low-latency model serving for production with TensorFlow Serving.
3. DC/OS provides a full-stack platform to operationalize ML at scale through distributed computing resources, container orchestration, and integration of open source data and ML services.
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
This document summarizes an online machine learning framework called Structured Streaming that is being developed for Apache Spark. Some key points:
- It allows machine learning algorithms to be applied continuously to streaming data and update models incrementally in an online fashion.
- Models are updated every time interval (e.g. every second) based on new data within that interval. This provides an approximation of processing all data to date.
- It uses a stateful aggregation approach to allow models to be updated and merged across distributed partitions in a way that is deterministic but not necessarily commutative.
- APIs are provided for common online learning algorithms like online logistic regression and gradient descent to interface with streaming data sources and sinks.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
The document proposes JECB, a new approach to partitioning OLTP databases across multiple nodes. JECB uses join paths between tables defined by foreign key relationships to determine how to partition tables. It analyzes transaction SQL code to identify optimal join trees for partitioning. JECB partitions individual transaction types separately then combines results. Experiments show JECB partitions TPC-C and TPC-E databases with fewer distributed transactions than prior approaches, demonstrating it handles complex workloads better than previous statistics-based or schema-based partitioning methods.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
The document summarizes research done at the Barcelona Supercomputing Center on evaluating Hadoop platforms as a service (PaaS) compared to infrastructure as a service (IaaS). Key findings include:
- Provider (Azure HDInsight, Rackspace CBD, etc.) did not significantly impact performance of wordcount and terasort benchmarks.
- Data size and number of datanodes were more important factors, with diminishing returns on performance from adding more nodes.
- PaaS can save on maintenance costs compared to IaaS but may be more expensive depending on workload and VM size needed. Tuning may still be required with PaaS.
This thesis focuses on performance management techniques for cloud services. It presents work in three key areas: 1) Developing a scalable and generic resource allocation protocol for large cloud environments. 2) Building performance models to predict response times and capacity for a distributed key-value store. 3) Enabling real-time prediction of service metrics using analytics on low-level system statistics. The thesis contributes solutions for these challenging problems and identifies open questions around decentralized resource allocation, online performance management, and analytics-based forecasting at large scales.
Modelling Multi-Component Predictive Systems as Petri NetsManuel Martín
Building reliable data-driven predictive systems requires a considerable amount of human effort, especially in the data preparation and cleaning phase. In many application domains, multiple preprocessing steps need to be applied in sequence, constituting a `workflow' and facilitating reproducibility. The concatenation of such workflow with a predictive model forms a Multi-Component Predictive System (MCPS). Automatic MCPS composition can speed up this process by taking the human out of the loop, at the cost of model transparency (i.e. not being comprehensible by human experts). In this paper, we adopt and suitably re-define the Well-handled with Regular Iterations Work Flow (WRI-WF) Petri nets to represent MCPSs. The use of such WRI-WF nets helps to increase the transparency of MCPSs required in industrial applications and make it possible to automatically verify the composed workflows. We also present our experience and results of applying this representation to model soft sensors in chemical production plants.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Más contenido relacionado
Similar a Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
In this presentation, given to graduate students at Universita' RomaTre, Italy, we suggest that concepts well-known in Data Provenance can be exploited to provide explanations in the context of data-centric AI processes. Through use cases (incremental data cleaning, training set pruning), we build up increasingly complex provenance patterns, culminating in an open question:
how to describe "why" a specific data item has been manipulated as part of data processing, when such processing may consist of a complex data transformation algorithm.
Data Lineage, Property Based Testing & Neo4j Neo4j
Neo4j can be used to create data lineage graphs that track how data changes through transformations and processes. These graphs make it easy to identify errors, evaluate impacts, and improve communication by showing the relationships between data. More than 90% of algorithms can be modeled as graphs, and graphs are well-suited to hosting process chains with relationships that can be filtered. The document demonstrates how Neo4j can be used to build a data lineage graph from metadata and answer questions about where restricted data became exposed.
This document discusses developing analytics applications using machine learning on Azure Databricks and Apache Spark. It begins with an introduction to Richard Garris and the agenda. It then covers the data science lifecycle including data ingestion, understanding, modeling, and integrating models into applications. Finally, it demonstrates end-to-end examples of predicting power output, scoring leads, and predicting ratings from reviews.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and performance evaluation of different hardware platforms for big data workloads.
詹剑锋:Big databench—benchmarking big data systemshdhappy001
This document discusses BigDataBench, an open source project for big data benchmarking. BigDataBench includes six real-world data sets and 19 workloads that cover common big data applications and preserve the four V's of big data. The workloads were chosen to represent typical application domains like search engines, social networks, and e-commerce. BigDataBench aims to provide a standardized benchmark for evaluating big data systems, architectures, and software stacks. It has been used in several case studies for workload characterization and evaluating the performance and energy efficiency of different hardware platforms for big data workloads.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
The document discusses Pentaho Data Integration (Kettle), an open-source ETL tool. It describes Kettle's components like Spoon for designing transformations and jobs, Pan and Kitchen for executing them. It covers extracting, transforming and loading data, transformation steps, job entries, and how Kettle is used to integrate data from various sources into data warehouses.
TensorFlow 16: Building a Data Science Platform Seldon
1. The document discusses building a data science platform on DC/OS to operationalize machine learning models. It outlines challenges at each stage of the ML pipeline and how DC/OS addresses them with distributed computing capabilities and services for data storage, processing, model training and deployment.
2. Key stages covered include data preparation, distributed training using frameworks like TensorFlow, model management with storage of trained models, and low-latency model serving for production with TensorFlow Serving.
3. DC/OS provides a full-stack platform to operationalize ML at scale through distributed computing resources, container orchestration, and integration of open source data and ML services.
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
This document summarizes an online machine learning framework called Structured Streaming that is being developed for Apache Spark. Some key points:
- It allows machine learning algorithms to be applied continuously to streaming data and update models incrementally in an online fashion.
- Models are updated every time interval (e.g. every second) based on new data within that interval. This provides an approximation of processing all data to date.
- It uses a stateful aggregation approach to allow models to be updated and merged across distributed partitions in a way that is deterministic but not necessarily commutative.
- APIs are provided for common online learning algorithms like online logistic regression and gradient descent to interface with streaming data sources and sinks.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
The document proposes JECB, a new approach to partitioning OLTP databases across multiple nodes. JECB uses join paths between tables defined by foreign key relationships to determine how to partition tables. It analyzes transaction SQL code to identify optimal join trees for partitioning. JECB partitions individual transaction types separately then combines results. Experiments show JECB partitions TPC-C and TPC-E databases with fewer distributed transactions than prior approaches, demonstrating it handles complex workloads better than previous statistics-based or schema-based partitioning methods.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
The document summarizes research done at the Barcelona Supercomputing Center on evaluating Hadoop platforms as a service (PaaS) compared to infrastructure as a service (IaaS). Key findings include:
- Provider (Azure HDInsight, Rackspace CBD, etc.) did not significantly impact performance of wordcount and terasort benchmarks.
- Data size and number of datanodes were more important factors, with diminishing returns on performance from adding more nodes.
- PaaS can save on maintenance costs compared to IaaS but may be more expensive depending on workload and VM size needed. Tuning may still be required with PaaS.
This thesis focuses on performance management techniques for cloud services. It presents work in three key areas: 1) Developing a scalable and generic resource allocation protocol for large cloud environments. 2) Building performance models to predict response times and capacity for a distributed key-value store. 3) Enabling real-time prediction of service metrics using analytics on low-level system statistics. The thesis contributes solutions for these challenging problems and identifies open questions around decentralized resource allocation, online performance management, and analytics-based forecasting at large scales.
Modelling Multi-Component Predictive Systems as Petri NetsManuel Martín
Building reliable data-driven predictive systems requires a considerable amount of human effort, especially in the data preparation and cleaning phase. In many application domains, multiple preprocessing steps need to be applied in sequence, constituting a `workflow' and facilitating reproducibility. The concatenation of such workflow with a predictive model forms a Multi-Component Predictive System (MCPS). Automatic MCPS composition can speed up this process by taking the human out of the loop, at the cost of model transparency (i.e. not being comprehensible by human experts). In this paper, we adopt and suitably re-define the Well-handled with Regular Iterations Work Flow (WRI-WF) Petri nets to represent MCPSs. The use of such WRI-WF nets helps to increase the transparency of MCPSs required in industrial applications and make it possible to automatically verify the composed workflows. We also present our experience and results of applying this representation to model soft sensors in chemical production plants.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
The document describes the ReComp framework for efficiently recomputing analytics processes when changes occur. ReComp uses provenance data from past executions to estimate the impact of changes and selectively re-execute only affected parts of processes. It identifies changes, computes data differences, and estimates impacts on past outputs to determine the minimum re-executions needed. For genomic analysis workflows, ReComp reduced re-executions from 495 to 71 by caching intermediate data and re-running only impacted fragments. The framework is customizable via difference and impact functions tailored to specific applications and data types.
Similar a Design and Development of a Provenance Capture Platform for Data Science (20)
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
A talk given at the BDA4HM workshop, IEEE BigData conference, Dec. 2023
please see paper here:
https://drive.google.com/file/d/1vN08G0FWxOSH1Yeak5AX6a0sr5-EBbAt/view
Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier
A keynote talk given to the IDEAL 2023 conference (Evora, Portugal Nov 23, 2023).
Abstract.
The past few years have seen the emergence of what the AI community calls "Data-centric AI", namely the recognition that some of the limiting factors in AI performance are in fact in the data used for training the models, as much as in the expressiveness and complexity of the models themselves. One analogy is that of a powerful engine that will only run as fast as the quality of the fuel allows. A plethora of recent literature has started the connection between data and models in depth, along with startups that offer "data engineering for AI" services. Some concepts are well-known to the data engineering community, including incremental data cleaning, multi-source integration, or data bias control; others are more specific to AI applications, for instance the realisation that some samples in the training space are "easier to learn from" than others. In this "position talk" I will suggest that, from an infrastructure perspective, there is an opportunity to efficiently support patterns of complex pipelines where data and model improvements are entangled in a series of iterations. I will focus in particular on end-to-end tracking of data and model versions, as a way to support MLDev and MLOps engineers as they navigate through a complex decision space.
Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier
This document summarizes a presentation on opportunities and challenges for applying health data science and AI in healthcare. It discusses the potential of predictive, preventative, personalized and participatory (P4) approaches using large health datasets. However, it notes major challenges including data sparsity, imbalance, inconsistency and high costs. Case studies on liver disease and COVID datasets demonstrate issues requiring data engineering. Ensuring explanations and human oversight are also key to adopting AI in clinical practice. Overall, the document outlines a complex landscape and the need for better data science methods to realize the promise of data-driven healthcare.
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
This document describes DP4DS, a tool to collect fine-grained provenance from data processing pipelines. Specifically, it can collect provenance from dataframe-based Python scripts. It demonstrates scalable provenance generation, storage, and querying. Current work includes improving provenance compression techniques and demonstrating the tool's generality for standard relational operators. Open questions remain around how useful fine-grained provenance is for explaining findings from real data science pipelines.
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
a brief intro on the data challenges associated with working with Health Care data, with a few examples, both from literature and our own, of traditional approaches (Latent Class Analysis, Topic Modelling) and a perspective on Language-based modelling for Electronic Health Records (EHR).
probably more references than actual content in here!
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
This document describes a method for capturing and querying fine-grained provenance from data science preprocessing pipelines. It captures provenance at the dataframe level by comparing inputs and outputs to identify transformations. Templates are used to represent common transformations like joins and appends. The approach was evaluated on benchmark datasets and pipelines, showing overhead from provenance capture is low and queries are fast even for large datasets. Scalability was demonstrated on datasets up to 1TB in size. A tool called DPDS was also developed to assist with data science provenance.
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
The document proposes tracking trajectories of multiple long-term conditions using dynamic patient-cluster associations. It uses topic modeling to identify disease clusters from patient timelines and quantifies how patients associate with clusters over time. Preliminary results on 143,000 patients from UK Biobank show varying stability of patient associations with clusters. Further work aims to better define stability and identify causes of instability.
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
Digital biomarkers for preventive personalised healthcarePaolo Missier
A talk given to the Alan Turing Institute, UK, Oct 2021, reporting on the preliminary results and ongoing research in our lab, on self-monitoring using accelerometers for healthcare applications
The document discusses data provenance for data science applications. It proposes automatically generating and storing metadata that describes how data flows through a machine learning pipeline. This provenance information could help address questions about model predictions, data processing decisions, and regulatory requirements for high-risk AI systems. Capturing provenance at a fine-grained level incurs overhead but enables detailed queries. The approach was evaluated on performance and scalability. Provenance may help with transparency, explainability and oversight as required by new regulations.
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
a talk given at the VLDB 2021 conference, August, 2021, presenting our paper:
Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. Chapman, A., Missier, P., Simonelli, G., & Torlone, R. PVLDB, 14(4):507–520, January, 2021.
http://doi.org/10.14778/3436905.3436911
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier
The document discusses provenance in the context of data science and artificial intelligence. It provides bibliometric data on publications related to data/workflow provenance from 2000 to the present. Recent trends include increased focus on applications in computing and engineering fields. Blockchain is discussed as a method for capturing fine-grained provenance. The document also outlines challenges around explainability, transparency and accountability for high-risk AI systems according to new EU regulations, and argues that provenance techniques may help address these challenges by providing traceability of system functioning and operation monitoring.
Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier
This document discusses using data provenance to optimize re-execution of analytics pipelines and enable transparency in data science workflows. It proposes a framework called ReComp that selectively recomputes parts of expensive analytics workflows when inputs change based on provenance data. It also discusses applying provenance techniques to collect fine-grained data on data preparation steps in machine learning pipelines to help explain model decisions and data transformations. Early results suggest provenance can be collected with reasonable overhead and enables useful queries about pipeline execution.
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
Paolo Missier presented on optimizing the re-execution of analytics pipelines in response to changes in input data. The talk discussed using provenance to selectively re-run parts of workflows impacted by changes. ProvONE combines process structure and runtime provenance to enable granular re-execution. The ReComp framework detects and quantifies data changes, estimates impact, and selectively re-executes relevant sub-processes to optimize re-running workflows in response to evolving data.
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses efficient re-computation of big data analytics processes when changes occur. It presents the ReComp framework which uses process execution history and provenance to selectively re-execute only the relevant parts of a process that are impacted by changes, rather than fully re-executing the entire process from scratch. This approach estimates the impact of changes using type-specific difference functions and impact estimation functions. It then identifies the minimal subset of process fragments that need to be re-executed based on change impact analysis and provenance traces. The framework is able to efficiently re-compute complex processes like genomics analytics workflows in response to changes in reference databases or other dependencies.
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Paolo Missier
a talk given at the 2nd IEEE Blockchain conference, Atlanta, US ?july 2019.
here is the paper: http://homepages.cs.ncl.ac.uk/paolo.missier/doc/Decentralised_Marketplace_USA_Conference___Accepted_Version_.pdf
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
This document discusses an efficient framework called ReComp for re-computing big data analytics processes when inputs or algorithms change. ReComp uses fine-grained process provenance and execution history to estimate the impact of changes and selectively re-execute only affected parts. This can provide significant time savings over fully re-running processes from scratch. The framework was tested on two case studies: genomic variant analysis (SVI tool) and simulation modeling, demonstrating savings of 28-37% compared to complete re-execution. ReComp provides a generic approach but allows customization for specific processes and change types.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
Design and Development of a Provenance Capture Platform for Data Science
1. Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗
∗Department of Engineering, Roma Tre University, Italy
†School of Computer Science, University of Birmingham, UK
‡Newcastle University, School of Computing
DATAPLAT@ICDE
May 2024
Utrecht, NL
Design and Development of a Provenance
Capture Platform for Data Science
2. 2
Setting and questions
Model
outputs
Training
datasets
Source
datasets
Data processing Training M Inference/generation
Data explanation questions:
• Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
• Which of the individual data items were affected by each of the transformations
• What was the effect?
DATAPLAT@ICDE
2024
3. 3
Provenance basics
Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ
D D’
A
wasGeneratedBy
wasDerivedFrom
used
Provenance expression:
DATAPLAT@ICDE
2024
4. 4
Extension to DAG topologies
Example: inputs 𝐷0
𝑎, 𝐷0
𝑏 Dc
0 are processed independently and eventually merged into 𝐷𝑛:
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
used
used
used
used
wgby
wgby
DATAPLAT@ICDE
2024
5. 5
The Big Provenance Dogma
Data provenance is an enabler for:
• Transparency
• Explainability
• Reproducibility
…for a variety of underlying process and source / target data combinations
Model
outputs
Training
datasets
Data processing Training M Inference/generation
Source Target
Process
DATAPLAT@ICDE
2024
6. 6
DATAPLAT@ICDE
2024
Contributions
ü Analysis of over 500 Data Science pipeline
§ “in the wild” --> Kaggle
§ “controlled” --> ML Bazaar
ü Formal provenance semantics for a catalogue of commonly used Data Science operators
ü Data Provenance for Data Science (DPDS)
§ automatically track granular provenance from Pandas
§ maximally transparent and minimally intrusive to the programmer
ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
7. 7
Data processing pipelines analysis: ML Bazaar
ü Facilitates developing ML and AutoML systems
ü Workflow style: Pipelines composed out of pre-defined primitives
ü Data + task pairs with benchmark results over multiple data types
✗ Only 5 types of operators
✗ Single location, controlled ecosystem
DFS = Deep Feature Synthesis
DATAPLAT@ICDE
2024
8. 8
Data processing pipelines analysis: Kaggle
Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle
Ø 29 unique pre-processing operations
Ø 12 appear in less than 10 pipelines
§ Transposing
§ changing index values
Ø feature augmentation (58)
Ø scaling operations (38)
DATAPLAT@ICDE
2024
13. 13
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
DL
./t
C DR
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
DL
] DR
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
DL
] DR
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
DL
./inner
DL.CId=DR.CId DR
DATAPLAT@ICDE
2024
14. 14
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>
↵!
f1(Age):ageRange(D)
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
15. 15
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
For i : 1 . . . n :
used ent.:[hF = Xm, I = i, V = Di,Xm
i|Xm 2 X]
generated ent.:[hF0
= Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
16. 16
Implementation by shape and value diff
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
For each input/output pair Din, Dout of dataframes:
1. Diff both shapes and values of Din, Dout
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the
relevant values in the two dataframes
• Generate an instantiated provlet
DATAPLAT@ICDE
2024
17. 17
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1 = Da ./left
K1,K2
Db
D2 = ⌧f1(⇤)(D1)
D3 = D2 ./left
K1,K2
Dc
D4 = ⌧f2(E,F )(D3)
D5 = ↵!
h(E):{E4,Ex,E1}(D4)
D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
DATAPLAT@ICDE
2024
19. 19
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
Dataframes Diff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1 value change, reduced nulls à imputation Data transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3 value change, reduced nulls à imputation Data transformation
D45 ß D4 Shape change, column(s) added <wait!>
D6 ß D5 Shape change, column(s) removed Data transformation, composite
DATAPLAT@ICDE
2024
20. 20
Program level transparency with control
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
DATAPLAT@ICDE
2024
21. 21
Provenance traversals – example
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
DATAPLAT@ICDE
2024
22. 22
Benchmarking: data x pipelines
Datasets:
Pipelines:
Provenance graphs are stored
in a single Neo4J database
DATAPLAT@ICDE
2024
23. 23
Results
The PT/PO ratio provides a rough indication of scalability:
- The graphs for the complete pipelines are close in size to the sum of the sizes of the components’
graphs
1,2,3: pipeline number
DATAPLAT@ICDE
2024
24. 24
Conclusions
ü DPDS generates granular provenance graphs that accurately represent the
underlying data processing
ü A potentially useful building block towards explanations in a Data Centric AI
setting
Limitations:
v No granularity control --> limited scalability
v Operates only on Pandas dataframes
DATAPLAT@ICDE
2024