SlideShare una empresa de Scribd logo
1 de 8
Panda: A System for
Provenance and Data
    Presented By Vladimir Bukhin
Contents

• Use of Provenance.
• Panda Goals.
• Example Workflow.
• Provenance Operations.
• Panda Implementation.
Use of Provenance

• Explanation: Examine sources and evolution
  of data elements.
• Verification: Auditing how data was
  produced.
• Re-computation: Having found error,
  propagate changes downstream.
Panda Goals
• Merge data-based and process-based
  provenance.
• Define provenance operators to query and
  analyze data mixed with provenance.
• Create open-source configurable system
  that can be used for wide variety of
  applications, having coupling capabilities
  with outside data/systems.
Example Workflow


•   De-duplicate 2 data sets.
•   Partition into Euro and USA, then take union.
•   Process predict Items most likely to be purchased from
    addition of 2 data sets.
•   Aggregation: Output of prediction table.
Provenance Operations
• Backward Tracing: If cowboy hats most sold
  item, where are people buying from?
• Forward Tracing: If we correct an error in
  the data, how would the outcome change?
• Forward Propagation: Rerun processes for
  concerned erroneous data and recalc end
  result.
• Refresh: Recalculate due to new data.
Panda Implementation
•   Query language to answer
    questions like: Which
    customer list contributes the
    most to the top 100 predicted
    items?
•   Use ‘predicates’ as references
    to refer back to data origins.
    (to trace back to info src.)
•   Python mapping/transformation
    nodes run per data point.
References

• R. Ikeda and J. Widom, Panda: A System for
  Provenance and Data, IEEE Data
  Engineering Bulletin,Vol. 33, No. 3.
  September 2010.

Más contenido relacionado

La actualidad más candente

IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEEFINALYEARSTUDENTPROJECTS
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduceGreg Hanchin
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-dsHimansu Sahoo
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastFranz Inc. - AllegroGraph
 
What are we doing with our data
What are we doing with our data What are we doing with our data
What are we doing with our data Spencer Keralis
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Reviewijsrd.com
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceDr. Haxel Consult
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Martin Magdinier
 
A Data Model, Workflow, and Architecture for Integrating Data
A Data Model, Workflow, and Architecture for Integrating DataA Data Model, Workflow, and Architecture for Integrating Data
A Data Model, Workflow, and Architecture for Integrating DataDavid Massart
 
Ginix generalized inverted index for keyword search
Ginix generalized inverted index for keyword searchGinix generalized inverted index for keyword search
Ginix generalized inverted index for keyword searchIEEEFINALYEARPROJECTS
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementC. Tobin Magle
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 

La actualidad más candente (20)

IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf dataIEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
IEEE 2014 JAVA DATA MINING PROJECTS Scalable keyword search on large rdf data
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
 
Himansu sahoo resume-ds
Himansu sahoo resume-dsHimansu sahoo resume-ds
Himansu sahoo resume-ds
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
AllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcastAllegroGraph - Cognitive Probability Graph webcast
AllegroGraph - Cognitive Probability Graph webcast
 
Introduction Big data
Introduction Big data  Introduction Big data
Introduction Big data
 
What are we doing with our data
What are we doing with our data What are we doing with our data
What are we doing with our data
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Review
 
Design and creation of ontologies for environmental information retrieval
Design and creation of ontologies for environmental information retrievalDesign and creation of ontologies for environmental information retrieval
Design and creation of ontologies for environmental information retrieval
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
CS215 - Lec 9 indexing and reclaiming space in files
CS215 - Lec 9  indexing and reclaiming space in filesCS215 - Lec 9  indexing and reclaiming space in files
CS215 - Lec 9 indexing and reclaiming space in files
 
Open access day
Open access dayOpen access day
Open access day
 
Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015Toronto OpenRefine MeetUp Nov 2015
Toronto OpenRefine MeetUp Nov 2015
 
A Data Model, Workflow, and Architecture for Integrating Data
A Data Model, Workflow, and Architecture for Integrating DataA Data Model, Workflow, and Architecture for Integrating Data
A Data Model, Workflow, and Architecture for Integrating Data
 
Ginix generalized inverted index for keyword search
Ginix generalized inverted index for keyword searchGinix generalized inverted index for keyword search
Ginix generalized inverted index for keyword search
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Data and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data ManagementData and Donuts: The Impact of Data Management
Data and Donuts: The Impact of Data Management
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 

Similar a Panda Provenance

20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesPradeeban Kathiravelu, Ph.D.
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsIlkay Altintas, Ph.D.
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analyticsAnirudh
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 

Similar a Panda Provenance (20)

20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data Lakes
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Bridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable WorkflowsBridging Big Data and Data Science Using Scalable Workflows
Bridging Big Data and Data Science Using Scalable Workflows
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 

Último

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Último (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

Panda Provenance

  • 1. Panda: A System for Provenance and Data Presented By Vladimir Bukhin
  • 2. Contents • Use of Provenance. • Panda Goals. • Example Workflow. • Provenance Operations. • Panda Implementation.
  • 3. Use of Provenance • Explanation: Examine sources and evolution of data elements. • Verification: Auditing how data was produced. • Re-computation: Having found error, propagate changes downstream.
  • 4. Panda Goals • Merge data-based and process-based provenance. • Define provenance operators to query and analyze data mixed with provenance. • Create open-source configurable system that can be used for wide variety of applications, having coupling capabilities with outside data/systems.
  • 5. Example Workflow • De-duplicate 2 data sets. • Partition into Euro and USA, then take union. • Process predict Items most likely to be purchased from addition of 2 data sets. • Aggregation: Output of prediction table.
  • 6. Provenance Operations • Backward Tracing: If cowboy hats most sold item, where are people buying from? • Forward Tracing: If we correct an error in the data, how would the outcome change? • Forward Propagation: Rerun processes for concerned erroneous data and recalc end result. • Refresh: Recalculate due to new data.
  • 7. Panda Implementation • Query language to answer questions like: Which customer list contributes the most to the top 100 predicted items? • Use ‘predicates’ as references to refer back to data origins. (to trace back to info src.) • Python mapping/transformation nodes run per data point.
  • 8. References • R. Ikeda and J. Widom, Panda: A System for Provenance and Data, IEEE Data Engineering Bulletin,Vol. 33, No. 3. September 2010.

Notas del editor

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n