SlideShare una empresa de Scribd logo
Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗
∗Department of Engineering, Roma Tre University, Italy
†School of Computer Science, University of Birmingham, UK
‡Newcastle University, School of Computing
DATAPLAT@ICDE
May 2024
Utrecht, NL
Design and Development of a Provenance
Capture Platform for Data Science
2
Setting and questions
Model
outputs
Training
datasets
Source
datasets
Data processing Training M Inference/generation
Data explanation questions:
• Which data transformations were applied to raw input dataset(s) to generate the final
training set used for modelling?
• Which of the individual data items were affected by each of the transformations
• What was the effect?
DATAPLAT@ICDE
2024
3
Provenance basics
Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ
D D’
A
wasGeneratedBy
wasDerivedFrom
used
Provenance expression:
DATAPLAT@ICDE
2024
4
Extension to DAG topologies
Example: inputs 𝐷0
𝑎, 𝐷0
𝑏 Dc
0 are processed independently and eventually merged into 𝐷𝑛:
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
Da
0 OP1 Da
1
Db
0 OP2 Db
1
Dc
0
OP3 Dbc
0
OP4 Dabc
3
used
used
used
used
wgby
wgby
DATAPLAT@ICDE
2024
5
The Big Provenance Dogma
Data provenance is an enabler for:
• Transparency
• Explainability
• Reproducibility
…for a variety of underlying process and source / target data combinations
Model
outputs
Training
datasets
Data processing Training M Inference/generation
Source Target
Process
DATAPLAT@ICDE
2024
6
DATAPLAT@ICDE
2024
Contributions
ü Analysis of over 500 Data Science pipeline
§ “in the wild” --> Kaggle
§ “controlled” --> ML Bazaar
ü Formal provenance semantics for a catalogue of commonly used Data Science operators
ü Data Provenance for Data Science (DPDS)
§ automatically track granular provenance from Pandas
§ maximally transparent and minimally intrusive to the programmer
ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
7
Data processing pipelines analysis: ML Bazaar
ü Facilitates developing ML and AutoML systems
ü Workflow style: Pipelines composed out of pre-defined primitives
ü Data + task pairs with benchmark results over multiple data types
✗ Only 5 types of operators
✗ Single location, controlled ecosystem
DFS = Deep Feature Synthesis
DATAPLAT@ICDE
2024
8
Data processing pipelines analysis: Kaggle
Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle
Ø 29 unique pre-processing operations
Ø 12 appear in less than 10 pipelines
§ Transposing
§ changing index values
Ø feature augmentation (58)
Ø scaling operations (38)
DATAPLAT@ICDE
2024
9
Data processing operators
DATAPLAT@ICDE
2024
10
Data reduction
<latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit>
D0
= ⇡C(D), D0
= C(D) - Projection, Selection
<latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit>
⇡{Cid,Gender,Age}( Age<30(D))
DATAPLAT@ICDE
2024
11
Data augmentation
Vertical augmentation
<latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit>
↵!
f(X):Y
<latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit>
↵!
f1(Age):ageRange(D)
group by gender
avg(age)
Horizontal augmentation
<latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit>
E2 = ↵#
Gender:f2(Age)(D)
<latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit>
↵#
X:f(Y )(D)
DATAPLAT@ICDE
2024
12
Data transformation
<latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit>
⌧f(X)
<latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit>
the transformation of a set of features X of D using a function f
is obtained by substituting each value dia with f(d⇤a),
for each feature a occurring in X.
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
<latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit>
⌧f(Zip)(D)
DATAPLAT@ICDE
2024
13
Data fusion: join and append
<latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit>
DL
./t
C DR
<latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit>
DL
] DR
<latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit>
DL
] DR
<latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit>
DL
./inner
DL.CId=DR.CId DR
DATAPLAT@ICDE
2024
14
Conceptual provenance capture model: templates
<latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit>
↵!
f1(Age):ageRange(D)
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
15
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V} à {new values: F’, J, V’}
+
Binding rules
<latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit>
For i : 1 . . . n :
used ent.:[hF = Xm, I = i, V = Di,Xm
i|Xm 2 X]
generated ent.:[hF0
= Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
16
Implementation by shape and value diff
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
For each input/output pair Din, Dout of dataframes:
1. Diff both shapes and values of Din, Dout
2. Use the diff to:
• Select the appropriate template
• Bind the template variables using the
relevant values in the two dataframes
• Generate an instantiated provlet
DATAPLAT@ICDE
2024
17
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
<latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit>
D1 = Da ./left
K1,K2
Db
D2 = ⌧f1(⇤)(D1)
D3 = D2 ./left
K1,K2
Dc
D4 = ⌧f2(E,F )(D3)
D5 = ↵!
h(E):{E4,Ex,E1}(D4)
D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5)
DATAPLAT@ICDE
2024
18
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join
df = df.fillna('imputed’) # Imputation
df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join
df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation
# one-hot encoding
c = 'E'
dummies = []
dummies.append(pd.get_dummies(df[c]))
df_dummies = pd.concat(dummies, axis=1)
df = pd.concat((df, df_dummies), axis=1)
df = df_A.drop([c], axis=1)
DATAPLAT@ICDE
2024
19
Running Example
D1 D2 D3
Add
‘E4,’ ‘Ex’, ‘E1’
Remove ‘E’
D4 D6
Da
Db
Left join
(K1,K2)
Impute
all missing
Dc
Left join
(K1,K2)
Impute E,F
D5
Dataframes Diff template
D1 ß {Da, Db} Explicit join provenance pattern
D2 ß D1 value change, reduced nulls à imputation Data transformation
D3 ß {D2, Dc} Explicit join provenance pattern
D4 ß D3 value change, reduced nulls à imputation Data transformation
D45 ß D4 Shape change, column(s) added <wait!>
D6 ß D5 Shape change, column(s) removed Data transformation, composite
DATAPLAT@ICDE
2024
20
Program level transparency with control
Approach:
- add an observer to monitor dataframe changes
- mostly transparent to application
- some control over Tracker surfaced
DATAPLAT@ICDE
2024
21
Provenance traversals – example
Capture, store and query element-level provenance
- Derivation of each element of each intermediate dataframe (when possible)
- Efficiently, at scale
fillna
Join
df_1
df_B (df_0)
df_A (df_-1)
DATAPLAT@ICDE
2024
22
Benchmarking: data x pipelines
Datasets:
Pipelines:
Provenance graphs are stored
in a single Neo4J database
DATAPLAT@ICDE
2024
23
Results
The PT/PO ratio provides a rough indication of scalability:
- The graphs for the complete pipelines are close in size to the sum of the sizes of the components’
graphs
1,2,3: pipeline number
DATAPLAT@ICDE
2024
24
Conclusions
ü DPDS generates granular provenance graphs that accurately represent the
underlying data processing
ü A potentially useful building block towards explanations in a Data Centric AI
setting
Limitations:
v No granularity control --> limited scalability
v Operates only on Pandas dataframes
DATAPLAT@ICDE
2024

Más contenido relacionado

Similar a Design and Development of a Provenance Capture Platform for Data Science

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Paolo Missier
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
hdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
hdhappy001
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
jade_22
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
Tal Lavian Ph.D.
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014
Khai Tran
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
NECST Lab @ Politecnico di Milano
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
DataWorks Summit/Hadoop Summit
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
rerngvit yanggratoke
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
Manuel Martín
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
Pouria Amirian
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 

Similar a Design and Development of a Provenance Capture Platform for Data Science (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j Data Lineage, Property Based Testing & Neo4j
Data Lineage, Property Based Testing & Neo4j
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01Kettleetltool 090522005630-phpapp01
Kettleetltool 090522005630-phpapp01
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
Jecb sigmod2014
Jecb sigmod2014Jecb sigmod2014
Jecb sigmod2014
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
rerngvit_phd_seminar
rerngvit_phd_seminarrerngvit_phd_seminar
rerngvit_phd_seminar
 
Modelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri NetsModelling Multi-Component Predictive Systems as Petri Nets
Modelling Multi-Component Predictive Systems as Petri Nets
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 

Más de Paolo Missier

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Paolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Paolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
Paolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Paolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Paolo Missier
 

Más de Paolo Missier (20)

(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 

Último

Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 

Último (20)

Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 

Design and Development of a Provenance Capture Platform for Data Science

  • 1. Luca Gregori∗, Paolo Missier†, Matthew Stidolph‡, Riccardo Torlone∗ and Alessandro Wood∗ ∗Department of Engineering, Roma Tre University, Italy †School of Computer Science, University of Birmingham, UK ‡Newcastle University, School of Computing DATAPLAT@ICDE May 2024 Utrecht, NL Design and Development of a Provenance Capture Platform for Data Science
  • 2. 2 Setting and questions Model outputs Training datasets Source datasets Data processing Training M Inference/generation Data explanation questions: • Which data transformations were applied to raw input dataset(s) to generate the final training set used for modelling? • Which of the individual data items were affected by each of the transformations • What was the effect? DATAPLAT@ICDE 2024
  • 3. 3 Provenance basics Abstract data transformation operator: 𝐷 → (OP) → 𝐷ʹ D D’ A wasGeneratedBy wasDerivedFrom used Provenance expression: DATAPLAT@ICDE 2024
  • 4. 4 Extension to DAG topologies Example: inputs 𝐷0 𝑎, 𝐷0 𝑏 Dc 0 are processed independently and eventually merged into 𝐷𝑛: Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 Da 0 OP1 Da 1 Db 0 OP2 Db 1 Dc 0 OP3 Dbc 0 OP4 Dabc 3 used used used used wgby wgby DATAPLAT@ICDE 2024
  • 5. 5 The Big Provenance Dogma Data provenance is an enabler for: • Transparency • Explainability • Reproducibility …for a variety of underlying process and source / target data combinations Model outputs Training datasets Data processing Training M Inference/generation Source Target Process DATAPLAT@ICDE 2024
  • 6. 6 DATAPLAT@ICDE 2024 Contributions ü Analysis of over 500 Data Science pipeline § “in the wild” --> Kaggle § “controlled” --> ML Bazaar ü Formal provenance semantics for a catalogue of commonly used Data Science operators ü Data Provenance for Data Science (DPDS) § automatically track granular provenance from Pandas § maximally transparent and minimally intrusive to the programmer ü Empirical evaluation against a grid of 3 benchmark datasets x 3 synthetic pipelines
  • 7. 7 Data processing pipelines analysis: ML Bazaar ü Facilitates developing ML and AutoML systems ü Workflow style: Pipelines composed out of pre-defined primitives ü Data + task pairs with benchmark results over multiple data types ✗ Only 5 types of operators ✗ Single location, controlled ecosystem DFS = Deep Feature Synthesis DATAPLAT@ICDE 2024
  • 8. 8 Data processing pipelines analysis: Kaggle Scope: top 200 most upvoted python notebooks related to machine learning on Kaggle Ø 29 unique pre-processing operations Ø 12 appear in less than 10 pipelines § Transposing § changing index values Ø feature augmentation (58) Ø scaling operations (38) DATAPLAT@ICDE 2024
  • 10. 10 Data reduction <latexit sha1_base64="caGX98B8rPEaUMv/+I4c5iOo7DY=">AAADNnicbVLLjtMwFHXDawivDizZGCrEIKEqQSNggzRiZsFykOjMSE2prl2nNXXsyL6mVFHWfA1b4FfYsENs+QEk3IeApnMlSyfnnGvH14eVSjpMkm+t6MLFS5ev7FyNr12/cfNWe/f2iTPectHjRhl7xsAJJbXooUQlzkoroGBKnLLp4UI/fS+sk0a/wXkpBgWMtcwlBwzUsH3v6CF9QbNSDg/3jh49pnT17eS4gCUVD9udpJssi26DdA06ZF3Hw93W72xkuC+ERq7AuX6alDiowKLkStRx5p0ogU9hLPoBaiiEG1TLu9T0QWBGNDc2LI10yf7fUUHh3LxgwVkATlxTW5DnaX2P+fNBJXXpUWi+Oij3iqKhi8HQkbSCo5oHANzK8K+UT8ACxzC+jVOYMVME5uo4zrSYcVMUoEdVZkxdZSg+IMsrU9ebYo51Px1Ufw2dtN6y/GuHpmbKIArtvBWLq9GM5dQ0PBNjwY+DD1Q5gbdVZuV4gmCtmTW3C5HYtIa5KRWebabP9b8zUgc3MzOUoqF5HZIURF8qv5hJCEzajMc2OHnSTZ9291/vdw5erqOzQ+6S+2SPpOQZOSCvyDHpEU4+kk/kM/kSfY2+Rz+inytr1Fr33CEbFf36A9XlEGY=</latexit> D0 = ⇡C(D), D0 = C(D) - Projection, Selection <latexit sha1_base64="fFqxFPpIMzZxYgmMXJgJEbRtTTU=">AAADX3icbVJdixMxFE1bddeqa1efxJdgEbogZcbvBx9WV1B8WsHuLjS1ZDJ3prGZZEgy1hLyn/w1gk/qDxFMP9i1070QOHPPucnk5CSl4MZG0c9Gs3Xl6rWd3evtGzdv7d3u7N85MarSDAZMCaXPEmpAcAkDy62As1IDLRIBp8n0aMGffgVtuJKf7LyEUUFzyTPOqA2tcecDKfnYEUcsfLPuiKf+EV7hdyBT0Oefr3PwxPseMTwv6NhddF89iXzv7cHBuNON+tGy8DaI16CL1nU83m/8JaliVQHSMkGNGcZRaUeOasuZAN8mlYGSsinNYRigpAWYkVte2uOHoZPiTOmwpMXL7v8TjhbGzIskKAtqJ6bOLZqXccPKZi9HjsuysiDZ6qCsEtgqvHAQp1wDs2IeAGWah3/FbEI1ZTb4vHFKotTU0sT4dptImDFVFFSmjijlV/4lmVPeb5KZ9cN45M4F3dhvSS7GaZ1TZSBBmkrD4mqYJBlWNc1EaVrlQUdFOaGfHdE8n1iqtZrVtwvZ2ZQG34QIzzaTl+q/KC6DOlEzy6HGVTJELpBVKaqFJyEwcT0e2+DkcT9+3n/28Wn38M06OrvoPnqAeihGL9Aheo+O0QAx9B39QL/Q7+af1k5rr9VZSZuN9cxdtFGte/8AGM4hrw==</latexit> ⇡{Cid,Gender,Age}( Age<30(D)) DATAPLAT@ICDE 2024
  • 11. 11 Data augmentation Vertical augmentation <latexit sha1_base64="Jkv8keMS0FhjcfbzwX5TGOOML7Q=">AAADJnicbVJNj9MwEHXDxy7lq4UjF4sKablUCVoB2tMKLhwXie4WNaWauE5j6tiRPSZUUX4KV+DXcEOIG38ECaetgKY7kqWneW884/FLCikshuHPTnDl6rXrB4c3ujdv3b5zt9e/d261M4yPmJbajBOwXArFRyhQ8nFhOOSJ5BfJ8mXDX3zgxgqt3uCq4NMcFkqkggH61KzXjzNtwC1mVXo0fnxC39az3iAchuug+yDaggHZxtms3/kdzzVzOVfIJFg7icICpxUYFEzyuhs7ywtgS1jwiYcKcm6n1Xr2mj7ymTlNtfFHIV1n/6+oILd2lSdemQNmts01ycu4icP0+bQSqnDIFds0Sp2kqGmzCDoXhjOUKw+AGeFnpSwDAwz9una6JFovERJbd7ux4iXTeQ5qXsVa11WM/CMmaaXrepdMsZ5E0+qvYBDVe5J/5dDmdOFJrqwzvHkajZOU6pZm83NeB7LI4F0VG7HIEIzRZfs6b4Fdqd+blP7bSnWp/r0WyqsTXaLgLc4p7xxPukK6ZifeMFHbHvvg/Mkwejo8fn08OH2xtc4heUAekiMSkWfklLwiZ2REGCnJJ/KZfAm+Bt+C78GPjTTobGvuk50Ifv0By7kNaw==</latexit> ↵! f(X):Y <latexit sha1_base64="KZhlQb7RQuvbDZlIWBGjNXy9o1c=">AAADPnicbVLLjhMxEHSGxy7hlYUjF4sIKblEGbSCFaflceC4ILK7UiZEPY4nMfHYI7tNiKz5Br6GK/Ab/AA3xJULEp4kAjLZliyVu6rddrvSQgqL/f63RnTp8pWre/vXmtdv3Lx1u3Vw59RqZxgfMC21OU/BcikUH6BAyc8LwyFPJT9L588r/uw9N1Zo9QaXBR/lMFUiEwwwpMatbjLTBtx07LNx3Eky9E+nvOw+oRWEKX8NKuzLzovuuNXu9/qroLsg3oA22cTJ+KDxO5lo5nKukEmwdhj3Cxx5MCiY5GUzcZYXwOahzTBABTm3I796U0kfhMyEZtqEpZCusv9XeMitXeZpUOaAM1vnquRF3NBhdjTyQhUOuWLrRpmTFDWtBkQnwnCGchkAMCPCXSmbgQGGYYxbXVKt5wipLZvNRPEF03kOauITrUufIP+AaeZ1WW6TGZbDeOT/CtpxuSP5Vw51TheB5Mo6w6un0STNqK5p1j8adCCLGbz1iRHTGYIxelE/LlhjWxrmJmX4toW6UP9OCxXUqV6g4DXOqeCoQLpCumomwTBx3R674PRhL37UO3x12D5+trHOPrlH7pMOicljckxekhMyIIx8JJ/IZ/Il+hp9j35EP9fSqLGpuUu2Ivr1B33rF1I=</latexit> ↵! f1(Age):ageRange(D) group by gender avg(age) Horizontal augmentation <latexit sha1_base64="/Fez8VR4cSmlF01/YiVQsD5zSEs=">AAADP3icbZLNjtMwEMfd8LWUj+3CkUtEhdTlUDXVChAS0vIlOC4S3V2pKZHjTlpTx47sMaWK/A48DVfgMXgCbogrBySctoJtuiNF+mf+v4njmUkLwQ32et8bwYWLly5f2bnavHb9xs3d1t6tY6OsZjBgSih9mlIDgksYIEcBp4UGmqcCTtLZ88o/+QDacCXf4qKAUU4nkmecUfSppHX/ZdJ/EnuC2klSxhmWr0COQbvHWdLvVO9PJ+D2XefFftJq97q9ZYTbIlqLNlnHUbLX+BOPFbM5SGSCGjOMegWOSqqRMwGuGVsDBWUzOoGhl5LmYEbl8lIuvOcz4zBT2j8Sw2X2bEVJc2MWeerJnOLU1L0qeZ43tJg9GpVcFhZBstVBmRUhqrDqUDjmGhiKhReUae7/NWRTqilD38eNU1KlZkhT45rNWMKcqTynclzGSrkyRviIaVYq5zbNDN0wGpX/gHbktpD/5bTuqcKbII3VUF0tjNMsVDVmqqpxeo6KYkrflbHmkylSrdW8/rnV5M+gvm9C+LHN5bn8e8Wlp1M1Rw41z0q/Ut60hbBVT/zCRPX12BbH/W70oHvw5qB9+Gy9OjvkDrlLOiQiD8kheU2OyIAw8ol8Jl/I1+Bb8CP4GfxaoUFjXXObbETw+y86OxeP</latexit> E2 = ↵# Gender:f2(Age)(D) <latexit sha1_base64="bJJOqZd/k6cJtV5UgJsl0/znBVA=">AAADKHicbZJNj9MwEIbd8LFL+eqyRy4WFVL3UiWIjxWnFXDguEh0t6gJ1cR1WlPHjuwxpYryW7gCv4Yb2iv/AwmnrWCb7kiRXs3zju3MTFpIYTEML1rBtes3bu7t32rfvnP33v3OwYMzq51hfMC01GaYguVSKD5AgZIPC8MhTyU/T+eva37+mRsrtHqPy4InOUyVyAQD9Klx5zD2FNx0XA5fZr0PR1XvzdG40w374Srorog2oks2cTo+aP2JJ5q5nCtkEqwdRWGBSQkGBZO8asfO8gLYHKZ85KWCnNukXL2+oo99ZkIzbfynkK6ylytKyK1d5ql35oAz22R18io2cpgdJ6VQhUOu2PqizEmKmtatoBNhOEO59AKYEf6tlM3AAEPfsK1bUq3nCKmt2u1Y8QXTeQ5qUsZaV2WM/AumWamrahtmWI2ipPxn6EbVjuV/OTSZLjzkyjrD61+jcZpR3fDMdD077wNZzOBjGRsxnSEYoxfN49ZjvmT1fZPSj22hrvR/0kJ5d6oXKHiDOeV3x0NXSFf3xC9M1FyPXXH2pB897z9797R78mqzOvvkIXlEeiQiL8gJeUtOyYAwsiRfyTfyPfgR/Ax+BRdra9Da1BySrQh+/wVrcA35</latexit> ↵# X:f(Y )(D) DATAPLAT@ICDE 2024
  • 12. 12 Data transformation <latexit sha1_base64="XtRrctBkqIU93sb+UHrmtJtjUkA=">AAADHnicbZJNbxMxEIad5aMlfLVw5LIiQiqXaBdVwLGCC8cikTbSbojGjjdr4rVX9rghsvZncAV+DTfEFX4MEt40ArLpSJZezfuMP8ZDayksJsmvXnTt+o2be/u3+rfv3L13/+DwwZnVzjA+YlpqM6ZguRSKj1Cg5OPacKio5Od08br1zy+4sUKrd7iq+aSCuRKFYIAhleUIbuqLo/HTZnowSIbJOuJdkW7EgGzidHrY+53PNHMVV8gkWJulSY0TDwYFk7zp587yGtgC5jwLUkHF7cSv79zET0JmFhfahKUwXmf/r/BQWbuqaCArwNJ2vTZ5lZc5LF5OvFC1Q67Y5UGFkzHquG1APBOGM5SrIIAZEe4asxIMMAxt2jqFar1AoLbp93PFl0xXFaiZz7VufI78I9LC66bZNgtssnTi/wKDtNlB/pVD19N1MLmyzvD2aXFOi1h3mFIbcPPAgaxLeO9zI+YlgjF62d0ufP02GvomZfi2pbqS/6CFCjTVSxS84zkVJiaYrpau7UkYmLQ7Hrvi7NkwfT48fns8OHm1GZ198og8JkckJS/ICXlDTsmIMKLJJ/KZfIm+Rt+i79GPSzTqbWoekq2Ifv4B+VsLDw==</latexit> ⌧f(X) <latexit sha1_base64="Q7sjzw3r7FpZN6MWGMj9azYMGFk=">AAAD5HicbVJLb9NAELYbHiW8WjhyWREjFQlFccXrWAEHjkWiDykO0ex6N1663rX20RBZ/gfcEFf+Emd+DBKzaQQk7Vw8O9/3zYxnhjZKOj8a/Uq3eteu37i5fat/+87de/d3dh8cOxMs40fMKGNPKTiupOZHXnrFTxvLoaaKn9CztxE/OefWSaM/+kXDJzXMtBSSgcfQdOdnoY3UJdee+IqTwvMvmKX1FrQTxtZLWkeMIEAc99ERHHyw3JHsNIvv7F1GgpN6hhQRNIsKkomMFAWRjhjqAZsrCV0QF6jD9MFHNgdWkXNQgZOsnLayEF1G5tJXKN7DQAHOY+xp9ixmwmYuFKvyJCuwhGEsWBuzSR37GU53BqPhaGnkspOvnEGyssPpbvq7KA0LNY6AKXBunI8aP2nBeskU7/pFcLwBdgYzPkZXQ83dpF1OviNPMFIuexMGR7iM/q9ooXZuUVNk4igrt4nF4FXYOHjxetJK3QTPNbsoJIIi3pC4RlJKy5lXC3SAWYm9ElaBBeZx2WtVqDFnHqjr+v1C8zkzdQ26bAtjuna5bipa03XroPDdOJ+0fwmDvLtE+SeHTcw0CHLtcE/x10hBBTEbnMpYCDPkgWoq+NQWVs4qD9aa+WY6POB1Ks5NKVzbXF/J/4wnjWxq5l7yDSzoeND4bVSIM8GDyTfP47JzvD/MXw6ff9gfHLxZnc528ih5nOwlefIqOUjeJ4fJUcLSF+k4LVPeE72vvW+97xfUrXSleZisWe/HH5yBTeM=</latexit> the transformation of a set of features X of D using a function f is obtained by substituting each value dia with f(d⇤a), for each feature a occurring in X. Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip <latexit sha1_base64="dKf0psuUtfBq7WDfOX5DpzZK5ls=">AAADKnicbZLNjtMwFIXd8DeUvw6IFZuICqmzqRI0ApYjYMFykOjMiCZUN67Tmjp2ZF9TKssPwxZ4GnYjtrwGEk6nAprOlSId3fPd2Lk5RS24wSQ570RXrl67fmPvZvfW7Tt37/X2758YZTVlI6qE0mcFGCa4ZCPkKNhZrRlUhWCnxeJV459+YtpwJd/hqmZ5BTPJS04BQ2vSe5gh2IkrB1mJ7j2v/YEfvD6Y9PrJMFlXvCvSjeiTTR1P9ju/s6mitmISqQBjxmlSY+5AI6eC+W5mDauBLmDGxkFKqJjJ3fr+Pn4SOtO4VDo8EuN19/8JB5Uxq6oIZAU4N22vaV7mjS2WL3LHZW2RSXpxUGlFjCpulhFPuWYUxSoIoJqHu8Z0DhoohpVtnVIotUAojO92M8mWVFUVyKnLlPIuQ/YZi9Ip77fNEv04zd1foJ/6HeTfOLQ9VQeTSWM1az4tzooyVi1mrjTYWeBA1HP44DLNZ3MErdWy/boQg2007E2I8NuW8lL+o+Iy0IVaImctz8qQnmDaWthmJyEwaTseu+Lk6TB9Njx8e9g/ermJzh55RB6TAUnJc3JE3pBjMiKUOPKFfCXfou/Rj+g8+nmBRp3NzAOyVdGvPwnyD0I=</latexit> ⌧f(Zip)(D) DATAPLAT@ICDE 2024
  • 13. 13 Data fusion: join and append <latexit sha1_base64="uo1XC2O2rrqRH/7jgx2X/lPakP4=">AAADKHicbZLNbtNAFIUn5q+Ev5Qu2YyIkFhFNqoKy6rtggWLgkhbKXai68k4HjKesWbuNESWn4Ut8DTsULe8BxLjNALi9EqWju75rmd8fdJSCotheNUJbt2+c/fezv3ug4ePHj/p7T49s9oZxodMS20uUrBcCsWHKFDyi9JwKFLJz9P5ceOfX3JjhVYfcVnypICZEplggL416e2djN/R+JMWaoyT6rimJ+MPk14/HISrotsiWos+WdfpZLfzO55q5gqukEmwdhSFJSYVGBRM8robO8tLYHOY8ZGXCgpuk2p1+5q+8J0pzbTxj0K66v4/UUFh7bJIPVkA5rbtNc2bvJHD7E1SCVU65IpdH5Q5SVHTZhV0KgxnKJdeADPC35WyHAww9AvbOCXVeo6Q2rrbjRVfMF0UoKZVrHVdxcg/Y5pVuq43zQzrUZRUf4F+VG8h/8ah7enSm1xZZ3jzaTROM6pbTK4NuJnnQJY5jKvYiFmOYIxetF/nQ7CJ+r1J6X/bQt3IN5HwdKoXKHjLc8pnx5uulK7ZiQ9M1I7Htjh7NYgOBvvv9/uHR+vo7JBn5Dl5SSLymhySt+SUDAkjS/KFfCXfgu/Bj+BncHWNBp31zB7ZqODXH8rzDh4=</latexit> DL ./t C DR <latexit sha1_base64="fiWoK5ivN8nYSDBQRhG2qdf4NTc=">AAADIXicbZJNbxMxEIad5auErxaOXCwiJE7RLqoKxwp64MChINJWym6qsePNmnjtlT0mRKv9H1yBX8MNcUP8FiS8aQRk05EsvZr3GX+Mh1VKOozjn73oytVr12/s3Ozfun3n7r3dvfsnznjLxYgbZewZAyeU1GKEEpU4q6yAkilxyuYvW//0g7BOGv0Ol5XISphpmUsOGFKTo8lrmnodJD2avD3fHcTDeBV0WyRrMSDrOD7f6/1Op4b7UmjkCpwbJ3GFWQ0WJVei6afeiQr4HGZiHKSGUrisXl27oY9DZkpzY8PSSFfZ/ytqKJ1bliyQJWDhul6bvMwbe8yfZ7XUlUeh+cVBuVcUDW17QKfSCo5qGQRwK8NdKS/AAsfQqY1TmDFzBOaafj/VYsFNWYKe1qkxTZ2i+Igsr03TbJo5NuMkq/8Cg6TZQv6VQ9czVTCFdt6K9mk0ZTk1HaYwFvwscKCqAiZ1auWsQLDWLLrbhd/fREPflArfttCX8u+N1IFmZoFSdLzVpATTV8q3PQkDk3THY1ucPB0mB8P9N/uDwxfr0dkhD8kj8oQk5Bk5JK/IMRkRTiz5RD6TL9HX6Fv0PfpxgUa9dc0DshHRrz8U/gvI</latexit> DL ] DR <latexit sha1_base64="ZSc/aIuuYda02WJ0QVQW8PzBr8E=">AAADIHicbZJNbxMxEIad5auErxaOXCwiJE7RLqqAYwU9cOBQEGkrZTfV2PFmTbz2Yo8J0Wp/B1fg13BDHOG/IOFNIyCbjmTp1bzP+GM8rFLSYRz/7EWXLl+5em3nev/GzVu37+zu3T12xlsuRtwoY08ZOKGkFiOUqMRpZQWUTIkTNn/R+icfhHXS6Le4rERWwkzLXHLAkMoOJ69Sr4Oih5M3Z7uDeBivgm6LZC0GZB1HZ3u93+nUcF8KjVyBc+MkrjCrwaLkSjT91DtRAZ/DTIyD1FAKl9WrWzf0YchMaW5sWBrpKvt/RQ2lc8uSBbIELFzXa5MXeWOP+bOslrryKDQ/Pyj3iqKhbQvoVFrBUS2DAG5luCvlBVjgGBq1cQozZo7AXNPvp1osuClL0NM6NaapUxQfkeW1aZpNM8dmnGT1X2CQNFvIv3LoeqYKptDOW9E+jaYsp6bDFMaCnwUOVFXApE6tnBUI1ppFd7vw+Zto6JtS4dsW+kL+nZE60MwsUIqOt5qUYPpK+bYnYWCS7nhsi+PHw+TJcP/1/uDg+Xp0dsh98oA8Igl5Sg7IS3JERoST9+QT+Uy+RF+jb9H36Mc5GvXWNffIRkS//gCWmQue</latexit> DL ] DR <latexit sha1_base64="Tf7s3qEix3yKzKbh9vcpsGLm1tk=">AAADSXicbVLdihMxGE2n/qz1r6uX3gSL4FWZkaLeCIu7FwperGJ3FzrTkkkzbWwmGZIv1hLyIj6Nt+oT+BjeiSCY6ZbVTveDgZNzzpdMvpy8EtxAHP9oRe0rV69d37vRuXnr9p273f17J0ZZTdmQKqH0WU4ME1yyIXAQ7KzSjJS5YKf54rDWTz8ybbiS72FVsawkM8kLTgkEatIdHI3fpB8Ul2OXmgJzKZn2ExfYflqAO3w99S+Oxu8uFh6H1aTbi/vxuvAuSDaghzZ1PNlv/UmnitqSSaCCGDNK4goyRzRwKpjvpNawitAFmbFRgJKUzGRufT2PHwVmigulwycBr9n/OxwpjVmVeXCWBOamqdXkZdrIQvE8c1xWFpik5wcVVmBQuJ4VnnLNKIhVAIRqHv4V0znRhEKY6NYpuVILILnxnU4q2ZKqsiRy6lKlvEuBfYK8cMr7bbEAP0oyd2HoJX7H8q+dNDVVBZFJYzWrr4bTvMCq4ZkrTews+Iio5iS8seazORCt1bK5XUjJtjXMTYjwbEt5qb8OTXDnagmcNTQrQ7iCaCth65mEwCTNeOyCkyf95Gl/8HbQO3i5ic4eeoAeoscoQc/QAXqFjtEQUfQZfUFf0bfoe/Qz+hX9PrdGrU3PfbRV7fZfdvcasg==</latexit> DL ./inner DL.CId=DR.CId DR DATAPLAT@ICDE 2024
  • 14. 14 Conceptual provenance capture model: templates <latexit sha1_base64="Q+fPf+TzQY7bxgC074TZYQmdfIg=">AAAKYHicjZZfb9s2EMDldn9Sr12T7W17IRYES7E1s4cWG/ZUZ83SAEXiFUlbIPYMSjrJRClSIym7hqAPucc97GWfZEfZiylK7SbAAI/3uzuSdzw6zDnTZjD4s3fr9gcffvTxzp3+J3fvfXp/d++zl1oWKoKrSHKpXodUA2cCrgwzHF7nCmgWcngVvvnZ6l8tQGkmxaVZ5TDNaCpYwiJqcGq2u5wIWEYyy6iIy0liquvhtCwnBt6aMCn3h1VV9RvIXCpapFU5oTyf09/KiWLp3FCl5NKia/WsTGbDQ3RXjlKoHvxE7JCm8IIKlKvDpw9mu/uDo0H9kfZguBnsB5tvPNvb+WMSy6jIQJiIU62vh4PcTEuqDIs4YOhCQ06jNxjmGoeCZqCnZX1CFTnAmZgkUuFPGFLPuhYlzbReZSGSGTVz7evsZJfuujDJj9OSibwwIKJ1oKTgxEhij5vETEFk+AoHNFIM10qiOVU0MpiUPn6Nw82VXODR2jDMlLXkHX8M+RawQuV7cO19rQyTrZqGumUtuWOOQuUvMN3qx6e+OYit9kT4WtzzVj1Cwdc7vkdpK7TNIALPN0Qteh6WabhykGV6vPIRJhJeYKbA5ag++3c6bpss46QJPwXFFhD/omTWYumywS5bmzRGXcoG0zoIO/VeIAYOKTXuHmgo227sto5XrZ1KlXXtU+MtdgOvZT/Fbg5A2BR4eTpxKvikVb/YnKSKwVYpDiP4/R36XLEMbqCv/WKARUadi7AWW0sRMoYtdG4lL5y9o1uilvwNcxqCcyvWor+etDgZ2dVi50uL74BWFSEHJAXxsNDYJojEHkywczHQ3xK8CGzB7Nh3whwvrHbjE/RkdHqDfIOIvSnNWEiSTKp3BiU5LzTBxiwMdqCDAyJzUNRI5S9HycI549NabN1Zj6JpJ6e7nM2wxFTrgYm41E4TqkUPUZA7GbESjVqJo2kTW8sdoIIlNmnXXy37dfDW2HLfltxa9sv3f60+c4NlmKautdfzTkN80UlKha+w8OmLerbbgnLebTTi/H12iJ9pybHtxI3l30z6eZT/nSPl3y4fGNt3d/vijC6f+cTZ+fkWmCywZc3B0Bk+ya3Kuri67ERlYVrs2Xk3y0SbjZtJj7uSPr547viLKCfjqsI/QUP/L0978PL7o+Hjo8Gvj/afHG/+Du0EXwZfBYfBMPgheBI8C8bBVRAFf/Vu9+727t35u7/Tv9/fW6O3ehubz4PG1//iH9y29FY=</latexit> ↵! f1(Age):ageRange(D) A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 15. 15 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V} à {new values: F’, J, V’} + Binding rules <latexit sha1_base64="icVdmbcCfxxYOiITpBtlS3uqwUQ=">AAAD+HicdZNdb9MwFIaTlY8RPtbBJTdHVJQhVVWDJkCTKk2AJsbVkOjWqQ6V4zqtmWNHtrOuBP8X7hC3/Buu+R1IOGkFbTccKTo672O/yTnHccaZNp3OT3+jdu36jZubt4Lbd+7e26pv3z/WMleE9ojkUvVjrClngvYMM5z2M0VxGnN6Ep+9LvWTc6o0k+KDmWU0SvFYsIQRbFxqWP8VNAEZemGKA6nAAtsLAY2k0SD2mggFTQTVUyHVm5ki13RkgQrT3rMwQByLMadwAF3oD9MWHHZZC467b4YFa7mEBaTmxBcoAUBMQB8i+N/xYyqowmbJY8nkiXM5HU5a8K50Oe8mO3Mf+3TZxhGVzSlEQTCsNzrtTrXgchAugoa3WEfDbf+3qwHJU2dPONZ6EHYyExVYGUY4tQFyFcgwOcNjOnChwCnVUVF1w8LjsjyQuHImUhiosss7CpxqPUtjR6bYTPS6Viav0ga5SV5GBRNZbqggc6Mk52AklK2FEVOUGD5zASaKuW8FMsEKE+MGYMUllvLM4FjbIECCTolMUyxGBZLSzrsQJ4W0dlVMjB2EUfEXaIT2EvJvO17XZOZEKnSuaPlrgOIE5BozkQrnY8dhnk3wxwIpNp4YrJScrh/nhnoVdXXj3LVtKq7kP0kmHB3LqWF0TcuFuwtOzDOelzVxAxOuj8fl4PhZO3ze3n2/29h/tRidTe+h98jb8ULvhbfvvfWOvJ5H/ENf+hf+rPa59rX2rfZ9jm74iz0PvJVV+/EHk+tPwQ==</latexit> For i : 1 . . . n : used ent.:[hF = Xm, I = i, V = Di,Xm i|Xm 2 X] generated ent.:[hF0 = Yh, J = i, v = f(Di,X )i|Yh 2 Y ]
  • 16. 16 Implementation by shape and value diff Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations For each input/output pair Din, Dout of dataframes: 1. Diff both shapes and values of Din, Dout 2. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes • Generate an instantiated provlet DATAPLAT@ICDE 2024
  • 17. 17 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 <latexit sha1_base64="vtTzVqyQbOaTVii0idD+QwwhSJQ=">AAAEKXicfVNbb9MwFE5WLqPcNvbIi0UFalE0NV03EFKl3Toh7WVI7CLVxXJcpzVz7Mh26IqV/8Ir8Gt4A175HUg4bRltN3GkREfn+853Ep/PUcqZNvX6D3+pdOPmrdvLd8p3791/8HBl9dGJlpki9JhILtVZhDXlTNBjwwynZ6miOIk4PY3O9wr89ANVmknx1oxS2k1wX7CYEWxcCa36a8DFM7CPwtY+wgC+l0y8s9AYwGlscmQPURgcokbuKBGE5b/0RgsanCEbo7D6vJZXnUBtBt5wao3/q5EZevNSrVFtBwdjvY1ZPbuZt+BAKpz1kR1U27VX0LZRMwBtdFG8QpgXPc3Znq0WTBmy0O4UnN0A7KBRAPYDsBeAg6Jppj0AEwE3p1ZGK5X6en0c4GoSTpOKN40jd4y/YU+SLKHCEI617oT11HQtVoYRTvMyzDRNMTnHfdpxqcAJ1V07Xl8OnrpKD8RSuUcYMK7OdlicaD1KIsdMsBnoRawoXod1MhO/7Fom0sxQQSaD4owDI0HhBdBjihLDRy7BRDH3rYAMsMLEOMfMTYmkPDc40nm5DAUdEpkkWPQslDJ326UXJoqtzPN50C28E3btJaES5lco/9rxIiZTB1KhM0WLXwMwioFc4Ewc4XiYpwPsnKZYf2CwUnK4KOduwTzVnRvnbm1DcS2/sK5jR3JoGF3AMuEujwOzlGfFmTjDhIv2uJqcNNbDrfXNN83K9u7UOsveY++JV/VC74W37b32jrxjj/gf/U/+Z/9L6WvpW+l76eeEuuRPe9a8uSj9+gO5hVWq</latexit> D1 = Da ./left K1,K2 Db D2 = ⌧f1(⇤)(D1) D3 = D2 ./left K1,K2 Dc D4 = ⌧f2(E,F )(D3) D5 = ↵! h(E):{E4,Ex,E1}(D4) D6 = ⇡{Ax,B,Ay,D,C,F,E4,Ex,E1,}(D5) DATAPLAT@ICDE 2024
  • 18. 18 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 df = pd.merge(df_A, df_B, on=['key1', 'key2'], how='left’) # join df = df.fillna('imputed’) # Imputation df = pd.merge(df, df_C, on=['key1', 'key2'], how='left’) #join df = df.fillna(value={'E':'Ex', 'F':'Fx’}) # Imputation # one-hot encoding c = 'E' dummies = [] dummies.append(pd.get_dummies(df[c])) df_dummies = pd.concat(dummies, axis=1) df = pd.concat((df, df_dummies), axis=1) df = df_A.drop([c], axis=1) DATAPLAT@ICDE 2024
  • 19. 19 Running Example D1 D2 D3 Add ‘E4,’ ‘Ex’, ‘E1’ Remove ‘E’ D4 D6 Da Db Left join (K1,K2) Impute all missing Dc Left join (K1,K2) Impute E,F D5 Dataframes Diff template D1 ß {Da, Db} Explicit join provenance pattern D2 ß D1 value change, reduced nulls à imputation Data transformation D3 ß {D2, Dc} Explicit join provenance pattern D4 ß D3 value change, reduced nulls à imputation Data transformation D45 ß D4 Shape change, column(s) added <wait!> D6 ß D5 Shape change, column(s) removed Data transformation, composite DATAPLAT@ICDE 2024
  • 20. 20 Program level transparency with control Approach: - add an observer to monitor dataframe changes - mostly transparent to application - some control over Tracker surfaced DATAPLAT@ICDE 2024
  • 21. 21 Provenance traversals – example Capture, store and query element-level provenance - Derivation of each element of each intermediate dataframe (when possible) - Efficiently, at scale fillna Join df_1 df_B (df_0) df_A (df_-1) DATAPLAT@ICDE 2024
  • 22. 22 Benchmarking: data x pipelines Datasets: Pipelines: Provenance graphs are stored in a single Neo4J database DATAPLAT@ICDE 2024
  • 23. 23 Results The PT/PO ratio provides a rough indication of scalability: - The graphs for the complete pipelines are close in size to the sum of the sizes of the components’ graphs 1,2,3: pipeline number DATAPLAT@ICDE 2024
  • 24. 24 Conclusions ü DPDS generates granular provenance graphs that accurately represent the underlying data processing ü A potentially useful building block towards explanations in a Data Centric AI setting Limitations: v No granularity control --> limited scalability v Operates only on Pandas dataframes DATAPLAT@ICDE 2024