SlideShare a Scribd company logo
1 of 21
Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio,
Edvan Soares and Mahmoud Berhamgi
Contact: kbelhajj@gmail.com
 Data driven analysis pipelines
 Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
 Tools for automating frequently
performed data intensive activities
 Provenance for the resulting datasets
 The method followed
 The resources used
 The datasets used
Khalid Belhajjame @ DarliAP Workshop, 2019 2
GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
inThai Population
Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation
Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]
Khalid Belhajjame @ DarliAP Workshop, 2019 3
 In fields such as biomedicine and social and
behavioral sciences, workflow executions
manipulate and generate sensitive information
about individuals.
 There is, therefore, a serious concern about dataset
inappropriate manipulation/misuse during
experiences that could lead to sensitive-data leak
and/or misuse.
 Publishing the provenance of the executions of such
workflows raises privacy concerns.
Khalid Belhajjame @ DarliAP Workshop, 2019 4
To our knowledge, there does not exist any proposal that assists scientists in
the task of anonymizing the provenance of their experiments..
Khalid Belhajjame @ DarliAP Workshop, 2019 5
Our objective: we seek to assist scientists in the task of anonymizing
workflow provenance to preserve the privacy of individuals.
 Most related work in the area have focused on the problem of securing
workflow provenance and policing their access.
 Protecting the integrity of provenance data from corruption using
cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].
 Deriving a partial view on a workflow that conforms to a pre-specified
access permissions on the modules' inputs and output and their
dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]
 Policy languages allowing scientists to specify relationships between
datasets and the workflow modules, and their properties relevant to
datasets [Alhaqbani et al., 2013; Gil et al., 2010]
 Protecting the privacy of the modules that compose the workflows by hiding
certain parameters (attributes) of the module that compose the workflow
[Davidson et al., 2011].
[Credit: Steve Touw, Immuta]
Khalid Belhajjame @ DarliAP Workshop, 2019 6
‘Differential privacy formalizes the idea that a "private" computation should
not reveal whether any one person participated in the input or not, much
less what their data are.’ - [Frank McSherry]
(https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)
$320k $340k $330k
$30M
Sensitivity of median = ~10k
Sensitivity of mean = ~30M
Khalid Belhajjame @ DarliAP Workshop, 2019 7
 For our work, we chose to use the most fundamental
anonymization privacy model, namely k-anonymity,
which has been proposed to protect individual privacy in
data publishing.
 While k-anonymity is less powerful than differntial
privacy, it is suitable for our purposes, given that it
provides the means for :
 Exploring the provenance of workflows,
 Examining the data products used and generated by
the workflows,
 Preserve (to certain extent) lineage information
between data products.
Khalid Belhajjame @ DarliAP Workshop, 2019 8
• A workflow is defined by the triple
• An operation op in OP is defined as.
• The data links:
Khalid Belhajjame @ DarliAP Workshop, 2019 9
Khalid Belhajjame @ DarliAP Workshop, 2019 10
Khalid Belhajjame @ DarliAP Workshop, 2019 11
Khalid Belhajjame @ DarliAP Workshop, 2019 12
 Sensitive parameters
To specify that a given input or output parameter carries
sensitive data, we use the following boolean function:
that is true if the data bound to <op,p> during the execution
are sensitive
 Anonymity Degree
we use the following function to specify the anonymity degree
of the parameter <p, op> with respect to a workflow instance
insWf:
Khalid Belhajjame @ DarliAP Workshop, 2019 13
 Manual identification of a workflow’s
parameters that are sensitive and setting their
anonymity degrees can be tedious.
 This is the case when the workflow includes a
large number of operations.
 We assist the scientist in this task by
leveraging parameter dependencies.
Khalid Belhajjame @ DarliAP Workshop, 2019 14
 A parameter <op, p> depends on a parameter <op', p’> in a workflow
(DWf), if during the execution of (DWf) the data bound to <op', p’>
contribute to or influence the data bound to <op', p’>
 Given a workflow (DWf), the dependencies between its parameters are
inferred as follows:
 Given an operation (op) that belongs to (DWf), we can infer that the
outputs of (op) depends on its inputs.
 If the workfow (DWf) contains a data link connecting an output <op, o>
to an input <op, i>, then:
 We also transitively derive dependencies between the operation
parameters:
Khalid Belhajjame @ DarliAP Workshop, 2019 15
 A parameter <p', op’> that is not an input to the
workflow may be sensitive if it depends on a
workflow input that is known to be sensitive:
 Note that we say may be sensitive. This is because
an operation that consumes sensitive datasets may
produce non-sensitive datasets.
Khalid Belhajjame @ DarliAP Workshop, 2019 16
 In addition to assisting the designer identify sensitive intermediate and
final output parameters, we also infer details about the anonymity degree
that should be applied to dataset instances of those sensitive parameters.
 The anonymity degree of a parameter <p', op’> given a workflow
execution insWf can be defined as the maximum degree of the sensitive
datasets that are used as input to the workflow and that contribute to the
datasets instances of <p', op’>.
Khalid Belhajjame @ DarliAP Workshop, 2019 17
Khalid Belhajjame @ DarliAP Workshop, 2019 18
Sensi ve Data
Non Sensi ve
Data
Sensi ve Data
Data owner
Data owner
Non Sensi ve
Data
Non Sensi ve
Data
Non Sensi ve
Data
Public data repositories
Trusted workflow environment
Workflow
execu on engine
Workflow
workbench
Data anonymizer
Private data
repository
share
launch
execution
get
inputs
store
outputs
publish data
1
2
3
4
5
6
7
get data
launch data
anonymization
 For validation purposes, we used 20 different CWL
workflows [1], we performed 500s executions per workflow,
and computed the overhead of our method in terms of the
computation of parameter dependencies, identification of
sensitive parameters and the computation of anonymity
degree.
 The results obtained showed that the overhead is small
compared to the execution of the workflow. It takes in
average less than a millisecond to perform all the
computation necessary.
Khalid Belhajjame @ DarliAP Workshop, 2019 19
[1] view.commonwl.org/workflows
 We presented an approach for preserving privacy in the
context of scientific workflows that heavily rely on large
datasets.
 We have shown how data plays a role in i) identifying
sensitive operation parameters in the workflow and ii)
deriving the anonymity degree that needs to be enforced
when publishing the datasets instances of these parameters.
 This is a preliminary work that opens up opportunities for
more research in the field of anonymization of workflow data
Khalid Belhajjame @ DarliAP Workshop, 2019 20
Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio,
Edvan Soares and Mahmoud Berhamgi
Contact: kbelhajj@gmail.com

More Related Content

Similar to Privacy-Preserving Data Analysis Workflows for eScience

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
Acupulco cda access (2)
Acupulco cda access (2)Acupulco cda access (2)
Acupulco cda access (2)
eyetech
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
Gyan Prakash
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
ijtsrd
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
1crore projects
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
1crore projects
 

Similar to Privacy-Preserving Data Analysis Workflows for eScience (20)

FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
The Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine LearningThe Study of the Large Scale Twitter on Machine Learning
The Study of the Large Scale Twitter on Machine Learning
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
Acupulco cda access (2)
Acupulco cda access (2)Acupulco cda access (2)
Acupulco cda access (2)
 
Apidays Singapore 2024 - Privacy Enhancing Technologies for AI by Mark Choo, ...
Apidays Singapore 2024 - Privacy Enhancing Technologies for AI by Mark Choo, ...Apidays Singapore 2024 - Privacy Enhancing Technologies for AI by Mark Choo, ...
Apidays Singapore 2024 - Privacy Enhancing Technologies for AI by Mark Choo, ...
 
Fake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve BayesFake News Detection using Passive Aggressive and Naïve Bayes
Fake News Detection using Passive Aggressive and Naïve Bayes
 
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSSECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDS
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
A Web Services Based Framework For Uniform Integration Of Command-Line Bioinf...
 
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
Study of Software Defect Prediction using Forward Pass RNN with Hyperbolic Ta...
 
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
A Survey on A Secure Anti-Collusion Data Sharing Scheme for Dynamic Groups in...
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Attribute-Based Data Sharing
Attribute-Based Data SharingAttribute-Based Data Sharing
Attribute-Based Data Sharing
 
IRJET - Mobile Chatbot for Information Search
 IRJET - Mobile Chatbot for Information Search IRJET - Mobile Chatbot for Information Search
IRJET - Mobile Chatbot for Information Search
 
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
 
Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018Information Technology in Industry(ITII) - November Issue 2018
Information Technology in Industry(ITII) - November Issue 2018
 
Automated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptxAutomated Fake News Detection -1.pptx
Automated Fake News Detection -1.pptx
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
 
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
Enabling Fine-grained Multi-keyword Search Supporting Classified Sub-dictiona...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 

More from Khalid Belhajjame

Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame
 

More from Khalid Belhajjame (20)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 

Recently uploaded

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 

Privacy-Preserving Data Analysis Workflows for eScience

  • 1. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com
  • 2.  Data driven analysis pipelines  Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving  Tools for automating frequently performed data intensive activities  Provenance for the resulting datasets  The method followed  The resources used  The datasets used Khalid Belhajjame @ DarliAP Workshop, 2019 2
  • 3. GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash inThai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble] Khalid Belhajjame @ DarliAP Workshop, 2019 3
  • 4.  In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals.  There is, therefore, a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse.  Publishing the provenance of the executions of such workflows raises privacy concerns. Khalid Belhajjame @ DarliAP Workshop, 2019 4
  • 5. To our knowledge, there does not exist any proposal that assists scientists in the task of anonymizing the provenance of their experiments.. Khalid Belhajjame @ DarliAP Workshop, 2019 5 Our objective: we seek to assist scientists in the task of anonymizing workflow provenance to preserve the privacy of individuals.  Most related work in the area have focused on the problem of securing workflow provenance and policing their access.  Protecting the integrity of provenance data from corruption using cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].  Deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]  Policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets [Alhaqbani et al., 2013; Gil et al., 2010]  Protecting the privacy of the modules that compose the workflows by hiding certain parameters (attributes) of the module that compose the workflow [Davidson et al., 2011].
  • 6. [Credit: Steve Touw, Immuta] Khalid Belhajjame @ DarliAP Workshop, 2019 6 ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are.’ - [Frank McSherry] (https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md) $320k $340k $330k $30M Sensitivity of median = ~10k Sensitivity of mean = ~30M
  • 7. Khalid Belhajjame @ DarliAP Workshop, 2019 7  For our work, we chose to use the most fundamental anonymization privacy model, namely k-anonymity, which has been proposed to protect individual privacy in data publishing.  While k-anonymity is less powerful than differntial privacy, it is suitable for our purposes, given that it provides the means for :  Exploring the provenance of workflows,  Examining the data products used and generated by the workflows,  Preserve (to certain extent) lineage information between data products.
  • 8. Khalid Belhajjame @ DarliAP Workshop, 2019 8 • A workflow is defined by the triple • An operation op in OP is defined as. • The data links:
  • 9. Khalid Belhajjame @ DarliAP Workshop, 2019 9
  • 10. Khalid Belhajjame @ DarliAP Workshop, 2019 10
  • 11. Khalid Belhajjame @ DarliAP Workshop, 2019 11
  • 12. Khalid Belhajjame @ DarliAP Workshop, 2019 12
  • 13.  Sensitive parameters To specify that a given input or output parameter carries sensitive data, we use the following boolean function: that is true if the data bound to <op,p> during the execution are sensitive  Anonymity Degree we use the following function to specify the anonymity degree of the parameter <p, op> with respect to a workflow instance insWf: Khalid Belhajjame @ DarliAP Workshop, 2019 13
  • 14.  Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious.  This is the case when the workflow includes a large number of operations.  We assist the scientist in this task by leveraging parameter dependencies. Khalid Belhajjame @ DarliAP Workshop, 2019 14
  • 15.  A parameter <op, p> depends on a parameter <op', p’> in a workflow (DWf), if during the execution of (DWf) the data bound to <op', p’> contribute to or influence the data bound to <op', p’>  Given a workflow (DWf), the dependencies between its parameters are inferred as follows:  Given an operation (op) that belongs to (DWf), we can infer that the outputs of (op) depends on its inputs.  If the workfow (DWf) contains a data link connecting an output <op, o> to an input <op, i>, then:  We also transitively derive dependencies between the operation parameters: Khalid Belhajjame @ DarliAP Workshop, 2019 15
  • 16.  A parameter <p', op’> that is not an input to the workflow may be sensitive if it depends on a workflow input that is known to be sensitive:  Note that we say may be sensitive. This is because an operation that consumes sensitive datasets may produce non-sensitive datasets. Khalid Belhajjame @ DarliAP Workshop, 2019 16
  • 17.  In addition to assisting the designer identify sensitive intermediate and final output parameters, we also infer details about the anonymity degree that should be applied to dataset instances of those sensitive parameters.  The anonymity degree of a parameter <p', op’> given a workflow execution insWf can be defined as the maximum degree of the sensitive datasets that are used as input to the workflow and that contribute to the datasets instances of <p', op’>. Khalid Belhajjame @ DarliAP Workshop, 2019 17
  • 18. Khalid Belhajjame @ DarliAP Workshop, 2019 18 Sensi ve Data Non Sensi ve Data Sensi ve Data Data owner Data owner Non Sensi ve Data Non Sensi ve Data Non Sensi ve Data Public data repositories Trusted workflow environment Workflow execu on engine Workflow workbench Data anonymizer Private data repository share launch execution get inputs store outputs publish data 1 2 3 4 5 6 7 get data launch data anonymization
  • 19.  For validation purposes, we used 20 different CWL workflows [1], we performed 500s executions per workflow, and computed the overhead of our method in terms of the computation of parameter dependencies, identification of sensitive parameters and the computation of anonymity degree.  The results obtained showed that the overhead is small compared to the execution of the workflow. It takes in average less than a millisecond to perform all the computation necessary. Khalid Belhajjame @ DarliAP Workshop, 2019 19 [1] view.commonwl.org/workflows
  • 20.  We presented an approach for preserving privacy in the context of scientific workflows that heavily rely on large datasets.  We have shown how data plays a role in i) identifying sensitive operation parameters in the workflow and ii) deriving the anonymity degree that needs to be enforced when publishing the datasets instances of these parameters.  This is a preliminary work that opens up opportunities for more research in the field of anonymization of workflow data Khalid Belhajjame @ DarliAP Workshop, 2019 20
  • 21. Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio, Edvan Soares and Mahmoud Berhamgi Contact: kbelhajj@gmail.com

Editor's Notes

  1. In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links. (Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”). As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s derivation history starting from the original sources, namely its lineage.
  2. In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals. There is a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse. Although this could happen inadvertently, the consequences remain the same. Publishing the provenance of the executions of such workflows raises privacy concerns. For example, record linking techniques can be applied to provenance traces to cross-reference datasets used and generated by the workflow modules with the intention to reveal private or sensitive information about individuals, thereby violating basic privacy rights.
  3. Protecting the integrity of provenance data from corruption using sophisticated secure computing and cryptography techniques Chebotko {\em et al} \cite{DBLP:conf/waim/ChebotkoCLFY08} discusses means for deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences. Gil {\em et al.} \cite{DBLP:conf/semweb/CheungG07,DBLP:conf/aaaiss/GilF10} and Alhaqbani {\em et al.} \cite{Alhaqbani2013} proposed policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets. Policies can be utilized for instance to specify that the data instances of a module's output needs to be anonymized. In doing so, however, the policy language does not specify how the datasets are to be anonymized, and even less, how their lineage information are to be preserved. Davidson {\em et al.} \cite{DBLP:conf/icdt/DavidsonKRSTC11,DBLP:conf/pods/DavidsonKMPR11,DBLP:conf/cidr/DavidsonKTRCMS11} investigated a related problem but with a focus on module privacy. The objective of this line of proposals is to identify the subset of the the inputs and outputs, or more specifically attributes thereof, of the wokflow modules that need to be hidden to keep the functionality of the workflow modules hidden. Our objective is different in that we consider that the modules that compose the workflow are public and we seek to anonymize the workflow provenance, with the objective to hide sensitive information about individual from the provenance records. In doing so, we examine anonymization techniques to generalize attribute values of data records, as opposed to hiding completely the attributes as done in \cite{DBLP:conf/pods/DavidsonKMPR11}.
  4. The intuition of differential privacy is that the removal or addition of a single record does not significantly affect the outcome of any analysis. Differential privacy: Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front. You can only ask aggregate questions. Different techniques have been proposed in the literature for protecting the privacy of individuals, e.g., k-anonymity [28, 31], l-diversity [24], t-closeness [22] and differential privacy [11]. In particular, differential privacy [] has recently gained momentum as the method of choice in statistical databases. It involves adding random noise to the data so that the distribu- tion of the resulting dataset is almost invariant to the inclusion of any data record. While extremely powerful, differential privacy is not suitable for our purposes, and its application may hamper the utility of anonymized provenance data in to preserve a more rigorous guarantee of privacy [29]. Indeed, for it to be useful, provenance information should keep track of the data records that have been used and generated by the workflow modules as well as their connections (lineage), which may be lost or broken when applying differential privacy techniques. [Khalid: you need to check the validity of the following statement, with evidence (paper).]
  5. The anonymity degree of a~$\mathtt{DWf}$'s parameter ($\mathtt{\langle p, op \rangle}$) is defined with respect to a given $\mathtt{DWf}$ instance~($\mathtt{insWf}$). Indeed, different instances of $\mathtt{DWf}$ may have as input datasets different anonymity degree requirements. For example, the owner of an input dataset used for a given workflow instance ($\mathtt{insWf_1}$) may impose a more stringent anonymity degree than the owner of an input dataset used for a different workflow instance ($\mathtt{insWf_2}$).
  6. Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious. Thisbecomes a serious concern when the workflow includes a largenumber of operations. To address this issue, we propose in thissection, an approach that takes as input the sensitivity of the inputparameters of the workflow(DWf)together with their anonymitydegrees. It then detects the list of (intermediate and final) pa-rameters in(DWf)that may be sensitive, and infer the anonymitydegree that should be applied to the datasets bound to thoseparameters during the execution of the(DWf)
  7. Taking the maximum anonymity degree of the contributing inputs ensures that the anonymity degrees imposed on such inputs is honored by the dependent parameter in question.
  8. This work opens up opportunities for more research in the field of anonymization of workflow data. In this respect, our ongoing work includes investigating the applicability of our solution to anonymization techniques