Privacy-Preserving Data Analysis Workflows for eScience

Khalid Belhajjame, Noura Faci, Zakaria Maamar,Vanilson Burégio,
Edvan Soares and Mahmoud Berhamgi
Contact: kbelhajj@gmail.com

 Data driven analysis pipelines
 Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
 Tools for automating frequently
performed data intensive activities
 Provenance for the resulting datasets
 The method followed
 The resources used
 The datasets used
Khalid Belhajjame @ DarliAP Workshop, 2019 2

GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
inThai Population
Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation
Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]

 In fields such as biomedicine and social and
behavioral sciences, workflow executions
manipulate and generate sensitive information
about individuals.
 There is, therefore, a serious concern about dataset
inappropriate manipulation/misuse during
experiences that could lead to sensitive-data leak
and/or misuse.
 Publishing the provenance of the executions of such
workflows raises privacy concerns.

To our knowledge, there does not exist any proposal that assists scientists in
the task of anonymizing the provenance of their experiments..
Our objective: we seek to assist scientists in the task of anonymizing
workflow provenance to preserve the privacy of individuals.
 Most related work in the area have focused on the problem of securing
workflow provenance and policing their access.
 Protecting the integrity of provenance data from corruption using
cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].
 Deriving a partial view on a workflow that conforms to a pre-specified
access permissions on the modules' inputs and output and their
dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]
 Policy languages allowing scientists to specify relationships between
datasets and the workflow modules, and their properties relevant to
datasets [Alhaqbani et al., 2013; Gil et al., 2010]
 Protecting the privacy of the modules that compose the workflows by hiding
certain parameters (attributes) of the module that compose the workflow
[Davidson et al., 2011].

[Credit: Steve Touw, Immuta]
‘Differential privacy formalizes the idea that a "private" computation should
not reveal whether any one person participated in the input or not, much
less what their data are.’ - [Frank McSherry]
(https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)
$320k $340k $330k
$30M
Sensitivity of median = ~10k
Sensitivity of mean = ~30M

 For our work, we chose to use the most fundamental
anonymization privacy model, namely k-anonymity,
which has been proposed to protect individual privacy in
data publishing.
 While k-anonymity is less powerful than differntial
privacy, it is suitable for our purposes, given that it
provides the means for :
 Exploring the provenance of workflows,
 Examining the data products used and generated by
the workflows,
 Preserve (to certain extent) lineage information
between data products.

• A workflow is defined by the triple
• An operation op in OP is defined as.
• The data links:

 Sensitive parameters
To specify that a given input or output parameter carries
sensitive data, we use the following boolean function:
that is true if the data bound to <op,p> during the execution
are sensitive
 Anonymity Degree
we use the following function to specify the anonymity degree
of the parameter <p, op> with respect to a workflow instance
insWf:

 Manual identification of a workflow’s
parameters that are sensitive and setting their
anonymity degrees can be tedious.
 This is the case when the workflow includes a
large number of operations.
 We assist the scientist in this task by
leveraging parameter dependencies.

 A parameter <op, p> depends on a parameter <op', p’> in a workflow
(DWf), if during the execution of (DWf) the data bound to <op', p’>
contribute to or influence the data bound to <op', p’>
 Given a workflow (DWf), the dependencies between its parameters are
inferred as follows:
 Given an operation (op) that belongs to (DWf), we can infer that the
outputs of (op) depends on its inputs.
 If the workfow (DWf) contains a data link connecting an output <op, o>
to an input <op, i>, then:
 We also transitively derive dependencies between the operation
parameters:

 A parameter <p', op’> that is not an input to the
workflow may be sensitive if it depends on a
workflow input that is known to be sensitive:
 Note that we say may be sensitive. This is because
an operation that consumes sensitive datasets may
produce non-sensitive datasets.

 In addition to assisting the designer identify sensitive intermediate and
final output parameters, we also infer details about the anonymity degree
that should be applied to dataset instances of those sensitive parameters.
 The anonymity degree of a parameter <p', op’> given a workflow
execution insWf can be defined as the maximum degree of the sensitive
datasets that are used as input to the workflow and that contribute to the
datasets instances of <p', op’>.

Sensi ve Data
Non Sensi ve
Data
Sensi ve Data
Data owner
Data owner
Non Sensi ve
Data
Non Sensi ve
Data
Non Sensi ve
Data
Public data repositories
Trusted workflow environment
Workflow
execu on engine
Workflow
workbench
Data anonymizer
Private data
repository
share
launch
execution
get
inputs
store
outputs
publish data
1
2
3
4
5
6
7
get data
launch data
anonymization

 For validation purposes, we used 20 different CWL
workflows [1], we performed 500s executions per workflow,
and computed the overhead of our method in terms of the
computation of parameter dependencies, identification of
sensitive parameters and the computation of anonymity
degree.
 The results obtained showed that the overhead is small
compared to the execution of the workflow. It takes in
average less than a millisecond to perform all the
computation necessary.
[1] view.commonwl.org/workflows

 We presented an approach for preserving privacy in the
context of scientific workflows that heavily rely on large
datasets.
 We have shown how data plays a role in i) identifying
sensitive operation parameters in the workflow and ii)
deriving the anonymity degree that needs to be enforced
when publishing the datasets instances of these parameters.
 This is a preliminary work that opens up opportunities for
more research in the field of anonymization of workflow data

Privacy-Preserving Data Analysis Workflows for eScience

Recommended

Recommended

More Related Content

Similar to Privacy-Preserving Data Analysis Workflows for eScience

Similar to Privacy-Preserving Data Analysis Workflows for eScience (20)

More from Khalid Belhajjame

More from Khalid Belhajjame (20)

Recently uploaded

Recently uploaded (20)

Privacy-Preserving Data Analysis Workflows for eScience

Editor's Notes