This document discusses an approach for preserving privacy in scientific workflows that use large datasets. It proposes using k-anonymity to anonymize sensitive workflow data. Parameter dependencies are leveraged to identify sensitive parameters and infer appropriate anonymity degrees. The approach was tested on 20 workflows, with overhead less than 1 millisecond. This preliminary work aims to assist scientists in anonymizing workflow data while enabling exploration of provenance and data products.
2. Data driven analysis pipelines
Systematic gathering of data and
analysis tools into computational
solutions for scientific problem-solving
Tools for automating frequently
performed data intensive activities
Provenance for the resulting datasets
The method followed
The resources used
The datasets used
Khalid Belhajjame @ DarliAP Workshop, 2019 2
3. GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
inThai Population
Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation
Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]
Khalid Belhajjame @ DarliAP Workshop, 2019 3
4. In fields such as biomedicine and social and
behavioral sciences, workflow executions
manipulate and generate sensitive information
about individuals.
There is, therefore, a serious concern about dataset
inappropriate manipulation/misuse during
experiences that could lead to sensitive-data leak
and/or misuse.
Publishing the provenance of the executions of such
workflows raises privacy concerns.
Khalid Belhajjame @ DarliAP Workshop, 2019 4
5. To our knowledge, there does not exist any proposal that assists scientists in
the task of anonymizing the provenance of their experiments..
Khalid Belhajjame @ DarliAP Workshop, 2019 5
Our objective: we seek to assist scientists in the task of anonymizing
workflow provenance to preserve the privacy of individuals.
Most related work in the area have focused on the problem of securing
workflow provenance and policing their access.
Protecting the integrity of provenance data from corruption using
cryptography techniques [Hasan and Khan, 2017; Lyle and Martin, 2010].
Deriving a partial view on a workflow that conforms to a pre-specified
access permissions on the modules' inputs and output and their
dependences [Chebotko et al., 2008; Cohen Boulakia et al., 2008]
Policy languages allowing scientists to specify relationships between
datasets and the workflow modules, and their properties relevant to
datasets [Alhaqbani et al., 2013; Gil et al., 2010]
Protecting the privacy of the modules that compose the workflows by hiding
certain parameters (attributes) of the module that compose the workflow
[Davidson et al., 2011].
6. [Credit: Steve Touw, Immuta]
Khalid Belhajjame @ DarliAP Workshop, 2019 6
‘Differential privacy formalizes the idea that a "private" computation should
not reveal whether any one person participated in the input or not, much
less what their data are.’ - [Frank McSherry]
(https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-03.md)
$320k $340k $330k
$30M
Sensitivity of median = ~10k
Sensitivity of mean = ~30M
7. Khalid Belhajjame @ DarliAP Workshop, 2019 7
For our work, we chose to use the most fundamental
anonymization privacy model, namely k-anonymity,
which has been proposed to protect individual privacy in
data publishing.
While k-anonymity is less powerful than differntial
privacy, it is suitable for our purposes, given that it
provides the means for :
Exploring the provenance of workflows,
Examining the data products used and generated by
the workflows,
Preserve (to certain extent) lineage information
between data products.
8. Khalid Belhajjame @ DarliAP Workshop, 2019 8
• A workflow is defined by the triple
• An operation op in OP is defined as.
• The data links:
13. Sensitive parameters
To specify that a given input or output parameter carries
sensitive data, we use the following boolean function:
that is true if the data bound to <op,p> during the execution
are sensitive
Anonymity Degree
we use the following function to specify the anonymity degree
of the parameter <p, op> with respect to a workflow instance
insWf:
Khalid Belhajjame @ DarliAP Workshop, 2019 13
14. Manual identification of a workflow’s
parameters that are sensitive and setting their
anonymity degrees can be tedious.
This is the case when the workflow includes a
large number of operations.
We assist the scientist in this task by
leveraging parameter dependencies.
Khalid Belhajjame @ DarliAP Workshop, 2019 14
15. A parameter <op, p> depends on a parameter <op', p’> in a workflow
(DWf), if during the execution of (DWf) the data bound to <op', p’>
contribute to or influence the data bound to <op', p’>
Given a workflow (DWf), the dependencies between its parameters are
inferred as follows:
Given an operation (op) that belongs to (DWf), we can infer that the
outputs of (op) depends on its inputs.
If the workfow (DWf) contains a data link connecting an output <op, o>
to an input <op, i>, then:
We also transitively derive dependencies between the operation
parameters:
Khalid Belhajjame @ DarliAP Workshop, 2019 15
16. A parameter <p', op’> that is not an input to the
workflow may be sensitive if it depends on a
workflow input that is known to be sensitive:
Note that we say may be sensitive. This is because
an operation that consumes sensitive datasets may
produce non-sensitive datasets.
Khalid Belhajjame @ DarliAP Workshop, 2019 16
17. In addition to assisting the designer identify sensitive intermediate and
final output parameters, we also infer details about the anonymity degree
that should be applied to dataset instances of those sensitive parameters.
The anonymity degree of a parameter <p', op’> given a workflow
execution insWf can be defined as the maximum degree of the sensitive
datasets that are used as input to the workflow and that contribute to the
datasets instances of <p', op’>.
Khalid Belhajjame @ DarliAP Workshop, 2019 17
18. Khalid Belhajjame @ DarliAP Workshop, 2019 18
Sensi ve Data
Non Sensi ve
Data
Sensi ve Data
Data owner
Data owner
Non Sensi ve
Data
Non Sensi ve
Data
Non Sensi ve
Data
Public data repositories
Trusted workflow environment
Workflow
execu on engine
Workflow
workbench
Data anonymizer
Private data
repository
share
launch
execution
get
inputs
store
outputs
publish data
1
2
3
4
5
6
7
get data
launch data
anonymization
19. For validation purposes, we used 20 different CWL
workflows [1], we performed 500s executions per workflow,
and computed the overhead of our method in terms of the
computation of parameter dependencies, identification of
sensitive parameters and the computation of anonymity
degree.
The results obtained showed that the overhead is small
compared to the execution of the workflow. It takes in
average less than a millisecond to perform all the
computation necessary.
Khalid Belhajjame @ DarliAP Workshop, 2019 19
[1] view.commonwl.org/workflows
20. We presented an approach for preserving privacy in the
context of scientific workflows that heavily rely on large
datasets.
We have shown how data plays a role in i) identifying
sensitive operation parameters in the workflow and ii)
deriving the anonymity degree that needs to be enforced
when publishing the datasets instances of these parameters.
This is a preliminary work that opens up opportunities for
more research in the field of anonymization of workflow data
Khalid Belhajjame @ DarliAP Workshop, 2019 20
In this age of data-intensive science we’re witnessing the unprecedented generation and sharing of large scientific datasets, where the pace of data generation has far surpassed the pace of conducting analysis over the data. Scientific Workflows [6] are a recent but very popular method for task automation and resource integration. Using workflows, scientists are able to systematically weave datasets and analytical tools into pipelines, represented as networks of data processing operations connected with dataflow links.
(Figure 1 illustrates a workflow from genomics, which “from a given set of gene ids, retrieves corresponding enzyme ids and finds the biological pathways involving them, then for each pathway retrieves its diagram with a designated coloring scheme”).
As well as being automation pipelines, workflows are of paramount importance for the provenance of data generated from their execution [6]. Provenance refers to data’s derivation history starting from the original sources, namely its lineage.
In fields such as biomedicine and social and behavioral sciences, workflow executions manipulate and generate sensitive information about individuals.
There is a serious concern about dataset inappropriate manipulation/misuse during experiences that could lead to sensitive-data leak and/or misuse. Although this could happen inadvertently, the consequences remain the same.
Publishing the provenance of the executions of such workflows raises privacy concerns. For example, record linking techniques can be applied to provenance traces to cross-reference datasets used and generated by the workflow modules with the intention to reveal private or sensitive information about individuals, thereby violating basic privacy rights.
Protecting the integrity of provenance data from corruption using sophisticated secure computing and cryptography techniques
Chebotko {\em et al} \cite{DBLP:conf/waim/ChebotkoCLFY08} discusses means for deriving a partial view on a workflow that conforms to a pre-specified access permissions on the modules' inputs and output and their dependences.
Gil {\em et al.} \cite{DBLP:conf/semweb/CheungG07,DBLP:conf/aaaiss/GilF10} and Alhaqbani {\em et al.} \cite{Alhaqbani2013} proposed policy languages allowing scientists to specify relationships between datasets and the workflow modules, and their properties relevant to datasets. Policies can be utilized for instance to specify that the data instances of a module's output needs to be anonymized. In doing so, however, the policy language does not specify how the datasets are to be anonymized, and even less, how their lineage information are to be preserved.
Davidson {\em et al.} \cite{DBLP:conf/icdt/DavidsonKRSTC11,DBLP:conf/pods/DavidsonKMPR11,DBLP:conf/cidr/DavidsonKTRCMS11} investigated a related problem but with a focus on module privacy. The objective of this line of proposals is to identify the subset of the the inputs and outputs, or more specifically attributes thereof, of the wokflow modules that need to be hidden to keep the functionality of the workflow modules hidden. Our objective is different in that we consider that the modules that compose the workflow are public and we seek to anonymize the workflow provenance, with the objective to hide sensitive information about individual from the provenance records. In doing so, we examine anonymization techniques to generalize attribute values of data records, as opposed to hiding completely the attributes as done in \cite{DBLP:conf/pods/DavidsonKMPR11}.
The intuition of differential privacy is that the removal or addition of a single record does not significantly affect the outcome of any analysis.
Differential privacy: Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front.
You can only ask aggregate questions.
Different techniques have been proposed in the literature for protecting the privacy of individuals, e.g., k-anonymity [28, 31], l-diversity [24], t-closeness [22] and differential privacy [11]. In particular, differential privacy [] has recently gained momentum as the method of choice in statistical databases. It involves adding random noise to the data so that the distribu- tion of the resulting dataset is almost invariant to the inclusion of any data record. While extremely powerful, differential privacy is not suitable for our purposes, and its application may hamper the utility of anonymized provenance data in to preserve a more rigorous guarantee of privacy [29]. Indeed, for it to be useful, provenance information should keep track of the data records that have been used and generated by the workflow modules as well as their connections (lineage), which may be lost or broken when applying differential privacy techniques. [Khalid: you need to check the validity of the following statement, with evidence (paper).]
The anonymity degree of a~$\mathtt{DWf}$'s parameter ($\mathtt{\langle p, op \rangle}$) is defined with respect to a given $\mathtt{DWf}$ instance~($\mathtt{insWf}$). Indeed, different instances of $\mathtt{DWf}$ may have as input datasets different anonymity degree requirements. For example, the owner of an input dataset used for a given workflow instance ($\mathtt{insWf_1}$) may impose a more stringent anonymity degree than the owner of an input dataset used for a different workflow instance ($\mathtt{insWf_2}$).
Manual identification of a workflow’s parameters that are sensitive and setting their anonymity degrees can be tedious. Thisbecomes a serious concern when the workflow includes a largenumber of operations. To address this issue, we propose in thissection, an approach that takes as input the sensitivity of the inputparameters of the workflow(DWf)together with their anonymitydegrees. It then detects the list of (intermediate and final) pa-rameters in(DWf)that may be sensitive, and infer the anonymitydegree that should be applied to the datasets bound to thoseparameters during the execution of the(DWf)
Taking the maximum anonymity degree of the contributing inputs ensures that the anonymity degrees imposed on such inputs is honored by the dependent parameter in question.
This work opens up opportunities for more research in the field of anonymization of workflow data. In this respect, our ongoing work includes investigating the applicability of our solution to anonymization techniques