Crea il tuo assistente AI con lo Stregatto (open source python framework)
Cast
1. PRESERVING PRIVACY IN
SEMANTIC-RICH TRAJECTORIES
OF HUMAN MOBILITY
Anna Monreale, Roberto Trasarti, Dino
Pedreschi, Chiara Renso
KDDLab, Pisa
Vania Bogorny
Univ. Santa Catarina, Brasile
1
Knowledge Discovery and Delivery Lab
(ISTI-CNR & Univ. Pisa)
www-kdd.isti.cnr.it
ANONIMO MEETING, Pisa, 20,21 settembre 2010
SPRINGL 2010, San Jose, November 2, 2010
2. How the story begins…
2 Semantic
trajectories
represent the
important places
visited by people
Semantic
trajectories
represent the
important places
visited by people
This information can
be privacy sensitive!
We should find a
good generalization
of the visited
places… preserving
semantics!
But how?
This information can
be privacy sensitive!
We should find a
good generalization
of the visited
places… preserving
semantics!
But how?
Can we use a taxonomy
of places to generalize
and find anonymous
datasets?
Let’s ask help to Anna,
Dino and Roberto!
Can we use a taxonomy
of places to generalize
and find anonymous
datasets?
Let’s ask help to Anna,
Dino and Roberto!
3. Semantic Trajectories
Availability of trajectory data increases
From raw trajectories to new forms of trajectory data with
richer semantic information: semantic trajectories
Semantic trajectories represents moving objects traces as
sequences of stops and moves
A semantic trajectory can be represented as the sequence
of stops, e.g.
<Home, Work, ShoppingCenter, Gym>
4. Semantic Trajectory and
Privacy
Data owner should not reveal personal sensitive
information
Disclosure of personal sensitive information puts
the citizen’s privacy at risk.
Hiding personal identifiers may not be sufficient
Need for new privacy-preserving DT techniques
Privacy by Design
Natural trade-off between privacy quantification
and data utility
Analysis results should not be altered significantly
Privacy has to be maximized
5. Semantic Trajectories Analysis and
Privacy Issues
Analyzing datasets of semantic trajectories
may cause privacy issues
A place allows to infer personal sensitive
information of an individual
Example: From the fact that a person has
stopped in an oncology clinic, an attacker can
derive private personal information about the
health of such person.
5
6. Semantic Trajectories Analysis
and Privacy Issues
k-anonymity is not enough for a robust protection
When individuals with similar trajectories stop in
the same sensitive place, we can easily infer
the individual sensitive information.
Example:
#U1 <Park, Restaurant, Oncology Clinic>
#U2 <Park, Restaurant, Oncology Clinic>
This dataset is 2-anonymous but the attacker can
infer that the user has been to the Oncology
Clinic!!!
6
7. The Privacy Framework
Anonymizes dataset of semantic trajectories
Based on semantic generalization and the
notion of c-safety - similar to the notion of l-
diversity in relational, tabular data
It is based on: a taxonomy of places, the notion of
quasi identifier places and sensitive places.
Preserves patterns mining results
8. Quasi-identifier and Sensitive
stops8
The taxonomy of places
Represents important places and their semantic
categories in a given domain
quasi-identifier places: can be used to infer the
identity of the user
sensitive places: can disclose sensitive
information about the user
In general we don’t have an apriori
classification since it depends on the
application and the context
10. Privacy Model
10
Adversary Knowledge:
how we anonymize the data
the privacy place taxonomy describing the levels of
abstraction
the user U is in the dataset
a quasi-identifier place sequence SQ visited by the user
U
Attack Model:
Given SQ, the attacker builts a set of candidate semantic
trajectories containing SQ and tries to infer the sensitive
places visited by U.
We denote by Prob(SQ,S) the probability that, given a
quasi-identifier place sequence SQ related to a user U,
the attacker infers the sequence of sensitive places S
visited by the user.
11. C-Safe Dataset
We want to control the probability Prob(SQ, S)
A dataset ST is said c-safe wrt the place set Q if
for every quasi-identifier place sequence SQ,
we have that for each set of sensitive place S
Prob(SQ,S) ≤ c with c ∈ [0,1].
Given a sequence of sensitive places S = s1, . . .
, sh and a quasi-identifier sequence SQ the
probability to infer S is the conditional
probability:
P(SQ,S) = P(S|SQ)
11
12. How we can obtain a c-safe dataset?
12
The CAST (C-safe Anonymization of Semantic
Trajectories) algorithm guarantees that P(S|SQ)
≤ c for each sequence of S and SQ
While (|S|>0)
SL = { s ∈ S| length(s) = MaxLength(S)}
While (|SL| >= m)
1. Compute the Cost of all possible group Gi of m
sequences in SL as: CostGi = CostQGi + CostSGi.
2. Apply the generalization with the lower Cost
storing the results in R.
3. Remove Gi from S and SL.
13. Example (1): The process
13
Consider the following set of sequences, and m=3 and c=0.45:
S = {<S1, R2, H1, R1, C1, S2>
<S3, D1, R1, C1, S2>
<S1, P3, C2, D2, S2>
…}
14. Example (2) CostQ
14
CostQ is the number of hops on the tree needed to generalize the
sequences of Quasi-identifiers to a common one.
Consider the group:
<S1, R2, H1, R1, C1, S2>
<S3, D1, R1, C1, S2>
<S1, P3, C2, D2, S2>
CostQ = 6 + 6 + 6 = 18
<Station,Place,Entertainment,S2 (H1,C1)>
<Station,Place,Entertainment,S2 (C1)>
<Station,Place,Entertainment,S2 (C2)>
15. Example (2) CostS
15
CostS is the number of hops on the tree needed to generalize the
sequence of Sensible in order to obtain the c-safety.
From the generalized group:
<Station,Place,Entertainment,S2 (H1,C1)>
<Station,Place,Entertainment,S2 (C1)>
<Station,Place,Entertainment,S2 (C2)>
CostS = 3
The Total Cost of this
group is 21 hops,
which is the lower
combination
<Station, Place, H1, Entertainment, Clinic,
S2 >
<Station, Place, Entertainment, Clinic, S2>
<Station, Place, Clinic, Entertainment, S2>
16. Example (4): Why is C-safe
<Station,Place,Entertainment,S2 (H1,C1)>
<Station,Place,Entertainment,S2 (C1)>
<Station,Place,Entertainment,S2 (C2)>
SQ = Station, Place, Entertainment, S2 .⟨ ⟩
Probability of crack: P (SQ , H1 ) = 1/3 <c , P(SQ,C1) = 2/3 > c and
P(SQ,C2) = 1/3 <c
We need to generalize C1 to the higher representation level in the
taxonomy: Clinic.
The probability of C1 become 2/5 < c !!!!
C-safe dataset:
<Station, Place, H1, Entertainment, Clinic, S2 >
<Station, Place, Entertainment, Clinic, S2>
<Station, Place, Clinic, Entertainment, S2>
16
17. Experiments
We found 6225 semantic trajectories with an
average length equal to 5.2 stops.
We run the sequential pattern algorithm and we
measured the quality of the results with two
measures:
the coverage coefficient
the distance coefficient.
17
The dataset contains trajectories of
17000 moving cars in Milan, in one
week, collected through GPS
devices.
18. Experiments: Quality of the
analysis
the coverage coefficient measures how many
patterns extracted from the original dataset
are covered (have a superclass in the taxonomy)
by the patterns extracted in the anonymized
dataset
18
20. Experiments: Quality of the
analysis
Distance coefficient represents the distance in
terms of steps in the taxonomy to transform
the patterns from the set extracted on the
original dataset and the one from the
anonymized dataset.
20
22. Conclusions and Future work
Improve the algorithm with better heuristics
and that does not consider only groups of a
fixed size.
More experiments with other mining
algorithms
More utility measures for the evaluation of
results
Another future research direction goes
towards the exploitation of c-safe semantic
trajectories dataset for semantic tagging of
trajectories. How does the anonymization step
22