Many applications benet from user location data, but lo-
cation data raises privacy concerns. Anonymization can protect privacy,
but identities can sometimes be inferred from supposedly anonymous
data. This paper studies a new attack on the anonymity of location data.
We show that if the approximate locations of an individual's home and
workplace can both be deduced from a location trace, then the median
size of the individual's anonymity set in the U.S. working population is
1, 21 and 34,980, for locations known at the granularity of a census block,
census track and county respectively. The location data of people who
live and work in dierent regions can be re-identied even more easily.
Our results show that the threat of re-identication for location data is
much greater when the individual's home and work locations can both
be deduced from the data. To preserve anonymity, we oer guidance for
obfuscating location traces before they are disclosed.
1. On the Anonymity of Home/Work
Location Pairs
Philippe Golle
Kurt Partridge
Presented by: Bo Begole
2. Location traces are useful for
personalized location-based services
Inferred Restaurant Preferences
Location History
Monday Tuesda
Location Cuisine
Time
…
11:57- 12:45 37°26’39”
… …
-122°9’38”
12:00
1:22 - 1:31 37°23’11” $ $
to 1:00 $$ $$
-122°9’02”
Chinese Chinese
… … …
Italian Italian
…
… …
1:00 to
Personalized
Recommendations
[Magitti, CHI 2008]
3. Location traces can be sensitive
They can reveal business connections, political
affiliation, medical condition, risky behaviors, etc.
Can damage reputations, unfairly raise insurance
premiums, increase divorce rates, etc.
Some ways to mitigate the risk
– “Only my trusted network provider knows my location”
» Location data is often shared with third party LBS providers
– “Privacy policies and legislation protect my location traces”
» Data collector may fall victim to attacks, or be dishonest themselves
– “I’ll report my location only infrequently when I need service”
» Many services require frequent location updates (e.g. friend finder)
– “I’ll report my location pseudonymously”
» Watch out for inferences!
4. Sometimes possible to infer identity by
combining data sources
Voter Registration
Inference: what can be
learned from combining Name
existing knowledge with newly Street address
acquired information …
Gender
E.g.: learn the medical record Postal code
Date of Birth
of William Weld [Swe00]
– Knowing the birth date and ZIP
code of the governor of
Gender
Massachusetts
Postal code
– Can retrieve his health records
Date of Birth
from a supposedly anonymous
database of state employee
health-insurance claims Cancer Type
87% of US population have
Patient Records
unique date of birth, gender
and postal code.
5. Inferences from location data
With 2-weeks’ worth of GPS data collected from a
subject’s car, Krumm [Pervasive 2007] showed we can
infer:
– Home address (median error < 60 m)
>5% are Identifiable by combining with
– Reverse geo-coder
– Web-based white pages directory
7. The K-anonymity Guarantee
Problem
– If a location trace is unique…
– It may be combined with other public data
– To generate undesirable inferences
Solution: k-anonymity [Sweeney 2000]
– Principle: “Data is safe to release if at least k
people share the same attributes”
– Anonymity set: set of people with same attributes
– K-anonymity: size of the anonymity set
8. K-anonymity Example
– Cancer DB: (Post code: 34012, DOB 7/31/1945, Male)
– Goal: ensure 5-anonymity
– (Cambridge MA) 54,805 people
– (Cambridge MA, male) 26,854 people
– (Cambridge MA, 55 years old) 2,096 people
– (Cambridge MA, DOB: 7/31/1945) 6 people
– (Cambridge MA, DOB: 7/31/1945, Male) 3 people
– (Post code: 34012, DOB: 7/31/1945, Male) 1 person
Combine with voter registration: William Weld
The same can be done for 69 percent of the 54,805
people on the voting list of Cambridge, MA.
9. Inferences from location data
What level of accuracy of location information
could result in small sizes of k anonymity for
some portion of the population?
Assume a location trace reveals approximate …
– region someone lives (county or postal code)
– and also, region someone works (county or postal code)
County and postal code are coarse-grained
– A trace that reveals less is probably not very useful
– Useful traces in fact reveal considerably more!
– Therefore, our results are an upper bound on the size of
the anonymity set (things could be worse!)
10. Data source
Longitudinal Employer-Household Dynamics
(LEHD)
– Census Bureau program to compile information about
where people work and where they live
» County
» Census Tract ~= US Postal code
– Also records age, earnings and distribution across
industries.
– 103,289,243 workers from 42 states
LEHD is a synthetic dataset
– “The key statistical property to preserve in the
synthetic data is the joint distribution of workers
across home and work areas”
– 3 implicates (synthetic datasets) are available for
researchers to check robustness of results
11. Size of live/work anonymity set
100,000 10,000,000
Size of anonymity set
1,000,000
10,000
100,000
1,000 10,000
50% have
k-anonymity
1,000
100
of 35,000
100
10
10
5% have
k-anonymity of 92
1
50% have
1
k-anonymity of 21
7% have
k-anonymity of 1 Home only
Work only
Both home & work
12. Influence of living & working in the
same region versus different regions
5,000
1,000,000
Size of anonymity set
1,000
100,000
10,000
100
1,000
100
10
10
1 1
Same location
Different locations
All
13. Conclusion: Even coarse grained
location traces can identify individuals
Approximate home and work locations narrow a person to a
very small anonymity set
Geographic Median Anonymity
Region Size Set Size
County 34,980
Census Tract
21
(~postal code)
Census Block 1
Being unique is not quite the same as being identifiable
– But identity of a specific individual may be found by combining other
sources (white pages, patient records, police records, private
knowledge, employer records, facebook pages, …)
We can estimate k-anonymity of location- and other context-
aware tech. prior to widespread adoption using public datasets
– Population Census
– Longitudinal Employment Household Data (LEHD)
– American Time Use Survey (ATUS)
– Japan Statistics Bureau Survey on Time Use and Leisure Activities
14. Techniques to Estimate Privacy Effects of
Context-Aware Technologies
K-anonymity (size of the anonymity set) is a metric to
characterize level of privacy
Can estimate k-anonymity of context-awareness
technologies with public datasets
– Population Census
– Longitudinal Employment Household Data (LEHD)
– American Time Use Survey (ATUS)
– Japan Statistics Bureau Survey on Time Use and Leisure
Activities
K-anonymity is not a perfect measure
– May still leak information
» L-diversity [MGKV 2006]
» ε-differential privacy [Dwork et al]
Notas del editor
Thank you. I'll be presenting the work of my colleagues Philippe Golle, who is possibly the nicest man on the planet, and Kurt Partridge, who is probably the most healthy-living person I know. The research is an exploration of privacy risk from location data.
To explore this question, we turn to a publicly available data set called the Longitudinal Employer-Household Dynamics which contains data on more than 100 million workers in the US including the county and census tract (which is roughly similar to a postal code) where they live and work.I need to point out that the publicly available LEHD is actually synthesized from the real data set to preserve privacy. However, the key characteristic of the synthesis is that the distribution of where people live and where they work is preserved in the synthetic data. The Census bureau makes three synthetic sets available, and Philippe and Kurt have conducted their analysis on all three sets with identical results.
Here are the results.The left diagram is for location at the granularity of census tract, the right is at granularity of county. The vertical axis is the size of the anonymity set on a logarithmic scale. The horizontal axis is the fraction of the population. The curves plot the fraction of the population who have an anonymity set of specific size. The red curve is derived from looking only at where a person works and the green is for where they live. Notice that for those data alone, 50% of the population of an anonymity set size of 1000 or more.But if you combine where a person lives with where they work, the anonymity set for 7% of the population is just 1 and for 50% of the population it is only as high as 21. On the county scale, the anonymity set size is substantially higher with 5% being no smaller than 92 and 50% having greater than 35000.--------Median size of anonymity setcensus tract: 21 people county: 35,000 people5% percentile size of anonymity setcensus tract: 1 people county: 92 peopleCounty. There are 2, 784 counties and county equivalents (boroughs,parishes) in the 42 states included in the LEHD dataset.– Census Tract. Census tracts are county subdivisions defined by the U.S.Census Bureau. In terms of size and number of residents, they are roughlycomparable to ZIP codes. The LEHD dataset contains 64, 881 tracts whereworkers live and 52, 852 tracts where they work. The mean number of workersliving in a tract is 1, 592 and the median is 1, 479. The mean number of peopleworking in a tract is 1, 954 and the median is 951 (it ranges from zero incompletely residential tracts to 166, 680 in a tract of Los Angeles County).
Workers who live and work in different locations (brown stars) have a much smaller anonymity set (i.e. are lessanonymous) than those who live and work in the same location (blue diamonds). This holds true both for locations revealed at census tract (left graph) and county granularity (right graph).The traces of workers who cross location boundaries to go to work are particularly vulnerable to re-identification attacks. Across the states included in the LEHD dataset, 94.1% of workers live and work in different census tracts.These numbers show that the extra anonymity afforded by living and working in the same census tract is uncommon. The fraction of workers who live and work in different counties is 43.5%. The percentages range from 63.8%in Virginia to 10.7% in Hawaii. Living and working in the same county is more frequent, and therefore had a bigger effect on the national average.
The point is that even coarse grained location traces can render individuals uniquely identifiable . The size k for US county granularity is fairly secure with a k of 34 thousand for 50% of population. For census tract and census block, which is described in the paper, k is quite small for 50% of the US population.Such small anonymity set sizes can leave some portion of the population vulnerable to being uniquely identified by combining with data from other sources.One final takeaway is this. It is possible to understand the privacy risk of context awareness before we attain widespread adoption of the technologies. There are publicly available data sets that can serve as surrogates for context technologies, such as the LEHD which was described in this paper. In other work we have used census data itself, and there is a data set of a large number of individual's self-described daily activities at the granularity of 15 minute time periods. The Japan statistics bureau provides a similar data set and there are probably comparable data from other governments.Write questions on 1000 yen note and I will take it to the author. Do you have any questions that I can take back to the authors?
We and others in this research field have been collecting traces of location for personalization of location-based services. From location traces, we can infer a person's individual preferences. For example, in the Magitti project that we reported at CHI 2008, we used location traces to identify a person's activity patterns, store and restaurant preferences which was used to improve the quality of recommendations for local venues.
We know that some users are uncomfortable with the existance of records of movement for a variety of reasons. In addition to restaurant and store preferences, such traces can reveal business connections, political affiliation, medical conditions and other aspects of a person's life that they might not want to expose so that they don't have to deal with undesired consequences.There are a number of ways to mitigate the risk of such exposure. You may trust that a service provider will not misuse the data - but sometimes they share such data with business affiliates. Legislation may be enacted to prevent misuse, but malicious parties may not abide by the laws. You can also just report your data infrequently, as needed, but some services require frequent updates in order to be useful. Finally, you can just use a false name, such as a user ID number, with the data. The risk here is that someone could potentially reverse identify you from some unique characteristic of the data. ----Another potential objection: do all the processing locally. The problem is that it consumes bandwidth for services that integrate with other data, and doesn’t work well for social applications. “pseudonymously”: for example, obtain service under a userID that cannot be tied to the user’s real identity
So, for example here we have two separate publicly available databases. One is the voter registration records which record a person's name, gender and postal code. The second is a record of state employee health claims for cancer which carefully does not include the names of people who made claims. However, in 2000, Sweeney used these databases to discover that William Weld, the governer of Massachusetts at the time, had cancer.Sweeney‘s analysis of US census data showed that 87% of the U.S. population is uniquely identifiable given their full date of birth (year, month and day), sex, and the ZIP code where they live.
Returning now to location traces, John Krum showed at Pervasive 2007 that you can infer a person's home address with a median error of 60 meters from 2 weeks of GPS location data. Combining that with a reverse geo-coder and online white pages, he was able to accurately identify more than 5% of the people.
Krumm went on to show a number of techniques that obscure an individual's data traces, making it much more difficult to identify a particular individual. He showed that you can hide location points near the person’s home, or add noise to all of the data points, or round the location samples to points on a grid.One of the conclusions of Krumm's work was that Location obfuscation defeats re-identification, at the cost of a significant loss in location precision which could render some location-based services unusable. Yet there is no guarantee that even this degree of location obfuscation protects against other/all re-identification attacks.
So, to summarize, if a location trace is unique, it may be combined with other public data to generate undesirable inferences. But how big is the risk? What can we use to estimate the degree of risk.In 2000, Sweeny proposed one metric of privacy certainty called k-anonymity which states that data are safe to release if at least \"k\" people share the same attributes. Those k people are called the anonymity set and k describes the size of that set.
For example, recall the cancer database. Suppose we had a goal of ensuring k-anonymity of at least 5. If dataset revealed only the city in which the patients lived, the k-anonymity size is 54 thousand. Adding gender reduces it to 26 thousand, which is still quite good. We can even add the age of the patient, reducing k-anonymity to approx 2 thousand. But when we disclose date of birth, plus gender, plus postal code, the k anonymity goes down to 6, 3 and finally to just a single individual.The same can be done for 69 percent of the 54,805 people on the voting list of Cambridge, MA.
How would we apply this principle to location trace data? What granularity of location information could result in small k for some portion of the population?Well, let's look at some very coarse-grain sizes to begin with county and zip code.GPS location trace, not just where they live, whichKrumm showed at 60m accuracy, but also where they work which is similarly inferable from location trace.We start at this level because it is very coarse grained, even less precise would probably not be useful to any location-based service. And useful traces would generally be even more precise. So the results are an upper bound on the anonymity set.