1. NISO Lightning Overview:
Identification & “Anonymization”
Micah Altman
Director of Research
MIT Libraries
Prepared for
NISO Workshop on Patron Privacy
Online
May 2015
2. DISCLAIMER
These opinions are my own, they are not the
opinions of MIT, Brookings, any of the project
funders, nor (with the exception of co-authored
previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about
the future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston
Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert
Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan
Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel,
Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
Lightning Overview: Identification &
“Anonymization”
2
3. Collaborators & Co-Conspirators
Privacy Tools for Sharing Research Data Team
(Salil Vadhan, P.I.)
http://privacytools.seas.harvard.edu/people
Research Support
Supported in part by NSF grant CNS-1237235
Lightning Overview: Identification &
“Anonymization”
3
4. Related Work
Main Project:
Privacy Tools for Sharing Research Data
http://privacytools.seas.harvard.edu/
Related publications:
Novak, K., Altman, M., Broch, E., Carroll, J. M., Clemins, P. J., Fournier, D., Laevart,
C., et al. (2011). Communicating Science and Engineering Data in the Information
Age. Computer Science and Telecommunications. National Academies Press
Vadhan, S., et al. 2011. “Re: Advance Notice of Proposed Rulemaking: Human
Subjects Research Protections.”
Altman, M., D. O’Brien, S. Vadhan, A. Wood. 2014. “Big Data Study: Request for
Information.”
O'Brien, et al. 2015. “When Is Information Purely Public?” (Mar. 27, 2015) Berkman
Center Research Publication No. 2015-7.
Wood, et al. 2014. “Long-Term Longitudinal Studies” (July 22, 2014). Berkman Center
Research Publication No. 2014-12.
Slides and reprints available from:
informatics.mit.edu
Lightning Overview: Identification &
“Anonymization”
4
5. Identifiable private information is common
Birth date + zipcode +
gender uniquely identify
~87% of people in the U.S.
Can predict social security
number using
birthdate/place
Tables, graphs and maps
can reveal identifiable
information
People have been identified
through movie rankings,
search strings, writing
style…
Brownstein, et al., 2006 , NEJM 355(16),
5 Lightning Overview: Identification &
“Anonymization”
6. Privacy is not Confidentiality…
(defining basic terms)
Privacy
Control over extent and circumstances of sharing
Confidentiality
Control of disclosure information
Sensitive information
Information that would cause harm if improperly
disclosed
(to individual, institution, social group, or society)
Private personally identifiable information
Not already purely public
Directly or indirectly linkable to an identifiable individual
Possibly using externally available information
6 Lightning Overview: Identification &
“Anonymization”
7. Legal Constraints are Complicated
Contract Intellectual
Property
Access
Rights Confidentiality
Copyrigh
t
Fair Use
DMCA
Database Rights
Moral Rights
Intellectua
l
Attribution
Trade
Secret
Patent
Trademark
Common
Rule
45 CFR 26HIPA
AFERP
A
EU Privacy
Directive
Privacy
Torts
(Invasion,
Defamation)
Rights of
Publicity
Sensitive
but
Unclassified
Potentially
Harmful
(Archeologica
l Sites,
Endangered
Species,
Animal
Testing, …)
Classifie
d
FOIA
CIPSE
A
State
Privacy
Laws
EA
R
State
FOI
Laws
Journal
Replication
Requirements
Funder
Open
Access
Contract
License
Click-Wrap
TOU
ITA
Export
Restriction
s
Lightning Overview: Identification &
“Anonymization”
7
8. Laws define “anonymized” differently
FERPA HIPAA Common
Rule
MA 201 CMR 17
Identificatio
n Criteria
- Direct
- Indirect
- Linked
- Bad intent
- direct/indirect:
18 identifier
- OR statistician
verifies
minimal risk
AND no actual
knowledge of
identified indiviual
- Direct
- Indirect /
Linked -- if
“readily
identifiable”
-First Initial + Last
Name
Sensitivity
Criteria
Any non-
directory
information
Any medical
information
Private
information –
based on harm
Financial, State,
Federal Identifiers
8 Lightning Overview: Identification &
“Anonymization”
9. Different definitions of identifiability
Lightning Overview: Identification &
“Anonymization”
9
Record-linkage
• “where’s waldo”
• Match a real person to
precise record in a database
• Examples: direct identifiers.
• Caveats: Satisfies
compliance for specific laws,
but not generally; substantial
potential for harm remains
Indistinguishability
+ Heterogeneity
• “hiding in the crowd”
• People can be matched only
to cluster of records
• Based on quasi-ids
• Sensitive attributes must
also vary
• Examples: K-anonymity, l-
diversity, attribute disclosure
• Caveats: Potential for
substantial harms may
remain
Learning
• “privacy, guaranteed”
• Formally bound the total
learning about any individual
that can occur from a query
• Examples: differential
privacy, zero-knowledge
proofs
• Caveats: Challenging to
implement, requires
interactive system
10. How many things are wrong with this picture?
Name SSN Birthdate Zipcode Gender Favorite
Ice Cream
# of crimes
committed
A. Jones 12341 01011961 02145 M Raspberr
y
0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolat
e
0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N.
Smi
th-
12354 04041974 64800 F Allergic 256
10 Lightning Overview: Identification &
“Anonymization”
11. Name SSN Birthdate Zipcode Gender Favorite
Ice Cream
# of crimes
committed
A. Jones 12341 01011961 02145 M Raspberr
y
0
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolat
e
0
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. Smith 12354 04041974 64800 F Allergic 256
What’s wrong with this picture?
Identifier Sensitive
Private
Identifier
Private
Identifier
Identifier Sensitive
Unexpected Response?
Mass resident
FERPA too?
Californian
Twins, separated at birth?
11 Lightning Overview: Identification &
“Anonymization”
12. Common Approach: Suppress Information for
Data Release
Published Outputs
* Jones * * 1961 021*
* Jones * * 1961 021*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
Modal Practice
“The correlation between
X and Y was large and
statistically
significant”
Summary statistics
Contingency table
Public use sample microdata
Information Visualization
Lightning Overview: Identification &
“Anonymization”
12
13. Help, help, I’m being suppressed…
Name SSN Birthdate Zipcode Gender Favorite
Ice Cream
# of crimes
committed
[Name 1] 1234
1
*1961 021* M Raspberry .1
[Name 2] 1234
2
*1961 021* M Pistachio -.1
[Name 3] 1234
3
*1972 940* M Chocolate 0
[Name 4] 1234
4
*1972 940* M Hazelnut 0
[Name 5] 1234
5
*1972 940* F Lemon .6
[Name 6] 1234
6
*1972 021* F Lemon .6
[Name 7] 1234
7
*1989 021* * Peach 64.6
[Name 8] 1234
8
*1973 632* F Lime 3
[Name 9] 1234
9
*1973 633* M Mango 3
Row
VarSynthetic Global Recode Local Suppression Aggregation
+
Perturbation
Traditional Static Suppression
Data reduction
Observation
Measure
Cell
Perturbation
Microaggregation
Rule-based data
swapping
Adding noise
13 Lightning Overview: Identification &
“Anonymization”
14. Suppression reduces utility
Lightning Overview: Identification &
“Anonymization”
14
Common approach of anonymizing/suppressing data
reduces usefulness
Minimizing disclosure in the presence of large
external data sources reduces usefulness a lot
Anonymized data is not simply less informative -- it
typically yields biased analyses
15. New Data – New Challenges
How to deidentify without completely
destroying the data?
The “Netflix Problem”: large, sparse datasets that
overlap can be probabilistically linked [Narayan
and Shmatikov 2008]
The “GIS”: fine geo-spatial-temporal data
impossible mask, when correlated with external
data [Zimmerman 2008; ]
The “Facebook Problem”: Possible to identify
masked network data, if only a few nodes
controlled. [Backstrom, et. al 2007]
The “Blog problem” : Pseudononymous
communication can be linked through textual
analysis [Novak wet. al 2004]
[For more examples see Vadhan, et al 2010]
Source: [Calberese 2008; Real
Time Rome Project 2007]
15 Lightning Overview: Identification &
“Anonymization”
16. Little Data – Big World
The “Favorite Ice Cream” problem
-- public information that is not risky can help us
learn information that is risky
The “Doesn’t Stay in Vegas” problem
-- information shared locally can be found anywhere
The “Data Exhaust problem”
-- wherever you go, there you are, and your data too!
Lightning Overview: Identification &
“Anonymization”
16
17. Algorithmic Discrimination
Lightning Overview: Identification &
“Anonymization”
• Emergent behavior of algorithms, big data, and behavior
discrimination on private personal characteristics
17
18. Information Science Approach:
Manage Privacy & Confidentiality Lifecycle
Lightning Overview: Identification &
Collection:
Consent/licensing terms
Methods
Measures
Storage
Systems information
security
Data structures and
partitioning
Dissemination
Vetting
Disclosure limitation
Data use agreements
Creation/C
ollection
Storag
e/Inge
st
Processing
Internal
Sharing
Analysi
s
External
dissemination/pu
blication
Re-use
Long-
term
access
Researc
h
methods
Data
Management
Systems
Legal / Policy
Frameworks
∂∂
Statistical /
Computational
Frameworks
18
19. Hybrid Approaches
Collection limitations
Limitations on collection
Inform and consent
Data enclaves – physically restrict access to data
Examples: ICPSR, Census Research Data Center
May include availability of synthetic data as an aid to preparing model specifications
Advantages: extensive human auditing, vetting; information security threats much reduced
Disadvantages: expensive, slow, inconvenient to access
Controlled remote access
Varies from remote access to all data and output to human vetting of output
Restrictions on use, easier to enforce
Advantages: auditable, potential to impose human review, potential to limit analysis
Disadvantages: complex to implement, slow
Model servers
Mediated remote access – analysis limited to designated models
Advantages: faster, no human in loop
Disadvantage: statistical methods for ensuring model safety are immature – residuals,
categorical variables, dummy variables are all risky; very limited set of models currently
supported; complex to implement
Experimental approaches
Personal Data Stores
Data Auditing and Accountability
19 Lightning Overview: Identification &
“Anonymization”
21. Creative Commons License
This work. Managing Confidential
information in research, by Micah Altman
(http://redistricting.info) is licensed under
the Creative Commons Attribution-Share
Alike 3.0 United States License. To view a
copy of this license, visit
http://creativecommons.org/licenses/by-
sa/3.0/us/ or send a letter to Creative
Commons, 171 Second Street, Suite 300,
San Francisco, California, 94105, USA.
21 Lightning Overview: Identification &
“Anonymization”
Editor's Notes
This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.
(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )