SlideShare una empresa de Scribd logo
1 de 39
De-identifying
Patron Data to
Balance Privacy
and Insight
Becky Yoose
Stephen Halsey
The Seattle Public Library
??
PLA 4/7/2016
Presentation Outline
• Overview of Data Management Principles, Policies,
and Practices
• Balancing Library Strategic Needs with Patron
Privacy
• SPL Methods of Data De-identification
• Summary
Principles, Policies, and
Practices
ALA Data Management
Guidelines
• Collection of personally identifiable information
• only when necessary to fulfill the mission of the library
• Should not share personally identifiable user
information with third parties, unless
• the library has obtained user permission
• has entered into a legal agreement with the vendor
• Make records available [to law enforcement
agencies and officers] only in response to properly
executed orders.
“An interpretation of the Library Bill of Rights.”
http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy
State Law: Revised Code of WA
• RCW 42.56.310: Library Records
• Any library record, the primary purpose of which is to
maintain control of library materials, or to gain access to
information, that discloses or could be used to disclose
the identity of a library user is exempt from disclosure
under this chapter.
• RCW 19.255.010: Disclosure, notice — Definitions
— Rights, remedies.
• First & last name combined with SSN, DL #, credit/debit
card number, authentication credentials, “account
number”
SPL Confidentiality of Patron Data
• It is the policy of The Seattle Public Library to
protect the confidentiality of borrower records as
part of its commitment to intellectual freedom.
• The Library will keep patron records confidential
and will not disclose this information except
• as necessary for the proper operation of the Library
• upon consent of the user
• pursuant to subpoena or court order
• as otherwise required by law.
The Seattle Public Library. “Confidentiality of Borrower Records.”
http://www.spl.org/about-the-library/library-use-policies/confidentiality-of-borrower-records
SPL Data Management Practices
• All records connecting a patron to an item that has
been held or borrowed, or to an information
resource that has been accessed, are deleted upon
the successful fulfillment of the transaction.
• Circulation records
• Public computer reservations
• Workstation use data (log files, caches, histories)
• Network logs
NIST: Two-part Definition of PII
1. Any information that can be used to distinguish
or trace an individual‘s identity, such as name,
social security number, date and place of birth,
mother‘s maiden name, or biometric records
2. Any other information that is linked or linkable to
an individual, such as medical, educational,
financial, and employment information
a) Libraries extend the second point by including
borrowing and information seeking activity
National Institute of Standards and Technology via the Government Accounting Office
expression of an amalgam of the definitions of PII from Office of Management and Budget
Memorandums 07-16 and 06-19. May 2008, http://www.gao.gov/new.items/d08536.pdf
Two-part Definition of PII
PII-1: Individual PII-2: Intellectual Pursuits
Balancing Library
Strategic Planning Needs
with Patron Privacy
3/16/2015
penguincakes/Flickr
“Unanswerable” Questions
• Longitudinal questions (e.g. trended analysis)
• Long-term (multi-year) rather than snapshot
• Trends, correlations, changes in behavior – not
necessarily individual activity
• Questions only about the type of behavioral use of
Library programs and services:
•“Do teen patrons remain active in their 20s?”
•“Do people use their neighborhood branch or
use the branch where relevant materials are?”
(e.g. Chinese language collection)
Privacy Requirements
• Passionate commitment to intellectual freedom
• Recognition that some patrons have no alternatives
• Intellectual content of transactions should always
be purged
• Avoid keeping records that show person’s
whereabouts
Threats to patron privacy
• Law enforcement
• Seeking intellectual pursuit data
• Seeking patron whereabouts
• Hackers
• Library and ILS data
• Data leak
• Reconstruction of identity via data
• Embarrassment/loss of trust
• Notification costs
AOL data release (2006)
There was no personally
identifiable data provided by
AOL with those records, but
search queries themselves
can sometimes include such
information.
This was a
screw up
TechCrunch. “AOL: ‘This was a screw up’.” August 2006. http://techcrunch.com/2006/08/07/aol-this-
was-a-screw-up/
AOL Example: User 4417749
numb fingers
60 single men
dog that urinates on everything
robert arnold
marion arnold
john arnold georgia
homes sold in shadow lake
subdivision gwinnet county georgia
landscapers in Lilburn, GaThelma Arnold
Target(ed) Marketing
The Incredible Story Of How Target Exposed A Teen Girl's Pregnancy
http://www.businessinsider.com/the-incredible-story-of-how-target-exposed-a-teen-girls-pregnancy-2012-2
Data for Good
In a 2012 Pew Research survey,
64% of respondents said they would be
interested in personalized online
accounts. 29% would be very likely to
use the services.
Last 30 days for SPL BiblioCommons opt-
in for borrowing history is 34%.
“7 Surprises About Libraries in Our Surveys”, Pew Research, 2012
http://www.pewresearch.org/fact-tank/2014/06/30/7-surprises-about-libraries-in-our-surveys/
Two-part Definition of PII
PII-1: People PII-2: Intellectual Pursuits
Data De-identification
• Designed to protect individual privacy while preserving
some of the dataset’s utility for other purposes.
• Make it hard or impossible to learn if an individual’s
data is in a data set, or determine any attributes about
an individual known to be in the data set.
• HIPAA: Data that does not identify an individual and
with respect to which there is no reasonable basis to
believe that the information can be used to identify an
individual
Garfinkel, Simson L. “De-Identification of Personally Identifiable Information.” April 2015.
NIST. http://csrc.nist.gov/publications/drafts/nistir-8053/nistir_8053_draft.pdf
Scott Nicholson and Catherine Arnott Smith, "Using Lessons From Health Care to Protect The
Privacy of Library Users: Guidelines For The De‐identification of Library Data Based on HIPAA,"
Proceedings of the American Society for Information Science and Technology 42, no. 1 (2005):
1198–1206.
SPL Methods of Data De-identification
Millennial Project
summary
Millennial Engagement
Persona - fictional person created from
de-identified data
The better than nothing method
of tracking
SPL Methods of Data De-identification
Data Warehouse
Methods
Age vs. DOB
• DOB: 3/15/1975 • Age 41
• DOB: 3/15/1975?
• 3/16/1975?
• 3/17/1975?
• 3/18/1975?
• 3/19/1975?
• 3/20/1975?
• 3/21/1975?
• 3/22/1975?
Call numbers
• Call number: 914.30487 F683
• Format: DVD
• Collection: Beginning ESL
• Truncated call number: 91*, FIC
Timestamps vs. dates
• Timestamp
Mon, 11 Apr 2016 11:02:43 +0000
• Date
4/11/2015, 00:44
Extract-Transform-Load
Data PII?
Barcode Yes
Name Yes
Address Yes
Email Address Yes
Phone Number Yes
Date of Birth Yes
Age No
Gender No
Zip Code No
Registration Year No
Data III?
Barcode Yes
Title Yes
Author Yes
Call Number Yes – truncate it
Item Type No
Branch No
Date No
Age Gender Zip Reg
Year
Item Type Dewey
100
Branch Date
45 Male 98117 2004 CD 700 CEN 4/1/15
45 Male 98117 2004 Book FIC BAL 4/3/15
Patrons Circulation
https://www.flickr.com/photos/accidentalhedonist/5707034712/
Belt & suspenders
• To identify a specific patron’s transaction, you’d
need to
• Breach ILS
• Recreate hash algorithm
• Breach data warehouse
• Look up patron
• Even then
• No intellectual content or whereabouts
• Only the fact of types of transactions
• Strict, clear policies for staff
Data Warehouse
Example
Express Computer Policy
Are patrons abusing 15-minute “Express”
workstations?
Old policy
• 90 minutes per day for “Internet” workstation
• 15 minutes per day for “Express” workstation
• Total of 90 minutes per day for any workstation
• Allowed for “serial” use of Express workstations (6x per
day)
• Staff noticed (anecdotally) that “a lot” of patrons were
chaining Express sessions together
Sample Data – Workstation Use
COMPUTER LOC DATE MINS BTYPE BSTAT HOMEZIP AGE PATRONdeID REG_YR
BALLIB08 BAL 10/1/2014 5 br srad 98119 23 KEwHPoJpXY7K757HLmVQXHEyaEg= 2014
CENLIB5270 CEN 10/1/2014 90 br srsen 98121 69 JeTrHceC+nwaWc/DQZ8VBfgKbL4= 1992
BALLIB08 BAL 10/1/2014 15 br srad 98107 53 hzvXFK24blsKH9LW7Pkc5kHecto= 2014
IDCLIB12 IDC 10/1/2014 48 br srsen 98104 63 aS4MypnjX+KV699OVM525fWB//k= 2014
UNILIB12 UNI 10/1/2014 15 br srad 98105 36 kKdtdIrFDhQTuQwDQcqzGXkQYoc= 2010
BALLIB15 BAL 10/1/2014 25 br srsen 98117 71 RJ0bkOnwFlmwTrFrAf/fsYJUfMo= 1992
CENLIB3009 CEN 10/1/2014 1 br srad 98122 50 tK+QVA0PJvQk57147tU8VK08aZ8= 1996
UNILIB12 UNI 10/1/2014 15 br srsen 98115 81 JytJE+kXHCEMpsK8lUfd4MdU/U8= 1992
CENLIB5330 CEN 10/1/2014 90 br srad 98104 59 Q2se1NTE3m54zolcnS+SE19ZyTU= 2002
BALLIB15 BAL 10/1/2014 5 br srad 98104 51 gS08RjQIzUGSZSuStA2Tz7MfvzE= 2013
CENLIB5401 CEN 10/1/2014 7 br srad 98133 48 mTXkmtPG7e1Y0mMOmhxwb9RaB/c= 2011
CENLIB5330 CEN 10/1/2014 32 br srad 98104 44 Bi1XhBLDx4Jl9yA1Y2w/tSbZrXM= 2012
CENLIB3011 CEN 10/1/2014 58 br srad 98133 48 mTXkmtPG7e1Y0mMOmhxwb9RaB/c= 2011
CENLIB5270 CEN 10/1/2014 90 br kcad 98035 49 T3D+yZiijOFqEuKa39/D4iURCEo= 2009
QNALIB05 QNA 10/1/2014 74 br srsen 98109 71 y3FSFjyUfO4mc3lzUSUWMGeYLVA= 1992
CENLIB3009 CEN 10/1/2014 4 br srad 98109 29 szB++tBCmztvhjqEx3i3/S/g2Io= 2005
Sample Data – 2015-08-01
Patron DeID Number of Sessions Number of Minutes
34c4e0c201ac7f14f8eef3c14fb877ca 6 90
38b82f34e018ef6accc258e4d539cfd4 6 90
5f9511476cdda7020e6356b4a8d33419 6 90
8025078883a24a72a7f0f84077e14cef 7 90
8c2c07d77b1e8d14ffb3cb7a9489272a 6 90
b1dc081ef62a831a397623e45f9f0915 6 90
cfafdd529713f254d38dcdb480778a0e 6 90
789115f4939f7400fa8b4c3d1485b433 6 89
e3468d6731968e1081e7f4666edb5703 6 89
6af2d93348ed0c9c643cd4a74097c7f9 6 85
78668085a84eddb0869eb16a9c99ddcb 7 80
839cb6c87c4a886ab29ed9513cc008c8 5 75
83110fab344de28ea5731131ca207bdf 5 74
0ecee31153187778f8de69a41407a9bc 6 71
9b01fe4277e1153d75c35a565867986b 8 71
25ce5d330ae8cb26a17ac8798c26fb8d 5 70
eca127e85176f2392c90ff69e81cf782 5 60
Data Warehouse Summary
• Store identifiable and non-identifiable information
in separate locations
• Avoid storing any intellectually significant or
identifiable data
• Build internal data stores that helps segment
anonymized patrons via behavior and
demographics
• Use this Library owned and managed data store as
the source of information for external CRMs
Summary: Data-driven + patron privacy
https://www.flickr.com/photos/mladjenovic_n/4338457946/
Thank you
Becky Yoose
Library Applications and
Systems Manager
@yo_bj
Stephen Halsey
Director of Marketing
and Online Services
@halseyhoff
The Seattle Public Library
Serious security (Bonus Slide!)
AOKI
KAY
0123456789
4/16/2015

Más contenido relacionado

Destacado

Virtual Worlds and the (real) West Midlands
Virtual Worlds and the (real) West MidlandsVirtual Worlds and the (real) West Midlands
Virtual Worlds and the (real) West MidlandsSilversprite
 
Surveying Virtual World use in UK Higher and Further Education
Surveying Virtual World use in UK Higher and Further EducationSurveying Virtual World use in UK Higher and Further Education
Surveying Virtual World use in UK Higher and Further EducationSilversprite
 
Choosing the best virtual world for your teaching needs
Choosing the best virtual world for your teaching needsChoosing the best virtual world for your teaching needs
Choosing the best virtual world for your teaching needsSilversprite
 
CETIS / Eduserv Foundation workshop
CETIS / Eduserv Foundation workshopCETIS / Eduserv Foundation workshop
CETIS / Eduserv Foundation workshopSilversprite
 
Non-gaming legacy slides: Project LANES (1997)
Non-gaming legacy slides: Project LANES (1997)Non-gaming legacy slides: Project LANES (1997)
Non-gaming legacy slides: Project LANES (1997)Silversprite
 
Learning through falling: Second Life in UK academia.
Learning through falling: Second Life in UK academia.Learning through falling: Second Life in UK academia.
Learning through falling: Second Life in UK academia.Silversprite
 
Virtual Worlds: their use in UK learning and teaching.
Virtual Worlds: their use in UK learning and teaching.Virtual Worlds: their use in UK learning and teaching.
Virtual Worlds: their use in UK learning and teaching.Silversprite
 

Destacado (7)

Virtual Worlds and the (real) West Midlands
Virtual Worlds and the (real) West MidlandsVirtual Worlds and the (real) West Midlands
Virtual Worlds and the (real) West Midlands
 
Surveying Virtual World use in UK Higher and Further Education
Surveying Virtual World use in UK Higher and Further EducationSurveying Virtual World use in UK Higher and Further Education
Surveying Virtual World use in UK Higher and Further Education
 
Choosing the best virtual world for your teaching needs
Choosing the best virtual world for your teaching needsChoosing the best virtual world for your teaching needs
Choosing the best virtual world for your teaching needs
 
CETIS / Eduserv Foundation workshop
CETIS / Eduserv Foundation workshopCETIS / Eduserv Foundation workshop
CETIS / Eduserv Foundation workshop
 
Non-gaming legacy slides: Project LANES (1997)
Non-gaming legacy slides: Project LANES (1997)Non-gaming legacy slides: Project LANES (1997)
Non-gaming legacy slides: Project LANES (1997)
 
Learning through falling: Second Life in UK academia.
Learning through falling: Second Life in UK academia.Learning through falling: Second Life in UK academia.
Learning through falling: Second Life in UK academia.
 
Virtual Worlds: their use in UK learning and teaching.
Virtual Worlds: their use in UK learning and teaching.Virtual Worlds: their use in UK learning and teaching.
Virtual Worlds: their use in UK learning and teaching.
 

Similar a De-identifying Patron Data for Analytics and Intelligence

Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research RequirementsICPSR
 
Maintaining confidentiality
Maintaining confidentialityMaintaining confidentiality
Maintaining confidentialityempalmer
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better Worldleadinghands
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6ARDC
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data managementCunera Buys
 
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...ICPSR
 
Librarian RDM Training: Ethics and copyright for research data
Librarian RDM Training: Ethics and copyright for research dataLibrarian RDM Training: Ethics and copyright for research data
Librarian RDM Training: Ethics and copyright for research dataRobin Rice
 
The art of depositing social science data: maximising quality and ensuring go...
The art of depositing social science data: maximising quality and ensuring go...The art of depositing social science data: maximising quality and ensuring go...
The art of depositing social science data: maximising quality and ensuring go...Louise Corti
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Datacunera
 
Data Visualization in the Newsroom
Data Visualization in the NewsroomData Visualization in the Newsroom
Data Visualization in the NewsroomCarl V. Lewis
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data managementcunera
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...SURFevents
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesLouise Corti
 
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTS
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTSTECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTS
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTSAnnelore van der Lint
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsSloan Carne
 

Similar a De-identifying Patron Data for Analytics and Intelligence (20)

Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Maintaining confidentiality
Maintaining confidentialityMaintaining confidentiality
Maintaining confidentiality
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better World
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...
 
Librarian RDM Training: Ethics and copyright for research data
Librarian RDM Training: Ethics and copyright for research dataLibrarian RDM Training: Ethics and copyright for research data
Librarian RDM Training: Ethics and copyright for research data
 
The art of depositing social science data: maximising quality and ensuring go...
The art of depositing social science data: maximising quality and ensuring go...The art of depositing social science data: maximising quality and ensuring go...
The art of depositing social science data: maximising quality and ensuring go...
 
Data Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach DataData Literacy: Creating and Managing Reserach Data
Data Literacy: Creating and Managing Reserach Data
 
Data Visualization in the Newsroom
Data Visualization in the NewsroomData Visualization in the Newsroom
Data Visualization in the Newsroom
 
Introduction to data management
Introduction to data managementIntroduction to data management
Introduction to data management
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...
 
Use of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issuesUse of data in safe havens: ethics and reproducibility issues
Use of data in safe havens: ethics and reproducibility issues
 
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTS
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTSTECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTS
TECHNOLOGY FOR HANDLING FOIA & PUBLIC DISCLOSURE REQUESTS
 
Preparing research data for sharing
Preparing research data for sharingPreparing research data for sharing
Preparing research data for sharing
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
Advanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU InvestigatorsAdvanced Research Investigations for SIU Investigators
Advanced Research Investigations for SIU Investigators
 
Preparing Research Data for Sharing
Preparing Research Data for SharingPreparing Research Data for Sharing
Preparing Research Data for Sharing
 

Más de Becky Yoose

Bibliographic Data Spring Cleaning with Sierra DNA
Bibliographic Data Spring Cleaning with Sierra DNA Bibliographic Data Spring Cleaning with Sierra DNA
Bibliographic Data Spring Cleaning with Sierra DNA Becky Yoose
 
Bibliographic Data Spring Cleaning with Sierra DNA - Handout
Bibliographic Data Spring Cleaning with Sierra DNA - HandoutBibliographic Data Spring Cleaning with Sierra DNA - Handout
Bibliographic Data Spring Cleaning with Sierra DNA - HandoutBecky Yoose
 
Your code does not exist in a vacuum
Your code does not exist in a vacuumYour code does not exist in a vacuum
Your code does not exist in a vacuumBecky Yoose
 
A tale of two communities
A tale of two communitiesA tale of two communities
A tale of two communitiesBecky Yoose
 
Poster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsPoster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsBecky Yoose
 
Taming the communication beast: Using LibGuides for intra-library communication
Taming the communication beast: Using LibGuides for intra-library communicationTaming the communication beast: Using LibGuides for intra-library communication
Taming the communication beast: Using LibGuides for intra-library communicationBecky Yoose
 
The public side of technical services
The public side of technical servicesThe public side of technical services
The public side of technical servicesBecky Yoose
 
Pack Light: Changes in Technical Services Staffing & Workflow
Pack Light: Changes in Technical Services Staffing & WorkflowPack Light: Changes in Technical Services Staffing & Workflow
Pack Light: Changes in Technical Services Staffing & WorkflowBecky Yoose
 
Using AutoIt for Millennium Task Automation Handout
Using AutoIt for Millennium Task Automation HandoutUsing AutoIt for Millennium Task Automation Handout
Using AutoIt for Millennium Task Automation HandoutBecky Yoose
 
Using AutoIt for Millennium Task Automation
Using AutoIt for Millennium Task AutomationUsing AutoIt for Millennium Task Automation
Using AutoIt for Millennium Task AutomationBecky Yoose
 
Technical Services Tools Redux Handout
Technical Services Tools Redux HandoutTechnical Services Tools Redux Handout
Technical Services Tools Redux HandoutBecky Yoose
 
Technical Services Tools Redux
Technical Services Tools ReduxTechnical Services Tools Redux
Technical Services Tools ReduxBecky Yoose
 
AutoIt for the rest of us
AutoIt for the rest of usAutoIt for the rest of us
AutoIt for the rest of usBecky Yoose
 
AutoIt for the rest of us - handout
AutoIt for the rest of us - handoutAutoIt for the rest of us - handout
AutoIt for the rest of us - handoutBecky Yoose
 
But I'm Not A Techie! Technical Tools for Technical Services
But I'm Not A Techie! Technical Tools for Technical ServicesBut I'm Not A Techie! Technical Tools for Technical Services
But I'm Not A Techie! Technical Tools for Technical ServicesBecky Yoose
 

Más de Becky Yoose (15)

Bibliographic Data Spring Cleaning with Sierra DNA
Bibliographic Data Spring Cleaning with Sierra DNA Bibliographic Data Spring Cleaning with Sierra DNA
Bibliographic Data Spring Cleaning with Sierra DNA
 
Bibliographic Data Spring Cleaning with Sierra DNA - Handout
Bibliographic Data Spring Cleaning with Sierra DNA - HandoutBibliographic Data Spring Cleaning with Sierra DNA - Handout
Bibliographic Data Spring Cleaning with Sierra DNA - Handout
 
Your code does not exist in a vacuum
Your code does not exist in a vacuumYour code does not exist in a vacuum
Your code does not exist in a vacuum
 
A tale of two communities
A tale of two communitiesA tale of two communities
A tale of two communities
 
Poster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsPoster: Using Open Source Tools to Improve Access to Oral History Collections
Poster: Using Open Source Tools to Improve Access to Oral History Collections
 
Taming the communication beast: Using LibGuides for intra-library communication
Taming the communication beast: Using LibGuides for intra-library communicationTaming the communication beast: Using LibGuides for intra-library communication
Taming the communication beast: Using LibGuides for intra-library communication
 
The public side of technical services
The public side of technical servicesThe public side of technical services
The public side of technical services
 
Pack Light: Changes in Technical Services Staffing & Workflow
Pack Light: Changes in Technical Services Staffing & WorkflowPack Light: Changes in Technical Services Staffing & Workflow
Pack Light: Changes in Technical Services Staffing & Workflow
 
Using AutoIt for Millennium Task Automation Handout
Using AutoIt for Millennium Task Automation HandoutUsing AutoIt for Millennium Task Automation Handout
Using AutoIt for Millennium Task Automation Handout
 
Using AutoIt for Millennium Task Automation
Using AutoIt for Millennium Task AutomationUsing AutoIt for Millennium Task Automation
Using AutoIt for Millennium Task Automation
 
Technical Services Tools Redux Handout
Technical Services Tools Redux HandoutTechnical Services Tools Redux Handout
Technical Services Tools Redux Handout
 
Technical Services Tools Redux
Technical Services Tools ReduxTechnical Services Tools Redux
Technical Services Tools Redux
 
AutoIt for the rest of us
AutoIt for the rest of usAutoIt for the rest of us
AutoIt for the rest of us
 
AutoIt for the rest of us - handout
AutoIt for the rest of us - handoutAutoIt for the rest of us - handout
AutoIt for the rest of us - handout
 
But I'm Not A Techie! Technical Tools for Technical Services
But I'm Not A Techie! Technical Tools for Technical ServicesBut I'm Not A Techie! Technical Tools for Technical Services
But I'm Not A Techie! Technical Tools for Technical Services
 

De-identifying Patron Data for Analytics and Intelligence

  • 1. De-identifying Patron Data to Balance Privacy and Insight Becky Yoose Stephen Halsey The Seattle Public Library ?? PLA 4/7/2016
  • 2. Presentation Outline • Overview of Data Management Principles, Policies, and Practices • Balancing Library Strategic Needs with Patron Privacy • SPL Methods of Data De-identification • Summary
  • 4. ALA Data Management Guidelines • Collection of personally identifiable information • only when necessary to fulfill the mission of the library • Should not share personally identifiable user information with third parties, unless • the library has obtained user permission • has entered into a legal agreement with the vendor • Make records available [to law enforcement agencies and officers] only in response to properly executed orders. “An interpretation of the Library Bill of Rights.” http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy
  • 5. State Law: Revised Code of WA • RCW 42.56.310: Library Records • Any library record, the primary purpose of which is to maintain control of library materials, or to gain access to information, that discloses or could be used to disclose the identity of a library user is exempt from disclosure under this chapter. • RCW 19.255.010: Disclosure, notice — Definitions — Rights, remedies. • First & last name combined with SSN, DL #, credit/debit card number, authentication credentials, “account number”
  • 6. SPL Confidentiality of Patron Data • It is the policy of The Seattle Public Library to protect the confidentiality of borrower records as part of its commitment to intellectual freedom. • The Library will keep patron records confidential and will not disclose this information except • as necessary for the proper operation of the Library • upon consent of the user • pursuant to subpoena or court order • as otherwise required by law. The Seattle Public Library. “Confidentiality of Borrower Records.” http://www.spl.org/about-the-library/library-use-policies/confidentiality-of-borrower-records
  • 7. SPL Data Management Practices • All records connecting a patron to an item that has been held or borrowed, or to an information resource that has been accessed, are deleted upon the successful fulfillment of the transaction. • Circulation records • Public computer reservations • Workstation use data (log files, caches, histories) • Network logs
  • 8. NIST: Two-part Definition of PII 1. Any information that can be used to distinguish or trace an individual‘s identity, such as name, social security number, date and place of birth, mother‘s maiden name, or biometric records 2. Any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information a) Libraries extend the second point by including borrowing and information seeking activity National Institute of Standards and Technology via the Government Accounting Office expression of an amalgam of the definitions of PII from Office of Management and Budget Memorandums 07-16 and 06-19. May 2008, http://www.gao.gov/new.items/d08536.pdf
  • 9. Two-part Definition of PII PII-1: Individual PII-2: Intellectual Pursuits
  • 10. Balancing Library Strategic Planning Needs with Patron Privacy
  • 12. “Unanswerable” Questions • Longitudinal questions (e.g. trended analysis) • Long-term (multi-year) rather than snapshot • Trends, correlations, changes in behavior – not necessarily individual activity • Questions only about the type of behavioral use of Library programs and services: •“Do teen patrons remain active in their 20s?” •“Do people use their neighborhood branch or use the branch where relevant materials are?” (e.g. Chinese language collection)
  • 13. Privacy Requirements • Passionate commitment to intellectual freedom • Recognition that some patrons have no alternatives • Intellectual content of transactions should always be purged • Avoid keeping records that show person’s whereabouts
  • 14. Threats to patron privacy • Law enforcement • Seeking intellectual pursuit data • Seeking patron whereabouts • Hackers • Library and ILS data • Data leak • Reconstruction of identity via data • Embarrassment/loss of trust • Notification costs
  • 15. AOL data release (2006) There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information. This was a screw up TechCrunch. “AOL: ‘This was a screw up’.” August 2006. http://techcrunch.com/2006/08/07/aol-this- was-a-screw-up/
  • 16. AOL Example: User 4417749 numb fingers 60 single men dog that urinates on everything robert arnold marion arnold john arnold georgia homes sold in shadow lake subdivision gwinnet county georgia landscapers in Lilburn, GaThelma Arnold
  • 17. Target(ed) Marketing The Incredible Story Of How Target Exposed A Teen Girl's Pregnancy http://www.businessinsider.com/the-incredible-story-of-how-target-exposed-a-teen-girls-pregnancy-2012-2
  • 18. Data for Good In a 2012 Pew Research survey, 64% of respondents said they would be interested in personalized online accounts. 29% would be very likely to use the services. Last 30 days for SPL BiblioCommons opt- in for borrowing history is 34%. “7 Surprises About Libraries in Our Surveys”, Pew Research, 2012 http://www.pewresearch.org/fact-tank/2014/06/30/7-surprises-about-libraries-in-our-surveys/
  • 19. Two-part Definition of PII PII-1: People PII-2: Intellectual Pursuits
  • 20. Data De-identification • Designed to protect individual privacy while preserving some of the dataset’s utility for other purposes. • Make it hard or impossible to learn if an individual’s data is in a data set, or determine any attributes about an individual known to be in the data set. • HIPAA: Data that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual Garfinkel, Simson L. “De-Identification of Personally Identifiable Information.” April 2015. NIST. http://csrc.nist.gov/publications/drafts/nistir-8053/nistir_8053_draft.pdf Scott Nicholson and Catherine Arnott Smith, "Using Lessons From Health Care to Protect The Privacy of Library Users: Guidelines For The De‐identification of Library Data Based on HIPAA," Proceedings of the American Society for Information Science and Technology 42, no. 1 (2005): 1198–1206.
  • 21. SPL Methods of Data De-identification Millennial Project summary
  • 23. Persona - fictional person created from de-identified data
  • 24. The better than nothing method of tracking
  • 25. SPL Methods of Data De-identification Data Warehouse Methods
  • 26. Age vs. DOB • DOB: 3/15/1975 • Age 41 • DOB: 3/15/1975? • 3/16/1975? • 3/17/1975? • 3/18/1975? • 3/19/1975? • 3/20/1975? • 3/21/1975? • 3/22/1975?
  • 27. Call numbers • Call number: 914.30487 F683 • Format: DVD • Collection: Beginning ESL • Truncated call number: 91*, FIC
  • 28. Timestamps vs. dates • Timestamp Mon, 11 Apr 2016 11:02:43 +0000 • Date 4/11/2015, 00:44
  • 29. Extract-Transform-Load Data PII? Barcode Yes Name Yes Address Yes Email Address Yes Phone Number Yes Date of Birth Yes Age No Gender No Zip Code No Registration Year No Data III? Barcode Yes Title Yes Author Yes Call Number Yes – truncate it Item Type No Branch No Date No Age Gender Zip Reg Year Item Type Dewey 100 Branch Date 45 Male 98117 2004 CD 700 CEN 4/1/15 45 Male 98117 2004 Book FIC BAL 4/3/15 Patrons Circulation
  • 31. Belt & suspenders • To identify a specific patron’s transaction, you’d need to • Breach ILS • Recreate hash algorithm • Breach data warehouse • Look up patron • Even then • No intellectual content or whereabouts • Only the fact of types of transactions • Strict, clear policies for staff
  • 33. Express Computer Policy Are patrons abusing 15-minute “Express” workstations? Old policy • 90 minutes per day for “Internet” workstation • 15 minutes per day for “Express” workstation • Total of 90 minutes per day for any workstation • Allowed for “serial” use of Express workstations (6x per day) • Staff noticed (anecdotally) that “a lot” of patrons were chaining Express sessions together
  • 34. Sample Data – Workstation Use COMPUTER LOC DATE MINS BTYPE BSTAT HOMEZIP AGE PATRONdeID REG_YR BALLIB08 BAL 10/1/2014 5 br srad 98119 23 KEwHPoJpXY7K757HLmVQXHEyaEg= 2014 CENLIB5270 CEN 10/1/2014 90 br srsen 98121 69 JeTrHceC+nwaWc/DQZ8VBfgKbL4= 1992 BALLIB08 BAL 10/1/2014 15 br srad 98107 53 hzvXFK24blsKH9LW7Pkc5kHecto= 2014 IDCLIB12 IDC 10/1/2014 48 br srsen 98104 63 aS4MypnjX+KV699OVM525fWB//k= 2014 UNILIB12 UNI 10/1/2014 15 br srad 98105 36 kKdtdIrFDhQTuQwDQcqzGXkQYoc= 2010 BALLIB15 BAL 10/1/2014 25 br srsen 98117 71 RJ0bkOnwFlmwTrFrAf/fsYJUfMo= 1992 CENLIB3009 CEN 10/1/2014 1 br srad 98122 50 tK+QVA0PJvQk57147tU8VK08aZ8= 1996 UNILIB12 UNI 10/1/2014 15 br srsen 98115 81 JytJE+kXHCEMpsK8lUfd4MdU/U8= 1992 CENLIB5330 CEN 10/1/2014 90 br srad 98104 59 Q2se1NTE3m54zolcnS+SE19ZyTU= 2002 BALLIB15 BAL 10/1/2014 5 br srad 98104 51 gS08RjQIzUGSZSuStA2Tz7MfvzE= 2013 CENLIB5401 CEN 10/1/2014 7 br srad 98133 48 mTXkmtPG7e1Y0mMOmhxwb9RaB/c= 2011 CENLIB5330 CEN 10/1/2014 32 br srad 98104 44 Bi1XhBLDx4Jl9yA1Y2w/tSbZrXM= 2012 CENLIB3011 CEN 10/1/2014 58 br srad 98133 48 mTXkmtPG7e1Y0mMOmhxwb9RaB/c= 2011 CENLIB5270 CEN 10/1/2014 90 br kcad 98035 49 T3D+yZiijOFqEuKa39/D4iURCEo= 2009 QNALIB05 QNA 10/1/2014 74 br srsen 98109 71 y3FSFjyUfO4mc3lzUSUWMGeYLVA= 1992 CENLIB3009 CEN 10/1/2014 4 br srad 98109 29 szB++tBCmztvhjqEx3i3/S/g2Io= 2005
  • 35. Sample Data – 2015-08-01 Patron DeID Number of Sessions Number of Minutes 34c4e0c201ac7f14f8eef3c14fb877ca 6 90 38b82f34e018ef6accc258e4d539cfd4 6 90 5f9511476cdda7020e6356b4a8d33419 6 90 8025078883a24a72a7f0f84077e14cef 7 90 8c2c07d77b1e8d14ffb3cb7a9489272a 6 90 b1dc081ef62a831a397623e45f9f0915 6 90 cfafdd529713f254d38dcdb480778a0e 6 90 789115f4939f7400fa8b4c3d1485b433 6 89 e3468d6731968e1081e7f4666edb5703 6 89 6af2d93348ed0c9c643cd4a74097c7f9 6 85 78668085a84eddb0869eb16a9c99ddcb 7 80 839cb6c87c4a886ab29ed9513cc008c8 5 75 83110fab344de28ea5731131ca207bdf 5 74 0ecee31153187778f8de69a41407a9bc 6 71 9b01fe4277e1153d75c35a565867986b 8 71 25ce5d330ae8cb26a17ac8798c26fb8d 5 70 eca127e85176f2392c90ff69e81cf782 5 60
  • 36. Data Warehouse Summary • Store identifiable and non-identifiable information in separate locations • Avoid storing any intellectually significant or identifiable data • Build internal data stores that helps segment anonymized patrons via behavior and demographics • Use this Library owned and managed data store as the source of information for external CRMs
  • 37. Summary: Data-driven + patron privacy https://www.flickr.com/photos/mladjenovic_n/4338457946/
  • 38. Thank you Becky Yoose Library Applications and Systems Manager @yo_bj Stephen Halsey Director of Marketing and Online Services @halseyhoff The Seattle Public Library
  • 39. Serious security (Bonus Slide!) AOKI KAY 0123456789 4/16/2015

Notas del editor

  1. [Stephen] Good afternoon, thanks for joining us especially with so many good session choices right now AND of course this close to happy hour. First introductions…. Becky Yoose is our Library Applications and Systems Manager and has recently started work at SPL and is now overseeing development of our data warehouse initiative. I am SPL’s Director of Marketing and Online Services and I’ve been with the system almost 3 years. Our data warehouse program started a couple of years ago. A few people have worked on this presentation and the data warehouse who have since left (and come back). I stepped in to assist Becky with this presentation. My last day is now tomorrow. Becky and I now affectionately refer to the curse of this presentation, which I believe as I was stuck in the elevator earlier today briefly! When I started 3 years ago and I was building the marketing department and program, I discovered that the Library was regularly purging a treasure trove of useful data that could be used for insights into patron behavior. I chatted with our Director of IT to see what we could do about this and the data warehouse initiative was born. So, this session is about what SPL has done in the past couple of years to de-identify data to gain better insights into our patrons’ behavior without sacrificing their privacy. I want to note that the same principals and privacy goals are in place regardless of whether you are using one of the many CRM’s now available OR if you are compiling trended data via Excel or scribbling it on a piece of paper and using an abacus
  2. [Stephen]
  3. [Stephen] At the national level, libraries are guided by the Library Bill of Rights and the interpretations thereof by ALA. On the matter of personally identifiable data, ALA advises libraries to collect only what is necessary to fulfill the mission of the library. We of course, argue that deriving patrons’ needs and wants from the data trails they make in our systems IS a necessary part of fulfilling our mission ALA further advises that we should not share personally identifiable information (or PII) with 3rd parties unless we are authorized to do so and/or we have agreement in place with 3rd parties to protect that data. Finally, ALA recommends that we comply with law enforcement requests for data, but only if they are valid.
  4. [Stephen] On the state level in Washington, there’s nothing that explicitly protects the privacy of library patrons or mandates the confidentiality of borrower records. What we do have that’s useful in this context is an exemption from state records retention requirements for library records and a definition of “personally identifiable information” which we will talk more about in a minute.
  5. [Becky] At SPL, our policy is probably similar to yours. We treat borrower records as confidential, meaning we won’t disclose them except under certain circumstances.
  6. [Becky] In practice, we take advantage of our exemption from public records retention and delete records as soon as they cease to be useful to us. To be more specific, we sever the connection between who you are and what you borrowed as long as you return it to us. We also sever the connection between who you are and what you did on a public computer. We do not capture or store information about what movie you’re streaming right now. Again, this is similar to how most libraries handle patron data. After these transactions are completed, we still know who you are and where you live – we keep a record of your account, for example. And we still know that “The Painter” by Peter Heller was checked out 127 times last week; nonetheless, we don’t know that you were the one who checked it out.
  7. [Becky] Let’s come back to Personally Identifiable Information. Earlier we saw how Washington state defined PII: name, SSN, DOB, account number, and other details about your that can help someone uniquely identify that you are you. That approach only looks at one facet of PII, however. The National Institute of Standards and Technology by way of the Government Accounting Office define PII using essentially 2 facets – stuff about you, and stuff that you do that is linked to you. NIST specifically lists medical, educational, financial, and employment information. This kind of data, in sufficient enough quantities, can be used in certain circumstances to reverse engineer an identity. I might not know your name, but if I know all the classes you’re taking at the university and I get access to course rosters, I can, by process of elimination, narrow down or even determine precisely who you are.
  8. [Becky] So we have two important facets of PII to consider whenever we approach a problem such as the one we’re here to talk about today. PII-1 can be seen as data referring to a person, and PII-2 is the data about, in the case of libraries, someone’s intellectual pursuits. In our context, PII-2 includes the books that patrons read, search queries they execute, and the web sites they visit. This data is extremely valuable and is collected all the time by companies - it helps determines what Amazon recommends what should you read or, more importantly, what to buy and what Facebook suggests that you should “Like” so that they can market more products at you. Libraries should be sanctuaries from this kind of detailed data gathering that has become commonplace in the information age… yet we also need to gain insights and intelligence into the rapidly changing and evolving needs and wants of our patrons so that we can better serve them and continue to be a vital resource to our communities.
  9. [Stephen] Again, from the perspective of Marketing and Public Services, we want to use this data to learn: How to effectively allocate resources (both financial and human) in the Marketing of our programs and services to ALL audiences (including non-users, disengaged and underserved audiences) Tracking effectiveness of marketing efforts Development of our programs and services to what our patrons desire and expect Tracking success and satisfaction of our patron’s engagement with our programs and services ALL while balancing library strategic needs with patron privacy
  10. [Becky] At the beginning of the project - We have a lot of data about what’s in our collection and the “behavior” of our collection independent of who is using it. For example, we know that a certain title has circed 231 times, or that items in the 910 call number range are popular. We have some data about our users (patron account records) but nothing about their completed activities. We make decisions about collections, services, facilities, etc. based on general demographics of a neighborhood, but not based on actual library use by residents. We do not know if services are used a lot by a few individuals or used somewhat by a lot of individuals. In short, we have data, but we could not use the data in a way that would both respect user privacy while at the same time use that data to inform where library resources should go in terms of programs and services.
  11. [Stephen] This leads into unanswerable questions that are essential for us to answer if we want to allocate resources effectively. This includes identifying and serving underrepresented patrons and our non-users in general. And it means keeping data for more than a year. A couple of questions to illustrate the above point: People in their 20s (aka Millennials) are generally not active users of the public library, We see a lot of teens and children, and then people come back when they become parents themselves. (I call this the church effect) For the patrons who *are* active in their 20s, had they been active users in their teens? We also have a high circ of materials in certain languages in certain branches, for example, Chinese-language materials at our International District branch. What is the zip code of patrons who check out those materials—is IDC their home branch or are they traveling from elsewhere in the city to get those materials? In Seattle, we are growing incredibly fast, there are shifting demographics from neighborhood to neighborhood that make understanding the neighborhood information even more important
  12. [Becky] When thinking of ways to solve this problem, we had to consider the following: SPL is very committed to free inquiry & privacy. Many users can’t “opt out” of using our service. With several online services you can say “Well, if you don’t like the terms of use, just don’t use it or use some other service you like better.” We recognize, however, that some of our users don’t have alternatives for services like the internet or the opportunity elsewhere to do their information seeking privately. As a principle we always want to purge the “intellectual content” of a transaction. We also want to avoid keeping records that show a person’s whereabouts or specific activities at a certain place or time. This is too much like surveillance.
  13. [Becky] If you’re trying to design a solution, you have to take into account any perceived threats. The one we talk about most is law enforcement or government. Libraries may receive requests or subpoenas for information on specific individuals or locations. In the analysis of risk, we balance the likelihood of something happening versus the harm that would result. These requests from law enforcement may not be very likely, but we perceive that they would inflict considerable harm on privacy and intellectual freedom, and we take them most seriously. We often hear about organizations who have lost control of their data to hackers; it seems like every week we have a news story about a major data breach. We don’t hold credit card numbers, social security numbers, or other data likely to have substantial monetary value to hackers. A lot of the data we hold, for example, addresses or phone numbers, can be found in a phone book. On the other hand, some patrons may have protection orders, unlisted phone numbers, sensitive jobs, or other reasons that this information must remain confidential. We should always practice good security for our systems, but this is not a special threat for libraries. Another risky scenario is an accidental leak of data or even an intentional release of data that is *believed* to be scrubbed of personally identifiable information, but which still can be reconstructed.
  14. [Becky] For example - In 2006, AOL released millions of search queries to researchers. AOL believed that they had removed PII from the data they released, but the search logs contained so many specific queries (for addresses, institutions, etc.) that researchers were able to reconstruct specific identities from various search terms that they identify as belonging to distinct persons. This data may not have been used for the research AOL originally intended but it proved an important point about privacy.
  15. [Becky] The New York Times profiled one of the cases. Some of the terms that belonged to user 4417749 were fairly generic and may only give you hints about the user’s age, marital status, and that he or she was probably the owner of a dog – an incontinent dog. There was also a flurry of search terms that included the same last name. Researchers knew from previous research that people tend to search a lot for themselves and relatives and so they inferred that since so many search terms were for the last name “Arnold,” user 4417749 might be an Arnold. Finally, they found search terms that provided more clues about 4417749’s whereabouts, which happened to be a very small town in Georgia. By process of elimination, then, they were able to pinpoint … Thelma Arnold, a 60-something widowed dog-owning Georgian who owned up to all the search terms that AOL has labeled as belonging to 4417749. The article does not explain if Thelma ever found a treatment for her dog’s frequent urination problem.
  16. [Stephen] For a marketer, this story of Target is a great one, as it relates to targeted (no pun intended) marketing: From Business Insider: Target broke through to a new level of customer tracking with the help of statistical genius Andrew Pole, according to a New York Times Magazine cover story by Charles Duhigg. Pole identified 25 products that when purchased together indicate a women is likely pregnant. The value of this information was that Target could send coupons to the pregnant woman at an expensive period of her life. Plugged into Target's customer tracking technology, Pole's formula was a beast. Once it even exposed a teen girl's pregnancy: [A] man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation. “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again. On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.” It doesn’t get much more targeted, and some will say, intrusive, than this. But this is a great example of how specific data analysis can get
  17. [Stephen] According to PEW: One of the foundational principles of librarians is supporting the privacy of patrons. Librarians have long resisted keeping or sharing records of the book-borrowing or computer-using activities of their patrons. However, in the age of book-recommendation practices on all kinds of websites, many patrons are comfortable with the idea of getting recommendations from librarians based on their previous book-reading habits. In a 2012 survey, 64% of respondents said they would be interested in personalized online accounts that provide customized recommendations for books based on their past library activity. Some 29% said they would be “very likely” to use a service if it were made available by their library. Indeed, our own statistics show that service is chosen by 34% of the new borrowers over the last 30 days
  18. [Becky] If AOL had recognized, as we’ve discussed, that PII is actually a two-part concept, they would not have been surprised that even if you strip out PII-1, PII-2 may still be used to reconstruct an identity.
  19. [Becky] How can we get to the sweet spot where we know enough about what patrons are doing with our stuff but not so much that we lose their trust or risk violating the agreements we have with them? We now turn to the concept of data de-identification. This is a very tricky thing to get right. A NIST white paper outlines definitions, approaches, and considerations for a data de-identification approach. The concept of de-identifying data also exists in the privacy juggernaut that is HIPAA – in healthcare, analyzing patient data is critical for developing new treatments and new approaches; nevertheless personal identity is sacrosanct and must be protected at all times. In the article “Using Lessons From Health Care to Protect the Privacy of Library Users: Guidelines for the De-Identification of Library Data Based on HIPAA,” the authors enumerate the specific data elements that are considered to constitute PII according to HIPAA, which can be used a general guideline for looking at library borrower data. While the original project leads were not aware of the article at the time, both the project and the paper share similar approaches to PII in libraries.
  20. [Stephen] Here is an example of one of those paper and Excel and abacus methods…and referring to those largely in their 20s This was a grant funded project by the Allen Foundation in 2013 and 2014 Goal was to re-engage the Millennial population But if we didn’t in some way collect patron data, in this case identifiable by age and borrowing statistics, we could not have known whether we were meeting the goals of the project and fulfilling the data reporting requirement of the grant
  21. [Stephen] In this case the oal was to increase usage of Millennials by 15% over the summer of 2014 through targeted marketing and programs One way to provide better targeting is through segmentation We engaged an agency to perform the research 5 subsegments were created from the Millennial demographic Programs were then created based on the market research Metrics were tracked on activity for all those cardholders that fell into the Millennial segment defined by those ages 18-30 at time of grant We developed our successful pop-up library program as well a successful cooking program based on this market research and we actually met our goals for the grant
  22. [Stephen] This is an example of one of the personas that was created as a result of this market segmentation De-identified data gets collected and was then mapped to personas that were created during the project Aruna Kumar is a persona representing this cluster of users, in this case, digital only
  23. [Stephen] Horizon database had exports of usage of a few key items The database had flags based on age New Borrowers Boopsie Logins Checkouts Freegal Downloads Overdrive Downloads We were able to track usage with this method We’d get weekly snapshots and then we would trend the data in Excel This though was the lead up to the better method of de-identification that Becky is about to walk you through
  24. [Becky] The Millennial Project, in hindsight, was the first iteration of the Data Warehouse that we are now currently building. We knew that this data was valuable and we proved that by using this data we could achieve real world impact. Nonetheless, we needed to evolve the Warehouse to balance patron privacy and data analysis. This next section will discuss how we are dealing with both PII-1 and PII-2 in our data de-identification scheme.
  25. [Becky] Let’s start with PII-1 information. Using the age vs. date of birth is a simple example of how you can keep data that is just as useful for us that significantly reduces the risk of identifying a patron. Say we have the information that a non-fiction DVD was checked out by a patron. We can record that the patron was 41 years old on April 1, 2016, or we can record that their date of birth was March 15, 1975. Which is better? The age is. A person who is 41 years old on 4/1/16 could have been born on any date from 4/2/1974 to 4/1/1975. There are a lot more people whose birthdays fall in that yearlong span than there are people whose birthdays are specifically on 4/1/75, so using the age is a lot “blurrier” and less identifiable. But if we’re interested in knowing what kinds of materials appeal to different ages, the age is just as good as the DOB.
  26. [Becky] We’re also interested in hitting the right granularity of information about our materials. Using the full call number is obviously far too specific. There are many items in our collection that would have a unique call number, which would be just as bad as us saving the title of something you checked out. The format goes too far in the opposite direction—it’s so vague that it gives us hardly any useful information about what materials are popular. For example, recent popular movies, foreign-language TV shows, documentaries, and instructional DVDs could all appear to have the same format. If all these are lumped together, we’d probably just see that DVDs are popular among all groups, which is not very useful. The collection code seems promising, but as we have them set up, it is also too specific. Some collections, like adult fiction or even adult sci-fi/fantasy are quite large. But other collections are as specific as “beginning ESL.” This is too much “intellectually identifiable info” as we think of it, about what the patron is interested in. Truncating the call numbers appears to be the best match for what we want. This gives us separate buckets for adult fiction or YA fiction, and Dewey decades or centuries, but wouldn’t boil down to individual items.
  27. [Becky] Our everyday systems keep track of the exact time and date of an action. For example, a patron used a computer from 11:02am to 11:46am on April 11. We have no use for this level of detail in our analysis of patron behavior. We’d be interested to know that someone used a computer for 44 minutes on a Monday, though. By keeping the session length and the date, we can achieve that while not maintaining a record of someone’s whereabouts at a specific time and place. You might think we’d be interested in knowing how many sessions start between 10 and 11am or 11am and 12pm, and so forth. That’s true, but we don’t need to know about the patrons who are doing those sessions at the same time, so that kind of data doesn’t belong in *this* database, our database for longitudinal analysis.
  28. [Becky] So, what is the database we are building? In creating a data warehouse, you do something called ETL, which is extract, transform, and load. You take the data from the operational database where it was being collected, you do whatever you need to change it into the kind of data you want to store for analysis long term, and then you load it into where you’re going to store it. Here you can see a sample of the data that is stored in the ILS, data about a patron and an item that has circulated. [x] This is stored in separate tables. [x] Here you can see the data that is not personally or intellectually identifiable that we want to move. In the case of the call number, we have to transform it as we move it. [x] Here is it, transformed and pulled together into records about each transaction.
  29. [Becky] The most secure thing for us to do would be to record all of the transactions with the age and home branch of the patron, and then you would have no connection to individual patrons. The problem with this is that it prevents us from answering some of our most important questions, like “do people who check out ebooks still use print?” We need to be able to match up those transactions—but we just need to know that person A is person A and person B is person B, not that John Smith is person A and Jane Doe is person B. That is essential for longitudinal analysis. The best way we’ve found to do this is to record the borrower ID of the person who did the transaction, but to obfuscate it using some hashing algorithms, with a pinch of salt… while we are not making corned beef hash, it is a good representation of what we are doing. A hash is a way to ensure that borrower number 12345 is always turned into the same disguised ID, but there is no way to calculate it in the reverse. Currently we are working with SHA-1 due to system restraints but are currently evaluating SHA-2’s compatibility within the warehouse. This isn’t bulletproof, because if you had access to our ILS database, you could repeat this procedure, to learn the disguised version of that patron’s ID. However, this attacker would have to gain access to multiple, completely separate databases to pull that off. https://www.flickr.com/photos/accidentalhedonist/5707034712/
  30. [Becky] What this adds up to is a “belt and suspenders” approach, or an approach that has redundant failsafes baked into it. If you wanted to use our system to gain access about a specific patron, you’d first need to breach our ILS to find that patron’s identity and ID. Then you’d need to somehow recreate the hash algorithm we’d used to find out how that patron is represented in our data warehouse, and then breach the data warehouse, and then look them up there. Not to get computer science-y on you, but assuming you could even manage the part with the hashing, you’d still need to break into two databases, and this is our lower risk scenario! Okay, what about our government opponent who can force us to give them access and identify the patrons to them? Even then, remember that we have no details about titles, specific transactions, websites anyone visited, exact times or places that things happened. We only have a record of the fact that certain kinds of transactions occurred. And what about leaks? We do not plan to wholesale share this information with anyone without first going through a rigorous vetting process, and staff already have strict, clear policies about how, when, and why they can access patron data—these policies are already in place for our ILS and would apply equally to this data.
  31. [Becky] We are still in the early stages of the Data Warehouse; we have a few datasets in there, including door counts, computer session usage, branch information, and some preliminary ILS circulation and patron demographic data. Nonetheless the data we do have in the Warehouse has already proved valuable.
  32. [Becky] Our libraries have both “Public Internet” workstations, which can be reserved for up to 90 minutes per day, and “Express” workstations, which can be used for 15 minutes without prior reservation. Patrons are permitted up to 90 minutes of computer use per day regardless of the type of workstation they use. Library staff began to report that patrons were “frequently” logging on to the 15-minute Express workstations multiple times in succession, to the detriment (and annoyance) of patrons who were waiting patiently for a computer to open up. It seemed that some patrons had figured out that they could chain together up to six 15-minute sessions at the Express computers. This was obviously not the intended purpose of the Express stations. However, all we had to rely upon was anecdotal information from a few locations. With our traditional data collection practices, we had no way of telling if any given array of six sessions on an Express workstation were done by the same patron or by six different patrons.
  33. [Becky] Before De-identification, we only had the total number of sessions and minutes that Express workstations were used per day per branch The de-identified data, however, gave us our answer. Since we now had public computer session data recorded along with a patron’s “DeID,” we could easily tell that, indeed, many, many patrons at all library locations were logging into Express workstations 6 times per day for 15 minutes per session. We couldn’t tell specifically who the patrons were, but we could tell when it was the same patron logging on 6 times versus when it was 6 distinct patrons each logging on for 15 minutes.
  34. [Becky] The new way of collecting data allowed us to determine that this problem was, indeed, rampant. Over 25% of patrons used Express workstations more than 15 minutes per day. We made the decision to change the way our system handled public computer sessions. Now, patrons get both 90 minutes per day on reservable public workstations but only 15 minutes per day on Express workstations. This essentially means that patrons qualify for a total of 105 minutes per day on our computers, but it was decided that this was a fairer and more equitable way to allocate time given the abuses we were seeing of the Express workstations.
  35. [Becky] In summary, we are still in the early days of the Data Warehouse; however we are certain that we are going in the right direction in terms of its architecture and our methodology to de-identify patron information. Storing different types of information in different locations while avoiding keeping any PII-1 or 2 data helps mitigate the risks of storing such data in house. At the same time, we have enough level of detail that the data we have can be used to impact real world operations. Finally, we have the control over what we provide third party customer relationship management systems or other external systems.
  36. [Becky] We hope to have shown here that there is not necessarily a clear-cut conflict between patron privacy and being able to make data-driven decisions. If we are clever about it, we can maintain the promises we’ve made to patrons about their privacy while keeping enough “blurry” data to get some value out of it. It doesn’t need to be a data-driven approach VERSUS patron privacy, it can be data-driven PLUS patron privacy.
  37. [Author Comment - we moved this slide out of the deck in case we will not have room to talk about this historical issue; nonetheless, for those of you playing along at home, this is a reminder that we also have physical security vulnerabilities to address as well.] In discussing this issue, we’ve also discovered that our historic practices are far from perfect. In discussions with several security researchers in academic computer science, they quickly spotted serious flaws in our systems. One is example is our holds system. A patron places a hold on an item. Usually we have a record of their intellectual pursuit for the three-week period of their checkout but sometimes you wait for something on hold much longer than that. We’re holding a record of their intellectual inquiry that is vulnerable to law enforcement or government demands, sometimes for months at a time. Then we send them a plaintext email about that request and put the item on the shelf with an tag that only shows four letters of their last name and three of your first name. Of course, if you’re my colleague KAY AOKI, that’s your whole name. The item is on an open holds shelf that anyone can wander by. One security researcher had an off-the-cuff idea to improve this process. He said we should give each request a QR code that tracks it and is required to pick it up. Then the request is associated with that code and NOT with a patron record. This would be very secure! The idea has a few flaws though. First, there would be no way to contact the patron to tell them that their item is ready because we don’t know who they are. They’d just have to check for it periodically. Second, the patron would have to NOT LOSE their QR code in order to claim the item. Obviously, both of these assumptions are unrealistic from a customer service point of view. Third, it’s a QR code. Maybe this idea could be refined and become more practical, but we liked this story as an example of how real-world customer service and operations are sometimes incompatible with ideal solutions.