The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Managing Confidential Information in Research
1. Managing Confidential
Information in Research
Micah Altman
Senior Research Scientist
Institute for Quantitative Social Science
Harvard University
2.
3.
4. Personally identifiable private
information is surprisingly common
Includes information from a variety of
sources, such as…
Research data, even if you aren‟t the
original collector
Student “records” such as e-mail, grades
Logs from web-servers, other systems
Lots of things are potentially identifying:
Under some federal laws: IP
addresses, dates, zipcodes, …
Birth date + zipcode + gender uniquely
identify ~87% of people in the U.S.
[Sweeney 2002]
With date and place of birth, can guess
first five digits of social security number
(SSN) > 60% of the time. (Can guess the
whole thing in under 10 tries, for a
significant minority of people.) [Aquisti&
Gross 2009]
Analysis of writing style or eclectic tastes
has been used to identify individuals Brownstein, et al., 2006 , NEJM 355(16),
Tables, graphs and maps can also
reveal identifiable information
4 [Micah Altman, 3/10/2011]
5.
6. IQSS (and affiliates) offer you support across all stages of your
quantitative research:
Research design, including:
design of surveys, selection of statistical methods.
Primary and secondary data collection, including:
the collection of geospatial and survey data.
Data management, including:
storage, cataloging, permanent archiving, and distribution.
Data analysis, including :
statistical consulting, GIS consulting, high performance research computing.
http://iq.harvard.edu/
6 [Micah Altman, 3/10/2011]
7. The IQSS grants administration team helps with every aspect of
the grant process. Contact us when you are planning your
proposal.
Assisting in identifying research funding opportunities
Consulting on writing proposals
Assisting IQSS affiliates with:
preparation, review and submission of all grant applications
(“pre-award support”)
management of their sponsored research portfolio
(“post-award support”)
Interpret sponsor policies
Coordinate with FAS Research Administration and the Central Office for
Sponsored Programs
… And, of course, support seminars like this!
7 [Micah Altman, 3/10/2011]
8. Goals for course
Overview of key areas
Identify key concepts & issues
Summarize Harvard
policies, procedures, resources
Establish framework for action
Provide connection to resources, literature
8 [Micah Altman, 3/10/2011]
9. Outline
[Preliminaries]
Law, policy, ethics
Research
methods, design, manage
ment
Information Security
(Storage, Transmission, U
se)
Disclosure Limitation
[Additional Resources &
Summary of
9 Recommendations] [Micah Altman, 3/10/2011]
10. Steps to Manage Confidential Research Data
Identify potentially sensitive information in planning
Identify legal requirements, institutional requirements, data use agreements
Consider obtaining a certificate of confidentiality
Plan for IRB review
Reduce sensitivity of collected data in design
Separate sensitive information in collection
Encrypt sensitive information in transit
Desensitize information in processing
Removing names and other direct identifiers
Suppressing, aggregating, or perturbing indirect identifiers
Protect sensitive information in systems
Use systems that are controlled, securely configured, and audited
Ensure people are authenticated, authorized, licensed
Review sensitive information before dissemination
Review disclosure risk
Apply non-statistical disclosure limitation
Apply statistical disclosure limitation
Review past releases and publically available data
Check for changes in the law
Require a use agreement
10 [Micah Altman, 3/10/2011]
11.
12. Law, policy, ethics
Law, Policy & Ethics Research design …
Information security
Disclosure
limitation
Ethical Obligations
Laws
Fun and games
Harvard Policies
[Summary]
12 [Micah Altman, 3/10/2011]
13. Law, policy, ethics
Confidentiality & Research Ethics Research design …
Information security
Disclosure
Belmont Principles limitation
Respect for Persons
individuals should be treated as autonomous agents
persons with diminished autonomy are entitled to protection
implies “informed consent”
implies respect for confidentiality and privacy
Beneficence
research must have individual and/or societal benefit to justify
risks
implies minimizing risk/benefit ratio
13 [Micah Altman, 3/10/2011]
14. Scientific & Societal Benefits Law, policy, ethics
of Data Sharing Research design …
Information security
Increases replicability of research Disclosure
limitation
Journal publication policies may apply
Increases scientific impact of research
Follow up studies
Extensions
Citations
Public interest in data produced by public funder
Funder policies may apply
Public interest in data that supports public policy
FOIA and state FOI laws may apply
Open data facilitates…
Transparent government
Scientific collaboration
Scientific verification
New forms of science
Participation in science
Hands-on education
Continuity of research
Sources: Fienberg et. al 1985; ICSU 2004; Nature 2009
14 [Micah Altman, 3/10/2011]
15. Sources of Confidentiality Restrictions for Law, policy, ethics
University Research Data Research design …
Information security
Disclosure
Overlapping laws limitation
Different laws apply
to different cases
All affiliates subject to
university policy
(Not included: EU
directive, foreign
laws, classified
data, …)
15 [Micah Altman, 3/10/2011]
16. Law, policy, ethics
45 CFR 46 [Overview] Research design …
Information security
“The Common Rule” Disclosure
limitation
Governs human subject research
With federal funds/ at federal institution
Establishes rules for conduct of research
Establishes confidentiality and consent requirement
for for identified private data
However, some information may be required to be
disclosed under state and federal laws (e.g. in cases
of child abuse)
Delegates procedural decisions to Institutional
Review Boards (IRB‟s)
16 [Micah Altman, 3/10/2011]
17. Law, policy, ethics
HIPAA [Overview] Research design …
Information security
Health Insurance Portability and Accountability Act Disclosure
limitation
Protects personal health care information for „covered
entities‟
Detailed technical protection requirements
Provides clearest legal standards for dissemination
Provides a „safe harbor‟
Has become an accepted practice for dissemination in
other areas where laws are less clear
HITECH Act of 2009 extends HIPAA
Extends coverage to associated entities of covered entities
Additional technical safeguards
Adds breach reporting requirement
HIPAA provides three dissemination options …
17 [Micah Altman, 3/10/2011]
18. Dissemination under HIPAA
Law, policy, ethics
Research design …
[option 1] Information security
Disclosure
limitation
“safe harbor” -- remove 18 identifiers
[Personal identifiers]
Names
Social Security #‟s; Personal Account #‟s; Certificate/License #‟s; full face
photos (and comparable images); biometric id‟s; medical
Any other unique identifying number, characteristic, or code
[Asset identifiers]
fax #‟s; phone #‟s; vehicle #‟s;
personal URL‟s; IP addresses; e-mail addresses
Device ID‟s and serial numbers
[Quasi identifiers]
dates smaller than a year (and ages > 89 collapsed into one category)
geographic subdivisions smaller than a state (except for 3 digits of zipcode, if
unit > 20,000 people)
And
Entity does not have actual knowledge [direct and clear awareness] that it
would be possible to use the remaining information alone or in
combination with other information to identify the subject
18 [Micah Altman, 3/10/2011]
19. Dissemination under HIPAA Law, policy, ethics
Research design …
[Option 2] Information security
“limited dataset” – leave some quasi-id‟s Disclosure
limitation
Remove personal and asset identifiers
Permitted dates: dates of birth, death, service, years
Permitted geographic subdivisions: town, city, state, zip
code
And
Require access control and data use agreement.
19 [Micah Altman, 3/10/2011]
20. Dissemination under HIPAA Law, policy, ethics
Research design …
[Option 3] Information security
“qualified statistician” – statistical determination Disclosure
limitation
Have qualified statistician determine, using generally
accepted statistical and scientific principles and
methods, that the risk is very small that the information
could be used, alone or in combination with other
reasonably available information, by the anticipated
recipient to identify the subject of the information.
Important caveats
Methods and results of the analysis must be documented
No bright line for “qualified”, text of rule is:
“a person with appropriate knowledge of and experience with
generally accepted statistical and scientific principles and
methods for rendering information not individually identifiable.”
[Section 164.514(b)(1)]
No clear definitions for “generally accepted” or “very small” or
“reasonably available information”…
however, there are references in the federal register to
statistical publications to be used as “starting points”
20 [Micah Altman, 3/10/2011]
21. Law, policy, ethics
FERPA Research design …
Information security
Family Educational Rights and Privacy Act Disclosure
limitation
Applies schools that receive federal (D.O.E.) funding
Restricts use of student (not employee) information
Establishes
Right to privacy of educational records
Right to inspect and correct records (with appeal to Federal government)
Definition of public “directory” information
Right to block access to public “directory” information, and to other records
Educational records include:
Identified information about student
Maintained by institution
Not …
Employee records
Some medical and law-enforcement records
Records solely in the possession and for use by the creator (e.g. unpublished instructor notes)
Personally identifiable information includes:
Direct identifiers
Indirect (quasi) identifiers
Indirectly linkable identifiers
“Information requested by a person who the educational agency or institution reasonably
believes knows the identity of the student to whom the education record relates.”
21 [Micah Altman, 3/10/2011]
22. Law, policy, ethics
MA 201 CMR 17 Research design …
Information security
Disclosure
Standards for the Protection of Personal Information
limitation
Strongest U.S. general privacy protection law
Has been delayed/modified repeatedly
Requires reporting of breaches
If data is not encrypted
Or encryption key is released in conjunction with data
Requires specific technical protections:
Firewalls
Encryption of data transmitted in public
Anti-virus software
Software updates
22 [Micah Altman, 3/10/2011]
23. Inconsistencies in Requirements Law, policy, ethics
and Definitions Research design …
Information security
Inconsistent definitions of “personally identifiable” Disclosure
Inconsistent definitions of sensitive information limitation
Requirements for to de-identify jibes with statistical realities
FERPA HIPAA Common MA 201 CMR
Rule 17
Coverage Students in Medical Information Living persons in Mass. Residents
Educational in “Covered research by
Institutions Entities” funded
institutions
Identification -Direct -Direct -Direct -Direct
Criteria -Indirect -Indirect -Indirect
-Linked -Linked -Linked
-Bad intent (!)
Sensitivity Any non-directory Any medical Private Financial, State,
Criteria information information information – Federal
based on harm Identifiers
Management - Directory opt-out - Consent - Consent - Specific
Requirement - [Implied] good - Specific technical - [Implied] risk technical
s 23 practice safeguards minimization safeguards
[Micah Altman, 3/10/2011]
-Breach - Breach
24. Law, policy, ethics
Third Party Requirements Research design …
Information security
Licensing requirements Disclosure
limitation
Intellectual property requirements
Federal/state law and/or policy requirements
State protection of personal information laws
Freedom of information laws (FOIA & State FOI)
State mandatory abuse/neglect notification laws
And … think ahead to publisher requirements
Replication requirements
IP requirements
Examples
NSF requires data from funded research be shared
NIH requires a data sharing plan for large projects
Wellcome Trust requires a data sharing plan
Many leading journals require data sharing
24 [Micah Altman, 3/10/2011]
25. Law, policy, ethics
(Some) More Laws & Standards Research design …
Information security
California Laws Detailed technical controls over information
Disclosure
Lots of rules systems
limitation
Applies any data about California residents Sarbanes-Oxley (aka, SOX, aka SARBOX)
Privacy policy Corporate and Auditing Accountability and
Responsibility Act of 2002
Disclosure
Applies to U.S. public company
Reporting policy boards, management and public accounting firms
EU Directive 95/46/EC Rarely applies to research in universities
Data protection directive Section 404 requires annual assessment of
Provides for notice, limits on purpose of organizational internal controls – but does not
use, consent, security, disclosure, access, account specify details of controls
ability Classified Data
Forbids transfer of data to entities in countries Separate and complex rules and requirements
compliant with directive
The University does not accept classified data
U.S. is not compliant but …
But, may have “Controlled But Unclassified”
Organizations can certify compliance with FTC
Vaguely defined area
No auditing/enforcement !
Mostly government produced area
Substantial criticism of this arrangement
Penalties unclear
Payment Card Industry (PCI) Security Standards
And… export controlled information, under ITAR
Governs treatment of credit card numbers and EAR
Requires reports, audits, fines Export control include
Detailed technical measures technologies, software, documentation/design
Not a law, but helps define good practice documents may be included
Large penalties
Nevada law mandates PCI standards
… and over 1100 International Human Subjects
FISMA laws…
Federal Information Security Management Act
(FISMA), Public Law (P.L.) 107-347.
Is starting to be applied to NIH sponsored
25 research [Micah Altman, 3/10/2011]
26. Law, policy, ethics
Predicted Legal Changes for 2011… Research design …
Information security
Disclosure
Caselaw limitation
“personal privacy” does not apply to information about
corporations (a corporation is not a “person” for this
purpose)
FCC vs. ATT 2011
Scheduled
EU “cookie privacy” directive 2009/136/EC goes into
effect
Proposed updates to EU information privacy directives
Very Likely
New information privacy laws in selected states in 2011
Likely
Increased federal regulation of internet privacy
26 [Micah Altman, 3/10/2011]
27. Law, policy, ethics
What’s wrong with this picture? Research design …
Information security
Disclosure
limitation
Name SSN Birthdate Zipcode Gender Favorite # of crimes
Ice Cream committed
A.Jones 12341 01011961 02145 M Raspberr 0
y
B. Jones 12342 02021961 02138 M Pistachio 0
C. Jones 12343 11111972 94043 M Chocolat 0
e
D. Jones 12344 12121972 94043 M Hazelnut 0
E. Jones 12345 03251972 94041 F Lemon 0
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64
M. Smith 12353 04041974 64700 F Pumpkin 128
N. 12354 04041974 64800 F Allergic 256
27 [Micah Altman, 3/10/2011]
Smi
28. Law, policy, ethics
What’s wrong with this picture? Research design …
Information security
Identifier Sensitive Identifier Sensitive
Private Disclosure
Private Identifier limitation
Identifier
Name SSN Birthdate Zipcode Gender Favorite # of crimes
Ice Cream committed
A.Jones 12341 01011961 02145 M Raspberr 0 Mass resident
y
B. Jones 12342 02021961 02138 M Pistachio 0
Californian
C. Jones 12343 11111972 94043 M Chocolat 0
e
D. Jones 12344 12121972 94043 M Hazelnut 0 Twins, separated at birth?
E. Jones 12345 03251972 94041 F Lemon 0
FERPA too?
F. Jones 12346 03251972 02127 F Lemon 1
G. Jones 12347 08081989 02138 F Peach 1
H. Smith 12348 01011973 63200 F Lime 2
I. Smith 12349 02021973 63300 M Mango 4
J. Smith 12350 02021973 63400 M Coconut 16
K. Smith 12351 03031974 64500 M Frog 32
L. Smith 12352 04041974 64600 M Vanilla 64 Unexpected Response?
M. Smith 12353 04041974 64700 F Pumpkin 128
28 [Micah Altman, 3/10/2011]
N. Smith 12354 04041974 64800 F Allergic 256
29.
30.
31. Harvard: Law, policy, ethics
Enterprise Security Policy (HEISP) Research design …
Information security
Storing High Risk Confidential Information (HRCI) Disclosure
Must not be stored on individual user computer or limitation
portable storage device
Confidential information on Harvard computing devices
Must be stored on "target computers" or secure locked
containers confidential information must be protected
Human subject information confidential information on portable devices must be
encrypted
All research on human subjects must be approved by
the IRB laptops must have encryption (some schools require
whole-disk encryption)
All proposals must include a data management plan
systems must be scanned annually
Personally identifiable medical information (PIMI)
Cannot save confidential information on computer
"Covered entities" at Harvard are subject to HIPAA directly accessible from the internet, open Harvard
requirements networks
PIMI is to be treated as HRCI throughout the university
Employees who have access must annually agree to
Obtaining confidential information requires approval confidentiality agreements
All confidential information must be encrypted when Access to lists and database of Harvard University ID
transported across any network numbers is restricted
Public directories must adhere to privacy preferences Each school must provide training
establishes by the individuals Registrars have developed common definition of
Identifying Users with Access to Confidential FERPA directory information
Information Must adhere to student requests to block their directory
System owners must be able to identify users that have information, per FERPA
access to confidential information
Accepting Payment Cards - Restricted to procedures
Strong passwords outlined in HU Credit Card Merchant Handbook
No account/password sharing
Inhibit password guessing with logging and lockouts
Limit application availability time with timeouts
Limit user access to confidential information based on
business need
[ More on next page…]
31 [Micah Altman, 3/10/2011]
32. Law, policy, ethics
HEISP – Part 2 Research design …
Information security
Physical Environment Network take down Disclosure
limitation
All digital/non-digital media must be Network managers run vulnerability
properly protected scans
Computers must be physically May take computers off the network
secure Service Resumption
Automatic logging must be Must have a service resumption plan if
consistent with written policies loss of confidential data is a
substantial business risk
Vendor contracts
require approval by security officer Incident Response Policy
Include OGC contract rider Disposition and destruction of
records
Computer operator
computer must be regularly updated Acquisition/use by unauthorized
persons must be reported to OGC
operated securely
Only necessary application installed Interacting with legal authorities --
always refer to OGC unless
annually certify compliance with imminent health/safety risk requires
university policies
otherwise
Computer setup - must filter
malicious traffic Web based surveys must have
protections in place
“Target” systems and controllers
Private address space; locally
firewalled
Annual vulnerability scanning
32 [Micah Altman, 3/10/2011]
33. Law, policy, ethics
Harvard: Research design …
Information security
Research Data Security Policy (HRDSP) Disclosure
limitation
Sensitivity of research data based on potential harm if disclosed:
Level 5 = “extremely sensitive”
Level 4 = “very sensitive” ~= HRCI
Level 3 = “sensitive” ~= HCI
Level 2 = “benign” ~= Good computer hygiene
Level 1 = anonymous and not business confidential
Required protections based on sensitivity
Level 5: Entirely disconnected from network (“bubble security”)
Level 4: Protections as per HRCI
Level 3: Protections as per HCI
Level 2: Good computer hygiene
Designates procedures for treatment of external data use agreements
[ next section ]
Legally binding
Can be both very detailed and not supported by Harvard security procedures
Investigator should not sign these – forward to OSP
Designates responsibilities for IRB, Investigator, OSP, IT, Security Officers.
security.harvard.edu/research-data-security-policy
33 [Micah Altman, 3/10/2011]
34. Harvard: Law, policy, ethics
Research design …
Researcher Responsibilities Information security
… for knowing the rules Disclosure
limitation
… for identifying potentially confidential information in all forms
(digital/analogue; on-line/off-line)
… for notifying recipients of their responsibility to protect confidentiality
… for obtaining IRB approval for any human subjects research
… for following an IRB approved plan
… for obtaining OSP approval of restricted data use agreements with providers, even
if no money involved
… and for proper
Storage
Access
Transmission
Disposal
Confidentiality is not an “IT problem”
34 [Micah Altman, 3/10/2011]
35. Harvard: Law, policy, ethics
Research design …
Staff – Personnel Manual Information security
Disclosure
Protect Harvard information and systems limitation
Keep your own information in Peoplesoft up to date
Comply with copyrights and DMCA
Comply with Harvard systems policies and procedures
All information produced at work is Harvard property
Attach only approved devices to the Harvard network
harvie.harvard.edu/docroot/standalone/Policies_Contracts/St
aff_Personnel_Manual/Section2/Privacy.shtml
35 [Micah Altman, 3/10/2011]
36. Law, policy, ethics
Key Concepts & Issues Review Research design …
Information security
Privacy Disclosure
Control over extent and circumstances of sharing limitation
Confidentiality
Treatment of private, sensitive information
Sensitive information
Information that would cause harm if disclosed and linked to an individual
Personally/individually identifiable information
Private information
Directly or indirectly linked to an identifiable individual
Human subjects
A living person …
who is interacted with to obtain research data
who‟s private identifiable information is included in research data
Research
Systematic investigation
Designed to develop or contribute to generalizable knowledge
“Common Rule”
Law governing funded human subjects research
HIPAA
Law governing use of personal health information in covered and associated entities
MA 201 CMR 17
Law governing use of certain personal identifiers for Massachusetts residents
36 [Micah Altman, 3/10/2011]
37. Law, policy, ethics
Checklist: Identify Requirements Research design …
Information security
Check if research includes … Disclosure
limitation
Interaction with humans Common Rule &HEISP/HRDSP
applies
Check if data used includes identified …
Student records FERPA &HEISP/HRDSP applies
State, federal, financial id‟s state law &HEISP/HRDSP applies
Medical/health information HIPAA (likely) &HEISP/HRDSP
applies
Human subjects & private info
Common Rule &HEISP/HRDSP applies
Check for other requirements/restriction on data dissemination:
Data provider restrictions and University approvals thereof
Open data requirements and norms
University information policy
37 [Micah Altman, 3/10/2011]
38. Law, policy, ethics
Resources Research design …
E.A. Bankert& R.J. Andur, 2006, Institutional Review Board: Management and Function,
Information security
Jones and Bartlett Publishers
Disclosure
P. Ohm, “Broken Promises of Privacy”, SSRN Working Paper limitation
[ssrn.com/abstract=1450006]
D. J. Mazur, 2007. Evaluating the Science and ethics of Research on Humans, Johns Hopkins University
Press
IRB: Ethics & Human Research [Journal], Hastings Press
www.thehastingscenter.org/Publications/IRB/
Journal of Empirical Research on Human Research Ethics, University of California Press
ucpressjournals.com/journal.asp?j=jer
201 CMR 17 text
www.mass.gov/Eoca/docs/idtheft/201CMR17amended.pdf
FERPA Website
www.ed.gov/policy/gen/guid/fpco/ferpa/index.html
HIPAA Website
www.hhs.gov/ocr/privacy/
Common Rule Website
www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.htm
State laws
www.ncsl.org/Default.aspx?TabId=13489
Harvard Enterprise Information Security Policy/ Research Data Security Policy
www.security.harvard.edu
Harvard Institutional Review Board
www.fas.harvard.edu/~research/hum_sub/
Harvard FAS Policies and Procedures
www.fas-it.fas.harvard.edu/services/catalog/browse/39
IQSS Policies and Procedures
support.hmdc.harvard.edu/kb-930/hmdc_policies
38 [Micah Altman, 3/10/2011]
39. Research design, methods, Law, policy, ethics
management
Research design
…
Information security
Disclosure
Reducing risk limitation
Sensitivity of information
Partitioning
Decreasing identification
Managing confidentiality and
dissemination
[Summary]
39 [Micah Altman, 3/10/2011]
40. Law, policy, ethics
Trade-offs Research design
…
Information security
Anonymity vs. research utility Disclosure
limitation
Sensitivity vs. research utility
(Anonymity * Sensitivity) vs. research costs/efforts
40 [Micah Altman, 3/10/2011]
41. Law, policy, ethics
Types of Sensitive Information Research design
…
Information security
Information is sensitive, if, once disclosed there
Disclosure
limitation
is a “significant” likelihood of harm
IRB literature suggests possible categories of
harm:
loss of insurability
loss of employability
criminal liability
psychological harm
social harm to a vulnerable group
loss of reputational harm
emotional harm
dignitary harm
physical harm: risk of disease, injury, or death
41 [Micah Altman, 3/10/2011]
42. Law, policy, ethics
Levels of sensitivity Research design
…
No widely accepted scale Information security
Publicly available data not sensitive under “common rule” Disclosure
Common rule anchors scale at “minimal risk”: limitation
“if disclosed, the probability and magnitude of harm or discomfort anticipated are not greater in and of
themselves than those ordinarily encountered in daily life or during the performance of routine
physical or psychological examinations or tests”
Harvard Research Data Security Policy
Level 5- Extremely sensitive information about individually identifiable people.
Information that if exposed poses significant risk of serious harm.
Includes information posing serious risk of criminal liability, serious psychological harm or other
significant injury, loss of insurability or employability, or significant social harm to an individual or
group.
Level 4 - Very sensitive information about individually identifiable people
Information that if exposed poses a non-minimal risk of moderate harm.
Includes civil liability, moderate psychological harm, or material social harm to individuals or
groups, medical records not classified as Level 5, sensitive-but-unclassified national security
information, and financial identifiers (as per HRCI standards).
Level 3- Sensitive information about individually identifiable people
Information that if disclosed poses a significant risk of minor harm.
Includes information that would reasonably be expected to damage reputation or cause
embarrassment; and FERPA records.
Level 2 – Benign information about individually identifiable people
Information that would not be considered harmful, but that as to which a subject has been
promised confidentiality.
Level 1 – De-identified information about people, and information not about people
42 [Micah Altman, 3/10/2011]
43. Law, policy, ethics
IRB Review Scope Research design
…
IRB approval needed for all: Information security
federally-funded research; Disclosure
limitation
or any research at (almost all) institutions receiving federal
funding that involve “human subjects”.
( Any organization operating under a general “federal-wide
assurance”)
All human subjects research at Harvard
Human subject: individual about whom an investigator
(whether professional or student) conducting research
obtains
(1) Data through intervention or interaction with a living
individual, or
(2) Identifiable private information about living individuals
See
www.hhs.gov/ohrp/
43 [Micah Altman, 3/10/2011]
44. Law, policy, ethics
Research not requiring IRB approval Research design
…
Non-research: Information security
not generalizable knowledge & systematic inquiry
Disclosure
Non-funded: limitation
institution receives no federal funds for research
Not human subject:
No living people described
Observation only AND no private identifiable information is obtained
Human Subjects, but “exempt” under 45 CFR 46
use of existing, publicly-available data
use of existing non-public data, if data is individuals cannot be directly or
indirectly identified
research conducted in educational settings, involving normal educational
practices
taste & food quality evaluation
federal program evaluation approved by agency head
observational, survey, test & interview of public officials and candidates
(in their formal capacity, or not identified)
Caution not all “exempt” is exempt…
Some research on prisoners, children, not exemptable
Some universities require review of “exempt” research
Harvard requires review of all human subject research
See:
www.hhs.gov/ohrp/humansubjects/guidance/
decisioncharts.htm
44 [Micah Altman, 3/10/2011]
45. Law, policy, ethics
IRB’s and Confidential Information Research design
…
Information security
Disclosure
IRB‟s review consent procedures and documentation limitation
IRB‟s may review data management plans
May require procedures to minimize risk of disclosure
May require procedures to minimize harm resulting from
disclosure
IRB‟s make determination of sensitivity of information
-- potential harm resulting from disclosure
IRB‟s make determination regarding whether data is
de-identified for “public use”
[see NHRPAC,
“Recommendations on Public Use Data Files”]
45 [Micah Altman, 3/10/2011]
46. Law, policy, ethics
Harvard IRB Approval Research design
…
Information security
Disclosure
The Harvard Institutional Review Board (IRB) must approve alllimitation
human subjects research at Harvard prior to data collection or use
Research involves human subjects if:
There is any interaction or intervention with living humans; or
If identifiable private data about living humans is used
Some examples of human subject research in soc sci:
Surveys
Behavioral experiments
Educational tests and evaluations
Analysis of identified private data collected from people
(your e-mail inbox, logs of web-browsing activity, facebook activity, ebay
bids … )
The IRB will:
Assess research protocol
Identify whether research is exempt from further review and management
Identify sensitivity level of data
46 [Micah Altman, 3/10/2011]
47. Law, policy, ethics
Harvard Responsibilities Research design
…
Information security
Disclosure
The Harvard Institutional Review Board (IRB) must approve alllimitation
human subjects research at Harvard prior to data collection or use
Research involves human subjects if:
There is any interaction or intervention with living humans; or
If identifiable private data about living humans is used
Some examples of human subject research in soc sci:
Surveys
Behavioral experiments
Educational tests and evaluations
Analysis of identified private data collected from people
(your e-mail inbox, logs of web-browsing activity, facebook activity, ebay
bids … )
The IRB will:
Assess research protocol
Identify whether research is exempt from further review and management
Identify sensitivity level of data
47 [Micah Altman, 3/10/2011]
48. Law, policy, ethics
HRDSP Responsibilities Research design
…
Information security
Disclosure
Responsibilities limitation
Researchers are responsible for disclosing to IRB, and follow
IRB approved plan
IRB is responsible for ensuring adequacy of Investigators
plans; granting (lawful) variances from security requirements
justified by research needs
IT is responsible for assisting with the identification of security
level, and assisting in the implementation of security
protections
Security Officer/CIO may review IT facilities and approve (give
written designation) that they meet protections for a given level
48 [Micah Altman, 3/10/2011]
49. Valuation of private Law, policy, ethics
information is uncertain Research design
…
Information security
Privacy valuations often inconsistent
Disclosure
Framing effects: ordering, endowment effect, possibly others limitation
Non-normal/uniform distribution of valuations
One study: < 10% of subjects would give up $2 of a $12 gift card to buy anonymity
of purchases
[Aquesti and Lowenstein 2009]
Cost benefit of information security may not be optimal for users [Herley
2009]
E.g. Loss from all phishing attacks is 100x less than time spent in avoiding them
Note, however weaknesses in this analysis:
Only loss of time modeled – no valuation of privacy made
Institutional costs not included – only personal costs
Very simplified model – not calibrated through surveys etc.
Repeated surveys of students show they tend to disclose a lot, e.g.:
>80% of students sampled in several studies had public facebook pages with birthdays,
home town and other private information
This information can easily be used to link to other databases!
Disclosure of extensive information on sexual orientation, private cell #‟s, drinking habits, etc.
etc. not uncommon
[See Kolek& Saunders 2008]
Emerging markets for privacy?
Micropayments for disclosures
http://www.personal.com/
49 http://www.i-allow.com/
[Micah Altman, 3/10/2011]
50. Law, policy, ethics
Reducing Risk in Data Collection Research design
…
Information security
Disclosure
Avoid collecting sensitive information, unless it is limitation
required by research design, method, or hypothesis
Unnecessary sensitive information not minimal risk
Reducing sensitivity higher participation, greater honesty
Collect sensitive information in private settings
Reduces risk of disclosure
Increases participation
Reduce sensitivity through indirect measures
Less sensitive proxies
E.g. Implicit association test [Greenwald, et al. 1998]
Unfolding brackets
Group response collection
Random response technique [Warner 1965]
Item count/unmatched count/list experiment technique
50 [Micah Altman, 3/10/2011]
51. Managing Sensitive Data Law, policy, ethics
Collection Research design
…
Information security
Disclosure
Separate: limitation
sensitive measures, (quasi)-identifiers, other
measures
If possible avoid storing identifiers with measures:
Collect identifying information beforehand
Assign opaque subject identifiers
For sensitive data:
Collect on-line directly (with appropriate protections); or
Encrypt collection devices/media (laptops, usb keys, etc)
For very/extremely sensitive data:
Collect with oversight directly; then
Store on encrypted device and;
Transfer to secure server as soon as feasible
51 [Micah Altman, 3/10/2011]
52. Randomized Response Law, policy, ethics
Technique Research design
…
Information security
Disclosure
limitation
Sensitiv
e
questio
n
Subjec Record
t rolls
a die > Answer
2
Variations:
- Ask two different
questions
- Item counts with
sensitive and non-
Say
sensitive items – eliminate
“YES” subject-randomization
- Regression analysis
methods
52 [Micah Altman, 3/10/2011]
53. Law, policy, ethics
Our Table – Less (?) Sensitive Less (?)
Research design
…
Sensitive Information security
Name SSN Birthdate Zipcode Gender Favorite Treat? # Disclosure
Ice Cream acts limitation
*
A.Jones 12341 01011961 02145 M Raspberry 0 0
B. Jones 12342 02021961 02138 M Pistachio 1 20
C. Jones 12343 11111972 94043 M Chocolate 0 0
D. Jones 12344 12121972 94043 M Hazelnut 1 12
E. Jones 12345 03251972 94041 F Lemon 0 0
F. Jones 12346 03251972 02127 F Lemon 1 7
G. Jones 12347 08081989 02138 F Peach 0 1
H. Smith 12348 01011973 63200 F Lime 1 17
I. Smith 12349 02021973 63300 M Mango 0 4
J. Smith 12350 02021973 63400 M Coconut 1 18
K. Smith 12351 03031974 64500 M Frog 0 32
L. Smith 12352 04041974 64600 M Vanilla 1 65
M. Smith 12353 04041974 64700 F Pumpkin 0 128
N. Smith 12354 04041974 64800 F Allergic 1 256
* Acts = crimes if treatment = 0; crimes + acts of generosity if treatment =1
53 [Micah Altman, 3/10/2011]
54. Randomized Response – Law, policy, ethics
Pros and Cons Research design
…
Information security
Pros
Disclosure
Can substantially reduce risks of disclosure limitation
Can increase response rate
Can decrease mis-reporting
Warning!
None of the randomized models uses a formal measure of disclosure limitation
Some would clearly violate measures (such as differential privacy) we‟ll see in
section 4
Do not use as a replacement for disclosure limitation
Other Issues
Loss of statistical efficiency
(if compliance would otherwise be the same)
Complicates data analysis, especially model-based analysis
Leaving randomization up to subject can be unreliable
May provide less confidentiality protection if:
Randomization is incomplete
Records of randomization assignment are kept
Lists of responses overlap across questions
Sensitive question response is large enough to dominate overall response
Non-sensitive question responses are extremely predictable, or publicly observable
54 [Micah Altman, 3/10/2011]
55. Law, policy, ethics
Partitioning Information Research design
…
Information security
Reduces risk in information management Disclosure
limitation
Partition data information based on sensitivity
Identifying information
Descriptive information
Sensitive information
Other information
Segregate
Storage of information
Access regimes
Data collections channels
Data transmission channels
Plan to segregate as early as feasible in data collection
and processing
Link segregated information with artificial keys …
55 [Micah Altman, 3/10/2011]
56. Partitioned table
Not Identified
Name SSN Birthdate Zipcode Gender LINK LINK Favorite Treat #
Ice Cream act
s
A.Jones 12341 0101196 02145 M 1401 1401 Raspberry 0 0
1
283 Pistachio 1 20
B. 12342 0202196 02138 M 283
8979 Chocolate 0 0
Jones 1
7023 Hazelnut 1 12
C. 12343 11111972 94043 M 8979
Jones 1498 Lemon 0 0
D. 12344 1212197 94043 M 7023 1036 Lemon 1 7
Jones 2
3864 Peach 0 1
E. 12345 0325197 94041 F 1498
2124 Lime 1 17
Jones 2
4339 Mango 0 4
F. 12346 0325197 02127 F 1036
Jon 2 6629 Coconut 1 18
es 9091 Frog 0 32
G. 12347 0808198 02138 F 3864
9918 Vanilla 1 65
Jon 9
es 4749 Pumpkin 0 12
8
H. Smith 12348 0101197 63200 F 2124
3 8197 Allergic 1 25
6
I. Smith 12349 0202197 63300 M 4339
56 3 [Micah Altman, 3/10/2011]
57. Law, policy, ethics
Choosing Linking Keys Research design
…
Entirely randomized Information security
Most resistant to relink Disclosure
Mapping from original id to random keys is highly sensitive limitation
Must keep and be able to access mapping to add new identified data
Most computer-generated random numbers are not sufficient by themselves
Most are PSEUDO random – predictable sequences
Use a cryptographic secure PRNG: Blum Blum Shub, AES (or other block cypher) in counter mode OR
Use real random numbers (e.g. from physical sources – see http://maltman.hmdc.harvard.edu/numal/) OR
Use a PRNG with a real random seed to random the order of the table; then another to generate the ID‟s for this randomly
ordered table
Encryption
More troublesome to compute
Same id‟s + same key + same “salt” produces same values facilitates merging
ID‟s can be recovered if key is exposed, cracked, or algorithm weak
Cryptographic Hash
e.g. SHA-256
Security is well understood
Tools available to compute
Same id‟s produce same hashes easier to merge new identified data
ID‟s cannot be recovered from hash because hash loses information
ID‟s can be confirmed if identifying information is known or guessable
Cryptographic Hash + secret key
Security is well understood
Tools available to compute
Same id‟s produce same hashes easier to merge new identified data
ID‟s cannot be recovered from hash because hash loses information
ID‟s cannot be confirmed unless key is also known
Do not choose arbitrary mathematical functions of other identifiers!
57 [Micah Altman, 3/10/2011]
58.
59. Anonymous Data Collection: Law, policy, ethics
Pros & Cons
Research design
…
Information security
Pros Disclosure
limitation
Presumption that data is not identifiable
May increase participation
May increase honesty
Cons
Barrier to follow-up, longitudinal studies
Can conflict with quality control, validation
Data still may be indirectly identifiable if respondent
descriptive information is collected
Linking data to other sources of information may have
large research benefits
59 [Micah Altman, 3/10/2011]
60. Law, policy, ethics
Anonymous Data Collection Methods Research design
…
Information security
Trusted third party intermediates Disclosure
limitation
Respondent initiates re-contacts
No identifying information recorded
Use id‟s randomized to subjects, destroy mapping
60 [Micah Altman, 3/10/2011]
61. Law, policy, ethics
Remote Data Collection Challenges Research design
…
Information security
Where network connection is readily available, Disclosure
easy to transfer as collected, or enter on remote systemlimitation
Encrypted network file transfer (e.g. SFTP, part of ssh )
Encrypted/tunneled network file system (e.g. Expandrive)
Where network connection is less reliable, high bandwidth
Whole-disk-encrypted laptop
Plus, Encrypted cloud backup solutions: CrashPlan, BackBlaze,
SpiderOak
Small data, short term
Encrypted USB keys (e.g. w/IronKey, TrueCrypt, PGP)
Foreign Travel
Be aware of U.S. EAR export restrictions, use commercial or
widely-available open encryption software only. Do not use
bespoke software.
Be aware of country import restrictions (as of 2008): Burma,
Belarus, China, Hungary, Iran, Israel, Morocco, Russia, Saudi
Arabia, Tunisia, Ukraine
Encrypt data if possible, but don‟t break foreign laws. Check with
department of state.
61 [Micah Altman, 3/10/2011]
62. Online/Electronic data collection Law, policy, ethics
challenges
Research design
…
Information security
IP addresses are identifiers
IP addresses can be logged automatically by host, even if not intended by researcher Disclosure
limitation
IP addresses can trivially be observed as data is collected
Partial ID numbers can be used for probabilistic geographical identification at sub-zipcode
levels
Cookies may be identifiers
Cookies provide a way to link data more easily
May or may not explicitly identify subject
Jurisdiction
Data collected from subjects from other states / countries could subject you to laws in that
jurisdiction
Jurisdiction may depend on residency of subject, availability of data collection instrument in
jurisdiction, or explicit data collections efforts within jurisdiction
Vendor
Vendor could retain IP addresses, identifying cookies, etc., even if not intended by researcher
Recommendation
Use only vendors that certify compliance with your confidentiality policy
Do not retain IP numbers if data is being collected anonymously
Use SSL/TLS encryption unless data is non-sensitive and anonymous
Some tools for anonymizing IP addresses and system/network logs
www.caida.org/tools/taxonomy/anonymization.xml
Harvard policy
Recommendations as above
62 Plus: do not use or display Level 4+ for web surveys [Micah Altman, 3/10/2011]
63.
64. Law, policy, ethics
Certificates of Confidentiality Research design
…
Information security
Issued by DHHS agencies such as NIH, CDC, Disclosure
FDA
limitation
Protects against many types of forced legal
disclosure of confidential information
May not protect against all state disclosure law
Does not protect against voluntary disclosures by
researcher/research institutions
64 [Micah Altman, 3/10/2011]
65. Law, policy, ethics
Confidentiality & Consent Research design
…
Information security
Best practice is to describe in consent form... Disclosure
limitation
Practices in place to protect confidentiality
Plans for making the data available, to whom, and under
what circumstances, rationale.
Limitations on confidentiality (e.g. limits to a certificate of
confidentiality under state law, planned voluntary
disclosure)
Consent form should be consistent with your:
Data management plan
Data sharing plans and requirements
Not generally best practice to promise
Unlimited confidentiality
Destruction of all data
Restriction of all data to original researchers
65 [Micah Altman, 3/10/2011]
66. Law, policy, ethics
Data Management Plan Research design
…
Information security
When is it required?
Any NIH request over $500K Disclosure
limitation
All NSF proposals after 12/31/2010
NIJ
Wellcome Trust
Any proposal where collected data will be a resource beyond the project
Safeguarding data during collection
Documentation
Backup and recovery
Review
Treatment of confidential information
Overview: http://www.icpsr.org/DATAPASS/pdf/confidentiality.pdf
Separation of identifying and sensitive information
Obtain certificate of confidentiality, other legal safeguards
De-identification and public use files
Dissemination
Archiving commitment (include letter of support)
Archiving timeline
Access procedures
Documentation
User vetting, tracking, and support
One size does not fit all projects.
66 [Micah Altman, 3/10/2011]
67. Data Management Plan Law, policy, ethics
Outline Research design
…
Information security
Data description Planned documentation and supporting Budget Disclosure
materials limitation
nature of data {generated, observed, Cost of preparing data and documentation
experimental information; amples; publications; Quality assurance procedures for metadata
physical collections; software; models} and documentation Cost of permanent archiving
scale of data Data Organization [if complex] Intellectual Property Rights
Access and Sharing File organization Entities who hold property rights
Plans for depositing in an existing public Naming conventions Types of IP rights in data
database Protections provided
Quality Assurance [if not described in main
Access procedures proposal] Dispute resolution process
Embargo periods Procedures for ensuring data quality in Legal Requirements
Access charges collections, and expected measurement error
Provider requirements and plans to meet them
Timeframe for access Cleaning and editing procedures
Institutional requirements and plans to meet
Technical access methods Validation methods them
Restrictions on access Storage, backup, replication, and Archiving and Preservation
Audience
versioning Requirements for data destruction, if applicable
Facilities Procedures for long term preservation
Potential secondary users
Methods Institution responsible for long-term costs of
Potential scope or scale of use
Procedures data preservation
Reasons not to share or reuse
Frequency Succession plans for data should archiving
Existing Data [ If applicable ] entity go out of existence
Replication
description of existing data relevant to the Ethics and privacy
project Version management
Informed consent
plans for integration with data collection Recovery guarantees
Protection of privacy
added value of collection, need to collect/create Security
new data Other ethical issues
Procedural controls
Formats Technical Controls
Adherence
Generation and dissemination formats and When will adherence to data management plan
Confidentiality concerns be checked or demonstrated
procedural justification
Access control rules Who is responsible for managing data in the
Storage format and archival justification
Restrictions on use project
Metadata and documentation
Responsibility Who is responsible for checking adherence to
Metadata to be provided data management plan
Individual or project team role responsible for
Metadata standards used data management
Treatment of field notes, and collection records
67 [Micah Altman, 3/10/2011]
68. IQSS Law, policy, ethics
Data Management Services
Research design
…
Information security
The Henry A. Murray Research Archive Disclosure
Harvard’s endowed permanent data archive limitation
Assists in developing data management plans
Can provide cataloging assistance for public release of
data
Dissemination of data through IQSS Dataverse Network
The IQSS Dataverse Network
Standard data management plan for public, small data
Provides easy virtual archiving and dissemination
Data is catalogued and controlled by you
You theme and brand your virtual archive
Universally searchable, citable
Automatically provides data formatting and statistical
analysis on-line
http://dvn.iq.harvard.edu
68 [Micah Altman, 3/10/2011]
69. Data Management Plans Examples Law, policy, ethics
(Summaries)
Research design
…
Example 1 Information security
The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the
Disclosure
New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial
limitation
features, as well as mental retardation. Even with the removal of all identifiers, we believe that it would be difficult if
not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical
data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting
subjects. Therefore, we are not planning to share the data.
Example 2
The proposed research will include data from approximately 500 subjects being screened for three bacterial
sexually transmitted diseases (STDs) at an inner city STD clinic. The final dataset will include self-reported
demographic and behavioral data from interviews with the subjects and laboratory data from urine specimens
provided. Because the STDs being studied are reportable diseases, we will be collecting identifying information.
Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains
the possibility of deductive disclosure of subjects with unusual characteristics. Thus, we will make the data and
associated documentation available to users only under a data-sharing agreement that provides for: (1) a
commitment to using the data only for research purposes and not to identify any individual participant; (2) a
commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or
returning the data after analyses are completed.
Example 3
This application requests support to collect public-use data from a survey of more than 22,000 Americans over the
age of 50 every 2 years. Data products from this study will be made available without cost to researchers and
analysts. https://ssl.isr.umich.edu/hrs/
User registration is required in order to access or download files. As part of the registration process, users must
agree to the conditions of use governing access to the public release data, including restrictions against attempting
to identify study participants, destruction of the data after analyses are completed, reporting responsibilities,
restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.
Registered users will receive user support, as well as information related to errors in the data, future releases,
workshops, and publication lists. The information provided to users will not be used for commercial purposes, and
will not be redistributed to third parties.
FROM NIH, [grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm#ex]
69 [Micah Altman, 3/10/2011]
70. External Data Usage
Agreements Controls on Confidential Information
Harvard University has developed extensive technical and administrative
Between provider and procedures to be used with all identified personal information and other
confidential information. The University classifies this form of information internally
individual as "Harvard Confidential Information" -- or HCI.
Careful – you‟re liable Any use of HCI at Harvard includes the following safeguards:
- Systems security. Any system used to store HCI is subject to a checklist of
Harvard will not help if you sign technical and procedural security measures including: operating system and
applications must be patched to current security levels, a host based firewall is
University review and OSP enabled, anti virus software is enabled and the definitions file is current
- Server security. Any server used to distribute HCI to other systems (e.g. through
signature strongly providing remote file system), or otherwise offering login access, must employ
recommended additional security measures including: connection through a private network only;
limitation on length of idle sessions; limitations on incorrect password attempts;
and additional logging and monitoring.
Between provider and - Access restriction: an individual is allowed to access HCI only if there is a
University specific need for access. All access to HCI is over physically controlled and/or
encrypted channels.
University Liable - Disposal processes: including secure file erasure and document destruction.
- Encryption: HCI must be strongly encrypted whenever it is transmitted across a
Requires University Approved public network, stored on a laptop, or stored on a portable device such as a flash
drive or on portable media.
Signer : OSP This is only a brief summary. The full University security policy can be found here:
http://security.harvard.edu/heisp
Avoid nonstandard protections
And a more detailed checklist used to verify systems compliance is found here:
whenever possible http://security.harvard.edu/files/resources/forms/
DUA‟s can impose very specific These safeguards are applied consistently throughout the university, we believe
and detailed requirements that these requirements offer stringent protection for the requested data. And
these requirements will be applied in addition to any others required by specific
Compatible in spirit does not data use agreement.
apply compatibility in legal
practice
70 Use University [Micah Altman, 3/10/2011]
policies/procedures as a
71. Law, policy, ethics
IQSS Data Management Services Research design
…
Information security
The Henry A. Murray Research Archive Disclosure
Harvard’s endowed permanent data archive limitation
Assists in developing data management plans
Can provide cataloging assistance for public release of
data
Dissemination of data through IQSS Dataverse Network
Provides letters of commitment to permanent archiving
www.murray.harvard.edu
The IQSS Dataverse Network
Provides easy virtual archiving and dissemination
Data is catalogued and controlled by you
You theme and brand your virtual archive
Universally searchable, citable
Automatically provides data formatting and statistical
analysis on-line
dvn.iq.harvard.edu
71 [Micah Altman, 3/10/2011]
72. Law, policy, ethics
Key Concepts & Issues Review Research design
…
Information security
Levels of sensitivity Disclosure
limitation
Anonymity criteria
Sensitivity reduction
Certificate of Confidentiality
Data sharing plan
Data management plan
Information Partitioning
Linking Keys
72 [Micah Altman, 3/10/2011]
Notas del editor
This work. Managing Confidential information in research, by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
- Institutions may give “limited assurance” of common rule compliance – just for funded research. Most give “general assurance”