The document discusses opportunities for partnerships between domain repositories and institutional repositories to enhance data curation. It provides examples of how they can partner to 1) archive and share data more widely to avoid duplicate data collection, 2) improve data documentation and address disclosure issues to make data more useful for future researchers, and 3) provide tools and expertise around data processing, confidentiality review, and access to help researchers use the data while protecting privacy. Productive partnerships require selection of pilot projects, investment of time, and willingness to take on new roles from both sides.
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…
1. Domain Repositories and Institutional
Repositories Partnering to Curate:
Opportunities and Examples
Jared Lyle
RDAP13
2. About ICPSR
• Founded in 1962 as a consortium of 21
universities to share the National Election
Survey
• Today: 700+ members around the world
• Data dissemination for more than 20 federal
and non-government sponsors
• 600,000+ visitors per year
3. What we do
• Acquire and archive social science data
• Distribute data to researchers
• Preserve data for future generations
• Provide training in quantitative methods
Archive size
• 8,000 data collections, over 60,000 data sets
• Grows by 300+ collections a year
• 9 Terabytes, soon to be 40+ Terabytes
7. “It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharin
http://www.iassistdata.org/downloads/iqvol304niu.pdf
8. “Virtually all geneticists believe
that scientists should share their
results freely with peers…”
Louis, Jones, and Campbell (2002). “Sharing in Science.”
http://dx.doi.org/10.1511/2002.4.304
9. “…the era of data sharing has arrived.”
Samet (2009). “Data: To Share or Not to Share?”
http://dx.doi.org/10.1097/EDE.0b013e3181930df3
11. Most PIs indicated that they wanted
to be “Good Citizens” and help:
“This sounds like an exciting
project.”
“I hope your project is successful
because I think that it is
important.”
12. “Good Citizens” = high willingness
…but no time, money, or resources
to submit data to us.
13. Data Sharing (N=1,544)
70
58.7%
60
50
40
30 25.7%
20 14.2%
10
0
Data Are Has Copy of Data Are Lost
Archived Data
Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Sh
http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPS
Identifying Important „At Risk‟ Social Science Data.”
http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
14. Data Sharing (N=935)
Federal Shared Shared Not
Agency Formally, Informally, Shared
Archived Not (n=409)
(n=111) Archived
(n=415)
NSF 22.4% 43.7% 33.9%
(27.3%)
NIH 7.4% 45.0% 47.6%
(72.7%)
Total 11.5% 44.6% 43.9%
Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research:
The Use and Reuse of Primary Research Data”.
http://hdl.handle.net/2027.42/78307
23. Disclosure Issues
• Direct Identifiers?
– personal names
– addresses (including ZIP codes)
– telephone numbers
– social security numbers
– driver license numbers
– patient numbers
– certification numbers,
24. Disclosure Issues
• Indirect Identifiers?
– detailed geography (i.e., state, county, or
census tract of residence)
– exact date of birth
– exact occupations held
– exact dates of events
– detailed income
25. Disclosure Issues
• External Linkages?
– public patient/medical records
– court records
– police and correction records
– Social Security records
– Medicare records
– driver’s licenses
– military records
27. “It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing
http://www.iassistdata.org/downloads/iqvol304niu.pdf
30. Emerging sources and types of data
• Geo-spatial
• Video
• Administrative data
• Online text
• Transactions
• Clicks
• Sensors
31. Partnerships
“We propose that domain specific
archives partner with institution based
repositories to provide expertise, tools,
guidelines, and best practices to the
research communities they serve.”
Green, Ann G., and Myron P. Gutmann. (2007) "Building
Partnerships Among Social
Science Researchers, Institution-based Repositories, and
Domain Specific Data Archives." OCLC Systems and
Services: International Digital Library Perspectives. 23: 35-
53. http://hdl.handle.net/2027.42/41214
38. Time & Willingness
http://www.flickr.com/photos/floridamemory/702661937
1/
39. Survey of Repositories‟ Data Needs
Inter-university Consortium for Political and Social
Research. Survey of Data Curation Services for
Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI:
Inter-university Consortium for Political and Social
Research [distributor], 2012-09-21.
doi:10.3886/ICPSR34302.v1
40. Repository Suggested Solutions:
• Media recovery, format migration, data
recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
linkages
• Support networks and training
• Confidential data dissemination and
confidentiality review
44. • Suppressing unique cases
• Grouping values (e.g., 13-29=1, 30-49=2)
• Top-coding (e.g., >1,000=1,000)
• Aggregating geographic areas
• Swapping values
• Sampling within a larger data collection
• Adding “noise”
• Replacing real data with synthetic data
49. The Virtual Data Enclave (VDE) provides remote access
to quantitative data in a secure environment.
50.
51. Hermes Outputs
• ASCII data files
– Column- and tab-delimited
• Stat package setup files
– SAS, SPSS, Stata (.do and .dct)
• “Ready-to-go” data files
– SAS transport (CPORT engine)
– SPSS system (.sav)
– Stata system (.dta)
– R (.rda)
52. Your ideas on partnerships?
Useful categories for discussion?
• Media recovery, format migration, data
recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
linkages
• Support networks and training
• Confidential data dissemination and
confidentiality review
“At the end of 2011 ICPSR had about 9TB of content stored in Archival Storage. This measurement includes everything we have collected over the past 50 years, including content which is not packaged into "studies" for dissemination, such as TIGER/Line files and data packaged for SDA. This content is not compressed, and contains many duplicates[1], and so should be considered an upper bound.”“Long-time ICPSR staff tell the story of how the 2000 Census doubled the size of ICPSR's holdings. (I'll speculate that perhaps ICPSR went from about 3TB of content prior to the 200 Census, and then grew to 6TB thereafter.) In 2012-2013 ICPSR is likely to quadruple the size of its holdings, growing from about 9TB to nearly 40TB.”http://techaticpsr.blogspot.com/2012/04/nature-of-icpsrs-holdings.html
Sharing data = formally archiving the data.
(on a 4-point scale, 49 percent “agree completely” and 42 percent “agree somewhat”)
Why are data not shared?Preparing data and documentation can be enormously time consumingLimited resources for data preparationNeed to protect the confidentiality of respondentsFear of getting “scooped”Lack of rewards for sharingPienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
4,883 NIH & NSF PIs emailed a survey1,217 responses (24.9% response rate)1,003 valid (collected data, not dissertation)We attempted to invite all 4,883 of these PIs. The PI survey consisted of consisted of questions about research data collected, various methods for sharing research data, attitudes about data sharing and demographic information. PIs were also asked about publications tied to the research project including information about their own publications, research team publications, and publications outside the research team. We received 1,217 responses (24.9% response rate). For the analytic sample we select PIs and their research data if (1) they confirm they collected research data (86.6% of the responses), (2) they did not collect data for a dissertation award (n=33), or (3) they were missing data on the dependent variable.
Enhancements by both Investigators & Archives – time, money, training, & tools
[Quote is from the National Longitudinal Survey of Youth’s explanation of its documentation (see: http://www.nlsinfo.org/nlsy97/97guide/chap3.htm#threethree).]
“A centuries-old fresco of Jesus Christ that was botched by a well-intentioned elderly woman has drawn hundreds of visitors and reporters to a north-eastern Spanish church - a positive push in tourism for the small town.The "ecce homo" (or "behold the man"), painted by famous Spanish artist Elias Garcia Martinez, is now mockingly - if not affectionately - called "ecce mono" ("behold the monkey") after an 81-year-old Cecilia Jimenez of Borja tried to fix the deteriorating fresco by applying a paint brush.”http://www.cbsnews.com/8301-503543_162-57501085-503543/ruined-fresco-draws-attention-fans-in-spain/
Traditionally, we’ve dealt with quantitative social surveys with properties and structures.
Along with project documentation, which is needed so secondary users can independently understand a data collection.
In this example, the very first value is “-1”. The variable is an age variable, as indicated by the name above the frequency table, therefore age cannot have a “-1” value, unless it has another valid meaning, such as “Inappropriate”, “Not applicable”, or “Missing”.
This variable might be missing descriptive information, which is problematic and could render the variable unusable. This variable’s name is generic, which might be fine as long as the codebook provides description. But without further labeling, this variable could be meaningless since no label is provided and the value labels are generic Yes/No. The end user wouldn’t be able to interpret the variable.
Here the top and bottom values for Age seem a bit off. It looks like the PI recoded everything <18 to 18 and everything >40 as 41, although it’s not explicit.
A disclosure risk review asks the question, Do these data contain content that I need to restrict?Major areas to check when assessing risk include:1) Are there direct identifiers that reveal the identity of respondents that may have been obtained in the process of data collection?
2) Are there indirect identifiers that reveal the identity of respondents when they are used in combination with other data?It can be more challenging to identify indirect identifiers. Careful attention must therefore be paid to interactions among the context of the study, the nature of the sample, and the characteristics of respondents to prevent ordinarily unrevealing information from becoming the pointer to an individual.
3) Are there external linkages that might reveal the identity of respondents?The ability to link data from these files to data available through external sources may present an unacceptable risk of disclosure.
Data discovery requires variable-level metadataData Documentation Initiative (DDI) is an XML standard for micro-data in the social sciencesFederated search toolsNew search toolsVariable level searchingQuestion banksHarmonization tools
At ICPSR, we use a “LEADS” database is to actively discover important research data that should be preserved and disseminated. See:Pienta, Gutmann, Hoelter, Lyle, and Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.”http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
There are increasing sources and types of data produced.
On IR was handed 80 distinct pieces of removable media with over 30,000 files and no instructions. One file, as an amusing aside, was named “David’s Favourite Captain Haddock Curses”.How to find the relevant files, and which of those are essential?
None of the IRs had the software to read statistical files, let alone the capability to recover or convert older files. This was when ICPSR could lend a direct hand.
Survey distributed March & April 2012.60% completion rate (109/181)27 U.S. states + D.C.6 Canadian provincesUK, AU, NL, NO, SA66% respondents from social science repository mailing listMost from college or universityLibrarian5457%Repository Manager3537%Of those who’d received or were planning to receive data (80%):Social Sciences (69%)Physical Sciences (47%)Humanities (36%)Biomedical (36%)Engineering (24%)
As we’re presenting guidelines and tips for creating a well-prepared data collection, keep in mind that a lot of the information we’re conveying in this presentation is found in our “ICPSR Guide to Social Science Data Preparation and Archiving” (a link to it is on the ICPSR web site).
Confidentiality review and treatment involves reviewing and modifying the actual data to reduce disclosure risk. Data collections undergo confidentiality review to determine whether the data contain any information that could be used to identify respondents. All direct identifiers should be removed from files.There are a number of actions you can take to protect respondent confidentiality: removing, masking, or collapsing variables within public-use versions of the datasets. Or, restricting access to the data.Removing variables is a good solution for treating direct identifiers.Blanking masks identifiers by altering original data values to missing data codes. For example, ‘abcd’ to “ “ (all blanks). Recoding alters original data values to missing data codes. For example, value ‘1234’ is changed to ‘9999’.Bracketing/collapsing combines the categories of a variable or merges the concepts embodies in two or more variables by creating a new summary variable. For example, age: 13-29=1, 30-49=2.Top/Bottom coding groups the upper or lower limits to eliminate outliers. For instance, a sample with extreme values for income might top-code or round all income >$100,000.Perterbing is a more complex statistical technique that involves alteration of the variable by variable suppressing, adding, or removing records, and random noise continuous/pseudo-continuous variables. This technique limits the appeal of the data since it alters the original data values.Restricting access through requiring users to apply for use, and highly restricted access (e.g., secured enclave-only access).
The intent of the ICPSR pipeline process is to curate, “preserve and access information for the Long Term” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 2-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).Throughout the pipeline, our intent is to insure that curated data are independently understandable – that is, “the community should be able to understand the information without needing the assistance of the experts who produced the information” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 3-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).
The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.