RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Domain Repositories and Institutional
Repositories Partnering to Curate:
Opportunities and Examples

Jared Lyle
RDAP13

About ICPSR
• Founded in 1962 as a consortium of 21
universities to share the National Election
Survey
• Today: 700+ members around the world
• Data dissemination for more than 20 federal
and non-government sponsors
• 600,000+ visitors per year

What we do
• Acquire and archive social science data
• Distribute data to researchers
• Preserve data for future generations
• Provide training in quantitative methods

Archive size
• 8,000 data collections, over 60,000 data sets
• Grows by 300+ collections a year
• 9 Terabytes, soon to be 40+ Terabytes

http://www.flickr.com/photos/dwiggs/3983200894/sizes/l/in/photostream/

“It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharin
http://www.iassistdata.org/downloads/iqvol304niu.pdf

“Virtually all geneticists believe
that scientists should share their
results freely with peers…”

Louis, Jones, and Campbell (2002). “Sharing in Science.”
http://dx.doi.org/10.1511/2002.4.304

“…the era of data sharing has arrived.”

Samet (2009). “Data: To Share or Not to Share?”
http://dx.doi.org/10.1097/EDE.0b013e3181930df3

Most PIs indicated that they wanted
to be “Good Citizens” and help:

“This sounds like an exciting
project.”

“I hope your project is successful
because I think that it is
important.”

“Good Citizens” = high willingness

…but no time, money, or resources
to submit data to us.

Data Sharing (N=1,544)
70
58.7%
60
50
40
30 25.7%
20 14.2%
10
0
Data Are Has Copy of Data Are Lost
Archived Data

Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Sh
http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPS
Identifying Important „At Risk‟ Social Science Data.”
http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf

Data Sharing (N=935)
Federal Shared Shared Not
Agency Formally, Informally, Shared
Archived Not (n=409)
(n=111) Archived
(n=415)
NSF 22.4% 43.7% 33.9%
(27.3%)
NIH 7.4% 45.0% 47.6%
(72.7%)
Total 11.5% 44.6% 43.9%

Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research:
The Use and Reuse of Primary Research Data”.
http://hdl.handle.net/2027.42/78307

A well-prepared data collection
“contains information intended to
be complete and self-explanatory”
for future users.

A corollary: Do no harm.

http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku-
xlarge.jpg

Documentation

http://dx.doi.org/10.3886/ICPSR31521.v1

Disclosure Issues
• Direct Identifiers?
– personal names
– addresses (including ZIP codes)
– telephone numbers
– social security numbers
– driver license numbers
– patient numbers
– certification numbers,

Disclosure Issues
• Indirect Identifiers?
– detailed geography (i.e., state, county, or
census tract of residence)
– exact date of birth
– exact occupations held
– exact dates of events
– detailed income

Disclosure Issues
• External Linkages?
– public patient/medical records
– court records
– police and correction records
– Social Security records
– Medicare records
– driver’s licenses
– military records

Opportunity

http://www.flickr.com/photos/k3v1nm/3366181223/

“It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing
http://www.iassistdata.org/downloads/iqvol304niu.pdf

“Search/Compare Variables” examines 2.1 million variables in 4,000 data collections

Emerging sources and types of data

• Geo-spatial
• Video
• Administrative data
• Online text
• Transactions
• Clicks
• Sensors

Partnerships
“We propose that domain specific
archives partner with institution based
repositories to provide expertise, tools,
guidelines, and best practices to the
research communities they serve.”

Green, Ann G., and Myron P. Gutmann. (2007) "Building
Partnerships Among Social
Science Researchers, Institution-based Repositories, and
Domain Specific Data Archives." OCLC Systems and
Services: International Digital Library Perspectives. 23: 35-
53. http://hdl.handle.net/2027.42/41214

http://www.icpsr.umich.edu/icpsrweb/I
R/

5 Pilot Data Collections

http://www.flickr.com/photos/smithsonian/25511703
86/

Finding interested partners

http://www.flickr.com/photos/usnationalarchives/47269173
73/

Time & Willingness

http://www.flickr.com/photos/floridamemory/702661937
1/

Survey of Repositories‟ Data Needs

Inter-university Consortium for Political and Social
Research. Survey of Data Curation Services for
Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI:
Inter-university Consortium for Political and Social
Research [distributor], 2012-09-21.
doi:10.3886/ICPSR34302.v1

Repository Suggested Solutions:

• Media recovery, format migration, data
recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
linkages
• Support networks and training
• Confidential data dissemination and
confidentiality review

http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf

2. Confidentiality Review & Treatment

• Suppressing unique cases
• Grouping values (e.g., 13-29=1, 30-49=2)
• Top-coding (e.g., >1,000=1,000)
• Aggregating geographic areas
• Swapping values
• Sampling within a larger data collection
• Adding “noise”
• Replacing real data with synthetic data

http://www.icpsr.umich.edu/icpsrweb/content/DSDR/tools/qualano
n.html

The Virtual Data Enclave (VDE) provides remote access
to quantitative data in a secure environment.

Hermes Outputs
• ASCII data files
– Column- and tab-delimited

• Stat package setup files
– SAS, SPSS, Stata (.do and .dct)

• “Ready-to-go” data files
– SAS transport (CPORT engine)
– SPSS system (.sav)
– Stata system (.dta)
– R (.rda)

Your ideas on partnerships?

Useful categories for discussion?
• Media recovery, format migration, data
recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
linkages
• Support networks and training
• Confidential data dissemination and
confidentiality review

RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Similar to RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn… (20)

More from ASIS&T

More from ASIS&T (20)

RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Editor's Notes