Keynote presented to KE workshop held in conjunction with the release of the report "A Surfboard for Riding the Wave
Towards a four country action programme on research data": http://www.knowledge-exchange.info/Default.aspx?ID=469
1. How many
solutions does it
take to change
the face of
research data?
Todd Vision
Dryad Digital Repository
University of North Carolina
at Chapel Hill
KE Workshop
14-15 November 2011
Bonn, Germany
2. “Es sollte nur ein Magazin der Kunst in der Welt sein wo der Künstler seine
Kunstwerke nur hinzugeben hätte um zu nehmen was er brauchte”
“There ought to be in the world a repository of art, to which the artist need
only bring his artworks in order to take what he needed”
Beethoven, letter to publisher F.A. Hoffmeister, 15 January 1801
3. Open dissection of research:
the Beethoven Repository
http://rockethub.com/projects/3755-open-dissection-of-research
8. Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow,
Passer domesticus. Biological Lectures from the Marine Biological Laboratory: 209-226.
10. Failure of peer-to-peer data sharing
Wicherts and colleagues requested data from from
141 articles in American Psychological
Association journals.
“6 months later, after … 400 emails, [sending]
detailed descriptions of our study aims, approvals
of our ethical committee, signed assurances not
to share data with others, and even our full
resumes…” only 27% of authors complied
Wicherts, J.M., Borsboom, D., Kats, J., Molenaar, D. (2006). The poor availability
of psychological research data for reanalysis. American Psychologist, 61, 726-728.
11. News alert: scientists are human
“We related the reluctance to share research data for reanalysis to 1148
statistically significant results reported in 49 papers published in two
major psychology journals. We found the reluctance to share data to be
associated with weaker evidence (against the null hypothesis of no effect)
and a higher prevalence of apparent errors in the reporting of statistical
results. The unwillingness to share data was particularly clear when
reporting errors had a bearing on statistical significance”.
Not shared Shared
Wicherts et al. (2011) doi:10.1371/journal.pone.0026828
12. Lang GI, Botstein D (2011) PLoS ONE
doi:10.1371/journal.pone.0025290
101 pages!
13. Joint Data Archiving Policy (JDAP)
Data are important products of the scientific enterprise, and
they should be preserved and usable for decades in the
future.
As a condition for publication, data supporting the results in
the article should be deposited in an appropriate public
archive.
Authors may elect to embargo access to the data for a
period up to a year after publication.
Exceptions may be granted at the discretion of the editor,
especially for sensitive information.
Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J.
Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146.
16. author
prepare manuscript
and related data files
JOURNAL
submit manuscript
manuscript review
DRYAD
upload data
editor
accepted? accepted?
send article Dryad data
no yes description package
send data
identifier (DOI) curation
data curator
published article published data
(with data citation) (with article citation)
19. Survey of authors
What are the policies of your funder as they
apply to online public archiving? (n=983)
1% Forbids
21% Recommends
9% Requires
40% No policy
26% I don’t know
3% Other
20. Data policies among bioscience journals
IF=3.6
IF=6.0
IF=4.5 n=70
Piwowar HA, Chapman WW (2008) A review of journal policies for sharing research
data. Presented at ELPUB2008, Nature Precedings hdl:10101/npre.2008.1700.1
27. Does sharing imply that it need be altruistic?
• For a set of 85 cancer microarray
clinical trials
48% had publicly available data
These received 85% of the article citations
Independent of journal impact factor,
publication date, author nationality
Piwowar H, et al. (2007) Sharing Detailed Research Data Is
Associated with Increased Citation Rate. PLoS ONE 2(3): e308.
28. Taxonomy of data archiving benefits
Direct
Indirect (costs avoided)
Verification of published research
Redundant data collection
Preserving accessibility to data
Inefficient legacy data curation
Allowing reuse and repurposing of data
Burden of sharing-upon-request
Discoverability of data
Opportunity cost of science not done
Near term
Long term
Protection against personnel turnover
Secure long-term stewardship
Availability for review and validation
Increased impact per publication
Private
Public
Increased citations
More efficient use of research dollars
New collaborations
Public trust in science
New research opportunities
Educational opportunities
Fulfilling funding mandates
Improved methodologies
More informed policy
Modified from Beagrie et al. (2009) Keeping Research Data Safe 2
28
29. Galilei, Galileo (1638) Discorsi e dimostrazioni matematiche, intorno à due
nuove scienze Attenenti alla Mecanica I Movimenti Locali. Elsevier
31. Costs
• Moderate economies of scale are required
At 10K packages/yr, $50/deposit, depending on curation
• What are the costs for SOM?
Journal of Clinical Investigation: $300 flat fee
Ecological Archives: $250 10Mb, more fees beyond that
FASEB: $100 per file
Beagrie N, Eakin-Richards L,Vision TJ (2009) Business models
and cost estimation: Dryad repository case study. iPRES 2010
32. What is the return on investment?
• A rigorous framework is lacking
But we can look at comparators
• Marginal cost of data archiving
$50/article is 2% of of publication costs ($2.5K)
And 0.2% of grant costs/article (~$25K)
• Is the data worth 2% of the research investment?
Using DNA microarray data in GEO as a model
2,711 submissions in 2007
Data reused by 3rd parties in 1,150 articles
Vision (2011) Open data and social contract of scientific publishing. BioScience, 60(5):330-330
Piwowar H,Vision TJ, Whitlock MC (2011) Data archiving is a good investment. Nature 473:285
35. DataONE network
Three major components for
flexibility, scalability and sustainability
Member Nodes
• diverse institutions
Coordinating Nodes
• serve local community
• retain complete
Investigator Toolkit
• provide resources for
metadata catalog
managing their data
• indexing for search
• retain copies of data
• network-wide services
• ensure content
availability
(preservation)
• replication services
36. Concluding thoughts
• Archiving is essential
• Journals and learned societies will be at
least as important as institutions
• Funders cannot be shy about policy, and
must drive the marketplace
• We can leverage for data lots of things that
work well for traditional publications
• International cooperation is a must
39. 12. CC-BY, Lang GI, Botstein D source: A Test of the Coordinated Expression Hypothesis for
the Origin and Maintenance of the GAL Cluster in Yeast. PLoS ONE 6(9): e25290, 2011.
doi:10.1371/journal.pone.0025290
13. CC BY-SA 2.0 avlxyz, source:http://www.flickr.com/photos/avlxyz/4589977933/
16. courtesy of Peggy Schaeffer
20. CC-BY Piwowar HA, Chapman WW source: A review of journal policies for sharing
research data, Nature Precedings, hdl:10101/npre.2008.1700.1
21. CC BY 2.0, sashafatcat source:http://www.flickr.com/photos/sashafatcat/2381412445
23, 24. CC-BY H Piwowar, J Carlson, T Vision, unpublished
25. CC-BY H Piwowar, source: http://researchremix.wordpress.com/2011/05/28/dear-nsf-
reviewers/
26. CC BY-ND 2.0 Sivaprakash Kannan source: http://www.flickr.com/photos/sivaprakash/
294755142/
28. After: Beagrie N, Lavoie B, Woollard M (2010) Keeping Research Data Safe 2, http://
www.jisc.ac.uk/media/documents/publications/reports/2010/keepingresearchdatasafe2.pdf
29. Galilei, Galileo (1638) Discorsi e dimostrazioni matematiche, intorno à due nuove scienze
Attenenti alla Mecanica I Movimenti Locali. Elsevier. Source: original unknown.
30. CC BY-NC-SA 2.0 Coralie Mercer, source: http://www.flickr.com/photos/koalie/394934841/
33. CC BY-NC-ND 2.0 by www.english.school.nz, source: http://www.flickr.com/photos/iei/
2904115612/
34. Liberty ship under construction, source:
http://en.wikipedia.org/wiki/File:Liberty_ship_construction_04_bottom.jpg, public domain