Talk I gave as part of the panel "How will cyberinfrastructure capabilities shape the future of scientific collaboration?" at the Cyberinfrastructure for Collaborative Science workshop, held at the National Evolutionary Synthesis Center (NESCent), May 18-20, 2011.
More information about the workshop at
https://www.nescent.org/wg_collabsci/2011_Workshop
Open science, open-source, and open data: Collaboration as an emergent property?
1. Open science, open-source,
and open data:
Collaboration as an
emergent property?
Cyberinfrastructure for
Collaborative Science
A workshop at NESCent, Durham, NC, May 18-20
Hilmar Lapp and Todd Vision, NESCent
2. “Openness is the new
flower power”
“We create the
scholarship. We
create the meta-data.
We create the tools.
We can reclaim and
reinvent the way that
scientific scholarship
is created and
disseminated.”
Peter Murray-Rust
http://blogs.ch.cam.ac.uk/pmr/2010/08/02/flowerpoint-step-by-step/
4. What does “open” mean?
“A piece of content
or data is open if
anyone is free to use,
reuse, and
redistribute it —
subject only, at most,
to the requirement
to attribute and
share-alike.”
5. What does “open” mean?
“A piece of content
or data is open if
anyone is free to
use, reuse, and
redistribute it —
subject only, at
most, to the
requirement to
attribute and share-
alike.”
9. Dryad: archiving, finding,
and sharing open data
• Data archived at publication • Data given persistent GUIDs
• Option for limited-term embargo • Metadata searchable
• Easy submission process • API for data exchange
• Data may be peer reviewed • Capacity for versioning/updates
• Persistent link from paper to data • Long-term preservation
• Data in public domain (CCZero) • Community governance
10. Dryad: archiving, finding,
and sharing open data
• Data archived at publication • Data given persistent GUIDs
• Option for limited-term embargo • Metadata searchable
• Easy submission process • API for data exchange
• Data may be peer reviewed • Capacity for versioning/updates
• Persistent link from paper to data • Long-term preservation
• Data in public domain (CCZero) • Community governance
12. • Two-layered process for gaining access to data
• “Individuals or groups of individuals that would like to use
the TRY database for a scientific project […] should submit a
proposal to TRY”
• “TRY will give the proponent the list of individuals […] that
will necessarily have to be […] consulted […]”
13. Another model for
controlled sharing: Text
Text
• Data description
published
• Data embargoed 1 (or
3) yrs post publication
• Afterwards it is
released under CCZero
14. Open Source:
The value of community
Web traffic 500
Mailing list membership
2008
400
2010
300
200
100
0
ajax anno arch cma deve gbro s
unce itect p l wse chema
ure
Example: Generic Model Organism Database
15. Community participation
in Phenoscape
• Anatomy Ontology Contributors: 25
• Taxonomy Ontology Contributors: 11
• Community data annotators: 10
• Workshop Participants: ~75
• Software testing volunteers: 33
• Internship students: 13
• Phenotype Ontology RCN: http://phenotypercn.org
16. Sustaining informatics
resources over the long term
• 875 modules in core, >422,000
lines of code
• Most widely used Perl toolkit in
the life sciences
• Active for 16 years
• No direct grant funding
• Continues to recruit new
contributors
• Leadership baton passed 5 times,
dozens of committers
• Stajich et al (2002) Genome
Biology: cited 509x
22. Summary
• The principles in common are freedom to
• Reuse
• Modify and recombine
• Redistribute
• These are also fundamental to enabling
synthetic collaborative science.
28. Whenever you do a
thing, act as if all the
world were watching.
Thomas Jefferson
29. Acknowledgments
• NESCent Directors • EvoInfo Working Group
and resident scientists participants
• NESCent Informatics team: • Participants and co-
R. Scherle, J. Balhoff, C. organizers of 5 hackathons
Kothari, D. Clements,
X. Liu, V. Gapeyev, J. Auman • Google OSPO, Google
Summer of Code students &
• Dryad team, UNC/MRC mentors
collaborators, including
Heather Piwowar • The O|B|F community and
fellow Bio* developers
• Phenoscape team (PIs P.
Mabee, M. Westerfield, T. • Comp. Phyloinformatics
Summer Course students &
Vision), curators, and other
instructors
collaborators
Editor's Notes
\n
\n
\n
\n
\n
Describing how the work of the summer interns got picked up by others while it was happening, based on the web presence of their evolving outputs. Examples, multiple outside groups stepped in to flesh out and maintain the database of funder policies on data archiving that Nic launched. Valerie’s resource pages on data citation led to students being invited by DataCite to contribute to the metadata model behind data DOIs. The interns also posted regularly to a project blog, shared their bibliographies on Mendeley, and their end-of-summer poster was likely seen by a much larger number of people online than at the conference, based on the Twitter buzz we observed.\n\nDownsides: There was fear (founded or not) of the consequences for organizational reputation to have things out there representing DataONE that might be seen as shoddy, or embarrasing to publishers, etc., and it made us add more disclaimers/qualifiers to the web documentation.  We discovered that there were some weaknesses with the platforms we used (e.g. OpenWetWare).  This particular batch of interns didn't seem to be intimidated.  And the open communication led to some efficiencies with the larger mentor group -- I'm not aware of inefficiencies that were introduced.  The blog may not have always been the best investment of time from a research-productivity standpoint, but was a useful learning experience for the students nonetheless. \n
“Researchers have found many reasons for making their notebook open: better collaborations, greater visibility and broader impact for funding requirements, sharing detailed methods, solving the dilemma of  null results publishing, helping other scientists advance, demonstrating the viability of the idea or participation in a community of open science.”\n
\n
Dryad provides a way to make data to be available not just for validation, but also for new method development, meta-analysis, and other types of synthetic science; it also give researchers credit for their data as a first-class scholarly product.\n
Dryad provides a way to make data to be available not just for validation, but also for new method development, meta-analysis, and other types of synthetic science; it also give researchers credit for their data as a first-class scholarly product.\n
An example of how valuable such archived data can be. NESCent postdoc Amy Zanne and colleagues led a group that in 2009 deposited to Dryad a dataset compiling wood anatomy data from 8412 plant species. This dataset has already been downloaded over 600 times! While some of these downloads may lead to citations, there is probably a good deal of data reuse for educational purposes, and likely also exploration of analytical methods on this unique dataset.\n\nThe inset from the corresponding Ecology Letters article shows the geographical distribution of wood density in North and South America. Each data point is the mean wood density value of all unique species occurrences in that cell. Wood density clearly varies in a very predictable way with temperature, precipitation, and seasonality. Dryad contains the data underlying this figure, but without Dryad, researchers would be unable to reconstruct the original data from this image for testing new hypotheses.\n
Contrast this with the newly announced TRY database. A similar community compilation of trait data. But access is guarded by a formidable two-tiered process.\n
And contrast that with another model for protecting the professional needs of researchers in their published data. Many of the Dryad partner journals allow authors to embargo data for a limited period, but they must archive it publicly upon publication, and reuse cannot be controlled after the embargo has been lifted. This is consistent with the idea that the original authors have a conflict of interest in the reuse of that data after publication, and should not be empowered to suppress independent reanalysis any more than they should for the contents of the article itself. Dryad supports 1 yr no-questions asked embargoes. In this case, a published dataset from some of the TRY contributors has been embargoed for 3 yrs at the discretion of the journal editor.\n
Mailing list traffic has gone, in the case of Gbrowse, from less then 50/month in 2007 to sometimes over 400/month in 2010\nWe know that there have been over 200 new installations of Gbrowse per month based on registrations\n
\n
\n
\n
\n
8 participating R packages\nAde4, Ape, apTreeshape, Geiger, Laser, OUCH, PaleoTS, Picante\nBefore the hackathon:1/8 were using version control (but none public), 1/8 had a mailing list\n