This document discusses the opportunities of open data sharing in the big data era, including quicker responses to problems, more collaboration, and harnessing crowd-sourced efforts. It provides examples of open data enabling scientific progress, such as genome analysis that helped control an E. coli outbreak. Open data can provide credit to data sharers and incentivize open science. The document advocates for removing barriers to open data like paywalls and silos through initiatives like GigaDB and GigaScience that integrate publishing and data platforms to maximize data utility.
Scott Edmunds at Tech4Dev on Open Publishing for the Big-Data Era
1. : Open Publishing
for the Big-Data Era
"Information is the
currency of the
future world”
William Gibson
Scott Edmunds, Peter Li, Huayan Gao, Chris Hunter, Si Zhe
Xiao, Tin-Lap Lee, Laurie Goodman
#Tech4Dev: 4th June 2014
2. Challenges/Opportunities in the Data-Driven Era
Quick response to climate change, food security & disease outbreaks
Using networking power of the internet to tackle problems
Can ask new questions & find hidden patterns & connections
Build on each others efforts quicker & more efficiently
More collaborations across more disciplines
Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Enables:
Enabled by:
Removing silos, standards/formats, open-access/data
Challenges:
3. Not enabled by: paywalls, silos, dead trees
18121665 1869
• Scholarly articles are merely advertisement of scholarship .
The actual scholarly artefacts, i.e. the data and
computational methods, which support the scholarship,
remain largely inaccessible --- Jon B. Buckheit and David L.
Donoho, WaveLab and reproducible research, 1995
• Lack of transparency, lack of credit for anything other than
“regular” dead tree publication
• If there is interest in data, only to monetise & repackage
4. • Data
• Software
• Re-use…
= Credit
}
Credit where credit is overdue:
“One option would be to provide researchers who release data to public
repositories with a means of accreditation.”
“An ability to search the literature for all online papers that used a particular data
set would enable appropriate attribution for those who share. “
Nature Biotechnology 27, 579 (2009)
New incentives/credit
5. GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
Utilizes big-data infrastructure and expertise from:
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
7. Democratization: the “Peoples Parrot”
Puerto Rican Parrot Genome Project (Amazona vittata )
Was the rarest parrot, national bird of Puerto Rico
Community funded from artworks, fashion shows, beer brands, crowdfunding…
Genome annotated by students in community college as part of bioinformatics education
Paper and Data published and promoted in GigaScience and GigaDB
Taras K Oleksyk, et al., (2012) A Locally Funded Puerto Rican Parrot (Amazona vittata) Genome Sequencing Project Increases Avian Data and Advances Young
Researcher Education. GigaScience 2012, 1:14
Steven J. O’Brien. (2012): Genome empowerment for the Puerto Rican parrot – Amazona vittata. GigaScience 2012, 1:13
Oleksyk et al., (2012): Genomic data of the Puerto Rican Parrot (Amazona vittata) from a locally funded project. GigaScience. http://dx.doi.org/10.5524/100039
8. To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011): Genomic data from Escherichia coli
O104:H4 isolate TY-2482. BGI Shenzhen.
http://dx.doi.org/10.5524/100001
Crowdsourcing disease outbreaks:
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
9.
10.
11. Downstream consequences:
“Last summer, biologist Andrew Kasarskis was eager to help decipher the genetic origin of the Escherichia coli strain
that infected roughly 4,000 people in Germany between May and July. But he knew it that might take days for the
lawyers at his company — Pacific Biosciences — to parse the agreements governing how his team could use data
collected on the strain. Luckily, one team had released its data under a Creative Commons licence that allowed free
use of the data, allowing Kasarskis and his colleagues to join the international research effort and publish their work
without wasting time on legal wrangling.”
1. Citations (>200) 2. Therapeutics (primers, antimicrobials) 3. Platform Comparisons
4. Example for faster & more open science
12. 1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.