Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Scott Edmunds: Data publication in the data deluge
1. Data publication in the data deluge
Scott Edmunds, GigaScience/BGI Hong Kong
COASP 2012, Budapest, 20th September 2012
www.gigasciencejournal.com
2. The Data Challenge:
•1.2 zettabytes (1021) electronic data generated globally each year
•>Exponential growth of genomics data (& growth in imaging and
MS data following)
Source: http://www.genome.gov/sequencingcosts/ (with apologies)
•Issues with reproducibility, hosting, curation, interoperability
•Need for better incentives to overcome these
Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.
3. Large-Scale Data
Journal/Database
In conjunction with:
Editor-in-Chief: Laurie Goodman, PhD
Editor: Scott Edmunds, PhD
Commisioning Editor: Nicole Nogoy, PhD
Lead Curator: Tam Sneddon D.Phil
Data Platform: Peter Li, PhD
www.gigasciencejournal.com
4. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
6. Anatomy of a Publication
Idea
Study
Metadata
Data
Analysis
Answer
7. Anatomy of a Data Publication
Idea
Study
Metadata
Data
Analysis
Answer
8. Issues for Data Publication
Idea
Cultural issues:
Study
Technical issues:
Metadata
Data
Analysis
Answer
9. Issues for Data Publication
Idea
Cultural issues:
Study
Metadata
Data
Analysis
Adoption held back by:
Answer journal policies, citation, tracking…
* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham
10. Issues for Data Publication
Idea
Study
Technical issues:
Metadata
What do we do with the data?
Data
Analysis
Answer
11. Issues for Data Publication
Idea
Study
Technical issues:
Metadata
What do we do with the data?
Data
Lightweight:
Analysis •Metadata only journals
•Get someone else to host
Heavyweight:
Answer •Become a repository
12. To host or not to host?
Against: supplementary files argument
The Journal of Neuroscience Average size of a Journal of Neuroscience article
and supplemental material in megabytes.
Announcement Regarding Supplemental Material:
Beginning November 1, 2010, The Journal of
Neuroscience will no longer allow authors to include
supplemental material when they submit new manuscripts
and will no longer host supplemental material on its web site
for those articles.
“While the size of articles has grown gradually over the
past decade, the supplemental material associated with a
typical Journal article appears to be growing exponentially
and is rapidly approaching the size of an article. The
sheer volume of supplemental material is adversely
affecting peer review.”
Maunsell J J. Neurosci. 2010;30:10599-10600
13. $1000 genome = million $ peer-review?
To review: (>6TBp, >1500 datasets)
S3 (storage) = $15,000
EC2 (analysis w/ BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops
14. $1000 genome = million $ peer-review?
To review: (>6TBp, >1500 datasets)
S3 (storage) = $15,000
EC2 (analysis w/ BLASTx) = $500,000
Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops
ENCODE analysis Virtual Machine:
Containing: input data, code
bundles with scripts and
processing steps, outputs
AWS = ~$5,000
Source: James Taylor / http://encodeproject.org/ENCODE/integrativeAnalysis/VM
15. To host or not to host?
For: reproducibility
The Guardian, 14th September 2012: Replication is the only solution to scientific fraud.
http://www.guardian.co.uk/commentisfree/2012/sep/14/solution-scientific-fraud-replication
For: “data is the new oil”
William Gibson: "Information is the currency of the future world”
Sir Tim Berners-Lee: "Data is a precious thing and will last longer than
the systems themselves”
Move compute to the data: think EC2 rather than S3
DNA Nexus + 0.5PB SRA data = $15 million given by Google
Source:DNA Nexus/SRA http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/
18. For data citation to work, needs:
1. Proven utility/potential user base.
2. Acceptance/inclusion by journals.
3. Data+Citation: inclusion in the references.
4. Tracking by citation indexes.
5. Usage of the metrics by the community…
19. Datacitation 1: utility/user base.
Establishment of data DOIs and use by databases:
Shackleton NJ, Hall MA, Vincent E (2001): Mean stable carbon isotope ratios
of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian
margin, North Atlantic. PANGAEA - Data Publisher for Earth & Environmental
Science. http://doi.pangaea.de/10.1594/PANGAEA.58229
Cited in:
Pahnke K, Zahn R: Southern Hemisphere Water Mass Conversion Linked with North Atlantic
Climate Variability. Science 2005, 307:1741 -1746.
Nocek B, Xu X, Savchenko A, Edwards A, Joachimiak A. 2007. PDB
ID: 2P06 Crystal structure of a predicted coding region AF_0060
from Archaeoglobus fulgidus DSM 4304. 10.2210/pdb2p06/pdb.
Cited in:
Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data
growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008,
36:D419-425.
20. BGI Datasets Get DOI®s
Invertebrate Released pre-publication
Ant Paper Published in GigaScience
- Florida carpenter ant Microbe
- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482
- Leaf-cutter ant Darwin’s Finch T2D gut metagenome
Roundworm Giant panda Macaque
Schistosoma -Chinese rhesus Cell-Lines
Silkworm -Crab-eating Chinese Hamster Ovary
Mini-Pig Mouse methylomes
Human Naked mole rat
Asian individual (YH) Parrot, Puerto Rican PLANTS
- DNA Methylome Penguin Chinese cabbage
- Genome Assembly - Emperor penguin Cucumber
- Transcriptome - Adelie penguin Foxtail millet
Cancer (14TB) Pigeon, domestic Pigeonpea
Single cell bladder cancer Polar bear Potato
HBV infected exomes Sheep Sorghum
Ancient DNA Tibetan antelope
- Saqqaq Eskimo
- Aboriginal Australian
21. Our first DOI:
To maximize its utility to the research community and aid those fighting
the current epidemic, genomic data is released here into the public domain
under a CC0 license. Until the publication of research papers on the
assembly and whole-genome analysis of this isolate we would ask you to
cite this dataset as:
Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang,
Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun,
Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ;
Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482
isolate genome sequencing consortium (2011)
Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen.
doi:10.5524/100001
http://dx.doi.org/10.5524/100001
To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to
Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
22.
23.
24.
25. 1.3 The power of intelligently open data
The benefits of intelligently open data were powerfully
illustrated by events following an outbreak of a severe gastro-
intestinal infection in Hamburg in Germany in May 2011. This
spread through several European countries and the US,
affecting about 4000 people and resulting in over 50 deaths. All
tested positive for an unusual and little-known Shiga-toxin–
producing E. coli bacterium. The strain was initially analysed by
scientists at BGI-Shenzhen in China, working together with
those in Hamburg, and three days later a draft genome was
released under an open data licence. This generated interest
from bioinformaticians on four continents. 24 hours after the
release of the genome it had been assembled. Within a week
two dozen reports had been filed on an open-source site
dedicated to the analysis of the strain. These analyses
provided crucial information about the strain’s virulence and
resistance genes – how it spreads and which antibiotics are
effective against it. They produced results in time to help
contain the outbreak. By July 2011, scientists published papers
based on this work. By opening up their early sequencing
results to international collaboration, researchers in Hamburg
produced results that were quickly tested by a wide range of
experts, used to produce new knowledge and ultimately to
control a public health emergency.
31. Is the DOI…
* Certain types of genomics data must also be deposited in INSDC databases (SRA & Genbank).
32. And in more journals…
Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma
vilgalysii, a new basidiolichen from the New World. Dryad Digital
Repository. doi:10.5061/dryad.j1g5dh23
Cited in:
Hodkinson BP, Uehling JK, Smith ME: Lepidostroma vilgalysii, a new basidiolichen
from the New World. Mycological Progress 2012. Advance Online Publication.
Roberts SB (2012) Herring Hepatic Transcriptome 34300
contigs.fa. Figshare. Available:
hdl.handle.net/10779/084d34370fbda29bbc67b3c5ecb02
575. Accessed 2012 Jan 20.
Cited in:
Roberts SB, Hauser L, Seeb LW, Seeb JE (2012) Development of Genomic Resources
for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7(2):
e30908. doi:10.1371/journal.pone.0030908
33. For data citation to work, needs:
1. Proven utility/potential user base. ✔
2. Acceptance/inclusion by journals. ✔
3. Data+Citation: inclusion in the references. ✔
4. Tracking by citation indexes.
5. Usage of the metrics by the community…
35. Datacitation 4: tracking?
✗FAIL
DataCite metadata in harvestable form (OAI-PMH)
- lists some DataCite DOIs, but says:
Datasets listed are the “result of approximations in the indexing
algorithms.”
“Google Scholar's intended coverage is for scholarly articles. At
this point, we don't include datasets. “
36. Datacitation 4: tracking?
✗FAIL
DataCite metadata in harvestable form (OAI-PMH)
✗ Working on it. Coming soon…
37.
38. Datacitation 5: metrics?
“As a result of diverse practices and tool
limitations, data citations are currently very
difficult to track.”
39. Datacitation 5: metrics?
✗FAIL
Research Remix, 29th May 2012: http://researchremix.wordpress.com/2012/05/29/dear-research-
data-advocate-please-sign-the-petition-oamonday/
“I’m afraid we are making promises to data
creators about attribution and reward that we
can’t keep. ”Make your data citeable!” is the cry.
OK. So citeable is step one. Cited is step two. But
for the citation to be useful, it has to be indexed
so that citation metrics can be tracked and
admired and used.
Who is indexing data citations right now? As far
as I can tell: absolutely no one.”
40. Where data citation is in 2012:
1. Proven utility/potential user base. ✔
2. Acceptance/inclusion by journals. ✔
3. Data+Citation: inclusion in the references. ✔
4. Tracking by citation indexes. ✔/✗
5. Usage of the metrics by the community… ✗
42. Addressing the reproducibility gap:
Computable methods/workflow systems
Bioinformatics
Development Biomedical and bioinformatics research Publishing
43. Redefining what is a paper in the era of big-data?
goal: Executable Research Objects
Citable DOI
46. Methods +
Data +
Publication
• Background
• Methods Doi for workflows?
• Results (Data)
doi:10.5524/100035
• Conclusions/Discussion
doi:10.1186/2047-217X-1-3
47. Data Methods Analysis
doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3
DOI: A + DOI: X = DOI: 1
48. Data Methods Analysis
doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3
DOI: A + DOI: X = DOI: 1
DOI: B + DOI: X = DOI: 2
49. Data Methods Analysis
doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3
DOI: A + DOI: X = DOI: 1
DOI: B + DOI: X = DOI: 2
DOI: A + DOI: Y = DOI: 3
50. Data Methods Analysis
doi:10.5524/100035 + DOI: x = doi:10.1186/2047-217X-1-3
DOI: A + DOI: X = DOI: 1
DOI: B + DOI: X = DOI: 2
DOI: A + DOI: Y = DOI: 3
A, B, C… X, Y, Z… = 4, 5, 6…
52. Different shaped publishable objects
Different levels of granularity
Experiment e.g. doi:10.5524/100001 Papers
(e.g. ACRG project)
e.g. doi:10.5524/100001-2 Data/
Datasets Micropubs
(e.g. cancer type)
e.g. doi:10.5524/100001-2000
Sample or doi:10.5524/100001_xyz
(e.g. specimen xyz)
Smaller still? Facts/Assertions (~1013 in literature) Nanopubs
53. Adding “value” publishing data
• Scope for different shaped publishable objects
• Scope for publishing methods/executable papers
• Peer review of data problematic
– Post publication peer review
– Change criteria (assess on transparency/access only)
– Better use of workflows/cloud/VMs
DOIs are cheap*, data is precious: maximise its use
* ish
54. Adding “value” publishing data
DOIs are cheap*, data is precious: maximise its use
* ish Source: Ross Mounce CC-BY http://rossmounce.co.uk/2012/09/04/the-gold-oa-plot-v0-2/
55. Thanks to: Shaoguang Liang (BGI-SZ)
Laurie Goodman Tin-Lap Lee (CUHK)
Tam Sneddon Huayen Gao (CUHK)
Nicole Nogoy Qiong Luo (HKUST)
Alexandra Basford Senghong Wang (HKUST)
Peter Li Yan Zhou (HKUST)
Jesse Si Zhe Cogini
editorial@gigasciencejournal.com
Contact us: database@gigasciencejournal.com
@gigascience
Follow us: facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog/
www.gigadb.org
www.gigasciencejournal.com
Notas del editor
and an advanced search option…
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Raw data has been submitted to the SRA, the assembly submitted to GenBank (no number), SV data to dbVar (it’s the first plant data they’ve received). Complements the traditional public databases by having all these “extra” data types, it’s all in one place, and it’s citable.
Leading on from that, current and future plans include collaborating with Tin-Lap Lee at the Chinese University of Hong Kong to integrate an instance of the Galaxy bioinformatics platform with GigaDB so users can make full use of the data in GigaDB by linking it to other resources and we can incorporate fully executable papers. One such submission is a new SOAPdenovo pipeline. The SOAP tools have been wrapped in Galaxy, the workflow defined in MyExperiment and the data will be issued with a DOI and accessible via GigaDB. Utilizing the BGI cloud if necessary, users will then be able to reproduce all the steps described in the GigaScience paper to test, reanalyze, compare results etc.Since we would like GigaDB to be a host for data types that have no other home, such as imaging data, we are investigating adding other tools such as an image viewer and the like to support accessibility to and usability of the data. So, if you have a large-scale biological or biomedical dataset and/or a pipeline or software that you would like to submit to GigaScience we would love to hear from you so please come and talk to Scott or myself.
That just leaves me to thank the GigaScience team: Laurie, Scott, Alexandra, Peter and Jesse, BGI for their support - specifically Shaoguang for IT and bioinformatics support – our collaborators on the database, website and tools: Tin-Lap, Qiong, Senhong, Yan, the Cogini web design team, Datacite for providing the DOI service and the isacommons team for their support and advocacy for best practice use of metadata reporting and sharing.Thank you for listening.