Scott Edmunds           : Big Data, Data Citationand Future Data HandlingWilliam Gibson: "Information is the currency of t...
Data Tsunami?                Flickr cc: opensourceway
Rice v Wheat: consequences of publically available                  genome data.                                 rice   wh...
Sharing aids everyone…Sharing Detailed ResearchData Is Associated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma ...
Problems?            Flickr cc: opensourceway
Sequencing cost ($ per Mbp)                   Moore’s Law                                     ~100,000X             Sequen...
Sequencing Output       Data              Moore’s/Kryder                  s Law
Sequencing Output       Data              Dissemination?
Potential sequencing capacity1 Illumina HiSeq 2000 (+Truseq upgrade)             = 600Gb/run (12 days)X 128 Hiseq = 6Tb/da...
Difficulties keeping up…                  Flickr cc: opensourceway
Do we have models for long term funding?Human Gene Mutation DatabaseKyoto Encyclopedia of Genes and Genomes               ...
Are there now too many hurdles?              ?
Are there now too many hurdles?Technical:   too large volumes             too heterogeneous             no home for many d...
Potential solutions?
Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repo...
Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)                                       offer a solution ...
Datacitation: Datacite and DOIs>1 million DOIs since Dec 2009Central metadata repository to link with WoS/ISI           - ...
Now taking submissions…        Large-Scale Data        Journal/Database       In conjunction with:Editor-in-Chief: Laurie ...
Now taking submissions…
Editorial Board: InternationalStephan Beck, UK                   Stephen OBrien, USAAlvis Brazma, UK                   Han...
Editorial Board: InternationalStephan Beck, Epigenomics              Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics...
Criteria and Focus of Journal/Database Reproducibility/Reuse Utility/Usability Standards/Searchability/Scale/Sharing D...
Use of Data = Importance + Usability                subjective?   easier to assess        www.gigasciencejournal.com
Reproducibility/Reuse              BGI Cloud Computing resources for             handling and analyzing large-scale data....
Special Series/Hub for cloud-based tools          Technical notes: test tools in the BGI-Cloud.          Tools + Test Da...
Standards/Searchability/Sharing              ISA-Tab compatibility to aid and promote             best practice in metada...
Data publishing/DOI        New journal format combines standard manuscript        publication with an extensive database ...
of data use/release?
The era of the data consumer?
The era of the data consumer??
The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it  ?
GDSAP:Genomic Data Submission and Analytical platform                         Big data                         from the Da...
New Databasewww.gigaDB.org
New Databasewww.gigaDB.org
BGI Datasets Get DOI®sInvertebrate                                            PLANTSAnt                             Verteb...
BGI Datasets Get DOI®s                                                        Many unpublished…Invertebrate               ...
Data also submitted to NCBI (including SV data to dbVar)Complemented by citable form, and data-types including:      Assem...
Our first DOI:To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data i...
“The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for...
We want your   data! scott@gigasciencejournal.comeditorial@gigasciencejournal.com  @gigascience  facebook.com/GigaScience ...
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Próxima SlideShare
Cargando en…5
×

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

1.371 visualizaciones

Publicado el

Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.

Publicado en: Tecnología, Educación
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling

  1. 1. Scott Edmunds : Big Data, Data Citationand Future Data HandlingWilliam Gibson: "Information is the currency of the future world" www.gigasciencejournal.com cc Flickr allan*
  2. 2. Data Tsunami? Flickr cc: opensourceway
  3. 3. Rice v Wheat: consequences of publically available genome data. rice wheat 700 600 500 400 300 200 100 0
  4. 4. Sharing aids everyone…Sharing Detailed ResearchData Is Associated withIncreased Citation Rate.Piwowar HA, Day RS, Fridsma DB (2007)PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308 Every 10 datasets collected contributes to at least 4 papers in the following 3-years. Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a
  5. 5. Problems? Flickr cc: opensourceway
  6. 6. Sequencing cost ($ per Mbp) Moore’s Law ~100,000X Sequencing Source: E Lander/Broad
  7. 7. Sequencing Output Data Moore’s/Kryder s Law
  8. 8. Sequencing Output Data Dissemination?
  9. 9. Potential sequencing capacity1 Illumina HiSeq 2000 (+Truseq upgrade) = 600Gb/run (12 days)X 128 Hiseq = 6Tb/day = >2Pb/year= ~ 2000 Human Genomes/day
  10. 10. Difficulties keeping up… Flickr cc: opensourceway
  11. 11. Do we have models for long term funding?Human Gene Mutation DatabaseKyoto Encyclopedia of Genes and Genomes ? Flickr cc: opensourceway
  12. 12. Are there now too many hurdles? ?
  13. 13. Are there now too many hurdles?Technical: too large volumes too heterogeneous no home for many data types too time consumingEconomic: too expensive, no long-term fundingCultural: inertia ? no incentives to share unaware of how
  14. 14. Potential solutions?
  15. 15. Incentives/creditCredit where credit is overdue:“One option would be to provide researchers who release data topublic repositories with a means of accreditation.”“An ability to search the literature for all online papers that used aparticular data set would enable appropriate attribution for thosewho share. “Nature Biotechnology 27, 579 (2009)Prepublication data sharing(Toronto International Data Release Workshop)“Data producers benefit from creating a citable reference, as it canlater be used to reflect impact of the data sets.”Nature 461, 168-170 (2009)
  16. 16. Datacitation: Datacite and DOIsDigital Object Identifiers (DOIs)  offer a solution Mostly widely used identifier for Dataset scientific articles Yancheva et al (2007). Analyses on Researchers, authors, publishers sediment of Lake Maar. PANGAEA. know how to use them doi:10.1594/PANGAEA.587840 Put datasets on the same playing field as articles
  17. 17. Datacitation: Datacite and DOIs>1 million DOIs since Dec 2009Central metadata repository to link with WoS/ISI - finally can track and credit use!
  18. 18. Now taking submissions… Large-Scale Data Journal/Database In conjunction with:Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDAssistant Editor: Alexandra Basford, PhD www.gigasciencejournal.com
  19. 19. Now taking submissions…
  20. 20. Editorial Board: InternationalStephan Beck, UK Stephen OBrien, USAAlvis Brazma, UK Hanchuan Peng, USAAnn-Shyn Chiang, Taiwan Russell Poldrack, USARichard Durbin, UK Ming Qi, China/USAPaul Flicek, UK Susanna-Assunta Sansone, UKRobert Hanner, Canada Michael Schatz, USAYoshihide Hayashizaki, Japan David Schwartz, USAHenning Hermjakob, UK Fritz Sommer, USAWolfgang Huber, Germany Lincoln Stein, CanadaGary King, USA Sumio Sugano, JapanTin-Lap Lee, Hong Kong Thomas Wachtler, GermanyDonald Moerman, Canada Jun Wang, ChinaKaren Nelson, USA Alistair Young, New ZealandFrancis Ouellette, Canada Zang Yufeng, China Marie Zins, France www.gigasciencejournal.com
  21. 21. Editorial Board: InternationalStephan Beck, Epigenomics Stephen OBrien, GenomicsAlvis Brazma, Transcriptomics Hanchuan Peng, Imaging/NeuroAnn-Shyn Chiang, Neuroscience Russell Poldrack, NeuroscienceRichard Durbin, Genetics/Genomics Ming Qi, GeneticsPaul Flicek, Genomics Susanna-Assunta Sansone, StandardsRobert Hanner, DNA Barcoding/Ecology Michael Schatz, Cloud ComputingYoshihide Hayashizaki, Genomics David Schwartz, Optical MappingHenning Hermjakob, Proteomics Fritz Sommer, NeuroscienceWolfgang Huber, Functional Genomics Lincoln Stein, Cloud ComputingGary King, Medicine Sumio Sugano, GenomicsTin-Lap Lee, Genomics Thomas Wachtler, NeuroscienceDonald Moerman, Functional Genomics Jun Wang, GenomicsKaren Nelson, Metagenomics Alistair Young, Medical ImagingFrancis Ouellette, Genomics Zang Yufeng, Neuroscience Marie Zins, Medicine www.gigasciencejournal.com
  22. 22. Criteria and Focus of Journal/Database Reproducibility/Reuse Utility/Usability Standards/Searchability/Scale/Sharing Data publishing/DOI www.gigasciencejournal.com
  23. 23. Use of Data = Importance + Usability subjective? easier to assess www.gigasciencejournal.com
  24. 24. Reproducibility/Reuse  BGI Cloud Computing resources for handling and analyzing large-scale data. Integrated tools to promote more widespread access, viewing, and analysis of data. Encourage and aid use of workflow systems for methods (e.g. submission of Galaxy XML files). www.gigasciencejournal.com
  25. 25. Special Series/Hub for cloud-based tools Technical notes: test tools in the BGI-Cloud. Tools + Test Data (BGI or user) in one place. Aids reproducibility. Aids reviewers (free) Aids authors: visibility (pubmed, etc.) hosting (included/free offers) –contact us: editorial@gigasciencejournal.com Oledoe flickr cc www.gigasciencejournal.com
  26. 26. Standards/Searchability/Sharing  ISA-Tab compatibility to aid and promote best practice in metadata reporting. All supporting data must be publically available. Ask for MIBBI compliance and use of reporting checklists. Part of the Biosharing network. www.gigasciencejournal.com
  27. 27. Data publishing/DOI New journal format combines standard manuscript publication with an extensive database to host all associated data.  Data hosting will follow standard funding agency and community guidelines. DOI assignment available for submitted data to allow ease of finding and citing datasets, as well as for citation tracking. www.gigasciencejournal.com
  28. 28. of data use/release?
  29. 29. The era of the data consumer?
  30. 30. The era of the data consumer??
  31. 31. The era of the data consumer?Free access to data – but analysis hubs/nodes for will form around it ?
  32. 32. GDSAP:Genomic Data Submission and Analytical platform Big data from the Data, Data, Data… “Sequencing Farm” Data Modeling Tin-Lap Lee, CUHK Pipeline design Validation Commercial applications “Apps”
  33. 33. New Databasewww.gigaDB.org
  34. 34. New Databasewww.gigaDB.org
  35. 35. BGI Datasets Get DOI®sInvertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
  36. 36. BGI Datasets Get DOI®s Many unpublished…Invertebrate PLANTSAnt Vertebrates Chinese cabbage- Florida carpenter ant Giant panda Macaque Cucumber- Jerdon’s jumping ant - Chinese rhesus Foxtail millet- Leaf-cutter ant - Crab-eating PigeonpeaRoundworm Naked mole rat PotatoSilkworm Penguin Sorghum - Emperor penguinHuman - Adelie penguinAsian individual (YH) Pigeon, domestic- DNA Methylome Polar bear- Genome Assembly Sheep doi:10.5524/100004- Transcriptome Tibetan antelopeAncient DNA (coming soon)- Saqqaq Eskimo Microbe- Aboriginal Australian E. Coli O104:H4 TY-2482 Cell-Line Chinese Hamster Ovary
  37. 37. Data also submitted to NCBI (including SV data to dbVar)Complemented by citable form, and data-types including: Assemblies of 3 strains Raw Data SNPs InDels CNVs SV
  38. 38. Our first DOI:To maximize its utility to the research community and aid those fighting the currentepidemic, genomic data is released here into the public domain under a CC0license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G;Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S;Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z;Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J andthe Escherichia coli O104:H4 TY-2482 isolate genome sequencingconsortium (2011)Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGIShenzhen. http://dx.doi.org/10.5524/100001 To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.
  39. 39. “The way that the genetic data of the 2011 E. coli strain were disseminatedglobally suggests a more effective approach for tackling public healthproblems. Both groups put their sequencing data on the Internet, so scientiststhe world over could immediately begin their own analysis of the bugsmakeup. BGI scientists also are using Twitter to communicate their latestfindings.”“German scientists and their colleagues at the Beijing Genomics Institute in China havebeen working on uncovering secrets of the outbreak. BGI scientists revised their draftgenetic sequence of the E. coli strain and have been sharing their data with dozens ofscientists around the world as a way to "crowdsource" this data. By publishing their datapublicy and freely, these other scientists can have a look at the genetic structure, and tryto sort it out for themselves.”
  40. 40. We want your data! scott@gigasciencejournal.comeditorial@gigasciencejournal.com @gigascience facebook.com/GigaScience blogs.openaccesscentral.com/blogs/gigablog/ www.gigasciencejournal.com

×