Democratising Data Publishing: A Global Perspective discusses the need for open and fair data globally to tackle problems more efficiently through collaboration. Some challenges to open data include cultural and technical hurdles to data sharing, as well as concerns about funding open access models internationally. The document provides examples of initiatives by GigaScience and the African Orphan Crop Consortium to make large genomic datasets more accessible and usable for researchers and plant breeders through tools like Galaxy. While bandwidth and agreements can pose difficulties, opening data benefits research and finding solutions to issues like food security.
2. Need for FAIR (high quality) Open Data
Enables
• Using networking power of the internet to tackle problems
• Can ask new questions & find hidden patterns & connections
• Build on each others efforts quicker & more efficiently
• More collaborations across more disciplines
• Harness wisdom of the crowds: crowdsourcing, citizen science,
crowdfunding
Global Challenges
• Quick response to climate change, food security & disease outbreaks
• Cultural & technical hurdles need to be overcome
5. Democratising Data at GigaScience
• GigaScience integrates and publishes all research objects to
maximise reproducibility, transparency and reuse
• GigaDB enables rapid publication of data associated with a
GigaScience manuscript
• GigaDB DOIs incentivise early release of data/code/etc.
• Data
• Software
• Models
• Pipelines
• Reviews
6. • E. Coli O104:H4 isolate TY-2482 in
Germany, >50 died, June 2011
• Crisis, mass panic, data needed
• BGI working with Hamburg University
let us share the data CC0 with our
first data DOI from GigaDB.
• Released via twitter
• Did not know consequences of early release of data
• These data were considered of such great importance that we did not wish
to wait for publication
Example: Disease outbreaks
http://dx.doi.org/10.5524/100001
Democratising Data at GigaScience
7.
8.
9.
10.
11. Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
• Pan-and-zoom map browser as a visual aid to allow the end user to
find datasets
12. • Pan-and-zoom map browser as a visual aid to allow the end user to
find datasets
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
13. • 3D viewer allows users to interact and explore image data prior to data
download
• 3D models are CC0, can be downloaded, and are printable
Democratising Data at GigaScience
• From Big Data to usable Data
• Example: WebTools for easy browsing and visualisation
14. Democratising Data at GigaScience
• Widening the target audience
• Bioinformaticians and ‘Big Data’ scientists are a
primary target audience
• Plugins and visualisations make access easier for the
less technically inclined
• Democratises access through education potential
and ease of use
15. Democratising Data at GigaScience
Difficulties we have encountered…
• Internet, i.e. Bandwidth, unstable connections,
occasionally US institutions blocking Chinese IP
addresses, China blocking google/dropbox links
• Copying 10GB of data from South Africa took >1month
because of powercuts
• Email communication difficulties due to spam filters.
• Data access agreements (clinical data)
16. Democratising Data at GigaScience
• Example: Food security
• Rice, Oryza sativa L., is the
staple food for half the
world’s population
• By 2030, rice production
must increase by at least
25% to keep pace with
population growth
17. Democratising Data at GigaScience
Rice 3K project
• 3,000 rice genomes
• 13.4TB public data
• 6 months to copy
data to Sequence
Read Archive (SRA)
• Data published 4
years before
analysis published
18. From Big Data to usable(ish) Data
• Although 13TB data in GigaDB was open (CC0), after analysing in
Tianhe supercomputer processed rice3K data = 100TB
• AWS hosted for free, but expensive to process
https://aws.amazon.com/public-data-sets/3000-rice-genome/
19. Processed data finally published 1st May 2018, Nature v557, p43–49
https://www.nature.com/articles/s41586-018-0063-9
20. Democratising Data at GigaScience
• Example: Food security
• The African Orphan Crop
Consortium (AOCC) is
developing genomic
resources for 101 crops that
represent a significant part
of African/Asian diets.
• To-date, the AOCC working
on 69 genomes, 5 of which
are published in GigaDB.
Hyacinth bean
21. • Stunting: Physical, Neurological, Economic
Growing Africa Out of Stunting, Hunger & Malnutrition:
The African Orphan Crops Consortium
22. • Provide genomic tools to accelerate breeding in 101 crops
important to African Diets
• Define genetic diversity in 100 lines/species
• Train 150 top African plant breeders to use the latest strategies
and technologies in plant breeding
African Orphan Crops Consortium (AOCC)
Courtesy: AOCC
24. Democratising Data at GigaScience
• Each AOCC genome is a single GigaDB dataset (with DOI)
25. Democratising Data at GigaScience
• From Big Data to usable Data
• Example: Easy-to-use plug and play RiceGalaxy
• Processed data and software tools made freely available
• GUI means plant breeders can utilise genetic data without coding skills
• Funded to run at low cost (<100 USD/month) via AWS Singapore & local
servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage)
• CGIAR Excellence in Plant Breeding Platform/model will roll out to other
crops
26. Democratising Data at GigaScience
• From Big Data to usable Data
• Example: Easy-to-use plug and play RiceGalaxy
• GUI means plant breeders can utilise genetic data without coding skills
• Funded to run at low cost (<100 USD/month) via AWS Singapore & local
servers (2 vCPUs, 8GB RAM, 2 mounted volumes, 200GB total storage)
• CGIAR Excellence in Plant Breeding Platform/model will roll out to other
crops
29. Acknowledgements
Laurie Goodman, Editor in Chief
Scott Edmunds, Executive Editor
Chris Hunter, GigaDB Lead BioCurator
Mary Ann Tuli, GigaDB Data Editor
Xiao (Jesse) Si Zhe, Database Developer
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Lead Data Manager
Chen Qi, Shenzhen Office.
@GigaScience
facebook.com/GigaScience
http://gigasciencejournal.com/blog/
www.gigasciencejournal.com
www.gigadb.org
+
Weibo
& WeChat
Editor's Notes
Up to $5000-6000 USDs
Quadrupled data in the public domain. Data publication 4 years before analysis published in Nature
Quadrupled data in the public domain. Data publication 4 years before analysis published in Nature