GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

•Download as PPTX, PDF•

1 like•1,795 views

This document discusses GigaDB and Galaxy, which are revolutionizing data dissemination, organization, and analysis. GigaDB is a new database integrated with the GigaScience journal to host large datasets with Digital Object Identifiers (DOIs) for citation. Galaxy is an open source platform for data analysis that GigaScience is using to make analyses reproducible by implementing workflows. The document provides examples of datasets hosted in GigaDB and describes efforts to integrate analysis tools from the BGI SOAP package into Galaxy workflows to enable their use and reuse within the community.

Technology

GigaDB and Galaxy: revolutionizing data
dissemination, organization and analysis

Peter Li
GigaScience
peter@gigasciencejournal.com

Journal and database for
large-scale data
in conjunction with

Editor-in-Chief: Laurie Goodman
Editor: Scott Edmunds
Commissioning Editor: Nicole Nogoy
Lead Curator: Tam Sneddon
Data Platform: Peter Li
www.gigasciencejournal.com

Why another *omics journal?

Already many journals publishing research
involving large data sets

Results
reproducibility

Unrepeatability of scientific results
Out of 18 microarray papers, results
from 10 could not be reproduced

Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses.
Nature Genetics 41: 149-155.

How are we supporting data
reproducibility?

Data sets

GigaScience
paper Analyses

Community tools for
data reproduction and reuse

Linking of papers and data
by citation of DOIs

Data set DOI

Paper DOI

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)

Faster download speeds

Aspera data transfer

BGI Datasets Get DOI®s
Invertebrate Released pre-publication
Ant Paper published in GigaScience
- Florida carpenter ant Microbe
- Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482
- Leaf-cutter ant Darwin’s Finch T2D gut metagenome
Roundworm Giant panda Macaque
Schistosoma -Chinese rhesus Cell-Lines
Silkworm -Crab-eating Chinese Hamster Ovary
Parasitic nematode Mini-Pig Mouse methylomes
Pacific oyster Naked mole rat
Human Parrot, Puerto Rican PLANTS
Asian individual (YH) Penguin Chinese cabbage
- DNA Methylome - Emperor penguin Cucumber
- Genome Assembly - Adelie penguin Foxtail millet
- Transcriptome Pigeon, domestic Pigeonpea
Cancer (14TB) Polar bear Potato
Single cell bladder cancer Sheep Sorghum
HBV infected exomes Tibetan antelope
Ancient DNA
- Saqqaq Eskimo
39 data sets
- Aboriginal Australian

Currently: 39 public datasets
*10 citations in references*
Humans
Ancient DNA
- Aboriginal Australian
- Saqqaq Eskimo
Asian individual (YH)

What about the analyses?

Data sets

GigaScience
paper Analyses

How will we make analyses available
for downloading and execution?

Bioinformatics data analyses as workflows

Example workflow: Investigate the evolutionary relationships between proteins
Multiple
Protein
Query sequence
sequences
alignment

Implement GigaScience workflows
in a community-accepted format

Open source

Over 20,000 main
Galaxy server users

Over 500 papers
citing Galaxy use

Over 55 Galaxy
servers deployed

http://galaxyproject.org

Tool list Tool parameterisation Results panel

Pilot project - Integrate BGI SOAP
package into Galaxy

Enable SOAP tools to be used from within Galaxy workflows

Integrate BGI SOAP package into
Galaxy
Data analysis pipelines

Python Python Python Python Python Python
wrapper wrapper wrapper wrapper wrapper wrapper

SOAP1 SOAP2 SOAPdenovo1 SOAPdenovo2 SOAPsnp SOAPsplice

GitHub open code repository

https://github.com/gigascience

Why publish in GigaScience?

Benefit Added value
• Data hosted in GigaDB • No need to use own servers
• Allocation of DOIs to data • Citable data
• Metadata in isa-tab format • Aids reuse of data
• Galaxy tool integration • Supports reuse of tools
• Use of tools in Galaxy • Improves documentation
workflows • Shows how tool can be used
with other bioinf. software

Thanks to:

• Tin-Lap Lee and Huayan Gao - CUHK
• Tam, Jesse, Scott, Nicole & Laurie - GigaScience

peter@gigasciencejournal.com

What's hot

ContentMine + EPMC: Finding Zika! TheContentMine

High-performance web services for gene and variant annotationsChunlei Wu

Content Mining of Science in CambridgeTheContentMine

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...GigaScience, BGI Hong Kong

ContentMine + EPMC: Finding Zika!petermurrayrust

ContentMine (TDM) at JISC Digifestpetermurrayrust

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and ...GigaScience, BGI Hong Kong

EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...EukRef

Content Mining of Science in Europepetermurrayrust

Cross-Disciplinary Biomedical Research at Calit2Larry Smarr

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

iPlant TNRS for digital collections - iDigBio WorkshopNaim Matasci

2014 sage-talkc.titus.brown

Building an Information Infrastructure to Support Genetic SciencesLarry Smarr

OpenTox Europe 2013Alejandra Gonzalez-Beltran

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong

leveraging the web to make science more collaborativeBrian Bot

How to share useful dataPeter McQuilton

Being FAIR: Enabling Reproducible Data ScienceCarole Goble

How novel compute technology transforms life science researchDenis C. Bauer

What's hot (20)

ContentMine + EPMC: Finding Zika!

High-performance web services for gene and variant annotations

Content Mining of Science in Cambridge

Nicole Nogoy's talk at eResearchNZ 2014: Improving data sharing, integration ...

ContentMine + EPMC: Finding Zika!

ContentMine (TDM) at JISC Digifest

Alexandra Basford, InCoB 2011: A Journal’s Perspective on Data Standards and ...

EuKRef. A community effort towards phylogenetic-based curation of ribosomal d...

Content Mining of Science in Europe

Cross-Disciplinary Biomedical Research at Calit2

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...

iPlant TNRS for digital collections - iDigBio Workshop

2014 sage-talk

Building an Information Infrastructure to Support Genetic Sciences

OpenTox Europe 2013

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...

leveraging the web to make science more collaborative

How to share useful data

Being FAIR: Enabling Reproducible Data Science

How novel compute technology transforms life science research

Similar to GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Tin-Lap Lee: CBIIT GigaGalaxy: A Galaxy-based platform for large-scale genomi...GigaScience, BGI Hong Kong

Scott Edmunds: Revolutionizing Data Dissemination: GigaScienceGigaScience, BGI Hong Kong

GigaScience: data and beta-database launch. Announcing GigaDBGigaScience, BGI Hong Kong

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong

Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientistsGigaScience, BGI Hong Kong

iplant-highlights-pag2015Matthew Vaughn

Scott Edmunds: Data publication in the data delugeGigaScience, BGI Hong Kong

If we build it will they come?myGrid team

Services For Science April 2009Ian Foster

Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban

wolstencroft-ogf20-astrowebuploader

Scott Edmunds flashtalk slides from Beyond the PDF2GigaScience, BGI Hong Kong

The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr

Data analysis & integration challenges in genomicsmikaelhuss

Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith

Scientific Data ManagementAlberto Labarga

Preserving the Inputs and Outputs of Scholarshiptsbbbu

iRODSIntelHealthcare

Overview of Next Gen Sequencing Data AnalysisBioinformatics and Computational Biosciences Branch

NETTAB 2012Alejandra Gonzalez-Beltran

Similar to GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis (20)

Tin-Lap Lee: CBIIT GigaGalaxy: A Galaxy-based platform for large-scale genomi...

Scott Edmunds: Revolutionizing Data Dissemination: GigaScience

GigaScience: data and beta-database launch. Announcing GigaDB

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...

Ramil Mauleon: IRRI GALAXY: bioinformatics for rice scientists

iplant-highlights-pag2015

Scott Edmunds: Data publication in the data deluge

If we build it will they come?

Services For Science April 2009

Sharing massive data analysis: from provenance to linked experiment reports

wolstencroft-ogf20-astro

Scott Edmunds flashtalk slides from Beyond the PDF2

The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...

Data analysis & integration challenges in genomics

Making your data work for you: Scratchpads, publishing & the biodiversity dat...

Scientific Data Management

Preserving the Inputs and Outputs of Scholarship

iRODS

Overview of Next Gen Sequencing Data Analysis

NETTAB 2012

Recently uploaded

Slack Application Development 101 Slidespraypatel2

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Histor y of HAM Radio presentation slidevu2urc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Partners Life - Insurer Innovation Award 2024The Digital Insurer

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Developing An App To Navigate The Roads of BrazilV3cube

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Recently uploaded (20)

Slack Application Development 101 Slides

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Salesforce Community Group Quito, Salesforce 101

Presentation on how to chat with PDF using ChatGPT code interpreter

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Axa Assurance Maroc - Insurer Innovation Award 2024

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

08448380779 Call Girls In Civil Lines Women Seeking Men

Driving Behavioral Change for Information Management through Data-Driven Gree...

Scaling API-first – The story of a global engineering organization

Histor y of HAM Radio presentation slide

🐬 The future of MySQL is Postgres 🐘

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Partners Life - Insurer Innovation Award 2024

How to Troubleshoot Apps for the Modern Connected Worker

Developing An App To Navigate The Roads of Brazil

Tata AIG General Insurance Company - Insurer Innovation Award 2024

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

1. GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis Peter Li GigaScience peter@gigasciencejournal.com

2. Journal and database for large-scale data in conjunction with Editor-in-Chief: Laurie Goodman Editor: Scott Edmunds Commissioning Editor: Nicole Nogoy Lead Curator: Tam Sneddon Data Platform: Peter Li www.gigasciencejournal.com

4. Why another *omics journal? Already many journals publishing research involving large data sets Results reproducibility

5. Unrepeatability of scientific results Out of 18 microarray papers, results from 10 could not be reproduced Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 149-155.

6. How are we supporting data reproducibility? Data sets GigaScience paper Analyses Community tools for data reproduction and reuse

7. Linking of papers and data by citation of DOIs Data set DOI Paper DOI

8. http://gigadb.org

9. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more)

10. Faster download speeds Aspera data transfer

11. BGI Datasets Get DOI®s Invertebrate Released pre-publication Ant Paper published in GigaScience - Florida carpenter ant Microbe - Jerdon’s jumping ant Vertebrates E. Coli O104:H4 TY-2482 - Leaf-cutter ant Darwin’s Finch T2D gut metagenome Roundworm Giant panda Macaque Schistosoma -Chinese rhesus Cell-Lines Silkworm -Crab-eating Chinese Hamster Ovary Parasitic nematode Mini-Pig Mouse methylomes Pacific oyster Naked mole rat Human Parrot, Puerto Rican PLANTS Asian individual (YH) Penguin Chinese cabbage - DNA Methylome - Emperor penguin Cucumber - Genome Assembly - Adelie penguin Foxtail millet - Transcriptome Pigeon, domestic Pigeonpea Cancer (14TB) Polar bear Potato Single cell bladder cancer Sheep Sorghum HBV infected exomes Tibetan antelope Ancient DNA - Saqqaq Eskimo 39 data sets - Aboriginal Australian

12. Currently: 39 public datasets *10 citations in references* Humans Ancient DNA - Aboriginal Australian - Saqqaq Eskimo Asian individual (YH)

13. What about the analyses? Data sets GigaScience paper Analyses How will we make analyses available for downloading and execution?

14. Bioinformatics data analyses as workflows Example workflow: Investigate the evolutionary relationships between proteins Multiple Protein Query sequence sequences alignment

15. Implement GigaScience workflows in a community-accepted format Open source Over 20,000 main Galaxy server users Over 500 papers citing Galaxy use Over 55 Galaxy servers deployed http://galaxyproject.org

16. Tool list Tool parameterisation Results panel

17. Pilot project - Integrate BGI SOAP package into Galaxy Enable SOAP tools to be used from within Galaxy workflows

18. Integrate BGI SOAP package into Galaxy Data analysis pipelines Python Python Python Python Python Python wrapper wrapper wrapper wrapper wrapper wrapper SOAP1 SOAP2 SOAPdenovo1 SOAPdenovo2 SOAPsnp SOAPsplice

19. GitHub open code repository https://github.com/gigascience

20. Tool list Tool parameterisation Results panel

21. SOAPdenovo2 Galaxy workflow

22. http://www.myexperiment.org

23. Why publish in GigaScience? Benefit Added value • Data hosted in GigaDB • No need to use own servers • Allocation of DOIs to data • Citable data • Metadata in isa-tab format • Aids reuse of data • Galaxy tool integration • Supports reuse of tools • Use of tools in Galaxy • Improves documentation workflows • Shows how tool can be used with other bioinf. software

24. Thanks to: • Tin-Lap Lee and Huayan Gao - CUHK • Tam, Jesse, Scott, Nicole & Laurie - GigaScience peter@gigasciencejournal.com

Editor's Notes

Mini-ping genome published this month
DOIsProvide example of a GigaScience paperMention DOI for the paper itselfHighlight data set generated and its DOI
And now that you all want to submit to GigaDB, how do you do that and how will people search and find your data and, other than citing your DOI, what will they be able to do with the data? We have redesigned the underlying Giga database and we’re working on the front end which we hope to be public early next month so the following slides are a mix of screenshots from the development site overlaid with tweaks made in powerpoint to illustrate features you can hope to see when we go live.These include:a home page image slider for browsing datasetsa text box search which I will demonstrate shortly
***NEEDS REWORKING!!!!***This is an example landing page for DOI 10.5524/100015 for the YH genome dataset. These pages are still in development but you can see the date released, title and abstract and how the dataset should be cited.Additional information includes links to manuscripts and data accessions at EBI, NCBI or DDBJ.There is then information on the samples and files.
A GigaDB dataset citation is also included in the YH Transcriptome paper published in Nature Biotechnology in February this year.As you can see the dataset was published in 2011 but this did not prevent subsequent publication of the analysis paper.
Over 20,000 users on the main serverOver 500 papers citing the use of GalaxyOver 55 servers deployed on the Web
Allows scientists who may not have programming skills to be able to compose data analysis pipelines.
DOIs can now be tracked in the new Thomson Reuters Data Citation index - which gives form of credit and makes the data more discoverable (Scott)

GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Similar to GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

GigaDB and Galaxy: revolutionizing data dissemination, organization and analysis

Editor's Notes