Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Tin-Lap Lee: Next-Gen Sequencing Analysis by GigaGalaxy
1. Next-Gen Sequencing Analysis
by GigaGalaxy
Tin-Lap, LEE
School of Biomedical Sciences
CUHK-BGI Innovation Institute of Trans-omics,
The Chinese University of Hong Kong
2. CUHK-BGI Innovation Institute of Trans-Omics (CBIIT)
• Jointly established between The Chinese
University of Hong Kong (CUHK) and BGI
in July 2011.
• “We aim to provide a platform conductive
to training of multi-disciplinary talents
conversant with the knowledge and
application of genomics, proteomics,
genetics, computation biology and
bioinformatics, by capitalizing on both
institutions’ expertise and strengths in
genomic science.”
4. www.gigasciencejournal.com
Journal, data-platform and
database for large-scale data
Editor-in-Chief: Laurie Goodman
Executive Editor: Scott Edmunds
Commissioning Editor: Nicole Nogoy
Lead Curator: Chris Hunter
Data Platform: Peter Li
in conjunction with
6. Giga-Galaxy
Collaboration between GigaScience and CBIIT
A publicly accessible Galaxy Servers
Share some of the workload of the main Galaxy server
Host data and workflows published in GigaScience, particularly involving
NGS data analysis
SOAP package: advantages from GigaGalaxy
Application Instance: SOAPdenovo2 tool
9. GigaSolution: deconstructing the paper
www.gigadb.org
www.gigasciencejournal.com
galaxy.cbiit.cuhk.edu.hk
Combines and integrates:
Open-access journal
Data Publishing Platform
Data Analysis Platform
10. doi:10.1186/2047-217X-1-18doi:10.5524/100038
AnalysisData Methods
doi:10.5524/100044+ =
Wang J et al., (2012): Updated genome assembly of YH: the first diploid genome sequence of a
Han Chinese individual (version 2, 07/2012). GigaScience Database.
http://dx.doi.org/10.5524/100038
Luo R et al., (2012): Software and supporting material for “SOAPdenovo2: An empirically improved
memory-efficient short read de novo assembly”. GigaScience Database.
http://dx.doi.org/10.5524/100044
Data
Methods
Luo R et al., (2012): SOAPdenovo2: an empirically improved memory-efficient short-read de novo
assembler GigaScience, 1:18 (28th December 2012) http://dx.doi.org/10.1186/2047-217X-1-18
Analysis
Example
17. Assembly Supporting Tools
• SOAPfilter: removed reads with artifacts
• Kmerfreq HA: a kmer frequency counter
• Corrector HA: corrects sequencing errors in short reads
• Gapcloser: close gaps in scaffolds
25. Help Center: Shared Data
• Several Datasets are available from the shared data menu
for test-running the tools.
• Data Libraries
• Published Workflows
• Published Pages
29. How is GigaScience supporting data
reproducibility?
Data sets
Analyses
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18
~10000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-Pipelines
Open-Workflows
DOI:10.5524/100038
Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/
~5000 downloads
Enabled code to being picked apart by bloggers in wiki
http://homolog.us/wiki/index.php?title=SOAPdenovo2
30. SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in GigaGalaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Will be available for >25K Galaxy users in Galaxy Toolshed
Galaxy is a web-based data analysis platform developed by PSUAccessible, Reproducible, and transparentEasy to use, no command line, much shorter learning curve for biologists
The first section of this talk is about implementation of public instance using galaxy tool shed. We are currently implement the first public SOAP instance to the platform.
The SOAP package provides a set of tools for processing NGS data. There are different versions of SOAP for mapping short reads to reference sequences. There are also tools like soapdenovo for construction of a new genome sequence and soapsnp which can assemble a consensus sequence and identify SNPs present on it in relation to a reference. Documentation in the BGI SOAP package is limited in scope, making the tools difficult to use. We will be working with the BGI developers in providing test data and Galaxy pipelines demonstrating the use of SOAP.