Kipper: Sequence database versioning for Galaxy bioinformatics servers
1. KIPPER: SEQUENCE DATABASE VERSIONING FOR
GALAXY BIOINFORMATICS SERVERS
Damion Dooley
Hsiao Lab, BC Public Health Microbiology & Reference Laboratory
And UBC Department of Pathology, Vancouver, Canada
https://github.com/Public-Health-Bioinformatics
/kipper /versioned_data
2. How to recreate sequencing analysis?
Retrieve or redo sequencing data
Get right software versions
Get databases as they
appeared on a certain date
3. Nice database vs. juggernaut
Periodically published
Varying ability to download past versions
RDP RNA v10.1 – 11.4 (5.5 GB)
Silva RNA v89 – 119 (2.6 GB)
Uniref (~50 versions, ~35 GB latest)
Pseudo-versioned
Version stated but no way to get past ones?
No client software for insert/delete diff
NCBI nt (58 GB) NCBI nr (78 GB) Ancient juggernaut supporting
immortal database and crushing
unwary sys admins in its path
7. Version listing
• Add new version:
• Retrieve a version by id:
$ Kipper rdp_rna –i download.fasta –o.
$ Kipper rdp_rna –e –n11
• Kipper is a python script
$ Kipper rdp_rna
10. Acknowledgements
This work was supported by Genome Canada / Genome BC Grant “A
Federated Bioinformatics Platform for Public Health Microbial
Genomics” to Fiona Brinkman, Gary Van Domselaar and William Hsiao.
More information about the IRIDA project (Integrated Rapid Infectious
Disease Analysis) can be found at http://www.irida.ca
Editor's Notes
EXPERIMENTAL REPRODUCABILITY
HOW do you recreate sequencing analysis say 2 years from now?
This juggling requires some infrastructure
MAYBE WE HAVE …
WHAT ABOUT THOSE …
NO SOFTWARE FOR GENERATING OLDER RESULTS
INFRASTRUCTURE PROBLEM
Anyone doing NCBI diff processing?
What if diff version regeneration speed >= download speed?
BIOMAJ + KIPPER
KIPPER IS PROTOTYPE
KIPPER HANDLES LARGE FILES THAT GIT HAS TO EXTERNALIZE
----- Meeting Notes (15-07-10 01:08) -----
the volume file has within it all the diff info for a range of versions.
For a particular fasta sequence
WHEN fasta sequence CREATED,
WHAT version deleted in
ACCESSION ID, title & description
----- Meeting Notes (15-07-10 01:08) -----
the volume file has within it all the diff info for a range of versions.
For a particular fasta sequence
WHEN fasta sequence CREATED,
WHAT version deleted in
ACCESSION ID, title & description
Retrieve any number of databases
Access by global version date
Lists dates (and/or ids) of available versions
Select one or more workflows on given database
SHOULDN’T PROVIDERS be convinced to supply their databases in a versioned format !