1. The utility of draft bacterial genomes for
gene function analysis and genomic island prediction
Julie A. Shay, Claire Bertelli, Bhavjinder K. Dhillon, and Fiona S.L. Brinkman
Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada
Canada’s Federal
Genomics Research and
Development Initiative
Acknowledgments
This work was made possible by funding from Genome Canada, Genome British Columbia, and the GRDI. Funding for project personnel was also provided by Cystic
Fibrosis Canada, the Swiss National Science Foundation, the CIHR/MSFHR Bioinformatics Training Program, and the Michael Smith Foundation for Health Research.
References
Comparing draft vs. complete genomes:
two examples
The problem: growing gap between
draft and complete genomes
Genomic Island (GI) analysis
•Draft/complete genomes were run on
IslandViewer5: web-based GI prediction tool
which incorporates two methods:
Contigs
(GenBank format)
Contig alignment to reference
genome with Mauve3
Concatenate contigs based on
alignment
Normal IslandViewer analysis
pipeline
User-selected
reference genome
Isolate Draft Genome Complete Genome
Compare
Gene function category analysis
•Open reading frames (ORFs) were assigned to
clusters of orthologous groups (COGs)12 using
RPS-BLAST13
•COG superfamily distributions were compared
between complete genomes and missing regions
of drafts
Genes of interest
•Main data set: 36 Listeria monocytogenes
isolates1, draft Illumina genomes and the
identical subsequently completed genomes
•Other data set: Draft genomes from the
Pseudomonas aeruginosa reference panel2
and similar completed genomes (identical
reference available for 2 strains)
•Draft genome aligned to completed reference
with Mauve Contig Mover3
SIGI-HMM10 IslandPath-DIMOB11
•codon usage bias
•HMM-based method
•dinucleotide bias
•presence of a mobility gene
•“Replication, recombination, and repair”
superfamily was significantly underrepresented
in draft genomes of both L. monocytogenes
and P. aeruginosa
•In particular, transposons tend to be missing
from draft genomes
Pipeline
Many GIs are present at
contig breaks, and these
GIs are more likely to be
missed by analysis of draft
genomes
0
20
40
60
80
100
120
140
160
180
0 1 to 9 10 to 99 100 to
999
1000 to
9999
10000 to
99999
1000000
to
999999
NumberofGIPredictions
inListeriaGenomes
Distance in Base Pairs from Contig Edge
Predictions Missed in Draft
Genome Analysis
Predictions Correctly Identified
in Draft Genome Analysis
0
50
100
150
200
250
2008 2009 2010 2011 2012 2013 2014 2015
ThousandsinDatabase
Year
NCBI SRA
Bacterial Genomes
NCBI Complete
Bacterial Genomes
[A] RNA processing and modification
[B] Chromatin structure and dynamics
[C] Energy production and conversion
[D] Cell cycle control, cell
division, chromosome partitioning
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[G] Carbohydrate transport and
metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[J] Translation, ribosomal structure and
biogenesis
[K] Transcription
[L] Replication, recombination and
repair
[M] Cell wall / membrane / envelope
biogenesis
[N] Cell motility
[O] Posttranslational modification, protein
turnover, chaperones
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites
biosynthesis, transport and catabolism
[R] General function prediction only
[S] Function unknown
[T] Signal transduction mechanisms
[U] Intracellular trafficking, secretion, and
vesicular transport
[V] Defense mechanisms
[W] Extracellular structures
[Z] Cytoskeleton
Methods Results
AMRGenes
Identified using
Resistance Gene
Identifier4 using the
Comprehensive
Antibiotic Resistance
Database
Not significantly
underrepresented in
Listeria or
Pseudomonas draft
genomes
VirulenceFactors
Predicted using a
conservative reciprocal-
best-blast-hit approach
from VFDB, PATRIC,
and Victor’s virulence
factors5,6,7.
Not significantly
underrepresented in
Listeria or
Pseudomonas draft
genomes
tRNAGenes
Predicted using
tRNAscan-SE8 and
ARAGORN9
Significantly
underrepresented in
Listeria and
Pseudomonas draft
genomes
PercentMissingfromDraftListeriagenomes
0
0.1
0.2
0.3
0.4
0.5
0.6
A B C D E F G H I J K L M N O P Q R S T U V W Y Z
Proportionof
TotalORFs
COG Superfamily
Completed Genome
Regions Missing from Draft Genome
Conclusion
All Protein-
Coding Genes
AMR
Genes
Virulence
Factors
tRNA
Genes
Note: This image only shows genomes submitted to NCBI, so it is
underestimating the extent of the gap between draft and complete
1) Gimour MW, et al. 2010 BMC Genomics 11:120.
2) De Soyza A, et al. 2013 MicrobiologyOpen 2(6):
1010-23.
3) Darling AE et al. 2010 PLoS One 5(6):e11147.
4) McArthur AG, et al. 2013 Antimicrob Agents Chemo
57(7): 3348-57.
5) Dhillon BK, et al. 2015 NAR gkv401.
6) Chen L, et al. 2011 NAR gkr989.
7) Wattam AR, et al. 2014 NAR 42(D1):D581-9.
8) Lowe TM & Eddy SR 1997 NAR 25(5):955-64.
9) Laslett D & Canback B 2004 NAR 32(1):11-6.
10)Waack S, et al. 2006 BMC Bioinformatics7:142.
11)Hsiao W, et al. 2003 Bioinformatics 19(3):418-20.
12)Tatusov RL, et al. 2003 BMC Bioinformatics 4:1.
13)Altschul SF, et al. 1997 NAR 25(17):3389-402.
•Draft genomes have limitations: certain gene
types, particularly those associated with mobile
elements, are disproportionately missing
•Draft genome analysis is still valuable for
VFs/AMR for the species examined, but more
species should be studied