This document provides an overview of tools and approaches for analyzing bacterial population genomics and evolution using next-generation sequencing (NGS) data. It discusses identifying variants from NGS reads using SNP-based or gene-by-gene approaches. It also covers assembly-free and assembly-based analyses, including tools for short-read assembly, pangenome alignment, core genome alignment, and ortholog clustering. Population genomics applications like cgMLST/wgMLST, population structure analysis, and recombination detection are also briefly introduced. The document aims to provide bacterial genomics researchers with a toolbox of software and strategies for population analysis using NGS data.
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Bacterial Population Genomics Using Next-Generation Sequencing
1. Toolbox for bacterial
population analysis
using NGS
INTRODUCTION OF BACTERIAL POPULATION GENOMICS AND EVOLUTION
MIRKO ROSSI
ASS. PROF. ENVIRONMENTAL HYGIENE, FACULTY OF VETERINARY MEDICINE
2. I’m a vet and not a bioinformatics.. I’m a good example of end-user!
I do not want to teach population genetics today … just give you some tips how to do it using
NGS in bacteria
If you are interested in bacterial population analysis … we are organizing an ad hoc course in
Spring ..
There are several more software/pipelines.. These are the ones I like/I know/I apply
If you want the slides send an Email to me mirko.rossi@helsinki.fi
If you are a MSc in bioinformatics and interested in thesis in applied bioinformatics in public
health microbiology and pathogen surveillance please contact me ..
3. Bacterial population
A group of individuals of the same species
POPULATIONS, not individuals, evolve
Population and community are two different concepts … WE ARE SPEAKING OF INDIVIDUALS OF
THE SAME SPECIES!!!! … although the definition of species in bacteriology is quite vague
Population genomics attend to understand the population by whole genome analysis a sample
of it investigating the variation of a subset of individual members of the population
“Sequence data is ideal for this, as the differences between individuals are often tiny (i.e. there
is very little variation) since they belong to a single population, and DNA sequence data allows
us to detect single nucleotide changes (ie provides high resolution)” (Kate Hold)
4. The sample is a subset of the population
4
Population
Universe
Reality
State of nature
Truth
parameters
Sample
Finite, random
noise
error
perturbation
statistics
Statistical inference: Extract maximum information from
sample in order to draw conclusions about population
Inductive not deductive
Source John Bunge
5. How many samples do I need to
sequence?
It depends on your question!
Accuracy is important.. but big numbers help!
Draft genomes are enough. Closing a genome is a waste of time and money!
good draft 100 €/s closed > 3000 €/s
Include in your analysis as much diversity as possible (time, space, phenotypes,...)
Sequence as much as you can … just stop before you get broke!!
1000 strains < 100 000 €
6. Bacterial population… different levels
Population of H. pylori living in a single stomach Population of H. pylori circulating globally
7. What do we want to measure?
Genetic Drift
◦ the change in the gene pool of a small population due to chance
Natural Selection
◦ Allele increasing fitness will accumulate in the population
◦ Cause ADAPTATION of Populations
Gene Flow
◦ is genetic exchange due to the migration of individuals between populations
8. How do we measure (using NGS)?
Identify variants:
◦SNPapproach
◦Gene-by-geneapproach
Define which part of the gene pool is common in all the individuals of the population (core)
and which part is not (accessory)
Use of phylogenetic frameworks for reconstructing genealogy and non-phylogenetic
clustering methods for inferring population structure
10. Identifying variants: SNP approaches
sample
NGS
WGS
reads
Mapping to reference
VCF/Fasta File with SNPs
• Needs a reference strain
• Monomorphic (Clonal) species
• Recombination/Horizontal gene transfer is a
problem
• Difficult to create a nomenclature
Source J. Carriço
11. Identifying variants: Gene-by-gene
sample
NGS
WGS
reads
• No need for reference strain
• Buffers recombination effect
• Simpler to create a nomenclature
• Population structure of non-monomorphic
species
• Multiple Schemas can be defined for a single
species
assembly
contigs
Central nomenclature server:
Schemas, Allele definitions and identifiers
Output :Allelic Profile
Source J. Carriço
13. … I’m just using Illumina
For both de novo and re-sequencing
At the moment Illumina gives the
best benefit-cost ratio:
• High throughput
• Accuracy
• Possibility for multiplex
• Reasonable work flow time
• Easy accessible
For small genomes (1 to 2 Mb) it is
nowadays possible to sequence at
~90 euro/sample with minimum x40
coverage
14. I have the reads for each strain.. OK, and now?
An overview of main programs, platforms and approaches … sometime it is a question of style!
15. I want some results from reads…
You can always map your reads against a close reference genome using ”classical” short reads
aligners and extract SNPs: BWA for example
Here just a (long) list http://omictools.com/read-alignment-c83-p1.html
Now you just need to decide the reference genome
Note that you might need to select more than one reference genome to tune your analysis
…Be aware that there are available software designed specifically for
bacterial genomes
16. Assembly-free analyses
SNP CALLING AND CORE GENOME ALIGNMENTS - REFERENCE BASED MAPPING
Snippy
◦ One-by-one
◦ a set results using the same reference to
generate a core SNP alignment
◦ A lot of output files
◦ Variants: SNPs, MNPs, INDELs, MIX
Input Requirements
◦ a reference genome in FASTA or GENBANK
format (can be in multiple contigs)
◦ query sequence read files in FASTQ or FASTA
format (can be .gz compressed) format
Wombac
◦ Fast and “dirty”´; several samples in a run
◦ Computations can re-used for building new trees
◦ looks for substitution SNPs, not indels, and it may
miss some SNPs
Input Requirements
◦ a reference genome in FASTA or GENBANK format
(can be in multiple contigs)
◦ query sequences in
◦ a folder containing FASTQ short reads: eg. R1.fq.fz R2.fq.gz
◦ a multi-FASTA file: eg. contigs.fa or NC_273461.fna
◦ a .tar.gz file containing FASTA contig files: eg.
Ecoli_K12mut.contig.tar.gz (from EBI/NCBI)
https://github.com/tseemann/wombachttps://github.com/tseemann/snippy
@torstenseemann
17. Assembly-free analyses
SHORT READ SEQUENCE TYPING
Srst2
◦ design specifically for bacterial genomes
◦ Query Illumina sequence data, against an MLST database and/or a database
of gene sequences
◦ Report the presence of STs (allele designation) and/or reference genes
Input Requirements
◦ Query: illumina reads (fastq.gz format, but other options)
◦ A fasta reference sequence database to match to:
◦ For MLST, this means a fasta file of all allele sequences. If you want to assign STs, you also need a
tab-delim file which defines the ST profiles as a combination of alleles.
◦ For resistance/virulence genes, this means a fasta file of all the resistance genes/alleles that you
want to screen for, clustered into gene groups.
https://github.com/katholt/srst2
@DrKatHolt
18. Stand-alone pipeline for SNP variant
Nullarbor
◦ Clean reads
◦ Species identification k-mer analysis against known genome database (Kraken)
◦ De novo assembly
◦ Annotation
◦ MLST
◦ Resistome
◦ SNP Variants
https://github.com/tseemann/nullarbor
@torstenseemann
19. … or you might prefer assemble your
genome!
When you know little or nothing of your dataset (it is not possible to select a
reference genome)
In case of deep comparative genomics when you also are interest in the accessory
genome (genes absence in your reference)
To extract the pangenome
Because having all your dataset assembled will facilitate downstream
applications
To develop common NOMENCLATURE
21. Assembly short reads
REFERENCE BASED ASSEMBLY
Mira (best assembler … for geeks since 1999 )
◦ multi-pass assembler/mapper for small genomes
(up to 150 Mb)
◦ has full overview on the whole project at any time
of the assembly, using all available data and
learning from mistakes
◦ Marks places of interest with tags so that these can
be found quickly in finishing programs
◦ can do also de novo and hybrid assembling
Input Requirements
◦ various formats (CAF, FASTA, FASTQ or PHD) from
Sanger, 454, Ion Torrent, illumina
DE NOVO ASSEMBLY
Spades (a very good assembler for lazy people)
◦ is intended for both standard isolates and single-
cell MDA bacteria assemblies
◦ It does its work and very well
◦ Simple to run
spades.py --careful -1 R1.fastq.gz -2 R2.fastq.gz –o output folder
◦ Can use Nanopore and PacBio for hydrid
assembly
Andrey’s lecture from WBG2014
https://docs.google.com/presentation/d/1wjrJGKhQQEHDwHF5OhQQyKnj5_c7
duTAQjcDsBHTkWQ/edit#slide=id.g47b5b1626_0793
http://sourceforge.net/projects/mira-assembler/ http://bioinf.spbau.ru/spades
@BaCh_mira
22. Pangenome alignment
(up to 50 strains)
MUGSY
Genomes should be very similar
Mugsy (also Mauve) alignment generated a
multiple block local alignment
Alignment format is in MAF
MAUVE
Large-scale evolutionary events
It can align more divergent strains than Mugsy:
as little as 50% nucleotide identity
It aligns the pan-genome
Complete genome alignment in the eXtended
Multi-FastA (XMFA)
List groups of genes that are predicted to be
positionally orthologous
GUI available
http://mugsy.sourceforge.net/
http://darlinglab.org/mauve/
23. Core genome alignment
PARSNP
Designed to align the core genome of hundreds to thousands of bacterial genomes within a few
minutes to few hours
Very very similar strains… it use MUMi to select the nearest genomes only the ones with
distance <= 0.01 are included, all others are discarded.
Input can be both draft assemblies and finished genomes, and output includes variant (SNP)
calls, core genome phylogeny ad multi-alignments
Results are visualized using a GUI
https://harvest.readthedocs.org/en/latest/content/parsnp.html
25. Structural annotation
PRODIGAL
Gene finders
Very fast 3000 genomes in ~ a week (8 cpu
16 Gb RAM)
Prodigal can be run in one step on a single
genomic sequence or on a draft genome
containing many sequences.
It does not need to be supplied with any
knowledge of the organism, as it learns all the
properties it needs to on its own.
PROKKA
Structural and functional annotation
Fast automatic annotation in multi-core <
15 min
Several dependencies tedious to install (… I
told you I’m very lazy!)
http://www.slideshare.net/torstenseemann/p
rokka-rapid-bacterial-genome-annotation-
abphm-2013?related=1
https://github.com/hyattpd/prodigal/wiki https://github.com/tseemann/prokka
26. Ortholog clustering
ORTHAGOGUE
high speed estimation of homology relations
within and between species in massive data
sets
easy to use and offers flexibility through a r
Input = all-against-all BLAST tabular output;
range of optional parameters
Output = mcl file
-u -o XX ignore e-value, use BLAST score,
esclude protein with overlap < XX
ROARY
high speed stand alone pan genome pipeline
128 samples can be analysed in under 1 hour
using 1 GB of RAM and a single processor
Input = GFF3 format produced by Prokka
Roary –e –mafft *.gff
FastTree –nt –gtr core_gene_alignment.aln >
my_tree.newick
Output = several files
https://code.google.com/p/orthagogue/ http://sanger-pathogens.github.io/Roary/
27. Gene-by-gene: pangenome,
coregenome, accessory genome
Ortholog
clustering
results
ad hoc
scripts
Core Genome
Accessory Genome
Pangenome
Phylogeny
RAxML
Fastree
BEAST
Everything included
in Roary but not in
OrthAgogue
Population
structure
BAPS
STRUCTURE
Recombination BRATNEXTGEN
GUBBINS
29. cgMLST and wgMLST
Open source
BACTERIAL ISOLATE GENOME SEQUENCE DATABASE
◦ Jolley & Maiden 2010, BMC Bioinformatics 11:595 - http://pubmlst.org/software/database/bigsdb/
◦ PROs: Freely available, open-source, handles thousands of genomes, has several schemas implemented
for MLSTfor several bacterial species, and some extended MLST and core genome MLST (mainly Neisseria
sp. but soon to be expanded)
◦ CONs: Requires Perl knowledge to install and maintain
Source J. Carriço
@jacarrico
30. cgMLST and wgMLST
Commercial software
RIDOM SEQSPHERE+
◦ http://www.ridom.com/seqsphere/
◦ with client server solutions from assembly to allele calling and visualization for core genome MLST
(MLST+/ cgMLST)
APPLIED MATHS - BIONUMERICS 7.5
◦ http://www.applied-maths.com/news/bionumerics-version-75-released
◦ Commercial software with client server solutions from assembly to allele calling and visualization for
whole genome MLST (wgMLST)
Source J. Carriço
@jacarrico
31. cgMLST with Genome Profiler
Index alleles of the loci that shared by the bacterial isolates implementing both BLASTN and
BLASTX
Transforms WGS data into allele profile data
Using a reference genome it attempted to account for gene paralogy using conserved gene
neighborhoods
http://jcm.asm.org/content/53/5/1765.abstract
32. cgMLST with Genome Profiler
Input files
◦ reference genome in gbk format (even in multi-gbk format from RAST) or a multi-FASTA file the allele
sequences
◦ Query genomes in FASTA format (complete or draft – in contigs)
If you run the data for the first time, you use one of the genome as reference to built a new
cgMLST scheme (ad hoc mode):
◦ perl GeP.pl -r NC_017282.gbk -g genome_list.txt
Data can be run with the cgMLST scheme created previously by GeP:
◦ perl GeP.pl -g genome_list.txt –o
Or you could use a multi-Fasta file of the the allele sequences (nt) as reference (in this case all
possible paralogs are excluded - a fix number of 999999999 will be assigned to expect-d)
◦ perl GeP.pl -r NC_017282.ffn -g genome_list.txt -n
33. cgMLST with Genome Profiler
Output files:
◦ output.txt records the information of all the loci in each of the test genome sequences
◦ difference_matrix.html contains a summary of the analysis and a matrix of pairwise
differences between the allelic profiles of the samples.
◦ Splitstree.nex allele profile of the isolates in NEXUS format, which can be opened in
Splitstree 4
◦ allele_profile.txt matrix of allele profile (input file of STRUCTURE and BAPS)
◦ core_genomes.fas alignment of the core genome in FASTA format
https://www.dropbox.com/sh/02pt21410hla1rf/AADGNL7W6Uxsb5cAR0kffSaUa?dl=0
34. Infering recombination events
GUBBINS
Iteratively identifies loci containing elevated
densities of base substitutions while
concurrently constructing a phylogeny based
on the putative point mutations outside of
these regions
Run in only a few hours on alignments of
hundreds of bacterial genome sequences.
BRATNEXTGEN
Bayesian analysis of recombinations in whole-
genome DNA sequence data
Use a GUI
Divides the genome into segments, then for
each segment, detects genetically distinct
clusters of isolates and estimates the
probabilities of recombination events
Run efficiently on a desktop computer .. I
tested up to 100 .. Results after O/N
https://github.com/sanger-pathogens/Gubbins
http://www.helsinki.fi/bsg/software/BRAT-NextGen/
https://www.dropbox.com/s/gppp5xs2pkw87ms/BratNextGen_manual.pdf?dl=0
35. Phylogeny (phylogeography) visualization
A directory for tree visualization
http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html
My favorite tree editor/viewer
http://itol.embl.de/
A very nice tool for phylogeography
http://microreact.org/showcase/