Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Snippy
Torsten Seemann
Balti & Bioinformatics - Birmingham, UK - Tue 5 May 2015
Rapid bacterial variant calling
& core genome alignments

Phyloflagomics
UK / Birmingham Australia / Victoria Canada / British Columbia

A new home
Centre for Applied
Microbial Genomics

Microbiological Diagnostic Unit
∷ Oldest public health lab in Australia
: established 1897 in Melbourne
: large historical isolate collection back to 1950s
∷ National reference laboratory
: Salmonella, Listeria, EHEC
∷ WHO regional reference lab
: vaccine preventable invasive bacterial pathogens

New director
∷ Professor Ben Howden
: clinician, microbiologist, pathologist
: early adopter of genomics and bioinformatics
: long term collaborator on MRSA/VRE w/ Tim Stinear
∷ Mandate
: modernise service delivery
: enhance research output and collaboration
: nationally lead the conversion to WGS

Hardware
∷ Sequencers
: NextSeq 500
: 3 x MiSeq
: PacBio RS II (arriving 22 May)
∷ Robots
: Perkin Elmer (does not have a Twitter account)
: Colony picker
∷ Compute
: 240 TB, 10 GigE, 3 x 72 core boxes

Variant calling
∷ Find DNA differences between genomes
: variants to explain phenotype
: validate your complemented mutant
∷ Two approaches
: reference based (read alignment)
: reference-free (de novo assembly / k-mer based)

Types of variants
∷ Substitutions
: single nucleotide polymorphism (snp) A➝C
: multiple nucleotide polymorphism (mnp) AG➝TC
∷ Indels
: insertion (ins) A➝AC
: deletion (del) ACCG➝AG
∷ Complex
: compound events AC➝T

Snippy
∷ Fast → snappy
∷ Finds variants → SNPs
∷ Australian → Skippy the bush kangaroo

Input
∷ FASTQ files
: paired end, interleaved, or single-end
∷ Reference
: FASTA or Genbank
∷ Output folder
: self contained bundle of results

Inside the black box
∷ bwa mem - no clipping needed
∷ samtools - sorted, filtered BAM
∷ freebayes - split / GNU parallel / merge
∷ vcflib/vcftools - VCF filtering
∷ perl - glue

Outputs
∷ Read alignments
: .bam / .bai
∷ Variants
: .vcf / .vcf.gz / .vcf.gz.tbi / .gff .bed .tab .csv .html
∷ Consensus
: reference with all variants applied to it
∷ Genome alignment
: reference with “-” (missing) and “N” low depth

TAB output
CHROM POS TYPE REF ALT EVIDENCE FTYPE STRAND NT_POS AA_POS LOCUS_TAG GENE PRODUCT
chr 5958 snp A G G:44 A:0 CDS + 41/600 13/200 ECO_0001 dnaA replication protein
DnaA
chr 35524 snp G T T:73 G:1 C:1 tRNA -
chr 45722 ins ATT ATTT ATTT:43 ATT:1 CDS - ECO_0045 gyrA DNA gyrase
chr 100541 del CAAA CAA CAA:38 CAAA:1 CDS + ECO_0179 hypothetical protein
plas 619 complex GATC AATA GATC:28 AATA:0
plas 3221 mnp GA CT CT:39 CT:0 CDS + ECO_p012 rep hypothetical protein

Phylogenetics 101
∷ Choose some genes
∷ Sequence each gene from each isolate
∷ Align the protein sequences of each gene
∷ Back-align to nucleotide space
∷ Concatenate all the alignments
∷ Construct a distance matrix (many ways)
∷ Draw a tree (many ways)
∷ Make wild inferences from little data

Phylogenomics 101
∷ Assemble each genome
∷ Perform whole genome alignment
: in nucleotide space, as don’t know what is coding
: very computationally expensive
: can’t parallelize as with individual genes
∷ Continue as for phylogenetics

bug1 GATTACCAGCATTAAGG-TTCTCCAATC
bug2 GAT---CTGCATTATGGATTCRNCATTC
bug3 G-TTACCAGCACTAA-------CCAGTC
∷ Ideally, feed this directly to a tree builder
∷ Properly model gaps, codons and ambiguity
∷ Hard!
Whole genome alignment

core | | ||||||||| ||||||
Core sites are present in all genomes.
Core genome

core | | ||||||||| ||||||
SNPs | | | | |
Core SNPS = polymorphic sites in core genome
Core SNPs

core | | ||||||||| ||||||
SNPs | | | | |
SNPs’ | | | |
Unambiguous core SNPs

SNPs’ | | | |
ata ttc ata atg
1 2 3 4
Allele sites

>bug1
ATAA
>bug2
TTTT
>bug3
ACAG
Alignment ⇢Tree
+------ bug3
|
---+--- bug1
|
+--------- bug2
--- 1 SNP

Aligning to reference
∷ Why is whole genome alignment not used?
: involves genome (mis)assembly
: computationally difficult
: expensive to add or remove isolates
∷ Short-cut
: choose a single reference
: align each isolates reads to the reference
: core, by definition, must include the reference

Read mapping considerations
∷ Choice of reference
∷ Too divergent?
: reads may not align well
: will get too many core genome SNPs
∷ One solution
: Assemble one isolate and use as the reference

SNPs | | | | |
core | | ||||||||| ||||||
core1 ||| ||||||||||| ||||||||||
SNPs1 | | || |
Remove taxon, different core (1)

SNPs | | | | |
core | | ||||||||| ||||||
core2 | | ||||||||| ||||||
SNPs2 | | | | |

SNPs | | | | |
core | | ||||||||| ||||||
core3 | ||||||||||||| ||||||
SNPs3 | |

Core genome alignments
∷ Core SNP alignments
: can shift dramatically with taxa content
: we are only using globally conserved sites
: remember variation still exists outside “core”
∷ Snippy will keep the full alignments
: quickly derive subsets on the fly
: adding isolates can be done quickly too

Snippy summary
∷ The good
: Fast, scales to 100 cores
: Simple, clean interface and output
∷ The bad
: Doesn’t do full consequences yet using snpEff
∷ The ugly?
: Written in Perl

Contact
∷ tseemann.github.io
∷ github.com/tseemann/snippy
∷ @torstenseemann

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015

Similar to Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015 (20)

More from Torsten Seemann

More from Torsten Seemann (6)

Recently uploaded

Recently uploaded (20)

Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015