1. Shaman Narayanasamy
Eco-Systems Biology Group
Supervisors: Paul Wilmes and Jorge Goncalves
PHD-2014-1/7934898
Computational approaches to predict
bacteriophage-host relationships
Robert A. Edwards, Katelyn McNair, Karoline Faust, Jeroen Raes, Bas E. Dulith
Review Article FEMS Microbiology (9 December 2015)
Computational Biology Pizza Club series: 25th May 2016
2. 2
Article overview
• Metagenomics for identification of viral-host associations
• Introduction of wet-lab methods
• Focused on bacteriophages (phages) and bacterial
interactions
• Benchmark data: 820 bacteriophages, associated hosts and
publicly available metagenomic datasets
• Assessment of predictive power of in silico phage-host
signals:
– Abundance-based methods
– Sequence homology based methods
– Genetic homology
– CRISPRs
– Oligonucleotide profiles
– Compositional based methods
12. Experimental approaches for phage isolation
12
• Spot and plaque assays
• Liquid assays
• Viral tagging
• Microfluidic PCR
• PhageFISH
• Single cell sequencing
• Hi-C sequencing
13. Spot and plaque assays
13
Requires
• Pure culture of host
• Pure/environmental culture of phage
Disadvantages
• Low throughput
• Host isolation required
Photo adapted and modified from http://www.slideshare.net/Adrienna/global-food-safety2013
14. Liquid assays
14
Requires
• Pure culture of host
• Pure culture of phage
Disadvantages
• Use of OD readout *
• Low sensitivity (single endpoint values) *
• Host and phage isolate required
* Use redox dye, Omnilog platform and real-time/semiquantitative PCR
Figure adapted and modified from Goldberg et al. (2014)
15. Viral tagging
15
Requires
• Pure culture of host
• Pure culture/environmental isolate of phages
• Cell sorter (FACS..?)
Disadvantages
• Host isolate required
Figure adapted and modified from http://jgi.doe.gov/dyeing-learn-marine-viruses/
16. Microfludic PCR
16
Requires
• Environmental microbial community sample
• PCR primers for target marker genes
Disadvantages
• Relies on marker genes for design of PCR primers
Figure adapted and modified from Dang & Sullivan (2014)
17. PhageFISH
17
Figures adapted and modified from Dang & Sullivan (2014) and Allers et al. (2013)
Requires
• Environmental microbial community sample
• PCR primers for target marker genes
Disadvantages
• Relies on marker genes for FISH probe design
time
18. Single cell sequencing
18
Requires
• Single microbial cell from environmental microbial community sample
Disadvantages
• Biased towards most abundant environmental microbe
Figure adapted and modified from Lasken (2012)
20. Quality assessment of predictions: ROC curves
20
• Assessment of binary classifier (Host/Not Host)
• Does not require cut-off value
• Based on the rate of accumulation of true and false positives
• True positive rate (Sensitivity), False positive rate (1-Specificity)
TPr = TP/TP + FN FPr = TN/TN + FP
22. Abundance profiles
22
• Stern et al. (2012)
– Good correlation of phage-host abundance across human gut microbiome (metagenomes)
• Reyes et al. (2013)
– 2/5 phages correspond to decrease in host abundance (mouse gut)
• Nielsen et al. (2014)
– Occurrence of phage like gene sets corresponding to host (bacterial) gene set
– Includes known phage-host pairs
• Dulith et al. (2014)
• 22% metagenomic reads may be of phage origin
• Lima-Mendez et al. (2015); TARA Oceon Survey
Figure adapted and modified from Nielsen et al. (2014) and Edwards et al. (2015)
• Improves with the availability of multiple samples from same/similar environments
• High spatio/temporal stratification; will improve as publicly available metagenome collection increases
• Time series datasets potentially used for time lagged associations
• Complicated and non-linear dynamics incompatible with straightforward correlation
• 12% correct identification of host
23. Genetic homology
23
• Phage-host homology is an indication of recent common ancestry, implying interaction
• Host genes may benefit phages!
• Auxilary metabolic genes
• Modi et al. (2013) and Dulith et al . (2014)
Figure adapted and modified from Edwards et al. (2015)
• Amino acid based searches applicable for distantly related organisms (29.8%)
• Nucleotide based searches more accurate (38.5%)
• 30% host identified
25. CRISPRs
25
• Studies:
– Human gut microbiome; Stern et al. (2012), Minot et al. (2013)
– Acidophilic biofilms; Andersson & Banfield (2008)
– Cow rumen; Berg Miller et al. (2012)
– Arctic glacial ice and soil; Sanguino et al. (2015)
– Marines environments; Anderson, Brazelton & Baross (2011), Cassman et al. (2012)
– Activated sludge; Narayanasamy et al. (unpublished)
• Little to no homology to known sequence
• Environmentally dependent
• Spacers are rapidly replaced
• Most suitable for recent phage-host interactions
• Not all prokaryotes encode CRISPRs (bacteria; 48 ± 30%, archaea; 63 ± 30%)
• Highly specific, but not sensitive
• Degeneracy of up to 13 mismatches allowed (Fineran et al., 2014)
Figure adapted and modified from Edwards et al. (2015)
26. Exact matches
26
• Integration of phage to host via homologous recombination
• attp (POP’) on phage genome and attb (BOB’) on bacterial genome
• Common identical core sequence (2-15 bp) between phage and host
• Adjacent to integrase gene in phage genome, near tRNA gene in bacterial genomes
Figure adapted and modified from Edwards et al. (2015)
• Longer matches more reliable
• Up to 40% matches correct prediction
27. Contig with cas gene
Contig with known phage gene
Contig with CRISPR locus
Oligonucleotide profiles
27
• Phages ameliorate genomic oligonucleotides profiles according to host
• Avoid recognition by restriction enzymes
• Adjustment of codon usage to match available host tRNAs
• Ogilvie et al. (2013) identified 408 metagenomic fragments with phage like properties (4mers)
Figure adapted and modified from Narayanasamy et al. (unpublished) and Edwards et al. (2015)
• Profiles cannot be too sparse (shorter kmers)
• K=3-8 predicted 8-17% correct hosts
• Codon usage predicted ~10% hosts correctly
• GC content not informative
28. Summary and overview
28
Signal category Approach Performance Comments
Abundance profiles Phage-host coabundance
profiles
Association by correlation
9.5% non-linear dynamics
confound correlations
Genetic homology Phage-host nucleotide and
protein sequence
homology
38.5% - blastn
29.8% - blastx
Depends on database
CRISPRs Spacers alignments to
phage genomes
15.1% - most similar
21.3% - highest
Occurrence of CRISPR
system (~40% bacteria,
~70% archaea)
No matches
Not sensitive
Exact matches ** Exact matches of phage-
host genomes
40.5% Short exact matches
may be random
Oligonucleotide
profiles
Similarity of kmer profiles
of phage-host
17.2% - 4mer
10.4% - codon
Table adapted and modified from Edwards et al. (2015)
29. Summary and overview
29
• Blastn and exact matches provide strongest signal
• Most methods predict between 1 - 4 bacteria as most likely host (better than random)
• Significant host genome fraction required (except for abundance-based method)
• Current knowledge still limited
• Phage host range (highly specific vs brad range)
• New methods and technology
Figure adapted and modified from Edwards et al. (2015)