SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
Pathogen Genome Data
EMBL-EBI Bioinformatics of Plants and
Plant Pathogens 23rd
May 2016
Leighton Pritchard1,2,3
1
Information and Computational Sciences,
2
Centre for Human and Animal Pathogens in the Environment,
3
Dundee Effector Consortium,
The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
Acceptable Use Policy
Recording of this talk, taking photos, discussing the content using
email, Twitter, blogs, etc. is permitted (and encouraged),
providing distraction to others is minimised.
These slides will be made available on SlideShare.
These slides, and supporting material including exercises, are
available at https://github.com/widdowquinn/Teaching-
EMBL-Plant-Path-Genomics
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
Introduction
What can pathogen genome data do for you?
Combining genomic data with comparative and evolutionary
biology, addresses questions of pathogen evolution, adaptation and
lifestyle.
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
NCBI
http://www.ncbi.nlm.nih.gov/
Repository of record for pathogen (and other) genome data
Example: Ralstonia solanacearum
Browser interface
FTP repositories of genome data
- RefSeq
- GenBank
GenBank vs RefSeq
GenBank
part of International Nucleotide Sequence Database
Collaboration (INSDC): EMBL/NCBI/DDBJ
records ’owned’ by submitter
may include redundant information
RefSeq
not part of INSDC
records derived from GenBank, ’owned’ by NCBI
stable non-redundant foundation for functional and diversity
studies
Ensembl
http://www.ensembl.org
Automated annotation on selected genomes
Specialised sub-collections
Ensembl Protists: http://protists.ensembl.org/
Ensembl Bacteria: http://bacteria.ensembl.org/
Ensembl Fungi: http://fungi.ensembl.org/
Downloadable resources
e.g. ftp://ftp.ensemblgenomes.org/pub/protists/
Ready-made comparative genomics!
Phytophthora genomics alignments (Avr3a)
Gene trees (Avr3a)
Other Sources
Sequencing centres, e.g.
JGI Genome Portals
Broad Institute - now retiring their online resources
Specialist databases, e.g.
PhytoPath - plant pathogens
PHI-Base 4 - curated plant-pathogen interaction data
FungiDB - fungi and oomycetes
CPGR - fungi and oomycetes (not recently updated)
Your friendly local sequencing centre!
Aspera is commonly used to connect to your private data
OPTIONAL WORKSHEET
worksheets/01-downloading_data_biopython.ipynb
Downloading genome data from NCBI with Biopython
MyBinder link
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
Why comparative genomics?
Transfer functional information
from model systems (E. coli, A.
thaliana, D. melanogaster) to
non-model systems
Genome similarity ∝ phenotype?
(functional genomics): virulence
and host range
Genome similarity ∝ relatedness?
(phylogenomics): record of
evolutionary processes and
constraints
Genomes aren’t everything. . .
Context
epigenetics
tissue differentiation/differential
expression
mesoscale systems, etc.
Phenotypic plasticity, responses to
temperature
stress
community, etc.
. . .and therefore systems biology. . .
Levels of comparison
Bulk Properties
e.g. k-mer spectra (MaSH, MetaPalette, etc.)
Whole Genome Sequence
sequence similarity (BLAST, BLAT, MUMmer, etc.)
structure and organisation (Mauve, ACT, etc.)
Genome Features/Functional Components
numbers and types of features: genes, ncRNA, regulatory
elements, etc.
organisation of features: synteny, operons, regulons, etc.
functional complement (KEGG, etc.)
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
Whole genome comparisons
Whole genome comparison
Comparisons of one complete or draft genome with another
(. . .or many others)
Minimum requirement: two genomes
Reference Genome
Comparator Genome
The experiment produces a comparative result that is dependent
on the choice of genomes.
Pairwise genome alignments
Pairwise comparisons produce alignments of similar regions.
Synteny and Collinearity
Genome rearrangements may occur post-species divergence
Sequence similarity, and order of similar regions, may be conserved
collinear conserved elements lie in the same linear sequence
syntenous (or syntenic) elements:
(orig.) lie on the same chromosome
(mod.) are collinear
Evolutionary constraint (e.g. indicated by synteny) may indicate
functional constraint (and help determine orthology)
Vibrio mimicus a
a
Hasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107
Chromosomes
C-I: virulence genes.
C-II: environmental adaptation
C-II has undergone extensive rearrangement; C-I has not.
Suggests modularity of genome organisation, as a mechanism for
adaptation (HGT, two-speed genome).
Serratia symbiotica a
a
Burke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002
S. symbiotica is a recently adapted symbiont of aphids
Massive genomic decay: consequence of adaptation
Whole genome classification a b
a
Baltrus (2016) Trends Microbiol. doi:10.1016/j.tim.2016.02.004
b
Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h
Widespread confusion about bacterial strain classification and
nomenclature
Taxonomies contradicted by bioinformatic classification
Databases populated by non-taxonomists
Philosophy and practice of taxonomy are in conflict
Classification can be independent of existing nomenclature
The route from genotype to phenotype is complicated
Time to abandon traditional bacterial species concepts?
An unambiguous sequence-based classification scheme is
possible
DNA-DNA hybridisationa
a
Morello-Mora and Amann (2001) FEMS Micro. Rev. doi:10.1016/S0168-6445(00)00040-1
“Gold Standard” for
prokaryotic taxonomy,
since 1960s. “70%
identity ≈ same species.”
Denature DNA from two
organisms.
Allow to anneal.
Reassociation ≈
similarity, measured as
∆T of denaturation
curves.
Proxy for sequence similarity - replace with genome analysis1?
1
Chan et al (2012) BMC Microbiol. doi:10.1186/1471-2180-12-302
Average Nucleotide Identity (ANIm)a
a
Richter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106
1. Align genomes
(MUMmer)
2. ANIm: Mean
% identity of all
matches
DDH:ANIm
linear
70%ID ≈
95%ANIb
55 Pectobacterium spp. ANIm a
a
Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h
Ten
species-level
groups (four
novel)
P. carotovorum
split: several
species
P. wasabiae
split: two
species
P. atrosepticum SCRI1043
P. atrosepticum NCPPB 3404
P. atrosepticum JG10­08
P. atrosepticum 21A
P. atrosepticum CFBP 6276
P. atrosepticum NCPPB 549
P. atrosepticum ICMP 1526
P. carotovorum PC1
P. carotovorum UGC32
P. betavasculorum NCPPB 2793
P. betavasculorum NCPPB 2795
P. carotovorum M022
P. wasabiae CFBP 3304
P. wasabiae NCPPB 3701
P. wasabiae NCPPB3702
P. wasabiae CFIA1002
P. wasabiae WPP163
P. wasabiae RNS08.42.1A
P. sp. SCC3193 SCC3193
P. carotovorum BC D6
P. carotovorum YC D49
P. carotovorum BC S2
P. carotovorum YC D29
P. carotovorum YC D65
P. carotovorum CFIA1001
P. carotovorum PCC21
P. carotovorum YC D46
P. carotovorum YC T31
P. carotovorum YC D62
P. carotovorum YC T3
P. carotovorum CFIA1009
P. carotovorum YC D52
P. carotovorum YC D21
P. carotovorum YC D64
P. carotovorum YC D60
P. carotovorum CFIA1033
P. carotovorum PBR1692
P. carotovorum LMG 21371
P. carotovorum BD255
P. carotovorum ICMP 19477
P. carotovorum LMG 21372
P. carotovorum KKH3
P. carotovorum NCPPB3841
P. carotovorum NCPPB 3839
P. carotovorum BC S7
P. carotovorum YC T1
P. carotovorum NCPPB 3395
P. carotovorum YC D57
P. carotovorum BC T2
P. carotovorum ICMP 5702
P. carotovorum NCPPB 312
P. carotovorum YC D16
P. carotovorum YC T39
P. carotovorum WPP14
P. carotovorum BC T5
P. atrosepticum SCRI1043
P. atrosepticum NCPPB 3404
P. atrosepticum JG10­08
P. atrosepticum 21A
P. atrosepticum CFBP 6276
P. atrosepticum NCPPB 549
P. atrosepticum ICMP 1526
P. carotovorum PC1
P. carotovorum UGC32
P. betavasculorum NCPPB 2793
P. betavasculorum NCPPB 2795
P. carotovorum M022
P. wasabiae CFBP 3304
P. wasabiae NCPPB 3701
P. wasabiae NCPPB3702
P. wasabiae CFIA1002
P. wasabiae WPP163
P. wasabiae RNS08.42.1A
P. sp. SCC3193 SCC3193
P. carotovorum BC D6
P. carotovorum YC D49
P. carotovorum BC S2
P. carotovorum YC D29
P. carotovorum YC D65
P. carotovorum CFIA1001
P. carotovorum PCC21
P. carotovorum YC D46
P. carotovorum YC T31
P. carotovorum YC D62
P. carotovorum YC T3
P. carotovorum CFIA1009
P. carotovorum YC D52
P. carotovorum YC D21
P. carotovorum YC D64
P. carotovorum YC D60
P. carotovorum CFIA1033
P. carotovorum PBR1692
P. carotovorum LMG 21371
P. carotovorum BD255
P. carotovorum ICMP 19477
P. carotovorum LMG 21372
P. carotovorum KKH3
P. carotovorum NCPPB3841
P. carotovorum NCPPB 3839
P. carotovorum BC S7
P. carotovorum YC T1
P. carotovorum NCPPB 3395
P. carotovorum YC D57
P. carotovorum BC T2
P. carotovorum ICMP 5702
P. carotovorum NCPPB 312
P. carotovorum YC D16
P. carotovorum YC T39
P. carotovorum WPP14
P. carotovorum BC T5
0.00
0.25
0.50
0.75
1.00
ANIm_percentage_identity
55 Pectobacterium spp. ANIma
a
Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h
All isolates
align over
>50% of
whole
genome
P. carotovorum YC T31
P. carotovorum YC D21
P. carotovorum YC T3
P. carotovorum CFIA1009
P. carotovorum YC D64
P. carotovorum YC D52
P. carotovorum YC D62
P. carotovorum YC D60
P. carotovorum YC D49
P. carotovorum BC D6
P. carotovorum YC D65
P. carotovorum YC D46
P. carotovorum BC S2
P. carotovorum YC D29
P. carotovorum PCC21
P. carotovorum CFIA1001
P. carotovorum YC T1
P. carotovorum NCPPB 3395
P. carotovorum UGC32
P. carotovorum KKH3
P. carotovorum BC S7
P. carotovorum ICMP 5702
P. carotovorum NCPPB 312
P. carotovorum BC T2
P. carotovorum YC D16
P. carotovorum YC T39
P. carotovorum YC D57
P. carotovorum WPP14
P. carotovorum BC T5
P. carotovorum CFIA1033
P. carotovorum PC1
P. carotovorum LMG 21371
P. carotovorum BD255
P. carotovorum PBR1692
P. carotovorum ICMP 19477
P. carotovorum LMG 21372
P. wasabiae CFBP 3304
P. wasabiae NCPPB 3701
P. wasabiae NCPPB3702
P. sp. SCC3193 SCC3193
P. wasabiae WPP163
P. wasabiae RNS08.42.1A
P. wasabiae CFIA1002
P. atrosepticum CFBP 6276
P. atrosepticum NCPPB 3404
P. atrosepticum NCPPB 549
P. atrosepticum ICMP 1526
P. atrosepticum SCRI1043
P. atrosepticum JG10­08
P. atrosepticum 21A
P. carotovorum NCPPB3841
P. carotovorum NCPPB 3839
P. carotovorum M022
P. betavasculorum NCPPB 2793
P. betavasculorum NCPPB 2795
P. carotovorum M022
P. carotovorum YC T1
P. carotovorum KKH3
P. carotovorum UGC32
P. carotovorum CFIA1033
P. carotovorum NCPPB3841
P. carotovorum NCPPB 3839
P. carotovorum BC S7
P. carotovorum ICMP 19477
P. carotovorum BD255
P. carotovorum LMG 21372
P. carotovorum PBR1692
P. carotovorum LMG 21371
P. carotovorum PC1
P. carotovorum ICMP 5702
P. carotovorum NCPPB 312
P. carotovorum YC D57
P. carotovorum BC T2
P. carotovorum YC D16
P. carotovorum WPP14
P. carotovorum BC T5
P. carotovorum YC D62
P. carotovorum YC T31
P. carotovorum YC D52
P. carotovorum YC T3
P. carotovorum YC D64
P. carotovorum YC D21
P. carotovorum YC D60
P. carotovorum YC T39
P. carotovorum CFIA1001
P. carotovorum YC D46
P. carotovorum YC D49
P. carotovorum BC D6
P. carotovorum YC D65
P. carotovorum BC S2
P. carotovorum YC D29
P. carotovorum PCC21
P. carotovorum CFIA1009
P. atrosepticum ICMP 1526
P. atrosepticum CFBP 6276
P. atrosepticum SCRI1043
P. atrosepticum NCPPB 3404
P. atrosepticum NCPPB 549
P. atrosepticum JG10­08
P. atrosepticum 21A
P. wasabiae WPP163
P. sp. SCC3193 SCC3193
P. wasabiae RNS08.42.1A
P. wasabiae CFIA1002
P. wasabiae CFBP 3304
P. wasabiae NCPPB 3701
P. wasabiae NCPPB3702
P. carotovorum NCPPB 3395
P. betavasculorum NCPPB 2793
P. betavasculorum NCPPB 2795
0.00
0.25
0.50
0.75
1.00
ANIm_alignment_coverage
ANI
Advantages
Average identity of all ‘homologous’ regions
Approximates limiting case of MLST/MLSA/multigene
comparisons
Adaptable to variable thresholding (LINS) classifications
Criticisms
Thresholds ‘arbitrary’, based on homologous regions only
Taxonomic classification, not phylogenetic reconstruction
No functional (or gene-based) interpretation; still need
pangenome classification and analysis
EXERCISE
exercises/01-whole_genome_comparisons.ipynb
Pairwise comparison of Pseudomonas genomes
ANIm classification of Pseudomonas isolates
MyBinder link
Chromosome painting a
a
Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
“Chromosome painting” (FINESTRUCTURE) infers
recombination-derived ‘chunks’
Genome’s haplotype constructed in terms of recombination
events from a ‘donor’ to a ‘recipient’ genome
Chromosome painting a
a
Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055
Recombination events summarised in a coancestry matrix.
H. pylori: most within geographical bounds, but asymmetrical
donation from Amerind/East Asian to European isolates.
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
Feature comparisons
Feature comparisons
Comparisons of the annotated features of one genome with another
(. . .or many others)
gene features
RNA features
regulatory features
Equivalent features
The power of genomics is comparative genomics!
Makes catalogues of genome components comparable between
organisms
Differences, e.g. presence/absence of equivalents may support
hypotheses for functional or phenotypic difference
Can identify characteristic signals for diagnosis/epidemiology
Can build parts lists and wiring diagrams for systems and
synthetic biology
Orthologues a b
a
Nehrt et al. (2011) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002073
b
Chen et al. (2012) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002784
Orthologs/Orthologues
“Homologs that diverged through speciation” (orig.)
“Genes/products we think are probably the same thing” (mod. inform.)
Why orthologues? a b c
a
Chen and Zhang (2012) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002784
b
Dessimoz (2011) Brief. Bioinf. doi:10.1093/bib/bbr057
c
Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262
Formalise the idea of corresponding genes in different
organisms.
Suggest two relationships:
Evolutionary equivalence
Functional equivalence (“The Ortholog Conjecture”)
The Ortholog Conjecture
Without duplication, a gene product is unlikely to change its basic
function, because this would lead to loss of the original function,
and this would be harmful.
Finding orthologues a
a
Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006
Which discovery method performs best?
Four methods tested against 2,723 curated orthologues from
six Saccharomycetes:
RBBH (and cRBH); RSD (and cRSD); MultiParanoid;
OrthoMCL
Rated by statistical performance metrics: sensitivity,
specificity, accuracy, FDR
cRBH most accurate and specific, with lowest FDR.
EXERCISE
exercises/02-cds_feature_comparisons.ipynb
RBBH analysis of Pseudomonas CDS feature annotations
MyBinder link
One-way BLAST vs RBBH
One-way BLAST includes many low-quality hits
One-way BLAST vs RBBH
Reciprocal best BLAST removes many low-quality matches
The Pangenome
The Core Genome Hypothesis
“The core genome is the primary cohesive unit defining a bacterial
species”
Once equivalent genes have been identified, those present in
all related isolates can be identified: the core genome.
The remaining genes are the accessory genome, and are
expected to mediate function that distinguishes between
isolates.
Roary: Rapid large-scale prokaryote pan-genome analysis - works
on a desktop machine.
Accessory genome a b
a
Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608
b
Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132
Accessory genomes
A cradle for adaptive evolution, particularly for bacterial
pathogens, such as Pseudomonas spp.
OPTIONAL WORKSHEET
worksheets/02-prokka_roary.ipynb
Annotation of pathogen genomes with Prokka
Calculation of the Pantoea agglomerans pangenome with
Roary
MyBinder link
Table of Contents
1 Introduction
Pathogen Genome Data
2 Public Genome Data Sources
Online Resources
3 Comparative Genomics
Why Comparative Genomics?
Whole Genome Comparisons
Feature Comparisons
4 Effector Prediction
Effector Characteristics
What is an effector?
What is an effector?
Effector
A molecule produced by pathogen that (directly?) modifies host
molecular/biochemical ‘behaviour’
Inhibits enzyme action (e.g. Cladosporium fulvum AVR2, AVR4;
Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors)
Cleaves a protein target (e.g. Pseudomonas syringae AvrRpt2)
(De-)phosphorylates a protein target (e.g. P. syringae AvrRPM1,
AvrB)
Retargeting host system such as E3 ligase (e.g. P. syringae
AvrPtoB; P. infestans Avr3a)
Regulatory control (e.g. Xanthomonas campestris AvrBs3)
What is an effector?
No unifying biochemical mechanism
No single test for ‘candidate effectors’, even in one organism
Effectors are modular a b
a
Greenberg & Vinatzer (2003) Curr. Opin. Microbiol. doi:10.1016/S1369-5274(02)00004-8
b
Collmer et al. (2002) Trends Microbiol. doi:10.1016/S0966-842X(02)02451-4
Delivery
N-terminal localisation/translocation domain
Activity
C-terminal functional/interaction domain
Effectors are modular a b
a
Dong et al. (2011) PLoS One doi:10.1371/journal.pone.0020172.t004
b
Boch et al. (2009) Science doi:10.1126/science.1178811
Delivery
Typically common to effector class: RxLR, T3E, CHxC
Activity
May be common (TAL) or divergent within effector class (RxLR, T3E)
Effector prediction tools (online) a b
a
Sperschneider et al. (2015) PLoS Pathogens doi:10.1371/journal.ppat.1004806
b
Sonah et al. (2016) Front. Plant Sci. doi:10.3389/fpls.2016.00126
Bacterial Type III Effectors
EffectiveT3
modlab
T3SEdb
Fungal/Oomycete Effectors
EffectorP
Galaxy Toolshed RxLR predictor
What do we look for? a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
What if someone hasn’t built a classifier for your protein family?
Tests are for protein family membership and/or ‘effector-like’
functional signal
The same as any sequence classification problem (functional
annotation)
Many possible approaches
(Supervised) machine learning problem:
train
test
validate
Sequence space a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Known members of our effector class are in red
Similarity distance a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Define a representative centre, and a distance from it that includes
known effectors
Classify candidates a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Classify sequences within the distance as similar
EXERCISE
exercises/03-effector_finding.ipynb
Downloading annotated Pseudomonas AvrPto1 effectors from
a public sequence repository
Building a (HMM) model from this training set
Searching public genome annotations with the model
MyBinder link
Choosing a distance a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
How do we define
distance?
How large a distance
should we take?
How do we know if we
chose well?
Are you in or out? a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
TP/TN/FP/FN a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
FPR/FNR/Sn/Sp/FDR a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
Small Boundary a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
Medium boundary a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
Large boundary a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
The boundary (distance) classifies sequences as ‘in’ or ‘out’
Sequences are predicted to be either in the class or not in the
class
Changing distance/boundary changes classification
Choosing a boundary a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Assign known ‘positive‘
and ‘negative’ examples
Vary ‘distance’ and
measure predictive
performance (F-measure,
AUC, . . .)
Choose the distance that
gives the ‘best’
performance
Crossvalidation a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Estimation of classifier performance depends on:
Boundary choice/distance measure
Composition of training set (‘positives’ and ‘negatives’)
Cross-validation gives objective estimate of performance
Many strategies (beyond today’s scope), including:
leave-one-out (LOO)
k-fold crossvalidation
repeated (random) subsampling
Always validate against a hold-out set (not used to train the
classifier)
Post-crossvalidation a
a
Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4
Crossvalidation gives ‘best’ method & parameters
Apply ‘best’ method to complete dataset for prediction
BEWARE THE BASERATE FALLACY!
OPTIONAL WORKSHEET
worksheets/03-effector-finding.ipynb
The Baserate Fallacy in effector prediction and classification.
MyBinder link
Licence: CC-BY-SA
By: Leighton Pritchard
This presentation is licensed under the Creative Commons
Attribution ShareAlike license
https://creativecommons.org/licenses/by-sa/4.0/

Más contenido relacionado

La actualidad más candente

[2013.09.27] extracting genomes from metagenomes
[2013.09.27] extracting genomes from metagenomes[2013.09.27] extracting genomes from metagenomes
[2013.09.27] extracting genomes from metagenomes
Mads Albertsen
 
Synthetic biology: Concepts and Applications
Synthetic biology: Concepts and ApplicationsSynthetic biology: Concepts and Applications
Synthetic biology: Concepts and Applications
USTC, Hefei, PRC
 
[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics
Mads Albertsen
 

La actualidad más candente (20)

Metagenomic analysis
Metagenomic analysisMetagenomic analysis
Metagenomic analysis
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Bhojeshwari sahu
Bhojeshwari sahuBhojeshwari sahu
Bhojeshwari sahu
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
 
[2013.09.27] extracting genomes from metagenomes
[2013.09.27] extracting genomes from metagenomes[2013.09.27] extracting genomes from metagenomes
[2013.09.27] extracting genomes from metagenomes
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
David
DavidDavid
David
 
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX WorkshopMicrobial Agrogenomics 4/2/2015, UK-MX Workshop
Microbial Agrogenomics 4/2/2015, UK-MX Workshop
 
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut GenomicsTargeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
 
Polyketide Synthase type III Isolated from Uncultured Deep-Sea Proteobacteriu...
Polyketide Synthase type III Isolated from Uncultured Deep-Sea Proteobacteriu...Polyketide Synthase type III Isolated from Uncultured Deep-Sea Proteobacteriu...
Polyketide Synthase type III Isolated from Uncultured Deep-Sea Proteobacteriu...
 
Analysis of binning tool in metagenomics
Analysis of binning tool in metagenomicsAnalysis of binning tool in metagenomics
Analysis of binning tool in metagenomics
 
Synthetic Biology & Global Health - Claire Marris
Synthetic Biology & Global Health - Claire MarrisSynthetic Biology & Global Health - Claire Marris
Synthetic Biology & Global Health - Claire Marris
 
Cross-Disciplinary Biomedical Research at Calit2
Cross-Disciplinary Biomedical Research at Calit2Cross-Disciplinary Biomedical Research at Calit2
Cross-Disciplinary Biomedical Research at Calit2
 
Paper-based synthetic gene networks
Paper-based synthetic gene networksPaper-based synthetic gene networks
Paper-based synthetic gene networks
 
Synthetic biology: Concepts and Applications
Synthetic biology: Concepts and ApplicationsSynthetic biology: Concepts and Applications
Synthetic biology: Concepts and Applications
 
[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics[13.07.07] albertsen mewe13 metagenomics
[13.07.07] albertsen mewe13 metagenomics
 
OBC | Synthetic biology announcing the coming technological revolution
OBC | Synthetic biology announcing the coming technological revolutionOBC | Synthetic biology announcing the coming technological revolution
OBC | Synthetic biology announcing the coming technological revolution
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 

Destacado

Deltanet Project - Final Report
Deltanet Project - Final ReportDeltanet Project - Final Report
Deltanet Project - Final Report
Gwilym Owen
 
Listas oficiales 2doa segundo a
Listas oficiales 2doa   segundo aListas oficiales 2doa   segundo a
Listas oficiales 2doa segundo a
Ciencias México
 
Link para encusta en línea contestar por favor
Link para encusta en línea contestar por favorLink para encusta en línea contestar por favor
Link para encusta en línea contestar por favor
naty leyva
 
Presentacion para el doc copia
Presentacion  para el doc   copiaPresentacion  para el doc   copia
Presentacion para el doc copia
martinez122
 
Sabha Boxes Factory last profile
Sabha Boxes Factory last profileSabha Boxes Factory last profile
Sabha Boxes Factory last profile
syed mustafa
 
Bienvenida a instructores sena
Bienvenida a instructores senaBienvenida a instructores sena
Bienvenida a instructores sena
kavabra
 

Destacado (15)

JSOL-B2C presentation
JSOL-B2C presentationJSOL-B2C presentation
JSOL-B2C presentation
 
Deltanet Project - Final Report
Deltanet Project - Final ReportDeltanet Project - Final Report
Deltanet Project - Final Report
 
How To Get The Justin Bieber Smile
How To Get The Justin Bieber SmileHow To Get The Justin Bieber Smile
How To Get The Justin Bieber Smile
 
Merieu p-frankestein-educador-9-32
Merieu p-frankestein-educador-9-32Merieu p-frankestein-educador-9-32
Merieu p-frankestein-educador-9-32
 
Fashion Styling & Creative Direction
Fashion Styling & Creative DirectionFashion Styling & Creative Direction
Fashion Styling & Creative Direction
 
Listas oficiales 2doa segundo a
Listas oficiales 2doa   segundo aListas oficiales 2doa   segundo a
Listas oficiales 2doa segundo a
 
Sistema cardiorespiratorio
Sistema cardiorespiratorioSistema cardiorespiratorio
Sistema cardiorespiratorio
 
Examen de ingles (1)
Examen de ingles (1)Examen de ingles (1)
Examen de ingles (1)
 
Link para encusta en línea contestar por favor
Link para encusta en línea contestar por favorLink para encusta en línea contestar por favor
Link para encusta en línea contestar por favor
 
Presentacion para el doc copia
Presentacion  para el doc   copiaPresentacion  para el doc   copia
Presentacion para el doc copia
 
Sociedad de responsabilidad limitada s. de r.l.
Sociedad de responsabilidad limitada s. de r.l.Sociedad de responsabilidad limitada s. de r.l.
Sociedad de responsabilidad limitada s. de r.l.
 
Нестерук О.І. Водойми Малинівки
Нестерук О.І. Водойми МалинівкиНестерук О.І. Водойми Малинівки
Нестерук О.І. Водойми Малинівки
 
Cocina fria
Cocina friaCocina fria
Cocina fria
 
Sabha Boxes Factory last profile
Sabha Boxes Factory last profileSabha Boxes Factory last profile
Sabha Boxes Factory last profile
 
Bienvenida a instructores sena
Bienvenida a instructores senaBienvenida a instructores sena
Bienvenida a instructores sena
 

Similar a Pathogen Genome Data

bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
nadeem akhter
 
Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
ssuser6fc73c
 
Fundamentals of Analysis of Exomes
Fundamentals of Analysis of ExomesFundamentals of Analysis of Exomes
Fundamentals of Analysis of Exomes
daforerog
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
Martín Arrieta
 

Similar a Pathogen Genome Data (20)

2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
Marine Host-Microbiome Interactions: Challenges and Opportunities
Marine Host-Microbiome Interactions: Challenges and OpportunitiesMarine Host-Microbiome Interactions: Challenges and Opportunities
Marine Host-Microbiome Interactions: Challenges and Opportunities
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased Overview
 
Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?Is microbial ecology driven by roaming genes?
Is microbial ecology driven by roaming genes?
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
metagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdfmetagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdf
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
Bioinformatics .pptx
Bioinformatics .pptxBioinformatics .pptx
Bioinformatics .pptx
 
GIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype DataGIGA2 Structuring Phenotype Data
GIGA2 Structuring Phenotype Data
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Plant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In SequencesPlant Pathogen Genome Data: My Life In Sequences
Plant Pathogen Genome Data: My Life In Sequences
 
Deep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a reviewDeep learning methods in metagenomics: a review
Deep learning methods in metagenomics: a review
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
 
Apollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionApollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 Introduction
 
Fundamentals of Analysis of Exomes
Fundamentals of Analysis of ExomesFundamentals of Analysis of Exomes
Fundamentals of Analysis of Exomes
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 

Más de Leighton Pritchard

Más de Leighton Pritchard (17)

Little Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic BacteriaLittle Rotters: Adventures With Plant-Pathogenic Bacteria
Little Rotters: Adventures With Plant-Pathogenic Bacteria
 
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymesReverse-and forward-engineering specificity of carbohydrate-processing enzymes
Reverse-and forward-engineering specificity of carbohydrate-processing enzymes
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
BM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of StrathclydeBM405 Lecture Slides 21/11/2014 University of Strathclyde
BM405 Lecture Slides 21/11/2014 University of Strathclyde
 
Sequencing and Beyond?
Sequencing and Beyond?Sequencing and Beyond?
Sequencing and Beyond?
 
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome DataHighly Discriminatory Diagnostic Primer Design From Whole Genome Data
Highly Discriminatory Diagnostic Primer Design From Whole Genome Data
 
ICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad ReportICSB 2013 - Visits Abroad Report
ICSB 2013 - Visits Abroad Report
 
Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)Adventures in Bioinformatics (2012)
Adventures in Bioinformatics (2012)
 
Golden Rules of Bioinformatics
Golden Rules of BioinformaticsGolden Rules of Bioinformatics
Golden Rules of Bioinformatics
 
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataRepeatable plant pathology bioinformatic analysis: Not everything is NGS data
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
 
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
What makes the enterobacterial plant pathogen Pectobacterium atrosepticum dif...
 
Rapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnosticsRapid generation of E.coli O104:H4 PCR diagnostics
Rapid generation of E.coli O104:H4 PCR diagnostics
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Mining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for EffectorsMining Plant Pathogen Genomes for Effectors
Mining Plant Pathogen Genomes for Effectors
 
Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2Comparative Genomics and Visualisation - Part 2
Comparative Genomics and Visualisation - Part 2
 
Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1Comparative Genomics and Visualisation - Part 1
Comparative Genomics and Visualisation - Part 1
 
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, TurinA Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
A Systems Biology Perspective on Plant-Pathogen Interactions 2012-05-08, Turin
 

Último

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Último (20)

chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 

Pathogen Genome Data

  • 1. Pathogen Genome Data EMBL-EBI Bioinformatics of Plants and Plant Pathogens 23rd May 2016 Leighton Pritchard1,2,3 1 Information and Computational Sciences, 2 Centre for Human and Animal Pathogens in the Environment, 3 Dundee Effector Consortium, The James Hutton Institute, Invergowrie, Dundee, Scotland, DD2 5DA
  • 2. Acceptable Use Policy Recording of this talk, taking photos, discussing the content using email, Twitter, blogs, etc. is permitted (and encouraged), providing distraction to others is minimised. These slides will be made available on SlideShare. These slides, and supporting material including exercises, are available at https://github.com/widdowquinn/Teaching- EMBL-Plant-Path-Genomics
  • 3. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 4. Introduction What can pathogen genome data do for you? Combining genomic data with comparative and evolutionary biology, addresses questions of pathogen evolution, adaptation and lifestyle.
  • 5. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 6. NCBI http://www.ncbi.nlm.nih.gov/ Repository of record for pathogen (and other) genome data Example: Ralstonia solanacearum Browser interface FTP repositories of genome data - RefSeq - GenBank
  • 7. GenBank vs RefSeq GenBank part of International Nucleotide Sequence Database Collaboration (INSDC): EMBL/NCBI/DDBJ records ’owned’ by submitter may include redundant information RefSeq not part of INSDC records derived from GenBank, ’owned’ by NCBI stable non-redundant foundation for functional and diversity studies
  • 8. Ensembl http://www.ensembl.org Automated annotation on selected genomes Specialised sub-collections Ensembl Protists: http://protists.ensembl.org/ Ensembl Bacteria: http://bacteria.ensembl.org/ Ensembl Fungi: http://fungi.ensembl.org/ Downloadable resources e.g. ftp://ftp.ensemblgenomes.org/pub/protists/ Ready-made comparative genomics! Phytophthora genomics alignments (Avr3a) Gene trees (Avr3a)
  • 9. Other Sources Sequencing centres, e.g. JGI Genome Portals Broad Institute - now retiring their online resources Specialist databases, e.g. PhytoPath - plant pathogens PHI-Base 4 - curated plant-pathogen interaction data FungiDB - fungi and oomycetes CPGR - fungi and oomycetes (not recently updated) Your friendly local sequencing centre! Aspera is commonly used to connect to your private data
  • 11. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 12. Why comparative genomics? Transfer functional information from model systems (E. coli, A. thaliana, D. melanogaster) to non-model systems Genome similarity ∝ phenotype? (functional genomics): virulence and host range Genome similarity ∝ relatedness? (phylogenomics): record of evolutionary processes and constraints
  • 13. Genomes aren’t everything. . . Context epigenetics tissue differentiation/differential expression mesoscale systems, etc. Phenotypic plasticity, responses to temperature stress community, etc. . . .and therefore systems biology. . .
  • 14. Levels of comparison Bulk Properties e.g. k-mer spectra (MaSH, MetaPalette, etc.) Whole Genome Sequence sequence similarity (BLAST, BLAT, MUMmer, etc.) structure and organisation (Mauve, ACT, etc.) Genome Features/Functional Components numbers and types of features: genes, ncRNA, regulatory elements, etc. organisation of features: synteny, operons, regulons, etc. functional complement (KEGG, etc.)
  • 15. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 16. Whole genome comparisons Whole genome comparison Comparisons of one complete or draft genome with another (. . .or many others) Minimum requirement: two genomes Reference Genome Comparator Genome The experiment produces a comparative result that is dependent on the choice of genomes.
  • 17. Pairwise genome alignments Pairwise comparisons produce alignments of similar regions.
  • 18. Synteny and Collinearity Genome rearrangements may occur post-species divergence Sequence similarity, and order of similar regions, may be conserved collinear conserved elements lie in the same linear sequence syntenous (or syntenic) elements: (orig.) lie on the same chromosome (mod.) are collinear Evolutionary constraint (e.g. indicated by synteny) may indicate functional constraint (and help determine orthology)
  • 19. Vibrio mimicus a a Hasan et al. (2010) Proc. Natl. Acad. Sci. USA 107:21134-21139 doi:10.1073/pnas.1013825107 Chromosomes C-I: virulence genes. C-II: environmental adaptation C-II has undergone extensive rearrangement; C-I has not. Suggests modularity of genome organisation, as a mechanism for adaptation (HGT, two-speed genome).
  • 20. Serratia symbiotica a a Burke and Moran (2011) Genome Biol. Evol. 3:195-208 doi:10.1093/gbe/evr002 S. symbiotica is a recently adapted symbiont of aphids Massive genomic decay: consequence of adaptation
  • 21. Whole genome classification a b a Baltrus (2016) Trends Microbiol. doi:10.1016/j.tim.2016.02.004 b Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h Widespread confusion about bacterial strain classification and nomenclature Taxonomies contradicted by bioinformatic classification Databases populated by non-taxonomists Philosophy and practice of taxonomy are in conflict Classification can be independent of existing nomenclature The route from genotype to phenotype is complicated Time to abandon traditional bacterial species concepts? An unambiguous sequence-based classification scheme is possible
  • 22. DNA-DNA hybridisationa a Morello-Mora and Amann (2001) FEMS Micro. Rev. doi:10.1016/S0168-6445(00)00040-1 “Gold Standard” for prokaryotic taxonomy, since 1960s. “70% identity ≈ same species.” Denature DNA from two organisms. Allow to anneal. Reassociation ≈ similarity, measured as ∆T of denaturation curves. Proxy for sequence similarity - replace with genome analysis1? 1 Chan et al (2012) BMC Microbiol. doi:10.1186/1471-2180-12-302
  • 23. Average Nucleotide Identity (ANIm)a a Richter and Rossello-Mora (2009) Proc. Natl. Acad. Sci. USA doi:10.1073/pnas.0906412106 1. Align genomes (MUMmer) 2. ANIm: Mean % identity of all matches DDH:ANIm linear 70%ID ≈ 95%ANIb
  • 24. 55 Pectobacterium spp. ANIm a a Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h Ten species-level groups (four novel) P. carotovorum split: several species P. wasabiae split: two species P. atrosepticum SCRI1043 P. atrosepticum NCPPB 3404 P. atrosepticum JG10­08 P. atrosepticum 21A P. atrosepticum CFBP 6276 P. atrosepticum NCPPB 549 P. atrosepticum ICMP 1526 P. carotovorum PC1 P. carotovorum UGC32 P. betavasculorum NCPPB 2793 P. betavasculorum NCPPB 2795 P. carotovorum M022 P. wasabiae CFBP 3304 P. wasabiae NCPPB 3701 P. wasabiae NCPPB3702 P. wasabiae CFIA1002 P. wasabiae WPP163 P. wasabiae RNS08.42.1A P. sp. SCC3193 SCC3193 P. carotovorum BC D6 P. carotovorum YC D49 P. carotovorum BC S2 P. carotovorum YC D29 P. carotovorum YC D65 P. carotovorum CFIA1001 P. carotovorum PCC21 P. carotovorum YC D46 P. carotovorum YC T31 P. carotovorum YC D62 P. carotovorum YC T3 P. carotovorum CFIA1009 P. carotovorum YC D52 P. carotovorum YC D21 P. carotovorum YC D64 P. carotovorum YC D60 P. carotovorum CFIA1033 P. carotovorum PBR1692 P. carotovorum LMG 21371 P. carotovorum BD255 P. carotovorum ICMP 19477 P. carotovorum LMG 21372 P. carotovorum KKH3 P. carotovorum NCPPB3841 P. carotovorum NCPPB 3839 P. carotovorum BC S7 P. carotovorum YC T1 P. carotovorum NCPPB 3395 P. carotovorum YC D57 P. carotovorum BC T2 P. carotovorum ICMP 5702 P. carotovorum NCPPB 312 P. carotovorum YC D16 P. carotovorum YC T39 P. carotovorum WPP14 P. carotovorum BC T5 P. atrosepticum SCRI1043 P. atrosepticum NCPPB 3404 P. atrosepticum JG10­08 P. atrosepticum 21A P. atrosepticum CFBP 6276 P. atrosepticum NCPPB 549 P. atrosepticum ICMP 1526 P. carotovorum PC1 P. carotovorum UGC32 P. betavasculorum NCPPB 2793 P. betavasculorum NCPPB 2795 P. carotovorum M022 P. wasabiae CFBP 3304 P. wasabiae NCPPB 3701 P. wasabiae NCPPB3702 P. wasabiae CFIA1002 P. wasabiae WPP163 P. wasabiae RNS08.42.1A P. sp. SCC3193 SCC3193 P. carotovorum BC D6 P. carotovorum YC D49 P. carotovorum BC S2 P. carotovorum YC D29 P. carotovorum YC D65 P. carotovorum CFIA1001 P. carotovorum PCC21 P. carotovorum YC D46 P. carotovorum YC T31 P. carotovorum YC D62 P. carotovorum YC T3 P. carotovorum CFIA1009 P. carotovorum YC D52 P. carotovorum YC D21 P. carotovorum YC D64 P. carotovorum YC D60 P. carotovorum CFIA1033 P. carotovorum PBR1692 P. carotovorum LMG 21371 P. carotovorum BD255 P. carotovorum ICMP 19477 P. carotovorum LMG 21372 P. carotovorum KKH3 P. carotovorum NCPPB3841 P. carotovorum NCPPB 3839 P. carotovorum BC S7 P. carotovorum YC T1 P. carotovorum NCPPB 3395 P. carotovorum YC D57 P. carotovorum BC T2 P. carotovorum ICMP 5702 P. carotovorum NCPPB 312 P. carotovorum YC D16 P. carotovorum YC T39 P. carotovorum WPP14 P. carotovorum BC T5 0.00 0.25 0.50 0.75 1.00 ANIm_percentage_identity
  • 25. 55 Pectobacterium spp. ANIma a Pritchard et al. (2016) Anal. Methods doi:10.1039/c5ay02550h All isolates align over >50% of whole genome P. carotovorum YC T31 P. carotovorum YC D21 P. carotovorum YC T3 P. carotovorum CFIA1009 P. carotovorum YC D64 P. carotovorum YC D52 P. carotovorum YC D62 P. carotovorum YC D60 P. carotovorum YC D49 P. carotovorum BC D6 P. carotovorum YC D65 P. carotovorum YC D46 P. carotovorum BC S2 P. carotovorum YC D29 P. carotovorum PCC21 P. carotovorum CFIA1001 P. carotovorum YC T1 P. carotovorum NCPPB 3395 P. carotovorum UGC32 P. carotovorum KKH3 P. carotovorum BC S7 P. carotovorum ICMP 5702 P. carotovorum NCPPB 312 P. carotovorum BC T2 P. carotovorum YC D16 P. carotovorum YC T39 P. carotovorum YC D57 P. carotovorum WPP14 P. carotovorum BC T5 P. carotovorum CFIA1033 P. carotovorum PC1 P. carotovorum LMG 21371 P. carotovorum BD255 P. carotovorum PBR1692 P. carotovorum ICMP 19477 P. carotovorum LMG 21372 P. wasabiae CFBP 3304 P. wasabiae NCPPB 3701 P. wasabiae NCPPB3702 P. sp. SCC3193 SCC3193 P. wasabiae WPP163 P. wasabiae RNS08.42.1A P. wasabiae CFIA1002 P. atrosepticum CFBP 6276 P. atrosepticum NCPPB 3404 P. atrosepticum NCPPB 549 P. atrosepticum ICMP 1526 P. atrosepticum SCRI1043 P. atrosepticum JG10­08 P. atrosepticum 21A P. carotovorum NCPPB3841 P. carotovorum NCPPB 3839 P. carotovorum M022 P. betavasculorum NCPPB 2793 P. betavasculorum NCPPB 2795 P. carotovorum M022 P. carotovorum YC T1 P. carotovorum KKH3 P. carotovorum UGC32 P. carotovorum CFIA1033 P. carotovorum NCPPB3841 P. carotovorum NCPPB 3839 P. carotovorum BC S7 P. carotovorum ICMP 19477 P. carotovorum BD255 P. carotovorum LMG 21372 P. carotovorum PBR1692 P. carotovorum LMG 21371 P. carotovorum PC1 P. carotovorum ICMP 5702 P. carotovorum NCPPB 312 P. carotovorum YC D57 P. carotovorum BC T2 P. carotovorum YC D16 P. carotovorum WPP14 P. carotovorum BC T5 P. carotovorum YC D62 P. carotovorum YC T31 P. carotovorum YC D52 P. carotovorum YC T3 P. carotovorum YC D64 P. carotovorum YC D21 P. carotovorum YC D60 P. carotovorum YC T39 P. carotovorum CFIA1001 P. carotovorum YC D46 P. carotovorum YC D49 P. carotovorum BC D6 P. carotovorum YC D65 P. carotovorum BC S2 P. carotovorum YC D29 P. carotovorum PCC21 P. carotovorum CFIA1009 P. atrosepticum ICMP 1526 P. atrosepticum CFBP 6276 P. atrosepticum SCRI1043 P. atrosepticum NCPPB 3404 P. atrosepticum NCPPB 549 P. atrosepticum JG10­08 P. atrosepticum 21A P. wasabiae WPP163 P. sp. SCC3193 SCC3193 P. wasabiae RNS08.42.1A P. wasabiae CFIA1002 P. wasabiae CFBP 3304 P. wasabiae NCPPB 3701 P. wasabiae NCPPB3702 P. carotovorum NCPPB 3395 P. betavasculorum NCPPB 2793 P. betavasculorum NCPPB 2795 0.00 0.25 0.50 0.75 1.00 ANIm_alignment_coverage
  • 26. ANI Advantages Average identity of all ‘homologous’ regions Approximates limiting case of MLST/MLSA/multigene comparisons Adaptable to variable thresholding (LINS) classifications Criticisms Thresholds ‘arbitrary’, based on homologous regions only Taxonomic classification, not phylogenetic reconstruction No functional (or gene-based) interpretation; still need pangenome classification and analysis
  • 27. EXERCISE exercises/01-whole_genome_comparisons.ipynb Pairwise comparison of Pseudomonas genomes ANIm classification of Pseudomonas isolates MyBinder link
  • 28. Chromosome painting a a Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055 “Chromosome painting” (FINESTRUCTURE) infers recombination-derived ‘chunks’ Genome’s haplotype constructed in terms of recombination events from a ‘donor’ to a ‘recipient’ genome
  • 29. Chromosome painting a a Yahara et al. (2013) Mol. Biol. Evol. 30:1454-1464 doi:10.1093/molbev/mst055 Recombination events summarised in a coancestry matrix. H. pylori: most within geographical bounds, but asymmetrical donation from Amerind/East Asian to European isolates.
  • 30. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 31. Feature comparisons Feature comparisons Comparisons of the annotated features of one genome with another (. . .or many others) gene features RNA features regulatory features
  • 32. Equivalent features The power of genomics is comparative genomics! Makes catalogues of genome components comparable between organisms Differences, e.g. presence/absence of equivalents may support hypotheses for functional or phenotypic difference Can identify characteristic signals for diagnosis/epidemiology Can build parts lists and wiring diagrams for systems and synthetic biology
  • 33. Orthologues a b a Nehrt et al. (2011) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002073 b Chen et al. (2012) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002784 Orthologs/Orthologues “Homologs that diverged through speciation” (orig.) “Genes/products we think are probably the same thing” (mod. inform.)
  • 34. Why orthologues? a b c a Chen and Zhang (2012) PLoS Comp. Biol. doi:10.1371/journal.pcbi.1002784 b Dessimoz (2011) Brief. Bioinf. doi:10.1093/bib/bbr057 c Altenhoff and Dessimoz (2009) PLoS Comp. Biol. 5:e1000262 doi:10.1371/journal.pcbi.1000262 Formalise the idea of corresponding genes in different organisms. Suggest two relationships: Evolutionary equivalence Functional equivalence (“The Ortholog Conjecture”) The Ortholog Conjecture Without duplication, a gene product is unlikely to change its basic function, because this would lead to loss of the original function, and this would be harmful.
  • 35. Finding orthologues a a Salichos and Rokas (2011) PLoS One 6:e18755 doi:10.1371/journal.pone.0018755.g006 Which discovery method performs best? Four methods tested against 2,723 curated orthologues from six Saccharomycetes: RBBH (and cRBH); RSD (and cRSD); MultiParanoid; OrthoMCL Rated by statistical performance metrics: sensitivity, specificity, accuracy, FDR cRBH most accurate and specific, with lowest FDR.
  • 36. EXERCISE exercises/02-cds_feature_comparisons.ipynb RBBH analysis of Pseudomonas CDS feature annotations MyBinder link
  • 37. One-way BLAST vs RBBH One-way BLAST includes many low-quality hits
  • 38. One-way BLAST vs RBBH Reciprocal best BLAST removes many low-quality matches
  • 39. The Pangenome The Core Genome Hypothesis “The core genome is the primary cohesive unit defining a bacterial species” Once equivalent genes have been identified, those present in all related isolates can be identified: the core genome. The remaining genes are the accessory genome, and are expected to mediate function that distinguishes between isolates. Roary: Rapid large-scale prokaryote pan-genome analysis - works on a desktop machine.
  • 40. Accessory genome a b a Croll and Mcdonald (2012) PLoS Path. 8:e1002608 doi:10.1371/journal.ppat.1002608 b Baltrus et al. (2011) PLoS Path. 7:e1002132 doi:10.1371/journal.ppat.1002132 Accessory genomes A cradle for adaptive evolution, particularly for bacterial pathogens, such as Pseudomonas spp.
  • 41. OPTIONAL WORKSHEET worksheets/02-prokka_roary.ipynb Annotation of pathogen genomes with Prokka Calculation of the Pantoea agglomerans pangenome with Roary MyBinder link
  • 42. Table of Contents 1 Introduction Pathogen Genome Data 2 Public Genome Data Sources Online Resources 3 Comparative Genomics Why Comparative Genomics? Whole Genome Comparisons Feature Comparisons 4 Effector Prediction Effector Characteristics
  • 43. What is an effector?
  • 44. What is an effector? Effector A molecule produced by pathogen that (directly?) modifies host molecular/biochemical ‘behaviour’ Inhibits enzyme action (e.g. Cladosporium fulvum AVR2, AVR4; Phytophthora infestans EPIC1, EPIC2B; P. sojae glucanase inhibitors) Cleaves a protein target (e.g. Pseudomonas syringae AvrRpt2) (De-)phosphorylates a protein target (e.g. P. syringae AvrRPM1, AvrB) Retargeting host system such as E3 ligase (e.g. P. syringae AvrPtoB; P. infestans Avr3a) Regulatory control (e.g. Xanthomonas campestris AvrBs3)
  • 45. What is an effector? No unifying biochemical mechanism No single test for ‘candidate effectors’, even in one organism
  • 46. Effectors are modular a b a Greenberg & Vinatzer (2003) Curr. Opin. Microbiol. doi:10.1016/S1369-5274(02)00004-8 b Collmer et al. (2002) Trends Microbiol. doi:10.1016/S0966-842X(02)02451-4 Delivery N-terminal localisation/translocation domain Activity C-terminal functional/interaction domain
  • 47. Effectors are modular a b a Dong et al. (2011) PLoS One doi:10.1371/journal.pone.0020172.t004 b Boch et al. (2009) Science doi:10.1126/science.1178811 Delivery Typically common to effector class: RxLR, T3E, CHxC Activity May be common (TAL) or divergent within effector class (RxLR, T3E)
  • 48. Effector prediction tools (online) a b a Sperschneider et al. (2015) PLoS Pathogens doi:10.1371/journal.ppat.1004806 b Sonah et al. (2016) Front. Plant Sci. doi:10.3389/fpls.2016.00126 Bacterial Type III Effectors EffectiveT3 modlab T3SEdb Fungal/Oomycete Effectors EffectorP Galaxy Toolshed RxLR predictor
  • 49. What do we look for? a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 What if someone hasn’t built a classifier for your protein family? Tests are for protein family membership and/or ‘effector-like’ functional signal The same as any sequence classification problem (functional annotation) Many possible approaches (Supervised) machine learning problem: train test validate
  • 50. Sequence space a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Known members of our effector class are in red
  • 51. Similarity distance a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Define a representative centre, and a distance from it that includes known effectors
  • 52. Classify candidates a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Classify sequences within the distance as similar
  • 53. EXERCISE exercises/03-effector_finding.ipynb Downloading annotated Pseudomonas AvrPto1 effectors from a public sequence repository Building a (HMM) model from this training set Searching public genome annotations with the model MyBinder link
  • 54. Choosing a distance a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 How do we define distance? How large a distance should we take? How do we know if we chose well?
  • 55. Are you in or out? a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 56. TP/TN/FP/FN a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 57. FPR/FNR/Sn/Sp/FDR a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 58. Small Boundary a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 59. Medium boundary a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 60. Large boundary a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 The boundary (distance) classifies sequences as ‘in’ or ‘out’ Sequences are predicted to be either in the class or not in the class Changing distance/boundary changes classification
  • 61. Choosing a boundary a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Assign known ‘positive‘ and ‘negative’ examples Vary ‘distance’ and measure predictive performance (F-measure, AUC, . . .) Choose the distance that gives the ‘best’ performance
  • 62. Crossvalidation a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Estimation of classifier performance depends on: Boundary choice/distance measure Composition of training set (‘positives’ and ‘negatives’) Cross-validation gives objective estimate of performance Many strategies (beyond today’s scope), including: leave-one-out (LOO) k-fold crossvalidation repeated (random) subsampling Always validate against a hold-out set (not used to train the classifier)
  • 63. Post-crossvalidation a a Pritchard & Broadhurst (2014) Methods Mol. Biol. doi:10.1007/978-1-62703-986-4 4 Crossvalidation gives ‘best’ method & parameters Apply ‘best’ method to complete dataset for prediction BEWARE THE BASERATE FALLACY!
  • 64. OPTIONAL WORKSHEET worksheets/03-effector-finding.ipynb The Baserate Fallacy in effector prediction and classification. MyBinder link
  • 65. Licence: CC-BY-SA By: Leighton Pritchard This presentation is licensed under the Creative Commons Attribution ShareAlike license https://creativecommons.org/licenses/by-sa/4.0/