SlideShare una empresa de Scribd logo
1 de 83
Descargar para leer sin conexión
Multivariate exploration of microbial communities
Josh D. Neufeld
Braunschweig, Germany
December, 2013

Andre Masella (MSc): Computer science
Michael Lynch (PhD): Taxonomy, phylogenetics, ecology
Michael Hall (co-op): mathematics, programming, user friendly!
Posted on Slideshare without images and unpublished data
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
An (abbreviated) history
Numerical ecology
phenetics and statistical analysis of organismal
counts
macroecology

16S rRNA gene era
sequence analysis as a surrogate for counting
mapping of marker to taxonomy

NGS enabled synthesis of phenetics,
phylogenetics, and numerical ecology
Now generate V3-V4 bacterial amplicons (~450 bases)
Usually PE 300
Assembling paired-end
reads dramatically
reduces error
Corrects mismatches in
region of overlap
(quality threshold >0.9),
set a minimum overlap.
Can compare to perfect
overlap assembly:
“completelymissesthepoint”
(name changing soon)
PANDAseq
>30x faster
than next
fastest
alternative
assembler
1. p-value threshold
2. parallelizes correctly
(both are now
added or fixed
in PANDAseq)
Biological Observation Matrix
BIOM file format (MacDonald et al. 2012)
Standard recognized by EMP, MG-RAST,
VAMPS
Based on JSON data interchange format
Computational structure in multiple languages

“facilitates the efficient handling and
storage of large, sparse biological
contingency tables”
Encapsulates metadata and contingency
table (e.g., OTU table) in one file
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
Diversity
(richness and evenness)
α-diversity: Richness and
Evenness

Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity

Shannon index (H’): richness and evenness
Estimators: richness
Faith’s PD: phylogenetic richness
Stearns et al., 2011

Hughes et al., 2001
“All biologists who sample natural
communities are plagued with the
problem of how well a sample reflects a
community’s ‘true’ diversity.”
Hughes et al. 2001
“Nonparametric estimators show particular promise for microbial data and in
some habitats may require sample sizes of only 200 to 1,000 clones to detect
richness differences of only tens of species.”
1

Google Scholar proportion
[Seqeuncing tech] AND 16S

400

454

300

Sanger

re
e

re
Ra

0
2000

200

2002

2004

2004

ph
os

100

bi

2008

0
2010

Time (year)
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

2012

“Rare biosphere” citations

Illumina

500
GOALS
Understanding of community structure
Better alpha-diversity measures
Robust beta-diversity measures

Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
Stearns et al. 2011
Bartram et al. 2011
Clustering algorithms
(influence alpha diversity primarily)

CD-HIT (Li and Godzik, Sanford-Burnham Medical
Research Institute)
‘longest-sequence-first’ removal algorithm
Fast, many implementations (nucleotide, protein, OTUspecific)
Tends to be more stringent than UCLUST

UCLUST (R. Edgar, drive5.com)
Faster than CD-HIT
Tends to generate larger number of low-abundance OTUs
Broader range of clustering thresholds

"I do not recommend using the UCLUST algorithm or
CD-HIT for generating OTUs” – Robert Edgar
CROP: Clustering 16S rRNA for OTU Prediction (CROP)
“CROP can find clusters based on the natural organization of data without setting a
hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
Chimeras
DNA from two or more parent molecules
PCR artifact
Can easily be classified as a “novel” sequence
Increases α-diversity

Software
ChimeraSlayer, Bellerophon, UCHIME, Pintail

Reference database or de novo
Classification and taxonomy
Ribosomal Database Project (RDP) classifier
Naïve Bayesian classifier (James Cole and Tiedje)
http://rdp.cme.msu.edu/

pplacer
Phylogenetic placement and visualization

BLAST
The tool we know and love

RTAX (UC Berkely, Rob Knight involved)
http://dev.davidsoergel.com/trac/rtax/

mothur (Patrick Schloss)
http://www.mothur.org/

SINA (SILVA)
RDP classifier
Large training sets require active memory management
Can be easily run in parallel by breaking up very large data sets
Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)
Algorithm:
determine the probability that an unknown query sequence is a member of a
known genus (training set), based on the profile of word subsets of known
genera.

Confidence estimation:
the number of times in 100 trials that a genus was selected based on a
random subset of words in the query

Take home:
The higher the diversity (bigger sequence space) of the training set, the
better the assignment
Longer query = better and more reliable assignment
Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of
0.5 suggested)
Database sources
GreenGenes
Latest May 2013

SILVA
Latest 115 (August 2013)
Includes 18S, 23S, 28S, LSU

RDP Database
Latest 11 (October 2013)

GenBank
Research-specific
e.g., CORE Oral
Multivariate data reduction
β-diversity
Visualization (ordination) versus hypothesis
testing (MRPP, indicator species analysis)
Many more algorithms out there for
exploration and statistical testing
mostly through widely used R packages
vegan (Community Ecology Package)
labdsv (Ordination and Multivariate Analysis for
Ecology)
ape (Analyses of Phylogenetics and Evolution)
picante (community analyses etc.)
Visualization (ordination)
Complementary to data clustering
looks for discontinuities

Ordination extracts main trends as continuous
axes
analysis of the square matrix derived from the
OTU table

Non-parametric, unconstrained ordination
methods most widely used (and best suited)
methods that can work directly on a square matrix

An appropriate metric is required to derive
this square matrix
many options...
Metrics
Ordination is essentially reducing dimensionality
first requirement: accurately model differences
among samples
Models are *really* important. Examples include:
OTU presence/absence
“all models are wrong,
Dice, Jaccard
some are useful”
OTU abundance
- G.E. Box
Bray-Curtis
“You can't publish anything without a
Phylogenetic
PCoA plot anymore, but METRICS

UniFrac

used to draw plot important.”
- Susan Huse
Metrics: UniFrac
A distance measure comparing multiple
communities using phylogenetic information
Requires sequence alignment and tree-building
PyNAST, MUSCLE, Infernal
Time-consuming and susceptible to poor phylogenetic
inference (does it matter?)

Weighted (abundance)
ecological features related to
abundance

Unweighted
ecological features related to
taxonomic presence/absence
Ordination example 1 (of many):

Principal Coordinates Analysis
Classical Multidimensional Scaling (MDS; Gower 1966)
Procedure:
based on eigenvectors
position objects in low-dimensional space while preserving
distance relationships as well as possible

highly flexible
can choose among many association measures

In microbial ecology, used for visualizing
phylogenetic or count-based distances
Consistent visual output for given distance matrix
Include variance explained (%) on Axis 1 and 2
Ordination example 2 (of many):

Non-metric Multidimensional Scaling
Ordination not based on eigenvectors
Does not preserve exact distances among objects
attempts to preserve ordering of samples (“ranks”)

Procedure:
iterative, tries to position the objects in a few (2-3) dimensions in such a way
that minimizes the “stress”
how well does the new ranked distribution of points represent the original
distances in the association matrix? Can express as R2 on axes 1 and 2.
the adjustment goes on until the stress value reaches a local minimum
(heuristic solution)

NMDS often represents distance relationships better than PCoA in the
same number of dimensions
Susceptible to the “local minimum issue”, and therefore should have
strong starting point (e.g., PCoA) or many permutations
You won't get the same result each time you run the analysis. Try several
runs until you are comfortable with the result.
Do my treatments separate?
Beta-diversity: Hypothesis testing
Multiple methods, implemented in QIIME,
mothur, AXIOME
e.g., MRPP, adonis, NP-MANOVA (perMANOVA),
ANOSIM
Are treatment effects significant?

Because these are predominantly
nonparametric methods, tests for
significance rely on testing by permutation
Let's focus on MRPP
Multiresponse Permutation Procedures
Compare intragroup average distances with the
average distances that would have resulted from all
the other possible combinations
T statistic: more negative with
increasing group separation
(T>-10 common for ecology)
A statistic: Degree of scatter
within groups (A=1 when all
points fall on top of one another)
p value: likelihood of similar
separation with randomized
data.
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
“PCoA plots are the first
step of a community
analysis, not the last.”
Josh Neufeld
Searching for species that matter
High dimensional data often have too many
features to investigate
solution: identify and study species significantly
associated with categorical metadata

Indicator species (Dufrene-Legendre)
calculates indicator value (fidelity and relative
abundance) of species
Permutation test for significance
Need solution for sparse data - be wary
of groups with small numbers of sites (influence on
permutation tests)
low abundance can artificially inflate indicator values
Specificity
Fidelity
IndVal (Dufrene & Legendre, 1997)
Specificity
Large mean abundance within group relative to summed
mean abundances of other groups

Fidelity
Presence in most or all sites of that group

Groups defined by a priori by metadata or
statistical clustering
Simple linear correlations
Metadata
mbc

Taxon R^2 value

k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat
ales;f__Isosphaeraceae;g__
0.611368489781491
mbc
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.677209935419981
mbn
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.64092523702996
soil_depth
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc
etales;f__Intrasporangiaceae;g__
0.669761188668774
mothur: cooccurrence function, measuring whether populations are co-occurring
more frequently than you would expect by chance.
Non-negative Matrix Factorization
NMF as a representation method for portraying
high-dimensional data as a small number of
taxonomic components.
Patterns of co-occurring OTUs can be
described by a smaller number of taxonomic
components.
Each sample represented by the collection of
component taxa, helping identify relationships
between taxa and the environment.
Jonathan Dushoff, McMaster University, Ontario, Canada
SSUnique
SILVA
SILVA
SILVA
SILVA
SILVA
Nakai et al. 2012

Lynch et al. 2012
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
Why pipelines?
Merge and manage (many) disparate techniques
Democratize analysis
improve accessibility

Accelerate pace of innovation, collaboration, and
research
Early synthesis
Early synthesis for numerical microbial ecology
Synthesis of 16S phylogenetics (Woese et al.)
and Hughes (Counting the uncountable)
Numerical ecology for microorganisms

Algorithm development
libshuff, dotur (mothur)

Analysis pipelines
QIIME, mothur
Knight Lab, U. Colorado at Boulder
Predominantly a collection of integrated Python/R
scripts
Many dependencies
easy managed installation:
qiime-deploy
MacQIIME
virtual box and Ubuntu fork
avoid for anything but small runs

Becoming the standard for marker gene studies
integrated analysis and visualization
easy access to broad computational biology toolbox
(Python/R)
Automation and extension
AXIOME and phyloseq
Extend existing technologies (QIIME, mothur, R,
custom)

Layers of abstraction
Automation and rapid re-analysis
Promote reproducible research (iPython, XML,
make)

Implement existing techniques (e.g., MRPP,
Dufrene-Legendre IndVal)
numerical microbial ecology needs to better
incorporate modern statistical theory

Develop and test new techniques
Axiometic
GUI companion for AXIOME
Cross-platform
New implementation in
development

Generates AXIOME file (XML)

xls template
coming soon for
all commands,
sample metadata,
and extra info…
much easier for
everyone.
“QIIME wraps many other software
packages, and these should be cited if
they are used. Any time you're using
tools that QIIME wraps, it is essential
to cite those tools.”
http://qiime.org/index.html
Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
The future
As data get bigger, interpretation should be
“hands off”
Move towards hypothesis testing of highdimension taxonomic data

Convergence on Galaxy
e.g., QIIME in Galaxy is developing

Further extension to cloud services
e.g., Amazon EC2

Machine learning and data mining
applications
Open-source, web-based platform
Deployed locally or in the cloud
Ongoing development of 16S rRNA gene analysis
Galaxy Workshed (available tools)
“The advantages of having large numbers of
samples at shallow coverage (~1,000 sequences
per sample) clearly outweigh having a small
number of samples at greater coverage for many
datasets, suggesting that the focus for future
studies should be on broader sampling that can
reveal association with key biological
parameters rather than on deeper sequencing.”
“….even [phylogenetic beta-diversity]
measures suited to the underlying
mechanism of differentiation may
require deep sequencing to reveal
subtle patterns”
Dr. Donovan Parks
Method standardization
Impossible.
Data storage
Sequence reads outpacing data storage costs
Federated data?
File formats
e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient,
do not ensure data is in correct format, no space for metadata, no absolute
standard)… relational databases?
Software
Free and Open Source enables an experiment to be faithfully replicated
Algorithms
Memory!
Many clustering and phylogenetic inference algorithms vary n2
Distributed, parallel, or cloud computing may not be helpful
Metadata
What to do with it? How to marry sequence and metadata sets?
We need better metadata integration, not necessarily more/better metadata
What should we be doing?
(take-home messages)

*Surveys are really important for
spatial and temporal mapping
*Hypothesis testing follows (or implicit)
*What species account for treatment effects?
*Who tracks with who? (why=function)
*Who avoids who?
*Are all microorganisms accounted for? (no)
*How can we use this information to
manipulate, manage and predict ecosystems?
What should we be doing?
(take-home messages)

There is no “one way” to analyze 16S rRNA
You need to build a pipeline for you.
If this seems daunting, it is.
If this is not daunting, your hands are dirty.
It’s getting better all the tii-ime.
Helpful resources
Thank you
jneufeld@uwaterloo.ca

Más contenido relacionado

La actualidad más candente

Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformaticsKamlesh Patade
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources innocent87
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES nadeem akhter
 
Databases short nucletide polymorphism
Databases short nucletide polymorphismDatabases short nucletide polymorphism
Databases short nucletide polymorphismIram Wains
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kkKAUSHAL SAHU
 
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesProcessing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesMartin Hartmann
 
Whole genome sequence.
Whole genome sequence.Whole genome sequence.
Whole genome sequence.jayalakshmi311
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
 
Bioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyBioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyUshanandini Mohanraj
 

La actualidad más candente (20)

Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformatics
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
NCBI
NCBINCBI
NCBI
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Kegg database resources
Kegg database resources Kegg database resources
Kegg database resources
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Databases short nucletide polymorphism
Databases short nucletide polymorphismDatabases short nucletide polymorphism
Databases short nucletide polymorphism
 
Bioinformatics in medicine
Bioinformatics in medicineBioinformatics in medicine
Bioinformatics in medicine
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 
Rna seq
Rna seqRna seq
Rna seq
 
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesProcessing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
 
Protein database
Protein  databaseProtein  database
Protein database
 
Shotgun and clone contig method
Shotgun and clone contig methodShotgun and clone contig method
Shotgun and clone contig method
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Whole genome sequence.
Whole genome sequence.Whole genome sequence.
Whole genome sequence.
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Bioinformatics Applications in Biotechnology
Bioinformatics Applications in BiotechnologyBioinformatics Applications in Biotechnology
Bioinformatics Applications in Biotechnology
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 

Destacado

Bacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptBacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptRakesh Kumar
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence AnalysisAbdulrahman Muhammad
 
[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introductionMads Albertsen
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
 
Amplicon Sequencing Introduction
Amplicon Sequencing IntroductionAmplicon Sequencing Introduction
Amplicon Sequencing IntroductionAaron Marc Saunders
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
Document 12
Document 12Document 12
Document 12gkuygk
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2TOASTworkshop
 
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Anupam Singh
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.jennomics
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talkTOASTworkshop
 
Horse gut microbiome
Horse gut microbiomeHorse gut microbiome
Horse gut microbiomeShebl E Salem
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiimeZech Xu
 
Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013mcmahonUW
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA databasecfloare
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beikobeiko
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to BiodiversityMark McGinley
 

Destacado (20)

Bacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.pptBacterial Identification by 16s rRNA Sequencing.ppt
Bacterial Identification by 16s rRNA Sequencing.ppt
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis
 
[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction[13.09.19] 16S workshop introduction
[13.09.19] 16S workshop introduction
 
16s
16s16s
16s
 
Introduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR Genomics
 
Thesis
ThesisThesis
Thesis
 
16S classifier
16S classifier16S classifier
16S classifier
 
Amplicon Sequencing Introduction
Amplicon Sequencing IntroductionAmplicon Sequencing Introduction
Amplicon Sequencing Introduction
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
Document 12
Document 12Document 12
Document 12
 
Toast 2015 qiime_talk2
Toast 2015 qiime_talk2Toast 2015 qiime_talk2
Toast 2015 qiime_talk2
 
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
Policy Brief-Costly Disease: How to reduce out of pocket expenditure in Diabe...
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
Toast 2015 qiime_talk
Toast 2015 qiime_talkToast 2015 qiime_talk
Toast 2015 qiime_talk
 
Horse gut microbiome
Horse gut microbiomeHorse gut microbiome
Horse gut microbiome
 
Esa 2014 qiime
Esa 2014 qiimeEsa 2014 qiime
Esa 2014 qiime
 
Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013Amplicon sequencing slides - Trina McMahon - MEWE 2013
Amplicon sequencing slides - Trina McMahon - MEWE 2013
 
Silva ribosomal RNA database
Silva ribosomal RNA databaseSilva ribosomal RNA database
Silva ribosomal RNA database
 
CCBC tutorial beiko
CCBC tutorial beikoCCBC tutorial beiko
CCBC tutorial beiko
 
Introduction to Biodiversity
Introduction to  BiodiversityIntroduction to  Biodiversity
Introduction to Biodiversity
 

Similar a Introduction to 16S rRNA gene multivariate analysis

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisPrasanthperceptron
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...EmadfHABIB2
 
Network Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsNetwork Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsGanesh Bagler
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRob Guralnick
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010Rob Guralnick
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018David Cook
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Wisdom Dlamini
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9Jonathan Eisen
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for PhyloinformaticsRutger Vos
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsLars Juhl Jensen
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationDmitry Grapov
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Karen Cranston
 

Similar a Introduction to 16S rRNA gene multivariate analysis (20)

Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
Updated (version 2.3 THRILLER) Easy Perspective to (Complexity)-Thriller 12 S...
 
Network Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systemsNetwork Biology: A paradigm for modeling biological complex systems
Network Biology: A paradigm for modeling biological complex systems
 
RPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 KeynoteRPG iEvoBio 2010 Keynote
RPG iEvoBio 2010 Keynote
 
iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010iEvoBio Keynote Talk 2010
iEvoBio Keynote Talk 2010
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Gf o2014talk
Gf o2014talkGf o2014talk
Gf o2014talk
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018scRNA-Seq Workshop Presentation - Stem Cell Network 2018
scRNA-Seq Workshop Presentation - Stem Cell Network 2018
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...Bayesian network-based predictive analytics applied to invasive species distr...
Bayesian network-based predictive analytics applied to invasive species distr...
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
Proteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data setsProteomics - Analysis and integration of large-scale data sets
Proteomics - Analysis and integration of large-scale data sets
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 
Prote-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and VisualizationProte-OMIC Data Analysis and Visualization
Prote-OMIC Data Analysis and Visualization
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Carleton Biology talk : March 2014
Carleton Biology talk : March 2014Carleton Biology talk : March 2014
Carleton Biology talk : March 2014
 

Más de Josh Neufeld

How to give a good scientific oral presentation
How to give a good scientific oral presentationHow to give a good scientific oral presentation
How to give a good scientific oral presentationJosh Neufeld
 
So you want to be an academic?
So you want to be an academic?So you want to be an academic?
So you want to be an academic?Josh Neufeld
 
Neufeld erin 2012 for posting
Neufeld erin 2012 for postingNeufeld erin 2012 for posting
Neufeld erin 2012 for postingJosh Neufeld
 
Neufeld citizen science
Neufeld citizen scienceNeufeld citizen science
Neufeld citizen scienceJosh Neufeld
 

Más de Josh Neufeld (6)

How to give a good scientific oral presentation
How to give a good scientific oral presentationHow to give a good scientific oral presentation
How to give a good scientific oral presentation
 
So you want to be an academic?
So you want to be an academic?So you want to be an academic?
So you want to be an academic?
 
Neufeld ISME14
Neufeld ISME14Neufeld ISME14
Neufeld ISME14
 
Neufeld CSM 2012
Neufeld CSM 2012Neufeld CSM 2012
Neufeld CSM 2012
 
Neufeld erin 2012 for posting
Neufeld erin 2012 for postingNeufeld erin 2012 for posting
Neufeld erin 2012 for posting
 
Neufeld citizen science
Neufeld citizen scienceNeufeld citizen science
Neufeld citizen science
 

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Introduction to 16S rRNA gene multivariate analysis

  • 1. Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013 Andre Masella (MSc): Computer science Michael Lynch (PhD): Taxonomy, phylogenetics, ecology Michael Hall (co-op): mathematics, programming, user friendly! Posted on Slideshare without images and unpublished data
  • 2. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 3. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  • 4. An (abbreviated) history Numerical ecology phenetics and statistical analysis of organismal counts macroecology 16S rRNA gene era sequence analysis as a surrogate for counting mapping of marker to taxonomy NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology
  • 5. Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300
  • 6. Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)
  • 8. 1. p-value threshold 2. parallelizes correctly (both are now added or fixed in PANDAseq)
  • 9.
  • 10. Biological Observation Matrix BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format Computational structure in multiple languages “facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file
  • 11. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 12. Who lives with whom, and why, and where? Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient
  • 14. α-diversity: Richness and Evenness Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness Stearns et al., 2011 Hughes et al., 2001
  • 15. “All biologists who sample natural communities are plagued with the problem of how well a sample reflects a community’s ‘true’ diversity.”
  • 16. Hughes et al. 2001 “Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”
  • 17. 1 Google Scholar proportion [Seqeuncing tech] AND 16S 400 454 300 Sanger re e re Ra 0 2000 200 2002 2004 2004 ph os 100 bi 2008 0 2010 Time (year) Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation. 2012 “Rare biosphere” citations Illumina 500
  • 18. GOALS Understanding of community structure Better alpha-diversity measures Robust beta-diversity measures Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
  • 21. Clustering algorithms (influence alpha diversity primarily) CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute) ‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTUspecific) Tends to be more stringent than UCLUST UCLUST (R. Edgar, drive5.com) Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or CD-HIT for generating OTUs” – Robert Edgar
  • 22.
  • 23. CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
  • 24. Chimeras DNA from two or more parent molecules PCR artifact Can easily be classified as a “novel” sequence Increases α-diversity Software ChimeraSlayer, Bellerophon, UCHIME, Pintail Reference database or de novo
  • 25. Classification and taxonomy Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) http://rdp.cme.msu.edu/ pplacer Phylogenetic placement and visualization BLAST The tool we know and love RTAX (UC Berkely, Rob Knight involved) http://dev.davidsoergel.com/trac/rtax/ mothur (Patrick Schloss) http://www.mothur.org/ SINA (SILVA)
  • 26. RDP classifier Large training sets require active memory management Can be easily run in parallel by breaking up very large data sets Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained) Algorithm: determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera. Confidence estimation: the number of times in 100 trials that a genus was selected based on a random subset of words in the query Take home: The higher the diversity (bigger sequence space) of the training set, the better the assignment Longer query = better and more reliable assignment Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)
  • 27. Database sources GreenGenes Latest May 2013 SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU RDP Database Latest 11 (October 2013) GenBank Research-specific e.g., CORE Oral
  • 29. β-diversity Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)
  • 30. Visualization (ordination) Complementary to data clustering looks for discontinuities Ordination extracts main trends as continuous axes analysis of the square matrix derived from the OTU table Non-parametric, unconstrained ordination methods most widely used (and best suited) methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix many options...
  • 31. Metrics Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples Models are *really* important. Examples include: OTU presence/absence “all models are wrong, Dice, Jaccard some are useful” OTU abundance - G.E. Box Bray-Curtis “You can't publish anything without a Phylogenetic PCoA plot anymore, but METRICS UniFrac used to draw plot important.” - Susan Huse
  • 32. Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?) Weighted (abundance) ecological features related to abundance Unweighted ecological features related to taxonomic presence/absence
  • 33. Ordination example 1 (of many): Principal Coordinates Analysis Classical Multidimensional Scaling (MDS; Gower 1966) Procedure: based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible highly flexible can choose among many association measures In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix Include variance explained (%) on Axis 1 and 2
  • 34. Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors Does not preserve exact distances among objects attempts to preserve ordering of samples (“ranks”) Procedure: iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress” how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2. the adjustment goes on until the stress value reaches a local minimum (heuristic solution) NMDS often represents distance relationships better than PCoA in the same number of dimensions Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.
  • 35. Do my treatments separate?
  • 36. Beta-diversity: Hypothesis testing Multiple methods, implemented in QIIME, mothur, AXIOME e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant? Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP
  • 37. Multiresponse Permutation Procedures Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.
  • 38. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 39. “PCoA plots are the first step of a community analysis, not the last.” Josh Neufeld
  • 40. Searching for species that matter High dimensional data often have too many features to investigate solution: identify and study species significantly associated with categorical metadata Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values
  • 42. IndVal (Dufrene & Legendre, 1997) Specificity Large mean abundance within group relative to summed mean abundances of other groups Fidelity Presence in most or all sites of that group Groups defined by a priori by metadata or statistical clustering
  • 43. Simple linear correlations Metadata mbc Taxon R^2 value k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat ales;f__Isosphaeraceae;g__ 0.611368489781491 mbc k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.677209935419981 mbn k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz obiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc etales;f__Intrasporangiaceae;g__ 0.669761188668774
  • 44. mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.
  • 45. Non-negative Matrix Factorization NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment. Jonathan Dushoff, McMaster University, Ontario, Canada
  • 46.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54. SILVA
  • 55. SILVA
  • 56. SILVA
  • 57. SILVA
  • 58. SILVA
  • 59. Nakai et al. 2012 Lynch et al. 2012
  • 60. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 61. Why pipelines? Merge and manage (many) disparate techniques Democratize analysis improve accessibility Accelerate pace of innovation, collaboration, and research
  • 62. Early synthesis Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable) Numerical ecology for microorganisms Algorithm development libshuff, dotur (mothur) Analysis pipelines QIIME, mothur
  • 63. Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies easy managed installation: qiime-deploy MacQIIME virtual box and Ubuntu fork avoid for anything but small runs Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)
  • 64. Automation and extension AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom) Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make) Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal) numerical microbial ecology needs to better incorporate modern statistical theory Develop and test new techniques
  • 65.
  • 66.
  • 67. Axiometic GUI companion for AXIOME Cross-platform New implementation in development Generates AXIOME file (XML) xls template coming soon for all commands, sample metadata, and extra info… much easier for everyone.
  • 68. “QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.” http://qiime.org/index.html
  • 69. Quick history Alpha and Beta diversity Species that matter Pipelines Future prospects and problems
  • 70. The future As data get bigger, interpretation should be “hands off” Move towards hypothesis testing of highdimension taxonomic data Convergence on Galaxy e.g., QIIME in Galaxy is developing Further extension to cloud services e.g., Amazon EC2 Machine learning and data mining applications
  • 71. Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis
  • 73. “The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”
  • 74. “….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns” Dr. Donovan Parks
  • 75. Method standardization Impossible. Data storage Sequence reads outpacing data storage costs Federated data? File formats e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute standard)… relational databases? Software Free and Open Source enables an experiment to be faithfully replicated Algorithms Memory! Many clustering and phylogenetic inference algorithms vary n2 Distributed, parallel, or cloud computing may not be helpful Metadata What to do with it? How to marry sequence and metadata sets? We need better metadata integration, not necessarily more/better metadata
  • 76. What should we be doing? (take-home messages) *Surveys are really important for spatial and temporal mapping *Hypothesis testing follows (or implicit) *What species account for treatment effects? *Who tracks with who? (why=function) *Who avoids who? *Are all microorganisms accounted for? (no) *How can we use this information to manipulate, manage and predict ecosystems?
  • 77. What should we be doing? (take-home messages) There is no “one way” to analyze 16S rRNA You need to build a pipeline for you. If this seems daunting, it is. If this is not daunting, your hands are dirty. It’s getting better all the tii-ime.
  • 79.
  • 80.
  • 81.
  • 82.