Introduction to 16S rRNA gene multivariate analysis

Multivariate exploration of microbial communities
Josh D. Neufeld
Braunschweig, Germany
December, 2013

Andre Masella (MSc): Computer science
Michael Lynch (PhD): Taxonomy, phylogenetics, ecology
Michael Hall (co-op): mathematics, programming, user friendly!
Posted on Slideshare without images and unpublished data

Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems

Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient

An (abbreviated) history
Numerical ecology
phenetics and statistical analysis of organismal
counts
macroecology

16S rRNA gene era
sequence analysis as a surrogate for counting
mapping of marker to taxonomy

NGS enabled synthesis of phenetics,
phylogenetics, and numerical ecology

Now generate V3-V4 bacterial amplicons (~450 bases)
Usually PE 300

Assembling paired-end
reads dramatically
reduces error
Corrects mismatches in
region of overlap
(quality threshold >0.9),
set a minimum overlap.
Can compare to perfect
overlap assembly:
“completelymissesthepoint”
(name changing soon)

PANDAseq
>30x faster
than next
fastest
alternative
assembler

1. p-value threshold
2. parallelizes correctly
(both are now
added or fixed
in PANDAseq)

Biological Observation Matrix
BIOM file format (MacDonald et al. 2012)
Standard recognized by EMP, MG-RAST,
VAMPS
Based on JSON data interchange format
Computational structure in multiple languages

“facilitates the efficient handling and
storage of large, sparse biological
contingency tables”
Encapsulates metadata and contingency
table (e.g., OTU table) in one file

Diversity
(richness and evenness)

α-diversity: Richness and
Evenness

Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity

Shannon index (H’): richness and evenness
Estimators: richness
Faith’s PD: phylogenetic richness
Stearns et al., 2011

Hughes et al., 2001

“All biologists who sample natural
communities are plagued with the
problem of how well a sample reﬂects a
community’s ‘true’ diversity.”

Hughes et al. 2001
“Nonparametric estimators show particular promise for microbial data and in
some habitats may require sample sizes of only 200 to 1,000 clones to detect
richness diﬀerences of only tens of species.”

1

Google Scholar proportion
[Seqeuncing tech] AND 16S

400

454

300

Sanger

re
e

re
Ra

0
2000

200

2002

2004

2004

ph
os

100

bi

2008

0
2010

Time (year)
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

2012

“Rare biosphere” citations

Illumina

500

GOALS
Understanding of community structure
Better alpha-diversity measures
Robust beta-diversity measures

Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

Clustering algorithms
(influence alpha diversity primarily)

CD-HIT (Li and Godzik, Sanford-Burnham Medical
Research Institute)
‘longest-sequence-first’ removal algorithm
Fast, many implementations (nucleotide, protein, OTUspecific)
Tends to be more stringent than UCLUST

UCLUST (R. Edgar, drive5.com)
Faster than CD-HIT
Tends to generate larger number of low-abundance OTUs
Broader range of clustering thresholds

"I do not recommend using the UCLUST algorithm or
CD-HIT for generating OTUs” – Robert Edgar

CROP: Clustering 16S rRNA for OTU Prediction (CROP)
“CROP can ﬁnd clusters based on the natural organization of data without setting a
hard cut-oﬀ threshold (3%/5%) as required by hierarchical clustering methods.”

Chimeras
DNA from two or more parent molecules
PCR artifact
Can easily be classiﬁed as a “novel” sequence
Increases α-diversity

Software
ChimeraSlayer, Bellerophon, UCHIME, Pintail

Reference database or de novo

Classification and taxonomy
Ribosomal Database Project (RDP) classifier
Naïve Bayesian classifier (James Cole and Tiedje)
http://rdp.cme.msu.edu/

pplacer
Phylogenetic placement and visualization

BLAST
The tool we know and love

RTAX (UC Berkely, Rob Knight involved)
http://dev.davidsoergel.com/trac/rtax/

mothur (Patrick Schloss)
http://www.mothur.org/

SINA (SILVA)

RDP classifier
Large training sets require active memory management
Can be easily run in parallel by breaking up very large data sets
Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)
Algorithm:
determine the probability that an unknown query sequence is a member of a
known genus (training set), based on the profile of word subsets of known
genera.

Confidence estimation:
the number of times in 100 trials that a genus was selected based on a
random subset of words in the query

Take home:
The higher the diversity (bigger sequence space) of the training set, the
better the assignment
Longer query = better and more reliable assignment
Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of
0.5 suggested)

Database sources
GreenGenes
Latest May 2013

SILVA
Latest 115 (August 2013)
Includes 18S, 23S, 28S, LSU

RDP Database
Latest 11 (October 2013)

GenBank
Research-speciﬁc
e.g., CORE Oral

β-diversity
Visualization (ordination) versus hypothesis
testing (MRPP, indicator species analysis)
Many more algorithms out there for
exploration and statistical testing
mostly through widely used R packages
vegan (Community Ecology Package)
labdsv (Ordination and Multivariate Analysis for
Ecology)
ape (Analyses of Phylogenetics and Evolution)
picante (community analyses etc.)

Visualization (ordination)
Complementary to data clustering
looks for discontinuities

Ordination extracts main trends as continuous
axes
analysis of the square matrix derived from the
OTU table

Non-parametric, unconstrained ordination
methods most widely used (and best suited)
methods that can work directly on a square matrix

An appropriate metric is required to derive
this square matrix
many options...

Metrics
Ordination is essentially reducing dimensionality
ﬁrst requirement: accurately model diﬀerences
among samples
Models are *really* important. Examples include:
OTU presence/absence
“all models are wrong,
Dice, Jaccard
some are useful”
OTU abundance
- G.E. Box
Bray-Curtis
“You can't publish anything without a
Phylogenetic
PCoA plot anymore, but METRICS

UniFrac

used to draw plot important.”
- Susan Huse

Metrics: UniFrac
A distance measure comparing multiple
communities using phylogenetic information
Requires sequence alignment and tree-building
PyNAST, MUSCLE, Infernal
Time-consuming and susceptible to poor phylogenetic
inference (does it matter?)

Weighted (abundance)
ecological features related to
abundance

Unweighted
ecological features related to
taxonomic presence/absence

Ordination example 1 (of many):

Principal Coordinates Analysis
Classical Multidimensional Scaling (MDS; Gower 1966)
Procedure:
based on eigenvectors
position objects in low-dimensional space while preserving
distance relationships as well as possible

highly ﬂexible
can choose among many association measures

In microbial ecology, used for visualizing
phylogenetic or count-based distances
Consistent visual output for given distance matrix
Include variance explained (%) on Axis 1 and 2

Ordination example 2 (of many):

Non-metric Multidimensional Scaling
Ordination not based on eigenvectors
Does not preserve exact distances among objects
attempts to preserve ordering of samples (“ranks”)

Procedure:
iterative, tries to position the objects in a few (2-3) dimensions in such a way
that minimizes the “stress”
how well does the new ranked distribution of points represent the original
distances in the association matrix? Can express as R2 on axes 1 and 2.
the adjustment goes on until the stress value reaches a local minimum
(heuristic solution)

NMDS often represents distance relationships better than PCoA in the
same number of dimensions
Susceptible to the “local minimum issue”, and therefore should have
strong starting point (e.g., PCoA) or many permutations
You won't get the same result each time you run the analysis. Try several
runs until you are comfortable with the result.

Beta-diversity: Hypothesis testing
Multiple methods, implemented in QIIME,
mothur, AXIOME
e.g., MRPP, adonis, NP-MANOVA (perMANOVA),
ANOSIM
Are treatment effects significant?

Because these are predominantly
nonparametric methods, tests for
significance rely on testing by permutation
Let's focus on MRPP

Multiresponse Permutation Procedures
Compare intragroup average distances with the
average distances that would have resulted from all
the other possible combinations
T statistic: more negative with
increasing group separation
(T>-10 common for ecology)
A statistic: Degree of scatter
within groups (A=1 when all
points fall on top of one another)
p value: likelihood of similar
separation with randomized
data.

“PCoA plots are the ﬁrst
step of a community
analysis, not the last.”
Josh Neufeld

Searching for species that matter
High dimensional data often have too many
features to investigate
solution: identify and study species significantly
associated with categorical metadata

Indicator species (Dufrene-Legendre)
calculates indicator value (fidelity and relative
abundance) of species
Permutation test for significance
Need solution for sparse data - be wary
of groups with small numbers of sites (influence on
permutation tests)
low abundance can artificially inflate indicator values

IndVal (Dufrene & Legendre, 1997)
Speciﬁcity
Large mean abundance within group relative to summed
mean abundances of other groups

Fidelity
Presence in most or all sites of that group

Groups deﬁned by a priori by metadata or
statistical clustering

Simple linear correlations
Metadata
mbc

Taxon R^2 value

k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat
ales;f__Isosphaeraceae;g__
0.611368489781491
mbc
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.677209935419981
mbn
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.64092523702996
soil_depth
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc
etales;f__Intrasporangiaceae;g__
0.669761188668774

mothur: cooccurrence function, measuring whether populations are co-occurring
more frequently than you would expect by chance.

Non-negative Matrix Factorization
NMF as a representation method for portraying
high-dimensional data as a small number of
taxonomic components.
Patterns of co-occurring OTUs can be
described by a smaller number of taxonomic
components.
Each sample represented by the collection of
component taxa, helping identify relationships
between taxa and the environment.
Jonathan Dushoﬀ, McMaster University, Ontario, Canada

Nakai et al. 2012

Lynch et al. 2012

Why pipelines?
Merge and manage (many) disparate techniques
Democratize analysis
improve accessibility

Accelerate pace of innovation, collaboration, and
research

Early synthesis
Early synthesis for numerical microbial ecology
Synthesis of 16S phylogenetics (Woese et al.)
and Hughes (Counting the uncountable)
Numerical ecology for microorganisms

Algorithm development
libshuﬀ, dotur (mothur)

Analysis pipelines
QIIME, mothur

Knight Lab, U. Colorado at Boulder
Predominantly a collection of integrated Python/R
scripts
Many dependencies
easy managed installation:
qiime-deploy
MacQIIME
virtual box and Ubuntu fork
avoid for anything but small runs

Becoming the standard for marker gene studies
integrated analysis and visualization
easy access to broad computational biology toolbox
(Python/R)

Automation and extension
AXIOME and phyloseq
Extend existing technologies (QIIME, mothur, R,
custom)

Layers of abstraction
Automation and rapid re-analysis
Promote reproducible research (iPython, XML,
make)

Implement existing techniques (e.g., MRPP,
Dufrene-Legendre IndVal)
numerical microbial ecology needs to better
incorporate modern statistical theory

Develop and test new techniques

Axiometic
GUI companion for AXIOME
Cross-platform
New implementation in
development

Generates AXIOME ﬁle (XML)

xls template
coming soon for
all commands,
sample metadata,
and extra info…
much easier for
everyone.

“QIIME wraps many other software
packages, and these should be cited if
they are used. Any time you're using
tools that QIIME wraps, it is essential
to cite those tools.”
http://qiime.org/index.html

The future
As data get bigger, interpretation should be
“hands oﬀ”
Move towards hypothesis testing of highdimension taxonomic data

Convergence on Galaxy
e.g., QIIME in Galaxy is developing

Further extension to cloud services
e.g., Amazon EC2

Machine learning and data mining
applications

Open-source, web-based platform
Deployed locally or in the cloud
Ongoing development of 16S rRNA gene analysis

Galaxy Workshed (available tools)

“The advantages of having large numbers of
samples at shallow coverage (~1,000 sequences
per sample) clearly outweigh having a small
number of samples at greater coverage for many
datasets, suggesting that the focus for future
studies should be on broader sampling that can
reveal association with key biological
parameters rather than on deeper sequencing.”

“….even [phylogenetic beta-diversity]
measures suited to the underlying
mechanism of diﬀerentiation may
require deep sequencing to reveal
subtle patterns”
Dr. Donovan Parks

Method standardization
Impossible.
Data storage
Sequence reads outpacing data storage costs
Federated data?
File formats
e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient,
do not ensure data is in correct format, no space for metadata, no absolute
standard)… relational databases?
Software
Free and Open Source enables an experiment to be faithfully replicated
Algorithms
Memory!
Many clustering and phylogenetic inference algorithms vary n2
Distributed, parallel, or cloud computing may not be helpful
Metadata
What to do with it? How to marry sequence and metadata sets?
We need better metadata integration, not necessarily more/better metadata

What should we be doing?
(take-home messages)

*Surveys are really important for
spatial and temporal mapping
*Hypothesis testing follows (or implicit)
*What species account for treatment eﬀects?
*Who tracks with who? (why=function)
*Who avoids who?
*Are all microorganisms accounted for? (no)
*How can we use this information to
manipulate, manage and predict ecosystems?

What should we be doing?
(take-home messages)

There is no “one way” to analyze 16S rRNA
You need to build a pipeline for you.
If this seems daunting, it is.
If this is not daunting, your hands are dirty.
It’s getting better all the tii-ime.

Thank you
jneufeld@uwaterloo.ca

Introduction to 16S rRNA gene multivariate analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Introduction to 16S rRNA gene multivariate analysis

Similar a Introduction to 16S rRNA gene multivariate analysis (20)

Más de Josh Neufeld

Más de Josh Neufeld (6)

Último

Último (20)

Introduction to 16S rRNA gene multivariate analysis