Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.
Introduction to 16S rRNA gene multivariate analysis
1. Multivariate exploration of microbial communities
Josh D. Neufeld
Braunschweig, Germany
December, 2013
Andre Masella (MSc): Computer science
Michael Lynch (PhD): Taxonomy, phylogenetics, ecology
Michael Hall (co-op): mathematics, programming, user friendly!
Posted on Slideshare without images and unpublished data
2. Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
3. Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
4. An (abbreviated) history
Numerical ecology
phenetics and statistical analysis of organismal
counts
macroecology
16S rRNA gene era
sequence analysis as a surrogate for counting
mapping of marker to taxonomy
NGS enabled synthesis of phenetics,
phylogenetics, and numerical ecology
6. Assembling paired-end
reads dramatically
reduces error
Corrects mismatches in
region of overlap
(quality threshold >0.9),
set a minimum overlap.
Can compare to perfect
overlap assembly:
“completelymissesthepoint”
(name changing soon)
10. Biological Observation Matrix
BIOM file format (MacDonald et al. 2012)
Standard recognized by EMP, MG-RAST,
VAMPS
Based on JSON data interchange format
Computational structure in multiple languages
“facilitates the efficient handling and
storage of large, sparse biological
contingency tables”
Encapsulates metadata and contingency
table (e.g., OTU table) in one file
11. Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
12. Who lives with whom, and why, and where?
Data reduction is essential for:
a) summarizing large numbers of observations
into manageable numbers
b) visualizing many interconnected variables in a
compact manner
Alpha diversity: species richness (and evenness)
within a single sample
Beta diversity: change in species composition
across a collection of samples
Gamma diversity: total species richness across an
environmental gradient
14. α-diversity: Richness and
Evenness
Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity
Shannon index (H’): richness and evenness
Estimators: richness
Faith’s PD: phylogenetic richness
Stearns et al., 2011
Hughes et al., 2001
15. “All biologists who sample natural
communities are plagued with the
problem of how well a sample reflects a
community’s ‘true’ diversity.”
16. Hughes et al. 2001
“Nonparametric estimators show particular promise for microbial data and in
some habitats may require sample sizes of only 200 to 1,000 clones to detect
richness differences of only tens of species.”
17. 1
Google Scholar proportion
[Seqeuncing tech] AND 16S
400
454
300
Sanger
re
e
re
Ra
0
2000
200
2002
2004
2004
ph
os
100
bi
2008
0
2010
Time (year)
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
2012
“Rare biosphere” citations
Illumina
500
18. GOALS
Understanding of community structure
Better alpha-diversity measures
Robust beta-diversity measures
Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.
21. Clustering algorithms
(influence alpha diversity primarily)
CD-HIT (Li and Godzik, Sanford-Burnham Medical
Research Institute)
‘longest-sequence-first’ removal algorithm
Fast, many implementations (nucleotide, protein, OTUspecific)
Tends to be more stringent than UCLUST
UCLUST (R. Edgar, drive5.com)
Faster than CD-HIT
Tends to generate larger number of low-abundance OTUs
Broader range of clustering thresholds
"I do not recommend using the UCLUST algorithm or
CD-HIT for generating OTUs” – Robert Edgar
22.
23. CROP: Clustering 16S rRNA for OTU Prediction (CROP)
“CROP can find clusters based on the natural organization of data without setting a
hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”
24. Chimeras
DNA from two or more parent molecules
PCR artifact
Can easily be classified as a “novel” sequence
Increases α-diversity
Software
ChimeraSlayer, Bellerophon, UCHIME, Pintail
Reference database or de novo
25. Classification and taxonomy
Ribosomal Database Project (RDP) classifier
Naïve Bayesian classifier (James Cole and Tiedje)
http://rdp.cme.msu.edu/
pplacer
Phylogenetic placement and visualization
BLAST
The tool we know and love
RTAX (UC Berkely, Rob Knight involved)
http://dev.davidsoergel.com/trac/rtax/
mothur (Patrick Schloss)
http://www.mothur.org/
SINA (SILVA)
26. RDP classifier
Large training sets require active memory management
Can be easily run in parallel by breaking up very large data sets
Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)
Algorithm:
determine the probability that an unknown query sequence is a member of a
known genus (training set), based on the profile of word subsets of known
genera.
Confidence estimation:
the number of times in 100 trials that a genus was selected based on a
random subset of words in the query
Take home:
The higher the diversity (bigger sequence space) of the training set, the
better the assignment
Longer query = better and more reliable assignment
Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of
0.5 suggested)
29. β-diversity
Visualization (ordination) versus hypothesis
testing (MRPP, indicator species analysis)
Many more algorithms out there for
exploration and statistical testing
mostly through widely used R packages
vegan (Community Ecology Package)
labdsv (Ordination and Multivariate Analysis for
Ecology)
ape (Analyses of Phylogenetics and Evolution)
picante (community analyses etc.)
30. Visualization (ordination)
Complementary to data clustering
looks for discontinuities
Ordination extracts main trends as continuous
axes
analysis of the square matrix derived from the
OTU table
Non-parametric, unconstrained ordination
methods most widely used (and best suited)
methods that can work directly on a square matrix
An appropriate metric is required to derive
this square matrix
many options...
31. Metrics
Ordination is essentially reducing dimensionality
first requirement: accurately model differences
among samples
Models are *really* important. Examples include:
OTU presence/absence
“all models are wrong,
Dice, Jaccard
some are useful”
OTU abundance
- G.E. Box
Bray-Curtis
“You can't publish anything without a
Phylogenetic
PCoA plot anymore, but METRICS
UniFrac
used to draw plot important.”
- Susan Huse
32. Metrics: UniFrac
A distance measure comparing multiple
communities using phylogenetic information
Requires sequence alignment and tree-building
PyNAST, MUSCLE, Infernal
Time-consuming and susceptible to poor phylogenetic
inference (does it matter?)
Weighted (abundance)
ecological features related to
abundance
Unweighted
ecological features related to
taxonomic presence/absence
33. Ordination example 1 (of many):
Principal Coordinates Analysis
Classical Multidimensional Scaling (MDS; Gower 1966)
Procedure:
based on eigenvectors
position objects in low-dimensional space while preserving
distance relationships as well as possible
highly flexible
can choose among many association measures
In microbial ecology, used for visualizing
phylogenetic or count-based distances
Consistent visual output for given distance matrix
Include variance explained (%) on Axis 1 and 2
34. Ordination example 2 (of many):
Non-metric Multidimensional Scaling
Ordination not based on eigenvectors
Does not preserve exact distances among objects
attempts to preserve ordering of samples (“ranks”)
Procedure:
iterative, tries to position the objects in a few (2-3) dimensions in such a way
that minimizes the “stress”
how well does the new ranked distribution of points represent the original
distances in the association matrix? Can express as R2 on axes 1 and 2.
the adjustment goes on until the stress value reaches a local minimum
(heuristic solution)
NMDS often represents distance relationships better than PCoA in the
same number of dimensions
Susceptible to the “local minimum issue”, and therefore should have
strong starting point (e.g., PCoA) or many permutations
You won't get the same result each time you run the analysis. Try several
runs until you are comfortable with the result.
36. Beta-diversity: Hypothesis testing
Multiple methods, implemented in QIIME,
mothur, AXIOME
e.g., MRPP, adonis, NP-MANOVA (perMANOVA),
ANOSIM
Are treatment effects significant?
Because these are predominantly
nonparametric methods, tests for
significance rely on testing by permutation
Let's focus on MRPP
37. Multiresponse Permutation Procedures
Compare intragroup average distances with the
average distances that would have resulted from all
the other possible combinations
T statistic: more negative with
increasing group separation
(T>-10 common for ecology)
A statistic: Degree of scatter
within groups (A=1 when all
points fall on top of one another)
p value: likelihood of similar
separation with randomized
data.
38. Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
39. “PCoA plots are the first
step of a community
analysis, not the last.”
Josh Neufeld
40. Searching for species that matter
High dimensional data often have too many
features to investigate
solution: identify and study species significantly
associated with categorical metadata
Indicator species (Dufrene-Legendre)
calculates indicator value (fidelity and relative
abundance) of species
Permutation test for significance
Need solution for sparse data - be wary
of groups with small numbers of sites (influence on
permutation tests)
low abundance can artificially inflate indicator values
42. IndVal (Dufrene & Legendre, 1997)
Specificity
Large mean abundance within group relative to summed
mean abundances of other groups
Fidelity
Presence in most or all sites of that group
Groups defined by a priori by metadata or
statistical clustering
43. Simple linear correlations
Metadata
mbc
Taxon R^2 value
k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmat
ales;f__Isosphaeraceae;g__
0.611368489781491
mbc
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.677209935419981
mbn
k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhiz
obiales;f__Methylocystaceae;g__
0.64092523702996
soil_depth
k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomyc
etales;f__Intrasporangiaceae;g__
0.669761188668774
44. mothur: cooccurrence function, measuring whether populations are co-occurring
more frequently than you would expect by chance.
45. Non-negative Matrix Factorization
NMF as a representation method for portraying
high-dimensional data as a small number of
taxonomic components.
Patterns of co-occurring OTUs can be
described by a smaller number of taxonomic
components.
Each sample represented by the collection of
component taxa, helping identify relationships
between taxa and the environment.
Jonathan Dushoff, McMaster University, Ontario, Canada
60. Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
61. Why pipelines?
Merge and manage (many) disparate techniques
Democratize analysis
improve accessibility
Accelerate pace of innovation, collaboration, and
research
62. Early synthesis
Early synthesis for numerical microbial ecology
Synthesis of 16S phylogenetics (Woese et al.)
and Hughes (Counting the uncountable)
Numerical ecology for microorganisms
Algorithm development
libshuff, dotur (mothur)
Analysis pipelines
QIIME, mothur
63. Knight Lab, U. Colorado at Boulder
Predominantly a collection of integrated Python/R
scripts
Many dependencies
easy managed installation:
qiime-deploy
MacQIIME
virtual box and Ubuntu fork
avoid for anything but small runs
Becoming the standard for marker gene studies
integrated analysis and visualization
easy access to broad computational biology toolbox
(Python/R)
64. Automation and extension
AXIOME and phyloseq
Extend existing technologies (QIIME, mothur, R,
custom)
Layers of abstraction
Automation and rapid re-analysis
Promote reproducible research (iPython, XML,
make)
Implement existing techniques (e.g., MRPP,
Dufrene-Legendre IndVal)
numerical microbial ecology needs to better
incorporate modern statistical theory
Develop and test new techniques
65.
66.
67. Axiometic
GUI companion for AXIOME
Cross-platform
New implementation in
development
Generates AXIOME file (XML)
xls template
coming soon for
all commands,
sample metadata,
and extra info…
much easier for
everyone.
68. “QIIME wraps many other software
packages, and these should be cited if
they are used. Any time you're using
tools that QIIME wraps, it is essential
to cite those tools.”
http://qiime.org/index.html
69. Quick history
Alpha and Beta diversity
Species that matter
Pipelines
Future prospects and problems
70. The future
As data get bigger, interpretation should be
“hands off”
Move towards hypothesis testing of highdimension taxonomic data
Convergence on Galaxy
e.g., QIIME in Galaxy is developing
Further extension to cloud services
e.g., Amazon EC2
Machine learning and data mining
applications
73. “The advantages of having large numbers of
samples at shallow coverage (~1,000 sequences
per sample) clearly outweigh having a small
number of samples at greater coverage for many
datasets, suggesting that the focus for future
studies should be on broader sampling that can
reveal association with key biological
parameters rather than on deeper sequencing.”
75. Method standardization
Impossible.
Data storage
Sequence reads outpacing data storage costs
Federated data?
File formats
e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient,
do not ensure data is in correct format, no space for metadata, no absolute
standard)… relational databases?
Software
Free and Open Source enables an experiment to be faithfully replicated
Algorithms
Memory!
Many clustering and phylogenetic inference algorithms vary n2
Distributed, parallel, or cloud computing may not be helpful
Metadata
What to do with it? How to marry sequence and metadata sets?
We need better metadata integration, not necessarily more/better metadata
76. What should we be doing?
(take-home messages)
*Surveys are really important for
spatial and temporal mapping
*Hypothesis testing follows (or implicit)
*What species account for treatment effects?
*Who tracks with who? (why=function)
*Who avoids who?
*Are all microorganisms accounted for? (no)
*How can we use this information to
manipulate, manage and predict ecosystems?
77. What should we be doing?
(take-home messages)
There is no “one way” to analyze 16S rRNA
You need to build a pipeline for you.
If this seems daunting, it is.
If this is not daunting, your hands are dirty.
It’s getting better all the tii-ime.