2012 erin-crc-nih-seattle

Extracting genomes from
community sequencing
„What works, what will work, and what
needs work‟

C. Titus Brown
ctb@msu.edu
Computer Science; Microbiology; BEACON
Michigan State University

Warnings

This talk contains forward looking statements.
These forward-looking statements can be
identified by terminology such as “will”, “expects”,
and “believes”.
-- Safe Harbor provisions of the
U.S. Private Securities Litigation Act

“Making predictions is difficult, especially if
they‟re about the future.”
-- Attributed to Niels Bohr

Thanks for the invitation!
 So, Linda Mansfield and I were talking one day…
 Her: “It‟d be great to be able to look at communities
with sequencing.”
 Me: “Oh, yeah, we can we do that now.”

 My overall interest is in good hypothesis
generation from computational data, with a focus
on sequence data.

 For the past three years, I have been working on
this specifically for soil metagenomics (and
mRNAseq, too).

Deep connection between
human gut  soil

Soil is full of uncultured microbes
Estimates of microbial diversity in agricultural soil ~1m species/gram

Randy Jackson

Soil contains thousands to millions of species
(“Collector’s curves” of ~species)

2000

1800

1600
Number of OTUs

1400
Iowa Corn
Iowa_Native_Prairie
1200
Kansas Corn

1000 Kansas_Native_Prairie
Wisconsin Corn
800 Wisconsin Native Prairie
Wisconsin Restored Prairie
600
Wisconsin Switchgrass

400

200

0
100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100

Number of Sequences

Ecology => function emphasis
 What‟s there?
 Is it really that complex a community?
 How “deep” do we need to sequence to sample
thoroughly and systematically?
 How is ecological complexity created &
maintained?
 How does ecological complexity respond to
perturbation?
 What organisms and gene functions are
present, including non-canonical carbon and
nitrogen cycling pathways?
 What kind of organismal and functional overlap is

The human gut is a diverse place

Dethlefsen et al., 2008

Ecology vs function in human gut
We can observe recovery of
diversity after Cipro treatment; but
what is driving recovery at a
functional level?

Dethlefsen and Relman, 2011

Culture independent methods
 Observation that 99% of microbes cannot easily
be cultured in the lab. (“The great plate count
anomaly”)
 While this is less true for host-associated
microbes, culture independent methods are still
important:
 Syntrophic relationships
 Niche-specificity or unknown physiology
 Dormant microbes
 Abundance within communities

Single-cell sequencing &shotgun metagenomicsare
two common ways to investigate microbial
communities.

Shotgun metagenomics
 Collect samples;

 Extract DNA;

 Feed into sequencer;

 Computationally analyze.

Wikipedia: Environmental shotgun sequencing.p

Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Shotgun sequencing & assembly
 Why assembly?
 Assumption free (no reference needed)
 Necessary for soil and marine; useful for host-associated?
 Assembly can serve as reference for transcriptome
interpretation

 Fragment, sequence, computationally assemble.

 What kind of results do you get?
 Almost certainly chimerism between different strains; but still
useful for gene content &operon structure.
 Specificity seems high, but sensitivity is dependent on
sequencing depth.

 Because of sampling rate, Illumina is primary choice.

Shotgun metagenomics: good news
 Cheap and easy to generate vast whole
metagenome/metatranscriptome shotgun data sets from
essentially any community you can sample.

 Such data can be quite interesting!
 Low hanging fruit – correlation with diet, etc.
 Still early days for observation of “pan genome” and functional
content.

 Potential to illuminate or inform:
 Dynamics and selective pressures of antibiotic
resistance, virulence genes, and pathogenicity islands
 Phage and viral communities
 Community interactions.

Shotgun metagenomics: bad
news
 Computational techniques are still relatively immature
 Mapping to known genomes?
 Discovery of unknown genomes & strain variants?
 Sensitivity and specificity are hard to evaluate.
 Computational ecosystem is not that rich…

 Interpreting the data is still the bottleneck, of course.
 Vast majority of genes not usefully annotated.
 Reliance on specific reference databases, annotations.
 Tools for (e.g.) inferring community interactions from
community dynamics & functional capacity are
desperately needed.

The computational conundrum

More data => better.

and

More data => computationally more challenging.

1. Assembly depends on high
coverage

2. Big data sets require big machines
For even relatively small data sets, metagenomic
assemblers scale poorly.

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Size of data set == big!!

(Estimated 6 weeks x 3 TB of RAM to do 300gb soil
sample, with a slightly modified conventional
assembler.)

Our “Grand Challenge” dataset

Approach 1: Partitioning

Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.

Technical challenges met (and defeated)
 Novel data structure properties elucidated via
percolation theory analysis (Pell, Hintze, et al., in
review, PNAS).

 Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.

 Sequencing technology introduces false
sequences in graph (Howe et al., in prep.)

 Only 20x improvement in assembly scaling .

(NOVEL)

Approach 2: Digital normalization

Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!

This 100x will consume
disk space and, because
of errors, memory.

Digital normalization discards
redundant reads prior to assembly.

This removes reads and decreases data
size, eliminates errors from removed reads, and
normalizes coverage across loci.

Discarded reads can be used after assembly for
quantitative analysis.

A read‟s median k-mer count is a
good estimator of “coverage”.
This gives us a
reference-free
measure of
coverage.

Shotgun data is often (1) high
coverage and (2) biased in coverage.

(MD amplified)

Digital normalization fixes all that.

Normalizes coverage

Discards redundancy

Eliminates majority of
errors

Scales assembly dramat

Assembly is 98% identica

Digital normalization retains information, while
discarding data and errors

Evaluating sensitivity & specificity

E. coli @ 10x + soil

98.5% of E. coli

How much? A mathematical
interlude.
 Suppose we need 10x coverage to assemble a
microbial genome, and microbial genomes
average 5e6 bp of DNA.
 Further suppose that we want to be able to
assemble a microbial species that is “1 in a
100000”, i.e. 1 in 1e5.
 Shotgun sequencing samples randomly, so must
sample deeply to be sensitive.

10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpof
sequence.

Example
Dethlefsen shotgun data set / Relman lab

251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on
Amazon EC2
(reads => final assembly + mapping)

Assembly stats:
58,224 contigs> 1000 bp (average 3kb)
summing to 190 mb genomic
~38 microbial genomes worth of DNA
~65% of reads mapped back to assembly

What do we get for soil?

Predicted
Total % Reads rplb
Total Contigs protein
Assembly Assembled genes
coding

2.5 bill 4.5 mill 19% 5.3 mill 391

3.5 bill 5.9 mill 22% 6.8 mill 466

This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp

Extracting whole genomes?
So far, we have only assembled contigs, but not whole
genomes.

Can entire genomes be
assembled from metagenomic
data?

Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenomecontigs into
~whole genomes. YES.

Concluding thoughts on
assembly
 Illumina is the only game in town for sequencing complex
microbial populations, but dealing with the data
(volume, errors) is tricky. This problem is being solved, by
us and others.

 We‟re working to make it as close to push button as
possible, with objectively argued parameters and
tools, and methods for evaluating new tools and
sequencing types.

 The community is working on dealing with data
downstream of sequencing & assembly.
 Most pipelines were built around 454 data – long reads, and
relatively few of them.
 With Illumina, we can get both long contigs and quantitative
information about their abundance. This necessitates
changes to pipelines like MG-RAST and HUMANn.

The interpretation challenge
 For soil, we have generated approximately 1200
bacterial genomes worth of assembled genomic DNA
from two soil samples.

 The vast majority of this genomic DNA contains
unknown genes with largely unknown function.

 Most annotations of gene function & interaction are
from a few phylogenetically limited model organisms
 Est 98% of annotations are computationally inferred:
transferred from model organisms to genomic
sequence, using homology.
 Can these annotations be transferred? (Probably not.)

This will be the biggest sequence analysis challenge
of the next 50 years.

How will we annotate soil??

Predicted
Total % Reads rplb
Total Contigs protein
Assembly Assembled genes
coding

2.5 bill 4.5 mill 19% 5.3 mill 391

3.5 bill 5.9 mill 22% 6.8 mill 466

This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp

Some lessons from C. jejuni
 In vivomurine transfer experiments demonstrate
substantial capacity for C. jejuni11168 to adapt solely
via modification of poly-G tracts (Jerome et al., 2011).

 Bell et al. (unpub) have shown substantial variability
in gene content of Campylobacter strains. Gene
content and gene expression are both important to
understanding mechanisms of pathogenicity.

 In vitro serial transfer experiments demonstrate that
rapid genomic adaptation to new environments occurs
at multiple loci, with substantial variation in genes of
unknown function (Jereme et al., in preparation)

Multilocus “strain” variation in C.
jejunidrives rapid adaptation

What works?
Today,

 From deep metagenomicdata, you can get the
gene and operon content (including abundance of
both) from communities.

 You can get microarray-like expression
information from metatranscriptomics.

What needs work?

 Assembling ultra-deep samples is going to
require more engineering, but is straightforward.
(“Infinite assembly.”)

 Building scaffolds and extracting whole genomes
has been done, but I am not yet sure how
feasible it is to do systematically with existing
tools (c.f. Armbrust Lab).

What will work, someday?

 Sensitive analysis of strain variation.
 Both assembly and mapping approaches do a poor
job detecting many kinds of biological novelty.
 The 1000 Genomes Project has developed some
good tools that need to be evaluated on community
samples.

 Ecological/evolutionary dynamics in vivo.
 Most work done on 16s, not on genomes or
functional content.
 Here, sensitivity is really important!

What are future needs?
 High-quality, medium+ throughput annotation of
genomes?
 Extrapolating from model organisms is both
immensely important and yet lacking.
 Strong phylogenetic sampling bias in existing
annotations.

 Synthetic biology for investigating non-model
organisms?
(Cleverness in experimental biology doesn‟t
scale)

Pubs, software, tutorials, etc.
Metagenome assembly / HMP tutorial:
http://ged.msu.edu/angus/nih-hmp-2012/

Everything I discussed is available pre-pub -- contact
ctb@msu.edu, or Google for

khmer – software package
kmer-percolation paper (in re-review, PNAS)
digital normalization paper (in review, PLoS One)

…a few dozen people using, one way or another.

Acknowledgements
 Jason Pell, Qingpeng Zhang, ArendHintze, and
Adina Howe
 Soil: Jim Tiedje (MSU), Janet Jansson
(LBNL/JGI), Susannah Tringe (JGI)
 Campy: Linda Mansfield, Julia Bell, JP Jerome,
Jeff Barrick

Funding:
USDA NIFA; NSF IOS; BEACON.

2012 erin-crc-nih-seattle

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 2012 erin-crc-nih-seattle

Similar to 2012 erin-crc-nih-seattle (20)

More from c.titus.brown

More from c.titus.brown (20)

Recently uploaded

Recently uploaded (20)

2012 erin-crc-nih-seattle

Editor's Notes