Ensuring Technical Readiness For Copilot in Microsoft 365
2012 erin-crc-nih-seattle
1. Extracting genomes from
community sequencing
„What works, what will work, and what
needs work‟
C. Titus Brown
ctb@msu.edu
Computer Science; Microbiology; BEACON
Michigan State University
2. Warnings
This talk contains forward looking statements.
These forward-looking statements can be
identified by terminology such as “will”, “expects”,
and “believes”.
-- Safe Harbor provisions of the
U.S. Private Securities Litigation Act
“Making predictions is difficult, especially if
they‟re about the future.”
-- Attributed to Niels Bohr
3. Thanks for the invitation!
So, Linda Mansfield and I were talking one day…
Her: “It‟d be great to be able to look at communities
with sequencing.”
Me: “Oh, yeah, we can we do that now.”
My overall interest is in good hypothesis
generation from computational data, with a focus
on sequence data.
For the past three years, I have been working on
this specifically for soil metagenomics (and
mRNAseq, too).
7. Soil contains thousands to millions of species
(“Collector’s curves” of ~species)
2000
1800
1600
Number of OTUs
1400
Iowa Corn
Iowa_Native_Prairie
1200
Kansas Corn
1000 Kansas_Native_Prairie
Wisconsin Corn
800 Wisconsin Native Prairie
Wisconsin Restored Prairie
600
Wisconsin Switchgrass
400
200
0
100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100
Number of Sequences
8. Ecology => function emphasis
What‟s there?
Is it really that complex a community?
How “deep” do we need to sequence to sample
thoroughly and systematically?
How is ecological complexity created &
maintained?
How does ecological complexity respond to
perturbation?
What organisms and gene functions are
present, including non-canonical carbon and
nitrogen cycling pathways?
What kind of organismal and functional overlap is
9. The human gut is a diverse place
Dethlefsen et al., 2008
10. Ecology vs function in human gut
We can observe recovery of
diversity after Cipro treatment; but
what is driving recovery at a
functional level?
Dethlefsen and Relman, 2011
11. Culture independent methods
Observation that 99% of microbes cannot easily
be cultured in the lab. (“The great plate count
anomaly”)
While this is less true for host-associated
microbes, culture independent methods are still
important:
Syntrophic relationships
Niche-specificity or unknown physiology
Dormant microbes
Abundance within communities
Single-cell sequencing &shotgun metagenomicsare
two common ways to investigate microbial
communities.
13. Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
14. Shotgun sequencing & assembly
Why assembly?
Assumption free (no reference needed)
Necessary for soil and marine; useful for host-associated?
Assembly can serve as reference for transcriptome
interpretation
Fragment, sequence, computationally assemble.
What kind of results do you get?
Almost certainly chimerism between different strains; but still
useful for gene content &operon structure.
Specificity seems high, but sensitivity is dependent on
sequencing depth.
Because of sampling rate, Illumina is primary choice.
15. Shotgun metagenomics: good news
Cheap and easy to generate vast whole
metagenome/metatranscriptome shotgun data sets from
essentially any community you can sample.
Such data can be quite interesting!
Low hanging fruit – correlation with diet, etc.
Still early days for observation of “pan genome” and functional
content.
Potential to illuminate or inform:
Dynamics and selective pressures of antibiotic
resistance, virulence genes, and pathogenicity islands
Phage and viral communities
Community interactions.
16. Shotgun metagenomics: bad
news
Computational techniques are still relatively immature
Mapping to known genomes?
Discovery of unknown genomes & strain variants?
Sensitivity and specificity are hard to evaluate.
Computational ecosystem is not that rich…
Interpreting the data is still the bottleneck, of course.
Vast majority of genes not usefully annotated.
Reliance on specific reference databases, annotations.
Tools for (e.g.) inferring community interactions from
community dynamics & functional capacity are
desperately needed.
19. 2. Big data sets require big machines
For even relatively small data sets, metagenomic
assemblers scale poorly.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Size of data set == big!!
(Estimated 6 weeks x 3 TB of RAM to do 300gb soil
sample, with a slightly modified conventional
assembler.)
21. Approach 1: Partitioning
Split reads into “bins”
belonging to
different source
species.
Can do this based
almost entirely on
connectivity of
sequences.
22. Technical challenges met (and defeated)
Novel data structure properties elucidated via
percolation theory analysis (Pell, Hintze, et al., in
review, PNAS).
Exhaustive in-memory traversal of graphs
containing 5-15 billion nodes.
Sequencing technology introduces false
sequences in graph (Howe et al., in prep.)
Only 20x improvement in assembly scaling .
23. (NOVEL)
Approach 2: Digital normalization
Suppose you have a
dilution factor of A (10) to
B(1). To get 10x of B you
need to get 100x of A!
Overkill!!
This 100x will consume
disk space and, because
of errors, memory.
24. Digital normalization discards
redundant reads prior to assembly.
This removes reads and decreases data
size, eliminates errors from removed reads, and
normalizes coverage across loci.
Discarded reads can be used after assembly for
quantitative analysis.
25. A read‟s median k-mer count is a
good estimator of “coverage”.
This gives us a
reference-free
measure of
coverage.
26. Shotgun data is often (1) high
coverage and (2) biased in coverage.
(MD amplified)
27. Digital normalization fixes all that.
Normalizes coverage
Discards redundancy
Eliminates majority of
errors
Scales assembly dramat
Assembly is 98% identica
30. How much? A mathematical
interlude.
Suppose we need 10x coverage to assemble a
microbial genome, and microbial genomes
average 5e6 bp of DNA.
Further suppose that we want to be able to
assemble a microbial species that is “1 in a
100000”, i.e. 1 in 1e5.
Shotgun sequencing samples randomly, so must
sample deeply to be sensitive.
10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpof
sequence.
31. Example
Dethlefsen shotgun data set / Relman lab
251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on
Amazon EC2
(reads => final assembly + mapping)
Assembly stats:
58,224 contigs> 1000 bp (average 3kb)
summing to 190 mb genomic
~38 microbial genomes worth of DNA
~65% of reads mapped back to assembly
32. What do we get for soil?
Predicted
Total % Reads rplb
Total Contigs protein
Assembly Assembled genes
coding
2.5 bill 4.5 mill 19% 5.3 mill 391
3.5 bill 5.9 mill 22% 6.8 mill 466
This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp
33. Extracting whole genomes?
So far, we have only assembled contigs, but not whole
genomes.
Can entire genomes be
assembled from metagenomic
data?
Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenomecontigs into
~whole genomes. YES.
34. Concluding thoughts on
assembly
Illumina is the only game in town for sequencing complex
microbial populations, but dealing with the data
(volume, errors) is tricky. This problem is being solved, by
us and others.
We‟re working to make it as close to push button as
possible, with objectively argued parameters and
tools, and methods for evaluating new tools and
sequencing types.
The community is working on dealing with data
downstream of sequencing & assembly.
Most pipelines were built around 454 data – long reads, and
relatively few of them.
With Illumina, we can get both long contigs and quantitative
information about their abundance. This necessitates
changes to pipelines like MG-RAST and HUMANn.
35. The interpretation challenge
For soil, we have generated approximately 1200
bacterial genomes worth of assembled genomic DNA
from two soil samples.
The vast majority of this genomic DNA contains
unknown genes with largely unknown function.
Most annotations of gene function & interaction are
from a few phylogenetically limited model organisms
Est 98% of annotations are computationally inferred:
transferred from model organisms to genomic
sequence, using homology.
Can these annotations be transferred? (Probably not.)
This will be the biggest sequence analysis challenge
of the next 50 years.
36. How will we annotate soil??
Predicted
Total % Reads rplb
Total Contigs protein
Assembly Assembled genes
coding
2.5 bill 4.5 mill 19% 5.3 mill 391
3.5 bill 5.9 mill 22% 6.8 mill 466
This estimates number of species ^
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes Adina Howe
Human genome ~3 billion bp
37. Some lessons from C. jejuni
In vivomurine transfer experiments demonstrate
substantial capacity for C. jejuni11168 to adapt solely
via modification of poly-G tracts (Jerome et al., 2011).
Bell et al. (unpub) have shown substantial variability
in gene content of Campylobacter strains. Gene
content and gene expression are both important to
understanding mechanisms of pathogenicity.
In vitro serial transfer experiments demonstrate that
rapid genomic adaptation to new environments occurs
at multiple loci, with substantial variation in genes of
unknown function (Jereme et al., in preparation)
39. What works?
Today,
From deep metagenomicdata, you can get the
gene and operon content (including abundance of
both) from communities.
You can get microarray-like expression
information from metatranscriptomics.
40. What needs work?
Assembling ultra-deep samples is going to
require more engineering, but is straightforward.
(“Infinite assembly.”)
Building scaffolds and extracting whole genomes
has been done, but I am not yet sure how
feasible it is to do systematically with existing
tools (c.f. Armbrust Lab).
41. What will work, someday?
Sensitive analysis of strain variation.
Both assembly and mapping approaches do a poor
job detecting many kinds of biological novelty.
The 1000 Genomes Project has developed some
good tools that need to be evaluated on community
samples.
Ecological/evolutionary dynamics in vivo.
Most work done on 16s, not on genomes or
functional content.
Here, sensitivity is really important!
42. What are future needs?
High-quality, medium+ throughput annotation of
genomes?
Extrapolating from model organisms is both
immensely important and yet lacking.
Strong phylogenetic sampling bias in existing
annotations.
Synthetic biology for investigating non-model
organisms?
(Cleverness in experimental biology doesn‟t
scale)
43. Pubs, software, tutorials, etc.
Metagenome assembly / HMP tutorial:
http://ged.msu.edu/angus/nih-hmp-2012/
Everything I discussed is available pre-pub -- contact
ctb@msu.edu, or Google for
khmer – software package
kmer-percolation paper (in re-review, PNAS)
digital normalization paper (in review, PLoS One)
…a few dozen people using, one way or another.
44. Acknowledgements
Jason Pell, Qingpeng Zhang, ArendHintze, and
Adina Howe
Soil: Jim Tiedje (MSU), Janet Jansson
(LBNL/JGI), Susannah Tringe (JGI)
Campy: Linda Mansfield, Julia Bell, JP Jerome,
Jeff Barrick
Funding:
USDA NIFA; NSF IOS; BEACON.
Editor's Notes
Development of antibiotic resistance; vacancy of niches for resource consumption by antibiotic sensitive; ??