U Florida / Gainesville talk, apr 13 2011

Divide and conquer applied to metagenomic DNA C. Titus Brown ctb@msu.edu CSE / MMG, Michigan State University

A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.

Assemble based on word overlaps: the quick brown fox jumped jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!

Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)

How does assembly scale? Our assembly approach scaleswith the amount of genomic novelty present in the sample. For “sane” problems (microbes, human genome, etc.) this isn’t too bad, although challenging. For metagenomes, with millions of different species at different abundances, this is an intractable problem (so far)…

Iowa Native Prairie Great Plains Grand Challenge –Sampling sites Wisconsin Native prairie (Goose Pond, Audubon) Long term cultivation (corn) Switchgrass rotation (previously corn) Restored prairie (from 1998) Iowa Native prairie (Morris prairie) Long term cultivation (corn) Kansas Native prairie (Konza prairie) Long term cultivation (corn) Switchgrass (Wisconsin) Iowa >100 yr tilled

Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores (pyrotag sequencing) 10 M

Community composition Soil Metagenome Illuminashotgun sequencing 454 Titanium Pyrotagsequencing 454Titanium Shotgunsequencing

What kinds of questions? What genes are present? What species are present? What are those species doing, physiologically speaking? How does “function” change with cultivation, CO2, fertilizer types, crop cycles, etc? We are at a “pre-question” stage, unfortunately…

The basic problem. Lots of metagenomic sequence data (200 GB Illumina for < $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).

We can’t just throw more hardware at the problem… Lincoln Stein

Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.

Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.) …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.

The challenge: Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)

Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness

Assembly of shotgun sequence It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.

K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)

K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.

Billions and billions of … >850:2:1:1943:15232/1 0 CCTGCCTGTGGAGCAGCCCACGCAGTTCGAGCTGATCATCAACCTCAAGACGGCCCAAGCCCTTGGCATCACGATT >850:2:1:1943:15232/2 0 ACACCATTTAATCTTAGCCATAAAAGTTGTATAAGCATCAACGTTTTGTTTGTCTCAAAAAACGATTTTTTTTTTG >850:2:1:1943:19543/1 0 ACTGTAGGTTTCTGGCTGCGTCCGACGATAGCAGCCCGCTCTGCCGACATTGTCA >850:2:1:1945:16822/2 0 AGTCGACAGATCGACCTGAAGGAGGTGCCGGGAATTGAAGTCATCCAGGGCGCCGAGGAGAACTGATCGG >850:2:1:1946:10202/2 0 AGCTTTTTCGCGCGCGTGAAAAAGCTTTGTCGATTTCTGGGTTTCGGCCTTCTCACAGTCACCGCCGAGGGCCGGG >850:2:1:1947:6533/2 0 GGTCTCCGGACACACGAAGGCACGGCTCTCCGAGAAGCGGAGGATGTACTCGACCTCACGGCTGC >850:2:1:1948:15431/1 0 ACCGCTTACTCGATGATGGAGCAAGGCAGAATCGACATGATTCTGAGCTCGCGTCCCGAAGATCGACGCGCGG >850:2:1:1949:19998/1 0 AATTCAAAGTAGGCATTTTTGTTTTTGTAGGGTTGGCGATGTTAGGCGCGCTGGTCGTGCAATTC >850:2:1:1950:4213/2 0 CCAACCGGGCCCTGGTCCTGCACGCCAACCTGTCCCCGCTGGTGG >850:2:1:1950:1388/1 0 CAGCCGCAATGTTGGCATTCTTCAGCAGTTCGAGCGCCACAAAGCGGTCATTGTCTGAGGCTTCTGGG

Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.

Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.

Eliminating unconnected reads “Graphsize filtering”

Subdividing reads by connection “Partitioning”

Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.

Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784

Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).

Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.

We pre-filter data for assembly:

Does removing small graphs work? Small data set (35m reads / 3.4 gb rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Filtered (2m reads) YES.

Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs / Total bp Largest contig 130 223,341 61,766 Unfiltered (35m) 130 223,341 61,766 Sum partitions YES.

Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. - Memory required reduced from 40 gb to 2 gb; - Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: - Biggest group is 300k reads - Memory required reduced from 40 gb to 500 mb; - Time reduced from 4 hrs to < 5 minutes/group.

Does it work on bigger data sets? Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads … 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads …

Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)

Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error

Why this lump? Real biological connectivity? Probably not. - Increasing K from 32 to ~64 didn’t break up the lump: not biological. Bug in our software? Probably not. ,[object Object],Sequencing artifact or error? YES. - (Note, we do filter & quality trim all sequences already)

“Good” vs “bad” assembly graph Low density High density

Non-biological levels of local graph connectivity:

Higher local graph density correlates with position in read

Higher local graph density correlates with position in read ARTIFACT

Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.

Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG

Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.

Artifacts from sequencing falsely connect graphs

Groxel view of knot-like region / ArendHintze

Density trimming breaks up the lump: Old P1, soddd trimmed (204.6 m reads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …

What does density trimming do to assembly? 204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: assembles into 57,135 contigs; total 83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mer graph assembler, as well.

So what’s going on? Current assemblers are bad at dealing with certain graph structures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.

OK, let’s assemble! Iowa corn (HiSeq + GA2): 219.11 Gb of sequence assembles to: 148,053 contigs, in 220 MB; max length 20322 max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0

Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb

Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way. (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actually run this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.

Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.

Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Scaffolding with Velvet causes systematic problems Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.

Future thoughts Our pre-filtering technique always has lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …) mRNAseq and genome artifact filtering?

Better artifact filtering? Kmer-> GTCGTAGTTCAGTTGGTTAGAACGCCGGCCTG 747:3:13:7042:16004/1 GATATCTGCAATATCCCGTTCGAATGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGCC 747:3:14:10559:9771/1 GAAATTCCGGTTTGATGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCG 747:3:14:17232:4498/1 CAAATTTGAGATCTGAGATCCCAGGGGTTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGG 747:3:15:7871:10206/1 TTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGG 747:3:16:17865:15895/2 TCAGGAGACGCCAGGGCGGTCTGAGTTCTTCAGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGG 747:3:27:9549:13966/1 GGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGGCTCCGCC 747:3:30:10672:3136/1 GCGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGTCGCGAGTTCGAGTCTCGTCGGCCC

All paths lead to the same k-mers Histogram of k-mer traversal counts. Number of times k-mer is traversed

Estimating sequencing return on investment To reach ~rumen depth of sampling of top abundance organisms, would need ~1-2 TB 10x Sequencing Coverage (1900 GB) 5x Sequencing Coverage (931 GB) <1% Novel Sequence

Argonne National Laboratory Institute for Genomic and Systems Biology

Earth Microbiome Projectwww.earthmicrobiome.org Goal – to systematically approach the problem of characterizing microbial life on earth Paradigm shift to analyzing communities from a microbes perspective: Strategy: Explore microbes in environmental parameter space Design ‘ideal’ strategy to interrogate these biomes Acquire samples and sequence broad and deep both DNA, mRNA and rRNA Define microbial community structure and the protein universe Gilbert et al., 2010a,b Standards in Genomic Science, open access Argonne National Laboratory Institute for Genomic and Systems Biology

Challenges 2.4 Quadrillion Base Pairs (2.4 Petabases) = 8000 HiSEQ2000 runs. Global Environmental Sample Database (GESI): identification and selection of 200,000 environmental samples, soil, air, marine and freshwater, host-associated, etc. The standardization of sampling, sample prep and sample processing, cataloging and sample metadata – Genomic Standards Consortium can help! The coordination of thousands of “volunteer” scientists for site characterization, sample collecting and processing Earth Microbiome Projectwww.earthmicrobiome.org Argonne National Laboratory Institute for Genomic and Systems Biology

Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

U Florida / Gainesville talk, apr 13 2011

U Florida / Gainesville talk, apr 13 2011

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a U Florida / Gainesville talk, apr 13 2011

Similar a U Florida / Gainesville talk, apr 13 2011 (20)

Más de c.titus.brown

Más de c.titus.brown (20)

Último

Último (20)

U Florida / Gainesville talk, apr 13 2011

Notas del editor