SlideShare una empresa de Scribd logo
1 de 71
Divide and conquer applied to metagenomic DNA C. Titus Brown ctb@msu.edu CSE / MMG, Michigan State University
A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the  isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
Assemble based on word overlaps: the quick brown fox jumped  jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!
Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
How does assembly scale? Our assembly approach scaleswith the amount of genomic novelty present in the sample. For “sane” problems (microbes, human genome, etc.) this isn’t too bad, although challenging. For metagenomes, with millions of different species at different abundances, this is an intractable problem (so far)…
Iowa Native Prairie Great Plains Grand Challenge –Sampling sites Wisconsin Native prairie (Goose Pond, Audubon) Long term cultivation (corn) Switchgrass rotation (previously corn) Restored prairie (from 1998) Iowa Native prairie (Morris prairie) Long term cultivation (corn) Kansas  Native prairie (Konza prairie) Long term cultivation (corn) Switchgrass (Wisconsin) Iowa >100 yr tilled
Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores             (pyrotag sequencing) 10 M
Community composition Soil Metagenome Illuminashotgun sequencing 454 Titanium Pyrotagsequencing 454Titanium Shotgunsequencing
What kinds of questions? What genes are present? What species are present? What are those species doing, physiologically speaking? How does “function” change with cultivation, CO2, fertilizer types, crop cycles, etc? We are at a “pre-question” stage, unfortunately…
The basic problem. Lots of metagenomic sequence data (200 GB Illumina for < $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
We can’t just throw more hardware at the problem… Lincoln Stein
Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week.   (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.)  …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
The challenge: Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!!  (See: Moore’s Law slide)
Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
Assembly of shotgun sequence It was the best of times, it was the wor , it was the worst of times, it was the  isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
Assemble based on word overlaps: the quick brown fox jumped  jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!
Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.
Billions and billions of … >850:2:1:1943:15232/1   0 CCTGCCTGTGGAGCAGCCCACGCAGTTCGAGCTGATCATCAACCTCAAGACGGCCCAAGCCCTTGGCATCACGATT >850:2:1:1943:15232/2   0 ACACCATTTAATCTTAGCCATAAAAGTTGTATAAGCATCAACGTTTTGTTTGTCTCAAAAAACGATTTTTTTTTTG >850:2:1:1943:19543/1   0 ACTGTAGGTTTCTGGCTGCGTCCGACGATAGCAGCCCGCTCTGCCGACATTGTCA >850:2:1:1945:16822/2   0 AGTCGACAGATCGACCTGAAGGAGGTGCCGGGAATTGAAGTCATCCAGGGCGCCGAGGAGAACTGATCGG >850:2:1:1946:10202/2   0 AGCTTTTTCGCGCGCGTGAAAAAGCTTTGTCGATTTCTGGGTTTCGGCCTTCTCACAGTCACCGCCGAGGGCCGGG >850:2:1:1947:6533/2    0 GGTCTCCGGACACACGAAGGCACGGCTCTCCGAGAAGCGGAGGATGTACTCGACCTCACGGCTGC >850:2:1:1948:15431/1   0 ACCGCTTACTCGATGATGGAGCAAGGCAGAATCGACATGATTCTGAGCTCGCGTCCCGAAGATCGACGCGCGG >850:2:1:1949:19998/1   0 AATTCAAAGTAGGCATTTTTGTTTTTGTAGGGTTGGCGATGTTAGGCGCGCTGGTCGTGCAATTC >850:2:1:1950:4213/2    0 CCAACCGGGCCCTGGTCCTGCACGCCAACCTGTCCCCGCTGGTGG >850:2:1:1950:1388/1    0 CAGCCGCAATGTTGGCATTCTTCAGCAGTTCGAGCGCCACAAAGCGGTCATTGTCTGAGGCTTCTGGG
Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.
Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
Eliminating unconnected reads “Graphsize filtering”
Subdividing reads by connection “Partitioning”
Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.
We pre-filter data for assembly:
Does removing small graphs work? Small data set (35m reads / 3.4 gb  rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs	/ Total bp			Largest contig 130     		   223,341	  		61,766				Unfiltered (35m) 130     		   223,341	  		61,766				Filtered (2m reads) YES.
Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs	/ Total bp			Largest contig 130     		   223,341	  		61,766				Unfiltered (35m) 130     		   223,341	  		61,766				Sum partitions YES.
Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. 	- Memory required reduced from 40 gb to 2 gb; 	- Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: 	- Biggest group is 300k reads 	- Memory required reduced from 40 gb to 500 mb; 	- Time reduced from 4 hrs to < 5 minutes/group.
Does it work on bigger data sets? Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads … 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads …
Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error
Why this lump? Real biological connectivity? Probably not. 	- 	Increasing K from 32 to ~64 didn’t break up the lump: not biological. Bug in our software? Probably not. ,[object Object],Sequencing artifact or error? YES. -	(Note, we do filter & quality trim all sequences already)
“Good” vs “bad” assembly graph Low density High density
Non-biological levels of local graph connectivity:
Higher local graph density correlates with position in read
Higher local graph density correlates with position in read ARTIFACT
Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.
Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
Artifacts from sequencing falsely connect graphs
Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
Groxel view of knot-like region / ArendHintze
Density trimming breaks up the lump: Old P1, soddd trimmed 	(204.6 m reads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
What does density trimming do to assembly? 204 m reads in lump: 	 assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: 	assembles into 57,135 contigs; total 83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mer graph assembler, as well.
So what’s going on? Current assemblers are bad at dealing with certain graph structures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
OK, let’s assemble! Iowa corn (HiSeq + GA2): 219.11 Gb of sequence assembles to: 	148,053 contigs, 	in 220 MB; 	max length 20322 	max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb
Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way.  (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actually run this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Scaffolding with Velvet causes systematic problems Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.
Future thoughts Our pre-filtering technique always has lower memory requirements than Velvet or other assemblers.  So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …) mRNAseq and genome artifact filtering?
Better artifact filtering?                                                       Kmer-> GTCGTAGTTCAGTTGGTTAGAACGCCGGCCTG    747:3:13:7042:16004/1         GATATCTGCAATATCCCGTTCGAATGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGCC    747:3:14:10559:9771/1                GAAATTCCGGTTTGATGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCG    747:3:14:17232:4498/1  CAAATTTGAGATCTGAGATCCCAGGGGTTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGG    747:3:15:7871:10206/1                             TTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGG   747:3:16:17865:15895/2  TCAGGAGACGCCAGGGCGGTCTGAGTTCTTCAGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGG    747:3:27:9549:13966/1                                  GGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGGCTCCGCC    747:3:30:10672:3136/1                                GCGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGTCGCGAGTTCGAGTCTCGTCGGCCC
All paths lead to the same k-mers Histogram of k-mer traversal counts. Number of times k-mer is traversed
Estimating sequencing return on investment To reach ~rumen depth of sampling of top abundance organisms, would need ~1-2 TB 10x Sequencing Coverage (1900 GB) 5x Sequencing Coverage (931 GB) <1% Novel Sequence
Argonne National Laboratory Institute for Genomic and Systems Biology
Earth Microbiome Projectwww.earthmicrobiome.org Goal – to systematically approach the problem of characterizing microbial life on earth Paradigm shift to analyzing communities from a microbes perspective: Strategy: Explore microbes in environmental parameter space Design ‘ideal’ strategy to interrogate these biomes Acquire samples and sequence broad and deep both DNA, mRNA and rRNA Define microbial community structure and the protein universe Gilbert et al., 2010a,b Standards in Genomic Science, open access Argonne National Laboratory Institute for Genomic and Systems Biology
Challenges 2.4 Quadrillion Base Pairs (2.4 Petabases) = 8000 HiSEQ2000 runs. Global Environmental Sample Database (GESI): identification and selection of 200,000 environmental samples, soil, air, marine and freshwater, host-associated, etc. The standardization of sampling, sample prep and sample processing, cataloging and sample metadata – Genomic Standards Consortium can help! The coordination of thousands of “volunteer” scientists for site characterization, sample collecting and processing  Earth Microbiome Projectwww.earthmicrobiome.org Argonne National Laboratory Institute for Genomic and Systems Biology
Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.
U Florida / Gainesville  talk, apr 13 2011

Más contenido relacionado

La actualidad más candente

2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Keith Bradnam
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Keith Bradnam
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writingKeith Bradnam
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han caoGenomeInABottle
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...Torsten Seemann
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platformsAllSeq
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
 

La actualidad más candente (20)

2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Genetic data storage
Genetic data storageGenetic data storage
Genetic data storage
 

Destacado

Nilai un dan_us_2010_ips
Nilai un dan_us_2010_ipsNilai un dan_us_2010_ips
Nilai un dan_us_2010_ips@rtNya
 
Section 1031 for Real Estate Professionals
Section 1031 for Real Estate ProfessionalsSection 1031 for Real Estate Professionals
Section 1031 for Real Estate ProfessionalsEdmund_Wheeler
 
Fokuspunkter ved br10 hvordan skal der bygges
Fokuspunkter ved br10   hvordan skal der byggesFokuspunkter ved br10   hvordan skal der bygges
Fokuspunkter ved br10 hvordan skal der byggesBertel Bolt-Jørgensen
 
Cuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesCuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesKegler Brown Hill + Ritter
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011sadettin
 
There is always_a_better_way
There is always_a_better_wayThere is always_a_better_way
There is always_a_better_wayDaniel Chua
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics GroupWarrick Tan
 
Oracle 11i OID AD Integration
Oracle 11i OID AD IntegrationOracle 11i OID AD Integration
Oracle 11i OID AD IntegrationMahesh Vallampati
 
Bildspel, irish glen of imaal terrier 2005
Bildspel, irish glen of imaal terrier 2005Bildspel, irish glen of imaal terrier 2005
Bildspel, irish glen of imaal terrier 2005Åse Lundblad
 
Få fuld valuta af din foreningshjemmeside
Få fuld valuta af din foreningshjemmesideFå fuld valuta af din foreningshjemmeside
Få fuld valuta af din foreningshjemmesideBertel Bolt-Jørgensen
 

Destacado (20)

Nilai un dan_us_2010_ips
Nilai un dan_us_2010_ipsNilai un dan_us_2010_ips
Nilai un dan_us_2010_ips
 
Section 1031 for Real Estate Professionals
Section 1031 for Real Estate ProfessionalsSection 1031 for Real Estate Professionals
Section 1031 for Real Estate Professionals
 
Fokuspunkter ved br10 hvordan skal der bygges
Fokuspunkter ved br10   hvordan skal der byggesFokuspunkter ved br10   hvordan skal der bygges
Fokuspunkter ved br10 hvordan skal der bygges
 
Cuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + ChallengesCuba's Current Energy Situation, Future Plans + Challenges
Cuba's Current Energy Situation, Future Plans + Challenges
 
Global crisis2011
Global crisis2011Global crisis2011
Global crisis2011
 
Review Adobe Wallaby
Review Adobe WallabyReview Adobe Wallaby
Review Adobe Wallaby
 
OSGi - beyond the myth
OSGi -  beyond the mythOSGi -  beyond the myth
OSGi - beyond the myth
 
OW2 Nanoko
OW2 NanokoOW2 Nanoko
OW2 Nanoko
 
Canada
CanadaCanada
Canada
 
There is always_a_better_way
There is always_a_better_wayThere is always_a_better_way
There is always_a_better_way
 
Pharma
PharmaPharma
Pharma
 
net Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Groupnet Balance presentation Queensland GreenIT Informatics Group
net Balance presentation Queensland GreenIT Informatics Group
 
Heartwave Appeal 9.6.09
Heartwave Appeal 9.6.09Heartwave Appeal 9.6.09
Heartwave Appeal 9.6.09
 
Formulacion del pei
Formulacion del peiFormulacion del pei
Formulacion del pei
 
RealTimePostproduction
RealTimePostproductionRealTimePostproduction
RealTimePostproduction
 
Oracle 11i OID AD Integration
Oracle 11i OID AD IntegrationOracle 11i OID AD Integration
Oracle 11i OID AD Integration
 
Peixoto e Cury Advogados
Peixoto e Cury AdvogadosPeixoto e Cury Advogados
Peixoto e Cury Advogados
 
Bildspel, irish glen of imaal terrier 2005
Bildspel, irish glen of imaal terrier 2005Bildspel, irish glen of imaal terrier 2005
Bildspel, irish glen of imaal terrier 2005
 
Få fuld valuta af din foreningshjemmeside
Få fuld valuta af din foreningshjemmesideFå fuld valuta af din foreningshjemmeside
Få fuld valuta af din foreningshjemmeside
 
One year-with-chameleon
One year-with-chameleonOne year-with-chameleon
One year-with-chameleon
 

Similar a U Florida / Gainesville talk, apr 13 2011

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenomec.titus.brown
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshopc.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptxaaaa bbb
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 

Similar a U Florida / Gainesville talk, apr 13 2011 (20)

2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Climbing Mt. Metagenome
Climbing Mt. MetagenomeClimbing Mt. Metagenome
Climbing Mt. Metagenome
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Stamps.pptx
Stamps.pptxStamps.pptx
Stamps.pptx
 
2014 naples
2014 naples2014 naples
2014 naples
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2012 XLDB talk
2012 XLDB talk2012 XLDB talk
2012 XLDB talk
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 

Más de c.titus.brown

Más de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 

Último

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 

Último (20)

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 

U Florida / Gainesville talk, apr 13 2011

  • 1. Divide and conquer applied to metagenomic DNA C. Titus Brown ctb@msu.edu CSE / MMG, Michigan State University
  • 2. A brief intro to shotgun assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
  • 3. Assemble based on word overlaps: the quick brown fox jumped jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!
  • 4. Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 5. How does assembly scale? Our assembly approach scaleswith the amount of genomic novelty present in the sample. For “sane” problems (microbes, human genome, etc.) this isn’t too bad, although challenging. For metagenomes, with millions of different species at different abundances, this is an intractable problem (so far)…
  • 6. Iowa Native Prairie Great Plains Grand Challenge –Sampling sites Wisconsin Native prairie (Goose Pond, Audubon) Long term cultivation (corn) Switchgrass rotation (previously corn) Restored prairie (from 1998) Iowa Native prairie (Morris prairie) Long term cultivation (corn) Kansas Native prairie (Konza prairie) Long term cultivation (corn) Switchgrass (Wisconsin) Iowa >100 yr tilled
  • 7. Sampling strategy per site 1 M 1 cM 10 M 1 cM Reference soil 1 M Soil cores: 1 inch diameter, 4 inches deep Total: 8 Reference metagenomes + 64 spatially separated cores (pyrotag sequencing) 10 M
  • 8. Community composition Soil Metagenome Illuminashotgun sequencing 454 Titanium Pyrotagsequencing 454Titanium Shotgunsequencing
  • 9. What kinds of questions? What genes are present? What species are present? What are those species doing, physiologically speaking? How does “function” change with cultivation, CO2, fertilizer types, crop cycles, etc? We are at a “pre-question” stage, unfortunately…
  • 10.
  • 11. The basic problem. Lots of metagenomic sequence data (200 GB Illumina for < $20k?) Assembly, especially metagenome assembly, scales poorly (due to high diversity). Standard assembly techniques don’t work well with sequences from multiple abundance genomes. Many people don’t have the necessary computational resources to assemble (~1 TB of RAM or more, if at all).
  • 12. We can’t just throw more hardware at the problem… Lincoln Stein
  • 13. Hat tip to Narayan Desai / ANL We don’t have enough resources or people to analyze data.
  • 14. Data generation vs data analysis It now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.) …x1000 sequencers Many useful analyses do not scale linearly in RAM or CPU with the amount of data.
  • 15. The challenge: Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume. Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
  • 16. Life’s too short to tackle the easy problems – come to academia! Easy stuff like Google Search Awesomeness
  • 17. Assembly of shotgun sequence It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for 2 bn+ fragments. Not subdivisible; not easy to distribute; memory intensive.
  • 18. Assemble based on word overlaps: the quick brown fox jumped jumped over the lazy dog the quick brown fox jumpedover the lazy dog Repeats do cause problems: my chemical romance: nanana nanana, batman!
  • 19. Whole genome shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 20. K-mer graphs - overlaps J.R. Miller et al. / Genomics (2010)
  • 21. K-mer graphs - branching For decisions about which paths etc, biology-based heuristics come into play as well.
  • 22.
  • 23. Billions and billions of … >850:2:1:1943:15232/1 0 CCTGCCTGTGGAGCAGCCCACGCAGTTCGAGCTGATCATCAACCTCAAGACGGCCCAAGCCCTTGGCATCACGATT >850:2:1:1943:15232/2 0 ACACCATTTAATCTTAGCCATAAAAGTTGTATAAGCATCAACGTTTTGTTTGTCTCAAAAAACGATTTTTTTTTTG >850:2:1:1943:19543/1 0 ACTGTAGGTTTCTGGCTGCGTCCGACGATAGCAGCCCGCTCTGCCGACATTGTCA >850:2:1:1945:16822/2 0 AGTCGACAGATCGACCTGAAGGAGGTGCCGGGAATTGAAGTCATCCAGGGCGCCGAGGAGAACTGATCGG >850:2:1:1946:10202/2 0 AGCTTTTTCGCGCGCGTGAAAAAGCTTTGTCGATTTCTGGGTTTCGGCCTTCTCACAGTCACCGCCGAGGGCCGGG >850:2:1:1947:6533/2 0 GGTCTCCGGACACACGAAGGCACGGCTCTCCGAGAAGCGGAGGATGTACTCGACCTCACGGCTGC >850:2:1:1948:15431/1 0 ACCGCTTACTCGATGATGGAGCAAGGCAGAATCGACATGATTCTGAGCTCGCGTCCCGAAGATCGACGCGCGG >850:2:1:1949:19998/1 0 AATTCAAAGTAGGCATTTTTGTTTTTGTAGGGTTGGCGATGTTAGGCGCGCTGGTCGTGCAATTC >850:2:1:1950:4213/2 0 CCAACCGGGCCCTGGTCCTGCACGCCAACCTGTCCCCGCTGGTGG >850:2:1:1950:1388/1 0 CAGCCGCAATGTTGGCATTCTTCAGCAGTTCGAGCGCCACAAAGCGGTCATTGTCTGAGGCTTCTGGG
  • 24. Too much data – what can we do? Reduce the size of the data (either with an approximate or an exact approach) Divide & conquer: subdivide the problem. For exact data reduction or subdivision, need to grok the entire assembly graph structure. …but that is why assembly scales poorly in the first place.
  • 25.
  • 26.
  • 27.
  • 28. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads. Group reads by connectivity into different partitions of the entire graph. For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • 29. Eliminating unconnected reads “Graphsize filtering”
  • 30. Subdividing reads by connection “Partitioning”
  • 31. Two exact data reduction techniques: Eliminate reads that do not connect to many other reads (“graphsize filtering”). Group reads by connectivity into different partitions of the entire graph (“partitioning”). For k-mer graph assemblers like Velvet and ABYSS, these are exactsolutions.
  • 32. Engineering overview Built a k-mer graph representation based on Bloom filters, a simple probabilistic data structure; With this, we can store graphs efficiently in memory, ~1-2 bytes/(unique) k-mer for arbitrary k. Also implemented efficient global traversal of extremely large graphs (5-20 bn nodes). For details see source code (github.com/ctb/khmer), or online webinar: http://oreillynet.com/pub/e/1784
  • 33. Store graph nodes in Bloom filter Graph traversal is done in full k-mer space; Presence/absence of individual nodes is kept in Bloom filter data structure (hash tables w/o collision tracking).
  • 34. Practical application Enables: graph trimming (exact removal) partitioning (exact subdivision) abundance filtering … all for K <= 64, for 200+ gb sequence collections. All results (except for comparison) obtained using a single Amazon EC2 4xlarge node, 68 GB of RAM / 8 cores. Similar running times to using Velvet alone.
  • 35. We pre-filter data for assembly:
  • 36. Does removing small graphs work? Small data set (35m reads / 3.4 gb rhizosphere soil sample) Filtered at k=32, assembled at k=33 with ABYSS N contigs / Total bp Largest contig 130      223,341   61,766 Unfiltered (35m) 130      223,341   61,766 Filtered (2m reads) YES.
  • 37. Does partitioning into disconnected graphs work? Partitioned same data set (35m reads / 3.5 gb) into 45k partitions containing > 10 reads; assembled partitions separately (k0=32, k=33). N contigs / Total bp Largest contig 130      223,341   61,766 Unfiltered (35m) 130      223,341   61,766 Sum partitions YES.
  • 38. Data reduction for assembly / practical details Reduction performed on machine with 16 gb of RAM. Removing poorly connected reads: 35m -> 2m reads. - Memory required reduced from 40 gb to 2 gb; - Time reduced from 4 hrs to 20 minutes. Partitioning reads into disconnected groups: - Biggest group is 300k reads - Memory required reduced from 40 gb to 500 mb; - Time reduced from 4 hrs to < 5 minutes/group.
  • 39. Does it work on bigger data sets? Iowa continuous corn GA2 partitions (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads … 35 m read data set partition sizes: P1: 277,043 reads P2: 5776 reads P3: 4444 reads P4: 3513 reads P5: 2528 reads P6: 2397 reads …
  • 40. Problem: big data sets have one big partition!? Too big to handle on EC2. Assembles with low coverage. Contains 2.5 bn unique k-mers (~500microbial genomes), at ~3-5x coverage As we sequence more deeply, the “lump” becomes bigger percentage of reads => trouble! Both for our approach, And possibly for assembly in general (because it assembles more poorly than it should, for given coverage/size)
  • 41. Why this lump? Real biological connectivity (rRNA, conserved genes, etc.) Bug in our software Sequencing artifact or error
  • 42.
  • 43. “Good” vs “bad” assembly graph Low density High density
  • 44. Non-biological levels of local graph connectivity:
  • 45. Higher local graph density correlates with position in read
  • 46. Higher local graph density correlates with position in read ARTIFACT
  • 47. Trimming reads Trim at high “soddd”, sum of degree degree distribution: From each k-mer in each read, walk two k-mers in all directions in the graph; If more than 3 k-mers can be found at exactly two steps, trim remainder of sequence. Overly stringent; actually trimming (k-1) connectivity graph by degree.
  • 48. Trimmed read examples >895:5:1:1986:16019/2 TGAGCACTACCTGCGGGCCGGGGACCGGGTCAGCCTGCT CGACCTGGGCCAACCGATGCGCC >895:5:1:1995:6913/1 TTGCGCGCCATGAAGCGGTTAACGCGCTCGGTCCATAGC GCGATG >895:5:1:1995:6913/2 GTTCATCGCGCTATGGACCGAGCGCGTTAACCGCTTCAT GGCGCGCAAAGATCGGAAGAGCGTCGTGTAG
  • 49. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
  • 50. Artifacts from sequencing falsely connect graphs
  • 51. Preferential attachment due to bias Any sufficiently large collection of connected reads will have one or more reads containing an artifact; These artifacts will then connect that group of reads to all other groups possessing artifacts; …and all high-coverage contigs will amalgamate into a single graph.
  • 52. Groxel view of knot-like region / ArendHintze
  • 53. Density trimming breaks up the lump: Old P1, soddd trimmed (204.6 m reads -> 179 m): P1: 23,444,332 reads P2: 60,703 reads P3: 48,818 reads P4: 39,755 reads P5: 34,902 reads P6: 33,284 reads … Untrimmed partitioning (218.5 m reads): P1: 204,582,365 reads P2: 3583 reads P3: 2917 reads P4: 2463 reads P5: 2435 reads P6: 2316 reads …
  • 54. What does density trimming do to assembly? 204 m reads in lump: assembles into 52,610 contigs; total 73.5 MB 180 m reads in trimmed lump: assembles into 57,135 contigs; total 83.6 MB (all contigs > 1kb) Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • 55. Wait, what? Yes, trimming these “knot-like” sequences improves the overall assembly! We remove 25.6 m reads and gain 10.1 MB!? Trend is same for ABySS, another k-mer graph assembler, as well.
  • 56. So what’s going on? Current assemblers are bad at dealing with certain graph structures (“knots”). If we can untangle knots for them, that’s good, maybe? Or, by eliminating locations where reads from differently abundant contigs connect, repeat resolution improves? Happens with other k-mer graph assemblers (ABYSS), and with at least one other (non-metagenomic) data set.
  • 57. OK, let’s assemble! Iowa corn (HiSeq + GA2): 219.11 Gb of sequence assembles to: 148,053 contigs, in 220 MB; max length 20322 max coverage ~10x …all done on Amazon EC2, ~ 1 week for under $500. Filtered/partitioned @k=32, assembled @ k=33, expcov=auto, cov_cutoff=0
  • 58. Full Iowa corn / mapping stats 1,806,800,000 QC/trimmed reads (1.8 bn) 204,900,000 reads map to somecontig (11%) 37,244,000 reads map to contigs > 1kb (2.1%) > 1 kb contig is a stringent criterion! Compare: 80% of MetaHIT reads to > 500 bp; 65%+ of rumen reads to > 1kb
  • 59. Success, tentatively. We are still evaluating assembly and assembly parameters; should be possible to improve in every way. (~10 hrs to redo entire assembly, once partitioned.) The main engineering point is that we can actually run this entire pipeline on a relatively small machine (8 core/68 GB RAM) We can do dozens of these in parallel on Amazon rental hardware. And, from our preliminary results, we get ~ equivalent assembly results as if we were scaling our hardware.
  • 60. Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
  • 61. Conclusions Engineering: can assemble large data sets. Scaling: can assemble on rented machines. Science: can optimize assembly for individual partitions. Science: retain low-abundance.
  • 62. Caveats Quality of assembly?? Illumina sequencing bias/error issue needs to be explored. Scaffolding with Velvet causes systematic problems Regardless of Illumina-specific issue, it’s good to have tools/approaches to look at structure of large graphs.
  • 63. Future thoughts Our pre-filtering technique always has lower memory requirements than Velvet or other assemblers. So it is a good first step to try, even if it doesn’t reduce the problem significantly. Divide & conquer approach should allow more sophisticated (compute intensive) graph analysis approaches in the future. This approach enables (in theory) assembly of arbitrarily large amounts of metagenomic DNA sequence. Can k-mer filtering work for non-de Bruijn graph assemblers? (SGA, ALLPATHS-LG, …) mRNAseq and genome artifact filtering?
  • 64. Better artifact filtering? Kmer-> GTCGTAGTTCAGTTGGTTAGAACGCCGGCCTG 747:3:13:7042:16004/1 GATATCTGCAATATCCCGTTCGAATGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGCC 747:3:14:10559:9771/1 GAAATTCCGGTTTGATGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCG 747:3:14:17232:4498/1 CAAATTTGAGATCTGAGATCCCAGGGGTTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGG 747:3:15:7871:10206/1 TTTGCGGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGG 747:3:16:17865:15895/2 TCAGGAGACGCCAGGGCGGTCTGAGTTCTTCAGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGG 747:3:27:9549:13966/1 GGAGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGTCGGAGGTCGCGGGTTCGAGTCCCGTCGGCTCCGCC 747:3:30:10672:3136/1 GCGGGGTCGTAGTTCAGTTGGTTAGAACGCCGGCCTGTCACGCCGGAGGTCGCGAGTTCGAGTCTCGTCGGCCC
  • 65. All paths lead to the same k-mers Histogram of k-mer traversal counts. Number of times k-mer is traversed
  • 66. Estimating sequencing return on investment To reach ~rumen depth of sampling of top abundance organisms, would need ~1-2 TB 10x Sequencing Coverage (1900 GB) 5x Sequencing Coverage (931 GB) <1% Novel Sequence
  • 67. Argonne National Laboratory Institute for Genomic and Systems Biology
  • 68. Earth Microbiome Projectwww.earthmicrobiome.org Goal – to systematically approach the problem of characterizing microbial life on earth Paradigm shift to analyzing communities from a microbes perspective: Strategy: Explore microbes in environmental parameter space Design ‘ideal’ strategy to interrogate these biomes Acquire samples and sequence broad and deep both DNA, mRNA and rRNA Define microbial community structure and the protein universe Gilbert et al., 2010a,b Standards in Genomic Science, open access Argonne National Laboratory Institute for Genomic and Systems Biology
  • 69. Challenges 2.4 Quadrillion Base Pairs (2.4 Petabases) = 8000 HiSEQ2000 runs. Global Environmental Sample Database (GESI): identification and selection of 200,000 environmental samples, soil, air, marine and freshwater, host-associated, etc. The standardization of sampling, sample prep and sample processing, cataloging and sample metadata – Genomic Standards Consortium can help! The coordination of thousands of “volunteer” scientists for site characterization, sample collecting and processing Earth Microbiome Projectwww.earthmicrobiome.org Argonne National Laboratory Institute for Genomic and Systems Biology
  • 70. Acknowledgements: The k-mer gang: Adina Howe Jason Pell RosangelaCanino-Koning Qingpeng Zhang ArendHintze Collaborators: Jim Tiedje (Il padrino) Janet Jansson, Rachel Mackelprang, Regina Lamendella, Susannah Tringe, and many others (JGI) Charles Ofria (MSU) Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Notas del editor

  1. Expand on this last point
  2. Bridge between this kind of view and k-mers
  3. Constant memory
  4. @@
  5. @@
  6. @@ k up to 64 graph
  7. Expand; talk about density, circumference
  8. @@ redo
  9. @@ redo
  10. Details!
  11. 2x coverage vs 10x coverage? Add “reads”
  12. For Iowa corn soil sample, doubling depth from 931 GB (~5x) to 1.9 TB (~10x) should yield &lt; 1% novel sequence.For Iowa prairie soil, doubling from 375 GB to 750 Gb reaches about the same point.
  13. Sites of imaginative potential – iconic locations – Iconic Sampling – Life is strange – microbes are stranger – how do we capitalizeon this?
  14. Funding: MSU startup, USDA NIFA, DOE, BEACON, Amazon.