SlideShare a Scribd company logo
1 of 44
Extracting genomes from
  community sequencing
  „What works, what will work, and what
              needs work‟

             C. Titus Brown
             ctb@msu.edu
Computer Science; Microbiology; BEACON
       Michigan State University
Warnings

 This talk contains forward looking statements.
    These forward-looking statements can be
identified by terminology such as “will”, “expects”,
                  and “believes”.
                           -- Safe Harbor provisions of the
                       U.S. Private Securities Litigation Act


  “Making predictions is difficult, especially if
          they‟re about the future.”
                                  -- Attributed to Niels Bohr
Thanks for the invitation!
 So, Linda Mansfield and I were talking one day…
   Her: “It‟d be great to be able to look at communities
    with sequencing.”
   Me: “Oh, yeah, we can we do that now.”


 My overall interest is in good hypothesis
 generation from computational data, with a focus
 on sequence data.

 For the past three years, I have been working on
 this specifically for soil metagenomics (and
 mRNAseq, too).
Deep connection between
   human gut  soil
Soil is full of uncultured microbes
  Estimates of microbial diversity in agricultural soil ~1m species/gram




                                                         Randy Jackson
SAMPLING LOCATIONS
Soil contains thousands to millions of species
                                            (“Collector’s curves” of ~species)

                 2000


                 1800


                 1600
Number of OTUs




                 1400
                                                                                                               Iowa Corn
                                                                                                               Iowa_Native_Prairie
                 1200
                                                                                                               Kansas Corn

                 1000                                                                                          Kansas_Native_Prairie
                                                                                                               Wisconsin Corn
                  800                                                                                          Wisconsin Native Prairie
                                                                                                               Wisconsin Restored Prairie
                  600
                                                                                                               Wisconsin Switchgrass

                  400

                  200


                    0
                        100   600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100



                                                     Number of Sequences
Ecology => function emphasis
 What‟s there?
 Is it really that complex a community?
 How “deep” do we need to sequence to sample
    thoroughly and systematically?
   How is ecological complexity created &
    maintained?
   How does ecological complexity respond to
    perturbation?
   What organisms and gene functions are
    present, including non-canonical carbon and
    nitrogen cycling pathways?
   What kind of organismal and functional overlap is
The human gut is a diverse place




                      Dethlefsen et al., 2008
Ecology vs function in human gut
      We can observe recovery of
   diversity after Cipro treatment; but
      what is driving recovery at a
             functional level?




                                   Dethlefsen and Relman, 2011
Culture independent methods
 Observation that 99% of microbes cannot easily
  be cultured in the lab. (“The great plate count
  anomaly”)
 While this is less true for host-associated
  microbes, culture independent methods are still
  important:
     Syntrophic relationships
     Niche-specificity or unknown physiology
     Dormant microbes
     Abundance within communities

  Single-cell sequencing &shotgun metagenomicsare
        two common ways to investigate microbial
                       communities.
Shotgun metagenomics
 Collect samples;


 Extract DNA;


 Feed into sequencer;


 Computationally analyze.




                      Wikipedia: Environmental shotgun sequencing.p
Shotgun sequencing & assembly
  Randomly fragment & sequence from DNA;
       reassemble computationally.




                     UMD assembly primer (cbcb.umd.edu)
Shotgun sequencing & assembly
 Why assembly?
   Assumption free (no reference needed)
   Necessary for soil and marine; useful for host-associated?
   Assembly can serve as reference for transcriptome
    interpretation

 Fragment, sequence, computationally assemble.


 What kind of results do you get?
   Almost certainly chimerism between different strains; but still
    useful for gene content &operon structure.
   Specificity seems high, but sensitivity is dependent on
    sequencing depth.

 Because of sampling rate, Illumina is primary choice.
Shotgun metagenomics: good news
 Cheap and easy to generate vast whole
  metagenome/metatranscriptome shotgun data sets from
  essentially any community you can sample.

 Such data can be quite interesting!
   Low hanging fruit – correlation with diet, etc.
   Still early days for observation of “pan genome” and functional
    content.

 Potential to illuminate or inform:
   Dynamics and selective pressures of antibiotic
    resistance, virulence genes, and pathogenicity islands
   Phage and viral communities
   Community interactions.
Shotgun metagenomics: bad
news
 Computational techniques are still relatively immature
   Mapping to known genomes?
   Discovery of unknown genomes & strain variants?
   Sensitivity and specificity are hard to evaluate.
   Computational ecosystem is not that rich…


 Interpreting the data is still the bottleneck, of course.
   Vast majority of genes not usefully annotated.
   Reliance on specific reference databases, annotations.
   Tools for (e.g.) inferring community interactions from
    community dynamics & functional capacity are
    desperately needed.
The computational conundrum


              More data => better.

and

 More data => computationally more challenging.
1. Assembly depends on high
coverage
2. Big data sets require big machines
For even relatively small data sets, metagenomic
  assemblers scale poorly.

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Size of data set == big!!

(Estimated 6 weeks x 3 TB of RAM to do 300gb soil
  sample, with a slightly modified conventional
  assembler.)
Our “Grand Challenge” dataset
Approach 1: Partitioning

Split reads into “bins”
 belonging to
 different source
 species.
Can do this based
 almost entirely on
 connectivity of
 sequences.
Technical challenges met (and defeated)
 Novel data structure properties elucidated via
 percolation theory analysis (Pell, Hintze, et al., in
 review, PNAS).

 Exhaustive in-memory traversal of graphs
 containing 5-15 billion nodes.

 Sequencing technology introduces false
 sequences in graph (Howe et al., in prep.)

 Only 20x improvement in assembly scaling .
(NOVEL)

Approach 2: Digital normalization


                         Suppose you have a
                      dilution factor of A (10) to
                      B(1). To get 10x of B you
                        need to get 100x of A!
                                Overkill!!

                       This 100x will consume
                      disk space and, because
                         of errors, memory.
Digital normalization discards
redundant reads prior to assembly.




      This removes reads and decreases data
   size, eliminates errors from removed reads, and
            normalizes coverage across loci.

  Discarded reads can be used after assembly for
               quantitative analysis.
A read‟s median k-mer count is a
good estimator of “coverage”.
                           This gives us a
                           reference-free
                             measure of
                              coverage.
Shotgun data is often (1) high
coverage and (2) biased in coverage.


                               (MD amplified)
Digital normalization fixes all that.

                           Normalizes coverage

                           Discards redundancy

                           Eliminates majority of
                           errors

                           Scales assembly dramat

                           Assembly is 98% identica
Digital normalization retains information, while
discarding data and errors
Evaluating sensitivity & specificity

    E. coli @ 10x + soil




                            98.5% of E. coli
How much? A mathematical
interlude.
 Suppose we need 10x coverage to assemble a
  microbial genome, and microbial genomes
  average 5e6 bp of DNA.
 Further suppose that we want to be able to
  assemble a microbial species that is “1 in a
  100000”, i.e. 1 in 1e5.
 Shotgun sequencing samples randomly, so must
  sample deeply to be sensitive.

10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpof
                    sequence.
Example
Dethlefsen shotgun data set / Relman lab

251 m reads / 16gb FASTQ gzipped
~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on
  Amazon EC2
  (reads => final assembly + mapping)

Assembly stats:
       58,224 contigs> 1000 bp (average 3kb)
            summing to 190 mb genomic
        ~38 microbial genomes worth of DNA
     ~65% of reads mapped back to assembly
What do we get for soil?

                                             Predicted
     Total                      % Reads                     rplb
               Total Contigs                  protein
   Assembly                    Assembled                   genes
                                              coding


    2.5 bill     4.5 mill         19%        5.3 mill       391


    3.5 bill     5.9 mill         22%        6.8 mill       466

                                   This estimates number of species ^
    Putting it in perspective:
    Total equivalent of ~1200 bacterial genomes          Adina Howe
    Human genome ~3 billion bp
Extracting whole genomes?
So far, we have only assembled contigs, but not whole
 genomes.

Can entire genomes be
assembled from metagenomic
data?

Iverson et al. (2012), from
the Armbrust lab, contains a
technique for scaffolding
metagenomecontigs into
~whole genomes. YES.
Concluding thoughts on
assembly
 Illumina is the only game in town for sequencing complex
  microbial populations, but dealing with the data
  (volume, errors) is tricky. This problem is being solved, by
  us and others.

 We‟re working to make it as close to push button as
  possible, with objectively argued parameters and
  tools, and methods for evaluating new tools and
  sequencing types.

 The community is working on dealing with data
  downstream of sequencing & assembly.
   Most pipelines were built around 454 data – long reads, and
    relatively few of them.
   With Illumina, we can get both long contigs and quantitative
    information about their abundance. This necessitates
    changes to pipelines like MG-RAST and HUMANn.
The interpretation challenge
 For soil, we have generated approximately 1200
  bacterial genomes worth of assembled genomic DNA
  from two soil samples.

 The vast majority of this genomic DNA contains
  unknown genes with largely unknown function.

 Most annotations of gene function & interaction are
  from a few phylogenetically limited model organisms
   Est 98% of annotations are computationally inferred:
    transferred from model organisms to genomic
    sequence, using homology.
   Can these annotations be transferred? (Probably not.)

   This will be the biggest sequence analysis challenge
                     of the next 50 years.
How will we annotate soil??

                                             Predicted
     Total                      % Reads                     rplb
               Total Contigs                  protein
   Assembly                    Assembled                   genes
                                              coding


    2.5 bill     4.5 mill         19%        5.3 mill       391


    3.5 bill     5.9 mill         22%        6.8 mill       466

                                   This estimates number of species ^
    Putting it in perspective:
    Total equivalent of ~1200 bacterial genomes          Adina Howe
    Human genome ~3 billion bp
Some lessons from C. jejuni
 In vivomurine transfer experiments demonstrate
  substantial capacity for C. jejuni11168 to adapt solely
  via modification of poly-G tracts (Jerome et al., 2011).

 Bell et al. (unpub) have shown substantial variability
  in gene content of Campylobacter strains. Gene
  content and gene expression are both important to
  understanding mechanisms of pathogenicity.

 In vitro serial transfer experiments demonstrate that
  rapid genomic adaptation to new environments occurs
  at multiple loci, with substantial variation in genes of
  unknown function (Jereme et al., in preparation)
Multilocus “strain” variation in C.
jejunidrives rapid adaptation
What works?
Today,

 From deep metagenomicdata, you can get the
 gene and operon content (including abundance of
 both) from communities.

 You can get microarray-like expression
 information from metatranscriptomics.
What needs work?

 Assembling ultra-deep samples is going to
 require more engineering, but is straightforward.
 (“Infinite assembly.”)

 Building scaffolds and extracting whole genomes
 has been done, but I am not yet sure how
 feasible it is to do systematically with existing
 tools (c.f. Armbrust Lab).
What will work, someday?

 Sensitive analysis of strain variation.
   Both assembly and mapping approaches do a poor
    job detecting many kinds of biological novelty.
   The 1000 Genomes Project has developed some
    good tools that need to be evaluated on community
    samples.


 Ecological/evolutionary dynamics in vivo.
   Most work done on 16s, not on genomes or
    functional content.
   Here, sensitivity is really important!
What are future needs?
 High-quality, medium+ throughput annotation of
 genomes?
   Extrapolating from model organisms is both
    immensely important and yet lacking.
   Strong phylogenetic sampling bias in existing
    annotations.


 Synthetic biology for investigating non-model
 organisms?
   (Cleverness in experimental biology doesn‟t
                     scale)
Pubs, software, tutorials, etc.
Metagenome assembly / HMP tutorial:
      http://ged.msu.edu/angus/nih-hmp-2012/

 Everything I discussed is available pre-pub -- contact
               ctb@msu.edu, or Google for

               khmer – software package
     kmer-percolation paper (in re-review, PNAS)
   digital normalization paper (in review, PLoS One)

   …a few dozen people using, one way or another.
Acknowledgements
 Jason Pell, Qingpeng Zhang, ArendHintze, and
  Adina Howe
 Soil: Jim Tiedje (MSU), Janet Jansson
  (LBNL/JGI), Susannah Tringe (JGI)
 Campy: Linda Mansfield, Julia Bell, JP Jerome,
  Jeff Barrick

Funding:
USDA NIFA; NSF IOS; BEACON.

More Related Content

Viewers also liked

The Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleThe Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleKegler Brown Hill + Ritter
 
MCHRP Evaluation Report f1 15-09-2012
MCHRP Evaluation Report f1 15-09-2012MCHRP Evaluation Report f1 15-09-2012
MCHRP Evaluation Report f1 15-09-2012Zafar Ahmad
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?jessecadelina
 
IDC Tech Spotlight: From Silicon To Cloud
IDC Tech Spotlight: From Silicon To CloudIDC Tech Spotlight: From Silicon To Cloud
IDC Tech Spotlight: From Silicon To CloudJames Price
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisaguestf98a87
 
Testtestest
TesttestestTesttestest
Testtestestderwick
 
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertisingMoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertisingMobileMonday Tel-Aviv
 
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...Circles of San Antonio Community Coalition
 
Managing International Risks + Corporate Investigations
Managing International Risks + Corporate InvestigationsManaging International Risks + Corporate Investigations
Managing International Risks + Corporate InvestigationsKegler Brown Hill + Ritter
 
Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints Judith Baines
 

Viewers also liked (20)

Furniture
FurnitureFurniture
Furniture
 
h-ubu : CDI in JavaScript
h-ubu : CDI in JavaScripth-ubu : CDI in JavaScript
h-ubu : CDI in JavaScript
 
Curriculum oscar
Curriculum oscarCurriculum oscar
Curriculum oscar
 
The Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business PeopleThe Seven Habits of Highly Ineffective Global Business People
The Seven Habits of Highly Ineffective Global Business People
 
MCHRP Evaluation Report f1 15-09-2012
MCHRP Evaluation Report f1 15-09-2012MCHRP Evaluation Report f1 15-09-2012
MCHRP Evaluation Report f1 15-09-2012
 
Nursing Skills
Nursing SkillsNursing Skills
Nursing Skills
 
How to make online billing invoice?
How to make online billing invoice?How to make online billing invoice?
How to make online billing invoice?
 
IDC Tech Spotlight: From Silicon To Cloud
IDC Tech Spotlight: From Silicon To CloudIDC Tech Spotlight: From Silicon To Cloud
IDC Tech Spotlight: From Silicon To Cloud
 
Wordshop Web Evolution (by Morozov Andrey)
Wordshop Web Evolution (by Morozov Andrey)Wordshop Web Evolution (by Morozov Andrey)
Wordshop Web Evolution (by Morozov Andrey)
 
Lets Think
Lets ThinkLets Think
Lets Think
 
Presentation Teknisa
Presentation TeknisaPresentation Teknisa
Presentation Teknisa
 
Body
BodyBody
Body
 
Testtestest
TesttestestTesttestest
Testtestest
 
Team8presentation
Team8presentationTeam8presentation
Team8presentation
 
RealTimePostproduction
RealTimePostproductionRealTimePostproduction
RealTimePostproduction
 
PROCESS elementary
PROCESS elementaryPROCESS elementary
PROCESS elementary
 
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertisingMoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising
MoMoTLV Israel March 2010 - innerActive - appstores & in-app advertising
 
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...
2016 Circles of San Antonio Community Coalition and Bexar County DWI Task For...
 
Managing International Risks + Corporate Investigations
Managing International Risks + Corporate InvestigationsManaging International Risks + Corporate Investigations
Managing International Risks + Corporate Investigations
 
Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints Experiments in Web 2.0: creative communications and digital footprints
Experiments in Web 2.0: creative communications and digital footprints
 

Similar to 2012 erin-crc-nih-seattle

2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Dag Endresen
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldJoe Parker
 

Similar to 2012 erin-crc-nih-seattle (20)

2012 XLDB talk
2012 XLDB talk2012 XLDB talk
2012 XLDB talk
 
2012 stamps-mbl-2
2012 stamps-mbl-22012 stamps-mbl-2
2012 stamps-mbl-2
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
Trait data mining at European pre-breeding workshop at Alnarp (25 Nov 2009)
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Inference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' worldInference and informatics in a 'sequenced' world
Inference and informatics in a 'sequenced' world
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 

More from c.titus.brown

More from c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

2012 erin-crc-nih-seattle

  • 1. Extracting genomes from community sequencing „What works, what will work, and what needs work‟ C. Titus Brown ctb@msu.edu Computer Science; Microbiology; BEACON Michigan State University
  • 2. Warnings This talk contains forward looking statements. These forward-looking statements can be identified by terminology such as “will”, “expects”, and “believes”. -- Safe Harbor provisions of the U.S. Private Securities Litigation Act “Making predictions is difficult, especially if they‟re about the future.” -- Attributed to Niels Bohr
  • 3. Thanks for the invitation!  So, Linda Mansfield and I were talking one day…  Her: “It‟d be great to be able to look at communities with sequencing.”  Me: “Oh, yeah, we can we do that now.”  My overall interest is in good hypothesis generation from computational data, with a focus on sequence data.  For the past three years, I have been working on this specifically for soil metagenomics (and mRNAseq, too).
  • 4. Deep connection between human gut  soil
  • 5. Soil is full of uncultured microbes Estimates of microbial diversity in agricultural soil ~1m species/gram Randy Jackson
  • 7. Soil contains thousands to millions of species (“Collector’s curves” of ~species) 2000 1800 1600 Number of OTUs 1400 Iowa Corn Iowa_Native_Prairie 1200 Kansas Corn 1000 Kansas_Native_Prairie Wisconsin Corn 800 Wisconsin Native Prairie Wisconsin Restored Prairie 600 Wisconsin Switchgrass 400 200 0 100 600 1100 1600 2100 2600 3100 3600 4100 4600 5100 5600 6100 6600 7100 7600 8100 Number of Sequences
  • 8. Ecology => function emphasis  What‟s there?  Is it really that complex a community?  How “deep” do we need to sequence to sample thoroughly and systematically?  How is ecological complexity created & maintained?  How does ecological complexity respond to perturbation?  What organisms and gene functions are present, including non-canonical carbon and nitrogen cycling pathways?  What kind of organismal and functional overlap is
  • 9. The human gut is a diverse place Dethlefsen et al., 2008
  • 10. Ecology vs function in human gut We can observe recovery of diversity after Cipro treatment; but what is driving recovery at a functional level? Dethlefsen and Relman, 2011
  • 11. Culture independent methods  Observation that 99% of microbes cannot easily be cultured in the lab. (“The great plate count anomaly”)  While this is less true for host-associated microbes, culture independent methods are still important:  Syntrophic relationships  Niche-specificity or unknown physiology  Dormant microbes  Abundance within communities Single-cell sequencing &shotgun metagenomicsare two common ways to investigate microbial communities.
  • 12. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.p
  • 13. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 14. Shotgun sequencing & assembly  Why assembly?  Assumption free (no reference needed)  Necessary for soil and marine; useful for host-associated?  Assembly can serve as reference for transcriptome interpretation  Fragment, sequence, computationally assemble.  What kind of results do you get?  Almost certainly chimerism between different strains; but still useful for gene content &operon structure.  Specificity seems high, but sensitivity is dependent on sequencing depth.  Because of sampling rate, Illumina is primary choice.
  • 15. Shotgun metagenomics: good news  Cheap and easy to generate vast whole metagenome/metatranscriptome shotgun data sets from essentially any community you can sample.  Such data can be quite interesting!  Low hanging fruit – correlation with diet, etc.  Still early days for observation of “pan genome” and functional content.  Potential to illuminate or inform:  Dynamics and selective pressures of antibiotic resistance, virulence genes, and pathogenicity islands  Phage and viral communities  Community interactions.
  • 16. Shotgun metagenomics: bad news  Computational techniques are still relatively immature  Mapping to known genomes?  Discovery of unknown genomes & strain variants?  Sensitivity and specificity are hard to evaluate.  Computational ecosystem is not that rich…  Interpreting the data is still the bottleneck, of course.  Vast majority of genes not usefully annotated.  Reliance on specific reference databases, annotations.  Tools for (e.g.) inferring community interactions from community dynamics & functional capacity are desperately needed.
  • 17. The computational conundrum More data => better. and More data => computationally more challenging.
  • 18. 1. Assembly depends on high coverage
  • 19. 2. Big data sets require big machines For even relatively small data sets, metagenomic assemblers scale poorly. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set Size of data set == big!! (Estimated 6 weeks x 3 TB of RAM to do 300gb soil sample, with a slightly modified conventional assembler.)
  • 21. Approach 1: Partitioning Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences.
  • 22. Technical challenges met (and defeated)  Novel data structure properties elucidated via percolation theory analysis (Pell, Hintze, et al., in review, PNAS).  Exhaustive in-memory traversal of graphs containing 5-15 billion nodes.  Sequencing technology introduces false sequences in graph (Howe et al., in prep.)  Only 20x improvement in assembly scaling .
  • 23. (NOVEL) Approach 2: Digital normalization Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory.
  • 24. Digital normalization discards redundant reads prior to assembly. This removes reads and decreases data size, eliminates errors from removed reads, and normalizes coverage across loci. Discarded reads can be used after assembly for quantitative analysis.
  • 25. A read‟s median k-mer count is a good estimator of “coverage”. This gives us a reference-free measure of coverage.
  • 26. Shotgun data is often (1) high coverage and (2) biased in coverage. (MD amplified)
  • 27. Digital normalization fixes all that. Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramat Assembly is 98% identica
  • 28. Digital normalization retains information, while discarding data and errors
  • 29. Evaluating sensitivity & specificity E. coli @ 10x + soil 98.5% of E. coli
  • 30. How much? A mathematical interlude.  Suppose we need 10x coverage to assemble a microbial genome, and microbial genomes average 5e6 bp of DNA.  Further suppose that we want to be able to assemble a microbial species that is “1 in a 100000”, i.e. 1 in 1e5.  Shotgun sequencing samples randomly, so must sample deeply to be sensitive. 10x coverage x 5e6 bp x1e5 =~ 50e11, or 5 Tbpof sequence.
  • 31. Example Dethlefsen shotgun data set / Relman lab 251 m reads / 16gb FASTQ gzipped ~ 24 hrs, < 32 gb of RAM for full pipeline -- $24 on Amazon EC2 (reads => final assembly + mapping) Assembly stats: 58,224 contigs> 1000 bp (average 3kb) summing to 190 mb genomic ~38 microbial genomes worth of DNA ~65% of reads mapped back to assembly
  • 32. What do we get for soil? Predicted Total % Reads rplb Total Contigs protein Assembly Assembled genes coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
  • 33. Extracting whole genomes? So far, we have only assembled contigs, but not whole genomes. Can entire genomes be assembled from metagenomic data? Iverson et al. (2012), from the Armbrust lab, contains a technique for scaffolding metagenomecontigs into ~whole genomes. YES.
  • 34. Concluding thoughts on assembly  Illumina is the only game in town for sequencing complex microbial populations, but dealing with the data (volume, errors) is tricky. This problem is being solved, by us and others.  We‟re working to make it as close to push button as possible, with objectively argued parameters and tools, and methods for evaluating new tools and sequencing types.  The community is working on dealing with data downstream of sequencing & assembly.  Most pipelines were built around 454 data – long reads, and relatively few of them.  With Illumina, we can get both long contigs and quantitative information about their abundance. This necessitates changes to pipelines like MG-RAST and HUMANn.
  • 35. The interpretation challenge  For soil, we have generated approximately 1200 bacterial genomes worth of assembled genomic DNA from two soil samples.  The vast majority of this genomic DNA contains unknown genes with largely unknown function.  Most annotations of gene function & interaction are from a few phylogenetically limited model organisms  Est 98% of annotations are computationally inferred: transferred from model organisms to genomic sequence, using homology.  Can these annotations be transferred? (Probably not.) This will be the biggest sequence analysis challenge of the next 50 years.
  • 36. How will we annotate soil?? Predicted Total % Reads rplb Total Contigs protein Assembly Assembled genes coding 2.5 bill 4.5 mill 19% 5.3 mill 391 3.5 bill 5.9 mill 22% 6.8 mill 466 This estimates number of species ^ Putting it in perspective: Total equivalent of ~1200 bacterial genomes Adina Howe Human genome ~3 billion bp
  • 37. Some lessons from C. jejuni  In vivomurine transfer experiments demonstrate substantial capacity for C. jejuni11168 to adapt solely via modification of poly-G tracts (Jerome et al., 2011).  Bell et al. (unpub) have shown substantial variability in gene content of Campylobacter strains. Gene content and gene expression are both important to understanding mechanisms of pathogenicity.  In vitro serial transfer experiments demonstrate that rapid genomic adaptation to new environments occurs at multiple loci, with substantial variation in genes of unknown function (Jereme et al., in preparation)
  • 38. Multilocus “strain” variation in C. jejunidrives rapid adaptation
  • 39. What works? Today,  From deep metagenomicdata, you can get the gene and operon content (including abundance of both) from communities.  You can get microarray-like expression information from metatranscriptomics.
  • 40. What needs work?  Assembling ultra-deep samples is going to require more engineering, but is straightforward. (“Infinite assembly.”)  Building scaffolds and extracting whole genomes has been done, but I am not yet sure how feasible it is to do systematically with existing tools (c.f. Armbrust Lab).
  • 41. What will work, someday?  Sensitive analysis of strain variation.  Both assembly and mapping approaches do a poor job detecting many kinds of biological novelty.  The 1000 Genomes Project has developed some good tools that need to be evaluated on community samples.  Ecological/evolutionary dynamics in vivo.  Most work done on 16s, not on genomes or functional content.  Here, sensitivity is really important!
  • 42. What are future needs?  High-quality, medium+ throughput annotation of genomes?  Extrapolating from model organisms is both immensely important and yet lacking.  Strong phylogenetic sampling bias in existing annotations.  Synthetic biology for investigating non-model organisms? (Cleverness in experimental biology doesn‟t scale)
  • 43. Pubs, software, tutorials, etc. Metagenome assembly / HMP tutorial: http://ged.msu.edu/angus/nih-hmp-2012/ Everything I discussed is available pre-pub -- contact ctb@msu.edu, or Google for khmer – software package kmer-percolation paper (in re-review, PNAS) digital normalization paper (in review, PLoS One) …a few dozen people using, one way or another.
  • 44. Acknowledgements  Jason Pell, Qingpeng Zhang, ArendHintze, and Adina Howe  Soil: Jim Tiedje (MSU), Janet Jansson (LBNL/JGI), Susannah Tringe (JGI)  Campy: Linda Mansfield, Julia Bell, JP Jerome, Jeff Barrick Funding: USDA NIFA; NSF IOS; BEACON.

Editor's Notes

  1. Development of antibiotic resistance; vacancy of niches for resource consumption by antibiotic sensitive; ??
  2. xx