SlideShare una empresa de Scribd logo
1 de 43
Making assembly cheap & easy, and consequences
thereof
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Feb 2014
ctb@msu.edu
Generally, yay #openscience!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog („titus brown blog‟)
 Twitter: @ctitusbrown
 Grants on Lab Web site:
http://ged.msu.edu/research.html
 Preprints: on arXiv, q-bio:
„diginorm arxiv‟
Problem under consideration: shotgun
metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
Analogy: we seek an understanding
of humanity via our libraries.

http://eofdreams.com/library.html;
But, our only observation tool is
shredding a mixture of all of the
books & digitizing the shreds.

http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Prior knowledge.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books.
A categorization system would be an invaluable but
not infallible guide to book topics.
Understanding the language would help you validate
& understand the books.
Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured

in lab.
 Very little transport in soil and sediment =>
slow mixing rates.
 Estimates of immense diversity:
 Billions of microbial cells per gram of soil.
 Million+ microbial species per gram of soil (Gans et

al, 2005)
 One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)
“By 'soil' we understand (Vil'yams, 1931) a loose
surface layer of earth capable of yielding plant
crops. In the physical sense the soil represents a
complex disperse system consisting of three
phases: solid, liquid, and gaseous.”

Microbies live in & on:
• Surfaces of
aggregate particles;
• Pores within
microaggregates;

N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h
tml
Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?

 How do soil microbial communities respond to

climate perturbation?
 Genome-level questions:
 What kind of strain-level heterogeneity is present in

the population?
 What are the phage and viral populations &
dynamic?
 What species are where, and how much is shared
between different geographical locations?
Must use culture independent and
metagenomic approaches
 Many reasons why you can‟t or don‟t want to

culture:
Cross-feeding, niche specificity, dormancy, etc.
 If you want to get at underlying function, 16s

analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.

“Sequence it all and let the
bioinformaticians sort it
out”
Wikipedia: Environmental shotgun
sequencing.png
Computational reconstruction of
(meta)genomic content.

http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Points:
 Lots of fragments needed! (Deep sampling.)
 Having read and understood some books will help








quite a bit (Reference genomes.)
Rare books will be harder to reconstruct than
common books.
Errors in OCR process matter quite a bit.
(Sequencing error)
The more, different specialized libraries you sample,
the more likely you are to discover valid correlations
between topics and books. (We don’t understand
most microbial function.)
A categorization system would be an invaluable but
not infallible guide to book topics. (Phylogeny can
guide interpretation.)
Understanding the language would help you validate
Great Prairie Grand
Challenge --SAMPLING
LOCATIONS

2008
A “Grand Challenge” dataset
(DOE/JGI)
Total: 1,846 Gbp soil metagenome
600

MetaHIT (Qin et. al, 2011), 578 Gbp

Basepairs of Sequencing (Gbp)

500

400

Rumen (Hess et. al, 2011), 268 Gbp

300

200

Rumen K-mer Filtered,
111 Gbp

100

NCBI nr database,
37 Gbp

0
Iowa,
Iowa, Native Kansas,
Continuous
Prairie
Cultivated
corn
corn

Kansas,
Native
Prairie
GAII

Wisconsin, Wisconsin, Wisconsin, Wisconsin,
Restored Switchgrass
Continuous
Native
corn
Prairie
Prairie

HiSeq
Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.

 Mixed population sampling => sensitivity driven by

lowest abundance.
 For example, for E. coli in 1/1000 dilution, you would

need approximately 100x coverage of a 5mb genome
at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
 Sequencing is straightforward; data analysis is not.

“$1000 genome with $1m analysis”
Great Prairie Grand Challenge goals
 How much of the source metagenome can we

reconstruct from ~300-600 Gbp+ of shotgun
sequencing? (Largest soil data set ever
sequenced, ~2010.)
 What can we learn about soil from looking at the

reconstructed metagenome? (See list of
questions)
Assembly graphs scale with data size, not
information.

Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
The Problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 No locality to the data in terms of graph structure.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore‟s Law.
Several solutions
1.More efficient exploration of data.
2. Subdivide data

3. Discard redundant data.
Primary approach: Digital
normalization
(a computational version of library normalization)
Suppose you have
a dilution factor of
A (10) to B(1). To
get 10x of B you
need to get 100x
of A!
Diversity vs
richness.

The high-coverage
reads in sample A
are unnecessary
for
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Diginorm is “lossy compression”
 Nearly perfect from an information theoretic

perspective:
 Discards 95% more of data for genomes.
 Loses < 00.02% of information.
Where are we taking this?
 Streaming online algorithms only look at data

~once.
 Diginorm is streaming, online…

 Conceptually, can move many aspects of

sequence analysis into streaming mode.
=> Extraordinary potential for computational
efficiency.
=> Streaming, online variant
calling.

Single pass, reference free, tunable, streaming online varian
Potentially quite clinically useful.
Prospective: sequencing tumor cells
 Goal: phylogenetically reconstruct causal “driver

mutations” in face of passenger mutations.
 1000 cells x 3 Gbp x 20 coverage: 60 Tbp of

sequence.
 Most of this data will be redundant and not useful.
 Developing diginorm-based algorithms to

eliminate data while retaining variant information.
The real challenge:
understanding
 We have gotten distracted by shiny toys:

sequencing!! Data!!
 Data is now plentiful! But:
 We typically have no knowledge of what > 50% of

an environmental metagenome “means”,
functionally.
 Most data is not openly available, so we cannot
mine correlations across data sets.
 Most computational science is not reproducible,
so I can‟t reuse other people‟s tools or
approaches.
Data intensive biology & hypothesis
generation
 My interest in biological data is to enable better

hypothesis generation.
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable analyses,

openly.
 Education and training.

“Platform perspective”
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper

than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in

the cloud.
khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for the cloud.
Diginorm

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.
 Open, versioned, forkable, citable.

Assembly

Annotation

RSEM differential
expression
CC0; BSD; on github; in reStructuredText.
Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people‟s existing data for

free, IFF they open it up within a year.
See: “Dead Sea Scrolls & Open Marine
Transcriptome Project” blog post; CephSeq white
paper.
“Research singularity”
The data a researchers generates in their lab
constitutes an increasingly small component of
the data used to reach a conclusion.
Corollary: The true value of the data an individual
investigator generates should be considered in the
context of aggregate data.
Even if we overcome the social barriers and
incentivize sharing, we are, needless to say, not
remotely prepared for sharing all the data.
My interests
 Open source ecosystem of analysis tools.
 Loosely coupled APIs for querying databases.
 Publishing reproducible and reusable analyses,

openly.
 Education and training.

“Platform perspective”
IPython Notebook: data + code
=>
IPython)Notebook)
Acknowledgements
Lab members involved















Adina Howe (w/Tiedje)
Jason Pell
Arend Hintze
Qingpeng Zhang
Elijah Lowe
Likit Preeyanon
Jiarong Guo
Tim Brom
Kanchan Pavangadkar
Eric McDonald
Camille Scott
Jordan Fish
Michael Crusoe
Leigh Sheneman

Collaborators
 Jim Tiedje, MSU
 Susannah Tringe and Janet






Jansson (JGI, LBNL)
Erich Schwarz, Caltech /
Cornell
Paul Sternberg, Caltech
Robin Gasser, U. Melbourne
Weiming Li, MSU
Shana Goffredi, Occidental

Funding

USDA NIFA; NSF IOS;
NIH; BEACON.

Más contenido relacionado

La actualidad más candente

2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dcc.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...Andy Petrella
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollGenomeInABottle
 

La actualidad más candente (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Spark meetup london share and analyse genomic data at scale with spark, adam...
Spark meetup london  share and analyse genomic data at scale with spark, adam...Spark meetup london  share and analyse genomic data at scale with spark, adam...
Spark meetup london share and analyse genomic data at scale with spark, adam...
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Xerox2009
Xerox2009Xerox2009
Xerox2009
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carroll
 

Destacado

10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied AccountingAccounting4Free
 
Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Congressional Budget Office
 
How to start a blog step-by-step guide
How to start a blog   step-by-step guideHow to start a blog   step-by-step guide
How to start a blog step-by-step guideKaran Labra
 
ХӨС семинар 10
ХӨС семинар 10ХӨС семинар 10
ХӨС семинар 10Usukhuu Galaa
 
Why FPGA
Why FPGAWhy FPGA
Why FPGAProFAX
 
Anatomy of a methodology
Anatomy of a methodologyAnatomy of a methodology
Anatomy of a methodologySaumya Ganguly
 
совладельцы. сособственники
совладельцы. сособственникисовладельцы. сособственники
совладельцы. сособственникиSychev.n
 
Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Hamed Hashemian
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...NextMove Software
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
 
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge BasesKushal Arora
 

Destacado (20)

Informal letter
Informal letterInformal letter
Informal letter
 
10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting10 Celebrities You Didn't Know Studied Accounting
10 Celebrities You Didn't Know Studied Accounting
 
Section 3
Section 3Section 3
Section 3
 
Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025Press Briefing Slides on the budget and economic outlook 2015 to 2025
Press Briefing Slides on the budget and economic outlook 2015 to 2025
 
How to start a blog step-by-step guide
How to start a blog   step-by-step guideHow to start a blog   step-by-step guide
How to start a blog step-by-step guide
 
ХӨС семинар 10
ХӨС семинар 10ХӨС семинар 10
ХӨС семинар 10
 
Why FPGA
Why FPGAWhy FPGA
Why FPGA
 
Anatomy of a methodology
Anatomy of a methodologyAnatomy of a methodology
Anatomy of a methodology
 
совладельцы. сособственники
совладельцы. сособственникисовладельцы. сособственники
совладельцы. сособственники
 
Habitual behaviour(unit 1)
Habitual behaviour(unit 1)Habitual behaviour(unit 1)
Habitual behaviour(unit 1)
 
Letter of application
Letter of applicationLetter of application
Letter of application
 
Catalog
CatalogCatalog
Catalog
 
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
Extraction, Analysis, Atom Mapping, Classification and Naming of Reactions fr...
 
Phonology
PhonologyPhonology
Phonology
 
Formal letter
Formal letterFormal letter
Formal letter
 
STEM CELL THERAPY
STEM CELL THERAPYSTEM CELL THERAPY
STEM CELL THERAPY
 
Speaking
SpeakingSpeaking
Speaking
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Exceptional Naming
Exceptional NamingExceptional Naming
Exceptional Naming
 
Neural Network in Knowledge Bases
Neural Network in Knowledge BasesNeural Network in Knowledge Bases
Neural Network in Knowledge Bases
 

Similar a 2014 sage-talk

2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesMonica Munoz-Torres
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assemblyc.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Monica Munoz-Torres
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 

Similar a 2014 sage-talk (20)

2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
2014 naples
2014 naples2014 naples
2014 naples
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.
 
Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014Web Apollo at Genome Informatics 2014
Web Apollo at Genome Informatics 2014
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 

Más de c.titus.brown

Más de c.titus.brown (17)

2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

2014 sage-talk

  • 1. Making assembly cheap & easy, and consequences thereof C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Feb 2014 ctb@msu.edu
  • 2. Generally, yay #openscience! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog („titus brown blog‟)  Twitter: @ctitusbrown  Grants on Lab Web site: http://ged.msu.edu/research.html  Preprints: on arXiv, q-bio: „diginorm arxiv‟
  • 3. Problem under consideration: shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • 4. Analogy: we seek an understanding of humanity via our libraries. http://eofdreams.com/library.html;
  • 5. But, our only observation tool is shredding a mixture of all of the books & digitizing the shreds. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 6. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Prior knowledge.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. A categorization system would be an invaluable but not infallible guide to book topics. Understanding the language would help you validate & understand the books.
  • 7. Investigating soil microbial communities  95% or more of soil microbes cannot be cultured in lab.  Very little transport in soil and sediment => slow mixing rates.  Estimates of immense diversity:  Billions of microbial cells per gram of soil.  Million+ microbial species per gram of soil (Gans et al, 2005)  One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
  • 8. “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbies live in & on: • Surfaces of aggregate particles; • Pores within microaggregates; N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.h tml
  • 9. Questions to address  Role of soil microbes in nutrient cycling:  How does agricultural soil differ from native soil?  How do soil microbial communities respond to climate perturbation?  Genome-level questions:  What kind of strain-level heterogeneity is present in the population?  What are the phage and viral populations & dynamic?  What species are where, and how much is shared between different geographical locations?
  • 10. Must use culture independent and metagenomic approaches  Many reasons why you can‟t or don‟t want to culture: Cross-feeding, niche specificity, dormancy, etc.  If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
  • 11. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. “Sequence it all and let the bioinformaticians sort it out” Wikipedia: Environmental shotgun sequencing.png
  • 12. Computational reconstruction of (meta)genomic content. http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 13. Points:  Lots of fragments needed! (Deep sampling.)  Having read and understood some books will help      quite a bit (Reference genomes.) Rare books will be harder to reconstruct than common books. Errors in OCR process matter quite a bit. (Sequencing error) The more, different specialized libraries you sample, the more likely you are to discover valid correlations between topics and books. (We don’t understand most microbial function.) A categorization system would be an invaluable but not infallible guide to book topics. (Phylogeny can guide interpretation.) Understanding the language would help you validate
  • 14. Great Prairie Grand Challenge --SAMPLING LOCATIONS 2008
  • 15. A “Grand Challenge” dataset (DOE/JGI) Total: 1,846 Gbp soil metagenome 600 MetaHIT (Qin et. al, 2011), 578 Gbp Basepairs of Sequencing (Gbp) 500 400 Rumen (Hess et. al, 2011), 268 Gbp 300 200 Rumen K-mer Filtered, 111 Gbp 100 NCBI nr database, 37 Gbp 0 Iowa, Iowa, Native Kansas, Continuous Prairie Cultivated corn corn Kansas, Native Prairie GAII Wisconsin, Wisconsin, Wisconsin, Wisconsin, Restored Switchgrass Continuous Native corn Prairie Prairie HiSeq
  • 16. Why do we need so much data?!  20-40x coverage is necessary; 100x is ~sufficient.  Mixed population sampling => sensitivity driven by lowest abundance.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp)  Sequencing is straightforward; data analysis is not. “$1000 genome with $1m analysis”
  • 17. Great Prairie Grand Challenge goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest soil data set ever sequenced, ~2010.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions)
  • 18. Assembly graphs scale with data size, not information. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 19. The Problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  No locality to the data in terms of graph structure.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore‟s Law.
  • 20. Several solutions 1.More efficient exploration of data. 2. Subdivide data 3. Discard redundant data.
  • 21. Primary approach: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Diversity vs richness. The high-coverage reads in sample A are unnecessary for
  • 28. Diginorm is “lossy compression”  Nearly perfect from an information theoretic perspective:  Discards 95% more of data for genomes.  Loses < 00.02% of information.
  • 29. Where are we taking this?  Streaming online algorithms only look at data ~once.  Diginorm is streaming, online…  Conceptually, can move many aspects of sequence analysis into streaming mode. => Extraordinary potential for computational efficiency.
  • 30. => Streaming, online variant calling. Single pass, reference free, tunable, streaming online varian Potentially quite clinically useful.
  • 31. Prospective: sequencing tumor cells  Goal: phylogenetically reconstruct causal “driver mutations” in face of passenger mutations.  1000 cells x 3 Gbp x 20 coverage: 60 Tbp of sequence.  Most of this data will be redundant and not useful.  Developing diginorm-based algorithms to eliminate data while retaining variant information.
  • 32. The real challenge: understanding  We have gotten distracted by shiny toys: sequencing!! Data!!  Data is now plentiful! But:  We typically have no knowledge of what > 50% of an environmental metagenome “means”, functionally.  Most data is not openly available, so we cannot mine correlations across data sets.  Most computational science is not reproducible, so I can‟t reuse other people‟s tools or approaches.
  • 33. Data intensive biology & hypothesis generation  My interest in biological data is to enable better hypothesis generation.
  • 34. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • 35. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud.
  • 36. khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for the cloud. Diginorm  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set.  Open, versioned, forkable, citable. Assembly Annotation RSEM differential expression
  • 37. CC0; BSD; on github; in reStructuredText.
  • 38. Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people‟s existing data for free, IFF they open it up within a year. See: “Dead Sea Scrolls & Open Marine Transcriptome Project” blog post; CephSeq white paper.
  • 39. “Research singularity” The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary: The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.
  • 40.
  • 41. My interests  Open source ecosystem of analysis tools.  Loosely coupled APIs for querying databases.  Publishing reproducible and reusable analyses, openly.  Education and training. “Platform perspective”
  • 42. IPython Notebook: data + code => IPython)Notebook)
  • 43. Acknowledgements Lab members involved               Adina Howe (w/Tiedje) Jason Pell Arend Hintze Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Camille Scott Jordan Fish Michael Crusoe Leigh Sheneman Collaborators  Jim Tiedje, MSU  Susannah Tringe and Janet      Jansson (JGI, LBNL) Erich Schwarz, Caltech / Cornell Paul Sternberg, Caltech Robin Gasser, U. Melbourne Weiming Li, MSU Shana Goffredi, Occidental Funding USDA NIFA; NSF IOS; NIH; BEACON.

Notas del editor

  1. @@ change slide up =&gt; more complex diversity