SlideShare una empresa de Scribd logo
1 de 61
Assembly I
C. Titus Brown
ctb@msu.edu
Aug 7, 2013
whoami?
Asst Prof, Microbiology & Computer
Science, Michigan State University
Some definitions
 Bioinformaticians write the software that
takes your perfectly good data and
produces bad clusters and assemblies.
Some definitions
 Bioinformaticians write the software that
takes your perfectly good data and
produces bad clusters and assemblies.
 Statisticians are the people who take your
perfectly good clusters and tell you that
your results are not statistically significant.
whoami?
 Primary scientific interest:
Effective analysis and generation of high-
quality biological hypotheses from sequencing
data.
 Now entirely dry lab.
whoami?
 Cross-cutting interests:
 Good (efficient, accurate, remixable)
software development, on the open
source model.
 Reproducibility.
 Open science; preprints, blogging, Twitter.
The Plan
 Th lecture: intro to metagenome assembly
 Th eve lab: k-mer & assembly basics
 Fri lecture/lab: assembling environmental
metagenomes
 Fri afternoon lab: doing metagenome
assembly
 Fri evening: reproduce, you fools!
Outline
1) Dealing with shotgun metagenomics
2) Short-read annotation & why/why not
3) Annotating soil reads – a story
4) Assembly stories!
Outline
1) Dealing with shotgun metagenomics
2) Short-read annotation & why/why not
3) Annotating soil reads – a story
4) Assembly stories!
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
FASTQ etc.
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
Name
Quality score
Note: /1 and /2 => interleaved
Shotgun sequencing & assembly
Randomly fragment & sequence from DNA;
reassemble computationally.
UMD assembly primer (cbcb.umd.edu)
Dealing with shotgun
metagenome reads.
1) Annotate/analyze individual reads.
2) Assemble reads into
contigs, genomes, etc.
3) A middle ground (I’ll mention on Friday)
Outline
1) Dealing with shotgun metagenomics
2) Short-read annotation & why/why not
3) Annotating soil reads – a story
4) Assembly stories!
Annotating individual reads
 Works really well when you have EITHER
 (a) evolutionarily close references
 (b) rather long sequences
(This is obvious, right?)
Annotating individual reads #2
 We have found that this does not work well
with Illumina samples from unexplored
environments (e.g. soil).
 Sensitivity is fine (correct match is usually
there)
 Specificity is bad (correct match may be
drowned out by incorrect matches)
Recommendation:
For reads < 200-300 bp,
 Annotate individual reads for human-
associated samples, or exploration of well-
studied systems.
 For everything else, look to assembly.
So, why assemble?
 Increase your ability to assign
homology/orthology correctly!!
Essentially all functional annotation systems
depend on sequence similarity to assign
homology. This is why you want to assemble
your data.
Why else would you want to
assemble?
 Assemble new “reference”.
 Look for large-scale variation from
reference – pathogenicity islands, etc.
 Discriminate between different members of
gene families.
 Discover operon assemblages & annotate on
co-incidence of genes.
 Reduce size of data!!
Why don’t you want to assemble??
 Abundance threshold – low-coverage filter.
 Strain variation
 Chimerism
Outline
1) Dealing with shotgun metagenomics
2) Short-read annotation & why/why not
3) Annotating soil reads – a story
4) Assembly stories!
A story: looking at land
management with 454 shotgun
 Tracy Teal, Vicente Gomez-Alvarado, & Tom
Schmidt
 Ask detailed questions of @tracykteal on
Twitter, please :)
How do microbial communities change with
land management?
www.glbrc.org)
12)
Kellogg Biological Station LTER
el. Changes in soil organic matter re-
e difference between net C uptake by
and losses of carbon from crop harvest
om the microbial oxidation of crop
s and soil organic matter (22).
conventional tillage system exhibited
GWP of 114 g CO2 equivalents m 1
(Table 2). About half of this potential
ntributed by N2O production (52 g
uivalents m 2
year 1
), with an equiv-
mount (50 g CO2 equivalents m 2
) contributed by the combined effects
of fertilizer and lime. The CO2 cost of fuel
use was also significant but less than that of
either lime or fertilizer. No soil C accumulat-
ed in this system, nor did CH4 oxidation
significantly offset any GWP sources.
The net GWP of the no-till system (14 g
CO2 equivalents m 2
year 1
) was substan-
tially lower than that of the conventional
tillage system, mostly because of increased
C storage in no-till soils. Slightly lower
fuel costs were offset by somewhat higher
lime inputs and N2O fluxes. Intermediate to
CH4 oxidation
and N2O pro-
n (bottom) in
and perennial
g systems and
aged systems.
crops were
ed as conven-
cropping sys-
as no-till sys-
slow–chemical
systems, or as
systems (no
er or manure).
cessional sys-
wereeither nev-
d (NT) or his-
y tilled (HT)
establishment.
ems were rep-
three to four
on the same or
soil series;flux-
ere measured
e 1991–99 pe-
hereareno sig-
differences
05) amongbars
hare the same
on the basis of
s of variance.
es indicate av-
fluxes when in-
thesingleday of anomalously highfluxesintheno-till andlow-input systemsin1999and
espectively (15).
NPP), soil nitrogen availability, and soil organic carbon (30) among study sites (10). Values are
xcept that organic Cvaluesare 1999 means.
onFebruary18,2010www.sciencemag.orgDownloadedfrom
MethaneNitrousoxide
AG Conventional Agriculture
ES Early Successional
SF Successional Forest
DF Deciduous Forest
* * * *
*
* * *
Teal TK,Gomez-AlvarezV,Schmidt TM
Wednesday, August 7, 13
PCoA120.2%
PCoA28.9%
●
●
AG
ES
SF
DF
Functional potential changes
with land management
454 shotgun metagenomes annotated with MG-RAST
Analysis uses amatrix of the 7058 genes annotated
Teal TK,Gomez-AlvarezV,Schmidt TM
Wednesday, August 7, 13
PCoA120.2%
PCoA28.9%
●
●
AG
ES
SF
DF
Denitrification
Nitrogen metabolism contributes to the
differentiation of communities
Teal TK,Gomez-AlvarezV,Schmidt TM
Wednesday, August 7, 13
Denitrifying microbes
B
●
●
●
●
Nitrous oxide
production
differs with land
management
Teal TK,Gomez-AlvarezV,Schmidt TM
More denitrification potential in Ag soils
www.glbrc.org)
23)
Assessinggene abundanceg soils
23)
Abundance of housekeepinggenes
in the libraries consistent across
samples and different genes
Where lena is average gene length (924bp) and lenx is the average length of gene X.
# of annotated reads is used for normalization because it incorporates both library size & quality.
Z =
#of matchestogene X in libraryY
#of annotated readsin libraryY
*
lena
lenx
!
Teal TK,Gomez-AlvarezV,Schmidt TM
More denitrification potential in Ag soils
www.glbrc.org)
23)
Assessingdenitrification potential
(presence of denitrification genes)
Increased denitrification potential inAG and ESsites
Consistent across denitrification pathway
Teal TK,Gomez-AlvarezV,Schmidt TM
More than 20 year recovery for denitrifiers
Teal TK,Gomez-AlvarezV,Schmidt TM
Proportion of the community that are denitrifiers
changes with land management
Denitrification gene abundance is normalized to average housekeepinggene abundance. This is used
as an approximation of the proportion of the community that has that gene
(assuming single copy housekeepingand target genes).
e.g.It can be seen here that inAG ~20%of the community has the nirK gene versus ~12%in DF
High denitrifer diversity
High denitrification diversity
Reads annotated as nirK blasted against reference database of diverse nirKs and each read assigned
to one of three clades. Phylogeny of nirKs is challenging and BLAST matches,especially since we’re
using varying nirK regions is inexact,so analysis limited to this clade level.
Only CladeA captured in standard PCR surveys
Teal TK,Gomez-AlvarezV,Schmidt TM
Wednesday, August 7, 13
Annotating soil reads - thoughts
 Possible to find well-known genes using long
(454) reads.
 Normalize for organism abundance!
 Primer independence can be important!
 Note, replicates give you error bars…
A few assembly stories
 Low complexity/Osedax symbionts
 High complexity/soil.
Osedax symbionts
Table S1 ! Specimens, and collection sites, used in this study
Site Dive1
Date Time Zone
(months)
# of
specimens
Osedax frankpressi
with Rs1 symbiont
2890m T486 Oct 2002 8 2
T610 Aug 2003 18 3
T1069 Jan 2007 59 2
DR010 Mar 2009 85 3
DR098 Nov 2009 93 4
DR204 Oct 2010 104 1
DR234 Jun 2011 112 2
1820m T1048 Oct 2006 7 3
T1071 Jan 2007 10 3
DR012 Mar 2009 36 4
DR2362
Jun 2011 63 2 
O. frankpressi from DR236
1
Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc
Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) .
2
Samples used for genomic analysis were collected during dive DR236.
Goffredi et al., submitted.
16s
shotgun
Metagenomic assembly followed
by binning enabled isolation of
fairly complete genomes
Assembly allowed genomic
content comparisons to nearest
cultured relative
Osedax assembly story
Conclusions include:
 Osedax symbionts have genes needed for
free living stage;
 metabolic versatility in
carbon, phosphate, and iron uptake
strategies;
 Genome includes mechanisms for
intracellular survival, and numerous
potential virulence capabilities.
Osedax assembly story
 Low diversity metagenome!
 Physical isolation => MDA => sequencing =>
diginorm => binning =>
 94% complete Rs1
 66-89% complete Rs2
 Note: many interesting critters are hard to
isolate => so, basically, metagenomes.
Human-associated communities
 “Time series community genomics analysis
reveals rapid shifts in bacterial
species, strains, and phage during infant gut
colonization.” Sharon et al. (Banfield lab);
Genome Res. 2013 23: 111-120
Setup
 Collected 11 fecal samples from premature
female, days 15-24.
 260m 100-bp paired-end Illumina HiSeq
reads; 400 and 900 base fragments.
 Assembled > 96% of reads into contigs >
500bp; 8 complete genomes; reconstructed
genes down to 0.05% of population
abundance.
Sharon et al., 2013; pmid 22936250
Key strategy: abundance binning
Bin reads by k-mer
abundance
Assemble most
abundance bin
Remove reads that
map to assembly
Sharon et al., 2013; pmid 22936250
Sharon et al., 2013; pmid 22936250
Tracking abundance
Sharon et al., 2013; pmid 22936250
Conclusions
 Recovered strain variation, phage
variation, abundance variation, lateral gene
transfer.
 Claim that “recovered genomes are superior
to draft genomes generated in most isolate
genome sequencing projects.”
Great Prairie Grand Challenge -
soil
 What ecosystem level functions are present, and how do
microbes do them?
 How does agricultural soil differ from native soil?
 How does soil respond to climate perturbation?
 Questions that are not easy to answer without shotgun
sequencing:
 What kind of strain-level heterogeneity is present in the
population?
 What does the phage and viral population look like?
 What species are where?
A “Grand Challenge” dataset (DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Approach 1: Digital normalization
(a computational version of library normalization)
Suppose you have
a dilution factor
of A (10) to B(1).
To get 10x of B
you need to get
100x of A!
Overkill!!
This 100x will
consume disk
space
and, because of
errors, memory.
We can discard it
for you…
Approach 2: Data partitioning
(a computational version of cell sorting)
Split reads into “bins”
belonging to different
source species.
Can do this based almost
entirely on connectivity
of sequences.
“Divide and conquer”
Memory-efficient
implementation helps
to scale assembly.
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assembly results for Iowa corn and prairie
(2x ~300 Gbp soil metagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low coverage.
Figure11: Coverage(median basepair) distribution of assembled contigsfrom soil metagenomes.
Corn Prairie
Iowa prairie & corn - very even.
Taxonomy– Iowa prairie
Note: this is predicted taxonomy of contigs w/o considering abundance
or length. (MG-RAST)
Taxonomy – Iowa corn
Note: this is predicted taxonomy of contigs w/o
considering abundance or length. (MG-RAST)
Strain variation?
Toptwoallele
frequencies
Position within contig
Of 5000 most
abundant
contigs, only 1
has a
polymorphism
rate > 5%
Can measure by
read mapping.
Concluding thoughts on assembly
(I)
 There’s no standard approach yet; almost
every paper uses a specialized pipeline of
some sort.
 More like genome assembly
 But unlike transcriptomics…
Concluding thoughts on assembly
(II)
 Anecdotally, everyone worries about strain
variation.
 Some groups (e.g. Banfield, us) have found that
this is not a problem in their system so far.
 Others (viral metagenomes! HMP!) have found
this to be a big concern.
Concluding thoughts on assembly
(III)
 Some groups have found metagenome assembly to
be very useful.
 Others (us! soil!) have not yet proven its utility.
Questions that will be addressed
tomorrow morning.
 How much sequencing should I do?
 How do I evaluate metagenome assemblies?
 Which assembler is best?
High coverage is critical.
Low coverage is the dominant
problem blocking assembly of
metagenomes
Materials
 In addition to materials for this class, see:
http://software-carpentry.org/v4/
http://ged.msu.edu/angus/

Más contenido relacionado

La actualidad más candente

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
Mads Albertsen
 

La actualidad más candente (20)

2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsCross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carroll
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 

Destacado

Actions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market LinkedinActions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market Linkedin
goldglove41
 
Printemps
PrintempsPrintemps
Printemps
JURY
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or I
guest5f4c783
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4
avlainich
 
Laurence
LaurenceLaurence
Laurence
JURY
 
Ashleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer WhalesAshleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer Whales
Takahe One
 
Ekonomik i̇stikrar
Ekonomik i̇stikrarEkonomik i̇stikrar
Ekonomik i̇stikrar
sadettin
 

Destacado (20)

PROCESS elementary
PROCESS elementaryPROCESS elementary
PROCESS elementary
 
Actions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market LinkedinActions You Can Take In Volatile Market Linkedin
Actions You Can Take In Volatile Market Linkedin
 
Printemps
PrintempsPrintemps
Printemps
 
Alegrijesy rebujos daycare
Alegrijesy rebujos daycareAlegrijesy rebujos daycare
Alegrijesy rebujos daycare
 
Ma R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or IMa R Ia Mo Nt E Ss Or I
Ma R Ia Mo Nt E Ss Or I
 
Flourishing in Life as a Lawyer
Flourishing in Life as a LawyerFlourishing in Life as a Lawyer
Flourishing in Life as a Lawyer
 
Urinary System #4
Urinary System #4Urinary System #4
Urinary System #4
 
Eterna Si Fascinanta Romanie
Eterna Si Fascinanta RomanieEterna Si Fascinanta Romanie
Eterna Si Fascinanta Romanie
 
Social Media Evidence
Social Media EvidenceSocial Media Evidence
Social Media Evidence
 
Coke
CokeCoke
Coke
 
Laurence
LaurenceLaurence
Laurence
 
Absence Makes You a Goner: Dealing with Employee Leave
Absence Makes You a Goner: Dealing with Employee LeaveAbsence Makes You a Goner: Dealing with Employee Leave
Absence Makes You a Goner: Dealing with Employee Leave
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
It's Time Summit 2012
It's Time Summit 2012It's Time Summit 2012
It's Time Summit 2012
 
House Hunting fun
House Hunting funHouse Hunting fun
House Hunting fun
 
Ashleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer WhalesAshleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer Whales
 
extremal matchmoving
extremal matchmovingextremal matchmoving
extremal matchmoving
 
Ekonomik i̇stikrar
Ekonomik i̇stikrarEkonomik i̇stikrar
Ekonomik i̇stikrar
 
Light Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional WorkLight Duty, Good Faith Job Offers + Transitional Work
Light Duty, Good Faith Job Offers + Transitional Work
 
Future Developments + Regulations
Future Developments + RegulationsFuture Developments + Regulations
Future Developments + Regulations
 

Similar a 2013 stamps-intro-assembly

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
21 genomes and their evolution
21   genomes and their evolution21   genomes and their evolution
21 genomes and their evolution
Renee Ariesen
 

Similar a 2013 stamps-intro-assembly (20)

2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
Statistical analyses services | Pharmaceutical writing companies | Statistica...
Statistical analyses services | Pharmaceutical writing companies | Statistica...Statistical analyses services | Pharmaceutical writing companies | Statistica...
Statistical analyses services | Pharmaceutical writing companies | Statistica...
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
Tom Delmont: From the Terragenome Project to Global Metagenomic Comparisons: ...
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
 
ppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdfppgardner-lecture03-genomesize-complexity.pdf
ppgardner-lecture03-genomesize-complexity.pdf
 
2014 naples
2014 naples2014 naples
2014 naples
 
Utility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogeneticUtility of transcriptome sequencing for phylogenetic
Utility of transcriptome sequencing for phylogenetic
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
21 genomes and their evolution
21   genomes and their evolution21   genomes and their evolution
21 genomes and their evolution
 
Biology Literature Review Example
Biology Literature Review ExampleBiology Literature Review Example
Biology Literature Review Example
 
Biology Literature Review Example
Biology Literature Review ExampleBiology Literature Review Example
Biology Literature Review Example
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10EVE 161 Winter 2018 Class 10
EVE 161 Winter 2018 Class 10
 
New generation Sequencing
New generation Sequencing New generation Sequencing
New generation Sequencing
 

Más de c.titus.brown

Más de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

2013 stamps-intro-assembly

  • 1. Assembly I C. Titus Brown ctb@msu.edu Aug 7, 2013
  • 2. whoami? Asst Prof, Microbiology & Computer Science, Michigan State University
  • 3. Some definitions  Bioinformaticians write the software that takes your perfectly good data and produces bad clusters and assemblies.
  • 4. Some definitions  Bioinformaticians write the software that takes your perfectly good data and produces bad clusters and assemblies.  Statisticians are the people who take your perfectly good clusters and tell you that your results are not statistically significant.
  • 5. whoami?  Primary scientific interest: Effective analysis and generation of high- quality biological hypotheses from sequencing data.  Now entirely dry lab.
  • 6. whoami?  Cross-cutting interests:  Good (efficient, accurate, remixable) software development, on the open source model.  Reproducibility.  Open science; preprints, blogging, Twitter.
  • 7. The Plan  Th lecture: intro to metagenome assembly  Th eve lab: k-mer & assembly basics  Fri lecture/lab: assembling environmental metagenomes  Fri afternoon lab: doing metagenome assembly  Fri evening: reproduce, you fools!
  • 8. Outline 1) Dealing with shotgun metagenomics 2) Short-read annotation & why/why not 3) Annotating soil reads – a story 4) Assembly stories!
  • 9. Outline 1) Dealing with shotgun metagenomics 2) Short-read annotation & why/why not 3) Annotating soil reads – a story 4) Assembly stories!
  • 10. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.png
  • 12. Shotgun sequencing & assembly Randomly fragment & sequence from DNA; reassemble computationally. UMD assembly primer (cbcb.umd.edu)
  • 13. Dealing with shotgun metagenome reads. 1) Annotate/analyze individual reads. 2) Assemble reads into contigs, genomes, etc. 3) A middle ground (I’ll mention on Friday)
  • 14. Outline 1) Dealing with shotgun metagenomics 2) Short-read annotation & why/why not 3) Annotating soil reads – a story 4) Assembly stories!
  • 15. Annotating individual reads  Works really well when you have EITHER  (a) evolutionarily close references  (b) rather long sequences (This is obvious, right?)
  • 16. Annotating individual reads #2  We have found that this does not work well with Illumina samples from unexplored environments (e.g. soil).  Sensitivity is fine (correct match is usually there)  Specificity is bad (correct match may be drowned out by incorrect matches)
  • 17. Recommendation: For reads < 200-300 bp,  Annotate individual reads for human- associated samples, or exploration of well- studied systems.  For everything else, look to assembly.
  • 18. So, why assemble?  Increase your ability to assign homology/orthology correctly!! Essentially all functional annotation systems depend on sequence similarity to assign homology. This is why you want to assemble your data.
  • 19. Why else would you want to assemble?  Assemble new “reference”.  Look for large-scale variation from reference – pathogenicity islands, etc.  Discriminate between different members of gene families.  Discover operon assemblages & annotate on co-incidence of genes.  Reduce size of data!!
  • 20. Why don’t you want to assemble??  Abundance threshold – low-coverage filter.  Strain variation  Chimerism
  • 21. Outline 1) Dealing with shotgun metagenomics 2) Short-read annotation & why/why not 3) Annotating soil reads – a story 4) Assembly stories!
  • 22. A story: looking at land management with 454 shotgun  Tracy Teal, Vicente Gomez-Alvarado, & Tom Schmidt  Ask detailed questions of @tracykteal on Twitter, please :)
  • 23. How do microbial communities change with land management? www.glbrc.org) 12) Kellogg Biological Station LTER el. Changes in soil organic matter re- e difference between net C uptake by and losses of carbon from crop harvest om the microbial oxidation of crop s and soil organic matter (22). conventional tillage system exhibited GWP of 114 g CO2 equivalents m 1 (Table 2). About half of this potential ntributed by N2O production (52 g uivalents m 2 year 1 ), with an equiv- mount (50 g CO2 equivalents m 2 ) contributed by the combined effects of fertilizer and lime. The CO2 cost of fuel use was also significant but less than that of either lime or fertilizer. No soil C accumulat- ed in this system, nor did CH4 oxidation significantly offset any GWP sources. The net GWP of the no-till system (14 g CO2 equivalents m 2 year 1 ) was substan- tially lower than that of the conventional tillage system, mostly because of increased C storage in no-till soils. Slightly lower fuel costs were offset by somewhat higher lime inputs and N2O fluxes. Intermediate to CH4 oxidation and N2O pro- n (bottom) in and perennial g systems and aged systems. crops were ed as conven- cropping sys- as no-till sys- slow–chemical systems, or as systems (no er or manure). cessional sys- wereeither nev- d (NT) or his- y tilled (HT) establishment. ems were rep- three to four on the same or soil series;flux- ere measured e 1991–99 pe- hereareno sig- differences 05) amongbars hare the same on the basis of s of variance. es indicate av- fluxes when in- thesingleday of anomalously highfluxesintheno-till andlow-input systemsin1999and espectively (15). NPP), soil nitrogen availability, and soil organic carbon (30) among study sites (10). Values are xcept that organic Cvaluesare 1999 means. onFebruary18,2010www.sciencemag.orgDownloadedfrom MethaneNitrousoxide AG Conventional Agriculture ES Early Successional SF Successional Forest DF Deciduous Forest * * * * * * * * Teal TK,Gomez-AlvarezV,Schmidt TM Wednesday, August 7, 13
  • 24. PCoA120.2% PCoA28.9% ● ● AG ES SF DF Functional potential changes with land management 454 shotgun metagenomes annotated with MG-RAST Analysis uses amatrix of the 7058 genes annotated Teal TK,Gomez-AlvarezV,Schmidt TM Wednesday, August 7, 13
  • 25. PCoA120.2% PCoA28.9% ● ● AG ES SF DF Denitrification Nitrogen metabolism contributes to the differentiation of communities Teal TK,Gomez-AlvarezV,Schmidt TM Wednesday, August 7, 13
  • 26. Denitrifying microbes B ● ● ● ● Nitrous oxide production differs with land management Teal TK,Gomez-AlvarezV,Schmidt TM
  • 27. More denitrification potential in Ag soils www.glbrc.org) 23) Assessinggene abundanceg soils 23) Abundance of housekeepinggenes in the libraries consistent across samples and different genes Where lena is average gene length (924bp) and lenx is the average length of gene X. # of annotated reads is used for normalization because it incorporates both library size & quality. Z = #of matchestogene X in libraryY #of annotated readsin libraryY * lena lenx ! Teal TK,Gomez-AlvarezV,Schmidt TM
  • 28. More denitrification potential in Ag soils www.glbrc.org) 23) Assessingdenitrification potential (presence of denitrification genes) Increased denitrification potential inAG and ESsites Consistent across denitrification pathway Teal TK,Gomez-AlvarezV,Schmidt TM
  • 29. More than 20 year recovery for denitrifiers Teal TK,Gomez-AlvarezV,Schmidt TM Proportion of the community that are denitrifiers changes with land management Denitrification gene abundance is normalized to average housekeepinggene abundance. This is used as an approximation of the proportion of the community that has that gene (assuming single copy housekeepingand target genes). e.g.It can be seen here that inAG ~20%of the community has the nirK gene versus ~12%in DF
  • 30. High denitrifer diversity High denitrification diversity Reads annotated as nirK blasted against reference database of diverse nirKs and each read assigned to one of three clades. Phylogeny of nirKs is challenging and BLAST matches,especially since we’re using varying nirK regions is inexact,so analysis limited to this clade level. Only CladeA captured in standard PCR surveys Teal TK,Gomez-AlvarezV,Schmidt TM Wednesday, August 7, 13
  • 31. Annotating soil reads - thoughts  Possible to find well-known genes using long (454) reads.  Normalize for organism abundance!  Primer independence can be important!  Note, replicates give you error bars…
  • 32. A few assembly stories  Low complexity/Osedax symbionts  High complexity/soil.
  • 33. Osedax symbionts Table S1 ! Specimens, and collection sites, used in this study Site Dive1 Date Time Zone (months) # of specimens Osedax frankpressi with Rs1 symbiont 2890m T486 Oct 2002 8 2 T610 Aug 2003 18 3 T1069 Jan 2007 59 2 DR010 Mar 2009 85 3 DR098 Nov 2009 93 4 DR204 Oct 2010 104 1 DR234 Jun 2011 112 2 1820m T1048 Oct 2006 7 3 T1071 Jan 2007 10 3 DR012 Mar 2009 36 4 DR2362 Jun 2011 63 2  O. frankpressi from DR236 1 Dive numbers begin with the remotely operated vehicle name; T= Tiburon, DR = Doc Ricketts (both owned and operated by the Monterey Bay Aquarium Research Institute) . 2 Samples used for genomic analysis were collected during dive DR236. Goffredi et al., submitted.
  • 35. Metagenomic assembly followed by binning enabled isolation of fairly complete genomes
  • 36. Assembly allowed genomic content comparisons to nearest cultured relative
  • 37. Osedax assembly story Conclusions include:  Osedax symbionts have genes needed for free living stage;  metabolic versatility in carbon, phosphate, and iron uptake strategies;  Genome includes mechanisms for intracellular survival, and numerous potential virulence capabilities.
  • 38. Osedax assembly story  Low diversity metagenome!  Physical isolation => MDA => sequencing => diginorm => binning =>  94% complete Rs1  66-89% complete Rs2  Note: many interesting critters are hard to isolate => so, basically, metagenomes.
  • 39. Human-associated communities  “Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization.” Sharon et al. (Banfield lab); Genome Res. 2013 23: 111-120
  • 40. Setup  Collected 11 fecal samples from premature female, days 15-24.  260m 100-bp paired-end Illumina HiSeq reads; 400 and 900 base fragments.  Assembled > 96% of reads into contigs > 500bp; 8 complete genomes; reconstructed genes down to 0.05% of population abundance.
  • 41. Sharon et al., 2013; pmid 22936250
  • 42. Key strategy: abundance binning Bin reads by k-mer abundance Assemble most abundance bin Remove reads that map to assembly Sharon et al., 2013; pmid 22936250
  • 43. Sharon et al., 2013; pmid 22936250
  • 44. Tracking abundance Sharon et al., 2013; pmid 22936250
  • 45. Conclusions  Recovered strain variation, phage variation, abundance variation, lateral gene transfer.  Claim that “recovered genomes are superior to draft genomes generated in most isolate genome sequencing projects.”
  • 46. Great Prairie Grand Challenge - soil  What ecosystem level functions are present, and how do microbes do them?  How does agricultural soil differ from native soil?  How does soil respond to climate perturbation?  Questions that are not easy to answer without shotgun sequencing:  What kind of strain-level heterogeneity is present in the population?  What does the phage and viral population look like?  What species are where?
  • 47. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 48. Approach 1: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 49. Approach 2: Data partitioning (a computational version of cell sorting) Split reads into “bins” belonging to different source species. Can do this based almost entirely on connectivity of sequences. “Divide and conquer” Memory-efficient implementation helps to scale assembly.
  • 50. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Assembly results for Iowa corn and prairie (2x ~300 Gbp soil metagenomes) Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 51. Resulting contigs are low coverage. Figure11: Coverage(median basepair) distribution of assembled contigsfrom soil metagenomes.
  • 52. Corn Prairie Iowa prairie & corn - very even.
  • 53. Taxonomy– Iowa prairie Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 54. Taxonomy – Iowa corn Note: this is predicted taxonomy of contigs w/o considering abundance or length. (MG-RAST)
  • 55. Strain variation? Toptwoallele frequencies Position within contig Of 5000 most abundant contigs, only 1 has a polymorphism rate > 5% Can measure by read mapping.
  • 56. Concluding thoughts on assembly (I)  There’s no standard approach yet; almost every paper uses a specialized pipeline of some sort.  More like genome assembly  But unlike transcriptomics…
  • 57. Concluding thoughts on assembly (II)  Anecdotally, everyone worries about strain variation.  Some groups (e.g. Banfield, us) have found that this is not a problem in their system so far.  Others (viral metagenomes! HMP!) have found this to be a big concern.
  • 58. Concluding thoughts on assembly (III)  Some groups have found metagenome assembly to be very useful.  Others (us! soil!) have not yet proven its utility.
  • 59. Questions that will be addressed tomorrow morning.  How much sequencing should I do?  How do I evaluate metagenome assemblies?  Which assembler is best?
  • 60. High coverage is critical. Low coverage is the dominant problem blocking assembly of metagenomes
  • 61. Materials  In addition to materials for this class, see: http://software-carpentry.org/v4/ http://ged.msu.edu/angus/

Notas del editor

  1. Both BLAST. Shotgun: annotated reads only.
  2. Beware of Janet Jansson bearing data.
  3. Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.