SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
De novo Genome Assembly
A/Prof Torsten Seemann
Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
Introduction
The human genome has 47 pieces
MT
(or XY)
The shortest piece is 48,000,000 bp
Total haploid size
3,200,000,000 bp
(3.2 Gbp)
Total diploid size
6,400,000,000 bp
(6.4 Gbp)
∑ = 6,400,000,000 bp
Human DNA iSequencer ™ 46 chromosomal
and1 mitochondrial
sequences
In an ideal world ...
AGTCTAGGATTCGCTATAG
ATTCAGGCTCTGATATATT
TCGCGGCATTAGCTAGAGA
TCTCGAGATTCGTCCCAGT
CTAGGATTCGCTAT
AAGTCTAAGATTC...
The real world ( for now )
Short
fragments
Reads are stored in “FASTQ” files
Genome
Sequencing
Assemble Compare
Genome assembly
(the red pill)
De novo genome assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome
from the sequence reads only
De novo genome assembly
“From scratch”
An example
A small “genome”
Friends,
Romans,
countrymen,
lend me your ears;
I’ll return
them
tomorrow!
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
Shakespearomics
Whoops!
I dropped
them.
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
Shakespearomics
I am good
with words.
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
• Majority consensus
Friends, Romans, countrymen, lend me your ears; (1 contig)
Shakespearomics
We have
reached a
consensus !
Overlap - Layout - Consensus
Amplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
Assembly graphs
Overlap graph
Another example is this
Overlaps find
Do the graph traverse
Size matters not. Look at me. Judge me by my size, do you?
Size matters not. Look at me. Judge me by my soze, do you?
2 supporting reads
1 supporting reads
So far, so good.
Why is it so hard?
What makes a jigsaw puzzle hard?
Repetitive
regions
Lots of
pieces
Missing
pieces
Multiple
copies
No corners
(circular
genomes)
No box
Dirty
pieces
Frayed
pieces : .,
What makes genome assembly hard
Size of the human genome = 3.2 x 109
bp (3,200,000,000)
Typical short read length = 102
bp (100)
A puzzle with millions to billions of pieces
Storing in RAM is a challenge ⇢ “succinct data structure”
1. Many pieces (read length is very short compared to the genome)
What makes genome assembly hard
Finding overlaps means
examining every pair of reads
Comparisons = N×(N-1)/2
~ N2
Lots of smart tricks to reduce
this close to ~N
2. Lots of overlaps
What makes genome assembly hard
ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA
TATATATA TATATATA TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
3. Lots of sky (short repeats)
How could we possibly assemble this segment of DNA?
All the reads are the same!
What makes genome assembly hard
Read 1: GGAACCTTTGGCCCTGT
Read 2: GGCGCTGTCCATTTTAGAAACC
4. Dirty pieces (sequencing errors)
1. The error in Read 2 might prevent us from seeing that it
overlaps with Read 1
2. How would we know which is the correct sequence?
What makes genome assembly hard
5. Multiple copies (long repeats)
Our old nemesis the REPEAT !
Gene Copy 1 Gene Copy 2
Repeats
What is a repeat?
A segment of DNA
that occurs more than once
in the genome
Major classes of repeats in the human genome
Repeat Class Arrangement Coverage (Hg) Length (bp)
Satellite (micro, mini) Tandem 3% 2-100
SINE Interspersed 15% 100-300
Transposable elements Interspersed 12% 200-5k
LINE Interspersed 21% 500-8k
rDNA Tandem 0.01% 2k-43k
Segmental Duplications Tandem or
Interspersed
0.2% 1k-100k
Repeats
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs
Repeats are hubs in the graph
Long reads can span repeats
Repeat copy 1 Repeat copy 2
long
reads
1 contig
Draft vs Finished genomes (bacteria)
250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
The two laws of repeats
1. It is impossible to resolve repeats of length L
unless you have reads longer than L.
2. It is impossible to resolve repeats of length L
unless you have reads longer than L.
How much data do we need?
Coverage vs Depth
Coverage
Fraction of the genome
sequenced by
at least one read
Depth
Average number of reads
that cover any given region
Intuitively more reads should increase coverage and depth
Example: Read length = ⅛ Genome Length
Coverage = ⅜
Depth = ⅜
Genome
Reads
Example: Read length = ⅛ Genome Length
Coverage = 4.5/8
Depth = 5/8
Genome
Reads
Newly Covered Regions = 1.5 Reads
Example: Read length = ⅛ Genome Length
Coverage = 5.2/8
Depth = 7/8
Genome
Reads
Newly Covered Regions = 0.7 Reads
Example: Read length = ⅛ Genome Length
Coverage = 6.7/8
Depth = 10/8
Genome
Reads
Newly Covered Regions = 1.5 Reads
Depth is easy to calculate
Depth = N × L / G
Depth = Y / G
N = Number of reads
L = Length of a read
G = Genome length
Y = Sequence yield (N x L)
Example: Coverage of the Tarantula genome
“The size estimate of the tarantula
genome based on k-mer analysis is 6 Gb
and we sequenced at 40x coverage from
a single female A. geniculata”
Sangaard et al 2014. Nature Comm (5) p1-11
Depth = N × L / G
40 = N x 100 / (6 Billion)
N = 40 * 6 Billion / 100
N = 2.4 Billion
Coverage and depth are related
Approximate
formula for
coverage
assuming
random reads
Coverage = 1 - e-Depth
Much more sequencing needed in reality
● Sequencing is not random
○ GC and AT rich regions
are under represented
○ Other chemistry quirks
● More depth needed for:
○ sequencing errors
○ polymorphisms
Assessing assemblies
Contiguity
Completeness
Correctness
Contiguity
● Desire
○ Fewer contigs
○ Longer contigs
● Metrics
○ Number of contigs
○ Average contig length
○ Median contig length
○ Maximum contig length
○ “N50”, “NG50”, “D50”
Contiguity: the N50 statistic
Completeness : Total size
Proportion of the original genome represented by the assembly
Can be between 0 and 1
Proportion of estimated genome size
… but estimates are not perfect
Completeness: core genes
Proportion of coding sequences can be estimated based on
known core genes thought to be present in a wide variety of
organisms.
Assumes that the proportion of assembled genes is equal to the
proportion of assembled core genes.
Number of Core Genes in
Assembly
Number of Core Genes in
Database
In the past this was done with a tool called CEGMA
There is a new tool for this called BUSCO
Correctness
Proportion of the assembly that is free from errors
Errors include
1. Mis-joins
2. Repeat compressions
3. Unnecessary duplications
4. Indels / SNPs caused by assembler
Correctness: check for self consistency
● Align all the reads back to the contigs
● Look for inconsistencies
Original Reads
Align
Assembly
Mapped Read
Assemble ALL the things
Why assemble genomes
● Produce a reference for new species
● Genomic variation can be studied by comparing
against the reference
● A wide variety of molecular biological tools become
available or more effective
● Coding sequences can be studied in the context of their
related non-coding (eg regulatory) sequences
● High level genome structure (number, arrangement of
genes and repeats) can be studied
Vast majority of genomes remain unsequenced
Estimated number of species ~ 9 Million
Whole genome sequences (including
incomplete drafts) ~ 3000
Estimated number of species Millions to
Billions
Genomic sequences ~ 150,000
Eukaryotes Bacteria & Archaea
Every individual has a different genome
Some diseases such as cancer give rise to massive genome level variation
Not just genomes
● Transcriptomes
○ One contig for every isoform
○ Do not expect uniform coverage
● Metagenomes
○ Mixture of different organisms
○ Host, bacteria, virus, fungi all at once
○ All different depths
● Metatranscriptomes
○ Combination of above!
Conclusions
Take home points
● De novo assembly is the process of reconstructing a long
sequence from many short ones
● Represented as a mathematical “overlap graph”
● Assembly is very challenging (“impossible”) because
○ sequencing bias under represents certain regions
○ Reads are short relative to genome size
○ Repeats create tangled hubs in the assembly graph
○ Sequencing errors cause detours and bubbles in the assembly graph
Contact
tseemann.github.io
torsten.seemann@gmail.com
@torstenseemann
The End

Más contenido relacionado

La actualidad más candente

Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
hemantbreeder
 

La actualidad más candente (20)

RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
COMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.pptCOMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.ppt
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Transcriptome analysis
Transcriptome analysisTranscriptome analysis
Transcriptome analysis
 
genome sequencing, types by kk sahu sir
genome sequencing, types by kk sahu sirgenome sequencing, types by kk sahu sir
genome sequencing, types by kk sahu sir
 
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUEPacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
System biology and its tools
System biology and its toolsSystem biology and its tools
System biology and its tools
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
Kegg
KeggKegg
Kegg
 

Destacado

Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
Daniela Rotariu
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 

Destacado (12)

Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased Overview
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Molecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsMolecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in Insects
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
Gene concept
Gene conceptGene concept
Gene concept
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 

Similar a De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Integrated DNA Technologies
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
Computer Science Club
 

Similar a De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016 (20)

GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.ppt
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.ppt
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Cot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdfCot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdf
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Sept2016 sv 10_x
Sept2016 sv 10_xSept2016 sv 10_x
Sept2016 sv 10_x
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
NGSWorkflows.ppt
NGSWorkflows.pptNGSWorkflows.ppt
NGSWorkflows.ppt
 
Genome structure
Genome structure Genome structure
Genome structure
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
DNA as Storage Medium
DNA as Storage MediumDNA as Storage Medium
DNA as Storage Medium
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
 
DNA sequencing
DNA sequencing  DNA sequencing
DNA sequencing
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Genome
GenomeGenome
Genome
 

Más de Torsten Seemann

De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 

Más de Torsten Seemann (20)

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 

Último

POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Último (20)

FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au - 4 july 2016

  • 1. De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
  • 3. The human genome has 47 pieces MT (or XY)
  • 4. The shortest piece is 48,000,000 bp Total haploid size 3,200,000,000 bp (3.2 Gbp) Total diploid size 6,400,000,000 bp (6.4 Gbp) ∑ = 6,400,000,000 bp
  • 5. Human DNA iSequencer ™ 46 chromosomal and1 mitochondrial sequences In an ideal world ... AGTCTAGGATTCGCTATAG ATTCAGGCTCTGATATATT TCGCGGCATTAGCTAGAGA TCTCGAGATTCGTCCCAGT CTAGGATTCGCTAT AAGTCTAAGATTC...
  • 6. The real world ( for now ) Short fragments Reads are stored in “FASTQ” files Genome Sequencing
  • 7.
  • 10. De novo genome assembly Ideally, one sequence per replicon. Millions of short sequences (reads) A few long sequences (contigs) Reconstruct the original genome from the sequence reads only
  • 11. De novo genome assembly “From scratch”
  • 13. A small “genome” Friends, Romans, countrymen, lend me your ears; I’ll return them tomorrow!
  • 14. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me Shakespearomics Whoops! I dropped them.
  • 15. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; Shakespearomics I am good with words.
  • 16. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; • Majority consensus Friends, Romans, countrymen, lend me your ears; (1 contig) Shakespearomics We have reached a consensus !
  • 17. Overlap - Layout - Consensus Amplified DNA Shear DNA Sequenced reads Overlaps Layout Consensus ↠ “Contigs”
  • 22. Do the graph traverse Size matters not. Look at me. Judge me by my size, do you? Size matters not. Look at me. Judge me by my soze, do you? 2 supporting reads 1 supporting reads
  • 23. So far, so good.
  • 24.
  • 25. Why is it so hard?
  • 26. What makes a jigsaw puzzle hard? Repetitive regions Lots of pieces Missing pieces Multiple copies No corners (circular genomes) No box Dirty pieces Frayed pieces : .,
  • 27. What makes genome assembly hard Size of the human genome = 3.2 x 109 bp (3,200,000,000) Typical short read length = 102 bp (100) A puzzle with millions to billions of pieces Storing in RAM is a challenge ⇢ “succinct data structure” 1. Many pieces (read length is very short compared to the genome)
  • 28. What makes genome assembly hard Finding overlaps means examining every pair of reads Comparisons = N×(N-1)/2 ~ N2 Lots of smart tricks to reduce this close to ~N 2. Lots of overlaps
  • 29. What makes genome assembly hard ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA 3. Lots of sky (short repeats) How could we possibly assemble this segment of DNA? All the reads are the same!
  • 30. What makes genome assembly hard Read 1: GGAACCTTTGGCCCTGT Read 2: GGCGCTGTCCATTTTAGAAACC 4. Dirty pieces (sequencing errors) 1. The error in Read 2 might prevent us from seeing that it overlaps with Read 1 2. How would we know which is the correct sequence?
  • 31. What makes genome assembly hard 5. Multiple copies (long repeats) Our old nemesis the REPEAT ! Gene Copy 1 Gene Copy 2
  • 33. What is a repeat? A segment of DNA that occurs more than once in the genome
  • 34. Major classes of repeats in the human genome Repeat Class Arrangement Coverage (Hg) Length (bp) Satellite (micro, mini) Tandem 3% 2-100 SINE Interspersed 15% 100-300 Transposable elements Interspersed 12% 200-5k LINE Interspersed 21% 500-8k rDNA Tandem 0.01% 2k-43k Segmental Duplications Tandem or Interspersed 0.2% 1k-100k
  • 35. Repeats Repeat copy 1 Repeat copy 2 Collapsed repeat consensus 1 locus 4 contigs
  • 36. Repeats are hubs in the graph
  • 37. Long reads can span repeats Repeat copy 1 Repeat copy 2 long reads 1 contig
  • 38. Draft vs Finished genomes (bacteria) 250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
  • 39. The two laws of repeats 1. It is impossible to resolve repeats of length L unless you have reads longer than L. 2. It is impossible to resolve repeats of length L unless you have reads longer than L.
  • 40. How much data do we need?
  • 41. Coverage vs Depth Coverage Fraction of the genome sequenced by at least one read Depth Average number of reads that cover any given region Intuitively more reads should increase coverage and depth
  • 42. Example: Read length = ⅛ Genome Length Coverage = ⅜ Depth = ⅜ Genome Reads
  • 43. Example: Read length = ⅛ Genome Length Coverage = 4.5/8 Depth = 5/8 Genome Reads Newly Covered Regions = 1.5 Reads
  • 44. Example: Read length = ⅛ Genome Length Coverage = 5.2/8 Depth = 7/8 Genome Reads Newly Covered Regions = 0.7 Reads
  • 45. Example: Read length = ⅛ Genome Length Coverage = 6.7/8 Depth = 10/8 Genome Reads Newly Covered Regions = 1.5 Reads
  • 46. Depth is easy to calculate Depth = N × L / G Depth = Y / G N = Number of reads L = Length of a read G = Genome length Y = Sequence yield (N x L)
  • 47. Example: Coverage of the Tarantula genome “The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata” Sangaard et al 2014. Nature Comm (5) p1-11 Depth = N × L / G 40 = N x 100 / (6 Billion) N = 40 * 6 Billion / 100 N = 2.4 Billion
  • 48. Coverage and depth are related Approximate formula for coverage assuming random reads Coverage = 1 - e-Depth
  • 49. Much more sequencing needed in reality ● Sequencing is not random ○ GC and AT rich regions are under represented ○ Other chemistry quirks ● More depth needed for: ○ sequencing errors ○ polymorphisms
  • 52. Contiguity ● Desire ○ Fewer contigs ○ Longer contigs ● Metrics ○ Number of contigs ○ Average contig length ○ Median contig length ○ Maximum contig length ○ “N50”, “NG50”, “D50”
  • 53. Contiguity: the N50 statistic
  • 54. Completeness : Total size Proportion of the original genome represented by the assembly Can be between 0 and 1 Proportion of estimated genome size … but estimates are not perfect
  • 55. Completeness: core genes Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms. Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes. Number of Core Genes in Assembly Number of Core Genes in Database In the past this was done with a tool called CEGMA There is a new tool for this called BUSCO
  • 56. Correctness Proportion of the assembly that is free from errors Errors include 1. Mis-joins 2. Repeat compressions 3. Unnecessary duplications 4. Indels / SNPs caused by assembler
  • 57. Correctness: check for self consistency ● Align all the reads back to the contigs ● Look for inconsistencies Original Reads Align Assembly Mapped Read
  • 59. Why assemble genomes ● Produce a reference for new species ● Genomic variation can be studied by comparing against the reference ● A wide variety of molecular biological tools become available or more effective ● Coding sequences can be studied in the context of their related non-coding (eg regulatory) sequences ● High level genome structure (number, arrangement of genes and repeats) can be studied
  • 60. Vast majority of genomes remain unsequenced Estimated number of species ~ 9 Million Whole genome sequences (including incomplete drafts) ~ 3000 Estimated number of species Millions to Billions Genomic sequences ~ 150,000 Eukaryotes Bacteria & Archaea Every individual has a different genome Some diseases such as cancer give rise to massive genome level variation
  • 61.
  • 62. Not just genomes ● Transcriptomes ○ One contig for every isoform ○ Do not expect uniform coverage ● Metagenomes ○ Mixture of different organisms ○ Host, bacteria, virus, fungi all at once ○ All different depths ● Metatranscriptomes ○ Combination of above!
  • 64. Take home points ● De novo assembly is the process of reconstructing a long sequence from many short ones ● Represented as a mathematical “overlap graph” ● Assembly is very challenging (“impossible”) because ○ sequencing bias under represents certain regions ○ Reads are short relative to genome size ○ Repeats create tangled hubs in the assembly graph ○ Sequencing errors cause detours and bubbles in the assembly graph