The dark energy paradox leads to a new structure of spacetime.pptx
Validating and improving the D. melanogaster reference genome sequence using PacBio de novo assemblies
1. Validating and improving the
D. melanogaster reference genome sequence
using PacBio de novo assemblies
Casey M. Bergman
@bergmanlab
@caseybergman
University of Liverpool
Centre for Genomic Research
PacBio Symposium
4 April 2014
!
2. Credits
• Danny Miller (Stowers Institute)
• Jane Landolin, Kristi Kim, Jason Chin & Edwin Hauw
(Pacific Biosciences)
• Sue Celniker & Roger Hoskins (Berkeley Drosophila
Genome Project)
• Sergey Koren & Adam Phillippy (National Biodefense
Analysis and Countermeasures Center)
• Raquel Linheiro (University of Manchester)
6. The strategy we have used is called chromosomal walking and jumping; it is
shown diagrammatically in Figure 1. The chromosomal origin of any non-repeated
segment of D. melanogaster DNA (Dm segment) can be determined by in situ
hybridization of that DNA to polytene chromosomes. When the sites of
hybridization are visualized by tritium autoradiography, the position is usually
confined to one or a few bands, which is similar to the precision of the cytological
localizations of rearrangement breakpoints or the localizations of well-mapped
genes. If a DNA sequence is found within a few bands of a gene of interest, that
sequence can be used as the starting point for a chromosomal walk to the gene. A
"step" in the walking procedure involves screening a recombinant DNA library of
random large Dm segments to collect those that overlap the starting point. The
CIIBIDE~W:KI' F87B: I B I C I D I E I F888A I B IC~~FA 0
A Af /
- - START HERE
• T
ill
T •
LEFT FUSION FRAGMENT RIGHT FUSION FRABNENT
89 IBB 88
INVERSION INVERSION
BREAK BREAK
Fro. 1. The strategy for walking and jumping. The upper chromosome represents a portion of the
right arm of the third chromosome with normal cytology (drawn from the map of Bridges, 1941), and
the lower chromosome has an inversion of the region from 87E to 89E. A few steps of a chromosomal
walk are shown diagrammatically below the 87E region (not to scale with the chromosome). When the
walk reached the site of the inversion breakpoint, the DNA from that position could be used to
identify the two fusion fragments isolated from the inversion chromosome. The foreign DNA in the
fusion fragments (tandem circles) was homologous to normal chromosomal DNA at the right or distal
inversion breakpoint, and thus it served as the origin of a chromosomal walk in 89E.
e.g. Bender et al. (1983) PMID: 6410077
“The” Drosophila genome circa 1990
12. Release Date
Total size of
scaffolds
Total size of
contigs
Contigs Contig N50
1 Mar 2000 116,117,226 114,201,085 1427 220,490
2 Oct 2000 116,109,070 114,448,849 1103 318,193
3 Dec 2002 116,781,562 116,739,493 50 14,289,516
4 Apr 2004 118,357,599 118,348,386 28 18,203,742
5 Mar 2006 120,381,546 120,290,946 14 21,485,538
Euchromatic genome assemblies
Several gaps persist in euchromatic arms
13. ~ 120 Mb of euchromatin
~ 60-100 Mb heterochromatin
“The” Drosophila genome since 2000
14. Hoskins et al. (2007) PMID: 17569867
Heterochromatic genome assemblies
~350 Kb
in Rel5
15. Release Scaffolds
Total Size of
Scaffolds
Contigs
Total Size of
Contigs
1 0 0 0 0
2 1 (U) 7,513,406 1000 5,530,718
3 2604 20,941,991 3810 17,150,417
4 0 0 0 0
5 8 (U + armHet + mt) 19,350,335 3044 16,535,110
Majority of heterochromatin unassembled
Heterochromatic genome assemblies
16. Low coverage pilot experiment with Hawley Lab
http://bergmanlab.smith.man.ac.uk/?p=1971
17. High coverage experiment with PacBio & BDGP
http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html
http://www.ncbi.nlm.nih.gov/sra/?term=SRP040522
18. Metric Value
Library Size (Kb) 15
Chemistry P5-C3
# SMRT cells 42
Run time (days) 6
# bases (nt) 15,208,567,933
# reads 1,514,730
avg length (nt) 10,040
N50 (nt) 14,214
Max (nt) 44,766
High coverage PacBio dataset for
D. melanogaster BDGP reference strain
http://blog.pacificbiosciences.com/2014/01/data-release-preliminary-de-novo.html
http://www.ncbi.nlm.nih.gov/sra/?term=SRP040522
20. >90x coverage based on reference mapping
http://bergmanlab.smith.man.ac.uk/?p=2176
21. PacBio-only assemblies of the
D. melanogaster genome
Assembly Read Set Pre-assembly Assembler Quivered
CA25x 25x longest PBcR CA 8.1 n
CA25x-Q 25x longest PBcR CA 8.1 y
CA50x 50x longest PBcR CA 8.1 n
FALCON-Q 25x longest FALCON FALCON y
FALCON-PBcR 70x PBcR FALCON n
FALCON-AWS all FALCON FALCON n
Koren & Phillippy (unpublished)
Chin & Bergman (unpublished)
22. Assembly Contigs Contig N50 (nt) Max Contig (nt)
CA25x 128 15,297,019 24,622,056
CA25x-Q 128 15,305,620 24,648,237
CA50x 131 4,105,199 24,577,947
FALCON-Q 434 5,001,041 21,336,512
FALCON-PBcR 1774 7,499,810 25,727,813
FALCON-AWS 955 7,882,002 21,631,108
PacBio-only assemblies of the
D. melanogaster genome
23. Long-range contiguity of CA25x assembly
Koren & Phillippy (unpublished)
http://cbcb.umd.edu/software/PBcR/dmel.html
X 3R 3L 2L 42R
28. Identification of Y-chromosome contigs in
PacBio assemblies by female/male depth ratio
0 1 2 3
02468
Ratio Profile
Ratio (in 10000 bre
Counts(log)
02468
Ratio Profile
chr2L
chr2R
chr3L
chr3R
chr4
chrX
chrYHet
log10frequency
female/male depth ratio
bwa
short read
DNA-seq
female/male depth ratio
Linheiro & Bergman (unpublished)
29. 0 10 20 30 40 50 60
01234
Ratio 0052_00|quiver|quiver
Location in chr (x10000)
Ratio
●●●●●●
●
●
●●●●●●●●●●●●●
●
●●
●
●●
●
●
●
●●●●●
●
●●●●
●
●●●
●
●●●
●●●●●
●●●
●
●
●●●
●
●
●
●
female/maledepthratio
window (10Kb step)
0 10 20 30 40 50 60
01234
Ratio 0052_00|quiver|quiver
Location in chr (x10000)
Ratio
●●●●●●
●
●
●●●●●●●●●●●●●
●
●●
●
●●
●
●
●
●●●●●
●
●●●●
●
●●●
●
●●●
●●●●●
●●●
●
●
●●●
●
●
●
●
●
●
●
_
_
A ratio
X ratio
Y ratio
X log 100 count
Y log 100 count
Identification of Y-chromosome contigs in
PacBio Assemblies by female/male depth ratio
Linheiro & Bergman (unpublished)
30. Improvement of the Y-chromosome
assembly & gene models
Celniker (unpublished)
31. Take Home (I)
• View of D. melanogaster genome has been changing
for >100 years & is still not complete
• Frontier of D. melanogaster genome assembly is in
heterochromatic regions (model for repeat-rich plant
genomes)
• PacBio long reads can be used to generate long-
range de novo assemblies that can close
euchromatic gaps & generate large heterochromatic
contigs
• Bioinformatic challenges: better pre-assembly
algorithms, better polishing algorithms, *.h5 data
archiving
32. • Early, open release of genomic data by small labs
can stimulate big returns & new collaborations
• PacBio has right corporate philosophy of engaging/
collaborating with the genomics community (open
data, open source)
Take Home (II)