1. Whole genome restriction maps for
nonmodel organisms: genomic
resources where there were none.
Sue Brown
Division of Biology
Kansas State University
Tuesday, February 25, 14
2. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
3. Genomes
• Genomes come in many sizes
• Genome assemblies come in many qualities
• Draft Assemblies
▫ Most genomes sequenced today (nonmodel)
• Finished Assemblies
▫ Model organisms (lots of resources)
Human
Computational
Genetic and genomic tools
• Genomic resources increase the value of the
genome sequence
▫ Reverse genetic approaches
Tuesday, February 25, 14
4. Many initiatives to sequence genomes
• 1,000 human genomes
▫ To provide a deep catalog of human genetic
variation
• Genome 10K -started as an intiative to
sequence 10,000 vertebrate genomes. Database
currently catalogs specimens from over 16,000
organisms
▫ To understand how complex animal life evolved
through changes in DNA and use this knowledge to
become better stewards of the planet
Tuesday, February 25, 14
6. Why sequence 5,000 insect genomes?
• 53% of all living species
• Maintenance and productivity of natural and agricultural
ecosystems
• Consume or damage 25% of all agricultural, forestry and
livestock production
▫ >$30 Billion in annual loss
• Vector plant, animal and human disease
▫ >$50 Billion cost world wide
• Just as human and veterinary medicine now rely on
personal or animal genome info, revealing info stored in
their genomes will transform our ability to manage
insects that threaten our health, food supply and
economic security
• Improve our lives
Tuesday, February 25, 14
7. Standard Draft Genome Assemblies
• Highly fragmented, even at deep coverage
• Scaffolds terminate in repetitive regions
• Relatively low N50 values
• Example:
• 7x Sanger-based Tribolium castaneum
genome assembly
Tuesday, February 25, 14
8. Tribolium castaneum genomics
• Cot analysis
▫ Genome ~200Mb
▫ Long stretches of unique sequence
▫ Low methylation
• 9 autosomes, X and Y
Jeff Stuart, Purdue
Tuesday, February 25, 14
9. Standard Draft
Minimally or unfiltered data, from any number of
different sequencing platforms, that are assembled
into contiguous strings of bases (AGTC), with no gaps
(contigs).
This is the minimum
standard for submission
to public databases.
Science Oct 9, 2009 pp236-237
http://compbio.pbworks.com
Tuesday, February 25, 14
10. Molecular linkage map used to anchor scaffolds
in chromosome builds (ChLG)
Low X coverage, no Y, marker density varies
Tuesday, February 25, 14
11. Molecular linkage map used to anchor scaffolds
in chromosome builds (ChLG)
Low X coverage, no Y, marker density varies
Tuesday, February 25, 14
12. T. castaneum assembly stats
•
•
•
•
•
•
•
Number of contigs! !
!
8,814
Contig N50! !
!
43,511
Number of scaffolds!!
! 481
Scaffold N50! !
!
975,455
Total number of chromosomes!
10 (-Y)
Unmapped scaffolds!!
!
352
Single contig scaffolds
1835
• (481 + 1830 = 2321 scaffolds total)
Tuesday, February 25, 14
13. Scaffold structure of the Tribolium
genome assembly
ChLG
NW
NW
NW
AAJJ
300K Ns
Unanchored
DS
AAJJ
DS
AAJJ
DS
AAJJ
DS
AAJJ
Tuesday, February 25, 14
300K Ns
14. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
15. genetic recombination map to the assembly scaffolds,
anchoring greater than 90% of the assembled
sequence1 (fig1).
To improve this draft assembly, we constructed
physical maps of the T. castaneum genome. Using the
orientation of scaffolds have been corrected, and
scaffolds have been extended by spanning repetitive
regions.
Nature 2008 452:949-55.
Genome assembly improvements
Figure 2 Genome refinements
T. castaneum 3.0
Baylor Sanger 7x draft assembly
and molecular genetic map
T. castaneum 4.0
length (Mb): 160.466
scaffolds:
2321
scaffold N50 (Mb): 0.98
multicontig
scaffolds
481
Illumina long distance jumplibraries extended scaffolds into
gaps and capturing gaps with
Atlas gap-link and gap-filler.
length (Mb): 160.862
scaffolds:
2219
scaffold N50 (Mb): 1.16
411
T. castaneum 4.0
and gam-ngs
Gam-ngs merged Illumina
assembly and T.cas 4.0
extending several unknowns and
an LGX scaffold.
length (Mb): 160.864
scaffolds:
2219
scaffold N50 (Mb): 1.16
411
T. castaneum 4.0
and gam-ngs plus
BioNano maps
Sequence scaffolds were aligned
to maps with IrysView the
alignment was filtered and used
to create new scaffolds.
Figure Assembly
length (Mb): 189.629
scaffolds:
2153
scaffold N50 (Mb): 3.31
341
An independent platform to validate and improve genomes
Figure Mis-assemblies
Tuesday, February 25, 14
ChLG3
Validate and expa
Three scaffolds fr
scaffolded with ca
16. How to validate a de novo assembly?
• Describe assembly
# contigs, # scaffolds, total bases, N50 lengths
coverage, # ESTs, # orthologs found
• But is the assembly accurate?
▫ Compare to BAC sequences
▫ If you have the resources
• Need independent (reasonably priced) method
Tuesday, February 25, 14
17. Genome maps based on
landmarks
• BioNanos Genomics
▫ San Diego, California
• Imaging ultra-long molecules of DNA
• Labeled at restriction sites
Tuesday, February 25, 14
18. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
22. Samples loaded into 2 flow cells per chip
3 lasers 3 detection channels
Detect yoyo 1 in DNA backbone
Fluorescent nucleotides at labeled sites
Tuesday, February 25, 14
25. A long repeat in the Tribolium genome
Tuesday, February 25, 14
26. 24
Mapping individual images back to map
• hthe
Regions flanking repeat are unique
Some sites are polymorphic
Tuesday, February 25, 14
27. Limitations of the Irys system
• Sample prep is very specific
• Requires gram amounts of starting material
• Bacterial cells, tissue culture cells, eukaryotic
nuclei
• Less complex tissue is best
▫ Blood
▫ Embryos
• Not applicable to transcriptomics projects
• contig N50 >30Kb (5 restriction sites)
Tuesday, February 25, 14
28. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
30. 28
Align BNG maps to in silico maps
(.xmap)
Tuesday, February 25, 14
31. 29
File formats are similar to
generating sequence data...
basecall
Image files
de novo assemble
fastq
fasta
@SRR014849.2 EIXKN4201AKDUH/2
TCAAGTGGTGAACGGCAGAAA
+
<=B:==B:=<?6=B;<;=B=)
Image files
call labels
Tuesday, February 25, 14
21!
202146.4
1096.2!
8973.8
10.0565!
11.7966
0.0187!
0.0604
sam
HWI-ST330_C0NEHACXX:
2:1101:17113:52802#0
!
69!
contig1
!
2578! 0!
*
!
=!
2578! 0
!
ATTACGGCCCATGGTTCAGAATAATGACGAA
TAGAAATACTAGTACTATATCCCCTAAAAAA!
<@CFFFFFHHGFHJHIJJJJJJJJJFJJJFG
FHEHIHGHJGIJHIIIJJJJJJJJIJIIJIH!
YT:Z:UP
>conitg1
TCAAGTGGTGAACGGCAGAAA
de novo assemble
bnx
0!
1!
QX11!
QX12!
align
align
cmap
#h CMapId
ContigLength
NumSites
SiteID
LabelChannel
Position
StdDev
Coverage
Occurrence
#f int float
int
int
int
float
float
int
int
393
225073.2
21
1
1
20.0 0.0
3
3
xmap
#h XmapEntryID!QryContigID
!
RefcontigID!
QryStartPos
!
QryEndPos!
RefStartPos
!
RefEndPos!
Orientation
!
Confidence!
HitEnum
#f int
!int
!
int
!
float
!
float
!
float
!
float
!
string
!
float
!
string
1!
94!
1!
444392.7
!
5839.8! 57024.0
!
550038.8!
-!
28.87
!
1M1D2M3I4D1M3I2M1I7M1I1M1I9M1I1M1I2M1I3M1D
2M
32. Visualizing an xmap
contig id
sequence-based
scaffold
label alignment
BioNano contig map
coverage
Tuesday, February 25, 14
33. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
34. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
35. K-INBRE i5K Github scripts:
Irys Scaffolding scripts and manuals written by Jennifer Shelton and Nic Herndon
Assembly workflow was developed by Ernest Lam (BioNano)
git pull https://github.com/i5K-KINBRE-script-share/Irysscaffolding
Tuesday, February 25, 14
36. 34
Assembly pipeline
developed with Ernest Lam (BioNano)
scripts available at: i5k-KINBRE script share at GitHub: Irys-scaffolding
https://github.com/i5K-KINBRE-script-share/Irys-scaffolding
Tuesday, February 25, 14
37. Filtering alignments
Label density varies throughout the
genome so we created scripts to filter in
two passes:
Pass 1: looks for high confidence
score over at least ~30% of the total
possible alignment
Pass 2: looks for low confidence
score over the majority of the total
possible alignment (~90%)
Pass 1 finds most high quality
alignments.
Pass 2 finds high-quality low-density
alignments.
Tuesday, February 25, 14
38. Filtering alignments
Super-scaffolded scaffolds are joined
in a new reference fasta file.
Overlaping scaffolds have a 30bp
spacing gap between them
If a scaffold aligns more than once
only the longest alignment is used
If two alignments have the same
length only the highest confidence
alignment is used
Tuesday, February 25, 14
39. Outline
• de novo genome assembly and i5K
• Improving assemblies with Bionano genome
maps
▫ Irys system
▫ File formats
▫ Assembly pipeline
▫ Alignment filtering
• Results
Tuesday, February 25, 14
40. 38
BNG restriction maps for
Tcastaneum
• Dual nicked Bsp.QI and BbvCI
• 28.6Gb = ~143x coverage of 200Mb Tribolium genome
(>150 Kb)
• N contigs: 216
• Total Contig Len (Mb): 200.473
• Avg. Contig Len (Mb): 0.928
• Contig N50 (Mb): 1.350
• Total Ref Len (Mb): 157.186
• Total Contig Len / Ref Len : 1.275
Tuesday, February 25, 14
41. ChLG X
ChLGX had 13 scaffolds. Alignment to BioNano maps
captured gaps and validated order for 11 of 13 scaffolds,
incorporated 2 unplaced scaffolds and identified a
potential misplaced scaffold (scaffold 2 aligns with another
linkage group).
Tuesday, February 25, 14
42. ChLG X
ChLGX had 13 scaffolds. Alignment to BioNano maps
captured gaps and validated order for 11 of 13 scaffolds,
incorporated 2 unplaced scaffolds and identified a
potential misplaced scaffold (scaffold 2 aligns with another
linkage group).
Tuesday, February 25, 14
43. ChLG X
ChLGX had 13 scaffolds. Alignment to BioNano maps
captured gaps and validated order for 11 of 13 scaffolds,
incorporated 2 unplaced scaffolds and identified a
potential misplaced scaffold (scaffold 2 aligns with another
linkage group).
Tuesday, February 25, 14
44. ChLG 7
Alignment to BioNano maps captured gaps and validated
order for 13 of 15 scaffolds. Scaffold 14 needs to be
reversed in the super-scaffold.
Tuesday, February 25, 14
45. ChLG 7
Alignment to BioNano maps captured gaps and validated
order for 13 of 15 scaffolds. Scaffold 14 needs to be
reversed in the super-scaffold.
Tuesday, February 25, 14
50. 42
what does it cost?
• 100-500Mb genome <$5,000
▫ 70-100x coverage
• 1Gb genome <$8,000
▫ 70-100x coverage
• completely dependent on homogeneity of starting
material
• assembly and analysis software is included in
price
Tuesday, February 25, 14
51. 43
Summary
•
•
•
•
•
•
•
•
•
Standard Draft Genomes are highly fragmented
BNG provides independent platform
Whole genome restriction maps
Validate assembly
Extend scaffolds/Size Gaps
Identify structural variants
Identify haplotypes
Comprehensive view of repetitive DNA (HORs)
A validated genome assembly improves
downstream analyses
Tuesday, February 25, 14
52. Thanks to:
• Michelle Gordon
▫ Research Assistant: optimizing sample preps
• Jennifer Shelton
▫ Biologist turned Bioinformaticist
• Nic Herndon
▫ Computer scientist turned Bioinformaticist
• BioNano Genomics
▫ Ernest Lam
▫ Weiping Wang
Tuesday, February 25, 14