1. Jared Simpson
!
Ontario Institute for Cancer Research
&
Department of Computer Science
University ofToronto
Error correction, assembly
and consensus algorithms
for MinION data
London Calling, May 14th, 2015
3. An overview of NGS assembly
β’ Illumina data: short reads, very accurate, very deep
β’ nearly all Illumina assembly is based on exact matching algorithms
β’ fragmented assemblies
!
β’ Algorithms for Illumina data do not work for long, noisy reads
β’ PacBio developed a pipeline (βHGAPβ) to assemble their data
β’ We used this recipe as a starting point but with custom components
3
13. Assembly Polishing
β’ Consensus problem is viewed as choosing a sequence Cβ that maximizes
the probability of the event data
13
C0
= arg max
S2C
P(D|S)
P(D|S) =
rY
k=1
P(ei,k, ei+1,k, ..., ej,k|S, β₯)
where
23. A simple model
β’ What is the probability of observing events E given a sequence S?
β’ Assuming for the moment there are no missing or extra events:
23
P(e1, e2, ..., en|s1, s2, ..., sn, β₯) =
nY
i=1
P(ei|si, Β΅si , si )
P(ei|k, Β΅k, k) = N(Β΅k, 2
k)
27. Nanopore HMM
β’ must consider:
β’ over segmentation
β’ under segmentation
β’ missed short events
β’ HMM:
β’ M states: match event to 5-mers
β’ E states: extra obs. of an event
β’ K states: no event obs. for 5-mer
27
P(D|S)
P(β‘, e1, e2, ..., en|S, β₯) =
nY
i=1
P(ei|β‘i, Β΅si , si )P(β‘i|β‘i 1, S)
P(e1, e2, ..., en|S, β₯) =
X
β‘
P(β‘, e1, e2, ..., en|S, β₯)
31. Assembly Accuracy
31
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountindraftassembly
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
draft
reference
y
12000
A
C
B
D
0
5000
0 5000 10000
5 mer count in reference
5mercou
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountinpolishedassembly
C D
Draft: 98.5% accuracy Polished: 99.5% accuracy
32. Assembly Accuracy
32
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
draft
reference
9000
12000
B
D
0
0 5000 10000
5 mer count in reference
5mer
0
3000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
0
5000
10000
0 5000 10000
5 mer count in reference
5mercountinpolishedassembly
0
3000
6000
9000
12000
TTTTT AAAAA TTTTG CAAAA CTTTT AAAAG CCCCC GGGGG
kmer
count
polished
reference
C D
33. Aligning Events to a Reference
β’ HMM can also align events to a reference genomeβ¨
β¨
!
!
!
!
!
β’ Read about it here:
β’ http://simpsonlab.github.io/2015/04/08/eventalign/
33
34. Planned Improvements
β’ Model dwell duration to better call homopolymers
!
!
!
β’ SNP calling/genotyping
!
!
!
β’ Improve scalability to handle larger genomes
β’ Use signal data during error correction
34
CTAAAAAAAAAAAAGTACA
P(gi|D) =
P(D|gi)P(gi)
P(D)