More Related Content More from Strand Life Sciences Pvt Ltd (12) Alignment of raw reads in Avadis NGS1. Pioneering
Scientific Intelligence
DNA/Small RNA Alignment
in Avadis NGS 1.3
Strictly Confidential © Strand Life Sciences
2. How does CoBWeb compare with other
What is an Alignment algorithm? algorithms?
What issues must an Alignment How is CoBWeb exposed in Avadis
algorithm consider? NGS?
What is the future evolution of
How do Alignment algorithms work? CoBWeb?
How does CoBWeb work?
Questions we will seek to answer in this presentation
© Strand
4. Subject’s
Genome
AGGCTACGCATTTCCCATAAAGACCCACGCTTAAGTTC
AGGCTACGCATGTCCCATAATGACCCACACTTAAGTTC
Reference
Genome, close
but not quite
the same as the
Subject’s
Genome
© Strand
7. Handling paired
reads
Subject’s
Genome
×
Reference
Genome
Repeat Repeat
Region Region
© Strand
8. A variety of
Read Lengths
Short reads
~50, few
mismatches
and gaps
Long
reads, few
hundreds to
thousands, ma
ny more
mismatches
and gaps
© Strand
9. Speed and
Memory
Run in 4GB
RAM Allow use of
multiple
Billions of cores/process
reads. ors
Scale speed
with more
memory
© Strand
11. Indexing the
Genome to find
Seed Matches Scanning the
Reference for
each Read
takes too long
The Reference
Index
The Index very
quickly yields
locations in the
Reference where
some part (seed) of
the Read matches.
This Seed occurs at This Seed occurs at
Reference locations Reference locations
x1, x2… x3, x4…
© Strand
12. Detailed
Alignment at
Seed Match
Locations
Seed
Reference Match
Read
How many Mismatches
and Gaps are needed
for the Read to match
around the Seed?
Smith-Waterman or
Dynamic Programming
© Strand
13. The Burrows-
Wheeler based
Index
The original
Reference
C G A C $
All its circular
shifts, sorted A C $ C G This column is
2 the BWT
lexicographically
0 C G A C $
3 C $ C G A
1 G A C $ C
Circular Shift
Indices 4 $ C G A C
The Index
These can be sampled comprises these
to fit into reduced along with some
memory at the expense housekeeping data
of speed without structures
sacrificing correctness
© Strand
14. The Burrows-
Wheeler based
Index
EXACT
Reference Match
Read
All Exact Matches of a Read (NO
Mismatches or Gaps) in the
Reference can be found in time
proportional to the length of the
Read and largely independent of
the size of the Reference.
© Strand
16. Seeding
Strategy
This 15-mer occurs This 15-mer occurs
at locations at locations
x1, x2… x3, x4… This whole 30-mer
occurs at location
x5
Use the BW based
index, augmented
with additional data
structures for
speed, to find one or
more Long Seed
Matches in the
Reference
Justification: Most long
Reads do not have
Mismatches and Gaps
strewn across their length; And Long Seeds
there are usually long will have few
stretches that match matching locations.
exactly.
© Strand
17. Advantages
Separating the Smith-
Seed length is not Waterman phase from
specified in advance, so the BW Index search
Long and Short reads can allows an unlimited
be handled seamlessly. number of gaps and
mismatches.
© Strand
19. Comparison
with BWA CoBWeb:
94% BWA: 4%
Alignment error + 1 gap
Read Score with up of possibly
Length 50 to 2 Gaps multiple length
Read
Length 150
A little faster than
BWA with
comparable results
© Strand
21. Entry
Two new experiment
types, DNA Alignment
and Small-RNA
Alignment
© Strand
22. The Alignment
Workflow
Run Alignment, and then
create a DNA Variant or
ChIP-Seq Experiment
from the results.
© Strand
23. Specify number of
Alignment Mismatches and
Parameters Gaps, and handling of
Multiple Matching.
Specify Adaptor
Trimming (only for Small
RNA) and 3’,5’ trimming
based on quality
Screen against
Contaminant Databases.
© Strand
25. ToDos
Chimeric
Reads
RNA-Seq
Alignment
Base Quality
recalibration
Affine Gap
Costs
© Strand