08448380779 Call Girls In Friends Colony Women Seeking Men
Overview of Genome Assembly Algorithms
1. Introduction
OLC
Graph theory and assembly
deBruijn - Euler
Genome Assembly Algorithms and Software
(or...what to do with all that sequence data ?)
Konstantinos Krampis
Asst. Professor, Informatics
J. Craig Venter Institute
George Washington University, Nov. 2nd 2011
Konstantinos Krampis Genome Assembly Algorithms and Software
2. Introduction
OLC
Graph theory and assembly
deBruijn - Euler
Introduction
Why do we need genome assembly
Definitions of genome assembly
OLC
Overlap
Layout
Consensus
OLC assembly software and publications
Graph theory and assembly
Definition of a graph
Graphs and genome assembly
deBruijn - Euler
An alternative assembly graph
Constructing a de Bruijn graph from reads
Genome assembly from de Bruijn graphs
deBruijn assembly software and publications
Konstantinos Krampis Genome Assembly Algorithms and Software
3. Introduction
OLC Why do we need genome assembly
Graph theory and assembly Definitions of genome assembly
deBruijn - Euler
Cannot read the complete genome
with the sequencer from one end to
the other !
DNA isolated from a cell is
amplified
Broken into fragments (shearing)
Fragments are ”read” with the
sequencer
Use the fragments - reads to
reconstruct the genome from
Credit: Masahiro Kasahara, Large-Scale Genome Sequence
sequencing reads
Processing, Imprerial College Press
Konstantinos Krampis Genome Assembly Algorithms and Software
4. Introduction
OLC Why do we need genome assembly
Graph theory and assembly Definitions of genome assembly
deBruijn - Euler
Assembly: hierarchical process
to reconstruct genome from
reads
Assemble the puzzle of the
genome from the reads:
overlaps connect the pieces
Oversample the genome so that
reads overlap
Key approach: data structure
representing overlaps, and
algorithms operating on that Credit: Masahiro Kasahara, Large-Scale Genome Sequence
data structure Processing, Imprerial College Press
Konstantinos Krampis Genome Assembly Algorithms and Software
5. Introduction
OLC Why do we need genome assembly
Graph theory and assembly Definitions of genome assembly
deBruijn - Euler
Two major algorithmic paradigms for genome assembly
Overlap - Layout - Consensus (OLC): well established,
more powerful method, but more difficult to implement
OLC: first to be used successfully for complex Eucaryotic
genomes (Drosophila,H.sapiens)
deBruijn - Euler: newer, easier to implement, problematic
in complex genomes (for current implementations)
Konstantinos Krampis Genome Assembly Algorithms and Software
6. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Find Overlaps by aligning
the sequence of the reads
Layout the reads based
on which aligns to which
Get Consensus by joining
all read sequences,
merging overlaps
Sequencer reads in
random direction,
left-to-right or Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing,
right-to-left Imprerial College Press
Konstantinos Krampis Genome Assembly Algorithms and Software
7. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Sequence alignment,
all-against-all reads
(Smith-Watermann,
BLAST, other?)
Computationally intensive
but easily parallelizable
Represent read overlap by
connecting with directed Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51
link
First step in creating the
genome assembly graph
(more later)
Konstantinos Krampis Genome Assembly Algorithms and Software
8. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Create a consistent linear
(ideally) ordering of the
reads
Remove redundancy, so
no two dovetails leave
the same edge
No containment edge is
followed by a dovetail
edge
Remove cycles, one link
in, one out
Konstantinos Krampis Genome Assembly Algorithms and Software
9. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Multiple Sequence
Alignment (ClustalW)
algorithms ? No
phylogeny here...
Vote for the most abundant
nucleotide for each position
Incorporate read quality data
Create pre-consensus from
high-quality reads, and align
remaining reads to it
Konstantinos Krampis Genome Assembly Algorithms and Software
10. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Celera Assembler
Developed at Celera Genomics for first Drosophila and human genome
assemblies
Continuoued development at J. Craig Venter Inst. as open source project
http://wgs-assembler.SourceForge.net (Licence: GPL)
Plently of wiki (developer + user) documentation, examples, user forums
Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR
Assembler
Konstantinos Krampis Genome Assembly Algorithms and Software
11. Introduction Overlap
OLC Layout
Graph theory and assembly Consensus
deBruijn - Euler OLC assembly software and publications
Celera Assembler publications
Myers et al (2000) A whole-genome assembly of Drosophila
Levy et al (2007) The diploid genome sequence of an individual human
Zimin et al (2009) The domestic cow, Bos taurus
Dalloul et al (2010) The domestic turkey, Meleagris gallopavo
Lorenzi et al (2010) New assembly of Entamoeba histolytica
Lawniczak et al (2010) Divergence in Anopheles gambiae
Jones et al (2011) The marine filamentous cyanobacterium Lyngbya
majuscula
Miller et al The Tasmanian devil, Sarcophilus harrisii
Prfer et al The great ape bonobo, Pan paniscus
Gordon et al The cotton bollworm moth, Helicoverpa
Konstantinos Krampis Genome Assembly Algorithms and Software
12. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
and now a bit of Graph Theory...
Konstantinos Krampis Genome Assembly Algorithms and Software
13. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Graph G with set of vertices (nodes)
V: {P,T,Q,S,R}
set of edges (links between nodes)
E: {(P,T),(P,Q),(P,S),(Q,T),
(S,T),(Q,S),(S,Q),(Q,R),(R,S)}
walk from P to R:(P,Q),(Q,R)
walk from R to T:(R,S),(S,Q),(Q,T)
or (R,S),(S,T) Credit: Introduction to Graph Theor
Robert J. Wilson
walk from R to P: not possible
Konstantinos Krampis Genome Assembly Algorithms and Software
14. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Trail: a walk of the graph where
each edge is visited only once
Example Trail: (P,Q), (Q,R),
(R,S), (S,Q), (Q,S), (S,T)
Path: a walk where each vertice
is visited once
Example Path: (P,Q), (Q,R),
(R,S), (S,T)
Konstantinos Krampis Genome Assembly Algorithms and Software
15. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Credit: Saad Mneimneh, CUNY
Konstantinos Krampis Genome Assembly Algorithms and Software
16. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Represent sequence overlaps as
a graph with weighted edges
SCS solution: find Path (visit
all edges and vertices once) that
maximizes weight sum
Hamiltonian Cycle or Traveling
Saleman Problem
Konstantinos Krampis Genome Assembly Algorithms and Software
17. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Which edge to start from?
NO: misses a vertex NO: misses edge with large weight
Konstantinos Krampis Genome Assembly Algorithms and Software
18. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
YES!: all vertices and edge with large weight
Konstantinos Krampis Genome Assembly Algorithms and Software
19. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
A more realistic version of a read / string overlap graph (C. jejuni)
Credit: Eugene W. Myers Bioinformatics 21:79-85
Konstantinos Krampis Genome Assembly Algorithms and Software
20. Introduction
OLC Definition of a graph
Graph theory and assembly Graphs and genome assembly
deBruijn - Euler
Computational Complexity
SCS solution by searching for a
Hamiltonian Cycle on a graph is a
difficult algorithmic problem
(NP-hard)
Using approximation or greedy
algorithms can yield a 2 to
4-aprroximation solutions (twice or
four times the length of the
optimal-shortest string)
Transformation of Overlap Graph
to String Graph leads to
Polynomial time solution. No Polynomial(P) : O(n), O(n2 ), O(n3 )etc.
assembler implementation yet. (1)
Konstantinos Krampis Genome Assembly Algorithms and Software
21. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Pevzner, Tang and
Waterman, An
Eulerian path
approach to DNA
fragment assembly,
PNAS 98 2001
9748-9753.
Konstantinos Krampis Genome Assembly Algorithms and Software
22. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
deBruijn graph: a directed graph representing overlaps between
sequences of symbols
Credit: Wikipedia
Konstantinos Krampis Genome Assembly Algorithms and Software
23. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Konstantinos Krampis Genome Assembly Algorithms and Software
24. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Konstantinos Krampis Genome Assembly Algorithms and Software
25. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Konstantinos Krampis Genome Assembly Algorithms and Software
26. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
In a real genome scenario...
Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12
Konstantinos Krampis Genome Assembly Algorithms and Software
27. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Euler’s algorithm
Using Euler’s algorithm we can find a path that visits each edge of the de
Bruijn genome assembly graph once, in order to concatenate the edge
labels and ”spell out” the assembly. Polynomial time!
Credit: Wikipedia
Konstantinos Krampis Genome Assembly Algorithms and Software
28. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Euler assembler (the very first), Pevzner et al 2001 PNAS
98:9748-9753
Velvet assembler (more user friendly),
Both those assemlers store the complete graph on the computer
memory 512GB-1024GB for human genomes
At JCVI we have two 1024GB (1TB) RAM servers for assembly
others: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributed
memory) assemblers on computer clusters
Konstantinos Krampis Genome Assembly Algorithms and Software
29. Introduction An alternative assembly graph
OLC Constructing a de Bruijn graph from reads
Graph theory and assembly Genome assembly from de Bruijn graphs
deBruijn - Euler deBruijn assembly software and publications
Thank you!
contact: kkrampis@jcvi.org
We hire interns at the J. Craig Venter Institute:
http://www.jcvi.org/cms/education/internship-program/
Some of my other projects - Cloud Computing:
http://tinyurl.com/cloudbiolinux-jcvi
http://www.cloudbiolinux.org
Konstantinos Krampis Genome Assembly Algorithms and Software