SlideShare a Scribd company logo
1 of 29
Introduction
                             OLC
        Graph theory and assembly
                  deBruijn - Euler




Genome Assembly Algorithms and Software
   (or...what to do with all that sequence data ?)


                   Konstantinos Krampis

                    Asst. Professor, Informatics
                     J. Craig Venter Institute




   George Washington University, Nov. 2nd 2011



            Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                    OLC
               Graph theory and assembly
                         deBruijn - Euler

Introduction
    Why do we need genome assembly
    Definitions of genome assembly
OLC
    Overlap
    Layout
    Consensus
    OLC assembly software and publications
Graph theory and assembly
    Definition of a graph
    Graphs and genome assembly
deBruijn - Euler
    An alternative assembly graph
    Constructing a de Bruijn graph from reads
    Genome assembly from de Bruijn graphs
    deBruijn assembly software and publications
                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                      OLC          Why do we need genome assembly
                 Graph theory and assembly         Definitions of genome assembly
                           deBruijn - Euler



Cannot read the complete genome
with the sequencer from one end to
the other !

DNA isolated from a cell is
amplified

Broken into fragments (shearing)

Fragments are ”read” with the
sequencer

Use the fragments - reads to
reconstruct the genome from
                                              Credit: Masahiro Kasahara, Large-Scale Genome Sequence
sequencing reads
                                              Processing, Imprerial College Press


                     Konstantinos Krampis          Genome Assembly Algorithms and Software
Introduction
                                     OLC          Why do we need genome assembly
                Graph theory and assembly         Definitions of genome assembly
                          deBruijn - Euler



Assembly: hierarchical process
to reconstruct genome from
reads

Assemble the puzzle of the
genome from the reads:
overlaps connect the pieces

Oversample the genome so that
reads overlap

Key approach: data structure
representing overlaps, and
algorithms operating on that                 Credit: Masahiro Kasahara, Large-Scale Genome Sequence

data structure                               Processing, Imprerial College Press


                    Konstantinos Krampis          Genome Assembly Algorithms and Software
Introduction
                                    OLC     Why do we need genome assembly
               Graph theory and assembly    Definitions of genome assembly
                         deBruijn - Euler


Two major algorithmic paradigms for genome assembly


       Overlap - Layout - Consensus (OLC): well established,
       more powerful method, but more difficult to implement

       OLC: first to be used successfully for complex Eucaryotic
       genomes (Drosophila,H.sapiens)

       deBruijn - Euler: newer, easier to implement, problematic
       in complex genomes (for current implementations)




                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction      Overlap
                                   OLC        Layout
              Graph theory and assembly       Consensus
                        deBruijn - Euler      OLC assembly software and publications



Find Overlaps by aligning
the sequence of the reads

Layout the reads based
on which aligns to which

Get Consensus by joining
all read sequences,
merging overlaps

Sequencer reads in
random direction,
left-to-right or                  Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing,

right-to-left                     Imprerial College Press




                  Konstantinos Krampis        Genome Assembly Algorithms and Software
Introduction     Overlap
                                    OLC       Layout
               Graph theory and assembly      Consensus
                         deBruijn - Euler     OLC assembly software and publications



Sequence alignment,
all-against-all reads
(Smith-Watermann,
BLAST, other?)

Computationally intensive
but easily parallelizable

Represent read overlap by
connecting with directed           Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51
link

First step in creating the
genome assembly graph
(more later)
                   Konstantinos Krampis       Genome Assembly Algorithms and Software
Introduction   Overlap
                                      OLC     Layout
                 Graph theory and assembly    Consensus
                           deBruijn - Euler   OLC assembly software and publications




Create a consistent linear
(ideally) ordering of the
reads


Remove redundancy, so
no two dovetails leave
the same edge

No containment edge is
followed by a dovetail
edge


Remove cycles, one link
in, one out


                     Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                      OLC     Layout
                 Graph theory and assembly    Consensus
                           deBruijn - Euler   OLC assembly software and publications




Multiple Sequence
Alignment (ClustalW)
algorithms ? No
phylogeny here...

Vote for the most abundant
nucleotide for each position

Incorporate read quality data


Create pre-consensus from
high-quality reads, and align
remaining reads to it



                     Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                       OLC     Layout
                  Graph theory and assembly    Consensus
                            deBruijn - Euler   OLC assembly software and publications


Celera Assembler

   Developed at Celera Genomics for first Drosophila and human genome
   assemblies

   Continuoued development at J. Craig Venter Inst. as open source project

   http://wgs-assembler.SourceForge.net (Licence: GPL)

   Plently of wiki (developer + user) documentation, examples, user forums

   Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR
   Assembler



                      Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   Overlap
                                         OLC     Layout
                    Graph theory and assembly    Consensus
                              deBruijn - Euler   OLC assembly software and publications


Celera Assembler publications

    Myers et al (2000) A whole-genome assembly of Drosophila
    Levy et al (2007) The diploid genome sequence of an individual human
    Zimin et al (2009) The domestic cow, Bos taurus
    Dalloul et al (2010) The domestic turkey, Meleagris gallopavo
    Lorenzi et al (2010) New assembly of Entamoeba histolytica
    Lawniczak et al (2010) Divergence in Anopheles gambiae
    Jones et al (2011) The marine filamentous cyanobacterium Lyngbya
    majuscula
    Miller et al The Tasmanian devil, Sarcophilus harrisii
    Prfer et al The great ape bonobo, Pan paniscus
    Gordon et al The cotton bollworm moth, Helicoverpa
                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                   OLC     Definition of a graph
              Graph theory and assembly    Graphs and genome assembly
                        deBruijn - Euler


and now a bit of Graph Theory...




                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                     OLC     Definition of a graph
                Graph theory and assembly    Graphs and genome assembly
                          deBruijn - Euler




Graph G with set of vertices (nodes)
V: {P,T,Q,S,R}

set of edges (links between nodes)
E: {(P,T),(P,Q),(P,S),(Q,T),
(S,T),(Q,S),(S,Q),(Q,R),(R,S)}

walk from P to R:(P,Q),(Q,R)

walk from R to T:(R,S),(S,Q),(Q,T)
or (R,S),(S,T)                     Credit: Introduction to Graph Theor
                                   Robert J. Wilson
walk from R to P: not possible


                    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                   OLC     Definition of a graph
              Graph theory and assembly    Graphs and genome assembly
                        deBruijn - Euler




Trail: a walk of the graph where
each edge is visited only once

Example Trail: (P,Q), (Q,R),
(R,S), (S,Q), (Q,S), (S,T)

Path: a walk where each vertice
is visited once

Example Path: (P,Q), (Q,R),
(R,S), (S,T)



                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                            OLC      Definition of a graph
                       Graph theory and assembly     Graphs and genome assembly
                                 deBruijn - Euler




Credit: Saad Mneimneh, CUNY




                              Konstantinos Krampis   Genome Assembly Algorithms and Software
Introduction
                                  OLC     Definition of a graph
             Graph theory and assembly    Graphs and genome assembly
                       deBruijn - Euler




Represent sequence overlaps as
a graph with weighted edges

SCS solution: find Path (visit
all edges and vertices once) that
maximizes weight sum

Hamiltonian Cycle or Traveling
Saleman Problem




                 Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                           OLC     Definition of a graph
                      Graph theory and assembly    Graphs and genome assembly
                                deBruijn - Euler


Which edge to start from?




NO: misses a vertex                                NO: misses edge with large weight


                          Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                    OLC     Definition of a graph
               Graph theory and assembly    Graphs and genome assembly
                         deBruijn - Euler




YES!: all vertices and edge with large weight


                   Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                          OLC     Definition of a graph
                     Graph theory and assembly    Graphs and genome assembly
                               deBruijn - Euler




A more realistic version of a read / string overlap graph (C. jejuni)
Credit: Eugene W. Myers Bioinformatics 21:79-85


                         Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction
                                        OLC     Definition of a graph
                   Graph theory and assembly    Graphs and genome assembly
                             deBruijn - Euler


Computational Complexity

   SCS solution by searching for a
   Hamiltonian Cycle on a graph is a
   difficult algorithmic problem
   (NP-hard)

   Using approximation or greedy
   algorithms can yield a 2 to
   4-aprroximation solutions (twice or
   four times the length of the
   optimal-shortest string)

   Transformation of Overlap Graph
   to String Graph leads to
   Polynomial time solution. No                 Polynomial(P) : O(n), O(n2 ), O(n3 )etc.
   assembler implementation yet.                                                     (1)
                       Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                    OLC     Constructing a de Bruijn graph from reads
               Graph theory and assembly    Genome assembly from de Bruijn graphs
                         deBruijn - Euler   deBruijn assembly software and publications




Pevzner, Tang and
Waterman, An
Eulerian path
approach to DNA
fragment assembly,
PNAS 98 2001
9748-9753.




                     Konstantinos Krampis   Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                         OLC     Constructing a de Bruijn graph from reads
                    Graph theory and assembly    Genome assembly from de Bruijn graphs
                              deBruijn - Euler   deBruijn assembly software and publications




deBruijn graph: a directed graph representing overlaps between
sequences of symbols
Credit: Wikipedia

                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                     OLC     Constructing a de Bruijn graph from reads
Graph theory and assembly    Genome assembly from de Bruijn graphs
          deBruijn - Euler   deBruijn assembly software and publications




    Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                         OLC     Constructing a de Bruijn graph from reads
                    Graph theory and assembly    Genome assembly from de Bruijn graphs
                              deBruijn - Euler   deBruijn assembly software and publications


In a real genome scenario...




Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12



                        Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                            OLC     Constructing a de Bruijn graph from reads
                       Graph theory and assembly    Genome assembly from de Bruijn graphs
                                 deBruijn - Euler   deBruijn assembly software and publications


Euler’s algorithm




   Using Euler’s algorithm we can find a path that visits each edge of the de
   Bruijn genome assembly graph once, in order to concatenate the edge
   labels and ”spell out” the assembly. Polynomial time!
   Credit: Wikipedia



                           Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                   OLC     Constructing a de Bruijn graph from reads
              Graph theory and assembly    Genome assembly from de Bruijn graphs
                        deBruijn - Euler   deBruijn assembly software and publications




Euler assembler (the very first), Pevzner et al 2001 PNAS
98:9748-9753

Velvet assembler (more user friendly),

Both those assemlers store the complete graph on the computer
memory 512GB-1024GB for human genomes

At JCVI we have two 1024GB (1TB) RAM servers for assembly

others: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributed
memory) assemblers on computer clusters



                  Konstantinos Krampis     Genome Assembly Algorithms and Software
Introduction   An alternative assembly graph
                                     OLC     Constructing a de Bruijn graph from reads
                Graph theory and assembly    Genome assembly from de Bruijn graphs
                          deBruijn - Euler   deBruijn assembly software and publications


Thank you!


    contact: kkrampis@jcvi.org

    We hire interns at the J. Craig Venter Institute:
    http://www.jcvi.org/cms/education/internship-program/

    Some of my other projects - Cloud Computing:
    http://tinyurl.com/cloudbiolinux-jcvi
    http://www.cloudbiolinux.org




                    Konstantinos Krampis     Genome Assembly Algorithms and Software

More Related Content

What's hot (20)

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
ChIP-seq
ChIP-seqChIP-seq
ChIP-seq
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
co immunoprecipitation
co immunoprecipitationco immunoprecipitation
co immunoprecipitation
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applications
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
SNPs analysis methods
SNPs analysis methodsSNPs analysis methods
SNPs analysis methods
 
FASTA
FASTAFASTA
FASTA
 
BLAST
BLASTBLAST
BLAST
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 

Similar to Overview of Genome Assembly Algorithms

ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeKengo Sato
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copyPradeep Kumar
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streamingc.titus.brown
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Chris Rackauckas
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Applying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells ClassificationApplying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells Classificationbutest
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximationFrank van Harmelen
 
A survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applicationsA survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applicationsJoseph Paul Cohen PhD
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningJulien TREGUER
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsFrancis Rowland
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesNamkug Kim
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-finalmarpierc
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-cardYanchang Zhao
 

Similar to Overview of Genome Assembly Algorithms (20)

Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing codeISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
ISMB2014読み会 イントロ + Deep learning of the tissue-regulated splicing code
 
Genome Assembly copy
Genome Assembly   copyGenome Assembly   copy
Genome Assembly copy
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
20101209 dnaseq pevzner
20101209 dnaseq pevzner20101209 dnaseq pevzner
20101209 dnaseq pevzner
 
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
Generalizing Scientific Machine Learning and Differentiable Simulation Beyond...
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Applying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells ClassificationApplying Support Vector Learning to Stem Cells Classification
Applying Support Vector Learning to Stem Cells Classification
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximation
 
A survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applicationsA survey of deep learning approaches to medical applications
A survey of deep learning approaches to medical applications
 
Big Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learningBig Sky Earth 2018 Introduction to machine learning
Big Sky Earth 2018 Introduction to machine learning
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in Genomics
 
Recent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectivesRecent advances of AI for medical imaging : Engineering perspectives
Recent advances of AI for medical imaging : Engineering perspectives
 
PPT
PPTPPT
PPT
 
R refcard-data-mining
R refcard-data-miningR refcard-data-mining
R refcard-data-mining
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-final
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
RDataMining-reference-card
RDataMining-reference-cardRDataMining-reference-card
RDataMining-reference-card
 

More from Ntino Krampis

Ntino Cloud BioLinux Barcelona Spain 2012
Ntino Cloud BioLinux Barcelona Spain 2012Ntino Cloud BioLinux Barcelona Spain 2012
Ntino Cloud BioLinux Barcelona Spain 2012Ntino Krampis
 
CHPC Afternoon Session
CHPC Afternoon SessionCHPC Afternoon Session
CHPC Afternoon SessionNtino Krampis
 
CHPC Workshop Morning Session
CHPC Workshop Morning SessionCHPC Workshop Morning Session
CHPC Workshop Morning SessionNtino Krampis
 
Cloud BioLinux S.Africa
Cloud BioLinux S.AfricaCloud BioLinux S.Africa
Cloud BioLinux S.AfricaNtino Krampis
 
Ntino Krampis GSC 2011
Ntino Krampis GSC 2011Ntino Krampis GSC 2011
Ntino Krampis GSC 2011Ntino Krampis
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsNtino Krampis
 
Chi next gen-ntino-krampis
Chi next gen-ntino-krampisChi next gen-ntino-krampis
Chi next gen-ntino-krampisNtino Krampis
 

More from Ntino Krampis (8)

Ntino Cloud BioLinux Barcelona Spain 2012
Ntino Cloud BioLinux Barcelona Spain 2012Ntino Cloud BioLinux Barcelona Spain 2012
Ntino Cloud BioLinux Barcelona Spain 2012
 
CHPC Afternoon Session
CHPC Afternoon SessionCHPC Afternoon Session
CHPC Afternoon Session
 
CHPC Workshop Morning Session
CHPC Workshop Morning SessionCHPC Workshop Morning Session
CHPC Workshop Morning Session
 
Cloud BioLinux S.Africa
Cloud BioLinux S.AfricaCloud BioLinux S.Africa
Cloud BioLinux S.Africa
 
Cloud ntino-krampis
Cloud ntino-krampisCloud ntino-krampis
Cloud ntino-krampis
 
Ntino Krampis GSC 2011
Ntino Krampis GSC 2011Ntino Krampis GSC 2011
Ntino Krampis GSC 2011
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 
Chi next gen-ntino-krampis
Chi next gen-ntino-krampisChi next gen-ntino-krampis
Chi next gen-ntino-krampis
 

Recently uploaded

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Overview of Genome Assembly Algorithms

  • 1. Introduction OLC Graph theory and assembly deBruijn - Euler Genome Assembly Algorithms and Software (or...what to do with all that sequence data ?) Konstantinos Krampis Asst. Professor, Informatics J. Craig Venter Institute George Washington University, Nov. 2nd 2011 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 2. Introduction OLC Graph theory and assembly deBruijn - Euler Introduction Why do we need genome assembly Definitions of genome assembly OLC Overlap Layout Consensus OLC assembly software and publications Graph theory and assembly Definition of a graph Graphs and genome assembly deBruijn - Euler An alternative assembly graph Constructing a de Bruijn graph from reads Genome assembly from de Bruijn graphs deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 3. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Cannot read the complete genome with the sequencer from one end to the other ! DNA isolated from a cell is amplified Broken into fragments (shearing) Fragments are ”read” with the sequencer Use the fragments - reads to reconstruct the genome from Credit: Masahiro Kasahara, Large-Scale Genome Sequence sequencing reads Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 4. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Assembly: hierarchical process to reconstruct genome from reads Assemble the puzzle of the genome from the reads: overlaps connect the pieces Oversample the genome so that reads overlap Key approach: data structure representing overlaps, and algorithms operating on that Credit: Masahiro Kasahara, Large-Scale Genome Sequence data structure Processing, Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 5. Introduction OLC Why do we need genome assembly Graph theory and assembly Definitions of genome assembly deBruijn - Euler Two major algorithmic paradigms for genome assembly Overlap - Layout - Consensus (OLC): well established, more powerful method, but more difficult to implement OLC: first to be used successfully for complex Eucaryotic genomes (Drosophila,H.sapiens) deBruijn - Euler: newer, easier to implement, problematic in complex genomes (for current implementations) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 6. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Find Overlaps by aligning the sequence of the reads Layout the reads based on which aligns to which Get Consensus by joining all read sequences, merging overlaps Sequencer reads in random direction, left-to-right or Credit: Masahiro Kasahara, Large-Scale Genome Sequence Processing, right-to-left Imprerial College Press Konstantinos Krampis Genome Assembly Algorithms and Software
  • 7. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Sequence alignment, all-against-all reads (Smith-Watermann, BLAST, other?) Computationally intensive but easily parallelizable Represent read overlap by connecting with directed Credit: Kececioglu and Myers 1995, Algorithmica 13:7-51 link First step in creating the genome assembly graph (more later) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 8. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Create a consistent linear (ideally) ordering of the reads Remove redundancy, so no two dovetails leave the same edge No containment edge is followed by a dovetail edge Remove cycles, one link in, one out Konstantinos Krampis Genome Assembly Algorithms and Software
  • 9. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Multiple Sequence Alignment (ClustalW) algorithms ? No phylogeny here... Vote for the most abundant nucleotide for each position Incorporate read quality data Create pre-consensus from high-quality reads, and align remaining reads to it Konstantinos Krampis Genome Assembly Algorithms and Software
  • 10. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Celera Assembler Developed at Celera Genomics for first Drosophila and human genome assemblies Continuoued development at J. Craig Venter Inst. as open source project http://wgs-assembler.SourceForge.net (Licence: GPL) Plently of wiki (developer + user) documentation, examples, user forums Other OLC implementations: Arachne, PCAP, Newbler, Phrap, TIGR Assembler Konstantinos Krampis Genome Assembly Algorithms and Software
  • 11. Introduction Overlap OLC Layout Graph theory and assembly Consensus deBruijn - Euler OLC assembly software and publications Celera Assembler publications Myers et al (2000) A whole-genome assembly of Drosophila Levy et al (2007) The diploid genome sequence of an individual human Zimin et al (2009) The domestic cow, Bos taurus Dalloul et al (2010) The domestic turkey, Meleagris gallopavo Lorenzi et al (2010) New assembly of Entamoeba histolytica Lawniczak et al (2010) Divergence in Anopheles gambiae Jones et al (2011) The marine filamentous cyanobacterium Lyngbya majuscula Miller et al The Tasmanian devil, Sarcophilus harrisii Prfer et al The great ape bonobo, Pan paniscus Gordon et al The cotton bollworm moth, Helicoverpa Konstantinos Krampis Genome Assembly Algorithms and Software
  • 12. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler and now a bit of Graph Theory... Konstantinos Krampis Genome Assembly Algorithms and Software
  • 13. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Graph G with set of vertices (nodes) V: {P,T,Q,S,R} set of edges (links between nodes) E: {(P,T),(P,Q),(P,S),(Q,T), (S,T),(Q,S),(S,Q),(Q,R),(R,S)} walk from P to R:(P,Q),(Q,R) walk from R to T:(R,S),(S,Q),(Q,T) or (R,S),(S,T) Credit: Introduction to Graph Theor Robert J. Wilson walk from R to P: not possible Konstantinos Krampis Genome Assembly Algorithms and Software
  • 14. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Trail: a walk of the graph where each edge is visited only once Example Trail: (P,Q), (Q,R), (R,S), (S,Q), (Q,S), (S,T) Path: a walk where each vertice is visited once Example Path: (P,Q), (Q,R), (R,S), (S,T) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 15. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Credit: Saad Mneimneh, CUNY Konstantinos Krampis Genome Assembly Algorithms and Software
  • 16. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Represent sequence overlaps as a graph with weighted edges SCS solution: find Path (visit all edges and vertices once) that maximizes weight sum Hamiltonian Cycle or Traveling Saleman Problem Konstantinos Krampis Genome Assembly Algorithms and Software
  • 17. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Which edge to start from? NO: misses a vertex NO: misses edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 18. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler YES!: all vertices and edge with large weight Konstantinos Krampis Genome Assembly Algorithms and Software
  • 19. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler A more realistic version of a read / string overlap graph (C. jejuni) Credit: Eugene W. Myers Bioinformatics 21:79-85 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 20. Introduction OLC Definition of a graph Graph theory and assembly Graphs and genome assembly deBruijn - Euler Computational Complexity SCS solution by searching for a Hamiltonian Cycle on a graph is a difficult algorithmic problem (NP-hard) Using approximation or greedy algorithms can yield a 2 to 4-aprroximation solutions (twice or four times the length of the optimal-shortest string) Transformation of Overlap Graph to String Graph leads to Polynomial time solution. No Polynomial(P) : O(n), O(n2 ), O(n3 )etc. assembler implementation yet. (1) Konstantinos Krampis Genome Assembly Algorithms and Software
  • 21. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Pevzner, Tang and Waterman, An Eulerian path approach to DNA fragment assembly, PNAS 98 2001 9748-9753. Konstantinos Krampis Genome Assembly Algorithms and Software
  • 22. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications deBruijn graph: a directed graph representing overlaps between sequences of symbols Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 23. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 24. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 25. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Konstantinos Krampis Genome Assembly Algorithms and Software
  • 26. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications In a real genome scenario... Credit: Flicek and Birney 2009, Nature Methods 6, S6 - S12 Konstantinos Krampis Genome Assembly Algorithms and Software
  • 27. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Euler’s algorithm Using Euler’s algorithm we can find a path that visits each edge of the de Bruijn genome assembly graph once, in order to concatenate the edge labels and ”spell out” the assembly. Polynomial time! Credit: Wikipedia Konstantinos Krampis Genome Assembly Algorithms and Software
  • 28. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Euler assembler (the very first), Pevzner et al 2001 PNAS 98:9748-9753 Velvet assembler (more user friendly), Both those assemlers store the complete graph on the computer memory 512GB-1024GB for human genomes At JCVI we have two 1024GB (1TB) RAM servers for assembly others: ABYSS, YAGA, Contrail-Bio, PASHA parallel (distributed memory) assemblers on computer clusters Konstantinos Krampis Genome Assembly Algorithms and Software
  • 29. Introduction An alternative assembly graph OLC Constructing a de Bruijn graph from reads Graph theory and assembly Genome assembly from de Bruijn graphs deBruijn - Euler deBruijn assembly software and publications Thank you! contact: kkrampis@jcvi.org We hire interns at the J. Craig Venter Institute: http://www.jcvi.org/cms/education/internship-program/ Some of my other projects - Cloud Computing: http://tinyurl.com/cloudbiolinux-jcvi http://www.cloudbiolinux.org Konstantinos Krampis Genome Assembly Algorithms and Software