SlideShare una empresa de Scribd logo
1 de 98
Descargar para leer sin conexión
Sequencing, Alignment and
       Assembly


        Shaun Jackman
   Genome Sciences Centre
   of the BC Cancer Agency
       Vancouver, Canada
          2011-July-14
Outline
●   DNA sequencing
●   Sequence alignment
●   Sequence assembly
●   Running ABySS
●   Assembly visualization (ABySS-Explorer)
●   Transcriptome assembly, alternative splicing,
    and visualization


                                                    2
DNA sequencing technologies
●   Sanger
●   454 Life Sciences
●   Illumina
●   SOLiD
●   Ion Torrent
●   Pacific Bio
●   Helicos

                                    3
Sequence alignment




                     4
Sequence alignment
●   Global sequence alignment
●   Local sequence alignment
●   Glocal sequence alignment




    The term glocal is a portmanteau of global and local.



                                                            5
Global alignment
●   Base-by-base alignment of one sequence to
    another allowing for both mismatches and gaps
●   Example:
    AGAGTGCTGCCGCC
    AGATGTACTGCGCC
●   Alignment:
    AGA-GTGCTGCCGCC
    ||| || |||| |||
    AGATGTACTGC-GCC
●   12 matches of 15 bp = 80% identity
                                                    6
Local alignment
●   Given two sequences, find a matching
    substring from each of those two sequences
●   Example:
    AGATGTGCTGCCGCC
    TTTGTACTGAAA
●   AGATGTGCTGCCGCC
       ||| |||
     TTTGTACTGAAA
●   6 matches of 7 bp = 86% identity
                                                 7
Glocal alignment
●   Given a query sequence and a reference
    sequence, identify a substring of the reference
    sequence that matches the entirety of the query
    sequence.
●   Example:
    Reference: AGATGTGCTGCCGCCACGT
    Query: TTTGTACTGAAA
●   ACGTAGATGTGCTGCCGCCACGT
           ||| |||
         TTTGTACTGAAA
                                                  8
●   6 matches of 12 bp = 50% identity
Criteria for choosing an aligner
●   Global, local or glocal alignment
●   Aligning short sequences to long sequences
    such as short reads to a reference
●   Aligning long sequences to long sequences
    such as long reads or contigs to a reference
●   Handles small gaps (insertions and deletions)
●   Handles large gaps (introns)
●   Handles split alignments (chimera)
●   Speed and ease of use                           9
Short sequence aligners
●   Bowtie
●   BWA
●   GSNAP
●   SOAP




                                     10
Long sequence aligners
●   BLAT
●   BWA-SW
●   Exonerate
●   GMAP
●   MUMmer




                                    11
Seed and extend
●   For large sequences, an exhaustive alignment
    is very slow
●   Many aligners start by finding perfect or near
    perfect matches to seeds
●   The seeding strategy has a large effect on the
    sensitivity of the aligner
●   BLAT for example requires two perfect nearby
    11-mer matches

                                                     12
Sequence assembly




                    13
Assembly
●   Reference-based assembly
    ●   Align, Layout, Consensus
    ●   not de novo
●   de novo assembly




                                   14
De Novo Assembly Strategies
●   Hierarchical sequencing
●   Shotgun sequencing




                                    15
Applications of Assembly
●   Genome
●   Exome
●   Transcriptome
●   Amplicon




                                    16
Assembly Algorithms
●   Greedy
●   Overlap, layout, consensus
●   De Bruijn Graph or k-mer assembly
●   Burrows Wheeler transform and FM-Index
●   Clustering




                                             17
Greedy
●   Find two sequences with the largest overlap
    and merge them; repeat
●   Flaw: prone to misassembly




                                                  18
Overlap, Layout, Consensus
●   Overlap
    Find all pairs of sequences that overlap
●   Layout
    Remove redundant and weak overlaps
●   Consensus
    Merge pairs of sequences that overlap
    unambiguously. That is, pairs of sequences that
    overlap only with each other and no other
    sequence.

                                                  19
Overlap graph
●   A vertex is a string
●   An edge represents an overlap between two
    strings
●   Used by Overlap-Layout-Consensus
    assemblers
         U AGATGTGCTGCCGCC
         V        TGCTGCCGCCTTGGA

                    U      V
                                                20
De Bruijn Graph
●   A De Bruijn Graph is a particular kind of overlap
    graph
●   Every vertex is a string of length k
●   Every edge is an overlap of length k-1
●   Used by De Bruijn Graph assemblers




                                                    21
De Bruijn Graph
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      22
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
    ●   GGACATC
    ●   GGACAGA
                           GACAT    ACATC
          GGACA


                          GACAG     ACAGA


                                            23
Burrows-Wheeler transform
             and the FM-index
●   A return to Overlap, Layout, Consensus
●   Uses the Ferragina-Manzini index to find all the
    pairs of overlapping sequences efficiently




                                                   24
Overlap, Layout, Consensus
●   ARACHNE
●   CAP3
●   Celera assembler
●   MIRA
●   Newbler
●   Phrap



                                    25
De Bruijn Graph
●   ABySS
●   ALLPATHS
●   SOAP de novo
●   Velvet




                                 26
Burrows Wheeler Transform
●   String Graph Assembler (SGA)




                                   27
Clustering
●   Phusion (and Phrap)
●   Curtain (and Velvet)




                                28
ABySS
●   de Bruijn graph assembler
●   Strengths
    ●   small memory foot print
    ●   distributed processing using MPI
    ●   can handle very large genomes




                                           29
Velvet
●   de Bruijn graph assembler
●   Strengths
    ●   can use paired-end or mate-pair libraries
    ●   can use long reads
    ●   can use a reference genome




                                                    30
SGA
●   Overlap assembler using the BWT
●   Strengths
    ●   small memory foot print
    ●   mix short reads and long reads
    ●   resolving repeats with size near the read length




                                                           31
Assembling to find variants




                              32
Small deletion in a tandem repeat
●   The reference has 5 repetitions of a short
    7-base sequence: GGCTGGA
●   The sample has only 4 repetitions, one fewer
        Sample
0006813 TCCAAAT.......ggctggaggctggaggctggaggctggaggcATGTGTTAGTG 0006861
>>>>>>> |||||||       |||||||||||||||||||||||||||||||||||||||||| >>>>>>>
2356747 TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG 2356802
        Reference
Alignment of short reads may not
            show the deletion
●   Aligning reads to the reference perfectly covers the
    reference with no more than 2 errors per read
●   Alignment will not find the small 7-base deletion
Reference:
        TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG

Alignment:
        TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGG
         CCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGC
          CAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCA
           AAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCAT
            AATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATG
                    ATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGT
                     TGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTG
                      GGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGT
                       GCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTT
                        CTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTA
                         TGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAG
                          GGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGT
                           GAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG
Assembly clearly shows the deletion
●   Assembling the reads and aligning the resulting contig to
    the reference clearly shows the small 7-base deletion.
Reads:  TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGG
         CCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGC
          CAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCA
           AAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCAT
            AATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATG
             ATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGT
              TGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTG
               GGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGT
                GCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTT
                 CTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTA
                  TGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAG
                   GGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGT
                    GAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG
Contig: TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG

Alignment:
0006813 TCCAAAT.......ggctggaggctggaggctggaggctggaggcATGTGTTAGTG 0006861
>>>>>>> |||||||       |||||||||||||||||||||||||||||||||||||||||| >>>>>>>
2356747 TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG 2356802
Running ABySS




                36
Input file formats of ABySS
●   FASTA
●   FASTQ
●   Illumina QSEQ
●   Eland export
●   SAM
●   BAM
●   Compressed: gz, bz2, xz, tar

                                      37
Running ABySS
●   Assemble the paired-end reads in the file
    reads.fa
    ● abyss-pe name=ecoli k=32 n=10
        in=reads.fa
●   Assemble the paired-end reads in the files
    reads_1.fa and reads_2.fa:
    ●   abyss-pe name=ecoli k=32 n=10
          in='reads_1.fa reads_2.fa'

                                                 38
Running ABySS in parallel
●   Run ABySS using eight threads
    ●  abyss-pe np=8 name=ecoli k=32 n=10
          in='reads_1.fa reads_2.fa'
●   ABySS uses MPI, the Message Passing
    Interface. OpenMPI is an open-source
    implementation of MPI




                                            39
Running ABySS in parallel
           on a cluster (SGE)
●   Run ABySS on a cluster using 8 threads
    ● qsub -pe openmpi 8 -N ecoli
         abyss-pe np=8 name=ecoli k=32 n=10
         in='reads_1.fa reads_2.fa'
●   abyss-pe uses the environment variables
    JOB_NAME and NSLOTS passed to it by SGE
    as the default values for name and np



                                              40
Running ABySS in parallel
           on a cluster (SGE)
          for many values of k
●
    Assemble every 8th k from 32 to 96
    ● qsub -pe openmpi 8 -N ecoli -t 32-96:8
        abyss-pe k=32 n=10
        in='reads_1.fa reads_2.fa'
●   abyss-pe uses the environment variable
    SGE_TASK_ID passed to it by SGE as the
    default value for k



                                               41
Assembling multiple libraries
●   abyss-pe name=ecoli
      k=32 n=10
      lib='pe200 pe500'
      pe200='pe200_1.fa pe200_2.fa'
      pe500='pe500_1.fa pe500_2.fa'




                                       42
Assembling a mix of paired-end and
        single-end reads
●   abyss-pe name=ecoli
      k=32 n=10
      lib='pe200 pe500'
      pe200='pe200_1.fa pe200_2.fa'
      pe500='pe500_1.fa pe500_2.fa'
      se='long.fa'




                                      43
Parameters of ABySS
●   name: name of the assembly
●   lib: name of the libraries (one or more)
●   se: paths of the single-end read files
●   ${lib}: paths of the read files for that library
●   Example
    abyss-pe name=ecoli k=32 n=10
      lib='pe200 pe500'
      pe200='pe200_1.fa pe200_2.fa'
      pe500='pe500_1.fa pe500_2.fa'
      se='long.fa'                                     44
Parameters of ABySS
              Sequence assembly
●   k: the size of a k-mer
●   q: quality trimming removes low-quality bases
    from the ends of reads
●   e and c: coverage-threshold parameters
    ●   e: erosion removes bases from the ends of contigs
    ●   c: coverage threshold removes entire contigs
●   p: the minimum identity for bubble popping


                                                            45
Parameters of ABySS
            Paired-end assembly
●   s: the minimum size of a seed contig
●   n: the number of pairs required to join two
    contigs
●   Example
    abyss-pe name=ecoli
      k=64 q=3 p=0.9 s=100 n=10
      lib='pe200 pe500'
      pe200='pe200_1.fa pe200_2.fa'
      pe500='pe500_1.fa pe500_2.fa'
      se='long.fa'
                                                  46
Stages of ABySS
●   Assembe read sequence without paired-end
    information
●   Map the reads back to the assembly
●   Use the paired-end information to merge
    contigs from the first stage into larger
    sequences




                                               47
Optimizing k
●
    Assemble every 8th k from 32 to 96
    Nine assemblies: 32 40 48 56 64 72 80 88 96
●   Find the peak
●
    Assemble every 2nd k around the peak
    For example, if the peak were at k=64...
    Eight assemblies: 56 58 60 62 66 68 70 72
●   SGE:
    qsub -t 32-96:8 qsub-abyss.sh
    qsub -t 56-72:2 qsub-abyss.sh
                                                  48
Output files of ABySS
●   ${name}-contigs.fa
    The final contigs in FASTA format
●   ${name}-bubbles.fa
    The equal-length variant sequences (FASTA)
●   ${name}-indel.fa
    The different-length variant sequences (FASTA)
●   ${name}-contigs.dot
    The contig overlap graph in Graphviz format

                                                  49
Intermediate output files of ABySS
●   .adj: contig overlap graph in ABySS adj format
●   .dist: estimates of the distance between contigs
    in ABySS dist format
●   .path: lists of contigs to be merged
●   .hist: fragment-size histogram of a library
●   coverage.hist: k-mer coverage histogram



                                                     50
Assembly/alignment visualization




                                   51
Assembly/alignment visualization
●   Display how the reads were used in the
    assembly (or align to the reference)
●   Show paired-end reads and highlight locations
    where the pairs are discordant
●   Browse annotations and variants
●   Standard file formats are BAM, VCF and GFF,
    though there are many



                                                    52
Visualization tools
●   UCSC Genome Browser
●   Integrative Genomics Viewer (IGV)
●   Tablet
●   gap5
●   consed
●   ABySS-Explorer



                                        53
●   Integrative Genomics Viewer (IGV)
●   Can visualize short
    read alignments and
    many other types of
    data




                                        54
ABySS-Explorer




                 55
ABySS-Explorer




                 56
K-mer coverage histogram
●   Counts the number of
    occurrences of each
    k-mer
●   Useful for estimating
    the size of the
    genome




                                   57
N50 and Nxx plot
●   The N50 is the
    weighted median of
    contig sizes
●   The N50 summarizes
    a single point on the
    Nxx plot
●   Better assemblies are
    further to the right


                                 58
ABySS-Explorer
Assembly graph visualization




                               59
Assembly Ambiguities

 True genome sequence

GGATTGAAAAAAAAAAAAAAAAGTAGCACGAATATACATAGAAAAAAAAAAAAAAAAATTACG




Assembled sequence
de Bruijn graph representation




                                 Cydney Nielsen               60
Starting Point




   Cydney Nielsen   61
Cydney Nielsen   62
Sequence length




                         one oscillation = 100 nt



        Cydney Nielsen                              63
Paired-end reads




 After building the initial single-end (SE) contigs from k-mer
sequences, ABySS uses paired-end reads to resolve ambiguities.


                          Cydney Nielsen                         64
Paired-end contigs

Paired-end reads are used to construct paired-end (PE) contigs




        … 13+ 44- 46+ 4+ 79+ 70+ …


        blue gradient = paired end contig
        orange = selected single end contig
                       Cydney Nielsen                            65
Cydney Nielsen   66
Cydney Nielsen   67
Transcriptome Assembly,
   Alternative Splicing
           and
       Visualization



                          68
http://www.eurasnet.info/clinicians/alternative-splicing/what-is-alternative-splicing/diversity
Assembly      ABySS
Alignment     GMAP
Detection & Sircah
Visualisation
ABySS

  Assemble transcriptome data

Transcriptome reads → Assembly
GMAP

Align contigs to the reference genome
           Annotate introns

      Assembly → Alignments
Sircah

Detect alternative splicing events

Alignments → Alternative splicing
EST_match
Sircah Visualisation

        Draw splicing diagrams

Alternative splicing → Splicing diagrams
EST_match




SpliceGraph
Acknowledgments
    Supervisors
●   İnanç Birol
●   Steven Jones
    Team
●   Readman Chiu
●   Rod Docking
●   Ka Ming Nip
●   Karen Mungall
●   Jenny Qian
                                    80
●   Tony Raymond
ABySS Algorithm




                  81
An assembly in two stages
●   Stage I: Sequence assembly algorithm
●   Stage II: Paired-end assembly algorithm




                                              82
Stage 1
      Sequence assembly algorithm
●   Load the reads,                  Load k-mers
    breaking each read into k-mers
●   Find adjacent k-mers, which      Find overlaps
    overlap by k-1 bases
●   Remove k-mers resulting from     Prune tips
    read errors
●   Remove variant sequences         Pop bubbles

●   Generate contigs
                                     Generate contigs



                                                        83
Load the reads
●   For each input read of length l, (l - k + 1) k-mers
    are generated by sliding a window of length k
    over the read
      Read (l = 12):    ● Each k-mer is a vertex of
         ATCATACATGAT   the de Bruijn graph
      k-mers (k = 9):
         ATCATACAT      ●Two adjacent k-mers are
          TCATACATG     an edge of the de Bruijn
           CATACATGA
            ATACATGAT   graph

                                                      84
De Bruijn Graph
●   A simple graph for k = 5
●   Two reads
    ●   GGACATC
    ●   GGACAGA
                           GACAT    ACATC
          GGACA


                          GACAG     ACAGA


                                            85
Pruning tips
●   Read errors cause
    tips




                                86
Pruning tips
●   Read errors cause
    tips
●   Pruning tips
    removes the
    erroneous reads
    from the assembly




                                87
Popping bubbles
●   Variant sequences cause
    bubbles
●   Popping bubbles removes
    the variant sequence from
    the assembly
●   Repeat sequences with
    small differences also
    cause bubbles




                                 88
Assemble contigs
●   Remove ambiguous
    edges
●   Output contigs in
    FASTA format




                                  89
Paired-end assembly algorithm
                       Stage 2
●   Align the reads to the contigs of the first stage
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig
●   Estimate the distance between contigs using
    the paired reads that align to different contigs




                                                        90
Align the reads to the contigs
                      KAligner
●   Every k-mer in the single-end
    assembly is unique
●   KAligner can map reads with k
    consecutive correct bases
●   ABySS may use other aligners,
    including BWA and bowtie




                                        91
Empirical fragment-size distribution
                     ParseAligns
●   Generate an empirical fragment-size
    distribution using the paired reads that align to
    the same contig




                                                        92
Estimate distances between contigs
                     DistanceEst
●   Estimate the distance between contigs using
    the paired reads that align to different contigs

                           d = 25 ± 8

                      d=3±5


                        d=6±5




                        d=4±3

                                                       93
Maximum likelihood estimator
                    DistanceEst
●   Use the empirical paired-
    end size distribution
●   Maximize the likelihood
    function
●   Find the most likely
    distance between the two
    contigs



                                     94
Paired-end algorithm
                   continued...
●   Find paths through the contig
    adjacency graph that agree with    Generate paths
    the distance estimates
●   Merge overlapping paths             Merge paths

●   Merge the contigs in these paths
                                       Generate contigs
    and output the FASTA file




                                                      95
Find consistent paths
                    SimpleGraph
●   Find paths through the contig adjacency graph
    that agree with the distance estimates




                     d=4±3

                  Actual distance = 3
                                                    96
Merge overlapping paths
                    MergePaths
●   Merge paths that overlap




                                   97
Generate the FASTA output
●   Merge the contigs in these paths.
●   Output the FASTA file




    GATTTTTG   GAC GTCTTGATCTT   CAC    GTATTG CTATT

                                                       98

Más contenido relacionado

La actualidad más candente

Serial analysis of gene expression
Serial analysis of gene expressionSerial analysis of gene expression
Serial analysis of gene expressionAshwini R
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsprateek kumar
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGAayushi Pal
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicshemantbreeder
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)talhakhat
 
Transcription in Pro- & eukaryotes
Transcription in Pro- & eukaryotesTranscription in Pro- & eukaryotes
Transcription in Pro- & eukaryotesNurulhasanKhatri
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted MutationAmit Kyada
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
PHYSICAL MAPPING STRATEGIES IN GENOMICS
PHYSICAL MAPPING STRATEGIES IN GENOMICSPHYSICAL MAPPING STRATEGIES IN GENOMICS
PHYSICAL MAPPING STRATEGIES IN GENOMICSUsman Arshad
 
Multiplex PCR and its Applications
Multiplex PCR and its ApplicationsMultiplex PCR and its Applications
Multiplex PCR and its ApplicationsNagendra P
 
Functional proteomics, methods and tools
Functional proteomics, methods and toolsFunctional proteomics, methods and tools
Functional proteomics, methods and toolsKAUSHAL SAHU
 
Map based cloning
Map based cloning Map based cloning
Map based cloning PREETHYDAVID
 

La actualidad más candente (20)

Genome annotation
Genome annotationGenome annotation
Genome annotation
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Serial analysis of gene expression
Serial analysis of gene expressionSerial analysis of gene expression
Serial analysis of gene expression
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 
Transcription in Pro- & eukaryotes
Transcription in Pro- & eukaryotesTranscription in Pro- & eukaryotes
Transcription in Pro- & eukaryotes
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
PHYSICAL MAPPING STRATEGIES IN GENOMICS
PHYSICAL MAPPING STRATEGIES IN GENOMICSPHYSICAL MAPPING STRATEGIES IN GENOMICS
PHYSICAL MAPPING STRATEGIES IN GENOMICS
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Multiplex PCR and its Applications
Multiplex PCR and its ApplicationsMultiplex PCR and its Applications
Multiplex PCR and its Applications
 
Functional proteomics, methods and tools
Functional proteomics, methods and toolsFunctional proteomics, methods and tools
Functional proteomics, methods and tools
 
Map based cloning
Map based cloning Map based cloning
Map based cloning
 

Destacado

La interfaz del servidor de directorios
La interfaz del servidor de directoriosLa interfaz del servidor de directorios
La interfaz del servidor de directoriospaola2545
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
Agentes inteligentes
Agentes inteligentesAgentes inteligentes
Agentes inteligentesmenamigue
 
Hacer un programa que calcule la suma de dos números y su producto
Hacer un programa que calcule la suma de dos números y su productoHacer un programa que calcule la suma de dos números y su producto
Hacer un programa que calcule la suma de dos números y su productoLeobardo Montalvo
 
Interfaz del Sistema de Archivos
Interfaz del Sistema de ArchivosInterfaz del Sistema de Archivos
Interfaz del Sistema de ArchivosAcristyM
 
Tutorial for using SQL in Microsoft Access
Tutorial for using SQL in Microsoft AccessTutorial for using SQL in Microsoft Access
Tutorial for using SQL in Microsoft Accessmcclellm
 
Apuntes 1 parcial
Apuntes 1 parcialApuntes 1 parcial
Apuntes 1 parcialeleazar dj
 
Maquinas de turing
Maquinas de turingMaquinas de turing
Maquinas de turingJesus David
 
Planificacion De Procesos y Procesadores
Planificacion De Procesos y ProcesadoresPlanificacion De Procesos y Procesadores
Planificacion De Procesos y ProcesadoresPkacho
 
pasos para hacer una mini agenda en visual basic 6.0
pasos para hacer una mini agenda en visual basic 6.0pasos para hacer una mini agenda en visual basic 6.0
pasos para hacer una mini agenda en visual basic 6.0yeimimorel
 
Planificacion del procesador
Planificacion del procesadorPlanificacion del procesador
Planificacion del procesadorManuel Ceron
 
Unidad No. 5 - Agentes Inteligentes
Unidad No. 5 - Agentes InteligentesUnidad No. 5 - Agentes Inteligentes
Unidad No. 5 - Agentes InteligentesMilton Klapp
 
Preguntas seguridad informática
Preguntas seguridad informáticaPreguntas seguridad informática
Preguntas seguridad informáticamorfouz
 
Estructuras de datos y tipos de datos abstractos
Estructuras de datos y tipos de datos abstractosEstructuras de datos y tipos de datos abstractos
Estructuras de datos y tipos de datos abstractosLuis Lastra Cid
 
Tipos abstractos de datos
Tipos abstractos de datosTipos abstractos de datos
Tipos abstractos de datosJose Armando
 
Entorno de desarrollo integrado de Visual Basic .NET
Entorno de desarrollo integrado de Visual Basic .NETEntorno de desarrollo integrado de Visual Basic .NET
Entorno de desarrollo integrado de Visual Basic .NETNilian Cabral
 

Destacado (20)

Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
La interfaz del servidor de directorios
La interfaz del servidor de directoriosLa interfaz del servidor de directorios
La interfaz del servidor de directorios
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Agentes inteligentes
Agentes inteligentesAgentes inteligentes
Agentes inteligentes
 
Lenguajes
LenguajesLenguajes
Lenguajes
 
Hacer un programa que calcule la suma de dos números y su producto
Hacer un programa que calcule la suma de dos números y su productoHacer un programa que calcule la suma de dos números y su producto
Hacer un programa que calcule la suma de dos números y su producto
 
Interfaz del Sistema de Archivos
Interfaz del Sistema de ArchivosInterfaz del Sistema de Archivos
Interfaz del Sistema de Archivos
 
Tutorial for using SQL in Microsoft Access
Tutorial for using SQL in Microsoft AccessTutorial for using SQL in Microsoft Access
Tutorial for using SQL in Microsoft Access
 
Apuntes 1 parcial
Apuntes 1 parcialApuntes 1 parcial
Apuntes 1 parcial
 
Maquinas de turing
Maquinas de turingMaquinas de turing
Maquinas de turing
 
Planificacion De Procesos y Procesadores
Planificacion De Procesos y ProcesadoresPlanificacion De Procesos y Procesadores
Planificacion De Procesos y Procesadores
 
pasos para hacer una mini agenda en visual basic 6.0
pasos para hacer una mini agenda en visual basic 6.0pasos para hacer una mini agenda en visual basic 6.0
pasos para hacer una mini agenda en visual basic 6.0
 
Planificacion del procesador
Planificacion del procesadorPlanificacion del procesador
Planificacion del procesador
 
Unidad No. 5 - Agentes Inteligentes
Unidad No. 5 - Agentes InteligentesUnidad No. 5 - Agentes Inteligentes
Unidad No. 5 - Agentes Inteligentes
 
Archivos Distribuidos
Archivos DistribuidosArchivos Distribuidos
Archivos Distribuidos
 
Preguntas seguridad informática
Preguntas seguridad informáticaPreguntas seguridad informática
Preguntas seguridad informática
 
Tipos de Datos Abstractos.
Tipos de Datos Abstractos.Tipos de Datos Abstractos.
Tipos de Datos Abstractos.
 
Estructuras de datos y tipos de datos abstractos
Estructuras de datos y tipos de datos abstractosEstructuras de datos y tipos de datos abstractos
Estructuras de datos y tipos de datos abstractos
 
Tipos abstractos de datos
Tipos abstractos de datosTipos abstractos de datos
Tipos abstractos de datos
 
Entorno de desarrollo integrado de Visual Basic .NET
Entorno de desarrollo integrado de Visual Basic .NETEntorno de desarrollo integrado de Visual Basic .NET
Entorno de desarrollo integrado de Visual Basic .NET
 

Similar a Sequencing, Alignment and Assembly

Similar a Sequencing, Alignment and Assembly (20)

Assembling genomes using ABySS
Assembling genomes using ABySSAssembling genomes using ABySS
Assembling genomes using ABySS
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Clipping
ClippingClipping
Clipping
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.Aligning seqeunces with W-curve and SQL.
Aligning seqeunces with W-curve and SQL.
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
137920
137920137920
137920
 
MT Study SCFG
MT Study SCFGMT Study SCFG
MT Study SCFG
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Halide - 2
Halide - 2 Halide - 2
Halide - 2
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Selection analysis using HyPhy
Selection analysis using HyPhySelection analysis using HyPhy
Selection analysis using HyPhy
 
Demystifying Garbage Collection in Java
Demystifying Garbage Collection in JavaDemystifying Garbage Collection in Java
Demystifying Garbage Collection in Java
 
cloning
cloningcloning
cloning
 
cloning
cloningcloning
cloning
 
C:\fakepath\cloning
C:\fakepath\cloningC:\fakepath\cloning
C:\fakepath\cloning
 
Cloning
CloningCloning
Cloning
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Sequencing, Alignment and Assembly

  • 1. Sequencing, Alignment and Assembly Shaun Jackman Genome Sciences Centre of the BC Cancer Agency Vancouver, Canada 2011-July-14
  • 2. Outline ● DNA sequencing ● Sequence alignment ● Sequence assembly ● Running ABySS ● Assembly visualization (ABySS-Explorer) ● Transcriptome assembly, alternative splicing, and visualization 2
  • 3. DNA sequencing technologies ● Sanger ● 454 Life Sciences ● Illumina ● SOLiD ● Ion Torrent ● Pacific Bio ● Helicos 3
  • 5. Sequence alignment ● Global sequence alignment ● Local sequence alignment ● Glocal sequence alignment The term glocal is a portmanteau of global and local. 5
  • 6. Global alignment ● Base-by-base alignment of one sequence to another allowing for both mismatches and gaps ● Example: AGAGTGCTGCCGCC AGATGTACTGCGCC ● Alignment: AGA-GTGCTGCCGCC ||| || |||| ||| AGATGTACTGC-GCC ● 12 matches of 15 bp = 80% identity 6
  • 7. Local alignment ● Given two sequences, find a matching substring from each of those two sequences ● Example: AGATGTGCTGCCGCC TTTGTACTGAAA ● AGATGTGCTGCCGCC ||| ||| TTTGTACTGAAA ● 6 matches of 7 bp = 86% identity 7
  • 8. Glocal alignment ● Given a query sequence and a reference sequence, identify a substring of the reference sequence that matches the entirety of the query sequence. ● Example: Reference: AGATGTGCTGCCGCCACGT Query: TTTGTACTGAAA ● ACGTAGATGTGCTGCCGCCACGT ||| ||| TTTGTACTGAAA 8 ● 6 matches of 12 bp = 50% identity
  • 9. Criteria for choosing an aligner ● Global, local or glocal alignment ● Aligning short sequences to long sequences such as short reads to a reference ● Aligning long sequences to long sequences such as long reads or contigs to a reference ● Handles small gaps (insertions and deletions) ● Handles large gaps (introns) ● Handles split alignments (chimera) ● Speed and ease of use 9
  • 10. Short sequence aligners ● Bowtie ● BWA ● GSNAP ● SOAP 10
  • 11. Long sequence aligners ● BLAT ● BWA-SW ● Exonerate ● GMAP ● MUMmer 11
  • 12. Seed and extend ● For large sequences, an exhaustive alignment is very slow ● Many aligners start by finding perfect or near perfect matches to seeds ● The seeding strategy has a large effect on the sensitivity of the aligner ● BLAT for example requires two perfect nearby 11-mer matches 12
  • 14. Assembly ● Reference-based assembly ● Align, Layout, Consensus ● not de novo ● de novo assembly 14
  • 15. De Novo Assembly Strategies ● Hierarchical sequencing ● Shotgun sequencing 15
  • 16. Applications of Assembly ● Genome ● Exome ● Transcriptome ● Amplicon 16
  • 17. Assembly Algorithms ● Greedy ● Overlap, layout, consensus ● De Bruijn Graph or k-mer assembly ● Burrows Wheeler transform and FM-Index ● Clustering 17
  • 18. Greedy ● Find two sequences with the largest overlap and merge them; repeat ● Flaw: prone to misassembly 18
  • 19. Overlap, Layout, Consensus ● Overlap Find all pairs of sequences that overlap ● Layout Remove redundant and weak overlaps ● Consensus Merge pairs of sequences that overlap unambiguously. That is, pairs of sequences that overlap only with each other and no other sequence. 19
  • 20. Overlap graph ● A vertex is a string ● An edge represents an overlap between two strings ● Used by Overlap-Layout-Consensus assemblers U AGATGTGCTGCCGCC V TGCTGCCGCCTTGGA U V 20
  • 21. De Bruijn Graph ● A De Bruijn Graph is a particular kind of overlap graph ● Every vertex is a string of length k ● Every edge is an overlap of length k-1 ● Used by De Bruijn Graph assemblers 21
  • 22. De Bruijn Graph ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 22
  • 23. De Bruijn Graph ● A simple graph for k = 5 ● Two reads ● GGACATC ● GGACAGA GACAT ACATC GGACA GACAG ACAGA 23
  • 24. Burrows-Wheeler transform and the FM-index ● A return to Overlap, Layout, Consensus ● Uses the Ferragina-Manzini index to find all the pairs of overlapping sequences efficiently 24
  • 25. Overlap, Layout, Consensus ● ARACHNE ● CAP3 ● Celera assembler ● MIRA ● Newbler ● Phrap 25
  • 26. De Bruijn Graph ● ABySS ● ALLPATHS ● SOAP de novo ● Velvet 26
  • 27. Burrows Wheeler Transform ● String Graph Assembler (SGA) 27
  • 28. Clustering ● Phusion (and Phrap) ● Curtain (and Velvet) 28
  • 29. ABySS ● de Bruijn graph assembler ● Strengths ● small memory foot print ● distributed processing using MPI ● can handle very large genomes 29
  • 30. Velvet ● de Bruijn graph assembler ● Strengths ● can use paired-end or mate-pair libraries ● can use long reads ● can use a reference genome 30
  • 31. SGA ● Overlap assembler using the BWT ● Strengths ● small memory foot print ● mix short reads and long reads ● resolving repeats with size near the read length 31
  • 32. Assembling to find variants 32
  • 33. Small deletion in a tandem repeat ● The reference has 5 repetitions of a short 7-base sequence: GGCTGGA ● The sample has only 4 repetitions, one fewer Sample 0006813 TCCAAAT.......ggctggaggctggaggctggaggctggaggcATGTGTTAGTG 0006861 >>>>>>> ||||||| |||||||||||||||||||||||||||||||||||||||||| >>>>>>> 2356747 TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG 2356802 Reference
  • 34. Alignment of short reads may not show the deletion ● Aligning reads to the reference perfectly covers the reference with no more than 2 errors per read ● Alignment will not find the small 7-base deletion Reference: TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG Alignment: TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGG CCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGC CAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCA AAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCAT AATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATG ATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGT TGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTG GGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGT GCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTT CTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTA TGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAG GGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGT GAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG
  • 35. Assembly clearly shows the deletion ● Assembling the reads and aligning the resulting contig to the reference clearly shows the small 7-base deletion. Reads: TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGG CCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGC CAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCA AAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCAT AATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATG ATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGT TGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTG GGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGT GCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTT CTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTA TGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAG GGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGT GAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG Contig: TCCAAATGGCTGGAGGCTGGAGGCTGGAGGCTGGAGGCATGTGTTAGTG Alignment: 0006813 TCCAAAT.......ggctggaggctggaggctggaggctggaggcATGTGTTAGTG 0006861 >>>>>>> ||||||| |||||||||||||||||||||||||||||||||||||||||| >>>>>>> 2356747 TCCAAATggctggaggctggaggctggaggctggaggctggaggcATGTGTTAGTG 2356802
  • 37. Input file formats of ABySS ● FASTA ● FASTQ ● Illumina QSEQ ● Eland export ● SAM ● BAM ● Compressed: gz, bz2, xz, tar 37
  • 38. Running ABySS ● Assemble the paired-end reads in the file reads.fa ● abyss-pe name=ecoli k=32 n=10 in=reads.fa ● Assemble the paired-end reads in the files reads_1.fa and reads_2.fa: ● abyss-pe name=ecoli k=32 n=10 in='reads_1.fa reads_2.fa' 38
  • 39. Running ABySS in parallel ● Run ABySS using eight threads ● abyss-pe np=8 name=ecoli k=32 n=10 in='reads_1.fa reads_2.fa' ● ABySS uses MPI, the Message Passing Interface. OpenMPI is an open-source implementation of MPI 39
  • 40. Running ABySS in parallel on a cluster (SGE) ● Run ABySS on a cluster using 8 threads ● qsub -pe openmpi 8 -N ecoli abyss-pe np=8 name=ecoli k=32 n=10 in='reads_1.fa reads_2.fa' ● abyss-pe uses the environment variables JOB_NAME and NSLOTS passed to it by SGE as the default values for name and np 40
  • 41. Running ABySS in parallel on a cluster (SGE) for many values of k ● Assemble every 8th k from 32 to 96 ● qsub -pe openmpi 8 -N ecoli -t 32-96:8 abyss-pe k=32 n=10 in='reads_1.fa reads_2.fa' ● abyss-pe uses the environment variable SGE_TASK_ID passed to it by SGE as the default value for k 41
  • 42. Assembling multiple libraries ● abyss-pe name=ecoli k=32 n=10 lib='pe200 pe500' pe200='pe200_1.fa pe200_2.fa' pe500='pe500_1.fa pe500_2.fa' 42
  • 43. Assembling a mix of paired-end and single-end reads ● abyss-pe name=ecoli k=32 n=10 lib='pe200 pe500' pe200='pe200_1.fa pe200_2.fa' pe500='pe500_1.fa pe500_2.fa' se='long.fa' 43
  • 44. Parameters of ABySS ● name: name of the assembly ● lib: name of the libraries (one or more) ● se: paths of the single-end read files ● ${lib}: paths of the read files for that library ● Example abyss-pe name=ecoli k=32 n=10 lib='pe200 pe500' pe200='pe200_1.fa pe200_2.fa' pe500='pe500_1.fa pe500_2.fa' se='long.fa' 44
  • 45. Parameters of ABySS Sequence assembly ● k: the size of a k-mer ● q: quality trimming removes low-quality bases from the ends of reads ● e and c: coverage-threshold parameters ● e: erosion removes bases from the ends of contigs ● c: coverage threshold removes entire contigs ● p: the minimum identity for bubble popping 45
  • 46. Parameters of ABySS Paired-end assembly ● s: the minimum size of a seed contig ● n: the number of pairs required to join two contigs ● Example abyss-pe name=ecoli k=64 q=3 p=0.9 s=100 n=10 lib='pe200 pe500' pe200='pe200_1.fa pe200_2.fa' pe500='pe500_1.fa pe500_2.fa' se='long.fa' 46
  • 47. Stages of ABySS ● Assembe read sequence without paired-end information ● Map the reads back to the assembly ● Use the paired-end information to merge contigs from the first stage into larger sequences 47
  • 48. Optimizing k ● Assemble every 8th k from 32 to 96 Nine assemblies: 32 40 48 56 64 72 80 88 96 ● Find the peak ● Assemble every 2nd k around the peak For example, if the peak were at k=64... Eight assemblies: 56 58 60 62 66 68 70 72 ● SGE: qsub -t 32-96:8 qsub-abyss.sh qsub -t 56-72:2 qsub-abyss.sh 48
  • 49. Output files of ABySS ● ${name}-contigs.fa The final contigs in FASTA format ● ${name}-bubbles.fa The equal-length variant sequences (FASTA) ● ${name}-indel.fa The different-length variant sequences (FASTA) ● ${name}-contigs.dot The contig overlap graph in Graphviz format 49
  • 50. Intermediate output files of ABySS ● .adj: contig overlap graph in ABySS adj format ● .dist: estimates of the distance between contigs in ABySS dist format ● .path: lists of contigs to be merged ● .hist: fragment-size histogram of a library ● coverage.hist: k-mer coverage histogram 50
  • 52. Assembly/alignment visualization ● Display how the reads were used in the assembly (or align to the reference) ● Show paired-end reads and highlight locations where the pairs are discordant ● Browse annotations and variants ● Standard file formats are BAM, VCF and GFF, though there are many 52
  • 53. Visualization tools ● UCSC Genome Browser ● Integrative Genomics Viewer (IGV) ● Tablet ● gap5 ● consed ● ABySS-Explorer 53
  • 54. Integrative Genomics Viewer (IGV) ● Can visualize short read alignments and many other types of data 54
  • 57. K-mer coverage histogram ● Counts the number of occurrences of each k-mer ● Useful for estimating the size of the genome 57
  • 58. N50 and Nxx plot ● The N50 is the weighted median of contig sizes ● The N50 summarizes a single point on the Nxx plot ● Better assemblies are further to the right 58
  • 60. Assembly Ambiguities True genome sequence GGATTGAAAAAAAAAAAAAAAAGTAGCACGAATATACATAGAAAAAAAAAAAAAAAAATTACG Assembled sequence de Bruijn graph representation Cydney Nielsen 60
  • 61. Starting Point Cydney Nielsen 61
  • 63. Sequence length one oscillation = 100 nt Cydney Nielsen 63
  • 64. Paired-end reads After building the initial single-end (SE) contigs from k-mer sequences, ABySS uses paired-end reads to resolve ambiguities. Cydney Nielsen 64
  • 65. Paired-end contigs Paired-end reads are used to construct paired-end (PE) contigs … 13+ 44- 46+ 4+ 79+ 70+ … blue gradient = paired end contig orange = selected single end contig Cydney Nielsen 65
  • 68. Transcriptome Assembly, Alternative Splicing and Visualization 68
  • 70.
  • 71. Assembly ABySS Alignment GMAP Detection & Sircah Visualisation
  • 72. ABySS Assemble transcriptome data Transcriptome reads → Assembly
  • 73.
  • 74. GMAP Align contigs to the reference genome Annotate introns Assembly → Alignments
  • 75.
  • 76. Sircah Detect alternative splicing events Alignments → Alternative splicing
  • 78. Sircah Visualisation Draw splicing diagrams Alternative splicing → Splicing diagrams
  • 80. Acknowledgments Supervisors ● İnanç Birol ● Steven Jones Team ● Readman Chiu ● Rod Docking ● Ka Ming Nip ● Karen Mungall ● Jenny Qian 80 ● Tony Raymond
  • 82. An assembly in two stages ● Stage I: Sequence assembly algorithm ● Stage II: Paired-end assembly algorithm 82
  • 83. Stage 1 Sequence assembly algorithm ● Load the reads, Load k-mers breaking each read into k-mers ● Find adjacent k-mers, which Find overlaps overlap by k-1 bases ● Remove k-mers resulting from Prune tips read errors ● Remove variant sequences Pop bubbles ● Generate contigs Generate contigs 83
  • 84. Load the reads ● For each input read of length l, (l - k + 1) k-mers are generated by sliding a window of length k over the read Read (l = 12): ● Each k-mer is a vertex of ATCATACATGAT the de Bruijn graph k-mers (k = 9): ATCATACAT ●Two adjacent k-mers are TCATACATG an edge of the de Bruijn CATACATGA ATACATGAT graph 84
  • 85. De Bruijn Graph ● A simple graph for k = 5 ● Two reads ● GGACATC ● GGACAGA GACAT ACATC GGACA GACAG ACAGA 85
  • 86. Pruning tips ● Read errors cause tips 86
  • 87. Pruning tips ● Read errors cause tips ● Pruning tips removes the erroneous reads from the assembly 87
  • 88. Popping bubbles ● Variant sequences cause bubbles ● Popping bubbles removes the variant sequence from the assembly ● Repeat sequences with small differences also cause bubbles 88
  • 89. Assemble contigs ● Remove ambiguous edges ● Output contigs in FASTA format 89
  • 90. Paired-end assembly algorithm Stage 2 ● Align the reads to the contigs of the first stage ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig ● Estimate the distance between contigs using the paired reads that align to different contigs 90
  • 91. Align the reads to the contigs KAligner ● Every k-mer in the single-end assembly is unique ● KAligner can map reads with k consecutive correct bases ● ABySS may use other aligners, including BWA and bowtie 91
  • 92. Empirical fragment-size distribution ParseAligns ● Generate an empirical fragment-size distribution using the paired reads that align to the same contig 92
  • 93. Estimate distances between contigs DistanceEst ● Estimate the distance between contigs using the paired reads that align to different contigs d = 25 ± 8 d=3±5 d=6±5 d=4±3 93
  • 94. Maximum likelihood estimator DistanceEst ● Use the empirical paired- end size distribution ● Maximize the likelihood function ● Find the most likely distance between the two contigs 94
  • 95. Paired-end algorithm continued... ● Find paths through the contig adjacency graph that agree with Generate paths the distance estimates ● Merge overlapping paths Merge paths ● Merge the contigs in these paths Generate contigs and output the FASTA file 95
  • 96. Find consistent paths SimpleGraph ● Find paths through the contig adjacency graph that agree with the distance estimates d=4±3 Actual distance = 3 96
  • 97. Merge overlapping paths MergePaths ● Merge paths that overlap 97
  • 98. Generate the FASTA output ● Merge the contigs in these paths. ● Output the FASTA file GATTTTTG GAC GTCTTGATCTT CAC GTATTG CTATT 98