SlideShare una empresa de Scribd logo
1 de 65
Descargar para leer sin conexión
[I0D51A] Bioinformatics: High-Throughput Analysis
   Next-generation sequencing. Part 2: Mapping
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be

TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)




                                                           1
Context




          2
Assembly vs mapping




                      3
Trapnell & Salzberg, 2009




challenges:
  • how quickly can we align the reads to the genome?
  • what do we do with repetitive sequences?

                                                                       4
Approaches


                                           Burrows-
                                            Wheeler
hash-based
                                           transform




               Trapnell & Salzberg, 2009

                                                       5
Hash-based mapping

E.g. MAQ


Steps:


  • Index reference genome (or sequence reads) => creates hash index (= big
    file: >50GB)


  • Divide each read into segments (seeds) and look up in table

                           seed                   positions
                            ...                       ...
                          AAGC                  3,473,2738,...
                          AAGG                 34,236,1827,...
                          AAGT                8,172,782,1921,...
                            ...                       ...
                                                                              6
Burrows-Wheeler transform

E.g. BWA


Used in data compression (e.g. bzip) => index: much smaller than hash-based
index (<2GB)


Alignment speed: 30x faster than MAQ


Steps:


  • Create BWT index of genome


  • Align read 1 character at a time to BWT-transformed genome


                                                                              7
Burrows-Wheeler transform




                                   2. Read mapping
    Creating Burrows-Wheeler

                               8
Inverse BWT: recreating original text
if BWT = O^OOGO$L => what was original text?


O^OOGO$L = last column L => first column F = sorted

          Last column L                First column F

               O                               G

                ^                              G

               O                               L
                             sort
               O                               O

               G                               O

               G                               O

                $                              ^

                L                              $        9
Inverse BWT: recreating original text

   ith occurrence of a character in L is same text occurrence as
   the ith occurrence in F

               F        L

   1st G      G         O      1st O

   2nd G      G         ^       1st ^

    1st L      L        O      2nd O

   1st O      O         O      3rd O

   2nd O      O         G      1st G

   3rd O      O         G      2nd G

    1st ^     ^         $       1st $

    1st $      $        L       1st L
                                                                   10
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   $
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                            11
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   L$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                             12
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   OL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                              13
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   GOL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                               14
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   OGOL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                                15
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   OOGOL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                                 16
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   GOOGOL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                                  17
F   L

1st G   G   O   1st O

2nd G   G   ^   1st ^

1st L   L   O   2nd O

1st O   O   O   3rd O   ^GOOGOL$
2nd O   O   G   1st G

3rd O   O   G   2nd G

1st ^   ^   $   1st $

1st $   $   L   1st L




                                   18
Searching using BWT

uses row index and fact that rows are alphabetically sorted => binary search
e.g. at what positions does “GO” occur in “^GOOGOL$”?




take middle position: is “GO” alphabetically before or after this position?
-> if before: take middle position of first half (if after: last half) and discard other
half
-> repeat until string found
-> row indices indicate positions of substring: “GO” is at positions 2 and 5
                                                                                          19
Issues

• placing reads in regions that do not exist in the reference genome

• sequencing errors and variations: alignment between read and true source in
 genome may have more differences than alignment with some other copy of
 repeat
 What if many nucleotide differences with closest fully sequenced genome?


• placing reads in repetitive regions: MAQ & bwa return only 1 mapping; If
 multiple: mapQ = 0


• MAQ & bwa: use paired-end information => might prefer correct distance
 over correct alignment



                                                                                20
File formats

SAM (Sequence Alignment/Map) format = unified format for storing read
alignments to a reference genome


BAM = binary version of SAM for fast querying




                                                                       21
7172283      83          chr9        139389482      60         90M         =        139389330     -242      ACGGGAG...      #######...
7172283    163         chr9          139389330      60        90M         =         139389482      242      TAGGAGG...      EHHHHHH...
7705896      83        chr9          139389513      60        90M         =         139389512      -91      GCTGGGG...      EBCHHFC...
7705896    163         chr9          139389512      60        90M         =         139389513       91      AGCTGGG...      HHHHHHH...




           1                   QNAME                                                 query template name
           2                    FLAG                                                      bitwise flag
           3                   RNAME                                              reference sequence name
           4                    POS                                           1-based leftmost mapping position
           5                   MAPQ                                                     mapping quality
           6                   CIGAR                                                     CIGAR string
           7                   RNEXT                                               reference name of mate
           8                   PNEXT                                                   position of mate
           9                    TLEN                                              observed template length
           10                   SEQ                                                       sequence
           11                   QUAL                                          ASCII of Phred-scaled base quality
                                                                                                          http://samtools.sourceforge.net/SAM1.pdf




                                                                                                                                                     22
paired data




• 7172283     83      chr9      139389482        60      90M       =      139389330     -242      ACGGGAG...      #######...
7172283   163      chr9      139389330      60        90M      =       139389482      242      TAGGAGG...      EHHHHHH...
7705896    83      chr9      139389513      60        90M      =       139389512      -91      GCTGGGG...      EBCHHFC...
7705896   163      chr9      139389512      60        90M      =       139389513       91      AGCTGGG...      HHHHHHH...




                                                                                                                               23
SAM format: FLAG field

numeric    binary                        description

  1       00000001     template has multiple fragments in sequencing

  2       00000010   each fragment properly mapped according to aligner

  4       00000100                 fragment is unmapped

  8       00001000                   mate is unmapped

  16      00010000           sequence is reverse complemented

  32      00100000              sequence of mate is reversed

  64      01000000               is first fragment in template

 128      10000000             is second fragment in template

                                                                          24
SAM FLAG: examples

• 83 = 64 + 16 + 2 + 1 = 01010011


    template has multiple fragments, each fragment is properly aligned,
    fragment is not unmapped, mate is not unmapped, sequence is reverse
    complemented, sequence of mate is not reversed, this is the first fragment
    in the template, this is not the second fragment in the template


• 163 = 128 + 32 + 2 + 1 = 10100011


    template has multiple fragments, each fragment is properly aligned,
    fragment is not unmapped, mate is not unmapped, sequence is not
    reverse complemented, sequence of mate is reversed, this is not the first
    fragment in the template, this is the second fragment in the template


                                                                                25
SAM format: CIGAR string

  M    alignment match (can be sequence match or mismatch)

   I                 insertion to the reference

  D                  deletion to the reference

  N              skipped region from the reference

  S      soft clipping (clipped sequence is present in SEQ)

  H    hard clipping (clipped sequence is not present in SEQ)

  P       padding (silent deletion from padded reference)

  =                      sequence match

  X                    sequence mismatch




                                                                26
CIGAR string: example




             read ACGCA-TGCAGTtagacgt

          reference ACGCAGTG--GT

            CIGAR 5M1D2M2I2M7S




                                        27
Running bwa (FASTQ -> BAM)

http://bio-bwa.sourceforge.net


Steps:


  1.Create index for genome (only has to be done once)


  2.Run “bwa aln” to find suffix array coordinates of good hits of each
    individual read


  3.Run “bwa samse/sampe” which converts suffix array coordinates to
    chromosomal coordinates and paired reads (for sampe)




                                                                        28
Running “bwa” without arguments returns help.




                                                29
bwa: indexing the genome

Only has to be done once!


To index chromosome 17 only:


  1.Download chr17.fa.gz from UCSC Genome Browser (downloads section)


  2.Run bwa index -a is chr17.fa




                                                                        30
31
bwa: finding suffix array coordinates for reads




                                                32
bwa: converting suffix array coordinates to
chromosome coordinates




                                             33
Using Galaxy for read mapping




                                34
Viewing BAM files

Many options:


• Integrative Genome Viewer (IGV) by Broad Institute


• samtools tview


• UCSC genome browser


• bamview


• bambino


• ...
                                                       35
Viewing BAM files: IGV

http://www.broadinstitute.org/software/igv/


Java WebStart




                                              36
coverage




           reads



                   polymorphisms


                                   gene model




                                                37
Is this a known SNP?




                       38
File -> Load from Server...




                              39
Yes, it is...




                40
Viewing BAM files: samtools tview

                           http://samtools.sourceforge.net




                                                             41
42
Viewing BAM files: UCSC Genome Browser

http://genome.ucsc.edu
-> “Genome Browser”
-> “Manage Custom Tracks”
-> “Add Custom Tracks”
-> In “Edit configuration”:


  track type=bam name="My BAM" bigDataUrl=http://med.kuleuven.be/lcb/
  teaching/aln.sorted.bam


-> “Submit”


aln.sorted.bam contains reads that map to the first 10Mb of chr17



                                                                        43
whole chromosome




                   44
zoomed in




            45
zoomed in even further   query template names




                                                46
Read details




               47
Manipulating SAM/BAM files

• convert SAM <-> BAM


• remove PCR duplicates


• sort BAM file - necessary for loading into tools such as IGV


• index BAM file - necessary for loading into tools such as IGV


• local realignment around indels


• base quality recalibration


• pileup - i.e. convert from read-based to position-based; SNP calling


• ...

                                                                         48
Manipulating SAM/BAM files - tools: samtools




                                           Li et al, 2009



         http://samtools.sourceforge.net


                                                            49
convert SAM
  to BAM




   sort




          index



                  50
Manipulating SAM/BAM files - tools: PICARD

http://picard.sourceforge.net


= Java-based command-line utility with similar functionality as samtools


useful commands:


   • MarkDuplicates - flags duplicate records (i.e. due to PCR amplification
    bias)


   • CalculateHsMetrics - calculates set of Hybrid Selection specific metrics

   • SamToFastq - extracts read sequences and qualities from SAM file

                                                                               51
52
Duplicate removal

 PCR amplification bias


   some reads: better amplified than others => bias!!


   => keep only one (with highest mapping Q)           PCR went well




                                                  PCR didn’t go
                         PCR didn’t
                                                    so well
                           work
                                                                       53
java -Xmx2048m 
   -jar /path_to_picard/MarkDuplicates.jar 
   INPUT=input.bam 
   OUTPUT=output.bam 
   METRICS_FILE=output.metrics 
   VALIDATION_STRINGENCY=LENIENT


                                               Picard

                      samtools


                             samtools rmdup input.bam output.bam




                                                                   54
Manipulating SAM/BAM files - tools: GATK

GATK = Genome Analysis Toolkit, developed by Broad Institute


http://www.broadinstitute.org/gsa/wiki/index.php/
The_Genome_Analysis_Toolkit


  • Full variant discovery workflow

  • Variant evaluation

  • ...




                                                               55
Base quality recalibration

• Why?


     correct for variation in quality with machine cycle, sequence context, lane,
     baseQ, ...


• Steps:


   • Identify what to correct for


   • Calculate covariates


   • Apply covariates


   • Check (create plots)

                                                                                    56
Mapping quality dependent on sequence context




                                                57
java -Xmx4g -jar GenomeAnalysisTK.jar 
  -l INFO 
  -R resources/Homo_sapiens_assembly18.fasta 
  --DBSNP resources/dbsnp_129_hg18.rod 
  -I my_reads.bam 
  -T CountCovariates 
  -cov ReadGroupCovariate 
  -cov QualityScoreCovariate 
  -cov DinucCovariate 
  -recalFile my_reads.recal_data.csv

                                   java -Xmx4g -jar GenomeAnalysisTK.jar 
                                     -l INFO 
                                     -R resources/Homo_sapiens_assembly18.fasta 
                                     -I my_reads.bam 
                                     -T TableRecalibration 
                                      -outputBam my_reads.recal.bam 
                                      -recalFile my_reads.recal_data.csv


                                                                                    58
Local realignment near indels




                                59
Local realignment near indels




                                60
java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar 
  -T RealignerTargetCreator 
  -R /path/to/reference.fasta 
  -o /path/to/output.intervals


java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir 
  -jar /path/to/GenomeAnalysisTK.jar 
  -I input.bam 
  -R ref.fasta 
  -T IndelRealigner 
  -targetIntervals /path/to/output.intervals 
  -o realignedBam.bam




                                                   61
Exercises




            62
Aligning reads to reference on the command line


Login on the server mentioned on Toledo, and:


From directory ~jaerts/i0d51a/: copy the files s_1_sequence_small.txt,
s_2_sequence_small.txt and chr9.fa to your own home directory.



If you know that s_1_sequence_small.txt and s_2_sequence_small.txt contain paired
reads: align these against chr9. You’ll first have to create an index for chr9 (see slides).
Also convert the resulting sam-file to a bam-file.


How many of the reads were mapped? How many could not be mapped? There’s an
easy way to do this with grep, but extra point if you can use the bitwise flag.


How many reads mapped without mismatches (i.e. CIGAR string equal to “90M”)?

                                                                                              63
Aligning reads to reference using Galaxy



Log into your account on Galaxy.



Align the reads in s_1_sequence_small.txt and s_2_sequence_small.txt (that
you uploaded in the last lesson) against hg19. Perform the mapping using BWA
for Illumina. Use the built-in index “Human (Homo sapiens): hg19 Full” (type
“hg19” in the “Select a reference genome” box). Do not suppress the header in
the output SAM file.



Using Galaxy: create a histogram of the insert sizes of this DNA sequencing
library (tip: you’ll need some commands from the “Text Manipulation” and
“Filter and Sort” groups)




                                                                                64
Investigating BAM file with IGV


Start the IGV application from http://www.broadinstitute.org/software/igv/download
(750MB version) and open the first10Mbchr17.sorted.bam file which you can download
from Toledo.



• Is this data from a whole-genome sequencing experiment, or rather from some type of
  pulldown? If the latter: what type of pulldown (i.e. what were the targets).



• Is the complete CDS of the KIF1C gene covered?


• What is the left-most gene that is also in OMIM (you can find those at “Load from Server
  -> hg19 -> Phenotype and Disease Associations”)? Are all its exons covered?



• At position 11,928 of chromosome 17: is this a SNP? If it is: is it already known in
  dbSNP? What about position 13,905?

                                                                                            65

Más contenido relacionado

Más de Jan Aerts

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoJan Aerts
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy UpdateJan Aerts
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpaceJan Aerts
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...Jan Aerts
 

Más de Jan Aerts (20)

Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 
D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy Update
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpace
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
 

Último

Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxannathomasp01
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 

Último (20)

Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Next generation sequencing course - part 2: sequence mapping

  • 1. [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 2: Mapping Prof Jan Aerts Faculty of Engineering - ESAT/SCD jan.aerts@esat.kuleuven.be TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  • 4. Trapnell & Salzberg, 2009 challenges: • how quickly can we align the reads to the genome? • what do we do with repetitive sequences? 4
  • 5. Approaches Burrows- Wheeler hash-based transform Trapnell & Salzberg, 2009 5
  • 6. Hash-based mapping E.g. MAQ Steps: • Index reference genome (or sequence reads) => creates hash index (= big file: >50GB) • Divide each read into segments (seeds) and look up in table seed positions ... ... AAGC 3,473,2738,... AAGG 34,236,1827,... AAGT 8,172,782,1921,... ... ... 6
  • 7. Burrows-Wheeler transform E.g. BWA Used in data compression (e.g. bzip) => index: much smaller than hash-based index (<2GB) Alignment speed: 30x faster than MAQ Steps: • Create BWT index of genome • Align read 1 character at a time to BWT-transformed genome 7
  • 8. Burrows-Wheeler transform 2. Read mapping Creating Burrows-Wheeler 8
  • 9. Inverse BWT: recreating original text if BWT = O^OOGO$L => what was original text? O^OOGO$L = last column L => first column F = sorted Last column L First column F O G ^ G O L sort O O G O G O $ ^ L $ 9
  • 10. Inverse BWT: recreating original text ith occurrence of a character in L is same text occurrence as the ith occurrence in F F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 10
  • 11. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O $ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 11
  • 12. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O L$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 12
  • 13. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O OL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 13
  • 14. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O GOL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 14
  • 15. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O OGOL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 15
  • 16. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O OOGOL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 16
  • 17. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O GOOGOL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 17
  • 18. F L 1st G G O 1st O 2nd G G ^ 1st ^ 1st L L O 2nd O 1st O O O 3rd O ^GOOGOL$ 2nd O O G 1st G 3rd O O G 2nd G 1st ^ ^ $ 1st $ 1st $ $ L 1st L 18
  • 19. Searching using BWT uses row index and fact that rows are alphabetically sorted => binary search e.g. at what positions does “GO” occur in “^GOOGOL$”? take middle position: is “GO” alphabetically before or after this position? -> if before: take middle position of first half (if after: last half) and discard other half -> repeat until string found -> row indices indicate positions of substring: “GO” is at positions 2 and 5 19
  • 20. Issues • placing reads in regions that do not exist in the reference genome • sequencing errors and variations: alignment between read and true source in genome may have more differences than alignment with some other copy of repeat What if many nucleotide differences with closest fully sequenced genome? • placing reads in repetitive regions: MAQ & bwa return only 1 mapping; If multiple: mapQ = 0 • MAQ & bwa: use paired-end information => might prefer correct distance over correct alignment 20
  • 21. File formats SAM (Sequence Alignment/Map) format = unified format for storing read alignments to a reference genome BAM = binary version of SAM for fast querying 21
  • 22. 7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######... 7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH... 7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC... 7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH... 1 QNAME query template name 2 FLAG bitwise flag 3 RNAME reference sequence name 4 POS 1-based leftmost mapping position 5 MAPQ mapping quality 6 CIGAR CIGAR string 7 RNEXT reference name of mate 8 PNEXT position of mate 9 TLEN observed template length 10 SEQ sequence 11 QUAL ASCII of Phred-scaled base quality http://samtools.sourceforge.net/SAM1.pdf 22
  • 23. paired data • 7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######... 7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH... 7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC... 7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH... 23
  • 24. SAM format: FLAG field numeric binary description 1 00000001 template has multiple fragments in sequencing 2 00000010 each fragment properly mapped according to aligner 4 00000100 fragment is unmapped 8 00001000 mate is unmapped 16 00010000 sequence is reverse complemented 32 00100000 sequence of mate is reversed 64 01000000 is first fragment in template 128 10000000 is second fragment in template 24
  • 25. SAM FLAG: examples • 83 = 64 + 16 + 2 + 1 = 01010011 template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is reverse complemented, sequence of mate is not reversed, this is the first fragment in the template, this is not the second fragment in the template • 163 = 128 + 32 + 2 + 1 = 10100011 template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is not reverse complemented, sequence of mate is reversed, this is not the first fragment in the template, this is the second fragment in the template 25
  • 26. SAM format: CIGAR string M alignment match (can be sequence match or mismatch) I insertion to the reference D deletion to the reference N skipped region from the reference S soft clipping (clipped sequence is present in SEQ) H hard clipping (clipped sequence is not present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch 26
  • 27. CIGAR string: example read ACGCA-TGCAGTtagacgt reference ACGCAGTG--GT CIGAR 5M1D2M2I2M7S 27
  • 28. Running bwa (FASTQ -> BAM) http://bio-bwa.sourceforge.net Steps: 1.Create index for genome (only has to be done once) 2.Run “bwa aln” to find suffix array coordinates of good hits of each individual read 3.Run “bwa samse/sampe” which converts suffix array coordinates to chromosomal coordinates and paired reads (for sampe) 28
  • 29. Running “bwa” without arguments returns help. 29
  • 30. bwa: indexing the genome Only has to be done once! To index chromosome 17 only: 1.Download chr17.fa.gz from UCSC Genome Browser (downloads section) 2.Run bwa index -a is chr17.fa 30
  • 31. 31
  • 32. bwa: finding suffix array coordinates for reads 32
  • 33. bwa: converting suffix array coordinates to chromosome coordinates 33
  • 34. Using Galaxy for read mapping 34
  • 35. Viewing BAM files Many options: • Integrative Genome Viewer (IGV) by Broad Institute • samtools tview • UCSC genome browser • bamview • bambino • ... 35
  • 36. Viewing BAM files: IGV http://www.broadinstitute.org/software/igv/ Java WebStart 36
  • 37. coverage reads polymorphisms gene model 37
  • 38. Is this a known SNP? 38
  • 39. File -> Load from Server... 39
  • 41. Viewing BAM files: samtools tview http://samtools.sourceforge.net 41
  • 42. 42
  • 43. Viewing BAM files: UCSC Genome Browser http://genome.ucsc.edu -> “Genome Browser” -> “Manage Custom Tracks” -> “Add Custom Tracks” -> In “Edit configuration”: track type=bam name="My BAM" bigDataUrl=http://med.kuleuven.be/lcb/ teaching/aln.sorted.bam -> “Submit” aln.sorted.bam contains reads that map to the first 10Mb of chr17 43
  • 45. zoomed in 45
  • 46. zoomed in even further query template names 46
  • 48. Manipulating SAM/BAM files • convert SAM <-> BAM • remove PCR duplicates • sort BAM file - necessary for loading into tools such as IGV • index BAM file - necessary for loading into tools such as IGV • local realignment around indels • base quality recalibration • pileup - i.e. convert from read-based to position-based; SNP calling • ... 48
  • 49. Manipulating SAM/BAM files - tools: samtools Li et al, 2009 http://samtools.sourceforge.net 49
  • 50. convert SAM to BAM sort index 50
  • 51. Manipulating SAM/BAM files - tools: PICARD http://picard.sourceforge.net = Java-based command-line utility with similar functionality as samtools useful commands: • MarkDuplicates - flags duplicate records (i.e. due to PCR amplification bias) • CalculateHsMetrics - calculates set of Hybrid Selection specific metrics • SamToFastq - extracts read sequences and qualities from SAM file 51
  • 52. 52
  • 53. Duplicate removal PCR amplification bias some reads: better amplified than others => bias!! => keep only one (with highest mapping Q) PCR went well PCR didn’t go PCR didn’t so well work 53
  • 54. java -Xmx2048m -jar /path_to_picard/MarkDuplicates.jar INPUT=input.bam OUTPUT=output.bam METRICS_FILE=output.metrics VALIDATION_STRINGENCY=LENIENT Picard samtools samtools rmdup input.bam output.bam 54
  • 55. Manipulating SAM/BAM files - tools: GATK GATK = Genome Analysis Toolkit, developed by Broad Institute http://www.broadinstitute.org/gsa/wiki/index.php/ The_Genome_Analysis_Toolkit • Full variant discovery workflow • Variant evaluation • ... 55
  • 56. Base quality recalibration • Why? correct for variation in quality with machine cycle, sequence context, lane, baseQ, ... • Steps: • Identify what to correct for • Calculate covariates • Apply covariates • Check (create plots) 56
  • 57. Mapping quality dependent on sequence context 57
  • 58. java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta --DBSNP resources/dbsnp_129_hg18.rod -I my_reads.bam -T CountCovariates -cov ReadGroupCovariate -cov QualityScoreCovariate -cov DinucCovariate -recalFile my_reads.recal_data.csv java -Xmx4g -jar GenomeAnalysisTK.jar -l INFO -R resources/Homo_sapiens_assembly18.fasta -I my_reads.bam -T TableRecalibration -outputBam my_reads.recal.bam -recalFile my_reads.recal_data.csv 58
  • 61. java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /path/to/reference.fasta -o /path/to/output.intervals java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir -jar /path/to/GenomeAnalysisTK.jar -I input.bam -R ref.fasta -T IndelRealigner -targetIntervals /path/to/output.intervals -o realignedBam.bam 61
  • 62. Exercises 62
  • 63. Aligning reads to reference on the command line Login on the server mentioned on Toledo, and: From directory ~jaerts/i0d51a/: copy the files s_1_sequence_small.txt, s_2_sequence_small.txt and chr9.fa to your own home directory. If you know that s_1_sequence_small.txt and s_2_sequence_small.txt contain paired reads: align these against chr9. You’ll first have to create an index for chr9 (see slides). Also convert the resulting sam-file to a bam-file. How many of the reads were mapped? How many could not be mapped? There’s an easy way to do this with grep, but extra point if you can use the bitwise flag. How many reads mapped without mismatches (i.e. CIGAR string equal to “90M”)? 63
  • 64. Aligning reads to reference using Galaxy Log into your account on Galaxy. Align the reads in s_1_sequence_small.txt and s_2_sequence_small.txt (that you uploaded in the last lesson) against hg19. Perform the mapping using BWA for Illumina. Use the built-in index “Human (Homo sapiens): hg19 Full” (type “hg19” in the “Select a reference genome” box). Do not suppress the header in the output SAM file. Using Galaxy: create a histogram of the insert sizes of this DNA sequencing library (tip: you’ll need some commands from the “Text Manipulation” and “Filter and Sort” groups) 64
  • 65. Investigating BAM file with IGV Start the IGV application from http://www.broadinstitute.org/software/igv/download (750MB version) and open the first10Mbchr17.sorted.bam file which you can download from Toledo. • Is this data from a whole-genome sequencing experiment, or rather from some type of pulldown? If the latter: what type of pulldown (i.e. what were the targets). • Is the complete CDS of the KIF1C gene covered? • What is the left-most gene that is also in OMIM (you can find those at “Load from Server -> hg19 -> Phenotype and Disease Associations”)? Are all its exons covered? • At position 11,928 of chromosome 17: is this a SNP? If it is: is it already known in dbSNP? What about position 13,905? 65