Next generation sequencing course - part 2: sequence mapping

[I0D51A] Bioinformatics: High-Throughput Analysis
Next-generation sequencing. Part 2: Mapping
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be

TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)

1

Assembly vs mapping

3

Trapnell & Salzberg, 2009

challenges:
• how quickly can we align the reads to the genome?
• what do we do with repetitive sequences?

4

Approaches

Burrows-
Wheeler
hash-based
transform

Trapnell & Salzberg, 2009

5

Hash-based mapping

E.g. MAQ

Steps:

• Index reference genome (or sequence reads) => creates hash index (= big
ﬁle: >50GB)

• Divide each read into segments (seeds) and look up in table

seed positions
... ...
AAGC 3,473,2738,...
AAGG 34,236,1827,...
AAGT 8,172,782,1921,...
... ...
6

Burrows-Wheeler transform

E.g. BWA

Used in data compression (e.g. bzip) => index: much smaller than hash-based
index (<2GB)

Alignment speed: 30x faster than MAQ

Steps:

• Create BWT index of genome

• Align read 1 character at a time to BWT-transformed genome

7

Burrows-Wheeler transform

2. Read mapping
Creating Burrows-Wheeler

8

Inverse BWT: recreating original text
if BWT = OÔOGO$L => what was original text?

OÔOGO$L = last column L => first column F = sorted

Last column L First column F

O G

^ G

O L
sort
O O

G O

G O

$ ^

L $ 9

Inverse BWT: recreating original text

ith occurrence of a character in L is same text occurrence as
the ith occurrence in F

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O

2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L
10

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O $
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

11

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O L$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

12

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O OL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

13

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O GOL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

14

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O OGOL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

15

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O OOGOL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

16

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O GOOGOL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

17

F L

1st G G O 1st O

2nd G G ^ 1st ^

1st L L O 2nd O

1st O O O 3rd O ^GOOGOL$
2nd O O G 1st G

3rd O O G 2nd G

1st ^ ^ $ 1st $

1st $ $ L 1st L

18

Searching using BWT

uses row index and fact that rows are alphabetically sorted => binary search
e.g. at what positions does “GO” occur in “^GOOGOL$”?

take middle position: is “GO” alphabetically before or after this position?
-> if before: take middle position of ﬁrst half (if after: last half) and discard other
half
-> repeat until string found
-> row indices indicate positions of substring: “GO” is at positions 2 and 5
19

Issues

• placing reads in regions that do not exist in the reference genome

• sequencing errors and variations: alignment between read and true source in
genome may have more differences than alignment with some other copy of
repeat
What if many nucleotide differences with closest fully sequenced genome?

• placing reads in repetitive regions: MAQ & bwa return only 1 mapping; If
multiple: mapQ = 0

• MAQ & bwa: use paired-end information => might prefer correct distance
over correct alignment

20

File formats

SAM (Sequence Alignment/Map) format = uniﬁed format for storing read
alignments to a reference genome

BAM = binary version of SAM for fast querying

21

7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...
7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...
7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...
7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH...

1 QNAME query template name
2 FLAG bitwise ﬂag
3 RNAME reference sequence name
4 POS 1-based leftmost mapping position
5 MAPQ mapping quality
6 CIGAR CIGAR string
7 RNEXT reference name of mate
8 PNEXT position of mate
9 TLEN observed template length
10 SEQ sequence
11 QUAL ASCII of Phred-scaled base quality
http://samtools.sourceforge.net/SAM1.pdf

22

paired data

• 7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...
7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...
7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...
7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH...

23

SAM format: FLAG ﬁeld

numeric binary description

1 00000001 template has multiple fragments in sequencing

2 00000010 each fragment properly mapped according to aligner

4 00000100 fragment is unmapped

8 00001000 mate is unmapped

16 00010000 sequence is reverse complemented

32 00100000 sequence of mate is reversed

64 01000000 is ﬁrst fragment in template

128 10000000 is second fragment in template

24

SAM FLAG: examples

• 83 = 64 + 16 + 2 + 1 = 01010011

template has multiple fragments, each fragment is properly aligned,
fragment is not unmapped, mate is not unmapped, sequence is reverse
complemented, sequence of mate is not reversed, this is the ﬁrst fragment
in the template, this is not the second fragment in the template

• 163 = 128 + 32 + 2 + 1 = 10100011

template has multiple fragments, each fragment is properly aligned,
fragment is not unmapped, mate is not unmapped, sequence is not
reverse complemented, sequence of mate is reversed, this is not the ﬁrst
fragment in the template, this is the second fragment in the template

25

SAM format: CIGAR string

M alignment match (can be sequence match or mismatch)

I insertion to the reference

D deletion to the reference

N skipped region from the reference

S soft clipping (clipped sequence is present in SEQ)

H hard clipping (clipped sequence is not present in SEQ)

P padding (silent deletion from padded reference)

= sequence match

X sequence mismatch

26

CIGAR string: example

read ACGCA-TGCAGTtagacgt

reference ACGCAGTG--GT

CIGAR 5M1D2M2I2M7S

27

Running bwa (FASTQ -> BAM)

http://bio-bwa.sourceforge.net

Steps:

1.Create index for genome (only has to be done once)

2.Run “bwa aln” to find suffix array coordinates of good hits of each
individual read

3.Run “bwa samse/sampe” which converts suffix array coordinates to
chromosomal coordinates and paired reads (for sampe)

28

Running “bwa” without arguments returns help.

29

bwa: indexing the genome

Only has to be done once!

To index chromosome 17 only:

1.Download chr17.fa.gz from UCSC Genome Browser (downloads section)

2.Run bwa index -a is chr17.fa

30

bwa: ﬁnding sufﬁx array coordinates for reads

32

bwa: converting sufﬁx array coordinates to
chromosome coordinates

33

Using Galaxy for read mapping

34

Viewing BAM ﬁles

Many options:

• Integrative Genome Viewer (IGV) by Broad Institute

• samtools tview

• UCSC genome browser

• bamview

• bambino

• ...
35

Viewing BAM ﬁles: IGV

http://www.broadinstitute.org/software/igv/

Java WebStart

36

coverage

reads

polymorphisms

gene model

37

Is this a known SNP?

38

File -> Load from Server...

39

Viewing BAM ﬁles: samtools tview

http://samtools.sourceforge.net

41

Viewing BAM files: UCSC Genome Browser

http://genome.ucsc.edu
-> “Genome Browser”
-> “Manage Custom Tracks”
-> “Add Custom Tracks”
-> In “Edit configuration”:

track type=bam name="My BAM" bigDataUrl=http://med.kuleuven.be/lcb/
teaching/aln.sorted.bam

-> “Submit”

aln.sorted.bam contains reads that map to the first 10Mb of chr17

43

whole chromosome

44

zoomed in even further query template names

46

Manipulating SAM/BAM files

• convert SAM <-> BAM

• remove PCR duplicates

• sort BAM file - necessary for loading into tools such as IGV

• index BAM file - necessary for loading into tools such as IGV

• local realignment around indels

• base quality recalibration

• pileup - i.e. convert from read-based to position-based; SNP calling

• ...

48

Manipulating SAM/BAM ﬁles - tools: samtools

Li et al, 2009

http://samtools.sourceforge.net

49

convert SAM
to BAM

sort

index

50

Manipulating SAM/BAM files - tools: PICARD

http://picard.sourceforge.net

= Java-based command-line utility with similar functionality as samtools

useful commands:

• MarkDuplicates - flags duplicate records (i.e. due to PCR amplification
bias)

• CalculateHsMetrics - calculates set of Hybrid Selection specific metrics

• SamToFastq - extracts read sequences and qualities from SAM file

51

Duplicate removal

PCR ampliﬁcation bias

some reads: better ampliﬁed than others => bias!!

=> keep only one (with highest mapping Q) PCR went well

PCR didn’t go
PCR didn’t
so well
work
53

java -Xmx2048m
-jar /path_to_picard/MarkDuplicates.jar
INPUT=input.bam
OUTPUT=output.bam
METRICS_FILE=output.metrics
VALIDATION_STRINGENCY=LENIENT

Picard

samtools

samtools rmdup input.bam output.bam

54

Manipulating SAM/BAM ﬁles - tools: GATK

GATK = Genome Analysis Toolkit, developed by Broad Institute

http://www.broadinstitute.org/gsa/wiki/index.php/
The_Genome_Analysis_Toolkit

• Full variant discovery workﬂow

• Variant evaluation

• ...

55

Base quality recalibration

• Why?

correct for variation in quality with machine cycle, sequence context, lane,
baseQ, ...

• Steps:

• Identify what to correct for

• Calculate covariates

• Apply covariates

• Check (create plots)

56

Mapping quality dependent on sequence context

57

java -Xmx4g -jar GenomeAnalysisTK.jar
-l INFO
-R resources/Homo_sapiens_assembly18.fasta
--DBSNP resources/dbsnp_129_hg18.rod
-I my_reads.bam
-T CountCovariates
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-recalFile my_reads.recal_data.csv

java -Xmx4g -jar GenomeAnalysisTK.jar
-l INFO
-R resources/Homo_sapiens_assembly18.fasta
-I my_reads.bam
-T TableRecalibration
-outputBam my_reads.recal.bam
-recalFile my_reads.recal_data.csv

58

Local realignment near indels

59

Local realignment near indels

60

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar
-T RealignerTargetCreator
-R /path/to/reference.fasta
-o /path/to/output.intervals

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir
-jar /path/to/GenomeAnalysisTK.jar
-I input.bam
-R ref.fasta
-T IndelRealigner
-targetIntervals /path/to/output.intervals
-o realignedBam.bam

61

Aligning reads to reference on the command line

Login on the server mentioned on Toledo, and:

From directory ~jaerts/i0d51a/: copy the files s_1_sequence_small.txt,
s_2_sequence_small.txt and chr9.fa to your own home directory.

If you know that s_1_sequence_small.txt and s_2_sequence_small.txt contain paired
reads: align these against chr9. You’ll first have to create an index for chr9 (see slides).
Also convert the resulting sam-file to a bam-file.

How many of the reads were mapped? How many could not be mapped? There’s an
easy way to do this with grep, but extra point if you can use the bitwise flag.

How many reads mapped without mismatches (i.e. CIGAR string equal to “90M”)?

63

Aligning reads to reference using Galaxy

Log into your account on Galaxy.

Align the reads in s_1_sequence_small.txt and s_2_sequence_small.txt (that
you uploaded in the last lesson) against hg19. Perform the mapping using BWA
for Illumina. Use the built-in index “Human (Homo sapiens): hg19 Full” (type
“hg19” in the “Select a reference genome” box). Do not suppress the header in
the output SAM ﬁle.

Using Galaxy: create a histogram of the insert sizes of this DNA sequencing
library (tip: you’ll need some commands from the “Text Manipulation” and
“Filter and Sort” groups)

64

Investigating BAM file with IGV

Start the IGV application from http://www.broadinstitute.org/software/igv/download
(750MB version) and open the first10Mbchr17.sorted.bam file which you can download
from Toledo.

• Is this data from a whole-genome sequencing experiment, or rather from some type of
pulldown? If the latter: what type of pulldown (i.e. what were the targets).

• Is the complete CDS of the KIF1C gene covered?

• What is the left-most gene that is also in OMIM (you can find those at “Load from Server
-> hg19 -> Phenotype and Disease Associations”)? Are all its exons covered?

• At position 11,928 of chromosome 17: is this a SNP? If it is: is it already known in
dbSNP? What about position 13,905?

65

Next generation sequencing course - part 2: sequence mapping

Recomendados

Recomendados

Más contenido relacionado

Más de Jan Aerts

Más de Jan Aerts (20)

Último

Último (20)

Next generation sequencing course - part 2: sequence mapping