2. RNA-Seq
• Application of Next Generation Sequencing technology
(NGS) for RNA sequencing for transcript identification
and quantification of RNA.
• Can be used for:
– Estimating the number of transcripts in the sample
(transcriptomics or expression profiling)
– Reveal sequence variation
– Detection of alternate splicing
– Gene expression profiles of healthy versus diseased tissue
6. Read-Mapping Challenges
• NGS Computational challenges
• Memory footprint
• Millions of short reads
• RNA-Seq Special Mapping Concerns
• New technology old problems
• Exact vs inexact matches
From wikipedia
7. Algorithms For Read Mapping
Build an Index
Set of position where reads are most likely to align
Refined alignment at the target locations
- Hash table
- Burrow-Wheeler
transform (BWT); FM
Index
Seed and Extend
8. Hash Tables
• Use hash tables to store position of all k-mers
in a genome
1 2
012345678901234567890
AATCGCATAG
ATCGCATAGT
TCGCATAGTT
CGCATAGTTA
GCATAGTTA T
- Chr 9, location 0
- Chr 9, location 1
- Chr 9, location 2
- Chr 9, location 3
- Chr 9, location 4
- Chr 9, location 5
AATCGCATAGTTATTAATGCTA
9. Output String: TTGGAACC
Input String: GCTAGCTA
GCTAGCTA
CTAGCTAG
TAGCTAGC
AGCTAGCT
GCTAGCTA
CTAGCTAG
TAGCTAGC
AGCTAGCT
AGCTAGCT
AGCTAGCT
CTAGCTAG
CTAGCTAG
GCTAGCTA
GCTAGCTA
TAGCTAGC
TAGCTAGC
Sorting
Burrows-Wheeler Transformation
BWT
• Reversible transformation
• Repetitive nature of the
outcome makes it easier to
compress
13. RNA-Seq: Special Mapping Concerns
• For RNA sequencing data, many reads will map to the reference
genome, but many reads will not because (coming from RNA) they
span exon–exon junctions.
• Methods to deal with junction reads
• Align to the reference transcriptome (well annotated).
• Align to the reference genome and build a junction library
from known adjacent exons and then align unmapped reads to
junction library
• Map reads to the genome and identify putative exon (indel
finding algorithm); using these candidate exon build all
possible exon-exon junctions
• De novo assembly of RNA-Seq reads
19. Summarizing Reads
• Aggregate reads over biological meaningful units such as transcripts or
genes
• Count the number of reads overlapping exons in a gene (but significant
proportion of the reads will also map outside annotated regions
Genome Biology 2010 11:220, DOI: 10.1186/gb-2010-11-12-220
20. Count Normalization
• Number of reads aligned to a gene gives a measure of
its level of expression
• Normalization of the count data
• Sequencing depth
• Length bias
o decide
rom the
require-
h assem-
ut differ
ufflinks b
Isoform 1
d
a
Low
Short transcript
High
Long transcript
Readcount
21
43
1 2 3 4
Exon unio104
Nature Methods 8, 469–477 (2011), Doi:10.1038/nmeth.1613
21. Count Normalization
• RPKM (Reads Per Kilobase of exon model per Million mapped reads)
• FPKM (Fragments Per Kilobase of exon model per Million mapped reads
• TPM (Transcripts per million)
Exon length
Raw number of reads
Number of mapped reads in the sample
1,000,000
RPKM =
22. Count Normalization
Gene/Transcript Name R1 counts R2 counts
A (50 kb) 37000 70000
B (100 kb) 50000 110000
C (200 kb) 50000 88000
D (-- kb) ---- ----
XDD (-- kb) ---- -----
Total number of reads 2000000 4000000
25. Differential Expression
• Goal of the DE analysis is to identify the genes
for which abundance across different
experimental conditions has changed
significantly
• Biological replicates (to account for biological
variation)
• Ranked list of genes with associated p-values
and fold changes
• DE tools: edgeR, DESeq
26. Alignment Independent Quantification
• Sailfish
• Salmon
• Kallisto
Main Idea
• Quantify the abundance of known transcripts
• Read mapping is unnecessary
• Replace inexact pattern matching with exact sub-pattern counting
Accurate maps of transcript start and end site
Detect sequence rearrangements and abnormal transcript structures (common in tumours)
It reflects the current state of the cell and can reveal pathological mechanism
In the past techniques such as microarray were used to study gene expression. It consists of array of probes whose sequence represents particular regions of the genes to be monitored. But there were several limitations
High background levels due to cross hybridization
Reliance on prior knowledge about the genome
On the other had signal from RNA-Seq data is digital in nature because you get the counts.
It has base-pair level resolution and a much higher dynamic range of expression levels.
We can find novel transcripts and fusion products.
Extraction of the RNA
Remove contaminant DNA
If the goal of the experiment is expression profiling then polyA selection for enriching mRNA in eukaryotes, will miss non-coding RNAs and RNAs that miss polyA tails. So if Other library preparation is to deplete rRNA
Library preparation can introduce biases such as amplification of GC-rich regions and generation of duplicate sequence
Pattern searching and data compression are old computational problems.
Exact matches are very quick but inexact matches(SW algothrim) taking into account the snps/indels are very slow.
First build an index and find the most probable sites where reads can match. Then at these putative sites (narrowed down) do local alignment.
Reads are coming from the mRNA and we are trying to match them to the genome.
Splicing is post-transcriptional modification in which non-coding regions are removed.
Many transcripts will share exon
Transcriptomes are incomplete even for well studied species
In the first step of the alignment you can start by aligning reads to either to the reference genome or to the transcriptome. Alinging to the transcriptome is a new feature in tophat2. It improves overall accuracy and sensitivity of the mapping. It also speeds up the analysis as due to smaller size of the transcriptome.
Some of the reads will not be mapped because they are coming form unknown transcripts not present in the annotation and there will also be poorly aligned reads.
So the next step is to take these unmapped reads and to find novel splice sites. The way tophat2 does it is by splitting the unmapped reads into non-overlapping segments 25 bp long by default and then these segments are aligned against the genome. The maximum intron size is 100 kb by default and that is the window in which we are looking for the match of left and right segments. When that pattern is detected then tophat2 tries to find the most likely location of the splice sites.
After detecting the splice juction, tophat2 puts together
based on known junction signals (GT-AG, GC-AG and AT-AC).
Overview of RNA-seq analysis. Reads produced by an RNA-seq experiment are aligned to the genome, then clustered into a graph structure that is traversed to recover all possible isoforms at one locus. Lastly, a subset of transcripts is selected and their abundance quantified from the input reads.
Number of reads aligned gives a measure of the level of expression
Cell type specific exon
Let A and B being two RNA-seq experiments under same condtions by that I mean no differentially expressed genes. If experiment A generates twice as many reads as much reads as B, it is likely that counts from the experiment A will be doubled
Length bias: expected number of reads mapped on a gene is proportional to both the abundance and length of the isoforms transcribed from the that gene
Adjust for the sequencing depth (“Million” part)
Adjust for the Gene length (“kilobase” part)
Sequencign depth of a sample second experiment generates twice as many reads
Read with errors still has has many ‘good’ k-mers
Only k-mers overlapping errors will be discarded or mis-counted