Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
RNA-seq Analysis
1. RNA-‐seq
analysis
Mikael
Huss
Bioinforma7cs
scien7st
at
WABI
(Wallenberg
Advanced
Infrastructure
for
Bioinforma7cs),
Science
for
Life
Laboratory
/
DBB,
Stockholm
university
February
13,
2013
2. Omics,
biology
and
diseases
+ + + +
Protein “parts Protein
Genomics RNA profiles Interactomics
list” profiles
Systems
biology
Pathways,
molecular
targets,
diagnos5cs
3. Approximate contents of talk
- Gene expression analysis in general; differences between RNA-seq and microarrays
- Typical workflow(s) for RNA-seq analysis
- Normalization issues
- Visualization
- Differential expression analysis
I have tried to include many references so you can go back to these slides for
reference afterwards
4. How
DNA
get
transcribed
to
RNA
(and
then
translated
to
proteins)
varies
between
e.
g.
-‐Tissues
-‐ Cell
types
-‐ Cell
states
-‐Individuals
5. What
can
gene
expression
tell
us?
Basic
research
-‐ How
do
gene
expression
paUerns
determine
cellular
iden7ty?
(7ssues,
cell
types
…)
-‐ How
does
gene
expression
control
early
development
in
an
embryo?
-‐ What
kinds
of
genes
are
expressed
in
response
to
specific
s7muli
(infec7ons,
smoking,
environmental
pollu7on,
gym
exercise
…)?
-‐ What
kinds
of
genes
do
bacteria
or
other
microorganisms
express
in
the
human
gut
/
in
soil
/
in
oceans
under
different
condi7ons?
…
and
much,
much
more
…
6. What
can
gene
expression
tell
us?
Diseases
-‐ Which
genes
are
over-‐
(or
under-‐)expressed
in
pa7ents
vs.
healthy
controls?
-‐ Which
genes
are
correlated
to
disease
progression?
-‐ Can
markers
of
hidden
disease
be
found
by
sequencing
blood
plasma?
7. Gene
expression
signatures
for
disease?
Hypothesis:
Cell
types
are
stable
states
in
a
“space”
of
gene
expression
paUerns.
Diseases
(e
g
cancers)
distort
the
gene
expression
so
that
the
cell
ends
up
in
the
wrong
stable
state.
Furusawa
and
Kaneko,
Biology
Direct
2009
4:17
8. Can
the
research
community
find
such
paUerns?
On-‐line
predic7on
compe77ons,
objec7vely
scored
by
the
organizers
Diagnosing
MS
(mul/ple
sclerosis),
lung
cancer,
psoriasis,
COPD
(KOL)
Prognos/ca/ng
breast
cancer
outcome
9. Human
7ssue
RNA-‐seq
data
sets
Genotype-Tissue Expression project
http://commonfund.nih.gov/GTEx/
Illumina Human Body Map
accessed via ReCount database, bowtie-bio.sourceforge.net/recount/
Wang 2008 data set of ~15 human tissues
accessed via ReCount
RNA-seq Atlas
http://medicalgenomics.org/rna_seq_atlas
Human Protein Atlas
http://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)
10. Tools
for
genome-‐scale
gene
expression
measurements
Microarrays
(c:a
1995)
Some7mes
called
“gene
chips”
Based
on
hybridiza7on
RNA
sequencing
(c:a
2008
in
current
form)
Based
on
sampling
12. Alterna7ve:
rRNA
deple7on
There are various kits for depleting rRNA instead
Pluses:
- Can use for microorganisms that don’t have poly-A tails
- Thus, can use for simultaneous host/pathogen expression profiling
- Can find non-coding RNA
Minuses:
-Usually leaves in quite a lot of rRNA
-In practice, often variable efficiency between samples -> hard to compare results
13. Sequencing
plagorms
ABI
3730xl
454
Life
Sciences
SOLiD
+
Pacific
Biosciences,
Sanger
Sequencing
pyrosequencing
Illumina
Oxford
Nanopore
etc
Single-‐molecule
sequencing
Length/read
800
bp
400
bp
100
bp
20
000+
bp
Reads/run
96
1
million
2
billion
5
million
Bases/run
60
kbp
400
Mbp
500
Gbp
100
Gbp
Speed
10
years/HG
1
month/HG
1
day/HG
10
min/HG
“old
school”
“2nd
gen”
“3rd
gen”
14. Microarray:
Hybridiza7on
Source:
Wikipedia
The
design
of
the
microarray
determines
what
you
can
detect
in
a
sample
15. RNA
sequencing:
Sampling
It
is
possible
to
detect
transcripts
that
are
not
known
a
priori
(in
advance)
16. RNA-‐seq
advantages
The
non-‐dependence
on
reference
makes
possible:
-‐ meta-‐transcriptomics
-‐ detec7ng
novel
splice
variants
-‐ detec7ng
novel
transcripts
-‐ Fusion
transcripts
-‐ Non-‐coding
transcripts
20. What
does
one
do
with
RNA-‐seq
reads?
• Mapping
(also
called
alignment)
• (de
novo)
Assembly
21. Mapping
(alignment)
vs.
assembly
Imagine
a
book
being
ripped
to
pieces
with
word
or
sentence
fragments
ending
up
on
each
piece
of
paper.
If
you
have
a
copy
of
the
book
that
you
can
compare
the
pieces
to,
you
have
a
mapping
(alignment)
problem.
If
you
have
no
copy
of
the
book,
you
have
a
de
novo
assembly
problem.
22. Mapping
to
a
reference
genome
Reads
from
the
sequencer
Sequencing
error
Gene7c
varia7on
CAATCAGA G TCCCACTGTGG
AGACG TCCCACTGTGGGGTG
GTGAAGTGTCCGTAGATGTGTG
GCAAATGCAATCAGACG TCCC
Gene(or
transcript)
sequence
23. Mapping
to
a
reference
genome
AGACG TCCCACTGTGGGGTG
GTGAAGTGTCCGTAGATGTGTG
GCAAATGCAATCAGACG TCCC
24. Mapping
to
a
reference
genome
GTGAAGTGTCCGTAGATGTGTG
GCAAATGCAATCAGACG TCCC
25. Mapping
to
a
reference
genome
GCAAATGCAATCAGACG TCCC
27. Mapping
to
the
genome
vs.
the
transcriptome
Vs. the genome:
-Can (in principle) detect new transcripts, splice variants
- Less sensitive, need a lot of coverage to discover new things
- Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc.
Vs. the transcriptome:
-Not unbiased anymore, tied to existing annotation
-Faster, more sensitive, need less coverage
The best of both worlds?
- Tools like TopHat (v1.4 and up) now do both
28. If
it
had
been
de
novo
assembly
CAATCAGA G TCCCACTGTGG
AGACG TCCCACTGTGGGGTG
GTGAAGTGTCCGTAGATGTGTG
GCAAATGCAATCAGACG TCCC
Assembly
CAATCAGA G TCCCACTGTGG
AGACG TCCCACTGTGGGGTG
GCAAATGCAATCAGACG TCCC
“singleton”
GTGAAGTGTCCGTAGATGTGTG
Consensus
sequence(s)
29. Assembly
of
RNA-‐seq
reads
Will not be discussed much further here.
Most popular de novo assemblers build de Bruijn graphs where overlapping k-mers
are connected to each other. The programs then try to find paths through the graph
Typically needs a LOT of RAM. Can try to pre-process using “digital normalization”
Tools:
- Trinity
- Velvet/Oases
- CLC Bio (commercial)
30. Assembly
of
RNA-‐seq
reads
Typical workflow could be:
- Clean the reads properly (remove adapters, low-quality reads)
- Useful tools: FastQC, PRINSEQ, FASTX toolkit etc.
- Run assembly tool of choice, resulting in a set of contigs
- BLAST the contigs against nt database, check for % overlap by transcript in
related organisms
- Map your original reads back to the contigs and count the reads overlapping
each
<- comparison of
assembly &
mapping
31. Quan7fying
expression
with
RNA-‐seq
Microarrays give a continuous (floating-point) expression value for each gene
RNA-‐seq
gives
an
integer
value
for
each
gene
(“digital
expression”):
read
counts
32. Example
(SciLifeLab)
mapping
workflow
FASTQ file(s)
TopHat 2.0
BAM file
Picard tools (SortSam, MarkDuplicates)
Sorted BAM file with duplicate reads removed
HTSeq 0.5 Cufflinks 2.0
Gene-level count files Gene- and isoform-level expression
(for DE analysis) estimates (FPKM, for reporting)
34. (what
it
would
look
like
mapped
to
the
genome)
Exon
1
Exon
2
Exon
3
Need
a
special
mapping
algorithm
which
allows
large
gaps,
a
“split-‐read
aligner”
35. (what
we
would
actually
observe
–
of
course
we
don’t
know
which
reads
come
from
which
isoform)
Sta7s7cal
algorithms
needed
to
es7mate
what
propor7on
of
reads
comes
from
which
isoform.
(For
example,
maximum
likelihood
/
expecta7on
maximiza7on)
36. Name
Free/Commercial/ Type
of
approach
Descrip5on
only
Xing
et
al.
2006
D
Maximum
likelihood
Partek
C
“
Li
et
al.
2010
D
“
Avadis
C
“
IsoEM
F
“
MISO
F
“
(MCMC)
Cufflinks
F
“
rQuant
F
Least
squares
(quadra7c
programming)
Rpkmforgenes.py
F
Least
squares
Howard
and
Heber
2010
D
Least
squares
FluxCapacitor
F
Linear
programming
CLC
Bio
C
?
NSMAP
F
Nonnega7ve
Sparse
Maximum
A
Posteriori
ALEXA-‐SEQ
F
Use
only
reads
that
are
compa7ble
with
a
single
isoform
NEUMA
D
Normaliza7on
by
Expected
Uniquely
Mappable
Area
37. Some remarks on isoform quantification
- It is necessary for correct gene-level quantification as well because straight read
counting methods can never be fully correct (from 2012 CuffDiff2 paper)
- Xing et al. (2006) gave the basic idea for EM-
based isoform quantification which other
programs (Cufflinks, MISO, IsoEM, …) have
added various “bells and whistles” to
- It is actually pretty hard to do isoform
quantification well because there can be a lot
of possible isoforms not enough sequence
coverage to estimate
38. Basic idea of the EM approach
We have a set of reads mapping to some locus
- Some fit one specific isoform
- Some fit several isoforms
If we knew the isoforms’ expression levels, we could distribute the reads proportionally
to those. But we don’t!
On the other hand, if we knew the probability of each read to match each isoform, we
could estimate the isoforms’ expression pretty well. But we don’t know that either.
So … start with a guess and iterate!
- Assign reads to isoforms according to some initial guess
- Re-estimate isoform expression levels
- Repeat until convergence!
39. Gene
fusion
detec7on
with
RNA-‐seq
Beyond
isoforms:
Detect
pieces
of
different
genes
that
have
been
fused
Look
for
reads
that
map
in
“wrong”
ways
Wang
et
al.
Briefings
in
Bioinforma7cs
doi:10.1093/
bib/bbs044
40. Some
further
comments
on
microarrays
and
RNA-‐seq
-‐ Microarrays
are
s7ll
cheaper
and
faster.
-‐ You
may
be
able
to
run
more
replicates,
which
is
important
for
sta7s7cal
power.
-‐ RNA-‐seq
has
a
wider
measurement
range.
-‐ Low
expressed
transcripts:
-‐ Microarrays
have
high
background
signal
-‐>
poor
measurement
-‐ RNA-‐seq
can
measure
well
if
you
sequence
very
deeply
-‐ Medium
expressed
transcripts:
-‐ Microarrays
measure
well
-‐ RNA-‐seq
measures
well
if
sequenced
rela7vely
deeply
-‐ High
expressed
transcripts:
-‐ Microarrays
measure
poorly
because
of
satura7on
-‐ RNA-‐seq
measures
well
-‐ Less
is
understood
about
how
to
pre-‐process
and
normalize
RNA-‐seq
data.
-‐ One
interes7ng
aspect
of
RNA-‐seq:
You
can
con7nue
to
sequence
a
sample
more
to
obtain
beUer
gene
expression
es7mates.
41. Analysis
-‐ Pre-‐processing
and
normaliza7on
-‐ Visualiza7on
-‐ Differen7al
gene
expression
analysis
-‐ ( Gene
set
analysis,
pathway
analysis,
gene
expression
signatures
…
-‐>
try
to
find
the
biological
significance)
42. Pre-‐processing
Why
do
we
do
pre-‐processing
and
normaliza7on
of
RNA-‐seq
(or
microarray)
data?
43. Pre-‐processing
Why
do
we
do
pre-‐processing
and
normaliza7on
of
RNA-‐seq
(or
microarray)
data?
-‐ To
correct
for
batch
effects
-‐ Different
labs
-‐ Different
prepara7on
7mes
-‐ Etc.
44. Pre-‐processing
Why
do
we
do
pre-‐processing
and
normaliza7on
of
RNA-‐seq
(or
microarray)
data?
-‐ To
correct
for
batch
effects
-‐ Different
labs
-‐ Different
prepara7on
7mes
-‐ Etc.
-‐ To
correct
for
intrinsic
technical
biases
in
the
technologies
45. Pre-‐processing
Why
do
we
do
pre-‐processing
and
normaliza7on
of
RNA-‐
seq
(or
microarray)
data?
-‐ To
correct
for
batch
effects
-‐ Different
labs
-‐ Different
prepara7on
7mes
-‐ Etc.
-‐ To
correct
for
intrinsic
technical
biases
in
the
technologies
-‐ To
make
the
expression
value
distribu7ons
conform
to
some
assump7ons
in
order
to
perform
sta7s7cal
tests
46. RNA-‐seq
pre-‐processing
For
RNA-‐seq
data,
it
is
s7ll
less
understood
than
for
microarrays
how
one
should
pre-‐process
and
normalize
the
data.
Let’s
look
at
some
aspects
(that
some7mes
apply
to
both
RNA-‐seq
and
microarray
data)
47. R
and
Bioconductor
Very helpful for (e.g.) microarray and RNA-seq
differential expression analysis
Microarray: RNA-seq:
affy, lumi (read raw microarray signal files DESeq, edgeR, baySeq,
& preprocess) (differential expression analysis
limma (differential expression analysis based on count data)
with complex designs) SAMSeq (nonparametric
differential expression analysis)
48. Variance
stabiliza5on
Raw data
(could be microarray signal or RNA-seq counts)
Higher value -> higher variability (noise)
Log transform
Lower value -> higher variability. Too aggressive
Variance stabilizing transform
e.g. voom() in limma package
http://bridgecrest.blogspot.se/2011_09_01_archive.html
49. Quan5fying
expression
with
RNA-‐seq
If
you
want
to
compare
RNA-‐seq
counts
between
different
genes
and/or
samples,
consider:
-‐ Longer
genes/transcripts
are
expected
to
generate
more
reads
-‐ The
more
you
sequence,
the
more
reads
you
get
from
each
gene
Therefore,
the
standard
measure
has
been
RPKM
(
),
which
corrects
for
transcript
length
and
sequencing
depth:
⎛ X t ⎞
⎜ l ⎟
10 9 ⋅ X t (Xt:
no
of
reads
mapped
to
transcript/gene/…
t
⎜ eff ,t ⎟
Nlib:
no
of
mapped
reads
in
library
RPKM
=
⎜ 10 3 ⎟
⎜ ⎟
=
N lib ⋅ leff ,t Leff,
t:
effec/ve
length
of
transcript/gene/…
t)
⎝ ⎠
⎛ N lib ⎞
⎜ 6 ⎟
⎝ 10 ⎠
€ €
FPKM is a paired-end version of this
50. Alterna5ves
TPM – “transcripts per million”
A slightly modified RPKM measure that
accounts for differences in gene length
distribution in the transcript population
51. Alterna5ves
TMM – “trimmed mean of M values”
Attempts to correct for differences in RNA composition between samples
E g if certain genes are very highly expressed in one tissue but not another, there will be less
“sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or
similar) will give biased expression values for them compared to the other sample
RNA population 1 RNA population 2
Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the
expression levels are actually the same in populations 1 and 2
Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
55. Prac5cal
issues
with
normaliza5on
methods
Limma / voom can give negative values
TMM cannot be done on a single sample
56. RNA-‐seq
pre-‐processing
In
RNA-‐seq,
normaliza7on
of
counts
is
oven
interwoven
with
differen7al
expression
analysis
and
done
implicitly
in
DE
packages
such
as
DESeq,
edgeR
etc.
Normalized
values
like
RPKM
are
usually
only
used
for
repor7ng
expression
values,
not
tes7ng
for
differen7al
expression.
Why?
57. Count
nature
of
RNA-‐seq
data
These
methods
want
to
use
the
added
sta7s7cal
power
provided
by
the
count
nature
of
RNA-‐seq
data.
Simplified
toy
example:
Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 counts
in sample B.
Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts in
sample B.
Assume that the sequencing depths are the same in both samples and both
scenarios. Then the RPKM is the same in sample A in both scenarios, and in
sample B and both scenarios.
In scenario A, we can be more confident that there is a true difference in the
expression level than in scenario B (although we would want more replicates of
course!) by analogy to a coin flip – 700 heads out of 1000 trials gives much more
confidence that a coin is biased than 7 heads out of 10 trials
58. Visualiza5on
Can
be
useful
for
“sanity
checking”,
outlier
detec7on
and
exploratory
analysis
in
general
Examples
of
useful
visualiza7ons
-‐ Heat
maps
-‐ PCA/MDS/NMF
-‐ Box
plots,
violin
plots
etc.
59. Box
plots
Useful for comparing groups
Adding the actual data points is optional but can be interesting
60. Sample
correla5on
heat
maps
Heat maps are ubiquitous in transcriptomics
Correlations between samples, hierarchical clustering
Used for “sanity checks”, outlier detection
Two tissues Batch effects
61. Gene
/
sample
heat
maps
With a smaller
collection of genes,
one sometimes looks
at gene/sample heat
maps
63. PCA
plots
Nice thing with PCA: you can also see how much each gene contributes to each
principal component -> a kind of feature selection
64. Alterna5ves
to
PCA
NMF: non-negative matrix factorization. Also a matrix decomposition technique (like
PCA)
“A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580
65. PCA
plot
of
human
5ssue
RNA-‐seq
Red – GTex
Green – Body Map
Black – Human Protein Atlas
66. #
of
genes
taking
up
X%
of
sequences
GTex RPKM
HBA1
HBB
HBA2
68. #
of
genes
taking
up
X%
of
sequences
Wang/Sandberg
69. Differen5al
expression
analysis
Many tools available!
Easily the most common type of analysis, even though it is understood that
gene expression levels are not independent of each other, and should in
principle be considered together.
However, since the number of samples is typically << the number of
measured genes, a full model is usually not feasible to construct in practice.
Some sort of feature selection is needed.
71. Differen5al
expression
analysis
One would simply like to do a t-test or something like that for each gene, but
…
- Assumes normal distribution & no mean-variance dependence
72. Differen5al
expression
analysis
One would simply like to do a t-test or something like that for each gene, but
…
- Assumes normal distribution & no mean-variance dependence
- Hard to estimate variance from few samples
73. Differen5al
expression
analysis
One would simply like to do a t-test or something like that for each gene, but
…
- Assumes normal distribution & no mean-variance dependence
- Hard to estimate variance from few samples
- Multiple testing issue
74. Parametric
vs.
non-‐parametric
methods
It would be nice to not have to assume anything about the expression value
distributions but only use rank-order statistics. -> methods like SAM
(Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data)
However, it is (typically) harder to show statistical significance with non-
parametric methods with few replicates.
My rule of thumb:
- Many replicates (~ >10) in each group -> use SAM(Seq)
- Otherwise use DESeq or other parametric method
Note that according to Simon Anders (creator of DESeq) says that non-
parametric methods are definitely better with 12 replicates and maybe already at
five
http://seqanswers.com/forums/showpost.php?p=74264&postcount=3
76. Standard
DE
methods
Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)
Distributional issue: Solved by variance stabilizing transform in limma
edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.
77. Standard
DE
methods
Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)
Distributional issue: Solved by variance stabilizing transform in limma
edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.
Multiple testing issue: All of these packages report false discovery rate (corrected
p values).
78. Standard
DE
methods
Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)
Distributional issue: Solved by variance stabilizing transform in limma
edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.
Multiple testing issue: All of these packages report false discovery rate (corrected
p values).
Variance estimation issue: These packages (in slightly different ways) “borrow”
information across genes to get a better variance estimate. One says that the
estimates “shrink” from gene-specific estimates towards a common mean value.
79. Standard
DE
methods
Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)
Distributional issue: Solved by variance stabilizing transform in limma
edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.
Multiple testing issue: All of these packages report false discovery rate (corrected
p values).
Variance estimation issue: These packages (in slightly different ways) “borrow”
information across genes to get a better variance estimate. One says that the
estimates “shrink” from gene-specific estimates towards a common mean value.
81. Complex
designs
The simplest case is when you just want to compare two groups against each other.
But what if you have several factors that you want to control for?
E.g. you have taken tumor samples at two different time points from six patients,
cultured the samples and treated them with two different anticancer drugs and a mock
control treatment. -> 2x6x3 = 36 samples.
Now you want to assess the differential expression in response to one of the
anticancer drugs, drug X. You could just compare all “drug X” samples to all control
samples but the inter-subject variability might be larger than the specific drug effect.
Enter limma / DESeq / edgeR which can work with factorial designs
(SAMSeq cannot, which is another reason one might not want to use it)
82. Limma
and
factorial
designs
limma stands for “linear models for microarray analysis”
Essentially, the expression of each gene is modeled with a linear relation
http://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf
The design matrix describes all the conditions, e g treatment, patient, time etc
y = a + b*treatment + c*time + d*patient + e
Baseline/average Error term/noise
84. Take-‐away
messages
from
DE
tool
comparison
- CuffDiff2, which should theoretically be better, seems to work worse, probably
due to the increased “statistical burden” from isoform expression estimation
- The HTSeq quantification which is theoretically “wrong” seems to give good
results with downstream software
- It is practically always better to sequence more biological replicates than to
sequence the same samples deeper
Omitted from this comparison
- gains from ability to do complex designs
- non-parametric methods
85. The
end
Contact me at mikael.huss@scilifelab.se if you have any questions