"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
RNA-seq: general concept, goal and experimental design - part 1
1. Defining the goal of RNA-seq
analysis for differential
expression
Joachim Jacob
20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
2. Great power comes with great responsibility
RNA-seq enables one to
1) get an idea which are all active genes
2) quantify expression of each transcript
3) quantify alternative splicing
… (use your imagination)
Principles of transcriptome analysis and gene expression quantification: an RNA-seq
tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract
3. Great power comes with great responsibility
You can't do all
RNA-seq is powerful, we have
to aim for a certain goal.
Our goal is to detect
differential expression
on the gene level.
8. The central dogma
Level of protein production
causes the phenotypic differences
GDA
yeast
Yeast mutant
GDA + vit C
?
9. The central dogma
Level of templates for protein production
causes the phenotypic differences
GDA
yeast
Yeast mutant
GDA + vit C
?
10. The central dogma
Level of mRNA copies
causes the phenotypic differences
GDA
yeast
Yeast mutant
GDA + vit C
?
11. Does it hold?
Level of mRNA copies
Level of templates for protein production
Level of protein production
Presence/concentration of proteins in a cell
Difference in protein activity
Phenotype
12. Problem reduction
We can measure mRNA levels (much easier
than protein levels).
So we measure mRNA.
The level of mRNA is a proxy of the level of
protein activity causing the aberrant
phenotype.
13. How to measure mRNA
1. Q-PCR (real-time)
A lot of work to measure few
genes, in a relatively wide array
of tissues. Very accurate.
2. Microarray
Easier way to measure many
predefined genes in a relatively
wide array of tissues. Robust.
3. RNA-seq
14. RNA-seq protocol in a nut shell
●
Get your sample
●
Lyse the cells and extract RNA
●
Convert the RNA to cDNA
●
The cDNA pool get sequenced
Yeast sample
The result is sequence information from
scratch. No prior information is needed.
Comprehensive comparative analysis of strand-specific RNA sequencing methods
http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html
Comparative analysis of RNA sequencing methods for degraded or low-input samples
http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html
15. The predecessors of RNA-seq
●
●
ESTs: expressed
sequence tags, ideal for
discovery of new genes.
SAGE: serial analysis of
gene expression,
measurement of
number of copies of
mRNA
http://www.montana.edu/observatory/people/mcdermottlab.html
16. The predecessors of RNA-seq
●
●
ESTs: expressed
sequence tags, ideal for
discovery of new genes.
SAGE: serial analysis of
gene expression,
measurement of
number of copies of
mRNA
http://www.sagenet.org/findings/index.html
17. The predecessors of RNA-seq
●
ESTs: expressed sequence tags
●
SAGE: serial analysis of gene expression
Low throughput: long sequence
information, but for only ~thousands of
genes.
18. Concept of measuring with RNA-seq
One template of protein production
GeneA GeneB GeneC
Extract mRNA
and turn into
cDNA
Fragment, ligate
adaptor, amplify.
Put a fraction of
the pool on
sequencer to
read fragments.
Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478
20. So many steps must fail our assumption
Phenotype
Define the phenotype
Proteins
Are a proxy for protein activity
mRNA levels
Represent the RNA pool we've extracted
cDNA pool
Represent the cDNA pool we've created
RNA-seq reads
21. So many steps must fail our assumption
Phenotype
mRNA templates have
different speeds of protein proDuction: availability of tRNAs,
rate of mRNA degration,
Alternative splicing events,...
Proteins
mRNA levels
cDNA pool
Fail to map reads to correct
gene, lane-specific biases on
reading cDNA fragments,...
Protein activity is regulated:
Fosforylation, ubiquitination,...
Loss on RNA extraction, 90% of
RNA in cell is rRNA, ligation
of adapters, conversion to cDNA
not 100%
RNA-seq reads
22. Consequence: focus on comparison
Phenotype A
Proteins
Phenotype B
Possibly due
to differences in
expression
Proteins
mRNA levels
mRNA levels
cDNA pool
cDNA pool
RNA-seq reads
RNA-seq reads
23. Consequence: focus on comparison
Phenotype A
Phenotype B
Proteins
Proteins
mRNA levels
mRNA levels
cDNA pool
cDNA pool
RNA-seq reads
RNA-seq reads
DESIGN OF
EXPERIMENT
24. Comparing number of reads to genes
sample
RNA-seq
GeneA GeneB GeneC
Obviously, the number of reads is dependent on:
OUR QUESTION
1. the expression level of the gene
2. the total number of reads generated
3. the length of the transcript
Normalisation is needed!
25. Experimental design
Our focus: which genes are differentially expressed
between different conditions?
Obviously, the number of reads is dependent on:
1. the expression level of the gene
2. the total number of reads generated
3. the length of the transcript
How many reads to sequence?
Which normalisation is needed?
26. Experimental design
Our focus: which genes are differentially expressed
between different conditions?
“How can we detect genes for which the counts of
reads change between conditions more
systematically than as expected by chance”
We must design an experiment in which we can test
this deviance from chance.
Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010,
11:220 http://genomebiology.com/2010/11/12/220
27. How many reads to sequence?
In other words: how deep to sequence? What is the
required 'depth of sequencing'?
sample
RNA-seq
GeneA GeneB
GeneC
GeneA GeneB
GeneC
sample
RNA-seq
The final test will look at ratios:
6
5
3
5
6
4
1,2
0,83
0,75
28. How many reads to sequence?
The difference between the lowest gene count and
the highest gene count is typically 105. This is called
the dynamic range.
Linear scale is useless.
The logarithmic scale is better.
Wait! Something's not correct here!
29. Zero remains zero!
We are working with counts. A count is >=1. A gene
with zero counts can be not yet sequenced (not
deep enough) or is not expressed in that condition.
0
It is not a full logarithmic scale.
It starts at zero.
30. So keep all counts above zero?
Assuming equal sequencing depth in the samples,
and these counts. Do all these genes differ in
expression?
sample
sample
RATIO
GeneA
5
10
2
GeneB
15
30
2
GeneC
40
80
2
GeneD
100
200
2
GeneE
1000
2000
2
GeneZ
1
2
2
31. So keep everything above zero?
Sequencing the result of the same steps again is
called a technical replicate.
Is there a trend in how
these numbers change? sample
sample
RATIO
GeneA
11
10
0,91
GeneB
11
30
2,72
GeneC
60
80
1,33
GeneD
79
200
2,53
GeneE
1150
2000
1,74
GeneZ
5
1
0,20
2?
32. Technical replicates
We take the same cDNA pool and sequence it several
times: technical replicates.
sample
sample
sample
sample
GeneA
11
5
4
4
GeneB
11
16
14
8
GeneC
60
45
32
38
GeneD
79
102
95
110
GeneE
1150
1023
987
1005
GeneZ
3
0
0
1
33. The poisson distribution
The counts of technical replicates follow a poisson
distribution (Marioni et al 2008). The Poisson distribution
can be applied to systems with a large number of possible
events, each of which is rare.
From Wikipedia. Can be 3
different genes, each with
their own poisson
distribution. Lambda is
the mean of the gene's
distribution, with a
certain number of reads.
Y=axis: chance to pick
that number of reads.
34. The poisson distribution
So when we have 4 technical replicates sequenced up
to a big depth (say 10 M reads). We can get by
chance, these numbers for 3 different genes.
GeneA 0, 0, 1, 3
GeneB 2, 3, 4, 7
GeneC 8, 9, 11, 14
35. Working the intuition
How many blue balls?
How many red balls?
Draw 10
Draw 10 more
Draw 10 more
Estimate how large the fraction is in the set?
36. The intuition with the balls
Color
Blue
Red
No color
10 draws
20 draws
30 draws
40 draws
37. Conclusion of the experiment
How bigger the fraction in the pool, how quicker (i.e.
with less sequencing depth) we are certain about the
estimate of that fraction.
estimate=count; variance=count
For lower counts, the variance is
relatively bigger than the
variance for higher counts.
CV (coëfficient of variation) =
sqrt(count)/count
Genes with lower expression
need much deeper sequencing
than genes with higher
expression levels.
38. Comparing counts
“Here we show the overlap of Poisson
distributions of single measurements at
different read counts. Because relative
Poisson uncertainty is high at low read
counts, a count of 1 versus 2 has very
little power to discriminate a true 2X fold
change, though at higher counts a 2X fold
change becomes significant.
In an actual experiment, the width of the
distribution would be greater due to
additional biological and technical
uncertainty, but the uncertainty to the
mean expression would narrow with
each additional replicate.”
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression.
Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015
39. (Log2 of the counts)
Comparing technical replicates
Correlation
between mean
and variance
according to Poisson
Lowess fit through
the data
(Log2 of the counts)
Risso et al. “GC-Content Normalization for RNA-Seq Data”
BMC Bioinformatics 2011, 12:480
http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R)
40. But poisson does not seem to fit
Extending the samples to real biological samples, this
mean variance relationship does not hold...
Plotted using EDASeq
Package in R.
41. But poisson does not seem to fit
Extending the samples to real biological samples, this
mean variance relationship does not hold!
Something is going on!
Reasonable fit
Plotted using EDASeq
Package in R.
42. An extra source of variation
The Poisson distribution has an 'overdispersed'
variance: the variance is bigger than expected for
higher counts between biological replicates.
Something is going on!
Plotted using EDASeq
Package in R.
43. An extra source of variation
Where Poisson: CV = std dev / mean => CV² = 1/μ
If an additional distribution is involved (also
dependent on π, the fraction of the gene in the cDNA
pool), we have a
mixture of
distributions:
CV² = 1/μ + φ
Low counts!
dispersion
Generalization of Poisson
with this extra parameter:
the Negative Binomial
Model fits better!
44. The negative binomial model
The NB model fits observed expression data of
RNA-seq better. It is a generalization of Poisson, and
2 parameters need to be estimated (μ and φ)
Counts (gene g in sample j) has a
Mean = μgj
Variance = μgj + φg μgj²
Biological CV² = φg => Biological CV = √φg
Methods differ in estimating this dispersion per gene:
Can only be measured with true biological replicates
45. Variation summary, intuitively
Total CV² = Technical CV² + Biological CV²
For low counts, the Poisson (technical) variation or
the measurement error is dominant.
For higher counts, the Poisson variation gets smaller,
and another source of variation becomes dominant,
the dispersion or the biological variation. Biological
variation does not get smaller with higher counts.
46. Beyond the NB model
It appears from analysis of many
biological replicates (#=69) that not
every gene can be modeled as NB:
the Poisson-Tweedie model
provides a further generalisation
and a better fit for many genes
(with an additional shape
parameter).
Left figure: raw data shows that about 26% of
the genes fit a NB model. Depending on the
estimated shape parameter, other
distributions fit better.
Esnaola et al. BMC Bioinformatics 2013, 14:254
http://www.biomedcentral.com/1471-2105/14/254
47. Consequence for our design
For low counts: the uncertainty is big due to
Poisson
●
For high counts: the uncertainty is big due to
biological variation. (highly expressed genes differ
in their natural variation (regulated by cellular
processes) more than lowly expressed genes).
●
If we focus on the ratios between the conditions:
is it reasonable to set a restriction of fold change?
Highly expressed genes can have a smaller and be
significant. Lowly expressed genes can exceed 2.
●
48. Consequence on fold change
The readily applied cut-off in micro-array analysis
is in RNA-seq not of use.
Volcanoplot
Blue and red:
known DE genes
These cut-offs often
applied can prohibit
detecting DE genes
49. Long story to say...
We need to estimate the model behind the count.
Never work without biological replicates.
Never work with 2 biological replicates.
Try avoiding working with 3 biological replicates.
Go for at least 4 biological replicates.
52. Summary
Obviously, the number of reads is dependent on:
1. chance
→ Define the count model (NB) from replicates
2. the expression level of the gene
→ Compare the ratios with a test
2. the total number of reads generated
3. the length of the transcript
53. The total number of reads generated
sample
RNA-seq
GeneA GeneB GeneC
sample
More RNA-seq
GeneA GeneB GeneC
The number of reads is dependent on the total
number of reads generated. If one library is
sequenced to 20M reads, and another one to
40M, most genes will ~double their counts.
54. Normalization for library size
Naive approach: divide by total library size. Is not
applied anymore!
Why not? Composition matters!
2 things to remember:
- zero sum system (or “we cannot count what we can't sequence”)
- 5 orders of magnitude
55. Normalization for library size
2 things to remember:
- zero sum system
- 5 orders of magnitude
In every sample, a lot of
reads are spend on few
extremely highly expressed
genes. Which genes? That
differ between libraries, but
affects negatively the naïve
size normalization if we
include those genes.
56. Normalization for library size
Schematically: when normalized on library size
(square represent number of reads).
Few genes with enormous counts: there is NO SATURATION of these counts
Rest of the genes
All counts for library A
Rest of the genes
All counts for library B
57. Normalization for library size
Better normalization would be as shown below.
DESeq2 and EdgeR apply such an approach (see
100%
later).
100%
Rest of the genes
Rest of the genes
58. Gene length influence the count
“Longer transcripts generate more reads”
True! But the transcript length does not differ
between samples. Since we are concerned with
relative differences between samples, this needs
no normalization (this story changes in case of
absolute quantification).
Sample A
Sample B
Gene A
Gene A
Gene B
Gene B
59. Between sample variation
Properties of libraries/samples can effect the
counts, and lead to variation. This is called
between-lane variation. Obvious ones: library
size (how many reads are sampled), library
composition.
Different libraries/samples can exhibit increased
variation by differing in how gene properties
relate to gene counts. This is called within-lane
variation.
60. GC-content of genes can influence counts
GC-content differs between genes. But it does
not change between samples, so there should
be no problem for relative expression
comparison.
We can visualize the
relationship between
counts and GC very
easily (see right). There is
some trend, and it is
equal for all samples.
EDAseq (R)
61. GC-content of genes can influence counts
Sometimes, samples show different relationships
between GC-content of the genes and the counts.
This within-lane variation
(or intra-sample) variation
needs to be corrected for,
so that in one sample not
all differentially expressed
genes are also the
GC-riched ones.
Length can have also this
effect.
62. What we need to know for our set-up
We want to detect differentially expressed genes
between 2 or more conditions.
For this, we need to apply the conditions in a
controlled environment (randomisation,...).
For good testing, we need to have some biological
replicates per condition.
For cost effectiveness, we determine how deep we
will sequence from each sample.
We analyse the reads, get raw counts and do the test!
63. Library preparation and lane loading
HiSeq2000: 24 single-index barcodes available. 1
lane gives 150-180 M reads. One lane of 50 bp SE
approx €1.500.
64. Bioinformatics analysis will take most of your time
Biological insight
DE test
Quality control (QC) of raw reads
QC of the count table
Count table extraction
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing
QC of the mapping
Mapping to a reference genome
(alternative: to a transcriptome)
65. Bioinformatics analysis will take most of your time
Biological insight
DE test
Quality control (QC) of raw reads
QC of the count table
Count table extraction
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing
QC of the mapping
Mapping to a reference genome
(alternative: to a transcriptome)
66. Bioinformatics analysis will take most of your time
Biological insight
6
1
DE test
Quality control (QC) of raw reads
5
QC of the count table
4
Count table extraction
Preprocessing: filtering of reads
2
and read parts, to help our goal
of differential detection.
QC of the mapping
3
QC of preprocessing
Mapping to a reference genome
(alternative: to a transcriptome)
69. Deeper, or more replicates?
Variance will be lower with more reads: but
sequencing another biological replicate is
preferred over sequencing deeper, or technical reps.
Doi: 10.1093/bioinformatics/btt015
71. Scotty – power analysis
Power: the probability to reject the null hypothesis if the alternative is
true.
'How many samples and how deep in order to minimize false
negatives'.
(a null hypothesis is always a scenario in which there is no difference,
hence no differential expression).
Alternative tools:
http://wiki.bits.vib.be/index.php/RNAseq_toolbox
74. Keywords
A read count of a gene is dependent on:
1. chance
2. expression level
3. transcript length
4. depth of sequencing
5. GC-content
Poisson distribution
Negative binomial distribution
Condition
Sample
Normalization
Write in your own words what the terms mean