SlideShare una empresa de Scribd logo
1 de 74
Descargar para leer sin conexión
This presentation is available under the Creative Commons
Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts
hereof.
RNA-seq for DE analysis training
Defining the goal of RNA-seq
analysis for differential
expression
Joachim Jacob
22 and 24 April 2014
2 of 74
Great power comes with great responsibility
You can't do all
RNA-seq is powerful, we have
to aim for a certain goal.
Our goal is to detect
differential expression
on the gene level.
3 of 74
With great power comes great responsibility
RNA-seq enables one to
1) get an idea which are all active genes
2) quantify expression of each transcript
3) quantify alternative splicing
… (use your imagination)
Principles of transcriptome analysis and gene expression quantification: an RNA-seq
tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract
4 of 74
Differential expression: useful?
What are we looking for?
Explanations of observed phenotypes
yeast
GDA
Yeast mutant
GDA + vit C
why?
5 of 74
The central dogma
<What?>
yeast
GDA
Yeast mutant
GDA + vit C
?
causes the phenotypic differences
6 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
Difference in protein activity
causes the phenotypic differences
7 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
Presence/concentration of proteins in a cell
causes the phenotypic differences
8 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Different regulation of protein production
causes the phenotypic differences
9 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Level of templates for protein production
causes the phenotypic differences
10 of 74
The central dogma
yeast
GDA
Yeast mutant
GDA + vit C
?
Level of mRNA copies
causes the phenotypic differences
11 of 74
Does the reason for measuring DE make sense?
Difference in protein activity
Level of mRNA copies
Level of templates for protein production
Level of protein production
Presence/concentration of proteins in a cell
Phenotype
12 of 74
Problem reduction
We can measure mRNA levels (much easier
than protein levels).
So we measure mRNA.
The level of mRNA is a proxy of the level of
protein activity causing the aberrant
phenotype.
13 of 74
How to measure mRNA levels
1. Q-PCR (real-time)
2. Microarray
3. RNA-seq
A lot of work to measure few
genes, in a relatively wide array
of tissues. Very accurate.
Easier way to measure many
predefined genes in a relatively
wide array of tissues. Robust.
14 of 74
Sequence-based measuring: high level view
● Get your sample
● Lyse the cells and extract RNA
● Convert the RNA to cDNA
● The cDNA pool get sequenced
The result is sequence information from
scratch. No prior information is needed.
Yeast sample
Comprehensive comparative analysis of strand-specific RNA sequencing methods
http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html
Comparative analysis of RNA sequencing methods for degraded or low-input samples
http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html
15 of 74
RNA-seq is not a new idea
● ESTs: expressed
sequence tags, ideal for
discovery of new genes.
● SAGE: serial analysis of
gene expression,
measurement of
number of copies of
mRNA
http://www.montana.edu/observatory/people/mcdermottlab.htm
16 of 74
RNA-seq is not a new idea
● ESTs: expressed
sequence tags, ideal for
discovery of new genes.
● SAGE: serial analysis of
gene expression,
measurement of
number of copies of
mRNA
http://www.sagenet.org/findings/index.html
17 of 74
RNA-seq is not a new idea
● ESTs: expressed sequence tags
● SAGE: serial analysis of gene expression
Low throughput: long sequence
information, but for only ~thousands of
genes.
18 of 74
Concept of measuring with RNA-seq
Extract mRNA
and turn into
cDNA
Fragment, ligate
Adaptor, amplify,
size selection.
Put a fraction of
the pool on a
high throughput
sequencer to
read fragments.
One template of protein
production, mRNA
Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478
GeneA GeneB GeneC
cell
nucleus
DNA
19 of 74
Every step means some loss
Yeast sample
20 of 74
RNA-seq numbers might explain phenotype
Phenotype
Proteins
mRNA levels
cDNA pool
RNA-seq read numbers
Represent the cDNA pool we've created
Represent the RNA pool we've extracted
Are a proxy for protein activity
Define the phenotype
21 of 74
So many steps must fail our assumption
Phenotype
Proteins
mRNA levels
cDNA pool
RNA-seq read numbers
Protein activity is regulated:
Fosforylation, ubiquitination,...
mRNA templates have
different speeds of protein pro-
Duction: availability of tRNAs,
rate of mRNA degration,
Alternative splicing events,...
Loss on RNA extraction, 90% of
RNA in cell is rRNA, ligation
of adapters, conversion to cDNA
not 100%
Fail to map reads to correct
gene, lane-specific biases on
reading cDNA fragments,...
22 of 74
Consequence: focus on comparison
Phenotype A
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Phenotype B
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Possibly due
to differences in
expression
23 of 74
Consequence: focus on comparison
Phenotype A
Proteins
mRNA pool
cDNA pool
RNA-seq reads
Phenotype B
Proteins
mRNA pool
cDNA pool
RNA-seq reads
DESIGN OF
EXPERIMENT
24 of 74
Comparing read numbers per gene
GeneA GeneB GeneC
sample
RNA-seq
Obviously, the number of reads is dependent on:
1. the expression level of the gene
2. the total number of reads generated
3. the length of the transcript
OUR QUESTION
25 of 74
Interpreting the counts for our goal
Our focus: which genes are differentially expressed
between different conditions?
Obviously, the number of reads is dependent on:
1. the expression level of the gene
2. the total number of reads generated
3. the length of the transcript
26 of 74
Experimental design
Our focus: which genes are differentially expressed
between different conditions?
“How can we detect genes for which the counts of
reads change between conditions more
systematically than as expected by chance”
We must design an experiment in which we can test
this deviance from chance.
Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010,
11:220 http://genomebiology.com/2010/11/12/220
27 of 74
How many reads to sequence?
In other words: how deep to sequence? What is the
required 'depth of sequencing'?
GeneA GeneB GeneC
sample
RNA-seq
RNA-seq
GeneA GeneB GeneC
The final test will look at ratios:
6 5 3
5 6 4
1,2 0,83 0,75
sample
28 of 74
How many reads to sequence?
The difference between the lowest gene count and
the highest gene count is typically 105
. This is called
the dynamic range.
Linear scale is useless. The logarithmic scale is better.
Wait! Something's not correct here!
29 of 74
Zero remains zero!
We are working with counts. A count is >=1. A gene
with zero counts can be not yet sequenced (not
deep enough) or is not expressed in that condition.
It is not a full logarithmic scale.
It starts at zero.
0
30 of 74
How do the numbers change?
Assuming equal sequencing depth in the samples,
and these counts. Do all these genes differ in
expression? sample sample
GeneA 5 10 2
GeneB 15 30 2
GeneC 40 80 2
GeneD 100 200 2
GeneE 1000 2000 2
GeneZ 1 2 2
RATIO
31 of 74
How do the numbers change?
sample sample
GeneA 11 10 0,91
GeneB 11 30 2,72
GeneC 60 80 1,33
GeneD 79 200 2,53
GeneE 1150 2000 1,74
GeneZ 5 1 0,20
RATIO
2?
Is there a trend in how
these numbers change?
Sequencing the result of the same steps again is
called a technical replicate.
32 of 74
Technical replicates
sample
GeneA 11 5 4 4
GeneB 11 16 14 8
GeneC 60 45 32 38
GeneD 79 102 95 110
GeneE 1150 1023 987 1005
GeneZ 3 0 0 1
sample sample sample
We take the same cDNA pool and sequence it several
times: technical replicates.
33 of 74
The poisson distribution
The counts of technical replicates follow a poisson
distribution (Marioni et al 2008). The Poisson distribution
can be applied to systems with a large number of possible
events, each of which is rare.
From Wikipedia. Can be 3
different genes, each with
their own poisson
distribution. Lambda is
the mean of the gene's
distribution, with a
certain number of reads.
Y=axis: chance to pick
that number of reads.
34 of 74
The poisson distribution
So when we have 4 technical replicates sequenced up
to a big depth (say 10 M reads). We can get by
chance, these numbers for 3 different genes.
GeneA 0, 0, 1, 3
GeneB 2, 3, 4, 7
GeneC 8, 9, 11, 14
35 of 74
Working the intuition
How many blue balls?
How many red balls?
Draw 10
Draw 10 more
Draw 10 more
Estimate how large the fraction is in the set?
36 of 74
The intuition with the balls
Color 10 draws 20 draws 30 draws 40 draws
Blue
Red
No color
37 of 74
Conclusion of the experiment
How bigger the fraction in the pool, how quicker (i.e.
with less sequencing depth) we are certain about the
estimate of that fraction.
For lower counts, the variance is
relatively bigger than the
variance for higher counts.
CV (coëfficient of variation) =
sqrt(count)/count
Genes with lower expression
need much deeper sequencing
than genes with higher
expression levels.
estimate=count; variance=count
38 of 74
Comparing counts
“Here we show the overlap of Poisson
distributions of single measurements at
different read counts. Because relative
Poisson uncertainty is high at low read
counts, a count of 1 versus 2 has very
little power to discriminate a true 2X fold
change, though at higher counts a 2X fold
change becomes significant.
In an actual experiment, the width of the
distribution would be greater due to
additional biological and technical
uncertainty, but the uncertainty to the
mean expression would narrow with
each additional replicate.”
Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression.
Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015
39 of 74
Comparing technical replicates
Risso et al. “GC-Content Normalization for RNA-Seq Data”
BMC Bioinformatics 2011, 12:480
http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R)
Correlation
between mean
and variance
according to Poisson
Lowess fit through
the data
(Log2 of the counts)
(Log2ofthecounts)
40 of 74
But poisson does not seem to fit
Extending the samples to real biological samples, this
mean variance relationship does not hold...
Plotted using EDASeq
Package in R.
41 of 74
But poisson does not seem to fit
Extending the samples to real biological samples, this
mean variance relationship does not hold!
Plotted using EDASeq
Package in R.
Reasonable fit
Something is going on!
42 of 74
An extra source of variation
The Poisson distribution has an 'overdispersed'
variance: the variance is bigger than expected for
higher counts between biological replicates.
Plotted using EDASeq
Package in R.
Something is going on!
43 of 74
An extra source of variation
Where Poisson: CV = std dev / mean => CV² = 1/μ
If an additional distribution is involved (also
dependent on π, the fraction of the gene in the cDNA
pool), we have a
mixture of
distributions:
CV² = 1/μ + φ
Low counts! dispersion
Generalization of Poisson
with this extra parameter:
the Negative Binomial
Model fits better!
44 of 74
The negative binomial model
The NB model fits observed expression data of RNA-
seq better. It is a generalization of Poisson, and 2
parameters need to be estimated (μ and φ)
Counts (gene g in sample j) has a
Mean = μgj
Variance = μgj
+ φg
μgj
²
Biological CV² = φg
=> Biological CV = √φg
Methods differ in estimating this dispersion per gene:
Can only be measured with true biological replicates
45 of 74
Variation summary, intuitively
Total CV² = Technical CV² + Biological CV²
For low counts, the Poisson (technical) variation or
the measurement error is dominant.
For higher counts, the Poisson variation gets smaller,
and another source of variation becomes dominant,
the dispersion or the biological variation. Biological
variation does not get smaller with higher counts.
46 of 74
Beyond the NB model
It appears from analysis of many
biological replicates (#=69) that not
every gene can be modeled as NB:
the Poisson-Tweedie model
provides a further generalisation
and a better fit for many genes
(with an additional shape
parameter).
Left figure: raw data shows that about 26% of
the genes fit a NB model. Depending on the
estimated shape parameter, other
distributions fit better.
Esnaola et al. BMC Bioinformatics 2013, 14:254
http://www.biomedcentral.com/1471-2105/14/254
47 of 74
Consequence for our design
● For low counts: the uncertainty is big due to
Poisson
● For high counts: the uncertainty is big due to
biological variation. (highly expressed genes differ
in their natural variation (regulated by cellular
processes) more than lowly expressed genes).
● If we focus on the ratios between the conditions:
is it reasonable to set a restriction of fold change?
Highly expressed genes can have a smaller and be
significant. Lowly expressed genes can exceed 2.
48 of 74
Consequence on fold change
The readily applied cut-off in micro-array analysis
is in RNA-seq not of use.
Blue and red:
known DE genes
Volcanoplot
These cut-offs often
applied can prohibit
detecting DE genes
49 of 74
Just remember...
We need to estimate the model behind the count.
Never work without biological replicates.
Never work with 2 biological replicates.
Try avoiding working with 3 biological replicates.
Go for at least 4 biological replicates.
50 of 74
Break?
51 of 74
Overview
GeneA GeneB GeneC
Sample 1
RNA-seq
GeneA GeneB GeneC
Sample 2
RNA-seq
GeneA GeneB GeneC
Sample 3
RNA-seq
GeneA GeneB GeneC
Sample 4
RNA-seq
GeneA GeneB GeneC
Sample 5
RNA-seq
GeneA GeneB GeneC
Sample 6
RNA-seq
Condition X
Condition Y
52 of 74
Factors influencing read count
Obviously, the number of reads is dependent on:
1. chance
→ Define the count model (NB) from replicates
2. the expression level of the gene
→ Compare the ratios with a test
3. the total number of reads generated
4. the length of the transcript
53 of 74
Library size influences read counts
GeneA GeneB GeneC
sample
RNA-seq
The number of reads is dependent on the total
number of reads generated. If one library is
sequenced to 20M reads, and another one to
40M, most genes will ~double their counts.
GeneA GeneB GeneC
sample
More RNA-seq
54 of 74
Normalization for library size
Naive approach: divide by total library size. Is not
applied anymore!
Why not? Composition matters!
2 things to remember:
- zero sum system or “every gene we measure takes up a part (at
least one read) of the total library”
- 5 orders of magnitude
55 of 74
Normalization for library size
Every gene takes up at
least one read. But in
every sample, a lot of
reads are spend on few
extremely highly
expressed genes. Reason
unknown. Often different
between samples. This
fact biases average
based (naïve)
normalization attempts.
Average count (log2)
Comparing 2 samples
Countdifference(log2ratio)
56 of 74
Normalization for library size
Schematically: when normalized on library size
(square represent number of reads).
Rest of the genes
Rest of the genes
Few genes with enormous counts
All counts for library A All counts for library B
57 of 74
Normalization for library size
Better normalization would be as shown below.
DESeq2 and EdgeR apply such an approach (see
later).
Rest of the genesRest of the genes
100%
100%
58 of 74
Gene length influences the count
“Longer transcripts generate more reads”
True! But the transcript length does not differ
between samples. Since we are concerned with
relative differences between samples, this needs
no normalization (this story changes in case of
absolute quantification).
Sample A Sample B
Gene A
Gene B
Gene A
Gene B
59 of 74
The many flavours of sample variation
Some properties of libraries/samples can
effect the counts, and lead to variation. This is
called between-lane variation. Obvious ones:
library size (how many reads are sampled),
library composition.
Different libraries/samples differ sometimes in
how gene properties relate to gene counts. This
is called within-lane variation.
60 of 74
GC-content of genes can influence counts
GC-content differs between genes. But it does
not change between samples, so there should
be no problem for relative expression
comparison.
We can visualize the
relationship between
counts and GC very
easily (see right). There is
some trend, and it is
equal for all samples.
EDAseq (R)
61 of 74
GC-content of genes can influence counts
Sometimes, samples show different relationships
between GC-content of the genes and the counts.
This within-lane variation
(or intra-sample) variation
needs to be corrected for,
so that in one sample not
all differentially expressed
genes are also the GC-
riched ones.
Length can have also this
effect.
62 of 74
Putting our experiment together
We want to detect differentially expressed genes
between 2 or more conditions.
For this, we need to apply the conditions in a
controlled environment (randomisation,...).
For good testing, we need to have some biological
replicates per condition.
For cost effectiveness, we determine how deep we
will sequence from each sample.
We analyse the reads, get raw counts and do the test!
63 of 74
On the sequencer
HiSeq2000: 24 single-index barcodes available. 1
lane gives 150-180 M reads. One lane of 50 bp SE
approx €1.500.
64 of 74
Bioinformatics analysis of the output
Quality control (QC) of raw reads
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing Mapping to a reference genome
(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
65 of 74
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads
and read parts, to help our goal
of differential detection.
QC of preprocessing Mapping to a reference genome
(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
4
5
6
66 of 74
Overview
Anders et al. Count-based differential expression analysis of RNA
sequencing data using R and Bioconductor. 2013
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
67 of 74
The numbers get reduced with every step
20M
25M
15M
~16%
~5%
~10%
~30% decrease
from sequenced
reads to counted
reads.
68 of 74
Deeper, or more replicates?
Variance will be lower with more reads: but
sequencing another biological replicate is
preferred over sequencing deeper, or technical reps.
Busby et al. Scotty: a web tool for designing RNA-Seq experiments to
measure differential gene expression. Doi: 10.1093/bioinformatics/btt015
69 of 74
There is tool to help you set up
70 of 74
Scotty: power analysis
'How many samples and how deep in order to
minimize false negatives?'
Power: the probability to reject the null hypothesis if
the alternative is true. A null hypothesis is always a
scenario in which there is no difference, hence no
differential expression.
Check the BITS wiki:
http://wiki.bits.vib.be/index.php/RNAseq_toolbox
71 of 74
Help with design
http://wiki.bits.vib.be/index.php/RNAseq_toolbox
72 of 74
How many samples to sequence?
→ Scotty exercise
73 of 74
Keywords
A read count of a gene is dependent on:
1. chance
2. expression level
3. transcript length
4. depth of sequencing
5. GC-content
Poisson distribution
Negative binomial distribution
Condition
Sample
Normalization
Write in your own words what the terms mean
74 of 74
Reads
All my references available at:
https://www.zotero.org/groups/dernaseq/items

Más contenido relacionado

La actualidad más candente

Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies and
Sarla Rao
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
Jhoirene Clemente
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
yuvraj404
 

La actualidad más candente (20)

Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Role of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies andRole of transcriptomics in gene expression studies and
Role of transcriptomics in gene expression studies and
 
Single-cell RNA-seq tutorial
Single-cell RNA-seq tutorialSingle-cell RNA-seq tutorial
Single-cell RNA-seq tutorial
 
Reverse transcription-quantitative PCR (RT-qPCR): Reporting and minimizing th...
Reverse transcription-quantitative PCR (RT-qPCR): Reporting and minimizing th...Reverse transcription-quantitative PCR (RT-qPCR): Reporting and minimizing th...
Reverse transcription-quantitative PCR (RT-qPCR): Reporting and minimizing th...
 
Biological networks - building and visualizing
Biological networks - building and visualizingBiological networks - building and visualizing
Biological networks - building and visualizing
 
Genome mapping
Genome mappingGenome mapping
Genome mapping
 
Rna seq
Rna seqRna seq
Rna seq
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Gene Expression Data Analysis
Gene Expression Data AnalysisGene Expression Data Analysis
Gene Expression Data Analysis
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
State-of-the-Art Normalization of RT-qPCR Data
State-of-the-Art Normalization of RT-qPCR Data State-of-the-Art Normalization of RT-qPCR Data
State-of-the-Art Normalization of RT-qPCR Data
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 

Destacado

Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression Analysis
Yaoyu Wang
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
Deanna Church
 
Biotecnologia Genomica na era do sequenciamento de DNA em larga escala
Biotecnologia Genomica na era do sequenciamento de DNA em larga escalaBiotecnologia Genomica na era do sequenciamento de DNA em larga escala
Biotecnologia Genomica na era do sequenciamento de DNA em larga escala
Rinaldo Pereira
 

Destacado (18)

Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression Analysis
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Personalomics
PersonalomicsPersonalomics
Personalomics
 
Systematic evaluation of spliced alignment programs for RNA-seq data
Systematic evaluation  of spliced alignment programs  for RNA-seq dataSystematic evaluation  of spliced alignment programs  for RNA-seq data
Systematic evaluation of spliced alignment programs for RNA-seq data
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Korte handleiding van de Partago app
Korte handleiding van de Partago appKorte handleiding van de Partago app
Korte handleiding van de Partago app
 
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
Part 6 of RNA-seq for DE analysis: Detecting biology from differential expres...
 
DNA methylation analysis in R
DNA methylation analysis in RDNA methylation analysis in R
DNA methylation analysis in R
 
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': IntroductionPart 1 of 'Introduction to Linux for bioinformatics': Introduction
Part 1 of 'Introduction to Linux for bioinformatics': Introduction
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Biotecnologia Genomica na era do sequenciamento de DNA em larga escala
Biotecnologia Genomica na era do sequenciamento de DNA em larga escalaBiotecnologia Genomica na era do sequenciamento de DNA em larga escala
Biotecnologia Genomica na era do sequenciamento de DNA em larga escala
 
Computational genomics approaches to precision medicine
Computational genomics approaches to precision medicineComputational genomics approaches to precision medicine
Computational genomics approaches to precision medicine
 
Gene expression
Gene expressionGene expression
Gene expression
 
Computational genomics course poster 2015 (BIMSB/MDC-Berlin)
Computational genomics course poster 2015 (BIMSB/MDC-Berlin)Computational genomics course poster 2015 (BIMSB/MDC-Berlin)
Computational genomics course poster 2015 (BIMSB/MDC-Berlin)
 
DNA Methylation Data Analysis
DNA Methylation Data AnalysisDNA Methylation Data Analysis
DNA Methylation Data Analysis
 

Similar a Part 1 of RNA-seq for DE analysis: Defining the goal

RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
Tanmay Ghai
 
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
QIAGEN
 
Cnv and a analysis strategies
Cnv and a analysis strategiesCnv and a analysis strategies
Cnv and a analysis strategies
Elsa von Licy
 
Lecture bioinformatics Part2.next generation
Lecture bioinformatics Part2.next generationLecture bioinformatics Part2.next generation
Lecture bioinformatics Part2.next generation
MohamedHasan816582
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
Elsa von Licy
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
Toyin23
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
Computer Science Club
 

Similar a Part 1 of RNA-seq for DE analysis: Defining the goal (20)

RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohi
 
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
Multicopy reference assay (MRef) — a superior normalizer of sample input in D...
 
Cnv and a analysis strategies
Cnv and a analysis strategiesCnv and a analysis strategies
Cnv and a analysis strategies
 
Cn presentation
Cn presentationCn presentation
Cn presentation
 
31931 31941
31931 3194131931 31941
31931 31941
 
Ashg poster sp_compressed
Ashg poster sp_compressedAshg poster sp_compressed
Ashg poster sp_compressed
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
20140711 2 j_willey_ercc2.0_workshop
20140711 2 j_willey_ercc2.0_workshop20140711 2 j_willey_ercc2.0_workshop
20140711 2 j_willey_ercc2.0_workshop
 
Lecture bioinformatics Part2.next generation
Lecture bioinformatics Part2.next generationLecture bioinformatics Part2.next generation
Lecture bioinformatics Part2.next generation
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
 
Rna seq
Rna seq Rna seq
Rna seq
 
NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生NAISTビッグデータシンポジウム - バイオ久保先生
NAISTビッグデータシンポジウム - バイオ久保先生
 
Analysis of gene expression
Analysis of gene expressionAnalysis of gene expression
Analysis of gene expression
 
Gene Array Analyzer
Gene Array AnalyzerGene Array Analyzer
Gene Array Analyzer
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
 

Más de Joachim Jacob

Más de Joachim Jacob (6)

Blaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met LinuxBlaas nieuw leven in je PC met Linux
Blaas nieuw leven in je PC met Linux
 
The Galaxy toolshed
The Galaxy toolshedThe Galaxy toolshed
The Galaxy toolshed
 
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
Part 5 of "Introduction to Linux for Bioinformatics": Working the command lin...
 
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tipsPart 6 of "Introduction to linux for bioinformatics": Productivity tips
Part 6 of "Introduction to linux for bioinformatics": Productivity tips
 
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data Part 4 of 'Introduction to Linux for bioinformatics': Managing data
Part 4 of 'Introduction to Linux for bioinformatics': Managing data
 
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing softwarePart 2 of 'Introduction to Linux for bioinformatics': Installing software
Part 2 of 'Introduction to Linux for bioinformatics': Installing software
 

Último

Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 

Último (20)

High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 

Part 1 of RNA-seq for DE analysis: Defining the goal

  • 1. This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Defining the goal of RNA-seq analysis for differential expression Joachim Jacob 22 and 24 April 2014
  • 2. 2 of 74 Great power comes with great responsibility You can't do all RNA-seq is powerful, we have to aim for a certain goal. Our goal is to detect differential expression on the gene level.
  • 3. 3 of 74 With great power comes great responsibility RNA-seq enables one to 1) get an idea which are all active genes 2) quantify expression of each transcript 3) quantify alternative splicing … (use your imagination) Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract
  • 4. 4 of 74 Differential expression: useful? What are we looking for? Explanations of observed phenotypes yeast GDA Yeast mutant GDA + vit C why?
  • 5. 5 of 74 The central dogma <What?> yeast GDA Yeast mutant GDA + vit C ? causes the phenotypic differences
  • 6. 6 of 74 The central dogma yeast GDA Yeast mutant GDA + vit C Difference in protein activity causes the phenotypic differences
  • 7. 7 of 74 The central dogma yeast GDA Yeast mutant GDA + vit C Presence/concentration of proteins in a cell causes the phenotypic differences
  • 8. 8 of 74 The central dogma yeast GDA Yeast mutant GDA + vit C ? Different regulation of protein production causes the phenotypic differences
  • 9. 9 of 74 The central dogma yeast GDA Yeast mutant GDA + vit C ? Level of templates for protein production causes the phenotypic differences
  • 10. 10 of 74 The central dogma yeast GDA Yeast mutant GDA + vit C ? Level of mRNA copies causes the phenotypic differences
  • 11. 11 of 74 Does the reason for measuring DE make sense? Difference in protein activity Level of mRNA copies Level of templates for protein production Level of protein production Presence/concentration of proteins in a cell Phenotype
  • 12. 12 of 74 Problem reduction We can measure mRNA levels (much easier than protein levels). So we measure mRNA. The level of mRNA is a proxy of the level of protein activity causing the aberrant phenotype.
  • 13. 13 of 74 How to measure mRNA levels 1. Q-PCR (real-time) 2. Microarray 3. RNA-seq A lot of work to measure few genes, in a relatively wide array of tissues. Very accurate. Easier way to measure many predefined genes in a relatively wide array of tissues. Robust.
  • 14. 14 of 74 Sequence-based measuring: high level view ● Get your sample ● Lyse the cells and extract RNA ● Convert the RNA to cDNA ● The cDNA pool get sequenced The result is sequence information from scratch. No prior information is needed. Yeast sample Comprehensive comparative analysis of strand-specific RNA sequencing methods http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html Comparative analysis of RNA sequencing methods for degraded or low-input samples http://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html
  • 15. 15 of 74 RNA-seq is not a new idea ● ESTs: expressed sequence tags, ideal for discovery of new genes. ● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA http://www.montana.edu/observatory/people/mcdermottlab.htm
  • 16. 16 of 74 RNA-seq is not a new idea ● ESTs: expressed sequence tags, ideal for discovery of new genes. ● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA http://www.sagenet.org/findings/index.html
  • 17. 17 of 74 RNA-seq is not a new idea ● ESTs: expressed sequence tags ● SAGE: serial analysis of gene expression Low throughput: long sequence information, but for only ~thousands of genes.
  • 18. 18 of 74 Concept of measuring with RNA-seq Extract mRNA and turn into cDNA Fragment, ligate Adaptor, amplify, size selection. Put a fraction of the pool on a high throughput sequencer to read fragments. One template of protein production, mRNA Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478 GeneA GeneB GeneC cell nucleus DNA
  • 19. 19 of 74 Every step means some loss Yeast sample
  • 20. 20 of 74 RNA-seq numbers might explain phenotype Phenotype Proteins mRNA levels cDNA pool RNA-seq read numbers Represent the cDNA pool we've created Represent the RNA pool we've extracted Are a proxy for protein activity Define the phenotype
  • 21. 21 of 74 So many steps must fail our assumption Phenotype Proteins mRNA levels cDNA pool RNA-seq read numbers Protein activity is regulated: Fosforylation, ubiquitination,... mRNA templates have different speeds of protein pro- Duction: availability of tRNAs, rate of mRNA degration, Alternative splicing events,... Loss on RNA extraction, 90% of RNA in cell is rRNA, ligation of adapters, conversion to cDNA not 100% Fail to map reads to correct gene, lane-specific biases on reading cDNA fragments,...
  • 22. 22 of 74 Consequence: focus on comparison Phenotype A Proteins mRNA pool cDNA pool RNA-seq reads Phenotype B Proteins mRNA pool cDNA pool RNA-seq reads Possibly due to differences in expression
  • 23. 23 of 74 Consequence: focus on comparison Phenotype A Proteins mRNA pool cDNA pool RNA-seq reads Phenotype B Proteins mRNA pool cDNA pool RNA-seq reads DESIGN OF EXPERIMENT
  • 24. 24 of 74 Comparing read numbers per gene GeneA GeneB GeneC sample RNA-seq Obviously, the number of reads is dependent on: 1. the expression level of the gene 2. the total number of reads generated 3. the length of the transcript OUR QUESTION
  • 25. 25 of 74 Interpreting the counts for our goal Our focus: which genes are differentially expressed between different conditions? Obviously, the number of reads is dependent on: 1. the expression level of the gene 2. the total number of reads generated 3. the length of the transcript
  • 26. 26 of 74 Experimental design Our focus: which genes are differentially expressed between different conditions? “How can we detect genes for which the counts of reads change between conditions more systematically than as expected by chance” We must design an experiment in which we can test this deviance from chance. Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010, 11:220 http://genomebiology.com/2010/11/12/220
  • 27. 27 of 74 How many reads to sequence? In other words: how deep to sequence? What is the required 'depth of sequencing'? GeneA GeneB GeneC sample RNA-seq RNA-seq GeneA GeneB GeneC The final test will look at ratios: 6 5 3 5 6 4 1,2 0,83 0,75 sample
  • 28. 28 of 74 How many reads to sequence? The difference between the lowest gene count and the highest gene count is typically 105 . This is called the dynamic range. Linear scale is useless. The logarithmic scale is better. Wait! Something's not correct here!
  • 29. 29 of 74 Zero remains zero! We are working with counts. A count is >=1. A gene with zero counts can be not yet sequenced (not deep enough) or is not expressed in that condition. It is not a full logarithmic scale. It starts at zero. 0
  • 30. 30 of 74 How do the numbers change? Assuming equal sequencing depth in the samples, and these counts. Do all these genes differ in expression? sample sample GeneA 5 10 2 GeneB 15 30 2 GeneC 40 80 2 GeneD 100 200 2 GeneE 1000 2000 2 GeneZ 1 2 2 RATIO
  • 31. 31 of 74 How do the numbers change? sample sample GeneA 11 10 0,91 GeneB 11 30 2,72 GeneC 60 80 1,33 GeneD 79 200 2,53 GeneE 1150 2000 1,74 GeneZ 5 1 0,20 RATIO 2? Is there a trend in how these numbers change? Sequencing the result of the same steps again is called a technical replicate.
  • 32. 32 of 74 Technical replicates sample GeneA 11 5 4 4 GeneB 11 16 14 8 GeneC 60 45 32 38 GeneD 79 102 95 110 GeneE 1150 1023 987 1005 GeneZ 3 0 0 1 sample sample sample We take the same cDNA pool and sequence it several times: technical replicates.
  • 33. 33 of 74 The poisson distribution The counts of technical replicates follow a poisson distribution (Marioni et al 2008). The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. From Wikipedia. Can be 3 different genes, each with their own poisson distribution. Lambda is the mean of the gene's distribution, with a certain number of reads. Y=axis: chance to pick that number of reads.
  • 34. 34 of 74 The poisson distribution So when we have 4 technical replicates sequenced up to a big depth (say 10 M reads). We can get by chance, these numbers for 3 different genes. GeneA 0, 0, 1, 3 GeneB 2, 3, 4, 7 GeneC 8, 9, 11, 14
  • 35. 35 of 74 Working the intuition How many blue balls? How many red balls? Draw 10 Draw 10 more Draw 10 more Estimate how large the fraction is in the set?
  • 36. 36 of 74 The intuition with the balls Color 10 draws 20 draws 30 draws 40 draws Blue Red No color
  • 37. 37 of 74 Conclusion of the experiment How bigger the fraction in the pool, how quicker (i.e. with less sequencing depth) we are certain about the estimate of that fraction. For lower counts, the variance is relatively bigger than the variance for higher counts. CV (coëfficient of variation) = sqrt(count)/count Genes with lower expression need much deeper sequencing than genes with higher expression levels. estimate=count; variance=count
  • 38. 38 of 74 Comparing counts “Here we show the overlap of Poisson distributions of single measurements at different read counts. Because relative Poisson uncertainty is high at low read counts, a count of 1 versus 2 has very little power to discriminate a true 2X fold change, though at higher counts a 2X fold change becomes significant. In an actual experiment, the width of the distribution would be greater due to additional biological and technical uncertainty, but the uncertainty to the mean expression would narrow with each additional replicate.” Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015
  • 39. 39 of 74 Comparing technical replicates Risso et al. “GC-Content Normalization for RNA-Seq Data” BMC Bioinformatics 2011, 12:480 http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R) Correlation between mean and variance according to Poisson Lowess fit through the data (Log2 of the counts) (Log2ofthecounts)
  • 40. 40 of 74 But poisson does not seem to fit Extending the samples to real biological samples, this mean variance relationship does not hold... Plotted using EDASeq Package in R.
  • 41. 41 of 74 But poisson does not seem to fit Extending the samples to real biological samples, this mean variance relationship does not hold! Plotted using EDASeq Package in R. Reasonable fit Something is going on!
  • 42. 42 of 74 An extra source of variation The Poisson distribution has an 'overdispersed' variance: the variance is bigger than expected for higher counts between biological replicates. Plotted using EDASeq Package in R. Something is going on!
  • 43. 43 of 74 An extra source of variation Where Poisson: CV = std dev / mean => CV² = 1/μ If an additional distribution is involved (also dependent on π, the fraction of the gene in the cDNA pool), we have a mixture of distributions: CV² = 1/μ + φ Low counts! dispersion Generalization of Poisson with this extra parameter: the Negative Binomial Model fits better!
  • 44. 44 of 74 The negative binomial model The NB model fits observed expression data of RNA- seq better. It is a generalization of Poisson, and 2 parameters need to be estimated (μ and φ) Counts (gene g in sample j) has a Mean = μgj Variance = μgj + φg μgj ² Biological CV² = φg => Biological CV = √φg Methods differ in estimating this dispersion per gene: Can only be measured with true biological replicates
  • 45. 45 of 74 Variation summary, intuitively Total CV² = Technical CV² + Biological CV² For low counts, the Poisson (technical) variation or the measurement error is dominant. For higher counts, the Poisson variation gets smaller, and another source of variation becomes dominant, the dispersion or the biological variation. Biological variation does not get smaller with higher counts.
  • 46. 46 of 74 Beyond the NB model It appears from analysis of many biological replicates (#=69) that not every gene can be modeled as NB: the Poisson-Tweedie model provides a further generalisation and a better fit for many genes (with an additional shape parameter). Left figure: raw data shows that about 26% of the genes fit a NB model. Depending on the estimated shape parameter, other distributions fit better. Esnaola et al. BMC Bioinformatics 2013, 14:254 http://www.biomedcentral.com/1471-2105/14/254
  • 47. 47 of 74 Consequence for our design ● For low counts: the uncertainty is big due to Poisson ● For high counts: the uncertainty is big due to biological variation. (highly expressed genes differ in their natural variation (regulated by cellular processes) more than lowly expressed genes). ● If we focus on the ratios between the conditions: is it reasonable to set a restriction of fold change? Highly expressed genes can have a smaller and be significant. Lowly expressed genes can exceed 2.
  • 48. 48 of 74 Consequence on fold change The readily applied cut-off in micro-array analysis is in RNA-seq not of use. Blue and red: known DE genes Volcanoplot These cut-offs often applied can prohibit detecting DE genes
  • 49. 49 of 74 Just remember... We need to estimate the model behind the count. Never work without biological replicates. Never work with 2 biological replicates. Try avoiding working with 3 biological replicates. Go for at least 4 biological replicates.
  • 51. 51 of 74 Overview GeneA GeneB GeneC Sample 1 RNA-seq GeneA GeneB GeneC Sample 2 RNA-seq GeneA GeneB GeneC Sample 3 RNA-seq GeneA GeneB GeneC Sample 4 RNA-seq GeneA GeneB GeneC Sample 5 RNA-seq GeneA GeneB GeneC Sample 6 RNA-seq Condition X Condition Y
  • 52. 52 of 74 Factors influencing read count Obviously, the number of reads is dependent on: 1. chance → Define the count model (NB) from replicates 2. the expression level of the gene → Compare the ratios with a test 3. the total number of reads generated 4. the length of the transcript
  • 53. 53 of 74 Library size influences read counts GeneA GeneB GeneC sample RNA-seq The number of reads is dependent on the total number of reads generated. If one library is sequenced to 20M reads, and another one to 40M, most genes will ~double their counts. GeneA GeneB GeneC sample More RNA-seq
  • 54. 54 of 74 Normalization for library size Naive approach: divide by total library size. Is not applied anymore! Why not? Composition matters! 2 things to remember: - zero sum system or “every gene we measure takes up a part (at least one read) of the total library” - 5 orders of magnitude
  • 55. 55 of 74 Normalization for library size Every gene takes up at least one read. But in every sample, a lot of reads are spend on few extremely highly expressed genes. Reason unknown. Often different between samples. This fact biases average based (naïve) normalization attempts. Average count (log2) Comparing 2 samples Countdifference(log2ratio)
  • 56. 56 of 74 Normalization for library size Schematically: when normalized on library size (square represent number of reads). Rest of the genes Rest of the genes Few genes with enormous counts All counts for library A All counts for library B
  • 57. 57 of 74 Normalization for library size Better normalization would be as shown below. DESeq2 and EdgeR apply such an approach (see later). Rest of the genesRest of the genes 100% 100%
  • 58. 58 of 74 Gene length influences the count “Longer transcripts generate more reads” True! But the transcript length does not differ between samples. Since we are concerned with relative differences between samples, this needs no normalization (this story changes in case of absolute quantification). Sample A Sample B Gene A Gene B Gene A Gene B
  • 59. 59 of 74 The many flavours of sample variation Some properties of libraries/samples can effect the counts, and lead to variation. This is called between-lane variation. Obvious ones: library size (how many reads are sampled), library composition. Different libraries/samples differ sometimes in how gene properties relate to gene counts. This is called within-lane variation.
  • 60. 60 of 74 GC-content of genes can influence counts GC-content differs between genes. But it does not change between samples, so there should be no problem for relative expression comparison. We can visualize the relationship between counts and GC very easily (see right). There is some trend, and it is equal for all samples. EDAseq (R)
  • 61. 61 of 74 GC-content of genes can influence counts Sometimes, samples show different relationships between GC-content of the genes and the counts. This within-lane variation (or intra-sample) variation needs to be corrected for, so that in one sample not all differentially expressed genes are also the GC- riched ones. Length can have also this effect.
  • 62. 62 of 74 Putting our experiment together We want to detect differentially expressed genes between 2 or more conditions. For this, we need to apply the conditions in a controlled environment (randomisation,...). For good testing, we need to have some biological replicates per condition. For cost effectiveness, we determine how deep we will sequence from each sample. We analyse the reads, get raw counts and do the test!
  • 63. 63 of 74 On the sequencer HiSeq2000: 24 single-index barcodes available. 1 lane gives 150-180 M reads. One lane of 50 bp SE approx €1.500.
  • 64. 64 of 74 Bioinformatics analysis of the output Quality control (QC) of raw reads Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing Mapping to a reference genome (alternative: to a transcriptome) QC of the mapping Count table extraction QC of the count table DE test Biological insight
  • 65. 65 of 74 Bioinformatics analysis will take most of your time Quality control (QC) of raw reads Preprocessing: filtering of reads and read parts, to help our goal of differential detection. QC of preprocessing Mapping to a reference genome (alternative: to a transcriptome) QC of the mapping Count table extraction QC of the count table DE test Biological insight 1 2 3 4 5 6
  • 66. 66 of 74 Overview Anders et al. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor. 2013 http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
  • 67. 67 of 74 The numbers get reduced with every step 20M 25M 15M ~16% ~5% ~10% ~30% decrease from sequenced reads to counted reads.
  • 68. 68 of 74 Deeper, or more replicates? Variance will be lower with more reads: but sequencing another biological replicate is preferred over sequencing deeper, or technical reps. Busby et al. Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Doi: 10.1093/bioinformatics/btt015
  • 69. 69 of 74 There is tool to help you set up
  • 70. 70 of 74 Scotty: power analysis 'How many samples and how deep in order to minimize false negatives?' Power: the probability to reject the null hypothesis if the alternative is true. A null hypothesis is always a scenario in which there is no difference, hence no differential expression. Check the BITS wiki: http://wiki.bits.vib.be/index.php/RNAseq_toolbox
  • 71. 71 of 74 Help with design http://wiki.bits.vib.be/index.php/RNAseq_toolbox
  • 72. 72 of 74 How many samples to sequence? → Scotty exercise
  • 73. 73 of 74 Keywords A read count of a gene is dependent on: 1. chance 2. expression level 3. transcript length 4. depth of sequencing 5. GC-content Poisson distribution Negative binomial distribution Condition Sample Normalization Write in your own words what the terms mean
  • 74. 74 of 74 Reads All my references available at: https://www.zotero.org/groups/dernaseq/items