Part 4 of the training sesson 'RNA-seq for differential expression analysis' considers extracting the count table from a mapping, and performing QC to detect sample biases. See http://www.bits.vib.be
RNA-seq for DE analysis: extracting counts and QC - part 4
1. Generating the count table
and validating assumptions
RNA-seq for DE analysis training
Joachim Jacob
20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
2. Goal
Summarize the read counts per gene from
a mapping result.
The outcome is a raw count table on
which we can perform some QC.
This table is used by the differential
expression algorithm to detect DE genes.
4. The challenge
'Exons' are the type of features used here.
They are summarized per 'gene'
Alt splicing
Overlaps no feature
Concept:
GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads
GeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',
5. Tools to count features
●
Different tools exist to accomplish this:
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
6. Dealing with ambiguity
●
We focus on the gene level: merge all counts over
different isoforms into one, taking into account:
●
●
●
Reads that do not overlap a feature, but appear in
introns. Take into account?
Reads that align to more than one feature (exon or
transcript). Transcripts can be overlapping - perhaps
on different strands. (PE, and strandedness can
resolve this partially).
Reads that partially overlap a feature, not following
known annotations.
7. HTSeq count has 3 modes
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'intersection_st
rict mode'.
Galaxy allows
experimenting!
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
8. Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinates
of the features to be counted
mode
Reverse stranded: heck with mapping viz
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over
'exons' grouped by 'gene_id'. Make sure
these fields are correct in your GTF file.
15. QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
Gene
Counts
42 genes (0,0063%)
of the 6665 genes
take 25% of all
counts.
This graph can be
constructed from
the count table.
TEF1alpha, putative ribo prot,...
17. QC: general metrics
●
We can plot the counts per sample: filter
out the '0', and transform on log2.
The bulk of the genes have counts
in the hundreds.
Few are extremely highly expressed
A minority have extremely low counts
log2(count)
18. QC: log2 density graph
●
We can do this for all samples, and merge
All samples show
nice overlap, peaks
are similar
Strange
Deviation
here
19. QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
We see a horizontal shift
of the graph, rather than a
vertical shift, pointing to
no saturation.
20. QC: log2, merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
21. QC: rarefaction curve
What is the number
of total detected
features, how does
the feature space
increase with each
additional sample
added?
There should be
saturation, but
here there is none.
Code:
ggplot(data = nonzero_counts, aes(total,
counts)) + geom_line() + labs(x = "total
number of sequenced reads",
y = "number of genes with counts > 0")
22. Sample A
Sample A + sample B
Sample A + sample B + sample C
Etc.
QC: rarefaction curve
rRNA genes
Saturation: OK!
23. QC: transformations for viz
Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log2.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
24. QC: count transformations
Not normalizations!
●
Techniques used for microarray can be
applied on VST transformed counts.
Log2
http://www.biomedcentral.com/1471-2105/14/91
rLog
VST
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
25. QC including condition info
●
●
We can also include condition
information, to interpret our QC better.
For this, we need to gather sample
information.
Make a separate file
in which sample info
is provided (metadata)
26. QC with condition info
What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the treatment
and the strain. Must match
the sample descriptions file.
27. QC with condition info
Clustering of the distance between samples based on
transformed counts can reveal sample errors.
VST transformed
Colour scale
Of the distance
measure between
Samples. Similar conditions
Should cluster together
rLog transformed
28. QC with condition info
Clustering of transformed counts can reveal sample
errors.
VST transformed
rLog transformed
29. QC with condition info
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.
30. Collect enough metadata
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.
Why do
these resemble
each other?
31. QC with condition info
During library preparation, collect as much as
information as possible, to add to the sample
descriptions. Pay particular attention to differences
between samples: e.g. day of preparation,
centrifuges used, ...
Why do
these resemble
each other?
32. Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples.
Day 1
Day 2
Additional metadata
33. Collect enough metadata
In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).
Day 1
Day 2
Additional metadata