RNA-seq for DE analysis: extracting counts and QC - part 4

Generating the count table
and validating assumptions
RNA-seq for DE analysis training
Joachim Jacob
20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.

Goal
Summarize the read counts per gene from
a mapping result.
The outcome is a raw count table on
which we can perform some QC.
This table is used by the differential
expression algorithm to detect DE genes.

The challenge
'Exons' are the type of features used here.
They are summarized per 'gene'

Alt splicing
Overlaps no feature

Concept:
GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 reads
GeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',

Tools to count features
●

Different tools exist to accomplish this:

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting

Dealing with ambiguity
●

We focus on the gene level: merge all counts over
different isoforms into one, taking into account:
●

●

●

Reads that do not overlap a feature, but appear in
introns. Take into account?
Reads that align to more than one feature (exon or
transcript). Transcripts can be overlapping - perhaps
on different strands. (PE, and strandedness can
resolve this partially).
Reads that partially overlap a feature, not following
known annotations.

HTSeq count has 3 modes
HTSeq-count
recommends
the 'union
mode'. But
depending on
your genome,
you may opt
for the
'intersection_st
rict mode'.
Galaxy allows
experimenting!

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Indicate the SE or PE nature of your data
(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinates
of the features to be counted
mode
Reverse stranded: heck with mapping viz
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over
'exons' grouped by 'gene_id'. Make sure
these fields are correct in your GTF file.

Resulting count table column

One sample !

Merging to create experiment count table

Quality control of count table
Relative numbers

Absolute numbers

In the end, we used about 70% of the reads. Check for your experiment.

Quality control of count table
2 types of QC:
●

General metrics

●

Sample-specific quality control

QC: general metrics
●

General numbers

QC: general metrics
Which genes are most highly present?
Which fractions do they occupy?
Gene

Counts

42 genes (0,0063%)
of the 6665 genes
take 25% of all
counts.
This graph can be
constructed from
the count table.
TEF1alpha, putative ribo prot,...

QC: general metrics
●

We can plot the counts per sample: filter
out the '0', and transform on log2.

The bulk of the genes have counts
in the hundreds.

Few are extremely highly expressed
A minority have extremely low counts
log2(count)

QC: log2 density graph
●

We can do this for all samples, and merge
All samples show
nice overlap, peaks
are similar

Strange
Deviation
here

QC: log2 merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.
We see a horizontal shift
of the graph, rather than a
vertical shift, pointing to
no saturation.

QC: log2, merging samples
Here, we take one sample,
plot the log2 density
graph, add the counts of
another sample, and plot
again, add the counts of
another sample, etc. until
we have merged all
samples.

QC: rarefaction curve
What is the number
of total detected
features, how does
the feature space
increase with each
additional sample
added?
There should be
saturation, but
here there is none.
Code:
ggplot(data = nonzero_counts, aes(total,
counts)) + geom_line() + labs(x = "total
number of sequenced reads",
y = "number of genes with counts > 0")

Sample A
Sample A + sample B
Sample A + sample B + sample C
Etc.

QC: rarefaction curve
rRNA genes

Saturation: OK!

QC: transformations for viz

Regularized log (rLog) and 'Variance Stabilizing Transformation'
(VST) as alternatives to log2.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

QC: count transformations
Not normalizations!
●

Techniques used for microarray can be
applied on VST transformed counts.
Log2

http://www.biomedcentral.com/1471-2105/14/91

rLog

VST

http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html

QC including condition info
●

●

We can also include condition
information, to interpret our QC better.
For this, we need to gather sample
information.
Make a separate file
in which sample info
is provided (metadata)

QC with condition info

What are the differences in
counts in each sample
dependent on? Here: counts are
dependent on the treatment
and the strain. Must match
the sample descriptions file.

Clustering of the distance between samples based on
transformed counts can reveal sample errors.

VST transformed

Colour scale
Of the distance
measure between
Samples. Similar conditions
Should cluster together

rLog transformed

Clustering of transformed counts can reveal sample
errors.

VST transformed

rLog transformed

Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.

Collect enough metadata
Principal component (PC) analysis allows to display
the samples in a 2D scatterplot based on variability
between the samples. Samples close to each other
resemble each other more.

Why do
these resemble
each other?

During library preparation, collect as much as
information as possible, to add to the sample
descriptions. Pay particular attention to differences
between samples: e.g. day of preparation,
centrifuges used, ...

Why do
these resemble
each other?

In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples.

Day 1
Day 2

Additional metadata

In the QC of the count table, you can map this
additional info to the PC graph. In this case, library
prep on a different day had effect on the WT
samples (batch effect).

Day 1
Day 2

Additional metadata

Next step
Now we know our data from the inside out, we
can run a DE algorithm on the count table!

Keywords
Raw counts
VST

Write in your own words what the terms mean

RNA-seq for DE analysis: extracting counts and QC - part 4

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a RNA-seq for DE analysis: extracting counts and QC - part 4

Similar a RNA-seq for DE analysis: extracting counts and QC - part 4 (20)

Más de BITS

Más de BITS (16)

Último

Último (20)

RNA-seq for DE analysis: extracting counts and QC - part 4