The document discusses issues and pitfalls with using publicly available RNA-seq data. It was presented by Mikael Huss from the SciLifeLab and Stockholm University at RNA-Seq Europe in Basel. Huss works with a team of bioinformaticians at SciLifeLab to analyze new sequencing data and put it into context with existing information to ensure the data makes sense. The presentation addresses how to evaluate RNA-seq data quality and compare it to array data.
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
RNA-Seq data issues and pitfalls
1. Using publicly available RNA-seq data
Issues and pitfalls
RNA-Seq Europe, Basel
4 December, 2013
Mikael Huss
WABI, SciLifeLab / Stockholm University
6. Human tissue RNA-seq data sets
Project name GEO / AE
accessio
n
Representati
ve
publication
FAST
Q
BAM Counts R/FPKM
BodyMap2.0 GSE30611 - Yes (Yes) (Yes) (Yes)
RNA-Seq Atlas E-MTAB-305 PMID: 22345621 Yes No No Yes
Evolution of Gene
Expression
GSE30352 PMID: 22012392 Yes No No No
Wang GSE12946 PMID: 18978772 Yes No No Yes
GTex - - No No Yes Yes
Human Protein Atlas
(SciLifeLab/KTH)
- - No No No No
FANTOM5 (RIKEN) - - No No No No
Comingsoon!
8. Comparing RNA-seq expression profiles
How reproducible are RNA-seq
expression profiles?
What is the relative importance of
different factors?
-Cross-study biases, batch effects
-Bioinformatics pipelines
RNA extraction
Sequencing library
preparation
Sequencing
Mapping
Quantification
bioinformatics
lab
DE or other
downstream
analysis
10. Comparing RNA-seq expression profiles
RNA extraction
Sequencing library
preparation
Sequencing
Mapping
Quantification
bioinformatics
lab
DE or other
downstream
analysis
“In this paper, we have demonstrated that
technical variation in RNA-seq experiments is
small and that results from RNA-seq
experiments performed in different
laboratories are consistent. This conclusion is
valid as long as all participating laboratories
use the exact same protocols […] and versions
of sample preparation and sequencing kits.”
11. What about different samples and pipelines?
We wanted to know how consistent expression profiles are between studies
12. What about different samples and pipelines?
We wanted to know how consistent expression profiles are between studies
Also, how much the processing pipeline (mapping, quantification etc.) affects results
13. What about different samples and pipelines?
We wanted to know how consistent expression profiles are between studies
Also, how much the processing pipeline (mapping, quantification etc.) affects results
Decided to look at publicly available human tissue RNA-seq data
14. What about different samples and pipelines?
We wanted to know how consistent expression profiles are between studies
Also, how much the processing pipeline (mapping, quantification etc.) affects results
Decided to look at publicly available human tissue RNA-seq data
We want the samples from different tissues and studies to cluster by tissue rather than
study – and to find genes we identify as tissue-specific in independent catalogues of
tissue specific genes
15. First try
Let’s cluster reported R/FPKM values from some processed public data sets!
Actually hypothalamus
A hassle to combine IDs!
WangHeart
WangBrain
GTexRandomBrain2
GTexRandomHeart2
GTexRandomBrain1
GTexRandomHeart1
AtlasHeart
AtlasBrain
WangHeart
WangBrain
GTexRandomBrain2
GTexRandomHeart2
GTexRandomBrain1
GTexRandomHeart1
AtlasHeart
AtlasBrain
0.2
0.4
0.6
0.8
1
16. Human tissue RNA-seq data sets
Project name GEO / AE
accessio
n
Representati
ve
publication
FAST
Q
BAM Counts R/FPKM
BodyMap2.0 GSE30611 - Yes (Yes) (Yes) (Yes)
RNA-Seq Atlas E-MTAB-305 Krupp et al. (2012)
PMID: 22345621
Yes No No Yes
Evolution of Gene
Expression Levels in
Mammalian Organs
GSE30352 Brawand et al.
(2011)
PMID: 22012392
Yes No No No
Wang GSE12946 Wang et al. (2008)
PMID: 18978772
Yes No No Yes
GTex - Lonsdale et al.
(2013)
PMID: 23715323
No No Yes Yes
Human Protein Atlas
(SciLifeLab/KTH)
- - No No No No
FANTOM5 (RIKEN) - - No No No No
Comingsoon!
21. Cufflinks was used to calculate FPKMs
By default, Cufflinks skips loci with >1,000,000 alignments and puts FPKM=0 and a HIDATA flag
22. Cufflinks was used to calculate FPKMs
By default, Cufflinks skips loci with >1,000,000 alignments and puts FPKM=0 and a HIDATA flag
Fix: Look for HIDATA and rerun Cufflinks with a higher --max-bundle-frags option if you find any
23. Re-process from FASTQ
Benchmark to find out if we can obtain a sensible clustering of samples from different sources.
Get brain, kidney, heart data in RNA-Seq Atlas, BodyMap, Evolution of Gene Expression
Tips:
SRAdb package in BioConductor can be very handy for obtaining FASTQ files if you like R
InSilicoDb also has a decent interface against GEO/ArrayExpress
24. Re-process from FASTQ
Benchmark to find out if we can obtain a sensible clustering of samples from different sources.
Get brain, kidney, heart data in RNA-Seq Atlas, BodyMap, Evolution of Gene Expression
Map to genome/transcriptome using TopHat and STAR
25. Re-process from FASTQ
Benchmark to find out if we can obtain a sensible clustering of samples from different sources.
Get brain, kidney, heart data in RNA-Seq Atlas, BodyMap, Evolution of Gene Expression
Map to genome/transcriptome using TopHat and STAR
Quantify with HTSeq (counts) or Cufflinks (FPKM)
26. Re-process from FASTQ
Benchmark to find out if we can obtain a sensible clustering of samples from different sources.
Get brain, kidney, heart data in RNA-Seq Atlas, BodyMap, Evolution of Gene Expression
Map to genome/transcriptome using TopHat and STAR
Quantify with HTSeq (counts) or Cufflinks (FPKM)
DE analysis with edgeR/limma/CuffDiff and compare with external set of tissue specific
genes
30. RNA-seq normalization: different goals
- R/FPKM: (Mortazavi et al. 2008)
- Correct for: differences in sequencing depth and transcript length
- Aiming to: compare a gene across samples and diff genes within sample
- TMM: (Robinson and Oshlack 2010)
- Correct for: differences in transcript pool composition; extreme outliers
- Aiming to: provide better across-sample comparability
- TPM: (Li et al 2010, Wagner et al 2012)
- Correct for: transcript length distribution in RNA pool
- Aiming to: provide better across-sample comparability
- Limma voom (logCPM): (Lawet al 2013)
- Aiming to: stabilize variance; remove dependence of variance on the mean
32. • If we are right in that gene expression profiles
from tissues are comparable across studies
when normalized in this way, it should work
for a new data set that we have not used up
until this point
“Cross-validation”
33. • If we are right in that gene expression profiles
from tissues are comparable across studies
when normalized in this way, it should work
for a new data set that we have not used up
until this point
• Introduce Wang/Sandberg data set and
repeat (only brain and heart available, poly-A
enriched, 1x32 bp on earliest Solexa system,
~15 Mreads mapped)
“Cross-validation”
37. Concordance of differentially expressed genes with “gold
standard” (TiGER)
Use samples from different studies as pseudo-replicates (no biol reps within the studies)
Look at top 100 genes (in terms of FDR) for each tool
Tissue Method # genes with
FDR < 1%
Overlap with
TiGER
Brain CuffDiff2 208 22/100
Brain limma 2649 26/100
Brain edgeR 4249 37/100
Heart CuffDiff2 42 18/100
Heart limma 576 46/100
Heart edgeR 2438 32/100
Kidney CuffDiff2 82 24/100
Kidney limma 1517 60/100
Kidney edgeR 2688 59/100
38. brain heart kidney
Top 100 genes' overlap with TiGER
02060100
●
●
●
limma/TiGER
edgeR/TiGER
CuffDiff/TiGER
brain heart kidney
Top 100 genes' overlap with SpeCond
02060100
●
●
●
limma/SpeCond
edgeR/SpeCond
CuffDiff/SpeCond
limma and edgeR can
include additional
explanatory factors (such
as study)
40. Some observations
Human tissue RNA-seq data sets from different sources seem to be fairly comparable
at a global level, but proper normalization is important.
It seems easier to compare results of count-based workflows rather than e g Cufflinks
FPKMs, although the latter is more theoretically correct.
The alignment program used, however, does not seem to make a crucial difference.
The ability to include additional explanatory factors in differential gene expression
analysis (edgeR, DESeq, limma) is important for sensitivity.
41. Recommendations
1. Reprocess the data from FASTQ in a consistent way.
(Corollary; Beware of values reported in portals for processed expression data.)
2. If using Cufflinks, be aware of the HIDATA flag and use the --max-bundle-frags
option to avoid the issue.
3. To compare your samples on a global scale, you may want to start from counts
and try both a scaling normalization like TMM and a log-like transform (e.g. log-
CPM).
4. To compare FPKM values from e g Cufflinks, try including only protein-coding
genes, filter to genes with a mean FPKM>1 or high variance, and take logs.