SlideShare una empresa de Scribd logo
1 de 34
RNA-seq quality control and pre-processing
AllBio workshop
January 8, 2014
Mikael Huss, SciLifeLab, Sweden
RNA-seq quality control and pre-
processing
- Generic high-throughput sequencing QC tools (e g FastQC, PRINSEQ)
- RNA-seq specific QC tools (e g RSeQC, RNASeQC)
- Pre-mapping QC (sequence qualities, sequence overrepresentation)
- Pre-processing (trimming etc)
- Post-mapping QC (distribution of mapped regions, contamination etc)
FastQC
FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
Also check out PRINSEQ (http://prinseq.sourceforge.net)
Sequence quality score plots Sequence overrepresentation plots
RSeQC (https://code.google.com/p/rseqc/)
Trimming
- Adapter trimming
• May increase mapping rates
• Absolutely essential for small RNA
• Probably improves de novo assemblies
- Quality trimming
• May increase mapping rates
• May also lead to loss of information
Lots of software doing either of these or both. E g Cutadapt, Trim Galore!, PRINSEQ,
Trimmomatic, Sickle/Scythe, FASTX Toolkit, etc.
Adapter trimming
Illumina TruSeq DNA Adapters De-Mystified by James Schiemer
http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf
Seq. starts here
Most common case
DNA fragment of interest shorter than read length
100-bp read
Short fragment of interest
Will always happen for e g miRNA
Quality trimming
Rationale:
Erroneous base calls (often towards the ends of reads
but also in the beginning) can have a detrimental
effect on
• de novo assembly (spurious paths and
bubbles in the assembly graph  increased
memory consumption and complexity)
• mapping rates for reference based analysis
• variant calling
Assume that the reported quality values (QVs) for
these erroneous base calls will be low. Therefore you
want to trim away regions with average QVs below
some threshold.
One way to quality trim
BWA, CutAdapt, CLC Bio and many others use slightly different versions of “PHRED
trimming”, or the so-called “modified Mott algorithm”.
The basic idea is to trim from either the 3’ end, or both the 3’ and 5’ end, and keep
track of a running sum of deviations from the threshold (negative if the base has
lower quality than the cutoff, positive if higher). The read is trimmed where this sum
is minimal.
If the trimmed sequence is too short (e.g. <30 bp), it is discarded.
So, (at least) 2 user defined parameters:
- quality score cutoff
- min length of sequence to keep
Details of the Mott algorithm plus several other trimming methods are given in
http://research.bioinformatics.udel.edu/genomics/ngsShoRT/download/advanced_user_guide.pdf
Is trimming beneficial?
Two recent papers + a blog post: http://genomebio.org/is-trimming-is-beneficial-in-rna-seq/
Del Fabbro C et al (2013) An Extensive Evaluation of Read
Trimming Effects on Illumina NGS Data Analysis. PLoS ONE
8(12): e85024. doi:10.1371/journal.pone.0085024
“trimming is beneficial in RNA-Seq, SNP identification
and genome assembly procedures, with the best effects
evident for intermediate quality thresholds (Q between
20 and 30)”
“Although very aggressive quality trimming is common, this
study suggests that a more gentle trimming, specifically of
those nucleotides whose Phred score < 2 or < 5, is optimal
for most studies across a wide variety of metrics.”
MacManes MD (2013)
On the optimal trimming of high-throughput mRNAseq
data doi: 10.1101/000422
Erroneous bases
in assembly
# complete exons
Software comparison, RNA/DNA-Seq Assembly-oriented, RNA-seq only
Some comments on software
I do not always trim – just in cases where it appears to improve results
TrimGalore - wrapper to CutAdapt with both quality and adapter trimming
(my own choice as it works smoothly with paired-end reads as well)
Trimmomatic – quality and adapter trimming, lots of options
Scythe (adapter) – Sickle (quality) trimming combo
Beginnings of reads
Bias in sequence composition is often
(always?) seen in the first 12-15 bp in
Illumina RNA-seq data sets
Not clear if trimming the 5’ helps here.
According to an authoritative source you should always remove the first base and preferably a couple of
more bases afterwards  (I have not personally done this so far)
Thought to be due to issues with “random” hexamer priming
Hansen et al. (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming
Nucleic Acids Res. 2010 July; 38(12): e131. doi: 10.1093/nar/gkq224
Poly-A tails
Seldom captured in Illumina HiSeq runs
Could complicate mapping & lead to false positive hits in sequence databases
PRINSEQ low-complexity filter
EMBOSS TrimEST http://emboss.sourceforge.net/apps/cvs/emboss/apps/trimest.html
(etc).
GC bias
(disclaimer – I have never adjusted for this!)
Risso D et al. (2011) GC-Content
Normalization for RNA-Seq Data. BMC
Bioinformatics, 12:480 doi:10.1186/1471-
2105-12-480
“We […] demonstrate the existence of strong sample-
specific GC-content effects on RNA-Seq read counts, which
can substantially bias differential expression analysis”
Roberts A et al. (2011) Improving RNA-Seq
expression estimates by correcting for
fragment bias. Genome Biology, 12:R22
doi:10.1186/gb-2011-12-3-r22
+ several other papers
“The biochemistry of RNA-Seq library preparation results in
cDNA fragments that are not uniformly distributed within
the transcripts they represent. This non-uniformity must be
accounted for when estimating expression levels …”
CQN package for R (BioConductor)
http://www.bioconductor.org/packages/
2.13/bioc/html/cqn.html
Post-mapping QC
- Contamination
- Duplicates
- Genomic features covered
What’s lurking in your data?
Screen for contaminating genomes, vectors, adapter sequences
FastQ Screen: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/
Poor man’s version: simply BLAST (e.g.) 1000 random sequences against nt
What’s lurking in your data?
Could also be done pre-mapping
PRINSEQ
Dinucleotide frequencies
Comparing metagenomes
Contamination by similar genomes
Need to look at the distribution of the number of mismatches per alignment (e g NM:i:
attribute in the BAM/SAM file)
Duplicate sequences
Observing identical sequences in a sequencing run could result from
-Genuine, multiple observations of the same sequence from different source molecules
-Amplification from PCR steps in library preparation or sequencing
-Optical duplicates
-Exhausting the library; sequencing the same molecule several times
Note:
For resequencing applications (whole-genome, exome sequencing) it is standard
practice to remove duplicate sequences. For RNA-seq, things are more complicated.
Duplicates are usually removed after mapping because it is simple. E g look for paired-
end reads where both mates map to the same coordinates.
Duplicates and RNA-seq
# of sequences taking up X% of the sequences
HBA1 HBB HBA2
Duplicates and RNA-seq
Millions of reads mapping to a single short transcript  will look like a LOT of duplicates!
(this also happens with rRNA)
Thus, highly expressed transcripts having lots of “duplicate” reads is normal!
dupRadar (H. Klein et al)
http://sourceforge.net/projects/dupradar/
Predicting library complexity
Daley T and Smith AD. Predicting the molecular complexity of sequencing libraries.
Nature Methods 10, 325–327 (2013) doi:10.1038/nmeth.2375
RSeQC (https://code.google.com/p/rseqc/)
Gene body coverage
RNA quality affects the shape
If this profile has strange spikes, there may be extreme overrepresentation of sequences
Benjamin
Sigurgeirsson
Mapping to genomic features
RSeQC plot
Insert size distribution
Negative insert size implies overlapping mate reads
For assembly, might want to join overlapping mates into “pseudo-single-end reads”
I have used FLASH; other tools mentioned here
http://thegenomefactory.blogspot.se/2012/11/tools-to-merge-overlapping-paired-
end.html
Clustering to check for outliers and
batch effects
Cluster according to tissue Cluster according to prep or sequencing batch
●●●
●
●
●
●
●
●
Cufflinks FPKM
PC1
PC2
●
●
●
●
●
●
●
●
●
●
●
●
(edgeR) TMM
PC1
PC2
●
●
●
●
●
●
●
●
●
●
●
●
(limma) logCPM
PC1
PC2
●
●●
●
●
●
●
●
●
●
●
●
logCPM−TMM
PC1
PC2
●
●
●
Red – brain
Blue – heart
Black – kidney
Circles – Study 1
Triangles – Study 2
Squares – Study 3
… or PCA plots
But it can be tricky
because a lot depends
on the normalization
- R/FPKM: (Mortazavi et al. 2008)
- Correct for: differences in sequencing depth and transcript length
- Aiming to: compare a gene across samples and diff genes within sample
- TMM: (Robinson and Oshlack 2010)
- Correct for: differences in transcript pool composition; extreme outliers
- Aiming to: provide better across-sample comparability
- TPM: (Li et al 2010, Wagner et al 2012)
- Correct for: transcript length distribution in RNA pool
- Aiming to: provide better across-sample comparability
- Limma voom (logCPM): (Lawet al 2013)
- Aiming to: stabilize variance; remove dependence of variance on the mean
Normalization: different goals
TPM – Transcripts Per Million
Blog post that explains how it works
http://blog.nextgenetics.net/?e=51
A slightly modified RPKM measure that
accounts for differences in gene length
distribution in the transcript population
TMM – Trimmed Mean of M values
Attempts to correct for differences in RNA composition between samples
E g if certain genes are very highly expressed in one tissue but not another, there will be less
“sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization
(or similar) will give biased expression values for them compared to the other sample
Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
RNA population 1 RNA population 2
Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the
expression levels are actually the same in populations 1 and 2
Across-sample comparability
Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
Comments on normalization
Constantly evolving area. My current recommendations:
For reporting gene expression estimates: Use TPM if possible (RSEM, Sailfish, eXpress)
For differential expression analysis: Use TMM, DESeq normalization or similar
For clustering and visualization: I prefer TMM + a log transform (limma-voom)
Questions?
Thanks to:
Thomas Svensson + the whole WABI group at SciLifeLab in Stockholm & Uppsala
Benjamin Sigurgeirsson (SciLifeLab/KTH)
Gary Schroth (Illumina)

Más contenido relacionado

La actualidad más candente

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
Next generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedNext generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedShweta Tiwari
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisEfi Athieniti
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Sebastian Schmeier
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsfaraharooj
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencingDenis C. Bauer
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methodsratanvishwas
 

La actualidad más candente (20)

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
The STRING database
The STRING databaseThe STRING database
The STRING database
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
ILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptxILLUMINA SEQUENCE.pptx
ILLUMINA SEQUENCE.pptx
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Next generation-sequencing.ppt-converted
Next generation-sequencing.ppt-convertedNext generation-sequencing.ppt-converted
Next generation-sequencing.ppt-converted
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applications
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Variants of PCR
Variants of PCRVariants of PCR
Variants of PCR
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 

Destacado

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5BITS
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3BITS
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsBITS
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsBITS
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilChristian Frech
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6BITS
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsBITS
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsBITS
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS
 
BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expressionDenis C. Bauer
 
RNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTieRNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTieGunnar Rätsch
 
Masurca genome assembly with super reads
Masurca  genome assembly with super readsMasurca  genome assembly with super reads
Masurca genome assembly with super readsAbdullah Khan Zehady
 

Destacado (20)

RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
BITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics dataBITS - Genevestigator to easily access transcriptomics data
BITS - Genevestigator to easily access transcriptomics data
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
BITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome levelBITS - Comparative genomics on the genome level
BITS - Comparative genomics on the genome level
 
Productivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformaticsProductivity tips - Introduction to linux for bioinformatics
Productivity tips - Introduction to linux for bioinformatics
 
The structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformaticsThe structure of Linux - Introduction to Linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformatics
 
Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq for DE analysis: the biology behind observed changes - part 6
 
Text mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformaticsText mining on the command line - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
 
Managing your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformaticsManaging your data - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
 
Illumina Sequencing
Illumina SequencingIllumina Sequencing
Illumina Sequencing
 
BITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry dataBITS - Protein inference from mass spectrometry data
BITS - Protein inference from mass spectrometry data
 
BITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysisBITS - Comparative genomics: gene family analysis
BITS - Comparative genomics: gene family analysis
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
RNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTieRNA-seq based Genome Annotation with mGene.ngs and MiTie
RNA-seq based Genome Annotation with mGene.ngs and MiTie
 
Masurca genome assembly with super reads
Masurca  genome assembly with super readsMasurca  genome assembly with super reads
Masurca genome assembly with super reads
 

Similar a RNA-seq quality control and pre-processing

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...fruitbreedomics
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Jonathan Eisen
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pubsesejun
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqHimanshu Sethi
 

Similar a RNA-seq quality control and pre-processing (20)

Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
 

Más de mikaelhuss

ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in Rmikaelhuss
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq datamikaelhuss
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomicsmikaelhuss
 

Más de mikaelhuss (6)

ML crash course
ML crash courseML crash course
ML crash course
 
Deep learning with Tensorflow in R
Deep learning with Tensorflow in RDeep learning with Tensorflow in R
Deep learning with Tensorflow in R
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Comparing public RNA-seq data
Comparing public RNA-seq dataComparing public RNA-seq data
Comparing public RNA-seq data
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 

Último

Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 

Último (20)

Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 

RNA-seq quality control and pre-processing

  • 1. RNA-seq quality control and pre-processing AllBio workshop January 8, 2014 Mikael Huss, SciLifeLab, Sweden
  • 2. RNA-seq quality control and pre- processing - Generic high-throughput sequencing QC tools (e g FastQC, PRINSEQ) - RNA-seq specific QC tools (e g RSeQC, RNASeQC) - Pre-mapping QC (sequence qualities, sequence overrepresentation) - Pre-processing (trimming etc) - Post-mapping QC (distribution of mapped regions, contamination etc)
  • 3. FastQC FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) Also check out PRINSEQ (http://prinseq.sourceforge.net) Sequence quality score plots Sequence overrepresentation plots
  • 5. Trimming - Adapter trimming • May increase mapping rates • Absolutely essential for small RNA • Probably improves de novo assemblies - Quality trimming • May increase mapping rates • May also lead to loss of information Lots of software doing either of these or both. E g Cutadapt, Trim Galore!, PRINSEQ, Trimmomatic, Sickle/Scythe, FASTX Toolkit, etc.
  • 6. Adapter trimming Illumina TruSeq DNA Adapters De-Mystified by James Schiemer http://tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf Seq. starts here
  • 7. Most common case DNA fragment of interest shorter than read length 100-bp read Short fragment of interest Will always happen for e g miRNA
  • 8. Quality trimming Rationale: Erroneous base calls (often towards the ends of reads but also in the beginning) can have a detrimental effect on • de novo assembly (spurious paths and bubbles in the assembly graph  increased memory consumption and complexity) • mapping rates for reference based analysis • variant calling Assume that the reported quality values (QVs) for these erroneous base calls will be low. Therefore you want to trim away regions with average QVs below some threshold.
  • 9. One way to quality trim BWA, CutAdapt, CLC Bio and many others use slightly different versions of “PHRED trimming”, or the so-called “modified Mott algorithm”. The basic idea is to trim from either the 3’ end, or both the 3’ and 5’ end, and keep track of a running sum of deviations from the threshold (negative if the base has lower quality than the cutoff, positive if higher). The read is trimmed where this sum is minimal. If the trimmed sequence is too short (e.g. <30 bp), it is discarded. So, (at least) 2 user defined parameters: - quality score cutoff - min length of sequence to keep Details of the Mott algorithm plus several other trimming methods are given in http://research.bioinformatics.udel.edu/genomics/ngsShoRT/download/advanced_user_guide.pdf
  • 10. Is trimming beneficial? Two recent papers + a blog post: http://genomebio.org/is-trimming-is-beneficial-in-rna-seq/ Del Fabbro C et al (2013) An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. PLoS ONE 8(12): e85024. doi:10.1371/journal.pone.0085024 “trimming is beneficial in RNA-Seq, SNP identification and genome assembly procedures, with the best effects evident for intermediate quality thresholds (Q between 20 and 30)” “Although very aggressive quality trimming is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose Phred score < 2 or < 5, is optimal for most studies across a wide variety of metrics.” MacManes MD (2013) On the optimal trimming of high-throughput mRNAseq data doi: 10.1101/000422 Erroneous bases in assembly # complete exons Software comparison, RNA/DNA-Seq Assembly-oriented, RNA-seq only
  • 11. Some comments on software I do not always trim – just in cases where it appears to improve results TrimGalore - wrapper to CutAdapt with both quality and adapter trimming (my own choice as it works smoothly with paired-end reads as well) Trimmomatic – quality and adapter trimming, lots of options Scythe (adapter) – Sickle (quality) trimming combo
  • 12. Beginnings of reads Bias in sequence composition is often (always?) seen in the first 12-15 bp in Illumina RNA-seq data sets Not clear if trimming the 5’ helps here. According to an authoritative source you should always remove the first base and preferably a couple of more bases afterwards  (I have not personally done this so far) Thought to be due to issues with “random” hexamer priming Hansen et al. (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming Nucleic Acids Res. 2010 July; 38(12): e131. doi: 10.1093/nar/gkq224
  • 13. Poly-A tails Seldom captured in Illumina HiSeq runs Could complicate mapping & lead to false positive hits in sequence databases PRINSEQ low-complexity filter EMBOSS TrimEST http://emboss.sourceforge.net/apps/cvs/emboss/apps/trimest.html (etc).
  • 14. GC bias (disclaimer – I have never adjusted for this!) Risso D et al. (2011) GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics, 12:480 doi:10.1186/1471- 2105-12-480 “We […] demonstrate the existence of strong sample- specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis” Roberts A et al. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology, 12:R22 doi:10.1186/gb-2011-12-3-r22 + several other papers “The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels …” CQN package for R (BioConductor) http://www.bioconductor.org/packages/ 2.13/bioc/html/cqn.html
  • 15. Post-mapping QC - Contamination - Duplicates - Genomic features covered
  • 16. What’s lurking in your data? Screen for contaminating genomes, vectors, adapter sequences FastQ Screen: http://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/ Poor man’s version: simply BLAST (e.g.) 1000 random sequences against nt
  • 17. What’s lurking in your data? Could also be done pre-mapping PRINSEQ Dinucleotide frequencies Comparing metagenomes
  • 18. Contamination by similar genomes Need to look at the distribution of the number of mismatches per alignment (e g NM:i: attribute in the BAM/SAM file)
  • 19. Duplicate sequences Observing identical sequences in a sequencing run could result from -Genuine, multiple observations of the same sequence from different source molecules -Amplification from PCR steps in library preparation or sequencing -Optical duplicates -Exhausting the library; sequencing the same molecule several times Note: For resequencing applications (whole-genome, exome sequencing) it is standard practice to remove duplicate sequences. For RNA-seq, things are more complicated. Duplicates are usually removed after mapping because it is simple. E g look for paired- end reads where both mates map to the same coordinates.
  • 20. Duplicates and RNA-seq # of sequences taking up X% of the sequences HBA1 HBB HBA2
  • 21. Duplicates and RNA-seq Millions of reads mapping to a single short transcript  will look like a LOT of duplicates! (this also happens with rRNA) Thus, highly expressed transcripts having lots of “duplicate” reads is normal! dupRadar (H. Klein et al) http://sourceforge.net/projects/dupradar/
  • 22. Predicting library complexity Daley T and Smith AD. Predicting the molecular complexity of sequencing libraries. Nature Methods 10, 325–327 (2013) doi:10.1038/nmeth.2375
  • 24. Gene body coverage RNA quality affects the shape If this profile has strange spikes, there may be extreme overrepresentation of sequences Benjamin Sigurgeirsson
  • 25. Mapping to genomic features RSeQC plot
  • 26. Insert size distribution Negative insert size implies overlapping mate reads For assembly, might want to join overlapping mates into “pseudo-single-end reads” I have used FLASH; other tools mentioned here http://thegenomefactory.blogspot.se/2012/11/tools-to-merge-overlapping-paired- end.html
  • 27. Clustering to check for outliers and batch effects Cluster according to tissue Cluster according to prep or sequencing batch
  • 28. ●●● ● ● ● ● ● ● Cufflinks FPKM PC1 PC2 ● ● ● ● ● ● ● ● ● ● ● ● (edgeR) TMM PC1 PC2 ● ● ● ● ● ● ● ● ● ● ● ● (limma) logCPM PC1 PC2 ● ●● ● ● ● ● ● ● ● ● ● logCPM−TMM PC1 PC2 ● ● ● Red – brain Blue – heart Black – kidney Circles – Study 1 Triangles – Study 2 Squares – Study 3 … or PCA plots But it can be tricky because a lot depends on the normalization
  • 29. - R/FPKM: (Mortazavi et al. 2008) - Correct for: differences in sequencing depth and transcript length - Aiming to: compare a gene across samples and diff genes within sample - TMM: (Robinson and Oshlack 2010) - Correct for: differences in transcript pool composition; extreme outliers - Aiming to: provide better across-sample comparability - TPM: (Li et al 2010, Wagner et al 2012) - Correct for: transcript length distribution in RNA pool - Aiming to: provide better across-sample comparability - Limma voom (logCPM): (Lawet al 2013) - Aiming to: stabilize variance; remove dependence of variance on the mean Normalization: different goals
  • 30. TPM – Transcripts Per Million Blog post that explains how it works http://blog.nextgenetics.net/?e=51 A slightly modified RPKM measure that accounts for differences in gene length distribution in the transcript population
  • 31. TMM – Trimmed Mean of M values Attempts to correct for differences in RNA composition between samples E g if certain genes are very highly expressed in one tissue but not another, there will be less “sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25 RNA population 1 RNA population 2 Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the expression levels are actually the same in populations 1 and 2
  • 32. Across-sample comparability Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
  • 33. Comments on normalization Constantly evolving area. My current recommendations: For reporting gene expression estimates: Use TPM if possible (RSEM, Sailfish, eXpress) For differential expression analysis: Use TMM, DESeq normalization or similar For clustering and visualization: I prefer TMM + a log transform (limma-voom)
  • 34. Questions? Thanks to: Thomas Svensson + the whole WABI group at SciLifeLab in Stockholm & Uppsala Benjamin Sigurgeirsson (SciLifeLab/KTH) Gary Schroth (Illumina)