UiPath Community: Communication Mining from Zero to Hero
Dgaston dec-06-2012
1. Bioinformatics: Intro to RNA-Seq
Analysis
Integrated Learning Session
Daniel Gaston, PhD
Dr. Karen Bedard Lab, Department of Pathology
December 6th, 2012
2. Overview
Introduction
Considerations for RNA-Seq
Computational Resources/Options
Analysis of RNA-Seq Data
Principle of analyzing RNA-Seq
General RNA-Seq analysis pipeline
“Tuxedo” pipeline
Alternative tools
Resources
http://www.slideshare.net/DanGaston
3. Before You Start: Considerations for RNA-
Seq Analysis
Next-Generation Sequencing experiments generate
a lot of raw data
25-40 GB/sample/replicate for most transcriptomes/tissue
types/cell lines/conditions
Require more computational resources than many
labs routinely have available for analyse data
At minimum several processing “cores” (8 minimum)
Large amount of RAM (16GB+)
Large amount of disk storage space for intermediate and
final results files in addition to raw FastQ files
Can be a significant amount of time per sample (days to
week)
6. So I Ran an RNA-Seq Experiment. Now
What?
Need to go from raw “read” data to gene expression
data
We now have:
De-multiplexed fastq files for each individual sample and
replicate
We want lists of:
Differentially expressed genes/transcripts
Potentially novel genes/transcripts
Potentially novel splice junctions
Potential fusion events
Organize your data, programs, and additional
resources (discussed later)
7. What is the Raw Data
A single lane of Illumina HiSeq 2000 sequencing
produces ~ 250 – 300 million “reads” of sequencing
Can be paired or single-end sequencing (paired-end
preferred)
Various sequencing lengths (number of sequencing
cycles)
2x50bp, 2x75bp, 2x100bp, 2x150bp most common
Cost versus amount of usable data
True raw data is actually image data with colour
intensities that are then converted into text (A, C, G,
T and quality scores) called FastQ
10. What is Short-Read Alignment?
Paired-End Reads
Section of Reference
Chromosome
11. What’s Special About RNA-Seq
Normally distance between paired-reads and size of
insertions both constrained
With RNA-Seq the source is mRNA, not genomic
DNA
Mapping to a reference genome, not transcriptome
Need to account for introns, pairs can be much
further apart than expected
17. What else can we look for?
Combine with ChiP-Seq to differentiate various
levels of regulation
Integrative analyses to identify common elements
(micro-RNA, transcription factors, molecular
pathways, protein-DNA interactions)
Combine with whole-exome or whole-genome
sequencing
Allele-specific expression
Allelic imbalance
LOH
Large genomic rearrangements/abnormalities
18. Caution
Need to differentiate between real data and artifacts
Differentiate between biologically meaningful data
and “noise”
Sample selection, experimental design, biological
replication (not technical replication), and robust
statistical methods are important
Looking at your data “by eye” is useful, but needs to
be backed up by stats
Avoid experimenter bias
Try and be holistic in your analyses
21. What you need before you begin
The individual programs
Reference genome (hg19/GRCh37)
FASTA file of whole genome, each chromosome is a
sequence entry
Bowtie2 Index files for reference genome
Index files are compressed representations of the
genome that allow assembly to the reference efficiently
and in parallel
Gene/Transcript annotation reference (UCSC,
Ensembl, ENCODE, etc)
Gives information about the location of genes and
important features such as location of introns, exons,
splice junctions, etc
22. Step 0: Bowtie
Bowtie forms the core of TopHat for short-read
alignment
Initial mapping of subset of reads (~5 million) to a
reference transcriptome to estimate inner-distance
mean/median and standard deviation for tophat
This info can be retrieved from the library prep stage
but is actually better to estimate from your final data
Sample command-line:
bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1
read1.fastq -2 read2.fastq –S output.sam
23. Step 1: Tophat
Tophat is a short-read mapper capable of aligning
reads to a reference genome and finding exon-exon
junctions
Can be provided a list of known junctions, do de
novo junction discovery, or both
Also has an option to find potential fusion-gene
transcripts
Sample command-line:
tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev
std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq
read2.fastq
24. About TopHat Options
-o: The path/name of a directory in which to place all
of the TopHat output files
-G path to and name of an annotation file so TopHat
can be aware of known junctions
Reference Genome: Given as path and “base
name.” If reference genome saved as:
/genomes/hg19/genome.fa then the relevant path
and basename would be /genomes/hg19/genome
Inner Distance = Fragment size – (2 x read length)
25. TopHat: Additional options
--no-mixed
--b2-very-sensitive
--fusion-search
Running above options on 6 processing cores on
one sample took ~26 hours
26. Step 2: Cufflinks
Cufflinks performs gene and transcript discovery
Many possible options
No novel discovery, use only a reference group of
transcripts
de novo mode (shown below, beginner‟s default)
Mixed Reference-Guided Assembly and de novo
discovery.
Options for more robust normalization methods and error
correction
Sample command-line:
cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
28. Step 4: Cuffdiff
Calculates expression levels of transcripts in
samples
Estimates differential expression between samples
Calculates significance value for difference in
expression levels between samples
Also groups together transcripts that all start from
same start site. Identify genes under
transcriptional/post-transcriptional regulation
Sample command-line:
cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u
merged.gtf cond1.bam cond2.bam
29. Cuffdiff Output
FPKM values for genes, isoforms, CDS, and groups of
genes from same Transcription Start Site for each
condition
FPKM is the normalized “expression value” used in RNA-Seq
Count files of above
As above but on a per replicate basis
Differential expression test results for genes, CDS,
primary transcripts, spliced transcripts on a per sample
(condition) comparison basis (Each possible X vs Y
comparison unless otherwise specified)
Includes identifiers, expression levels, expression difference
values, p-values, q-values, and yes/no significance field
Differential splicing tests, differential coding output,
differential promoter use
32. Help!
Command X failed
Keep calm
Don‟t blame the computer
Check input files and formats
Google/SeqAnswers/Biostars
Results looks “weird”
Check the raw data
Re-check the commands you used
RNA-Seq analysis is an experiment:
Maintain good records of what you did, like any other
experiment
33. Alternative tools
Alternative short-read alignment
BWA -> Can not align RNA-Seq data
GSNAP
STAR -> Requires minimum of 30GB of RAM
Alternative transcript reconstruction
STAR
Scripture
Alternative Expression/Abundance Estimation
DESeq
DEXSeq
edgeR
36. Additional Resources
Differential gene and transcript expression analysis
of RNA-Seq Experiments with TopHat and Cufflinks
(2012) Nature Protocols. 7(3)
www.biostars.org (Q&A site)
SeqAnswers Forum
GENCODE Gene Annotations
http://www.gencodegenes.org/
ftp://ftp.sanger.ac.uk/pub/gencode
TopHat / Illumina iGenomes References and
Annotation Files:
http://tophat.cbcb.umd.edu/igenomes.html
37. Acknowledgements
Dalhousie University Dr. Graham Dellaire
Dr. Karen Bedard Montgomery Lab
Dr. Chris McMaster Stanford
Dr. Andrew Orr Dr. Stephen Montgomery
Dr. Conrad Fernandez
BHCRI CRTP Skills
Dr. Marissa Leblanc
Acquisition Program
Mat Nightingale
Bedard Lab
IGNITE
44. The Cancer Genome Atlas
Identify cancer subtypes, actionable driver
mutations, personalized/genomic/precision medicine
More than $275 million in funding from NIH
Multiple research groups around the world
20 cancer types being studied
205 publications from the research network since
late 2008
49. What is UNIX?
UNIX and UNIX-Like are a family of computer
operating systems originally developed at AT&T‟s
Bell Labs
Apple OS X and iOS (UNIX)
Linux (UNIX-Like)
50. Intro
The terminal (command-line) isn‟t THAT scary.
Maintaining a Linux environment can be challenging,
but most of these analyses can also be done in an
OS X environment
Installing software can sometimes be cumbersome
and confusing, however many standard
bioinformatics programs and software libraries are
fairly easy to set-up
Working with the programs from the command-line
will often give you a better appreciation for what the
program does and what it requires
51. Terms to Know
Path: The location of a directory, file, or command on
the computer.
Example: /Users/dan (OS X home directory)
52. The Commands You Need to Know
ls: Lists the files in the current directory. Directories
(folders) are just a special type of file themselves
cd: Change directory
pwd: View the full path of the directory you are
currently in
cat: Displays the contents of a file on the terminal
screen
head / tail : Displays the top or bottom contents of a
file to the screen respectively