SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Bioinformatics: Intro to RNA-Seq
                         Analysis

                   Integrated Learning Session
                           Daniel Gaston, PhD
 Dr. Karen Bedard Lab, Department of Pathology

                           December 6th, 2012
Overview
   Introduction
       Considerations for RNA-Seq
       Computational Resources/Options
   Analysis of RNA-Seq Data
       Principle of analyzing RNA-Seq
       General RNA-Seq analysis pipeline
       “Tuxedo” pipeline
       Alternative tools
   Resources
       http://www.slideshare.net/DanGaston
Before You Start: Considerations for RNA-
Seq Analysis
   Next-Generation Sequencing experiments generate
    a lot of raw data
       25-40 GB/sample/replicate for most transcriptomes/tissue
        types/cell lines/conditions

   Require more computational resources than many
    labs routinely have available for analyse data
       At minimum several processing “cores” (8 minimum)
       Large amount of RAM (16GB+)
       Large amount of disk storage space for intermediate and
        final results files in addition to raw FastQ files
       Can be a significant amount of time per sample (days to
        week)
Computational Options
   Local (Large workstation or cluster)
   Remote Computer/Cluster
    (ComputeCanada/ACENet)
   Cloud Services
       Amazon Web Services
   Cloud/Local Bioinformatics „Portals”
       Galaxy
       Chipster
       GenomeSpace
       CloudBioLinux
       CloudMan
       BioCloudCentral (Interface to CloudMan, CloudBioLinux,
        etc)
RNA-Seq Analysis Workflow
So I Ran an RNA-Seq Experiment. Now
What?
   Need to go from raw “read” data to gene expression
    data
   We now have:
       De-multiplexed fastq files for each individual sample and
        replicate
   We want lists of:
       Differentially expressed genes/transcripts
       Potentially novel genes/transcripts
       Potentially novel splice junctions
       Potential fusion events
   Organize your data, programs, and additional
    resources (discussed later)
What is the Raw Data
   A single lane of Illumina HiSeq 2000 sequencing
    produces ~ 250 – 300 million “reads” of sequencing
   Can be paired or single-end sequencing (paired-end
    preferred)
   Various sequencing lengths (number of sequencing
    cycles)
       2x50bp, 2x75bp, 2x100bp, 2x150bp most common
       Cost versus amount of usable data
   True raw data is actually image data with colour
    intensities that are then converted into text (A, C, G,
    T and quality scores) called FastQ
FastQ
@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1
TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC
+
?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH


   FASTA format file with a header line, sequence line, and
    quality scores for every base in the sequenced read
   In Paired-End Sequencing one file for each “end” of
    sequencing (Primer 1 and Primer 2)
   Qualities scores are encoded with a single character
    representing a number. Most common encoding scheme
    is called Phred33. Old Illumina software used Phred64
    but current generation does not. (Illumina 1.3 – 1.7 is
    Phred64)
       Often needs to be set explicitly in alignment programs
General Analysis Pipeline

         Short-Read Alignment


        Transcript Reconstruction


         Abundance/Expression


        Visualization / Statisticss
What is Short-Read Alignment?




Paired-End Reads


Section of Reference
Chromosome
What’s Special About RNA-Seq


   Normally distance between paired-reads and size of
    insertions both constrained
   With RNA-Seq the source is mRNA, not genomic
    DNA
   Mapping to a reference genome, not transcriptome
   Need to account for introns, pairs can be much
    further apart than expected
Transcript Reconstruction: Intron/Exon
Junctions




Exon1               Exon 2               Exon 3
Transcript Reconstruction: Alternative
Splicing




Exon1                Exon 2              Exon 3
Transcript Reconstruction: Novel
Exon/Transcript Identification




Exon1               Exon 2     Exon X   Exon 3
Transcript Reconstruction: Fusion
Transcripts




Exon1                Exon 2         Exon 3




                 Gene 2 Exon 4
Transcript Reconstruction: Differential
Expression

                   Sample 1




                   Sample 2
What else can we look for?
   Combine with ChiP-Seq to differentiate various
    levels of regulation
   Integrative analyses to identify common elements
    (micro-RNA, transcription factors, molecular
    pathways, protein-DNA interactions)
   Combine with whole-exome or whole-genome
    sequencing
       Allele-specific expression
       Allelic imbalance
       LOH
       Large genomic rearrangements/abnormalities
Caution
   Need to differentiate between real data and artifacts
   Differentiate between biologically meaningful data
    and “noise”
   Sample selection, experimental design, biological
    replication (not technical replication), and robust
    statistical methods are important
   Looking at your data “by eye” is useful, but needs to
    be backed up by stats
   Avoid experimenter bias
   Try and be holistic in your analyses
Visualizing with IGV
“Tuxedo” Analysis Pipeline

                         Bowtie


                         Tophat


                       Cufflinks
      Cufflinks   Cuffcompare   Cuffmerge   Cuffdiff




                  CummeRbund
What you need before you begin
   The individual programs
   Reference genome (hg19/GRCh37)
       FASTA file of whole genome, each chromosome is a
        sequence entry
   Bowtie2 Index files for reference genome
       Index files are compressed representations of the
        genome that allow assembly to the reference efficiently
        and in parallel
   Gene/Transcript annotation reference (UCSC,
    Ensembl, ENCODE, etc)
       Gives information about the location of genes and
        important features such as location of introns, exons,
        splice junctions, etc
Step 0: Bowtie
   Bowtie forms the core of TopHat for short-read
    alignment
   Initial mapping of subset of reads (~5 million) to a
    reference transcriptome to estimate inner-distance
    mean/median and standard deviation for tophat
   This info can be retrieved from the library prep stage
    but is actually better to estimate from your final data
   Sample command-line:

    bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1
    read1.fastq -2 read2.fastq –S output.sam
Step 1: Tophat
   Tophat is a short-read mapper capable of aligning
    reads to a reference genome and finding exon-exon
    junctions
   Can be provided a list of known junctions, do de
    novo junction discovery, or both
   Also has an option to find potential fusion-gene
    transcripts
   Sample command-line:

    tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev
    std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq
    read2.fastq
About TopHat Options
   -o: The path/name of a directory in which to place all
    of the TopHat output files
   -G path to and name of an annotation file so TopHat
    can be aware of known junctions
   Reference Genome: Given as path and “base
    name.” If reference genome saved as:
    /genomes/hg19/genome.fa then the relevant path
    and basename would be /genomes/hg19/genome
   Inner Distance = Fragment size – (2 x read length)
TopHat: Additional options
   --no-mixed
   --b2-very-sensitive
   --fusion-search
   Running above options on 6 processing cores on
    one sample took ~26 hours
Step 2: Cufflinks
   Cufflinks performs gene and transcript discovery
   Many possible options
       No novel discovery, use only a reference group of
        transcripts
       de novo mode (shown below, beginner‟s default)
       Mixed Reference-Guided Assembly and de novo
        discovery.
       Options for more robust normalization methods and error
        correction
   Sample command-line:

    cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
Step 3: Cuffmerge
   Merges sample assemblies, estimate abundances,
    clean up transcriptome
   Sample command-line:

    cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8
    text_list_of_assemblies.txt
Step 4: Cuffdiff
   Calculates expression levels of transcripts in
    samples
   Estimates differential expression between samples
   Calculates significance value for difference in
    expression levels between samples
   Also groups together transcripts that all start from
    same start site. Identify genes under
    transcriptional/post-transcriptional regulation
   Sample command-line:

    cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u
    merged.gtf cond1.bam cond2.bam
Cuffdiff Output
   FPKM values for genes, isoforms, CDS, and groups of
    genes from same Transcription Start Site for each
    condition
       FPKM is the normalized “expression value” used in RNA-Seq
   Count files of above
   As above but on a per replicate basis
   Differential expression test results for genes, CDS,
    primary transcripts, spliced transcripts on a per sample
    (condition) comparison basis (Each possible X vs Y
    comparison unless otherwise specified)
       Includes identifiers, expression levels, expression difference
        values, p-values, q-values, and yes/no significance field
   Differential splicing tests, differential coding output,
    differential promoter use
Step 5: CummeRbund (R)




                         Trapnell et al., 2012
Visualization




                Trapnell et al., 2012
Help!
   Command X failed
       Keep calm
       Don‟t blame the computer
       Check input files and formats
       Google/SeqAnswers/Biostars
   Results looks “weird”
       Check the raw data
       Re-check the commands you used
   RNA-Seq analysis is an experiment:
       Maintain good records of what you did, like any other
        experiment
Alternative tools
   Alternative short-read alignment
       BWA -> Can not align RNA-Seq data
       GSNAP
       STAR -> Requires minimum of 30GB of RAM
   Alternative transcript reconstruction
       STAR
       Scripture
   Alternative Expression/Abundance Estimation
       DESeq
       DEXSeq
       edgeR
Resources
Software Websites
   TopHat          http://tophat.cbcb.umd.edu
   Cufflinks       http://cufflinks.cbcb.umd.edu
   STAR            http://gingeraslab.cshl.edu/STAR/
   Scripture

http://www.broadinstitute.org/software/scripture/

   Bioconductor    http://www.bioconductor.org/
       DEXSeq
       DESeq
       edgeR
   Blah
Additional Resources
   Differential gene and transcript expression analysis
    of RNA-Seq Experiments with TopHat and Cufflinks
    (2012) Nature Protocols. 7(3)
   www.biostars.org (Q&A site)
   SeqAnswers Forum
   GENCODE Gene Annotations
       http://www.gencodegenes.org/
       ftp://ftp.sanger.ac.uk/pub/gencode
   TopHat / Illumina iGenomes References and
    Annotation Files:
       http://tophat.cbcb.umd.edu/igenomes.html
Acknowledgements
   Dalhousie University          Dr. Graham Dellaire
       Dr. Karen Bedard          Montgomery Lab
       Dr. Chris McMaster         Stanford
       Dr. Andrew Orr                Dr. Stephen Montgomery
       Dr. Conrad Fernandez
                                  BHCRI CRTP Skills
       Dr. Marissa Leblanc
                                   Acquisition Program
       Mat Nightingale
       Bedard Lab
       IGNITE
Experimental Data for Genes of
                       Interest
UCSC Genome Browser
UCSC Genome Browser
MetabolicMine
MetabolicMine
NCI Pathway Interaction Database
The Cancer Genome Atlas
   Identify cancer subtypes, actionable driver
    mutations, personalized/genomic/precision medicine
   More than $275 million in funding from NIH
   Multiple research groups around the world
   20 cancer types being studied
   205 publications from the research network since
    late 2008
The Cancer Genome Atlas
The Cancer Genome Atlas
The Cancer Genome Atlas
UNIX/Linux command-line basics
What is UNIX?
   UNIX and UNIX-Like are a family of computer
    operating systems originally developed at AT&T‟s
    Bell Labs
       Apple OS X and iOS (UNIX)
       Linux (UNIX-Like)
Intro
   The terminal (command-line) isn‟t THAT scary.
    Maintaining a Linux environment can be challenging,
    but most of these analyses can also be done in an
    OS X environment
   Installing software can sometimes be cumbersome
    and confusing, however many standard
    bioinformatics programs and software libraries are
    fairly easy to set-up
   Working with the programs from the command-line
    will often give you a better appreciation for what the
    program does and what it requires
Terms to Know
   Path: The location of a directory, file, or command on
    the computer.
       Example: /Users/dan (OS X home directory)
The Commands You Need to Know
   ls: Lists the files in the current directory. Directories
    (folders) are just a special type of file themselves
   cd: Change directory
   pwd: View the full path of the directory you are
    currently in
   cat: Displays the contents of a file on the terminal
    screen
   head / tail : Displays the top or bottom contents of a
    file to the screen respectively

Más contenido relacionado

La actualidad más candente

Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalJoachim Jacob
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineHai-Wei Yen
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataJoachim Jacob
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3BITS
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4BITS
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsIntegrated DNA Technologies
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normalGenomeInABottle
 
The Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesThe Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesReece Hart
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...Jan Aerts
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphsGenomeInABottle
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq DataPhil Ewels
 

La actualidad más candente (20)

Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra toolBITS - Comparative genomics: the Contra tool
BITS - Comparative genomics: the Contra tool
 
LUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis PipelineLUGM-Update of the Illumina Analysis Pipeline
LUGM-Update of the Illumina Analysis Pipeline
 
Part 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw dataPart 2 of RNA-seq for DE analysis: Investigating raw data
Part 2 of RNA-seq for DE analysis: Investigating raw data
 
Rna seq
Rna seq Rna seq
Rna seq
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: extracting counts and QC - part 4
 
Exome Sequencing
Exome SequencingExome Sequencing
Exome Sequencing
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI tools
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
How giab fits in the rest of the world seqc2 tumor normal
How giab fits in the rest of the world   seqc2 tumor normalHow giab fits in the rest of the world   seqc2 tumor normal
How giab fits in the rest of the world seqc2 tumor normal
 
The Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment DiscrepanciesThe Clinical Significance of Transcript Alignment Discrepancies
The Clinical Significance of Transcript Alignment Discrepancies
 
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
 
New methods diploid assembly with graphs
New methods   diploid assembly with graphsNew methods   diploid assembly with graphs
New methods diploid assembly with graphs
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
Genome in a Bottle
Genome in a BottleGenome in a Bottle
Genome in a Bottle
 

Similar a Dgaston dec-06-2012

Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisSANJANA PANDEY
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Monica Munoz-Torres
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assemblyRamya P
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 

Similar a Dgaston dec-06-2012 (20)

Tools for Transcriptome Data Analysis
Tools for Transcriptome Data AnalysisTools for Transcriptome Data Analysis
Tools for Transcriptome Data Analysis
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing Apollo Collaborative genome annotation editing
Apollo Collaborative genome annotation editing
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 

Más de Dan Gaston

Population and evolutionary genetics 1
Population and evolutionary genetics 1Population and evolutionary genetics 1
Population and evolutionary genetics 1Dan Gaston
 
2016 ngs health_lecture
2016 ngs health_lecture2016 ngs health_lecture
2016 ngs health_lectureDan Gaston
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary geneticsDan Gaston
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyDan Gaston
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine LectureDan Gaston
 
Bioc4700 2014 Guest Lecture
Bioc4700   2014 Guest LectureBioc4700   2014 Guest Lecture
Bioc4700 2014 Guest LectureDan Gaston
 
Protein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthProtein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthDan Gaston
 
Bioc4010 sample questions
Bioc4010 sample questionsBioc4010 sample questions
Bioc4010 sample questionsDan Gaston
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Dan Gaston
 
Bioinformatics in Gene Research
Bioinformatics in Gene ResearchBioinformatics in Gene Research
Bioinformatics in Gene ResearchDan Gaston
 

Más de Dan Gaston (11)

Population and evolutionary genetics 1
Population and evolutionary genetics 1Population and evolutionary genetics 1
Population and evolutionary genetics 1
 
2016 ngs health_lecture
2016 ngs health_lecture2016 ngs health_lecture
2016 ngs health_lecture
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture2016 Dal Human Genetics - Genomics in Medicine Lecture
2016 Dal Human Genetics - Genomics in Medicine Lecture
 
Bioc4700 2014 Guest Lecture
Bioc4700   2014 Guest LectureBioc4700   2014 Guest Lecture
Bioc4700 2014 Guest Lecture
 
Protein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human HealthProtein Evolution: Structure, Function, and Human Health
Protein Evolution: Structure, Function, and Human Health
 
Bioc4010 sample questions
Bioc4010 sample questionsBioc4010 sample questions
Bioc4010 sample questions
 
Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2Bioc4010 lectures 1 and 2
Bioc4010 lectures 1 and 2
 
Bioinformatics in Gene Research
Bioinformatics in Gene ResearchBioinformatics in Gene Research
Bioinformatics in Gene Research
 

Último

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Último (20)

Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Dgaston dec-06-2012

  • 1. Bioinformatics: Intro to RNA-Seq Analysis Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology December 6th, 2012
  • 2. Overview  Introduction  Considerations for RNA-Seq  Computational Resources/Options  Analysis of RNA-Seq Data  Principle of analyzing RNA-Seq  General RNA-Seq analysis pipeline  “Tuxedo” pipeline  Alternative tools  Resources  http://www.slideshare.net/DanGaston
  • 3. Before You Start: Considerations for RNA- Seq Analysis  Next-Generation Sequencing experiments generate a lot of raw data  25-40 GB/sample/replicate for most transcriptomes/tissue types/cell lines/conditions  Require more computational resources than many labs routinely have available for analyse data  At minimum several processing “cores” (8 minimum)  Large amount of RAM (16GB+)  Large amount of disk storage space for intermediate and final results files in addition to raw FastQ files  Can be a significant amount of time per sample (days to week)
  • 4. Computational Options  Local (Large workstation or cluster)  Remote Computer/Cluster (ComputeCanada/ACENet)  Cloud Services  Amazon Web Services  Cloud/Local Bioinformatics „Portals”  Galaxy  Chipster  GenomeSpace  CloudBioLinux  CloudMan  BioCloudCentral (Interface to CloudMan, CloudBioLinux, etc)
  • 6. So I Ran an RNA-Seq Experiment. Now What?  Need to go from raw “read” data to gene expression data  We now have:  De-multiplexed fastq files for each individual sample and replicate  We want lists of:  Differentially expressed genes/transcripts  Potentially novel genes/transcripts  Potentially novel splice junctions  Potential fusion events  Organize your data, programs, and additional resources (discussed later)
  • 7. What is the Raw Data  A single lane of Illumina HiSeq 2000 sequencing produces ~ 250 – 300 million “reads” of sequencing  Can be paired or single-end sequencing (paired-end preferred)  Various sequencing lengths (number of sequencing cycles)  2x50bp, 2x75bp, 2x100bp, 2x150bp most common  Cost versus amount of usable data  True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ
  • 8. FastQ @M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1 TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGAC + ?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH  FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read  In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2)  Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64)  Often needs to be set explicitly in alignment programs
  • 9. General Analysis Pipeline Short-Read Alignment Transcript Reconstruction Abundance/Expression Visualization / Statisticss
  • 10. What is Short-Read Alignment? Paired-End Reads Section of Reference Chromosome
  • 11. What’s Special About RNA-Seq  Normally distance between paired-reads and size of insertions both constrained  With RNA-Seq the source is mRNA, not genomic DNA  Mapping to a reference genome, not transcriptome  Need to account for introns, pairs can be much further apart than expected
  • 14. Transcript Reconstruction: Novel Exon/Transcript Identification Exon1 Exon 2 Exon X Exon 3
  • 17. What else can we look for?  Combine with ChiP-Seq to differentiate various levels of regulation  Integrative analyses to identify common elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions)  Combine with whole-exome or whole-genome sequencing  Allele-specific expression  Allelic imbalance  LOH  Large genomic rearrangements/abnormalities
  • 18. Caution  Need to differentiate between real data and artifacts  Differentiate between biologically meaningful data and “noise”  Sample selection, experimental design, biological replication (not technical replication), and robust statistical methods are important  Looking at your data “by eye” is useful, but needs to be backed up by stats  Avoid experimenter bias  Try and be holistic in your analyses
  • 20. “Tuxedo” Analysis Pipeline Bowtie Tophat Cufflinks Cufflinks Cuffcompare Cuffmerge Cuffdiff CummeRbund
  • 21. What you need before you begin  The individual programs  Reference genome (hg19/GRCh37)  FASTA file of whole genome, each chromosome is a sequence entry  Bowtie2 Index files for reference genome  Index files are compressed representations of the genome that allow assembly to the reference efficiently and in parallel  Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc)  Gives information about the location of genes and important features such as location of introns, exons, splice junctions, etc
  • 22. Step 0: Bowtie  Bowtie forms the core of TopHat for short-read alignment  Initial mapping of subset of reads (~5 million) to a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat  This info can be retrieved from the library prep stage but is actually better to estimate from your final data  Sample command-line: bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam
  • 23. Step 1: Tophat  Tophat is a short-read mapper capable of aligning reads to a reference genome and finding exon-exon junctions  Can be provided a list of known junctions, do de novo junction discovery, or both  Also has an option to find potential fusion-gene transcripts  Sample command-line: tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq
  • 24. About TopHat Options  -o: The path/name of a directory in which to place all of the TopHat output files  -G path to and name of an annotation file so TopHat can be aware of known junctions  Reference Genome: Given as path and “base name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome  Inner Distance = Fragment size – (2 x read length)
  • 25. TopHat: Additional options  --no-mixed  --b2-very-sensitive  --fusion-search  Running above options on 6 processing cores on one sample took ~26 hours
  • 26. Step 2: Cufflinks  Cufflinks performs gene and transcript discovery  Many possible options  No novel discovery, use only a reference group of transcripts  de novo mode (shown below, beginner‟s default)  Mixed Reference-Guided Assembly and de novo discovery.  Options for more robust normalization methods and error correction  Sample command-line: cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam
  • 27. Step 3: Cuffmerge  Merges sample assemblies, estimate abundances, clean up transcriptome  Sample command-line: cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt
  • 28. Step 4: Cuffdiff  Calculates expression levels of transcripts in samples  Estimates differential expression between samples  Calculates significance value for difference in expression levels between samples  Also groups together transcripts that all start from same start site. Identify genes under transcriptional/post-transcriptional regulation  Sample command-line: cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam
  • 29. Cuffdiff Output  FPKM values for genes, isoforms, CDS, and groups of genes from same Transcription Start Site for each condition  FPKM is the normalized “expression value” used in RNA-Seq  Count files of above  As above but on a per replicate basis  Differential expression test results for genes, CDS, primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified)  Includes identifiers, expression levels, expression difference values, p-values, q-values, and yes/no significance field  Differential splicing tests, differential coding output, differential promoter use
  • 30. Step 5: CummeRbund (R) Trapnell et al., 2012
  • 31. Visualization Trapnell et al., 2012
  • 32. Help!  Command X failed  Keep calm  Don‟t blame the computer  Check input files and formats  Google/SeqAnswers/Biostars  Results looks “weird”  Check the raw data  Re-check the commands you used  RNA-Seq analysis is an experiment:  Maintain good records of what you did, like any other experiment
  • 33. Alternative tools  Alternative short-read alignment  BWA -> Can not align RNA-Seq data  GSNAP  STAR -> Requires minimum of 30GB of RAM  Alternative transcript reconstruction  STAR  Scripture  Alternative Expression/Abundance Estimation  DESeq  DEXSeq  edgeR
  • 35. Software Websites  TopHat http://tophat.cbcb.umd.edu  Cufflinks http://cufflinks.cbcb.umd.edu  STAR http://gingeraslab.cshl.edu/STAR/  Scripture http://www.broadinstitute.org/software/scripture/  Bioconductor http://www.bioconductor.org/  DEXSeq  DESeq  edgeR  Blah
  • 36. Additional Resources  Differential gene and transcript expression analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3)  www.biostars.org (Q&A site)  SeqAnswers Forum  GENCODE Gene Annotations  http://www.gencodegenes.org/  ftp://ftp.sanger.ac.uk/pub/gencode  TopHat / Illumina iGenomes References and Annotation Files:  http://tophat.cbcb.umd.edu/igenomes.html
  • 37. Acknowledgements  Dalhousie University  Dr. Graham Dellaire  Dr. Karen Bedard  Montgomery Lab  Dr. Chris McMaster Stanford  Dr. Andrew Orr  Dr. Stephen Montgomery  Dr. Conrad Fernandez  BHCRI CRTP Skills  Dr. Marissa Leblanc Acquisition Program  Mat Nightingale  Bedard Lab  IGNITE
  • 38. Experimental Data for Genes of Interest
  • 44. The Cancer Genome Atlas  Identify cancer subtypes, actionable driver mutations, personalized/genomic/precision medicine  More than $275 million in funding from NIH  Multiple research groups around the world  20 cancer types being studied  205 publications from the research network since late 2008
  • 49. What is UNIX?  UNIX and UNIX-Like are a family of computer operating systems originally developed at AT&T‟s Bell Labs  Apple OS X and iOS (UNIX)  Linux (UNIX-Like)
  • 50. Intro  The terminal (command-line) isn‟t THAT scary. Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment  Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up  Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires
  • 51. Terms to Know  Path: The location of a directory, file, or command on the computer.  Example: /Users/dan (OS X home directory)
  • 52. The Commands You Need to Know  ls: Lists the files in the current directory. Directories (folders) are just a special type of file themselves  cd: Change directory  pwd: View the full path of the directory you are currently in  cat: Displays the contents of a file on the terminal screen  head / tail : Displays the top or bottom contents of a file to the screen respectively