SlideShare una empresa de Scribd logo
1 de 85
RNA-­‐seq	
  analysis	
  
                          Mikael	
  Huss	
  
   Bioinforma7cs	
  scien7st	
  at	
  WABI	
  (Wallenberg	
  
Advanced	
  Infrastructure	
  for	
  Bioinforma7cs),	
  Science	
  
  for	
  Life	
  Laboratory	
  /	
  DBB,	
  Stockholm	
  university	
  	
  
                       February	
  13,	
  2013	
  
Omics,	
  biology	
  and	
  diseases	
  




           +               +                  +                   +
                               Protein “parts         Protein
Genomics   RNA profiles                                           Interactomics
                                    list”             profiles

                                   Systems
                                   biology




           Pathways,	
  molecular	
  targets,	
  diagnos5cs	
  
Approximate contents of talk



- Gene expression analysis in general; differences between RNA-seq and microarrays

- Typical workflow(s) for RNA-seq analysis

- Normalization issues

- Visualization

- Differential expression analysis




I have tried to include many references so you can go back to these slides for
reference afterwards
How	
  DNA	
  get	
  transcribed	
  to	
  RNA	
  (and	
  then	
  
translated	
  to	
  proteins)	
  varies	
  between	
  e.	
  g.	
  

-­‐Tissues	
  

-­‐ Cell	
  types	
  

-­‐ Cell	
  states	
  

-­‐Individuals	
  
What	
  can	
  gene	
  expression	
  tell	
  us?	
  

Basic	
  research	
  

-­‐ How	
  do	
  gene	
  expression	
  paUerns	
  determine	
  cellular	
  iden7ty?	
  (7ssues,	
  cell	
  types	
  …)	
  

-­‐ How	
  does	
  gene	
  expression	
  control	
  early	
  development	
  in	
  an	
  embryo?	
  

-­‐ What	
  kinds	
  of	
  genes	
  are	
  expressed	
  in	
  response	
  to	
  specific	
  s7muli	
  (infec7ons,	
  smoking,	
  
environmental	
  pollu7on,	
  gym	
  exercise	
  …)?	
  

-­‐ What	
  kinds	
  of	
  genes	
  do	
  bacteria	
  or	
  other	
  microorganisms	
  express	
  in	
  the	
  human	
  gut	
  /	
  in	
  
soil	
  /	
  in	
  oceans	
  under	
  different	
  condi7ons?	
  

…	
  and	
  much,	
  much	
  more	
  …	
  
What	
  can	
  gene	
  expression	
  tell	
  us?	
  

Diseases	
  

-­‐ Which	
  genes	
  are	
  over-­‐	
  (or	
  under-­‐)expressed	
  in	
  pa7ents	
  vs.	
  healthy	
  controls?	
  

-­‐ Which	
  genes	
  are	
  correlated	
  to	
  disease	
  progression?	
  

-­‐ Can	
  markers	
  of	
  hidden	
  disease	
  be	
  found	
  by	
  sequencing	
  blood	
  plasma?	
  
Gene	
  expression	
  signatures	
  for	
  disease?	
  

Hypothesis:	
  
Cell	
  types	
  are	
  stable	
  
states	
  in	
  a	
  “space”	
  of	
  
gene	
  expression	
  paUerns.	
  

Diseases	
  (e	
  g	
  cancers)	
  
distort	
  the	
  gene	
  
expression	
  so	
  that	
  the	
  cell	
  
ends	
  up	
  in	
  the	
  wrong	
  
stable	
  state.	
  




                                                            Furusawa	
  and	
  Kaneko,	
  Biology	
  Direct	
  2009	
  4:17	
  	
  
Can	
  the	
  research	
  community	
  find	
  such	
  paUerns?	
  

On-­‐line	
  predic7on	
  compe77ons,	
  objec7vely	
  scored	
  by	
  the	
  organizers	
  




Diagnosing	
  MS	
  (mul/ple	
  sclerosis),	
  lung	
  cancer,	
  psoriasis,	
  COPD	
  (KOL)	
  




Prognos/ca/ng	
  breast	
  cancer	
  outcome	
  
Human	
  7ssue	
  RNA-­‐seq	
  data	
  sets	
  



Genotype-Tissue Expression project
http://commonfund.nih.gov/GTEx/

Illumina Human Body Map
accessed via ReCount database, bowtie-bio.sourceforge.net/recount/

Wang 2008 data set of ~15 human tissues
accessed via ReCount

RNA-seq Atlas
http://medicalgenomics.org/rna_seq_atlas

Human Protein Atlas
http://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)
Tools	
  for	
  genome-­‐scale	
  gene	
  expression	
  measurements	
  


           Microarrays	
  (c:a	
  1995)	
  

           Some7mes	
  called	
  “gene	
  chips”	
  

           Based	
  on	
  hybridiza7on	
  




             RNA	
  sequencing	
  (c:a	
  2008	
  in	
  current	
  form)	
  



             Based	
  on	
  sampling	
  
Typical	
  (m)RNA-­‐seq	
  experiment	
  




 “library”	
  -­‐>	
  




                                                                          <-­‐	
  reads	
  
hUp://cmb.molgen.mpg.de	
  
Alterna7ve:	
  rRNA	
  deple7on	
  



There are various kits for depleting rRNA instead

Pluses:
- Can use for microorganisms that don’t have poly-A tails
- Thus, can use for simultaneous host/pathogen expression profiling
- Can find non-coding RNA

Minuses:
-Usually leaves in quite a lot of rRNA
-In practice, often variable efficiency between samples -> hard to compare results
Sequencing	
  plagorms	
  	
  



                                                   ABI	
  3730xl	
                    454	
  Life	
  Sciences	
            SOLiD	
  +	
                                 Pacific	
  Biosciences,	
  
                                                   Sanger	
  Sequencing	
             pyrosequencing	
                     Illumina	
                                   Oxford	
  Nanopore	
  etc	
  
                                                                                                                                                                        Single-­‐molecule	
  	
  
                                                                                                                                                                        sequencing	
  

Length/read 	
  800	
  bp                                        	
     	
     	
       	
  400	
  bp 	
            	
     	
  100	
  bp 	
                            	
  20	
  000+	
  bp	
  
Reads/run	
  	
                                  	
  96 	
       	
     	
     	
       	
  1	
  million 	
         	
     	
  2	
  billion 	
                         	
  5	
  million	
  
Bases/run	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  60	
  kbp   	
     	
     	
       	
  400	
  Mbp 	
           	
     	
  500	
  Gbp 	
                           	
  100	
  Gbp	
  

Speed                      	
             	
  10	
  years/HG            	
     	
       	
  1	
  month/HG           	
     	
  1	
  day/HG 	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10	
  min/HG	
  




                           “old	
  school”	
                                                            “2nd	
  gen”	
                                                                    “3rd	
  gen”	
  
Microarray:	
  Hybridiza7on	
  




                                                                                                       Source:	
  Wikipedia	
  



The	
  design	
  of	
  the	
  microarray	
  determines	
  what	
  you	
  can	
  detect	
  in	
  a	
  sample	
  
RNA	
  sequencing:	
  Sampling 	
  	
  




It	
  is	
  possible	
  to	
  detect	
  transcripts	
  that	
  are	
  not	
  known	
  a	
  priori	
  (in	
  advance)	
  
RNA-­‐seq	
  advantages	
  	
  

The	
  non-­‐dependence	
  on	
  reference	
  makes	
  
  possible:	
  

-­‐  meta-­‐transcriptomics	
  
-­‐  detec7ng	
  novel	
  splice	
  variants	
  
-­‐  detec7ng	
  novel	
  transcripts	
  
    -­‐  Fusion	
  transcripts	
  
    -­‐  Non-­‐coding	
  transcripts	
  
Some	
  examples	
  

RNA-seq Atlas        Wang 2008
Some	
  examples	
  
RNA-seq Atlas




                   <- Skeletal                   Wang 2008
                   muscle ->




                <-Adipose tissue->                 HPA
What	
  does	
  one	
  do	
  with	
  RNA-­‐seq	
  reads?	
  




•  Mapping	
  (also	
  called	
  alignment)	
  

•  (de	
  novo)	
  Assembly	
  
Mapping	
  (alignment)	
  vs.	
  assembly	
  



Imagine	
  a	
  book	
  being	
  ripped	
  to	
  pieces	
  with	
  word	
  or	
  sentence	
  
fragments	
  ending	
  up	
  on	
  each	
  piece	
  of	
  paper.	
  	
  


If	
  you	
  have	
  a	
  copy	
  of	
  the	
  book	
  that	
  you	
  can	
  compare	
  the	
  pieces	
  to,	
  
you	
  have	
  a	
  mapping	
  (alignment)	
  problem.	
  


If	
  you	
  have	
  no	
  copy	
  of	
  the	
  book,	
  you	
  have	
  a	
  de	
  novo	
  assembly	
  
problem.	
  
Mapping	
  to	
  a	
  reference	
  genome	
  

Reads	
  from	
  the	
  sequencer	
  
                                         Sequencing	
  error	
  

                                                    Gene7c	
  varia7on	
  

          CAATCAGA G TCCCACTGTGG	
  
          AGACG TCCCACTGTGGGGTG	
  
          GTGAAGTGTCCGTAGATGTGTG	
  
          GCAAATGCAATCAGACG TCCC	
  




Gene(or	
  transcript)	
  sequence	
  
Mapping	
  to	
  a	
  reference	
  genome	
  




AGACG TCCCACTGTGGGGTG	
  
GTGAAGTGTCCGTAGATGTGTG	
  
GCAAATGCAATCAGACG TCCC	
  
Mapping	
  to	
  a	
  reference	
  genome	
  




GTGAAGTGTCCGTAGATGTGTG	
  
GCAAATGCAATCAGACG TCCC	
  
Mapping	
  to	
  a	
  reference	
  genome	
  




GCAAATGCAATCAGACG TCCC	
  
Mapping	
  to	
  a	
  reference	
  genome	
  
Mapping	
  to	
  the	
  genome	
  vs.	
  the	
  
                   transcriptome	
  



Vs. the genome:
-Can (in principle) detect new transcripts, splice variants
- Less sensitive, need a lot of coverage to discover new things
- Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc.

Vs. the transcriptome:
-Not unbiased anymore, tied to existing annotation
-Faster, more sensitive, need less coverage

The best of both worlds?
- Tools like TopHat (v1.4 and up) now do both
If	
  it	
  had	
  been	
  de	
  novo	
  assembly	
  

         CAATCAGA G TCCCACTGTGG	
  
         AGACG TCCCACTGTGGGGTG	
  
         GTGAAGTGTCCGTAGATGTGTG	
  
         GCAAATGCAATCAGACG TCCC	
  


                                  Assembly	
  



       CAATCAGA G TCCCACTGTGG	
  
            AGACG TCCCACTGTGGGGTG	
  
GCAAATGCAATCAGACG TCCC	
  
                                                               “singleton”	
  
                                                          GTGAAGTGTCCGTAGATGTGTG	
  

                                  Consensus	
  sequence(s)	
  


                                                   	
  
                                                 	
  
Assembly	
  of	
  RNA-­‐seq	
  reads	
  




Will not be discussed much further here.

Most popular de novo assemblers build de Bruijn graphs where overlapping k-mers
are connected to each other. The programs then try to find paths through the graph

Typically needs a LOT of RAM. Can try to pre-process using “digital normalization”

Tools:
    - Trinity
    - Velvet/Oases
    - CLC Bio (commercial)
Assembly	
  of	
  RNA-­‐seq	
  reads	
  

Typical workflow could be:

- Clean the reads properly (remove adapters, low-quality reads)
     - Useful tools: FastQC, PRINSEQ, FASTX toolkit etc.

- Run assembly tool of choice, resulting in a set of contigs

- BLAST the contigs against nt database, check for % overlap by transcript in
related organisms

- Map your original reads back to the contigs and count the reads overlapping
each




                                                                        <- comparison of
                                                                        assembly &
                                                                        mapping
Quan7fying	
  expression	
  with	
  RNA-­‐seq	
  
Microarrays give a continuous (floating-point) expression value for each gene


RNA-­‐seq	
  gives	
  an	
  integer	
  value	
  for	
  each	
  gene	
  (“digital	
  expression”):	
  read	
  counts	
  
Example	
  (SciLifeLab)	
  mapping	
  workflow	
  

                      FASTQ file(s)


                             TopHat 2.0


                      BAM file


                                 Picard tools (SortSam, MarkDuplicates)

          Sorted BAM file with duplicate reads removed



 HTSeq 0.5                                   Cufflinks 2.0



Gene-level count files                Gene- and isoform-level expression
(for DE analysis)                     estimates (FPKM, for reporting)
RNA-­‐seq	
  mapping:	
  different	
  isoforms	
  




                                                                          Isoform	
  1	
  

Exon	
  1	
     Exon	
  2	
            Exon	
  3	
  



                                                                          Isoform	
  2	
  
Exon	
  1	
     Exon	
  2	
  
(what	
  it	
  would	
  look	
  like	
  mapped	
  to	
  the	
  genome)	
  




       Exon	
  1	
                                             Exon	
  2	
                                          Exon	
  3	
  




Need	
  a	
  special	
  mapping	
  algorithm	
  which	
  allows	
  large	
  gaps,	
  a	
  “split-­‐read	
  aligner”	
  
(what	
  we	
  would	
  actually	
  observe	
  –	
  of	
  course	
  we	
  don’t	
  know	
  which	
  reads	
  come	
  from	
  
which	
  isoform)	
  




Sta7s7cal	
  algorithms	
  needed	
  to	
  es7mate	
  what	
  propor7on	
  of	
  reads	
  comes	
  from	
  which	
  
isoform.	
  (For	
  example,	
  maximum	
  likelihood	
  /	
  expecta7on	
  maximiza7on)	
  
Name	
                               Free/Commercial/         Type	
  of	
  approach	
  
                                     Descrip5on	
  only	
  
Xing	
  et	
  al.	
  2006	
          D	
                      Maximum	
  likelihood	
  
Partek	
                             C	
                      “	
  
Li	
  et	
  al.	
  2010	
            D	
                      “	
  
Avadis	
                             C	
                      “	
  
IsoEM	
                              F	
                      “	
  
MISO	
                               F	
                      “	
  (MCMC)	
  
Cufflinks	
                            F	
                      “	
  
rQuant	
                             F	
                      Least	
  squares	
  (quadra7c	
  
                                                              programming)	
  
Rpkmforgenes.py	
                    F	
                      Least	
  squares	
  
Howard	
  and	
  Heber	
  2010	
     D	
                      Least	
  squares	
  
FluxCapacitor	
                      F	
                      Linear	
  programming	
  
CLC	
  Bio	
                         C	
                      ?	
  
NSMAP	
                              F	
                      Nonnega7ve	
  Sparse	
  
                                                              Maximum	
  A	
  Posteriori	
  
ALEXA-­‐SEQ	
                        F	
                      Use	
  only	
  reads	
  that	
  are	
  compa7ble	
  
                                                              with	
  a	
  single	
  isoform	
  



NEUMA	
                              D	
                      Normaliza7on	
  by	
  Expected	
  
                                                              Uniquely	
  Mappable	
  Area	
  
Some remarks on isoform quantification


- It is necessary for correct gene-level quantification as well because straight read
counting methods can never be fully correct (from 2012 CuffDiff2 paper)




- Xing et al. (2006) gave the basic idea for EM-
based isoform quantification which other
programs (Cufflinks, MISO, IsoEM, …) have
added various “bells and whistles” to


- It is actually pretty hard to do isoform
quantification well because there can be a lot
of possible isoforms  not enough sequence
coverage to estimate
Basic idea of the EM approach


We have a set of reads mapping to some locus
   - Some fit one specific isoform
   - Some fit several isoforms

If we knew the isoforms’ expression levels, we could distribute the reads proportionally
to those. But we don’t!

On the other hand, if we knew the probability of each read to match each isoform, we
could estimate the isoforms’ expression pretty well. But we don’t know that either.

So … start with a guess and iterate!

- Assign reads to isoforms according to some initial guess
- Re-estimate isoform expression levels
- Repeat until convergence!
Gene	
  fusion	
  detec7on	
  with	
  RNA-­‐seq	
  

Beyond	
  isoforms:	
  Detect	
  pieces	
  of	
  different	
  genes	
  that	
  have	
  been	
  fused	
  

                                                                                                                   Look	
  for	
  reads	
  
                                                                                                                   that	
  map	
  in	
  	
  
                                                                                                                   “wrong”	
  ways	
  




                                                                                                          Wang	
  et	
  al.	
  Briefings	
  in	
  
                                                                                                          Bioinforma7cs	
  doi:10.1093/
                                                                                                          bib/bbs044	
  
Some	
  further	
  comments	
  on	
  microarrays	
  
                                                               and	
  RNA-­‐seq	
  

-­‐    Microarrays	
  are	
  s7ll	
  cheaper	
  and	
  faster.	
  
        -­‐    You	
  may	
  be	
  able	
  to	
  run	
  more	
  replicates,	
  which	
  is	
  important	
  for	
  sta7s7cal	
  power.	
  	
  

-­‐    RNA-­‐seq	
  has	
  a	
  wider	
  measurement	
  range.	
  
        -­‐    Low	
  expressed	
  transcripts:	
  
                  -­‐  Microarrays	
  have	
  high	
  background	
  signal	
  -­‐>	
  poor	
  measurement	
  
                  -­‐  RNA-­‐seq	
  can	
  measure	
  well	
  if	
  you	
  sequence	
  very	
  deeply	
  
        -­‐    Medium	
  expressed	
  transcripts:	
  
                  -­‐  Microarrays	
  measure	
  well	
  
                  -­‐  RNA-­‐seq	
  measures	
  well	
  if	
  sequenced	
  rela7vely	
  deeply	
  
        -­‐    High	
  expressed	
  transcripts:	
  
                  -­‐  Microarrays	
  measure	
  poorly	
  because	
  of	
  satura7on	
  
                  -­‐  RNA-­‐seq	
  measures	
  well	
  

-­‐    Less	
  is	
  understood	
  about	
  how	
  to	
  pre-­‐process	
  and	
  normalize	
  RNA-­‐seq	
  data.	
  

-­‐    One	
  interes7ng	
  aspect	
  of	
  RNA-­‐seq:	
  You	
  can	
  con7nue	
  to	
  sequence	
  a	
  sample	
  more	
  
       to	
  obtain	
  beUer	
  gene	
  expression	
  es7mates.	
  
Analysis	
  



-­‐    Pre-­‐processing	
  and	
  normaliza7on	
  
-­‐    Visualiza7on	
  
-­‐    Differen7al	
  gene	
  expression	
  analysis	
  
-­‐    ( Gene	
  set	
  analysis,	
  pathway	
  analysis,	
  gene	
  
       expression	
  signatures	
  …	
  -­‐>	
  try	
  to	
  find	
  the	
  
       biological	
  significance)	
  
Pre-­‐processing	
  



Why	
  do	
  we	
  do	
  pre-­‐processing	
  and	
  normaliza7on	
  of	
  
 RNA-­‐seq	
  (or	
  microarray)	
  data?	
  
Pre-­‐processing	
  



Why	
  do	
  we	
  do	
  pre-­‐processing	
  and	
  normaliza7on	
  of	
  
 RNA-­‐seq	
  (or	
  microarray)	
  data?	
  

-­‐  To	
  correct	
  for	
  batch	
  effects	
  
     -­‐  Different	
  labs	
  
     -­‐  Different	
  prepara7on	
  7mes	
  
     -­‐  Etc.	
  
Pre-­‐processing	
  



Why	
  do	
  we	
  do	
  pre-­‐processing	
  and	
  normaliza7on	
  of	
  
 RNA-­‐seq	
  (or	
  microarray)	
  data?	
  

-­‐  To	
  correct	
  for	
  batch	
  effects	
  
     -­‐  Different	
  labs	
  
     -­‐  Different	
  prepara7on	
  7mes	
  
     -­‐  Etc.	
  
-­‐  To	
  correct	
  for	
  intrinsic	
  technical	
  biases	
  in	
  the	
  
     technologies	
  
Pre-­‐processing	
  



Why	
  do	
  we	
  do	
  pre-­‐processing	
  and	
  normaliza7on	
  of	
  RNA-­‐
 seq	
  (or	
  microarray)	
  data?	
  

-­‐  To	
  correct	
  for	
  batch	
  effects	
  
     -­‐  Different	
  labs	
  
     -­‐  Different	
  prepara7on	
  7mes	
  
     -­‐  Etc.	
  
-­‐  To	
  correct	
  for	
  intrinsic	
  technical	
  biases	
  in	
  the	
  
     technologies	
  
-­‐  To	
  make	
  the	
  expression	
  value	
  distribu7ons	
  conform	
  to	
  
     some	
  assump7ons	
  in	
  order	
  to	
  perform	
  sta7s7cal	
  tests	
  	
  
RNA-­‐seq	
  pre-­‐processing	
  



For	
  RNA-­‐seq	
  data,	
  it	
  is	
  s7ll	
  less	
  understood	
  than	
  for	
  
  microarrays	
  how	
  one	
  should	
  pre-­‐process	
  and	
  
  normalize	
  the	
  data.	
  Let’s	
  look	
  at	
  some	
  aspects	
  
  (that	
  some7mes	
  apply	
  to	
  both	
  RNA-­‐seq	
  and	
  
  microarray	
  data)	
  
R	
  and	
  Bioconductor	
  



                                    Very helpful for (e.g.) microarray and RNA-seq
                                    differential expression analysis



Microarray:                                              RNA-seq:

affy, lumi (read raw microarray signal files      DESeq, edgeR, baySeq,
& preprocess)                                     (differential expression analysis
limma (differential expression analysis           based on count data)
with complex designs)                             SAMSeq (nonparametric
                                                  differential expression analysis)
Variance	
  stabiliza5on	
  

Raw data
(could be microarray signal or RNA-seq counts)

Higher value -> higher variability (noise)




Log transform

Lower value -> higher variability. Too aggressive




Variance stabilizing transform
e.g. voom() in limma package

                 http://bridgecrest.blogspot.se/2011_09_01_archive.html
Quan5fying	
  expression	
  with	
  RNA-­‐seq	
  

If	
  you	
  want	
  to	
  compare	
  RNA-­‐seq	
  counts	
  between	
  different	
  genes	
  and/or	
  samples,	
  consider:	
  

-­‐ Longer	
  genes/transcripts	
  are	
  expected	
  to	
  generate	
  more	
  reads	
  
-­‐ The	
  more	
  you	
  sequence,	
  the	
  more	
  reads	
  you	
  get	
  from	
  each	
  gene	
  

Therefore,	
  the	
  standard	
  measure	
  has	
  been	
  RPKM	
  (
     ),	
  which	
  corrects	
  for	
  transcript	
  length	
  and	
  sequencing	
  depth:	
  	
  




                              ⎛ X t ⎞
                              ⎜ l ⎟
                                                     10 9 ⋅ X t             (Xt:	
  no	
  of	
  reads	
  mapped	
  to	
  transcript/gene/…	
  t	
  
                              ⎜ eff ,t ⎟
                                                                            Nlib:	
  no	
  of	
  mapped	
  reads	
  in	
  library	
  
      RPKM	
  =	
  	
         ⎜ 10 3 ⎟
                              ⎜        ⎟
                                             =	
  
                                                     N lib ⋅ leff ,t        Leff,	
  t:	
  effec/ve	
  length	
  of	
  transcript/gene/…	
  t)	
  
                              ⎝        ⎠
                              ⎛ N lib ⎞
                              ⎜ 6 ⎟
                              ⎝ 10 ⎠



                 €          €
   FPKM is a paired-end version of this
Alterna5ves	
  




TPM – “transcripts per million”


A slightly modified RPKM measure that
accounts for differences in gene length
distribution in the transcript population
Alterna5ves	
  

  TMM – “trimmed mean of M values”

  Attempts to correct for differences in RNA composition between samples

  E g if certain genes are very highly expressed in one tissue but not another, there will be less
  “sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or
  similar) will give biased expression values for them compared to the other sample


               RNA population 1                                     RNA population 2




Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the
expression levels are actually the same in populations 1 and 2

Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
Across-­‐sample	
  comparability	
  




Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
Across-­‐sample	
  comparability	
  
Across-­‐sample	
  comparability	
  
Prac5cal	
  issues	
  with	
  normaliza5on	
  methods	
  


Limma / voom can give negative values


TMM cannot be done on a single sample
RNA-­‐seq	
  pre-­‐processing	
  



In	
  RNA-­‐seq,	
  normaliza7on	
  of	
  counts	
  is	
  oven	
  
   interwoven	
  with	
  differen7al	
  expression	
  analysis	
  
   and	
  done	
  implicitly	
  in	
  DE	
  packages	
  such	
  as	
  DESeq,	
  
   edgeR	
  etc.	
  

Normalized	
  values	
  like	
  RPKM	
  are	
  usually	
  only	
  used	
  
 for	
  repor7ng	
  expression	
  values,	
  not	
  tes7ng	
  for	
  
 differen7al	
  expression.	
  	
  

Why?	
  
Count	
  nature	
  of	
  RNA-­‐seq	
  data	
  

    These	
  methods	
  want	
  to	
  use	
  the	
  added	
  sta7s7cal	
  power	
  provided	
  by	
  
      the	
  count	
  nature	
  of	
  RNA-­‐seq	
  data.	
  

    Simplified	
  toy	
  example:	
  

Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 counts
in sample B.

Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts in
sample B.

Assume that the sequencing depths are the same in both samples and both
scenarios. Then the RPKM is the same in sample A in both scenarios, and in
sample B and both scenarios.

In scenario A, we can be more confident that there is a true difference in the
expression level than in scenario B (although we would want more replicates of
course!) by analogy to a coin flip – 700 heads out of 1000 trials gives much more
confidence that a coin is biased than 7 heads out of 10 trials
Visualiza5on	
  


Can	
  be	
  useful	
  for	
  “sanity	
  checking”,	
  outlier	
  detec7on	
  and	
  exploratory	
  analysis	
  in	
  general	
  

Examples	
  of	
  useful	
  visualiza7ons	
  

-­‐ Heat	
  maps	
  
-­‐ PCA/MDS/NMF	
  
-­‐ Box	
  plots,	
  violin	
  plots	
  etc.	
  
Box	
  plots	
  




Useful for comparing groups

Adding the actual data points is optional but can be interesting
Sample	
  correla5on	
  heat	
  maps	
  
Heat maps are ubiquitous in transcriptomics
Correlations between samples, hierarchical clustering

Used for “sanity checks”, outlier detection




                   Two tissues                          Batch effects
Gene	
  /	
  sample	
  heat	
  maps	
  




With a smaller
collection of genes,
one sometimes looks
at gene/sample heat
maps
PCA	
  plots	
  




Another way to see how samples cluster
PCA	
  plots	
  




Nice thing with PCA: you can also see how much each gene contributes to each
principal component -> a kind of feature selection
Alterna5ves	
  to	
  PCA	
  




   NMF: non-negative matrix factorization. Also a matrix decomposition technique (like
   PCA)
“A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580
PCA	
  plot	
  of	
  human	
  5ssue	
  RNA-­‐seq	
  




Red – GTex
Green – Body Map
Black – Human Protein Atlas
#	
  of	
  genes	
  taking	
  up	
  X%	
  of	
  sequences	
  




                                                   GTex RPKM
                                                   HBA1
                                                   HBB
                                                   HBA2
#	
  of	
  genes	
  taking	
  up	
  X%	
  of	
  sequences	
  




GTex
#	
  of	
  genes	
  taking	
  up	
  X%	
  of	
  sequences	
  




Wang/Sandberg
Differen5al	
  expression	
  analysis	
  




Many tools available!


Easily the most common type of analysis, even though it is understood that
gene expression levels are not independent of each other, and should in
principle be considered together.


However, since the number of samples is typically << the number of
measured genes, a full model is usually not feasible to construct in practice.
Some sort of feature selection is needed.
Differen5al	
  expression	
  analysis	
  




One would simply like to do a t-test or something like that for each gene, but
…
Differen5al	
  expression	
  analysis	
  




One would simply like to do a t-test or something like that for each gene, but
…

- Assumes normal distribution & no mean-variance dependence
Differen5al	
  expression	
  analysis	
  




One would simply like to do a t-test or something like that for each gene, but
…

- Assumes normal distribution & no mean-variance dependence
- Hard to estimate variance from few samples
Differen5al	
  expression	
  analysis	
  




One would simply like to do a t-test or something like that for each gene, but
…

- Assumes normal distribution & no mean-variance dependence
- Hard to estimate variance from few samples
- Multiple testing issue
Parametric	
  vs.	
  non-­‐parametric	
  methods	
  


It would be nice to not have to assume anything about the expression value
distributions but only use rank-order statistics. -> methods like SAM
(Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data)

However, it is (typically) harder to show statistical significance with non-
parametric methods with few replicates.

My rule of thumb:

- Many replicates (~ >10) in each group -> use SAM(Seq)
- Otherwise use DESeq or other parametric method

Note that according to Simon Anders (creator of DESeq) says that non-
parametric methods are definitely better with 12 replicates and maybe already at
five

http://seqanswers.com/forums/showpost.php?p=74264&postcount=3
Standard	
  DE	
  methods	
  


Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)
Standard	
  DE	
  methods	
  


Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)

Distributional issue: Solved by variance stabilizing transform in limma

edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.
Standard	
  DE	
  methods	
  


Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)

Distributional issue: Solved by variance stabilizing transform in limma

edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.

Multiple testing issue: All of these packages report false discovery rate (corrected
p values).
Standard	
  DE	
  methods	
  


Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)

Distributional issue: Solved by variance stabilizing transform in limma

edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.

Multiple testing issue: All of these packages report false discovery rate (corrected
p values).

Variance estimation issue: These packages (in slightly different ways) “borrow”
information across genes to get a better variance estimate. One says that the
estimates “shrink” from gene-specific estimates towards a common mean value.
Standard	
  DE	
  methods	
  


Limma (microarrays, RNA-seq)
edgeR, DESeq (RNA-seq)

Distributional issue: Solved by variance stabilizing transform in limma

edgeR and DESeq model the count data using a negative binomial distribution and
use their own modified statistical tests based on that.

Multiple testing issue: All of these packages report false discovery rate (corrected
p values).

Variance estimation issue: These packages (in slightly different ways) “borrow”
information across genes to get a better variance estimate. One says that the
estimates “shrink” from gene-specific estimates towards a common mean value.
CuffDiff2	
  




Integrates isoform quantification +
differential expression analysis
Complex	
  designs	
  


The simplest case is when you just want to compare two groups against each other.

But what if you have several factors that you want to control for?

E.g. you have taken tumor samples at two different time points from six patients,
cultured the samples and treated them with two different anticancer drugs and a mock
control treatment. -> 2x6x3 = 36 samples.

Now you want to assess the differential expression in response to one of the
anticancer drugs, drug X. You could just compare all “drug X” samples to all control
samples but the inter-subject variability might be larger than the specific drug effect.

 Enter limma / DESeq / edgeR which can work with factorial designs

(SAMSeq cannot, which is another reason one might not want to use it)
Limma	
  and	
  factorial	
  designs	
  

           limma stands for “linear models for microarray analysis”

           Essentially, the expression of each gene is modeled with a linear relation




http://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf




 The design matrix describes all the conditions, e g treatment, patient, time etc
 y = a + b*treatment + c*time + d*patient + e

                Baseline/average                                                       Error term/noise
Recent	
  DE	
  so[ware	
  comparison	
  
Take-­‐away	
  messages	
  from	
  DE	
  tool	
  
                                         comparison	
  




- CuffDiff2, which should theoretically be better, seems to work worse, probably
due to the increased “statistical burden” from isoform expression estimation

- The HTSeq quantification which is theoretically “wrong” seems to give good
results with downstream software

- It is practically always better to sequence more biological replicates than to
sequence the same samples deeper

Omitted from this comparison
    - gains from ability to do complex designs
    - non-parametric methods
The	
  end	
  	
  




Contact me at mikael.huss@scilifelab.se if you have any questions

Más contenido relacionado

La actualidad más candente

RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5BITS
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications1010Genome Pte Ltd
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisUniversity of California, Davis
 
Transcriptome analysis
Transcriptome analysisTranscriptome analysis
Transcriptome analysisRamaJumwal2
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)KAUSHAL SAHU
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. localbenazeer fathima
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsAthira RG
 

La actualidad más candente (20)

Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Transcriptome analysis
Transcriptome analysisTranscriptome analysis
Transcriptome analysis
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Cath
CathCath
Cath
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Dna sequencing and its types
Dna sequencing and its typesDna sequencing and its types
Dna sequencing and its types
 
Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 

Similar a RNA-seq Analysis

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectFundación Ramón Areces
 
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...eventi-ITBbari
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)LOGESWARAN KA
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Integrated DNA Technologies
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohiujjwal sirohi
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 

Similar a RNA-seq Analysis (20)

An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome Project
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
Ernesto Picardi – Bioinformatica e genomica comparata: nuove strategie sperim...
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)Next Generation Sequencing (NGS)
Next Generation Sequencing (NGS)
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
 
Gene expression profiling
Gene expression profilingGene expression profiling
Gene expression profiling
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
Microarray @ujjwal sirohi
Microarray @ujjwal sirohiMicroarray @ujjwal sirohi
Microarray @ujjwal sirohi
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 

Más de COST action BM1006

Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisCOST action BM1006
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachCOST action BM1006
 
Reverse-engineering techniques in Data Integration
Reverse-engineering techniques in Data IntegrationReverse-engineering techniques in Data Integration
Reverse-engineering techniques in Data IntegrationCOST action BM1006
 
from B-cell Biology to Data Integration
from B-cell Biology to Data Integrationfrom B-cell Biology to Data Integration
from B-cell Biology to Data IntegrationCOST action BM1006
 
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...COST action BM1006
 
Integrative Analysis of Epigenomics and miRNA data in Immune System Models
Integrative Analysis of Epigenomics and miRNA data in Immune System ModelsIntegrative Analysis of Epigenomics and miRNA data in Immune System Models
Integrative Analysis of Epigenomics and miRNA data in Immune System ModelsCOST action BM1006
 
Proteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsProteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsCOST action BM1006
 
Metabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlMetabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlCOST action BM1006
 
X-omics Data Integration Challenges
X-omics Data Integration ChallengesX-omics Data Integration Challenges
X-omics Data Integration ChallengesCOST action BM1006
 

Más de COST action BM1006 (11)

Knowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysisKnowledge management for integrative omics data analysis
Knowledge management for integrative omics data analysis
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network Approach
 
Reverse-engineering techniques in Data Integration
Reverse-engineering techniques in Data IntegrationReverse-engineering techniques in Data Integration
Reverse-engineering techniques in Data Integration
 
from B-cell Biology to Data Integration
from B-cell Biology to Data Integrationfrom B-cell Biology to Data Integration
from B-cell Biology to Data Integration
 
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
 
Integrative Analysis of Epigenomics and miRNA data in Immune System Models
Integrative Analysis of Epigenomics and miRNA data in Immune System ModelsIntegrative Analysis of Epigenomics and miRNA data in Immune System Models
Integrative Analysis of Epigenomics and miRNA data in Immune System Models
 
Proteomics analysis: Basics and Applications
Proteomics analysis: Basics and ApplicationsProteomics analysis: Basics and Applications
Proteomics analysis: Basics and Applications
 
Metabolomics Data Analysis
Metabolomics Data AnalysisMetabolomics Data Analysis
Metabolomics Data Analysis
 
Metabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality controlMetabolomics: data acquisition, pre-processing and quality control
Metabolomics: data acquisition, pre-processing and quality control
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
X-omics Data Integration Challenges
X-omics Data Integration ChallengesX-omics Data Integration Challenges
X-omics Data Integration Challenges
 

Último

Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Serviceparulsinha
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAAjennyeacort
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...saminamagar
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...narwatsonia7
 
Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Gabriel Guevara MD
 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknownarwatsonia7
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Miss joya
 
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...narwatsonia7
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaPooja Gupta
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
 
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...rajnisinghkjn
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000aliya bhat
 

Último (20)

Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort ServiceCall Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
Call Girls Service In Shyam Nagar Whatsapp 8445551418 Independent Escort Service
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA97111 47426 Call Girls In Delhi MUNIRKAA
97111 47426 Call Girls In Delhi MUNIRKAA
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
 
Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024
 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
 
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
Russian Call Girls in Pune Riya 9907093804 Short 1500 Night 6000 Best call gi...
 
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
 
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...
Dwarka Sector 6 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few Cl...
 
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
 

RNA-seq Analysis

  • 1. RNA-­‐seq  analysis   Mikael  Huss   Bioinforma7cs  scien7st  at  WABI  (Wallenberg   Advanced  Infrastructure  for  Bioinforma7cs),  Science   for  Life  Laboratory  /  DBB,  Stockholm  university     February  13,  2013  
  • 2. Omics,  biology  and  diseases   + + + + Protein “parts Protein Genomics RNA profiles Interactomics list” profiles Systems biology Pathways,  molecular  targets,  diagnos5cs  
  • 3. Approximate contents of talk - Gene expression analysis in general; differences between RNA-seq and microarrays - Typical workflow(s) for RNA-seq analysis - Normalization issues - Visualization - Differential expression analysis I have tried to include many references so you can go back to these slides for reference afterwards
  • 4. How  DNA  get  transcribed  to  RNA  (and  then   translated  to  proteins)  varies  between  e.  g.   -­‐Tissues   -­‐ Cell  types   -­‐ Cell  states   -­‐Individuals  
  • 5. What  can  gene  expression  tell  us?   Basic  research   -­‐ How  do  gene  expression  paUerns  determine  cellular  iden7ty?  (7ssues,  cell  types  …)   -­‐ How  does  gene  expression  control  early  development  in  an  embryo?   -­‐ What  kinds  of  genes  are  expressed  in  response  to  specific  s7muli  (infec7ons,  smoking,   environmental  pollu7on,  gym  exercise  …)?   -­‐ What  kinds  of  genes  do  bacteria  or  other  microorganisms  express  in  the  human  gut  /  in   soil  /  in  oceans  under  different  condi7ons?   …  and  much,  much  more  …  
  • 6. What  can  gene  expression  tell  us?   Diseases   -­‐ Which  genes  are  over-­‐  (or  under-­‐)expressed  in  pa7ents  vs.  healthy  controls?   -­‐ Which  genes  are  correlated  to  disease  progression?   -­‐ Can  markers  of  hidden  disease  be  found  by  sequencing  blood  plasma?  
  • 7. Gene  expression  signatures  for  disease?   Hypothesis:   Cell  types  are  stable   states  in  a  “space”  of   gene  expression  paUerns.   Diseases  (e  g  cancers)   distort  the  gene   expression  so  that  the  cell   ends  up  in  the  wrong   stable  state.   Furusawa  and  Kaneko,  Biology  Direct  2009  4:17    
  • 8. Can  the  research  community  find  such  paUerns?   On-­‐line  predic7on  compe77ons,  objec7vely  scored  by  the  organizers   Diagnosing  MS  (mul/ple  sclerosis),  lung  cancer,  psoriasis,  COPD  (KOL)   Prognos/ca/ng  breast  cancer  outcome  
  • 9. Human  7ssue  RNA-­‐seq  data  sets   Genotype-Tissue Expression project http://commonfund.nih.gov/GTEx/ Illumina Human Body Map accessed via ReCount database, bowtie-bio.sourceforge.net/recount/ Wang 2008 data set of ~15 human tissues accessed via ReCount RNA-seq Atlas http://medicalgenomics.org/rna_seq_atlas Human Protein Atlas http://www.proteinatlas.org (tissue RNA-seq data not yet publicly released)
  • 10. Tools  for  genome-­‐scale  gene  expression  measurements   Microarrays  (c:a  1995)   Some7mes  called  “gene  chips”   Based  on  hybridiza7on   RNA  sequencing  (c:a  2008  in  current  form)   Based  on  sampling  
  • 11. Typical  (m)RNA-­‐seq  experiment   “library”  -­‐>   <-­‐  reads   hUp://cmb.molgen.mpg.de  
  • 12. Alterna7ve:  rRNA  deple7on   There are various kits for depleting rRNA instead Pluses: - Can use for microorganisms that don’t have poly-A tails - Thus, can use for simultaneous host/pathogen expression profiling - Can find non-coding RNA Minuses: -Usually leaves in quite a lot of rRNA -In practice, often variable efficiency between samples -> hard to compare results
  • 13. Sequencing  plagorms     ABI  3730xl   454  Life  Sciences   SOLiD  +   Pacific  Biosciences,   Sanger  Sequencing   pyrosequencing   Illumina   Oxford  Nanopore  etc   Single-­‐molecule     sequencing   Length/read  800  bp        400  bp      100  bp    20  000+  bp   Reads/run      96          1  million      2  billion    5  million   Bases/run                      60  kbp        400  Mbp      500  Gbp    100  Gbp   Speed    10  years/HG      1  month/HG    1  day/HG                      10  min/HG   “old  school”   “2nd  gen”   “3rd  gen”  
  • 14. Microarray:  Hybridiza7on   Source:  Wikipedia   The  design  of  the  microarray  determines  what  you  can  detect  in  a  sample  
  • 15. RNA  sequencing:  Sampling     It  is  possible  to  detect  transcripts  that  are  not  known  a  priori  (in  advance)  
  • 16. RNA-­‐seq  advantages     The  non-­‐dependence  on  reference  makes   possible:   -­‐  meta-­‐transcriptomics   -­‐  detec7ng  novel  splice  variants   -­‐  detec7ng  novel  transcripts   -­‐  Fusion  transcripts   -­‐  Non-­‐coding  transcripts  
  • 17.
  • 18. Some  examples   RNA-seq Atlas Wang 2008
  • 19. Some  examples   RNA-seq Atlas <- Skeletal Wang 2008 muscle -> <-Adipose tissue-> HPA
  • 20. What  does  one  do  with  RNA-­‐seq  reads?   •  Mapping  (also  called  alignment)   •  (de  novo)  Assembly  
  • 21. Mapping  (alignment)  vs.  assembly   Imagine  a  book  being  ripped  to  pieces  with  word  or  sentence   fragments  ending  up  on  each  piece  of  paper.     If  you  have  a  copy  of  the  book  that  you  can  compare  the  pieces  to,   you  have  a  mapping  (alignment)  problem.   If  you  have  no  copy  of  the  book,  you  have  a  de  novo  assembly   problem.  
  • 22. Mapping  to  a  reference  genome   Reads  from  the  sequencer   Sequencing  error   Gene7c  varia7on   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC   Gene(or  transcript)  sequence  
  • 23. Mapping  to  a  reference  genome   AGACG TCCCACTGTGGGGTG   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC  
  • 24. Mapping  to  a  reference  genome   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC  
  • 25. Mapping  to  a  reference  genome   GCAAATGCAATCAGACG TCCC  
  • 26. Mapping  to  a  reference  genome  
  • 27. Mapping  to  the  genome  vs.  the   transcriptome   Vs. the genome: -Can (in principle) detect new transcripts, splice variants - Less sensitive, need a lot of coverage to discover new things - Need a “splice-aware” aligned such as TopHat, MapSplice, RUM etc. Vs. the transcriptome: -Not unbiased anymore, tied to existing annotation -Faster, more sensitive, need less coverage The best of both worlds? - Tools like TopHat (v1.4 and up) now do both
  • 28. If  it  had  been  de  novo  assembly   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG   GTGAAGTGTCCGTAGATGTGTG   GCAAATGCAATCAGACG TCCC   Assembly   CAATCAGA G TCCCACTGTGG   AGACG TCCCACTGTGGGGTG   GCAAATGCAATCAGACG TCCC   “singleton”   GTGAAGTGTCCGTAGATGTGTG   Consensus  sequence(s)      
  • 29. Assembly  of  RNA-­‐seq  reads   Will not be discussed much further here. Most popular de novo assemblers build de Bruijn graphs where overlapping k-mers are connected to each other. The programs then try to find paths through the graph Typically needs a LOT of RAM. Can try to pre-process using “digital normalization” Tools: - Trinity - Velvet/Oases - CLC Bio (commercial)
  • 30. Assembly  of  RNA-­‐seq  reads   Typical workflow could be: - Clean the reads properly (remove adapters, low-quality reads) - Useful tools: FastQC, PRINSEQ, FASTX toolkit etc. - Run assembly tool of choice, resulting in a set of contigs - BLAST the contigs against nt database, check for % overlap by transcript in related organisms - Map your original reads back to the contigs and count the reads overlapping each <- comparison of assembly & mapping
  • 31. Quan7fying  expression  with  RNA-­‐seq   Microarrays give a continuous (floating-point) expression value for each gene RNA-­‐seq  gives  an  integer  value  for  each  gene  (“digital  expression”):  read  counts  
  • 32. Example  (SciLifeLab)  mapping  workflow   FASTQ file(s) TopHat 2.0 BAM file Picard tools (SortSam, MarkDuplicates) Sorted BAM file with duplicate reads removed HTSeq 0.5 Cufflinks 2.0 Gene-level count files Gene- and isoform-level expression (for DE analysis) estimates (FPKM, for reporting)
  • 33. RNA-­‐seq  mapping:  different  isoforms   Isoform  1   Exon  1   Exon  2   Exon  3   Isoform  2   Exon  1   Exon  2  
  • 34. (what  it  would  look  like  mapped  to  the  genome)   Exon  1   Exon  2   Exon  3   Need  a  special  mapping  algorithm  which  allows  large  gaps,  a  “split-­‐read  aligner”  
  • 35. (what  we  would  actually  observe  –  of  course  we  don’t  know  which  reads  come  from   which  isoform)   Sta7s7cal  algorithms  needed  to  es7mate  what  propor7on  of  reads  comes  from  which   isoform.  (For  example,  maximum  likelihood  /  expecta7on  maximiza7on)  
  • 36. Name   Free/Commercial/ Type  of  approach   Descrip5on  only   Xing  et  al.  2006   D   Maximum  likelihood   Partek   C   “   Li  et  al.  2010   D   “   Avadis   C   “   IsoEM   F   “   MISO   F   “  (MCMC)   Cufflinks   F   “   rQuant   F   Least  squares  (quadra7c   programming)   Rpkmforgenes.py   F   Least  squares   Howard  and  Heber  2010   D   Least  squares   FluxCapacitor   F   Linear  programming   CLC  Bio   C   ?   NSMAP   F   Nonnega7ve  Sparse   Maximum  A  Posteriori   ALEXA-­‐SEQ   F   Use  only  reads  that  are  compa7ble   with  a  single  isoform   NEUMA   D   Normaliza7on  by  Expected   Uniquely  Mappable  Area  
  • 37. Some remarks on isoform quantification - It is necessary for correct gene-level quantification as well because straight read counting methods can never be fully correct (from 2012 CuffDiff2 paper) - Xing et al. (2006) gave the basic idea for EM- based isoform quantification which other programs (Cufflinks, MISO, IsoEM, …) have added various “bells and whistles” to - It is actually pretty hard to do isoform quantification well because there can be a lot of possible isoforms  not enough sequence coverage to estimate
  • 38. Basic idea of the EM approach We have a set of reads mapping to some locus - Some fit one specific isoform - Some fit several isoforms If we knew the isoforms’ expression levels, we could distribute the reads proportionally to those. But we don’t! On the other hand, if we knew the probability of each read to match each isoform, we could estimate the isoforms’ expression pretty well. But we don’t know that either. So … start with a guess and iterate! - Assign reads to isoforms according to some initial guess - Re-estimate isoform expression levels - Repeat until convergence!
  • 39. Gene  fusion  detec7on  with  RNA-­‐seq   Beyond  isoforms:  Detect  pieces  of  different  genes  that  have  been  fused   Look  for  reads   that  map  in     “wrong”  ways   Wang  et  al.  Briefings  in   Bioinforma7cs  doi:10.1093/ bib/bbs044  
  • 40. Some  further  comments  on  microarrays   and  RNA-­‐seq   -­‐  Microarrays  are  s7ll  cheaper  and  faster.   -­‐  You  may  be  able  to  run  more  replicates,  which  is  important  for  sta7s7cal  power.     -­‐  RNA-­‐seq  has  a  wider  measurement  range.   -­‐  Low  expressed  transcripts:   -­‐  Microarrays  have  high  background  signal  -­‐>  poor  measurement   -­‐  RNA-­‐seq  can  measure  well  if  you  sequence  very  deeply   -­‐  Medium  expressed  transcripts:   -­‐  Microarrays  measure  well   -­‐  RNA-­‐seq  measures  well  if  sequenced  rela7vely  deeply   -­‐  High  expressed  transcripts:   -­‐  Microarrays  measure  poorly  because  of  satura7on   -­‐  RNA-­‐seq  measures  well   -­‐  Less  is  understood  about  how  to  pre-­‐process  and  normalize  RNA-­‐seq  data.   -­‐  One  interes7ng  aspect  of  RNA-­‐seq:  You  can  con7nue  to  sequence  a  sample  more   to  obtain  beUer  gene  expression  es7mates.  
  • 41. Analysis   -­‐  Pre-­‐processing  and  normaliza7on   -­‐  Visualiza7on   -­‐  Differen7al  gene  expression  analysis   -­‐  ( Gene  set  analysis,  pathway  analysis,  gene   expression  signatures  …  -­‐>  try  to  find  the   biological  significance)  
  • 42. Pre-­‐processing   Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?  
  • 43. Pre-­‐processing   Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?   -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.  
  • 44. Pre-­‐processing   Why  do  we  do  pre-­‐processing  and  normaliza7on  of   RNA-­‐seq  (or  microarray)  data?   -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.   -­‐  To  correct  for  intrinsic  technical  biases  in  the   technologies  
  • 45. Pre-­‐processing   Why  do  we  do  pre-­‐processing  and  normaliza7on  of  RNA-­‐ seq  (or  microarray)  data?   -­‐  To  correct  for  batch  effects   -­‐  Different  labs   -­‐  Different  prepara7on  7mes   -­‐  Etc.   -­‐  To  correct  for  intrinsic  technical  biases  in  the   technologies   -­‐  To  make  the  expression  value  distribu7ons  conform  to   some  assump7ons  in  order  to  perform  sta7s7cal  tests    
  • 46. RNA-­‐seq  pre-­‐processing   For  RNA-­‐seq  data,  it  is  s7ll  less  understood  than  for   microarrays  how  one  should  pre-­‐process  and   normalize  the  data.  Let’s  look  at  some  aspects   (that  some7mes  apply  to  both  RNA-­‐seq  and   microarray  data)  
  • 47. R  and  Bioconductor   Very helpful for (e.g.) microarray and RNA-seq differential expression analysis Microarray: RNA-seq: affy, lumi (read raw microarray signal files DESeq, edgeR, baySeq, & preprocess) (differential expression analysis limma (differential expression analysis based on count data) with complex designs) SAMSeq (nonparametric differential expression analysis)
  • 48. Variance  stabiliza5on   Raw data (could be microarray signal or RNA-seq counts) Higher value -> higher variability (noise) Log transform Lower value -> higher variability. Too aggressive Variance stabilizing transform e.g. voom() in limma package http://bridgecrest.blogspot.se/2011_09_01_archive.html
  • 49. Quan5fying  expression  with  RNA-­‐seq   If  you  want  to  compare  RNA-­‐seq  counts  between  different  genes  and/or  samples,  consider:   -­‐ Longer  genes/transcripts  are  expected  to  generate  more  reads   -­‐ The  more  you  sequence,  the  more  reads  you  get  from  each  gene   Therefore,  the  standard  measure  has  been  RPKM  ( ),  which  corrects  for  transcript  length  and  sequencing  depth:     ⎛ X t ⎞ ⎜ l ⎟ 10 9 ⋅ X t (Xt:  no  of  reads  mapped  to  transcript/gene/…  t   ⎜ eff ,t ⎟ Nlib:  no  of  mapped  reads  in  library   RPKM  =     ⎜ 10 3 ⎟ ⎜ ⎟ =   N lib ⋅ leff ,t Leff,  t:  effec/ve  length  of  transcript/gene/…  t)   ⎝ ⎠ ⎛ N lib ⎞ ⎜ 6 ⎟ ⎝ 10 ⎠ € € FPKM is a paired-end version of this
  • 50. Alterna5ves   TPM – “transcripts per million” A slightly modified RPKM measure that accounts for differences in gene length distribution in the transcript population
  • 51. Alterna5ves   TMM – “trimmed mean of M values” Attempts to correct for differences in RNA composition between samples E g if certain genes are very highly expressed in one tissue but not another, there will be less “sequencing real estate” left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample RNA population 1 RNA population 2 Equal sequencing depth -> orange and red will get lower RPKM in RNA population 1 although the expression levels are actually the same in populations 1 and 2 Robinson and Oshlack Genome Biology 2010, 11:R25, http://genomebiology.com/2010/11/3/R25
  • 52. Across-­‐sample  comparability   Dillies et al., Briefings in Bioinformatics, doi:10.1093/bib/bbs046
  • 55. Prac5cal  issues  with  normaliza5on  methods   Limma / voom can give negative values TMM cannot be done on a single sample
  • 56. RNA-­‐seq  pre-­‐processing   In  RNA-­‐seq,  normaliza7on  of  counts  is  oven   interwoven  with  differen7al  expression  analysis   and  done  implicitly  in  DE  packages  such  as  DESeq,   edgeR  etc.   Normalized  values  like  RPKM  are  usually  only  used   for  repor7ng  expression  values,  not  tes7ng  for   differen7al  expression.     Why?  
  • 57. Count  nature  of  RNA-­‐seq  data   These  methods  want  to  use  the  added  sta7s7cal  power  provided  by   the  count  nature  of  RNA-­‐seq  data.   Simplified  toy  example:   Scenario 1: A 30000-bp transcript has 1000 counts in sample A and 700 counts in sample B. Scenario 2: A 300-bp transcript has 10 counts in sample A and 7 counts in sample B. Assume that the sequencing depths are the same in both samples and both scenarios. Then the RPKM is the same in sample A in both scenarios, and in sample B and both scenarios. In scenario A, we can be more confident that there is a true difference in the expression level than in scenario B (although we would want more replicates of course!) by analogy to a coin flip – 700 heads out of 1000 trials gives much more confidence that a coin is biased than 7 heads out of 10 trials
  • 58. Visualiza5on   Can  be  useful  for  “sanity  checking”,  outlier  detec7on  and  exploratory  analysis  in  general   Examples  of  useful  visualiza7ons   -­‐ Heat  maps   -­‐ PCA/MDS/NMF   -­‐ Box  plots,  violin  plots  etc.  
  • 59. Box  plots   Useful for comparing groups Adding the actual data points is optional but can be interesting
  • 60. Sample  correla5on  heat  maps   Heat maps are ubiquitous in transcriptomics Correlations between samples, hierarchical clustering Used for “sanity checks”, outlier detection Two tissues Batch effects
  • 61. Gene  /  sample  heat  maps   With a smaller collection of genes, one sometimes looks at gene/sample heat maps
  • 62. PCA  plots   Another way to see how samples cluster
  • 63. PCA  plots   Nice thing with PCA: you can also see how much each gene contributes to each principal component -> a kind of feature selection
  • 64. Alterna5ves  to  PCA   NMF: non-negative matrix factorization. Also a matrix decomposition technique (like PCA) “A bioinformatic assay for pluripotency in human cells”, Nature Methods: doi.10.1038/nmeth.1580
  • 65. PCA  plot  of  human  5ssue  RNA-­‐seq   Red – GTex Green – Body Map Black – Human Protein Atlas
  • 66. #  of  genes  taking  up  X%  of  sequences   GTex RPKM HBA1 HBB HBA2
  • 67. #  of  genes  taking  up  X%  of  sequences   GTex
  • 68. #  of  genes  taking  up  X%  of  sequences   Wang/Sandberg
  • 69. Differen5al  expression  analysis   Many tools available! Easily the most common type of analysis, even though it is understood that gene expression levels are not independent of each other, and should in principle be considered together. However, since the number of samples is typically << the number of measured genes, a full model is usually not feasible to construct in practice. Some sort of feature selection is needed.
  • 70. Differen5al  expression  analysis   One would simply like to do a t-test or something like that for each gene, but …
  • 71. Differen5al  expression  analysis   One would simply like to do a t-test or something like that for each gene, but … - Assumes normal distribution & no mean-variance dependence
  • 72. Differen5al  expression  analysis   One would simply like to do a t-test or something like that for each gene, but … - Assumes normal distribution & no mean-variance dependence - Hard to estimate variance from few samples
  • 73. Differen5al  expression  analysis   One would simply like to do a t-test or something like that for each gene, but … - Assumes normal distribution & no mean-variance dependence - Hard to estimate variance from few samples - Multiple testing issue
  • 74. Parametric  vs.  non-­‐parametric  methods   It would be nice to not have to assume anything about the expression value distributions but only use rank-order statistics. -> methods like SAM (Significance Analysis of Microarrays) or SAM-seq (equivalent for RNA-seq data) However, it is (typically) harder to show statistical significance with non- parametric methods with few replicates. My rule of thumb: - Many replicates (~ >10) in each group -> use SAM(Seq) - Otherwise use DESeq or other parametric method Note that according to Simon Anders (creator of DESeq) says that non- parametric methods are definitely better with 12 replicates and maybe already at five http://seqanswers.com/forums/showpost.php?p=74264&postcount=3
  • 75. Standard  DE  methods   Limma (microarrays, RNA-seq) edgeR, DESeq (RNA-seq)
  • 76. Standard  DE  methods   Limma (microarrays, RNA-seq) edgeR, DESeq (RNA-seq) Distributional issue: Solved by variance stabilizing transform in limma edgeR and DESeq model the count data using a negative binomial distribution and use their own modified statistical tests based on that.
  • 77. Standard  DE  methods   Limma (microarrays, RNA-seq) edgeR, DESeq (RNA-seq) Distributional issue: Solved by variance stabilizing transform in limma edgeR and DESeq model the count data using a negative binomial distribution and use their own modified statistical tests based on that. Multiple testing issue: All of these packages report false discovery rate (corrected p values).
  • 78. Standard  DE  methods   Limma (microarrays, RNA-seq) edgeR, DESeq (RNA-seq) Distributional issue: Solved by variance stabilizing transform in limma edgeR and DESeq model the count data using a negative binomial distribution and use their own modified statistical tests based on that. Multiple testing issue: All of these packages report false discovery rate (corrected p values). Variance estimation issue: These packages (in slightly different ways) “borrow” information across genes to get a better variance estimate. One says that the estimates “shrink” from gene-specific estimates towards a common mean value.
  • 79. Standard  DE  methods   Limma (microarrays, RNA-seq) edgeR, DESeq (RNA-seq) Distributional issue: Solved by variance stabilizing transform in limma edgeR and DESeq model the count data using a negative binomial distribution and use their own modified statistical tests based on that. Multiple testing issue: All of these packages report false discovery rate (corrected p values). Variance estimation issue: These packages (in slightly different ways) “borrow” information across genes to get a better variance estimate. One says that the estimates “shrink” from gene-specific estimates towards a common mean value.
  • 80. CuffDiff2   Integrates isoform quantification + differential expression analysis
  • 81. Complex  designs   The simplest case is when you just want to compare two groups against each other. But what if you have several factors that you want to control for? E.g. you have taken tumor samples at two different time points from six patients, cultured the samples and treated them with two different anticancer drugs and a mock control treatment. -> 2x6x3 = 36 samples. Now you want to assess the differential expression in response to one of the anticancer drugs, drug X. You could just compare all “drug X” samples to all control samples but the inter-subject variability might be larger than the specific drug effect.  Enter limma / DESeq / edgeR which can work with factorial designs (SAMSeq cannot, which is another reason one might not want to use it)
  • 82. Limma  and  factorial  designs   limma stands for “linear models for microarray analysis” Essentially, the expression of each gene is modeled with a linear relation http://www.math.ku.dk/~richard/courses/bioconductor2009/handout/19_08_Wednesday/KU-August2009-LIMMA/PPT-PDF/Robinson-limma-linear-models-ku-2009.6up.pdf The design matrix describes all the conditions, e g treatment, patient, time etc y = a + b*treatment + c*time + d*patient + e Baseline/average Error term/noise
  • 83. Recent  DE  so[ware  comparison  
  • 84. Take-­‐away  messages  from  DE  tool   comparison   - CuffDiff2, which should theoretically be better, seems to work worse, probably due to the increased “statistical burden” from isoform expression estimation - The HTSeq quantification which is theoretically “wrong” seems to give good results with downstream software - It is practically always better to sequence more biological replicates than to sequence the same samples deeper Omitted from this comparison - gains from ability to do complex designs - non-parametric methods
  • 85. The  end     Contact me at mikael.huss@scilifelab.se if you have any questions