Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

RNASeq Experiment Design

5.199 visualizaciones

Publicado el

Introduction to RNASeq Data Analysis and Experimental Design

Publicado en: Tecnología
  • Inicia sesión para ver los comentarios

RNASeq Experiment Design

  2. 2. Overview   •  Earlier:  libraries  to  raw  reads.   Now   •  What  to  do  with  RNA-­‐seq  reads?   •  How  to  design  a  RNA-­‐Seq   experiment?  
  3. 3. Blencowe B J et al. Genes Dev. 2009;23:1379-1386 Illumina  HiSeq  
  4. 4. Reads  are  ready.    Now  What?   bcl2fastq   Big  Fastq  files  (2-­‐30Gb)   •  Reads  represent  real  biology.       •  More  reads  corresponding  to  a  transcript  indicate  higher  abundance  of  that   transcript.   •  Reads  may  represent  novel  transcripts  or  novel  arrangements  of  exons  that  are   not  present  in  any  known  reference  genome.   •  New  exon-­‐exon  juncIons,  RNA-­‐ediIng,  and  nucleoIde  variaIons  (SNPs)  may  all   be  present  in  the  read  data.   How  do  we  translate  these  raw  reads  into  biological  knowledge:    start  with   sequence  alignment.  
  5. 5. Reads  are  ready.    Now  What?   Fastq   Do  we  have  a   genome  reference?   Yes   Do  we  a  transcript/gene   annotaIon  reference?   Yes  No   No   Perform  full  de  novo   transcriptome  construcIon   Perform  alignment-­‐guided  de  novo   transcriptome  assembly   Align  to  the  genome.   QuanIficaIon  Only:  accept   only  alignments  that   correspond  to  known   transcripts   Align  to  known   exons  but  accept   alternaIve   arrangements.   Align  to  known   exons  plus  other   regions.   Like  microarray  
  6. 6. What  to  map  to?   Map  to  a  genome  with  no  gene  annotaSon.   •  Assembling  transcripts  from  exon  regions  is  difficult  and  requires   complex  staIsIcal  algorithms.   •  IdenIfying  alternaIve  transcript  isoforms  is  unreliable.   •  Usually  this  is  best  for  a  novel  or  unannotated  genomes.       Exons  ?   Genome  ref  
  7. 7. What  to  map  to?   Map  to  the  genome,  with  knowledge  of  transcript  annotaSons   • Well  annotated  genome  reference  is  required.   • To  effecively  map  to  exon  juncIons,  you  need  a  mapping   algorithm  that  can  divide  the  sequencing  reads  and  map  porIons   independently.   • IdenIfying  alternaIve  transcript  isoforms  involves  complex   algorithms.  
  8. 8. Which  sequence  mappers  to  use?   •  RNASeq  Alignment  algorithm  must  be   –  Fast   –  Able  to  handle  SNPs,  indels,  and  sequencing  errors   –  Maintain  accurate  quanIficaIon       –  Allow  for  introns  for  reference  genome  alignment(spliced  alignment   detecIon)   •  Burrows  Wheeler  Transform(BWT)  mappers   –  Fast   –  Limited  mismatches  allowed  (<3)   –  Limited  indel  detecIon  ability   –  Examples:  BowIe2,  BWA,  Tophat     –  Use  cases:  large  and  conserved  genome  and  transcriptomes     •  Hash  Table  mappers   –  Require  large  amount  of  RAM  for  indexing   –  More  mismatches  allowed   –  Indel  detecIon   –  Examples:  GSNAP,  SHRiMP,  STAR   –  Use  case:  highly  variable  or  smaller  genomes,  transcriptomes    
  9. 9. RNA-­‐Seq  reads   Alignment   Assemble   Transcripts   fastq  file   SAM/BAM  file   Transcript  isoforms   Gene  or  transcript   quanSficaSon   Count  reads   HTseq  -­‐     h_p://www-­‐ anders/HTSeq/doc/overview.html   Cufflinks  -­‐   h_p://   Bioconductor  -­‐   h_p://   Trinity  -­‐   h_p://   Cufflinks  -­‐   h_p://   Generalized  Analysis  Workflow   BowIe2,  BWA,  Tophat,     GSNAP,  SHRiMP,  STAR    
  10. 10. RNA-­‐Seq  reads   Align  to  the  genome  using   BowIe/Tophat.   Tophat   Cufflinks   Spliced  Fragments  align  to   known  exon-­‐exon  juncIons.   Genomic  mapped  reads  may   idenIfy  novel  isoforms.   fastq  file   SAM/BAM  file   Genome  reference  .fasta   Gene  annotaSons  .g^   Genome  reference  .fasta   Gene  annotaSons  .g^   Transcript  isoforms   Gene/transcript   quanSficaSon   Cufflinks  idenIfies  mutually   exclusive  exons.    Graph-­‐based   analysis  uses  a  shortest-­‐path   algorithm  to  determine     Tophat/Cufflinks   Workflow  
  11. 11. Sequence  Alignment  Files   BAM/SAM  alignment  files   • SAM  file  is  the  standard  alignment  file  format  generated  from   all  mappers   • All  alignments  files  are  stored  in  a  BAM  file,  an  industry   standard.   • BAM  is  a  compressed  (binary)  version  of  the  SAM  file.    BAM  is   not  readable.    It  can  be  indexed  so  that  huge  alignment  files   can  be  read  and  searched  rapidly  by  other  tools  and  genome   browsers.   • A  suite  of  tools  (called  “samtools”)  is  used  to  convert  between   SAM  and  BAM.   • Samtools  can  also  be  used  to  index  bam  file  for  faster   visualizaIon,  on  IGV  or  UCSC  Genome  Browser    
  12. 12. SAM  format   h_p://   Format  version   Ref  seq  name   Ref  seq  length   Sort  order   Cigar  String  
  13. 13. h_p://   CIGAR  Strings   Compact  IdiosyncraIc  Gapped  Alignment  Report  
  14. 14. DifferenSal  Gene  Expression  Analysis   •  Given  samples  from  different   experimental  condiIons,  find   changes  in  transcriptome   profiles   •  Allows  for  hypothesis   genera0on  on  molecular   abnormaliIes  and  mechanisms   that  may  contribute  to  the   tumor  phenotype   •  Provides  insights  to  potenIal   biological  mechanisms   associated  with  experimental/ diseased  condiIons    
  15. 15. Sample annotations STAR aligner featureCounts   DESeq,  GSEA,  QC   HTML  report   Standard  Transcriptome  Sequencing  Pipeline  
  16. 16. This  is  really  a  simple  sequence  counSng   problem   Data:    NGS  randomly  sample  and  sequence  all  gene   transcripts  from  samples  (so  the  number  of  reads   correlate  with  the  number  of  transcripts)     ObjecSve:    Does  gene  X  has  more  copies  in  condiIon  Z   than  in  B  (Z>B)?     X   Y   Z   X   Y   Z   CondiSon  Z   CondiSon  B  
  17. 17. CounSng  Rules  for  RNASeq   •  Count  mapped  reads,  not  base-­‐pairs   •  Count  each  read  at  most  once   •  Discard  a  read  if   –  It  cannot  be  uniquely  mapped   –  Its  alignment  overlaps  with  several  genes   –  The  alignment  quality  score  is  bad   –  (for  paired-­‐end  reads)  the  mates  do  not  map  to  the   same  genes  (poten0al  fusion  genes)   •  Do  not  discard  if  there  is  read  duplicates  (same   reads  appear  mulIple  Imes)   •  Keep  track  of  alignment  method  and  parameters    
  18. 18. What  kind  of  quesSons  can  be  answered   from  sequence  count  data?   Gene    Healthy1   Health  2   Health  3   PaSent  1   PaSent  2   PaSent  3   CCT2   50   60   45   75   5   69   TP53   30   72   30   127   40   80   CXCR5   3   10   60   20   5   40   Gene  Sequence  Count  Data   Is  gene  TP53  upregulated  in  paSent  samples?   -­‐  Hint:  If  healthy  samples  were  sequenced  at  20  million  reads  and   paIent  samples  were  sequenced  at  80  million  reads,  does  it   change  the  answer?     Is  there  more  TP53  transcript  copies  compare  to   CCT2?   -­‐  Hint:  TP53  transcript  is  a  lot  longer  than  CCT2  
  19. 19. Direct  comparison  of  read  counts  per   gene  is  problemaSc     More  sequence  reads  mapped  to  a  transcript  if  it  is   a)  Long       b)  At  higher  depth  of  Coverage   Read  Counts  =  12,  Depth  =  3X,   Read  Counts  =  5,  Depth  =  3X   Read  Counts  =  11,  Depth  =  5X   Read  Counts  =  5,  Depth  =  3X   Cannot  claim  blue  transcript  is  transcribed  at  a  higher  level     than  green  transcript  based  on  read  counts  
  20. 20. NormalizaSon  RNASeq  Count  Data     •  Data  NormalizaIon  is  ALWAYS  required  to   compare  one  sequencing  result  to  another   •  Bring  count  data  from  different  experiments  to   the  same  scale  for  comparison   •  RNASeq  count  data  normalizaIon  wants  to  adjust   data  such  that:   –  gene  with  different  lengths  can  be  compared   –  Total  sequence  counts  are  considered  
  21. 21. RPKM:  Reads  per  Kilobase  per  Million   Mapped  Reads   C  =  #  of  mappable  reads  in  a  feature  (exon  or  transcript)   N  =  #  of  mappable  reads  in  the  experiment     L  =  length  of  the  feature  in  base  pairs   The  easiest  way  to  normalize  is  take  the  number  of  the  mapped   reads  on  a  transcript  and  divide  by  the  length  of  the  transcript   and  the  number  of  total  read     Nature  Methods  -­‐  5,  621  -­‐  628  (2008)     •  Generally  correct  for  biases   •  Vulnerable  to  bias  by  a  few  highly  expressed  genes  driving  N  to   be  large   •  Used  to  be  the  standard,  but  not  anymore  
  22. 22. Other  NormalizaSon  Methods   Upper  QuarSle  Method   Aim:  Correct  for  the  bias  that  total  read  count  is  strongly  dependent   on  a  few  highly  expressed  transcripts   Method:  Use  the  top  25%  (upper  quarIle)most  expressed   transcripts  as  scaling  factor  and  report  back  Normalized  Count     Geometric  Mean  Method  (the  DESeq  method)   Aim:  to  minimize  the  effect  of  majority  of  sequences  and   concentrate  on  variaIon  between  condiIons   AssumpSon:    A  majority  of  transcripts  is  not  differenIally  expressed   Method:    Take  geometric  means  of  read  counts  as  reference  value  sj   to  normalize  transcript  count       Bullard  et  al.  BMC  Bioinforma0cs  2010,  11:94   kij=number  of  reads  in  sample  j  assigned  to  gene  i   v  =  sample  1  to  m  
  23. 23. Inferring  DifferenSal  Expression  (DE)   Method   NormalizaS on   Needs   replicas   Input   StaSsScs  for   DE   Availability   edgeR   Library  size     Yes   Raw   counts   Empirical   Bayesian   esImaIon  based   on  NegaIve   binomial   distribuIon   R/Bioconductor   DESeq   Library  size   No   Raw   counts   NegaIve   binomial   distribuIon   R/Bioconductor     baySeq   Library  size   Yes   Raw   counts   Empirical   Bayesian   esImaIon  based   on  NegaIve   binomial   distribuIon   R/Bioconductor     LIMMA   Library  size   Yes   Raw   counts   Empirical   Bayesian   esImaIon   R/Bioconductor     CuffDiff   RPKM   No   RPKM   Log  raIo   Standalone  
  24. 24. Typical  DE  Result  Table   Gene  or   transcript   name   Mean  expression   levels   Fold  Change:  measurement  of   changing  magnitude,  calculated  as     FC=baseMeanB/baseMeanA     Typically  Log2(FC)  is  reported   Significance:  use  adjusted  P   value  (padj)  instead  of  raw  P   value  (pval)  unless  you  know   what  you  are  doing  
  25. 25. Why  use  adjusted  P-­‐value  instead  of  raw   P-­‐value?   MulSple  Comparison  Problem  –  When  large  number  of  staIsIcal  tests  were   performed  simultaneously  (as  in  genomic  analysis),  some  tests  will   have  P  values  less  than  0.05  purely  by  chance,  even  if  all  your  null  hypotheses   are  really  true.       Benne@-­‐Salmon-­‐2009   The  Dead  Thinking  Salmon  Experiment   -­‐  Buy  a  whole  salmon   -­‐  Take  fMRI  image  of  the  salmon,  which   similar  to  genomic  analysis  asks  the   quesIon  if  a  small  region  (voxels)  of  the   brain  is  acIve   -­‐  Some  region  WILL  BE  significantly  acIve   if  enough  of  picture  and    enough  of   voxel  are  taken   -­‐  SuggesIng  the  dead  salmon  is   thinking…   -­‐  Nothing  is  significant  if  p-­‐val  is  adjusted   Methods  for  Adjustment:    Bonferroni  correcIon,  FDR  controlling  procedures  
  26. 26. Heatmap  and  Hierarchical  Clustering   •  Most  common  representaIon   for  differenIal  expression   analysis   •  Hierarchical  clustering  on  both   samples  are  genes  are  oven   performed  to  idenIfy  similar   samples/genes   •  Can  be  generated  using  many   tools,  such  as  R/Bioconductor   heatmap  and  gplots  package    
  27. 27. FuncSonal  Enrichment  Analysis   •  Use  gene  expression  to  idenIfy  pathways  or  gene   funcIons  that  are  over-­‐represented   •  Address  the  quesIon:  “What  biological  funcIons   are  different  between  sample  groups?”   •  Many  open-­‐source  and  proprietary  tools   –  GSEA  (h_p://   –  DAVID  (h_ps://   –  TopGO/GOSEQ  (R/Bioconductor)   –  Ingenuity  Pathway  Analysis  (QIAGEN,  proprietary)   •  Detailed  discussion  is  out  of  scope  for  this  course  
  29. 29. Design  RNASeq  Experiment   •  Biological  Comparison(s)   •  Replicates   •  Read  length   •  Paired  End/Single  Read   •  Read  depth   •  Pooling  
  30. 30. Biological  System  in  QuesIons   Simple  QuesSon   Complex  QuesSon   Examples:   •  Cell  line  groups  treated  with   different  condiIons   •  PaIent  groups  with  the  same   disease  treated  with  different   treatment   Examples:   •  Matched  paIent  samples  from  both   normal  and  diseased  Issues   •  Normal  and  cancer  samples   obtained  from  genotypically  diverse   populaIon  
  31. 31. Experimental  QuesSons   •  What  are  my  goals?   –  DifferenIal  expression  analysis  of  genes?   –  DifferenIal  expression  analysis  of  transcripts?   –  IdenIfy  rare  transcript  isoforms?   –  IdenIfy  transcript  polymorphism?   –  IdenIfy  non-­‐coding  RNA  populaIons  such  as  miRNA,   lincRNA?     •  What  are  the  characterisScs  of  systems?   –  Large,  complex  genome  ?  (ie.  Human)   –  Highly  heterogeneous  sample  populaIon  ?  (i.e.  breast   tumor)   –  No  reference  genome  or  transcriptome  ?   –  High  degree  of  alternaIve  splicing?  
  32. 32. Experimental  QuesSons   What  are  the  sequencing  opIons?   How  much  money  to  spend?  
  33. 33. What  are  Single  Read  (SR)  and  Paired  End   (PE)  sequencing   cDNA   Single  Read  (SR)  :    only  one  end  from  each  cDNA  fragment   is  sequenced  to  generate  one  read  per  fragment   Paired  End  (PE)  :  the  cDNA  fragment  is  sequenced  from   both  ends  to  generate  two  reads  per  fragment  from  two   direcIons  
  34. 34. What  are  Single  Read  (SR)  and  Paired  End   (PE)  sequencing   Single  Read  (SR)   -­‐  Sample  the  same  number  of  cDNA  fragment  as  PE   -­‐  Generate  half  of  the  reads  (half  of  the  depth)  than  PE   -­‐  Suitable  for  gene  expression  level  detecIon     -­‐  SubstanIally  cheaper  than  PE   Paired  End  (PE)   -­‐  Sample  the  same  number  of  cDNA  fragment  as  SR   -­‐  Allow  for  more  accurate  detecIon  of  structural  variant,  novel   isoform  idenIficaIon  and  quanIficaIon     Reference  Sequence  
  35. 35. Impacts  of  Read  Length  on  RNASeq   Longer  read  length  provides  (ie.  75bp  vs  50bp):   -­‐  be_er  ability  to  assemble  unknown  transcripts   -­‐  Higher  accuracy  to  map  reads  to  complex  regions  (i.e.   repeats,  high  polymorphic  regions)   -­‐  Splice  juncIon  detecIon  is  most  affected  by  read  length   Is  long  read  length  (ie.  100bp  vs  50  bp)  always  give  bejer?   -­‐  Not  necessarily   -­‐  Long  reads  convey  minimal  to  no  advantage  for  differenIal   gene  expression  analysis   50  bp   75bp   50  bp   75bp  
  36. 36. Impacts  of  Sequencing  Depth   •  Quick  means  to  detect  more  genes  and  transcript   variants  with  low  expression  (the  more  reads  you   sequence,  the  more  genes  you  find)   •  Require  logarithmic  increase  in  depth  for  linear  increase   in  gene  detected   X   Y   Z   X   Y   Z   RNASeq  1,  30  million  reads   RNASeq  2,  10  million  reads  
  37. 37. Number  of  reads  needed  for  an   experiment     •  Different  RNA  sequencing  require  different  number  of  reads   •  More  genes  are  detected  with  higher  sequencing  depth   •  However,  the  increase  of  detected  genes  reduces  substanIally   •  Understand  your  sequencing  system  before  deciding  on  depth   •  Can  always  increase  depth  by  addiIonal  sequencing  on  the  same   library   –  Unlike  microarray  there  is  very  limited  batch  effect  for  RNASeq   Differen0al  expression  in  RNA-­‐seq:  A  ma@er   of  depth.  Genome  Res.  2011.    
  38. 38. Experimental  Design   •  Technical  replicates   –  Not  needed:    RNASeq  have  low  technical  variaIon   •  Minimize  batch  effects   •  Biological  replicates   –  Not  needed  for  novel  transcript  idenIficaIon  and   transcriptome  assembly   –  EssenIal  for  differenIal  expression  analysis   –  Difficult  to  esImate  the  minimum  number   •  3+  for  cell  lines   •  5+  for  inbred  lines  (i.e.  mouse,  model  organsims)   •  20+  for  human  samples    (usually  unachievable)   –  Must  have  3+  to  perform  staIsIcal  analysis  
  39. 39. Experimental  Design   •  Pooling  samples   – Limited  RNA  obtainable   •  Tumor  samples  from  hard  to  reach  Issue  type  (i.e.   brain)   – Novel  transcriptome  assembly   – Don’t  do  it  unless  you  know  what  you  are  doing  
  40. 40. QuesSons  to  ask  when  gekng  raw   RNASeq  data  back   •  How  was  the  RNA  extracted?   •  How  was  RNASeq  library  constructed?   •  Which  playorm  was  the  library  sequenced  on?   •  How  long  was  the  read  length?   •  Was  sequencing  done  with  single  read  or   paired  end?   •  How  many  reads  were  sequenced  per  sample?   •  Where  is  the  QC  report?  
  41. 41. Check  list  for  gekng  RNASeq  DE  analysis   results  back   q   Fastq  files   q   FastQC  Report   q   BAM  files   q   RNASeq  QC  Report  (Not  discussed)   q Table  of  DifferenIally  Expressed  Genes/   Transcripts   q   Heatmaps   q   FuncIonal  Enrichment  Analysis  Table  
  42. 42. Recognize  Yourself  as  a  Genomic  Data   Consumer   BioinformaScists/Data  ScienSsts   -­‐  Let  data  drive  scienIfic   hypothesis  generaIon   -­‐  Start  with  raw  data  (i.e.  fastq)   -­‐  Process  raw  data  by  privately   tuned  pipelines     KNOW  YOUR  DATA  SOURCE     TranslaSonal  ScienSsts   -­‐  Start  with  a  specific  hypothesis   derived  from  observaIon   -­‐  Find  processed  to  perform   secondary  analysis   -­‐  Use  readily  available  tools   -­‐  Interpret  results  in  the  context   of  iniIal  hypothesis   KNOW  YOUR  TOOLS  
  43. 43. THE  END