Talk in gene discovery session at PAGXXII (https://pag.confex.com/pag/xxii/webprogram/Session2128.html)
Joint work with Jonas Behr, Gabriele Schweikert, Andre Kahles and others.
Abstract: High throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in detection of expressed genes and transcripts. However, the immense dynamic range of gene expression, limitations and biases of the sequencing technology, as well as the observed complexity of the transcriptional landscape pose profound computational challenges. We discuss several of these challenges and based on illustrative simulation examples, we identify the limits of state-of-the-art tools in reconstructing multiple alternative transcripts even if sufficient information is provided. We propose a novel framework, called MiTie, for simultaneous transcript reconstruction and quantification based on combinatorial optimization. We use the negative binomial distribution to define a likelihood function and use a regularization approach to select a small number of transcripts quantitatively explaining the observed read data. We show that the resulting regularized maximum likelihood problem can be formulated as a mixed integer programming problem (MIP) which can be solved optimally using standard optimization approaches. We will also describe an extension of the discriminative gene finding system mGene that takes advantage of RNA-seq reads. We demonstrate that the extended system mGene.ngs can significantly more accurately predict transcript annotations when using RNA-seq data and also better than tools for transcriptome reconstruction that are solely based on RNA-seq data. Finally, we illustrate how a combination of gene finding and transcriptome reconstruction methods like MiTie can be used to accurately annotate newly sequenced genomes without prior annotations.
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
RNA-seq based Genome Annotation with mGene.ngs and MiTie
1. RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar R¨tsch
a
Biomedical Data Science Group
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
gxr #mGene #MiTie #PAGXXII
2. Memorial Sloan-Kettering Cancer Center
Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
Andre Kahles
Funding
Financial interest disclosure
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
2
3. Memorial Sloan-Kettering Cancer Center
Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
Andre Kahles
Funding
Financial interest disclosure
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
2
4. Memorial Sloan-Kettering Cancer Center
Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
Andre Kahles
Funding
Financial interest disclosure
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
2
5. Memorial Sloan-Kettering Cancer Center
Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr
Andre Kahles
Funding
Financial interest disclosure
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
2
6. Memorial Sloan-Kettering Cancer Center
Genome Annotation Pipeline(s)
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
3
7. Memorial Sloan-Kettering Cancer Center
Genome Annotation Pipeline(s)
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
3
8. Memorial Sloan-Kettering Cancer Center
Genome Annotation Pipeline(s)
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
3
9. Memorial Sloan-Kettering Cancer Center
Genome Annotation Pipeline(s)
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
3
10. Memorial Sloan-Kettering Cancer Center
Genome Annotation Pipeline(s)
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
3
11. Proposed new gene finding method (mGene.ngs) for reannotation of
19 A. thaliana genomes (and genome assembly + analysis).
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
4
12. Memorial Sloan-Kettering Cancer Center
mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)
f (y |x) for all y = y
(“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a
Automatically adapts to quality of RNA-seq data/alignments
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
5
13. Memorial Sloan-Kettering Cancer Center
mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)
f (y |x) for all y = y
(“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a
Automatically adapts to quality of RNA-seq data/alignments
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
5
14. Memorial Sloan-Kettering Cancer Center
mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)
f (y |x) for all y = y
(“large margin”)
Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a
Automatically adapts to quality of RNA-seq data/alignments
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
5
15. Memorial Sloan-Kettering Cancer Center
Training of mGene
genomic position
True gene model
2
3
4
5
STEP 1: SVM Signal Predictions
tss
tis
acc
don
Score f(y|x)
stop
genomic position
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
6
16. Memorial Sloan-Kettering Cancer Center
Training of mGene
genomic position
True gene model
2
3
4
5
STEP 1: SVM Signal Predictions
tss
tis
acc
don
Score f(y|x)
stop
genomic position
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
6
17. Memorial Sloan-Kettering Cancer Center
Training of mGene
genomic position
True gene model
2
3
4
5
Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
Score f(y|x)
stop
large margin
genomic position
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
6
18. Memorial Sloan-Kettering Cancer Center
Training of mGene.ngs
genomic position
True gene model
2
3
4
5
Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop
Coverage
RNA-seq
Score f(y|x)
intron support
from spliced reads
large margin
genomic position
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
6
19. Memorial Sloan-Kettering Cancer Center
Training of mGene.ngs
genomic position
True gene model
2
3
4
5
Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop
Coverage
RNA-seq
intron support
from spliced reads
Score f(y|x)
larger margin
genomic position
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
6
20. Memorial Sloan-Kettering Cancer Center
Results for C. elegans
RNA-seq:
paired-end, strand-specific RNA ligation based protocol
76bp reads, 50 million reads
Alignment with Palmapper
Evaluation:
Transcript-level F-score of coding transcripts
. . . for different expression levels
Compare mGene (ab initio), mGene.ngs, cufflinks
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
7
21. Memorial Sloan-Kettering Cancer Center
Results for C. elegans
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
8
22. Memorial Sloan-Kettering Cancer Center
Digestion
Observations:
RNA-seq helps to improve performance
Genomic signals help much (see cufflinks)
Problems:
Need existing annotation for training
Cannot predict non-coding transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
9
23. Memorial Sloan-Kettering Cancer Center
Skimming and Non-coding Transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
10
24. Memorial Sloan-Kettering Cancer Center
Skimming and Non-coding Transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
10
25. Memorial Sloan-Kettering Cancer Center
Learning Strategy
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
11
26. Memorial Sloan-Kettering Cancer Center
Learning Strategy
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
11
27. Memorial Sloan-Kettering Cancer Center
Learning Strategy
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
11
28. Memorial Sloan-Kettering Cancer Center
Results for C. elegans
0.7
0.6
F−score
0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
0.4
0.3
0.2
0.1
0
0
10
c Gunnar R¨tsch (cBio@MSKCC)
a
20
30
40
50
60
expression percentile
70
RNA-Seq-based Annotation using mGene.ngs and MiTie
80
90
100
PAG XXII Gene Discovery Workshop
12
29. Memorial Sloan-Kettering Cancer Center
Results for C. elegans
0.7
0.6
F−score
0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation
0.4
0.3
0.2
0.1
0
0
10
c Gunnar R¨tsch (cBio@MSKCC)
a
20
30
40
50
60
expression percentile
70
RNA-Seq-based Annotation using mGene.ngs and MiTie
80
90
100
PAG XXII Gene Discovery Workshop
12
30. Memorial Sloan-Kettering Cancer Center
Results for C. elegans
0.7
0.6
F−score
0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation
0.4
0.3
0.2
De novo prediction works!
Modeling noncoding
transcripts improves coding
transcript prediction.
0.1
0
0
10
c Gunnar R¨tsch (cBio@MSKCC)
a
20
30
40
50
60
expression percentile
70
RNA-Seq-based Annotation using mGene.ngs and MiTie
80
90
100
PAG XXII Gene Discovery Workshop
12
31. Memorial Sloan-Kettering Cancer Center
Gene Finding vs. Transcript Assembly
Gene expression level
low
high
Genefinding + RNA-seq
=> only one transcript
RNA transcript assembly
=>multiple transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
13
32. BIOINFORMATICS
ORIGINAL PAPER
Genome analysis
Vol. 29 no. 20 2013, pages 2529–2538
doi:10.1093/bioinformatics/btt442
Advance Access publication August 25, 2013
MITIE: Simultaneous RNA-Seq-based transcript identification and
quantification in multiple samples
´
Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and
¨
Gunnar Ratsch1,*
1
Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich
Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany
¨
Associate Editor: Ivo Hofacker
ABSTRACT
c Gunnar R¨tsch (cBio@MSKCC)
a
Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led
to tremendous improvements in the detection of expressed genes and
reconstruction of RNA transcripts. However, the extensive dynamic
range of gene expression, technical limitations and biases, as well
as the observed complexity of the transcriptional landscape, pose
profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer
Transcript IdEntification) for simultaneous transcript reconstruction
and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few
transcripts collectively explaining the observed read data and show
how to find the optimal solution using Mixed Integer Programming.
MITIE can (i) take advantage of known transcripts, (ii) reconstruct
and quantify transcripts simultaneously in multiple samples, and
(iii) resolve the location of multi-mapping reads. It is designed for
genome- and assembly-based transcriptome reconstruction. We
present an extensive study based on realistic simulated RNA-Seq
data. When compared with state-of-the-art approaches, MITIE
proves to be significantly more sensitive and overall more accurate.
Moreover, MITIE yields substantial performance gains when used with
multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of
reconstructing omitted transcript annotations and the specificity with
RNA-Seq-based Annotation using corroborate that aand
respect to annotated transcripts. Our results mGene.ngs well-
Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem
genic locus by means of alternative splicing, transcription start
and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,
¨
2007; Schweikert et al., 2009). A comprehensive catalog of all
transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of
gene expression and RNA processing regulation.
RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing
technologies (ENCODE Project Consortium et al., 2012;
Mortazavi et al., 2008; Wang et al., 2009). Currently available
sequencing platforms typically provide several 10–100 millions of
sequence fragments (reads) with a typical length of 50–150 bases.
By mapping these reads back to the genome, one can determine
where gene products are encoded in the genome (e.g. Denoeud
et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,
2011) and collect evidence of RNA processing such as splicing
(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing
(Bahn et al., 2012).
In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible
read origins within the genome. Contiguous regions covered with
read alignments (possibly with small gaps) are candidates for
exonic segments. Alignment tools for RNA-Seq reads, such as
PALMapper PAG XXIIal., 2008; Discovery Workshop
(De Bona et Gene Jean et al., 2010), TopHat
MiTie
Transcript prediction via combinatorial optimization that combines
evidence from multiple experiments & achieves higher accuracy.
14
33. Memorial Sloan-Kettering Cancer Center
Transcript Reconstruction with RNA-seq
Reads
Genome Based Assembly
(Cufflinks, Scripture)
Read alignments
Denovo Assembly
(Trinity, Oases)
Genomic DNA
Data
processing
Segment graph
Optimization
108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
15
34. Memorial Sloan-Kettering Cancer Center
Transcript Reconstruction with RNA-seq
Reads
Genome Based Assembly
(Cufflinks, Scripture)
Read alignments
Denovo Assembly
(Trinity, Oases)
Genomic DNA
Data
processing
Segment graph
Optimization
108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
15
35. Memorial Sloan-Kettering Cancer Center
Enumerate and Quantify all Transcripts
Segment Graph
Potential Transcripts
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
36. Memorial Sloan-Kettering Cancer Center
Enumerate and Quantify all Transcripts
Segment Graph
Potential Transcripts
1
1
1
1
1
1
1
1
0
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
0
1
0
0
1
1
0
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
37. Memorial Sloan-Kettering Cancer Center
Enumerate and Quantify all Transcripts
Segment Graph
Abundance
Potential Transcripts
1
1
1
1
1
1
1
1
0
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
0
1
0
0
1
1
0
0
Sample1 Sample2
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0
Expected coverage
0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0
R. Bohnert and G. R¨tsch, NAR (2010)
a
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
[Behr et al., 2013]
PAG XXII Gene Discovery Workshop
16
38. Memorial Sloan-Kettering Cancer Center
Enumerate and Quantify all Transcripts
Segment Graph
Abundance
Potential Transcripts
1
1
1
1
1
1
1
1
0
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
0
1
0
0
1
1
0
0
Sample1 Sample2
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
min L( U T × W ,
W
0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0
Expected coverage
0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0
C
)+γ× W
1
expected coverage observed coverage
R. Bohnert and G. R¨tsch, NAR (2010)
a
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
39. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
k
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
40. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
min L( U T × W ,
U,W
Expected coverage
Sample1 Sample2
0.8
0.2
0.0
0.0
0.9
0.0
0.1
0.0
W
)+γ×N
C
expected coverage observed coverage
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
41. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
W
min L(U T × W , C ) + γ × N
U,W
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
42. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
W
min L(U T × W , C ) + γ × N
U,W
s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a
U is valid
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
43. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
W
min L(U T × W , C ) + γ × N
U,W
'$
s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a
U is valid
&%
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
44. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification & Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
W
min L(U T × W , C ) + γ × N
U,W
s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a
U is valid
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
45. Memorial Sloan-Kettering Cancer Center
Simultaneous Identification Quantification
Segment Graph
Abundance
Transcripts Matrix
...
1 1 0
N
1
1
0
0
1
0
1
1
1
0
Sample1 Sample2
0
1
0
0
U
1
1
1
0
1
1
1
0
1
1
1
0
0.8
0.2
0.0
0.0
Expected coverage
0.9
0.0
0.1
0.0
W
min × W , C ) + γ × N
L(U T
U,W
s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a
U is valid
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
16
46. Memorial Sloan-Kettering Cancer Center
MiTie’s Main Features
Uses a likelihood function L based on a probabilistic model for
the read coverage.
Uses combinatorial optimization to find transcripts that explain
data from multiple RNA-seq libraries
Newly predicted transcripts are penalized (once).
Can use already known/confirmed transcripts without penalty.
Provides a p-value for each transcript providing a confidence
measure for presence of predicted transcript.
Log-likelihood ratio test:
Tt = −2 log
p(D|M)
p(D|Mt )
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
17
47. Memorial Sloan-Kettering Cancer Center
MiTie Results
F−score on Transcript Level
A
F−score on Transcript Level
B
Human Simulated Data
0.45
MITIE + MMO
MITIE
Cufflinks + Cuffmerge
Cufflinks
0.40
0.35
1
0.37
2
3
4
D. melanogaster modENCODE Data
5
0.35
0.33
0.31
0.29
MITIE
Cufflinks + Cuffmerge
1
2
3
4
5
Number of Samples
6
7
[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
18
48. Memorial Sloan-Kettering Cancer Center
Gene Finding vs. Transcript Assembly
Gene expression level
low
high
mGene.ngs
= only one transcript
MiTie
=multiple transcripts
low confidence
high confidence
for alternative transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
19
49. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
50. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
51. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
52. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
53. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
54. Memorial Sloan-Kettering Cancer Center
Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes
Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments
Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)
Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
20
56. References I
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,
pages 3–10, 2003.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch.
a
Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,
September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous
e
a
rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.
doi: 10.1093/bioinformatics/btt442.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,
KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms
o
a
shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:
10.1126/science.1138632.
G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.
a
Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.
G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors,
a
o
Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT
Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.
G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21
a
o
(Suppl. 1):i369–i377, June 2005.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio
De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based
u
o
a
gene finding with an application to nematode genomes. Genome Research, 2009. URL
http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.
S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International
a
u
Conference on Artificial Neural Networks, 2002.
S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human.
o
a
Bioinformatics, 22(14):e472–480, 2006.
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
22
57. References II
VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar
o
R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.
a
Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana
with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:
10.1101/gr.070169.107.
A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That
a
o
u
Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.
c Gunnar R¨tsch (cBio@MSKCC)
a
RNA-Seq-based Annotation using mGene.ngs and MiTie
PAG XXII Gene Discovery Workshop
23