SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
RNA-Seq-based Genome Annotation
using mGene.ngs and MiTie
Gunnar R¨tsch
a
Biomedical Data Science Group
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
gxr #mGene #MiTie #PAGXXII
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Acknowledgements and Disclosures
Main contributors
Gabriele Schweikert
Jonas Behr

Andre Kahles

Funding

Financial interest disclosure

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

2
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Memorial Sloan-Kettering Cancer Center

Genome Annotation Pipeline(s)

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

3
Proposed new gene finding method (mGene.ngs) for reannotation of
19 A. thaliana genomes (and genome assembly + analysis).

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

4
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

mGene.ngs Overview
Goal: Predict annotation based on RNA-seq and genomic
sequence information
Learn function f (y |x) that scores gene models y based on
different sources of information x
Train parameters such that
f (y |x)

f (y |x) for all y = y

(“large margin”)

Hidden semi-Markov Support Vector Machines (HsM-SVMs)
[Altun et al., 2003, R¨tsch and Sonnenburg, 2007]
a

Automatically adapts to quality of RNA-seq data/alignments

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

5
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don

Score f(y|x)

stop

large margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop

Coverage

RNA-seq

Score f(y|x)

intron support
from spliced reads

large margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Training of mGene.ngs
genomic position
True gene model

2

3

4

5

Wrong gene model
STEP 1: SVM Signal Predictions
tss
tis
acc
don
stop

Coverage

RNA-seq

intron support
from spliced reads

Score f(y|x)

larger margin

genomic position

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

6
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
RNA-seq:
paired-end, strand-specific RNA ligation based protocol
76bp reads, 50 million reads
Alignment with Palmapper
Evaluation:
Transcript-level F-score of coding transcripts
. . . for different expression levels
Compare mGene (ab initio), mGene.ngs, cufflinks

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

7
Memorial Sloan-Kettering Cancer Center

Results for C. elegans

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

8
Memorial Sloan-Kettering Cancer Center

Digestion
Observations:
RNA-seq helps to improve performance
Genomic signals help much (see cufflinks)
Problems:
Need existing annotation for training
Cannot predict non-coding transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

9
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

10
Memorial Sloan-Kettering Cancer Center

Skimming and Non-coding Transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

10
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Learning Strategy

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

11
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation

0.4

0.3

0.2

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Results for C. elegans
0.7

0.6

F−score

0.5
mGene − ab initio w/ annotation
mGene.ngs − w/ annotation
cufflinks − Trapnell et al. 2010
mGene.ngs − w/o annotation
mGene.nc − w/o annotation

0.4

0.3

0.2

De novo prediction works!
Modeling noncoding
transcripts improves coding
transcript prediction.

0.1

0

0

10

c Gunnar R¨tsch (cBio@MSKCC)
a

20

30

40

50

60

expression percentile

70

RNA-Seq-based Annotation using mGene.ngs and MiTie

80

90

100

PAG XXII Gene Discovery Workshop

12
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

Genefinding + RNA-seq
=> only one transcript
RNA transcript assembly
=>multiple transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

13
BIOINFORMATICS

ORIGINAL PAPER

Genome analysis

Vol. 29 no. 20 2013, pages 2529–2538
doi:10.1093/bioinformatics/btt442

Advance Access publication August 25, 2013

MITIE: Simultaneous RNA-Seq-based transcript identification and
quantification in multiple samples
´
Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and
¨
Gunnar Ratsch1,*
1

Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich
Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany
¨

Associate Editor: Ivo Hofacker
ABSTRACT

c Gunnar R¨tsch (cBio@MSKCC)
a

Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led
to tremendous improvements in the detection of expressed genes and
reconstruction of RNA transcripts. However, the extensive dynamic
range of gene expression, technical limitations and biases, as well
as the observed complexity of the transcriptional landscape, pose
profound computational challenges for transcriptome reconstruction.
Results: We present the novel framework MITIE (Mixed Integer
Transcript IdEntification) for simultaneous transcript reconstruction
and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few
transcripts collectively explaining the observed read data and show
how to find the optimal solution using Mixed Integer Programming.
MITIE can (i) take advantage of known transcripts, (ii) reconstruct
and quantify transcripts simultaneously in multiple samples, and
(iii) resolve the location of multi-mapping reads. It is designed for
genome- and assembly-based transcriptome reconstruction. We
present an extensive study based on realistic simulated RNA-Seq
data. When compared with state-of-the-art approaches, MITIE
proves to be significantly more sensitive and overall more accurate.
Moreover, MITIE yields substantial performance gains when used with
multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of
reconstructing omitted transcript annotations and the specificity with
RNA-Seq-based Annotation using corroborate that aand
respect to annotated transcripts. Our results mGene.ngs well-

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem

genic locus by means of alternative splicing, transcription start
and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al.,
¨
2007; Schweikert et al., 2009). A comprehensive catalog of all
transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of
gene expression and RNA processing regulation.
RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing
technologies (ENCODE Project Consortium et al., 2012;
Mortazavi et al., 2008; Wang et al., 2009). Currently available
sequencing platforms typically provide several 10–100 millions of
sequence fragments (reads) with a typical length of 50–150 bases.
By mapping these reads back to the genome, one can determine
where gene products are encoded in the genome (e.g. Denoeud
et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al.,
2011) and collect evidence of RNA processing such as splicing
(Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing
(Bahn et al., 2012).
In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible
read origins within the genome. Contiguous regions covered with
read alignments (possibly with small gaps) are candidates for
exonic segments. Alignment tools for RNA-Seq reads, such as
PALMapper PAG XXIIal., 2008; Discovery Workshop
(De Bona et Gene Jean et al., 2010), TopHat
MiTie

Transcript prediction via combinatorial optimization that combines
evidence from multiple experiments & achieves higher accuracy.

14
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Scripture)
Read alignments

Denovo Assembly
(Trinity, Oases)

Genomic DNA

Data
processing

Segment graph

Optimization

108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

15
Memorial Sloan-Kettering Cancer Center

Transcript Reconstruction with RNA-seq
Reads

Genome Based Assembly
(Cufflinks, Scripture)
Read alignments

Denovo Assembly
(Trinity, Oases)

Genomic DNA

Data
processing

Segment graph

Optimization

108 possible transcripts, 1028 possible subsets of transcripts
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

15
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

R. Bohnert and G. R¨tsch, NAR (2010)
a
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

[Behr et al., 2013]
PAG XXII Gene Discovery Workshop
16
Memorial Sloan-Kettering Cancer Center

Enumerate and Quantify all Transcripts
Segment Graph

Abundance

Potential Transcripts

1
1
1
1
1
1
1
1

0
0
1
0
1
0
1
0

1
1
1
1
1
1
1
1

0
1
0
0
1
1
0
0

Sample1 Sample2

1
1
1
1
1
1
1
1

1
1
1
1
0
0
0
0

1
1
1
1
0
0
0
0

min L( U T × W ,
W

0.0
0.2
0.0
0.8
0.0
0.0
0.0
0.0

Expected coverage

0.0
0.0
0.1
0.9
0.0
0.0
0.0
0.0

C

)+γ× W

1

expected coverage observed coverage

R. Bohnert and G. R¨tsch, NAR (2010)
a
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
k

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

min L( U T × W ,

U,W

Expected coverage

Sample1 Sample2

0.8
0.2
0.0
0.0

0.9
0.0
0.1
0.0

W

)+γ×N

C

expected coverage observed coverage

[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W

min L(U T × W , C ) + γ × N

U,W

'$

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

&%

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification & Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W



min L(U T × W , C ) + γ × N


U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

Simultaneous Identification  Quantification
Segment Graph

Abundance

Transcripts Matrix

...

1 1 0
N

1
1
0

0
1
0

1
1
1
0

Sample1 Sample2

0
1
0
0

U

1
1
1
0

1
1
1
0

1
1
1
0

0.8
0.2
0.0
0.0

Expected coverage

0.9
0.0
0.1
0.0

W



min × W , C ) + γ × N
L(U T

U,W

s.t.
c Gunnar R¨tsch (cBio@MSKCC)
a

U is valid

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

16
Memorial Sloan-Kettering Cancer Center

MiTie’s Main Features
Uses a likelihood function L based on a probabilistic model for
the read coverage.
Uses combinatorial optimization to find transcripts that explain
data from multiple RNA-seq libraries
Newly predicted transcripts are penalized (once).
Can use already known/confirmed transcripts without penalty.
Provides a p-value for each transcript providing a confidence
measure for presence of predicted transcript.
Log-likelihood ratio test:
Tt = −2 log

p(D|M)
p(D|Mt )
[Behr et al., 2013]

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

17
Memorial Sloan-Kettering Cancer Center

MiTie Results
F−score on Transcript Level

A

F−score on Transcript Level

B

Human Simulated Data
0.45
MITIE + MMO
MITIE
Cufflinks + Cuffmerge
Cufflinks

0.40

0.35
1
0.37

2

3

4

D. melanogaster modENCODE Data

5

0.35

0.33
0.31
0.29

MITIE
Cufflinks + Cuffmerge
1

2

3

4

5

Number of Samples

6

7

[Behr et al., 2013]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

18
Memorial Sloan-Kettering Cancer Center

Gene Finding vs. Transcript Assembly
Gene expression level
low

high

mGene.ngs
= only one transcript
MiTie
=multiple transcripts
low confidence
high confidence
for alternative transcripts

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

19
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Memorial Sloan-Kettering Cancer Center

Conclusions
Genome annotation pipeline
Transcript Skimmer identifies highly expressed genes for training
mGene.ngs predicts coding and non-coding transcripts
MiTie predicts alternative transcripts for highly expressed genes

Genome annotation pipeline requires only
Genome sequence
RNA-seq alignments

Good for annotating new genomes or improving existing ones
Sources are free http://bioweb.me/mgene 
http://bioweb.me/mitie
Functionality partially available in Galaxy instance
(http://galaxy.cbio.mskcc.org)

Thank you!
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

20
Just published:

Checkout:
http://oqtans.org
http://galaxy.cbio.mskcc.org
[Sreedharan et al., 2014]
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

21
References I
Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn.,
pages 3–10, 2003.
J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch.
a
Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting,
September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.
Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous
e
a
rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013.
doi: 10.1093/bioinformatics/btt442.
RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen,
KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms
o
a
shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:
10.1126/science.1138632.
G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P.
a
Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004.
G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors,
a
o
Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT
Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM.
G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21
a
o
(Suppl. 1):i369–i377, June 2005.
Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio
De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based
u
o
a
gene finding with an application to nematode genomes. Genome Research, 2009. URL
http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009.
S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International
a
u
Conference on Artificial Neural Networks, 2002.
S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human.
o
a
Bioinformatics, 22(14):e472–480, 2006.
c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

22
References II

VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar
o
R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis.
a
Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014.
G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana
with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi:
10.1101/gr.070169.107.
A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That
a
o
u
Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

c Gunnar R¨tsch (cBio@MSKCC)
a

RNA-Seq-based Annotation using mGene.ngs and MiTie

PAG XXII Gene Discovery Workshop

23

Más contenido relacionado

Similar a RNA-seq based Genome Annotation with mGene.ngs and MiTie

Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Gunnar Rätsch
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineFrancesca Giordano
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyDan Gaston
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopNuria Lopez-Bigas
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...ExternalEvents
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer studySwift Biosciences
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshopGenomeInABottle
 
Kshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: PrognosisKshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: PrognosisOleg Kshivets
 
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...ICRISAT
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionGenomeInABottle
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasMuhammadAbbaskhan9
 

Similar a RNA-seq based Genome Annotation with mGene.ngs and MiTie (20)

Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
2023 GIAB AMP Update
2023 GIAB AMP Update2023 GIAB AMP Update
2023 GIAB AMP Update
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
F Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis PipelineF Giordano ScanPAV Analysis Pipeline
F Giordano ScanPAV Analysis Pipeline
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Genomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and PathologyGenomics, Bioinformatics, and Pathology
Genomics, Bioinformatics, and Pathology
 
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics WorkshopLopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
Lopez-Bigas talk at the EBI/EMBL Cancer Genomics Workshop
 
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
The National Center for Biotechnology Information (NCBI) Pathogen Analysis Pi...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
NGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical viewNGS and the molecular basis of disease: a practical view
NGS and the molecular basis of disease: a practical view
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer study
 
final_presentation
final_presentationfinal_presentation
final_presentation
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Kshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: PrognosisKshivets O. Lung Cancer Surgery: Prognosis
Kshivets O. Lung Cancer Surgery: Prognosis
 
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
Research Program Genetic Gains (RPGG) Review Meeting 2021: Forward Breeding: ...
 
Karen miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detectionKaren miga centromere sequence characterization and variant detection
Karen miga centromere sequence characterization and variant detection
 
Next generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad AbbasNext generation sequencing by Muhammad Abbas
Next generation sequencing by Muhammad Abbas
 

Último

Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfMedicoseAcademics
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingNehru place Escorts
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...narwatsonia7
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptxDr.Nusrat Tariq
 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbaisonalikaur4
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safenarwatsonia7
 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknownarwatsonia7
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiNehru place Escorts
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 

Último (20)

Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
 
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Servicesauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
sauth delhi call girls in Bhajanpura 🔝 9953056974 🔝 escort Service
 
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdfHemostasis Physiology and Clinical correlations by Dr Faiza.pdf
Hemostasis Physiology and Clinical correlations by Dr Faiza.pdf
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Electronic City Just Call 7001305949 Top Class Call Girl Service A...
 
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment BookingCall Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
Call Girls Service Nandiambakkam | 7001305949 At Low Cost Cash Payment Booking
 
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment BookingHousewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
Housewife Call Girls Hoskote | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
Call Girls Service in Bommanahalli - 7001305949 with real photos and phone nu...
 
Glomerular Filtration and determinants of glomerular filtration .pptx
Glomerular Filtration and  determinants of glomerular filtration .pptxGlomerular Filtration and  determinants of glomerular filtration .pptx
Glomerular Filtration and determinants of glomerular filtration .pptx
 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
 
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service MumbaiLow Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
Low Rate Call Girls Mumbai Suman 9910780858 Independent Escort Service Mumbai
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
 
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli 📞 9907093804 High Profile Service 100% Safe
 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
 
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service ChennaiCall Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
Call Girls Service Chennai Jiya 7001305949 Independent Escort Service Chennai
 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 

RNA-seq based Genome Annotation with mGene.ngs and MiTie

  • 1. RNA-Seq-based Genome Annotation using mGene.ngs and MiTie Gunnar R¨tsch a Biomedical Data Science Group Computational Biology Center Memorial Sloan-Kettering Cancer Center gxr #mGene #MiTie #PAGXXII
  • 2. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 3. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 4. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 5. Memorial Sloan-Kettering Cancer Center Acknowledgements and Disclosures Main contributors Gabriele Schweikert Jonas Behr Andre Kahles Funding Financial interest disclosure c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 2
  • 6. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 7. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 8. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 9. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 10. Memorial Sloan-Kettering Cancer Center Genome Annotation Pipeline(s) c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 3
  • 11. Proposed new gene finding method (mGene.ngs) for reannotation of 19 A. thaliana genomes (and genome assembly + analysis). c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 4
  • 12. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 13. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 14. Memorial Sloan-Kettering Cancer Center mGene.ngs Overview Goal: Predict annotation based on RNA-seq and genomic sequence information Learn function f (y |x) that scores gene models y based on different sources of information x Train parameters such that f (y |x) f (y |x) for all y = y (“large margin”) Hidden semi-Markov Support Vector Machines (HsM-SVMs) [Altun et al., 2003, R¨tsch and Sonnenburg, 2007] a Automatically adapts to quality of RNA-seq data/alignments c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 5
  • 15. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 16. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 17. Memorial Sloan-Kettering Cancer Center Training of mGene genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don Score f(y|x) stop large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 18. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq Score f(y|x) intron support from spliced reads large margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 19. Memorial Sloan-Kettering Cancer Center Training of mGene.ngs genomic position True gene model 2 3 4 5 Wrong gene model STEP 1: SVM Signal Predictions tss tis acc don stop Coverage RNA-seq intron support from spliced reads Score f(y|x) larger margin genomic position c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 6
  • 20. Memorial Sloan-Kettering Cancer Center Results for C. elegans RNA-seq: paired-end, strand-specific RNA ligation based protocol 76bp reads, 50 million reads Alignment with Palmapper Evaluation: Transcript-level F-score of coding transcripts . . . for different expression levels Compare mGene (ab initio), mGene.ngs, cufflinks c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 7
  • 21. Memorial Sloan-Kettering Cancer Center Results for C. elegans c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 8
  • 22. Memorial Sloan-Kettering Cancer Center Digestion Observations: RNA-seq helps to improve performance Genomic signals help much (see cufflinks) Problems: Need existing annotation for training Cannot predict non-coding transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 9
  • 23. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  • 24. Memorial Sloan-Kettering Cancer Center Skimming and Non-coding Transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 10
  • 25. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 26. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 27. Memorial Sloan-Kettering Cancer Center Learning Strategy c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 11
  • 28. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 29. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 30. Memorial Sloan-Kettering Cancer Center Results for C. elegans 0.7 0.6 F−score 0.5 mGene − ab initio w/ annotation mGene.ngs − w/ annotation cufflinks − Trapnell et al. 2010 mGene.ngs − w/o annotation mGene.nc − w/o annotation 0.4 0.3 0.2 De novo prediction works! Modeling noncoding transcripts improves coding transcript prediction. 0.1 0 0 10 c Gunnar R¨tsch (cBio@MSKCC) a 20 30 40 50 60 expression percentile 70 RNA-Seq-based Annotation using mGene.ngs and MiTie 80 90 100 PAG XXII Gene Discovery Workshop 12
  • 31. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high Genefinding + RNA-seq => only one transcript RNA transcript assembly =>multiple transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 13
  • 32. BIOINFORMATICS ORIGINAL PAPER Genome analysis Vol. 29 no. 20 2013, pages 2529–2538 doi:10.1093/bioinformatics/btt442 Advance Access publication August 25, 2013 MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples ´ Jonas Behr1,2,*,y, Andre Kahles1, Yi Zhong1, Vipin T. Sreedharan1, Philipp Drewe1 and ¨ Gunnar Ratsch1,* 1 Computational Biology Center, Sloan-Kettering Institute, 1275 York Avenue, New York, NY 10065, USA and 2Friedrich Miescher Laboratory, Max Planck Society, Spemannstr. 39, 72076 Tubingen, Germany ¨ Associate Editor: Ivo Hofacker ABSTRACT c Gunnar R¨tsch (cBio@MSKCC) a Motivation: High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. Results: We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with RNA-Seq-based Annotation using corroborate that aand respect to annotated transcripts. Our results mGene.ngs well- Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on Decem genic locus by means of alternative splicing, transcription start and termination (e.g. Nilsen and Graveley, 2010; Ratsch et al., ¨ 2007; Schweikert et al., 2009). A comprehensive catalog of all transcripts encoded by a genomic locus is essential for downstream analyses that aim at a more detailed understanding of gene expression and RNA processing regulation. RNA-Seq is a method for parallel sequencing of a large number of RNA molecules based on high-throughput sequencing technologies (ENCODE Project Consortium et al., 2012; Mortazavi et al., 2008; Wang et al., 2009). Currently available sequencing platforms typically provide several 10–100 millions of sequence fragments (reads) with a typical length of 50–150 bases. By mapping these reads back to the genome, one can determine where gene products are encoded in the genome (e.g. Denoeud et al., 2008; Guttman et al., 2010; Trapnell et al., 2010; Xia et al., 2011) and collect evidence of RNA processing such as splicing (Bradley et al., 2012; Sonnenburg et al., 2007) or RNA-editing (Bahn et al., 2012). In many cases, the RNA-Seq reads are first aligned to a reference genome using an alignment tool that identifies possible read origins within the genome. Contiguous regions covered with read alignments (possibly with small gaps) are candidates for exonic segments. Alignment tools for RNA-Seq reads, such as PALMapper PAG XXIIal., 2008; Discovery Workshop (De Bona et Gene Jean et al., 2010), TopHat MiTie Transcript prediction via combinatorial optimization that combines evidence from multiple experiments & achieves higher accuracy. 14
  • 33. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  • 34. Memorial Sloan-Kettering Cancer Center Transcript Reconstruction with RNA-seq Reads Genome Based Assembly (Cufflinks, Scripture) Read alignments Denovo Assembly (Trinity, Oases) Genomic DNA Data processing Segment graph Optimization 108 possible transcripts, 1028 possible subsets of transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 15
  • 35. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 36. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 37. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 R. Bohnert and G. R¨tsch, NAR (2010) a c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie [Behr et al., 2013] PAG XXII Gene Discovery Workshop 16
  • 38. Memorial Sloan-Kettering Cancer Center Enumerate and Quantify all Transcripts Segment Graph Abundance Potential Transcripts 1 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 0 Sample1 Sample2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 min L( U T × W , W 0.0 0.2 0.0 0.8 0.0 0.0 0.0 0.0 Expected coverage 0.0 0.0 0.1 0.9 0.0 0.0 0.0 0.0 C )+γ× W 1 expected coverage observed coverage R. Bohnert and G. R¨tsch, NAR (2010) a [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 39. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 k 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 40. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 min L( U T × W , U,W Expected coverage Sample1 Sample2 0.8 0.2 0.0 0.0 0.9 0.0 0.1 0.0 W )+γ×N C expected coverage observed coverage [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 41. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 42. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 43. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W '$ s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid &% RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 44. Memorial Sloan-Kettering Cancer Center Simultaneous Identification & Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min L(U T × W , C ) + γ × N U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 45. Memorial Sloan-Kettering Cancer Center Simultaneous Identification Quantification Segment Graph Abundance Transcripts Matrix ... 1 1 0 N 1 1 0 0 1 0 1 1 1 0 Sample1 Sample2 0 1 0 0 U 1 1 1 0 1 1 1 0 1 1 1 0 0.8 0.2 0.0 0.0 Expected coverage 0.9 0.0 0.1 0.0 W min × W , C ) + γ × N L(U T U,W s.t. c Gunnar R¨tsch (cBio@MSKCC) a U is valid RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 16
  • 46. Memorial Sloan-Kettering Cancer Center MiTie’s Main Features Uses a likelihood function L based on a probabilistic model for the read coverage. Uses combinatorial optimization to find transcripts that explain data from multiple RNA-seq libraries Newly predicted transcripts are penalized (once). Can use already known/confirmed transcripts without penalty. Provides a p-value for each transcript providing a confidence measure for presence of predicted transcript. Log-likelihood ratio test: Tt = −2 log p(D|M) p(D|Mt ) [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 17
  • 47. Memorial Sloan-Kettering Cancer Center MiTie Results F−score on Transcript Level A F−score on Transcript Level B Human Simulated Data 0.45 MITIE + MMO MITIE Cufflinks + Cuffmerge Cufflinks 0.40 0.35 1 0.37 2 3 4 D. melanogaster modENCODE Data 5 0.35 0.33 0.31 0.29 MITIE Cufflinks + Cuffmerge 1 2 3 4 5 Number of Samples 6 7 [Behr et al., 2013] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 18
  • 48. Memorial Sloan-Kettering Cancer Center Gene Finding vs. Transcript Assembly Gene expression level low high mGene.ngs = only one transcript MiTie =multiple transcripts low confidence high confidence for alternative transcripts c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 19
  • 49. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 50. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 51. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 52. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 53. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 54. Memorial Sloan-Kettering Cancer Center Conclusions Genome annotation pipeline Transcript Skimmer identifies highly expressed genes for training mGene.ngs predicts coding and non-coding transcripts MiTie predicts alternative transcripts for highly expressed genes Genome annotation pipeline requires only Genome sequence RNA-seq alignments Good for annotating new genomes or improving existing ones Sources are free http://bioweb.me/mgene http://bioweb.me/mitie Functionality partially available in Galaxy instance (http://galaxy.cbio.mskcc.org) Thank you! c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 20
  • 55. Just published: Checkout: http://oqtans.org http://galaxy.cbio.mskcc.org [Sreedharan et al., 2014] c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 21
  • 56. References I Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov Support Vector Machines. In Proc. 20th Int. Conf. Mach. Learn., pages 3–10, 2003. J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski, K. Schneeberger, D. Weigel, and G. R¨tsch. a Rna-seq and tiling arrays for improved gene finding. Oral presentation at the CSHL Genome Informatics Meeting, September 2008. URL http://www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf. Jonas Behr, Andr´ Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R¨tsch. Mitie: Simultaneous e a rna-seq-based transcript identification and quantification in multiple samples. Bioinformatics, 29(20):2529–38, Oct 2013. doi: 10.1093/bioinformatics/btt442. RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu, G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Sch¨lkopf, M Nordborg, G R¨tsch, JR Ecker, and D Weigel. Common sequence polymorphisms o a shaping genetic diversity in arabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi: 10.1126/science.1138632. G. R¨tsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. In K. Tsuda B. Schoelkopf and J.-P. a Vert, editors, Kernel Methods in Computational Biology. MIT Press, 2004. G R¨tsch and S Sonnenburg. Large scale hidden semi-markov svms. In B. Sch¨lkopf, J. Platt, and T. Hoffman, editors, a o Advances in Neural Information Processing Systems (NIPS’06), volume 19, pages 1161–1168, Cambridge, MA, 2007. MIT Press. URL http://www.fml.tuebingen.mpg.de/raetsch/projects/HSMSVM. G. R¨tsch, S. Sonnenburg, and B. Sch¨lkopf. RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21 a o (Suppl. 1):i369–i377, June 2005. Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich, Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kr¨ger, S¨ren Sonnenburg, and Gunnar R¨tsch. mgene: Accurate svm-based u o a gene finding with an application to nematode genomes. Genome Research, 2009. URL http://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html. Advance access June 29, 2009. S. Sonnenburg, G. R¨tsch, A. Jagota, and K.-R. M¨ller. New methods for splice-site recognition. In Proc. International a u Conference on Artificial Neural Networks, 2002. S¨ren Sonnenburg, Alexander Zien, and Gunnar R¨tsch. ARTS: Accurate Recognition of Transcription Starts in Human. o a Bioinformatics, 22(14):e472–480, 2006. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 22
  • 57. References II VT Sreedharan, SJ Schultheiss, G Jean, A Kahles, R Bohnert, P Drewe, P Mudrakarta, N G¨rnitz, G Zeller, and Gunnar o R¨tsch. Oqtans: The rna-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis. a Bioinformatics, 2014. Bioinformatics Advance Access published January 11, 2014. G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detecting polymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107. A. Zien, G. R¨tsch, S. Mika, B. Sch¨lkopf, T. Lengauer, and K.-R. M¨ller. Engineering Support Vector Machine Kernels That a o u Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000. c Gunnar R¨tsch (cBio@MSKCC) a RNA-Seq-based Annotation using mGene.ngs and MiTie PAG XXII Gene Discovery Workshop 23