Here are 3 potential questions someone might have after reviewing the document:
1. What is the drawback of directly estimating isoform abundances first before calculating alternative splicing event frequencies? This is computationally intensive and requires fully resolving the isoform space.
2. Why can't the abundances be solved directly through equations rather than using an expectation-maximization (EM) algorithm? The isoform abundances are latent variables that need to be estimated, so an iterative algorithm like EM is required rather than solving a system of equations.
3. Does the proposed method require prior knowledge of the frequencies of the five basic alternative splicing event types? No, the method aims to estimate event frequencies directly from k-mer counts in an unsupervised manner without
1. Statistics for K-mer Based
Splicing Event Analysis
Data Learner Miner Practitioner
Ruofei Du, Hao Li, Hui Miao, Shangfu Peng
2. Alternative Splicing Events
Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014.
<http://en.wikipedia.org/wiki/Alternative_splicing>
● Alternative splicing is used to describe
any case in which a primary transcript
can be spliced in more than one pattern
to generate multiple and distinct
mRNAs.
● 5 traditional basic modes; most
common: exon skipping.
● It is a widespread mechanism for
generating protein diversity and
regulating protein expression.
3. ● Improve
understanding of
cell
differentiation
and classify
disease types
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
4. Alternative Splicing Events
● Different species tend to have different
splicing event patterns.
● Different splicing events also indicates the
abnormal cells activities, such as cancer
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
5. Abundance Estimation for
Alternative Splicing Events
● Given RNA-Seq samples, estimate the abundance and
the relative proportion of every alternative transcription
path
Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.
6. Abundance Estimation for Isoforms
● The Standard Paradigm
o Read alignment step can be very computationally
intensive.
● Sailfish
o Far faster than the standard paradigm
o Replace the step of read mapping with the much
faster and simpler process of k-mer counting
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and
Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
7. K-mer
● A fixed sized (K) sequence
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight
Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted
(2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
A
C
G
T
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
● A string of length N contains
N-K+1 k-mers
● One can build K-mer index to
represent a string
7-mer iD N
ATTCGAC 1 1
TTCGACA 2 1
TCGACAG 3 1
...
1-mer 2-mer
8. Sailfish Workflow
● Indexing
o Build K-mer index for known
isoform transcripts
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using
Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford.
● Quantification
o Counts the number of times
each K-mer occurs in the
reads.
o Estimating abundances via an
EM algorithm
11. Our Proposal
● We propose to investigate the scalable statistic method
using k-mer and k-mer index to estimate abundance of
alternative splicing events.
● We will focus on the
most frequent event type:
Exon Skipping Event
o other event types can
be extended naturally
Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data."
Nucleic Acids Research 40.8 (2012): e6
(1) (2) (3)
12. ● Variables for abundance:
● Build k-mer index for a specific gene: e.g. A B C D E
● On reads part, aggregated k-mer counts like Sailfish
● Use EM to do maximum likelihood estimation
Class I: Each exon i
Class II: Each exon-exon junction (non-spliced)
Class III: Each spliced junction
Initial Idea
Exon A, B, C, D, E
Non-spliced junction AB, BC, CD, DE
Spliced junction AC, BD, CE
13. Advantage
● Do not require to know the Isoform space.
● Replace the step of read mapping, and provide a faster
approach for splicing event analysis.
15. Questions
1. The drawback of the straightforward method: get the Pi of each
Isoform using EM first, and then calculate the frequency of events.
2. Why we have to use EM, why not solve equations?
3. Require to know the frequency of the five events?
Editor's Notes
Good morning everyone, we’re data learner miner practitioner team. Today we’re going to talk about our project proposal: statistics for k-mer based splicing event analysis.
So what are alternative splicing and what are alternative splicing events? Alternative splicing is a regulated process during gene expression that results in a single gene coding for multiple proteins. There are five traditional basic modes of alternative splicing events: Exon skipping, Mutually exclusive exons, Alternative donor site, Alternative acceptor site, and Intron retention.
For the exon skipping case, an exon (as the yellow one in the figure) may be skipped from the primary transcript. This is the most common mode in mammalian pre-mRNAs.
Mutually exclusive exons: One of the two yellow exons is retained in mRNAs after splicing, but not both.Alternative donor site: An alternative 5' splice junction (donor site) is used, changing the 3' boundary of the upstream exon.
Alternative acceptor site: An alternative 3' splice junction (acceptor site) is used, changing the 5' boundary of the downstream exon.
Intron retention: A subsequence in one exon may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns.
Alternative splicing is a widespread mechanism for generating protein diversity and regulating protein expression. The term alternative splicing is used in biology to describe any case in which a primary transcript can be spliced in more than one pattern to generate multiple, distinct mRNAs.
AS events are available for the following model organisms:
Caenorhabditis elegans
Danio rerio
Drosophila melanogaster
Homo sapiens
Mus musculus
Rattus norvegicus
So why are we interested in splicing events?
For one thing, different species tend to have different splicing event patterns. For example, for each of the 12 compared species, a pie diagram shows the distribution of splicing events across 5 structural different classes. It’s clear from the figure that mammals has more exon skipping and complex events and less retained introns than invertebrates.
For another, different splicing events, also indicates the abnormal cells activities, such as cancer The splicing event analysis could Improve our understanding of cell differentiation and classify disease types.
Next, Hui would introduce abundance estimation for alternative splicing events
The splicing event analysis could Improve our understanding of cell differentiation and classify disease types.
Next, Hui would introduce abundance estimation for alternative splicing events
Estimates the abundance and the relative proportion of every alternative transcription path. Subsequently, the estimators for the expression of each ASM are propagated to derive an estimator for the overall gene expression
Shuffle ambiguously mapped reads around. usually with the goal of uniform coverage.
K-mers are robust to errors.
Longer k-mers may result in less ambiguity, but may be more affected by errors in the reads.
shorter k-mers, though more ambiguous, may be more robust to errors in the reads
Sailfish works in two phases: indexing and quantification
The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts.
Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
Sailfish works in two phases: indexing and quantification
The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts.
Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
Counts the number of times each K-mer occurs in the reads.
Applies EM to determine maximum likelihood estimates for the abundance of each transcript
By working with k-mers, we can replace the computationally intensive step of read mapping with the much faster and simpler process of k-mer counting
We also avoid any dependence on read mapping parameters
Two k-mers are equivalent from the perspective of the EM algorithm if they occur in the same set of transcript sequences with the same rate
This reduction in the number of active variables substantially reduces the computational requirements of the EM procedure
The basic idea is to focus on the frequency of the exon-exon junction. Like this picture, we named it exon 1, exon 2 and exon 3.
If we tested a 1-3 junction from the reads, we know one exon skipping event has occurred. So our task is to estimate the frequency of exon-exon junction.
Recalling Sailfish, it estimates mu_i for each isoform. Similarly, here we introduce three classes variables on genes part to be estimated.
The first class is mu_i for each single exon.
The second class and the third class are the mu for all exon-exon junction. But the second class is for non-spliced junction and the third class is for spliced junction.
For example, for the gene sequence ABCDE, where ABCDE are exons. The first class is the mu for A,B,C,D,E. The second class is the mu for AB. The third class is mu for
If we know all mu result, the sum of mu of the third class is exactly the frequency of exon skipping event.
To estimate mu, we build k-mer index for each variable on the gene part. And on reads part, we aggregated k-mer counts like Sailfish.
which is a hard task in biology.
Simlar to Sailfish