Statistics for K-mer Based Splicing Analysis

Statistics for K-mer Based
Splicing Event Analysis
Data Learner Miner Practitioner
Ruofei Du, Hao Li, Hui Miao, Shangfu Peng

Alternative Splicing Events
Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014.
<http://en.wikipedia.org/wiki/Alternative_splicing>
● Alternative splicing is used to describe
any case in which a primary transcript
can be spliced in more than one pattern
to generate multiple and distinct
mRNAs.
● 5 traditional basic modes; most
common: exon skipping.
● It is a widespread mechanism for
generating protein diversity and
regulating protein expression.

● Improve
understanding of
cell
differentiation
and classify
disease types
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.

● Different species tend to have different
splicing event patterns.
● Different splicing events also indicates the
abnormal cells activities, such as cancer
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.

Abundance Estimation for
● Given RNA-Seq samples, estimate the abundance and
the relative proportion of every alternative transcription
path
Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.

Abundance Estimation for Isoforms
● The Standard Paradigm
o Read alignment step can be very computationally
intensive.
● Sailfish
o Far faster than the standard paradigm
o Replace the step of read mapping with the much
faster and simpler process of k-mer counting
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and
Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf

K-mer
● A fixed sized (K) sequence
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight
Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted
(2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
A
C
G
T
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
● A string of length N contains
N-K+1 k-mers
● One can build K-mer index to
represent a string
7-mer iD N
ATTCGAC 1 1
TTCGACA 2 1
TCGACAG 3 1
...
1-mer 2-mer

Sailfish Workflow
● Indexing
o Build K-mer index for known
isoform transcripts
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using
Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford.
● Quantification
o Counts the number of times
each K-mer occurs in the
reads.
o Estimating abundances via an
EM algorithm

Sailfish Workflow: Indexing
● Perfect Hashing
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
Domain(K-mer) Range([0,|D|-1])

Sailfish Workflow: Quantification
2.K-mer Allocation to
Transcripts
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
1. Read Data K-mer Counting

Our Proposal
● We propose to investigate the scalable statistic method
using k-mer and k-mer index to estimate abundance of
alternative splicing events.
● We will focus on the
most frequent event type:
Exon Skipping Event
o other event types can
be extended naturally
Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data."
Nucleic Acids Research 40.8 (2012): e6
(1) (2) (3)

● Variables for abundance:
● Build k-mer index for a specific gene: e.g. A B C D E
● On reads part, aggregated k-mer counts like Sailfish
● Use EM to do maximum likelihood estimation
Class I: Each exon i
Class II: Each exon-exon junction (non-spliced)
Class III: Each spliced junction
Initial Idea
Exon A, B, C, D, E
Non-spliced junction AB, BC, CD, DE
Spliced junction AC, BD, CE

Advantage
● Do not require to know the Isoform space.
● Replace the step of read mapping, and provide a faster
approach for splicing event analysis.

Questions
1. The drawback of the straightforward method: get the Pi of each
Isoform using EM first, and then calculate the frequency of events.
2. Why we have to use EM, why not solve equations?
3. Require to know the frequency of the five events?

Statistics for K-mer Based Splicing Analysis

Recommended

Recommended

More Related Content

Similar to Statistics for K-mer Based Splicing Analysis

Similar to Statistics for K-mer Based Splicing Analysis (20)

More from Ruofei Du

More from Ruofei Du (12)

Recently uploaded

Recently uploaded (20)

Statistics for K-mer Based Splicing Analysis

Editor's Notes