SlideShare a Scribd company logo
1 of 15
Statistics for K-mer Based
Splicing Event Analysis
Data Learner Miner Practitioner
Ruofei Du, Hao Li, Hui Miao, Shangfu Peng
Alternative Splicing Events
Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014.
<http://en.wikipedia.org/wiki/Alternative_splicing>
● Alternative splicing is used to describe
any case in which a primary transcript
can be spliced in more than one pattern
to generate multiple and distinct
mRNAs.
● 5 traditional basic modes; most
common: exon skipping.
● It is a widespread mechanism for
generating protein diversity and
regulating protein expression.
● Improve
understanding of
cell
differentiation
and classify
disease types
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
Alternative Splicing Events
● Different species tend to have different
splicing event patterns.
● Different splicing events also indicates the
abnormal cells activities, such as cancer
Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing
Events." PLoS Computational Biology 4.8 (2008): e1000147.
Abundance Estimation for
Alternative Splicing Events
● Given RNA-Seq samples, estimate the abundance and
the relative proportion of every alternative transcription
path
Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.
Abundance Estimation for Isoforms
● The Standard Paradigm
o Read alignment step can be very computationally
intensive.
● Sailfish
o Far faster than the standard paradigm
o Replace the step of read mapping with the much
faster and simpler process of k-mer counting
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and
Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
K-mer
● A fixed sized (K) sequence
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight
Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted
(2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
A
C
G
T
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
● A string of length N contains
N-K+1 k-mers
● One can build K-mer index to
represent a string
7-mer iD N
ATTCGAC 1 1
TTCGACA 2 1
TCGACAG 3 1
...
1-mer 2-mer
Sailfish Workflow
● Indexing
o Build K-mer index for known
isoform transcripts
Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using
Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford.
● Quantification
o Counts the number of times
each K-mer occurs in the
reads.
o Estimating abundances via an
EM algorithm
Sailfish Workflow: Indexing
● Perfect Hashing
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
Domain(K-mer) Range([0,|D|-1])
Sailfish Workflow: Quantification
2.K-mer Allocation to
Transcripts
http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
1. Read Data K-mer Counting
Our Proposal
● We propose to investigate the scalable statistic method
using k-mer and k-mer index to estimate abundance of
alternative splicing events.
● We will focus on the
most frequent event type:
Exon Skipping Event
o other event types can
be extended naturally
Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data."
Nucleic Acids Research 40.8 (2012): e6
(1) (2) (3)
● Variables for abundance:
● Build k-mer index for a specific gene: e.g. A B C D E
● On reads part, aggregated k-mer counts like Sailfish
● Use EM to do maximum likelihood estimation
Class I: Each exon i
Class II: Each exon-exon junction (non-spliced)
Class III: Each spliced junction
Initial Idea
Exon A, B, C, D, E
Non-spliced junction AB, BC, CD, DE
Spliced junction AC, BD, CE
Advantage
● Do not require to know the Isoform space.
● Replace the step of read mapping, and provide a faster
approach for splicing event analysis.
Thank you
Questions
1. The drawback of the straightforward method: get the Pi of each
Isoform using EM first, and then calculate the frequency of events.
2. Why we have to use EM, why not solve equations?
3. Require to know the frequency of the five events?

More Related Content

Similar to Statistics for K-mer Based Splicing Analysis

Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Solutions
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuAlexander Pico
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal ClubMed_KU
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
2015BPSposterQL
2015BPSposterQL2015BPSposterQL
2015BPSposterQLQing Li
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSLubna MRL
 
Candidacy Exam Final Version
Candidacy Exam Final VersionCandidacy Exam Final Version
Candidacy Exam Final VersionAnthony Salvagno
 
Structure based computer aided drug design
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug designThanh Truong
 

Similar to Statistics for K-mer Based Splicing Analysis (20)

Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
OVium Bioinformatic Solutions
OVium Bioinformatic SolutionsOVium Bioinformatic Solutions
OVium Bioinformatic Solutions
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
Progetto_final
Progetto_finalProgetto_final
Progetto_final
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
 
JClinChem_2003
JClinChem_2003JClinChem_2003
JClinChem_2003
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
UNMSymposium2014
UNMSymposium2014UNMSymposium2014
UNMSymposium2014
 
2015BPSposterQL
2015BPSposterQL2015BPSposterQL
2015BPSposterQL
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICSPROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
 
Candidacy Exam Final Version
Candidacy Exam Final VersionCandidacy Exam Final Version
Candidacy Exam Final Version
 
Structure based computer aided drug design
Structure based computer aided drug designStructure based computer aided drug design
Structure based computer aided drug design
 

More from Ruofei Du

Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Ruofei Du
 
Geollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformGeollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformRuofei Du
 
Fusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsFusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsRuofei Du
 
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesMontage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesRuofei Du
 
CTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleCTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleRuofei Du
 
交大历史与梅竹赛
交大历史与梅竹赛交大历史与梅竹赛
交大历史与梅竹赛Ruofei Du
 
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaSocial Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaRuofei Du
 
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Ruofei Du
 
Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Ruofei Du
 
基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统Ruofei Du
 
Online Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesOnline Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesRuofei Du
 
Deliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsDeliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsRuofei Du
 

More from Ruofei Du (12)

Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
Project Geollery.com: Reconstructing a Live Mirrored World With Geotagged Soc...
 
Geollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media PlatformGeollery: A Mixed Reality Social Media Platform
Geollery: A Mixed Reality Social Media Platform
 
Fusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual EnvironmentsFusing Multimedia Data Into Dynamic Virtual Environments
Fusing Multimedia Data Into Dynamic Virtual Environments
 
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video TexturesMontage4D: Interactive Seamless Fusion of Multiview Video Textures
Montage4D: Interactive Seamless Fusion of Multiview Video Textures
 
CTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 ScheduleCTUAAA Summit 2017 Schedule
CTUAAA Summit 2017 Schedule
 
交大历史与梅竹赛
交大历史与梅竹赛交大历史与梅竹赛
交大历史与梅竹赛
 
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social MediaSocial Street View: Blending Immersive Street Views with Geo-tagged Social Media
Social Street View: Blending Immersive Street Views with Geo-tagged Social Media
 
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
Video Fields: Fusing Multiple Surveillance Videos into a Dynamic Virtual Envi...
 
Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)Chinese Caligraphy 品读书法·感悟中华 (2010)
Chinese Caligraphy 品读书法·感悟中华 (2010)
 
基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统基于视频的疲劳驾驶检测系统
基于视频的疲劳驾驶检测系统
 
Online Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography FeaturesOnline Vigilance Analysis Combining Video and Electrooculography Features
Online Vigilance Analysis Combining Video and Electrooculography Features
 
Deliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement MethodsDeliberately Planning and Acting for Angry Birds with Refinement Methods
Deliberately Planning and Acting for Angry Birds with Refinement Methods
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Statistics for K-mer Based Splicing Analysis

  • 1. Statistics for K-mer Based Splicing Event Analysis Data Learner Miner Practitioner Ruofei Du, Hao Li, Hui Miao, Shangfu Peng
  • 2. Alternative Splicing Events Image from: "Alternative Splicing Event" Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. 2 Apr. 2014. <http://en.wikipedia.org/wiki/Alternative_splicing> ● Alternative splicing is used to describe any case in which a primary transcript can be spliced in more than one pattern to generate multiple and distinct mRNAs. ● 5 traditional basic modes; most common: exon skipping. ● It is a widespread mechanism for generating protein diversity and regulating protein expression.
  • 3. ● Improve understanding of cell differentiation and classify disease types Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing Events." PLoS Computational Biology 4.8 (2008): e1000147.
  • 4. Alternative Splicing Events ● Different species tend to have different splicing event patterns. ● Different splicing events also indicates the abnormal cells activities, such as cancer Image from: Sammeth, Michael, Sylvain Foissac, and Roderic Guigó. "A General Definition and Nomenclature for Alternative Splicing Events." PLoS Computational Biology 4.8 (2008): e1000147.
  • 5. Abundance Estimation for Alternative Splicing Events ● Given RNA-Seq samples, estimate the abundance and the relative proportion of every alternative transcription path Image from: Hu, Yin, et al. "DiffSplice: the genome-wide detection of differential splicing events with RNA-seq." Nucleic acids research 41.2 (2013): e39-e39.
  • 6. Abundance Estimation for Isoforms ● The Standard Paradigm o Read alignment step can be very computationally intensive. ● Sailfish o Far faster than the standard paradigm o Replace the step of read mapping with the much faster and simpler process of k-mer counting Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf
  • 7. K-mer ● A fixed sized (K) sequence Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. Manuscript Submitted (2013) http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf A C G T AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT ● A string of length N contains N-K+1 k-mers ● One can build K-mer index to represent a string 7-mer iD N ATTCGAC 1 1 TTCGACA 2 1 TCGACAG 3 1 ... 1-mer 2-mer
  • 8. Sailfish Workflow ● Indexing o Build K-mer index for known isoform transcripts Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms Rob Patro, Stephen M. Mount, and Carl Kingsford. ● Quantification o Counts the number of times each K-mer occurs in the reads. o Estimating abundances via an EM algorithm
  • 9. Sailfish Workflow: Indexing ● Perfect Hashing http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf Domain(K-mer) Range([0,|D|-1])
  • 10. Sailfish Workflow: Quantification 2.K-mer Allocation to Transcripts http://www.cs.cmu.edu/~ckingsf/class/02714-f13/Lec05-sailfish.pdf 1. Read Data K-mer Counting
  • 11. Our Proposal ● We propose to investigate the scalable statistic method using k-mer and k-mer index to estimate abundance of alternative splicing events. ● We will focus on the most frequent event type: Exon Skipping Event o other event types can be extended naturally Shen, Shihao, et al. "MATS: a Bayesian framework for Flexible Detection of Differential Alternative Splicing from RNA-Seq Data." Nucleic Acids Research 40.8 (2012): e6 (1) (2) (3)
  • 12. ● Variables for abundance: ● Build k-mer index for a specific gene: e.g. A B C D E ● On reads part, aggregated k-mer counts like Sailfish ● Use EM to do maximum likelihood estimation Class I: Each exon i Class II: Each exon-exon junction (non-spliced) Class III: Each spliced junction Initial Idea Exon A, B, C, D, E Non-spliced junction AB, BC, CD, DE Spliced junction AC, BD, CE
  • 13. Advantage ● Do not require to know the Isoform space. ● Replace the step of read mapping, and provide a faster approach for splicing event analysis.
  • 15. Questions 1. The drawback of the straightforward method: get the Pi of each Isoform using EM first, and then calculate the frequency of events. 2. Why we have to use EM, why not solve equations? 3. Require to know the frequency of the five events?

Editor's Notes

  1. Good morning everyone, we’re data learner miner practitioner team. Today we’re going to talk about our project proposal: statistics for k-mer based splicing event analysis.
  2. So what are alternative splicing and what are alternative splicing events? Alternative splicing is a regulated process during gene expression that results in a single gene coding for multiple proteins. There are five traditional basic modes of alternative splicing events: Exon skipping, Mutually exclusive exons, Alternative donor site, Alternative acceptor site, and Intron retention. For the exon skipping case, an exon (as the yellow one in the figure) may be skipped from the primary transcript. This is the most common mode in mammalian pre-mRNAs. Mutually exclusive exons: One of the two yellow exons is retained in mRNAs after splicing, but not both. Alternative donor site: An alternative 5' splice junction (donor site) is used, changing the 3' boundary of the upstream exon. Alternative acceptor site: An alternative 3' splice junction (acceptor site) is used, changing the 5' boundary of the downstream exon. Intron retention: A subsequence in one exon may be spliced out as an intron or simply retained. This is distinguished from exon skipping because the retained sequence is not flanked by introns. Alternative splicing is a widespread mechanism for generating protein diversity and regulating protein expression. The term alternative splicing is used in biology to describe any case in which a primary transcript can be spliced in more than one pattern to generate multiple, distinct mRNAs. AS events are available for the following model organisms: Caenorhabditis elegans Danio rerio Drosophila melanogaster Homo sapiens Mus musculus Rattus norvegicus
  3. So why are we interested in splicing events? For one thing, different species tend to have different splicing event patterns. For example, for each of the 12 compared species, a pie diagram shows the distribution of splicing events across 5 structural different classes. It’s clear from the figure that mammals has more exon skipping and complex events and less retained introns than invertebrates. For another, different splicing events, also indicates the abnormal cells activities, such as cancer The splicing event analysis could Improve our understanding of cell differentiation and classify disease types. Next, Hui would introduce abundance estimation for alternative splicing events
  4. The splicing event analysis could Improve our understanding of cell differentiation and classify disease types. Next, Hui would introduce abundance estimation for alternative splicing events
  5. Estimates the abundance and the relative proportion of every alternative transcription path. Subsequently, the estimators for the expression of each ASM are propagated to derive an estimator for the overall gene expression
  6. Shuffle ambiguously mapped reads around. usually with the goal of uniform coverage.
  7. K-mers are robust to errors. Longer k-mers may result in less ambiguity, but may be more affected by errors in the reads. shorter k-mers, though more ambiguous, may be more robust to errors in the reads
  8. Sailfish works in two phases: indexing and quantification The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts. Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
  9. Sailfish works in two phases: indexing and quantification The most important data structure in the index is the minimal perfect hash function that maps each k-mer in the reference transcripts to an index between 0 and the number of different k-mers in the transcripts such that no two k-mers share an index. This allows us to quickly index and count any k-mer from the reads that also appears in the transcripts. Sailfish then applies an expectation maximization (EM) procedure to determine maximum likelihood estimates for the relative abundance of each transcript. this procedure is similar to the EM algorithm used by RSEM [5], except that k-mers rather than fragments are probabilistically assigned to transcripts,
  10. Counts the number of times each K-mer occurs in the reads. Applies EM to determine maximum likelihood estimates for the abundance of each transcript By working with k-mers, we can replace the computationally intensive step of read mapping with the much faster and simpler process of k-mer counting We also avoid any dependence on read mapping parameters Two k-mers are equivalent from the perspective of the EM algorithm if they occur in the same set of transcript sequences with the same rate This reduction in the number of active variables substantially reduces the computational requirements of the EM procedure
  11. The basic idea is to focus on the frequency of the exon-exon junction. Like this picture, we named it exon 1, exon 2 and exon 3. If we tested a 1-3 junction from the reads, we know one exon skipping event has occurred. So our task is to estimate the frequency of exon-exon junction.
  12. Recalling Sailfish, it estimates mu_i for each isoform. Similarly, here we introduce three classes variables on genes part to be estimated. The first class is mu_i for each single exon. The second class and the third class are the mu for all exon-exon junction. But the second class is for non-spliced junction and the third class is for spliced junction. For example, for the gene sequence ABCDE, where ABCDE are exons. The first class is the mu for A,B,C,D,E. The second class is the mu for AB. The third class is mu for If we know all mu result, the sum of mu of the third class is exactly the frequency of exon skipping event. To estimate mu, we build k-mer index for each variable on the gene part. And on reads part, we aggregated k-mer counts like Sailfish.
  13. which is a hard task in biology. Simlar to Sailfish