SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Introducing
MMseqs
MARTIN STEINEGGER
GENE CENTER MUNICH
Motivation
Map to protein /
organism
Blast: ~40 000 days (16 cores)
MMseqs: ~40 days (16 cores)
7 lanes × 200M reads
~ 7 × 200M seqs
of 50 amino acids
UniProt
5×107
Protein
seqs
1.4×109
Search reads
against UniProt
Gene predictionSequence
genome
Growth of the UniProtKB/TrEMBL
Protein Sequence Database
MARTIN STEINEGGER
Result Protein Search
Build & read index Search Time Speed-up factor
MMseqs s=4 1h 17m 6m 950x
MMseqs s=7 1h 17m 11m 518x
swipe 36m 2d 5h 34m 1.8x
BLAST 36m 3d 23h 01m 1x
ublast 1h 52m 46m 127x
RAPsearch 2h 11m 10h 56m 9.5x
UniProt
54 790 250
7 616 Proteins
 search
ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
Query 4:
db 100
db 99
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries 5
Result Protein Search
Fractionofqueries
ROC5
SCOP25
UniProtKB
283 406
SCOP25
7 616
 true positive: same SCOP superfamily
 false positive: different SCOP fold
 ignore same fold different superfamily
 search
950x
518x
9.5x
127x
1.8x
1x
Workflow Protein Search
Prefiltering Alignment
Search space : 108 × 108
~ 7 days
for UniProt 5.4*107
Search space: 108 × 102
~ 2 days
Query 1 Query n
Database
Hit1
Query 1 Query nquery 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
hit 1: 123
hit 2: 68
hit 3: 32
query n
...
Filtering Sequences with k-mers
Homologous proteins Unrelated proteins
Sequence2
Sequence2
Sequence 1 Sequence 1
Filtering Sequences with k-mers
2014/5/8
MARTIN STEINEGGER
Exact matches of length 3 Similar matches of length 6
Filtering Sequences with k-mers
Exact 3-mer matches Similar 6-mer matches
Informationispower k-mersaslongaspossible
Butweneedinexactmatchestokeepsensitivityhigh
3 mer, exact
5 mer, exact
5 mer, 25 similar
6 mer, 100 similar
7 mer, 400 similar
Prob. of chance k mer match
3
1.2 10 3
5
3 10 7
25
5
7.5 10 6
100
6
1.5 10 6
400
7
3 10 7
Prob. of homologous match
at 25% sequence identity
3
1/64
5
1/1024
25
5
1/40
100
6
1/40
400
7
1/40
Keep low for high speed! Keep high for high sensitivity!
Prefiltering
Algorithm
Most critical part of MMseqs
regarding speed and memory
consumption
Calculates similarity scores on
multiple CPUs.
Computationally intense parts
are vectorized.
11
Database SetQuery Set
AAAAAA	
  
AAAAAR	
  
.	
  .	
  .	
  
MHWVRE	
  
.	
  .	
  .	
  
XXXXXX	
  
Seq.Ids	
  
5351	
  
43314	
  
2314	
  
.	
  .	
  .	
  
Query
matchList of k-mers
Index table
of database
Sum of scores
Result of query 1
. . L G T M H W V R Q 	
   	
  A . . 	
   	
  
MHWVRQ	
  42	
  
MHWVKQ	
  34	
  	
  
MHWVRE	
  34	
  
.	
  .	
  .	
  
query 1:
db 5351: (123)
db 2314: (68)
db 2: (62)
23 ... 11+34 ... 42+34 ... 12+34
1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.
Z-scorescorrectforbackgroundk-mermatches
: summed k mer match score of query with target protein
, with from calibration run
: expected score from background matches
# expected chance k mer matches
Poisson distributed matches
Fast Smith-Waterman alignment using SSE2
Fast Smith
Waterman
Using Michael Farrar’s version
of the Smith-Waterman
algorithm to align prefiltering
outputs.
13
. . .
Prefiltering
Result
Alignment
Result
Hit1
Query 1 Query n
Multi core
parallelization
over query
sequence
Thread level parallelization with
OpenMP.
Splits query database in
packages and matches them
against the database set.
14
Node
Query seqs.
0 - 25.000
Query seqs.
25.001-50.000
Query seqs.
50.001-75.000
Query seqs.
75.001-100.000
Result
Database Set
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k1:
db 12: 103
db 71: 58
db 92: 52
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k2:
db 15: 152
db 23: 88
db 24: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: 123
db 23: 68
db 2: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k3:
db 5: 123
db 23: 68
db 2: 32
. . .
Core 1 Core 2 Core 3 Core 4
Multi node
parallelization
over database
sequence
From top to bottom:
1.  Message Passing Interface
2.  Thread Level Parallelism
3.  Data Level Parallelism
15
Aggregated
results
DB Seq
0 - 100.000
Node 1
Query Query Query
DB Seq.
100.001 - 200.000
Node 2
Query Query Query
DB Seq
200.001 - 300.000
Node 3
Query Query Query
Sequences Clusters
GLTRETVSR
Why Sequence Clustering
Workflow of MMseqs
ClusteringPrefiltering
Query 1 Query n
Database
Alignment
Hit1 Query 1 Query n
Clustering
Clustering
with greedy
set cover
Linear time and space
greedy set cover algorithm
to cluster results.
18
Database Set
Alignment
Result
Query Set
Clustering
Result
Cascaded
Clustering
19
90% sequence
identity
50% sequence
identity
20% sequence
identity
Speed
Sensitivity
Data to cluster
ClusteringPrefiltering Alignment
Updating
We created an updating
mechanism that is able to
detect changes and update
the current database.
We also guarantee stable
cluster identifiers.
20
New sequences
Old sequences
Deleted sequences
Old Result
Update
New against New
NewagainstOld
+
Updating: N × ΔN
Reclustering: N × N
Clustering Results
Clusters Corrupted Clusters Seq. per Cluster Time
MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s
MMseqs s=4 set cover 60 915 1 4.7 4m 02s
MMseqs cascaded s=4 41 173 3 7.0 3m 35s
MMseqs s=7 29 801 2 9.7 9m 26s
MMseqs cascaded s=7 22 541 1 12.9 5m 07s
blastclust 21 890 1 13.3 7h 25m 01s
CD-HIT 114 386 260 2.5 1h 25m 01s
kClust 91 681 1 3.2 9m 57s
Usearch 157 981 11 1.8 45s
SCOP25
UniProtKB
283 406
SCOP25
7 616
 cluster
Summary
l  BLAST-like searches at up to 1000x speed
l  Application on metagenomics datasets
l  Copes with huge sequence data amounts
l  Clustering large protein seq data sets with best sensitivity/speed
l  More sensitive core algorithm
l  Profile searches => boosts sensitivity at same speed
l  Applications in metagenomics
l  E. g. gut microbiomes for medical research, soil for agriculture etc.
l  Nucleotide sequence version to be tested
Outlook
Thanks
Maria Hauser
Development
Gene Center Munich
Ludwig-Maximilians-Universität
Johannes Söding
PI
Max Planck Institute Göttingen
Justas Dapkunas
Betatest
Institute of Biotechnology Vilnius
University
Klaus Faidt
Betatest
Max Planck Institute Tübingen
Borisas Bursteinas
Betatest
EBI: UniProt development
Andreas Hauser
FFindex
Thank you
for your time.
Discussion
Backup
2014/5/8
MILOT MIRDITA
Result Protein Search
MARTIN STEINEGGER
TP
FP
ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
 ROC
All querys:
db 100
db 99
db 65
db 63
db 62
db 59
db 56
db 55
db 50
db 48
db 43
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
Query 4:
db 100
db 99
ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2
TP
FP
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries
query 3 contributes
½ of the scores
query 4 contributes
all highest scores

Más contenido relacionado

Similar a MMseqs NGS 2014

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMABioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMAM. Gonzalo Claros
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用NVIDIA Taiwan
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011USC
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Q pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayQ pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayElsa von Licy
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterNhat Pham
 

Similar a MMseqs NGS 2014 (20)

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
community detection
community detectioncommunity detection
community detection
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
M Sc Project
M Sc ProjectM Sc Project
M Sc Project
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMABioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Macs course
Macs courseMacs course
Macs course
 
thesis_choward
thesis_chowardthesis_choward
thesis_choward
 
Q pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayQ pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarray
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
 

Último

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 

Último (20)

❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 

MMseqs NGS 2014

  • 2. Motivation Map to protein / organism Blast: ~40 000 days (16 cores) MMseqs: ~40 days (16 cores) 7 lanes × 200M reads ~ 7 × 200M seqs of 50 amino acids UniProt 5×107 Protein seqs 1.4×109 Search reads against UniProt Gene predictionSequence genome
  • 3. Growth of the UniProtKB/TrEMBL Protein Sequence Database MARTIN STEINEGGER
  • 4. Result Protein Search Build & read index Search Time Speed-up factor MMseqs s=4 1h 17m 6m 950x MMseqs s=7 1h 17m 11m 518x swipe 36m 2d 5h 34m 1.8x BLAST 36m 3d 23h 01m 1x ublast 1h 52m 46m 127x RAPsearch 2h 11m 10h 56m 9.5x UniProt 54 790 250 7 616 Proteins  search
  • 5. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43 Query 4: db 100 db 99  ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries 5
  • 6. Result Protein Search Fractionofqueries ROC5 SCOP25 UniProtKB 283 406 SCOP25 7 616  true positive: same SCOP superfamily  false positive: different SCOP fold  ignore same fold different superfamily  search 950x 518x 9.5x 127x 1.8x 1x
  • 7. Workflow Protein Search Prefiltering Alignment Search space : 108 × 108 ~ 7 days for UniProt 5.4*107 Search space: 108 × 102 ~ 2 days Query 1 Query n Database Hit1 Query 1 Query nquery 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: hit 1: 123 hit 2: 68 hit 3: 32 query n ...
  • 8. Filtering Sequences with k-mers Homologous proteins Unrelated proteins Sequence2 Sequence2 Sequence 1 Sequence 1
  • 9. Filtering Sequences with k-mers 2014/5/8 MARTIN STEINEGGER Exact matches of length 3 Similar matches of length 6
  • 10. Filtering Sequences with k-mers Exact 3-mer matches Similar 6-mer matches Informationispower k-mersaslongaspossible Butweneedinexactmatchestokeepsensitivityhigh 3 mer, exact 5 mer, exact 5 mer, 25 similar 6 mer, 100 similar 7 mer, 400 similar Prob. of chance k mer match 3 1.2 10 3 5 3 10 7 25 5 7.5 10 6 100 6 1.5 10 6 400 7 3 10 7 Prob. of homologous match at 25% sequence identity 3 1/64 5 1/1024 25 5 1/40 100 6 1/40 400 7 1/40 Keep low for high speed! Keep high for high sensitivity!
  • 11. Prefiltering Algorithm Most critical part of MMseqs regarding speed and memory consumption Calculates similarity scores on multiple CPUs. Computationally intense parts are vectorized. 11 Database SetQuery Set AAAAAA   AAAAAR   .  .  .   MHWVRE   .  .  .   XXXXXX   Seq.Ids   5351   43314   2314   .  .  .   Query matchList of k-mers Index table of database Sum of scores Result of query 1 . . L G T M H W V R Q    A . .     MHWVRQ  42   MHWVKQ  34     MHWVRE  34   .  .  .   query 1: db 5351: (123) db 2314: (68) db 2: (62) 23 ... 11+34 ... 42+34 ... 12+34 1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.
  • 12. Z-scorescorrectforbackgroundk-mermatches : summed k mer match score of query with target protein , with from calibration run : expected score from background matches # expected chance k mer matches Poisson distributed matches
  • 13. Fast Smith-Waterman alignment using SSE2 Fast Smith Waterman Using Michael Farrar’s version of the Smith-Waterman algorithm to align prefiltering outputs. 13 . . . Prefiltering Result Alignment Result Hit1 Query 1 Query n
  • 14. Multi core parallelization over query sequence Thread level parallelization with OpenMP. Splits query database in packages and matches them against the database set. 14 Node Query seqs. 0 - 25.000 Query seqs. 25.001-50.000 Query seqs. 50.001-75.000 Query seqs. 75.001-100.000 Result Database Set query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k1: db 12: 103 db 71: 58 db 92: 52 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k2: db 15: 152 db 23: 88 db 24: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: 123 db 23: 68 db 2: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k3: db 5: 123 db 23: 68 db 2: 32 . . . Core 1 Core 2 Core 3 Core 4
  • 15. Multi node parallelization over database sequence From top to bottom: 1.  Message Passing Interface 2.  Thread Level Parallelism 3.  Data Level Parallelism 15 Aggregated results DB Seq 0 - 100.000 Node 1 Query Query Query DB Seq. 100.001 - 200.000 Node 2 Query Query Query DB Seq 200.001 - 300.000 Node 3 Query Query Query
  • 17. Workflow of MMseqs ClusteringPrefiltering Query 1 Query n Database Alignment Hit1 Query 1 Query n
  • 18. Clustering Clustering with greedy set cover Linear time and space greedy set cover algorithm to cluster results. 18 Database Set Alignment Result Query Set Clustering Result
  • 19. Cascaded Clustering 19 90% sequence identity 50% sequence identity 20% sequence identity Speed Sensitivity Data to cluster ClusteringPrefiltering Alignment
  • 20. Updating We created an updating mechanism that is able to detect changes and update the current database. We also guarantee stable cluster identifiers. 20 New sequences Old sequences Deleted sequences Old Result Update New against New NewagainstOld + Updating: N × ΔN Reclustering: N × N
  • 21. Clustering Results Clusters Corrupted Clusters Seq. per Cluster Time MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s MMseqs s=4 set cover 60 915 1 4.7 4m 02s MMseqs cascaded s=4 41 173 3 7.0 3m 35s MMseqs s=7 29 801 2 9.7 9m 26s MMseqs cascaded s=7 22 541 1 12.9 5m 07s blastclust 21 890 1 13.3 7h 25m 01s CD-HIT 114 386 260 2.5 1h 25m 01s kClust 91 681 1 3.2 9m 57s Usearch 157 981 11 1.8 45s SCOP25 UniProtKB 283 406 SCOP25 7 616  cluster
  • 22. Summary l  BLAST-like searches at up to 1000x speed l  Application on metagenomics datasets l  Copes with huge sequence data amounts l  Clustering large protein seq data sets with best sensitivity/speed l  More sensitive core algorithm l  Profile searches => boosts sensitivity at same speed l  Applications in metagenomics l  E. g. gut microbiomes for medical research, soil for agriculture etc. l  Nucleotide sequence version to be tested Outlook
  • 23. Thanks Maria Hauser Development Gene Center Munich Ludwig-Maximilians-Universität Johannes Söding PI Max Planck Institute Göttingen Justas Dapkunas Betatest Institute of Biotechnology Vilnius University Klaus Faidt Betatest Max Planck Institute Tübingen Borisas Bursteinas Betatest EBI: UniProt development Andreas Hauser FFindex
  • 24. Thank you for your time. Discussion
  • 26. Result Protein Search MARTIN STEINEGGER TP FP
  • 27. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43  ROC All querys: db 100 db 99 db 65 db 63 db 62 db 59 db 56 db 55 db 50 db 48 db 43  ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 Query 4: db 100 db 99 ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2 TP FP 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries query 3 contributes ½ of the scores query 4 contributes all highest scores