MMseqs NGS 2014

Introducing
MMseqs
MARTIN STEINEGGER
GENE CENTER MUNICH

Motivation
Map to protein /
organism
Blast: ~40 000 days (16 cores)
MMseqs: ~40 days (16 cores)
7 lanes × 200M reads
~ 7 × 200M seqs
of 50 amino acids
UniProt
5×107
Protein
seqs
1.4×109
Search reads
against UniProt
Gene predictionSequence
genome

Growth of the UniProtKB/TrEMBL
Protein Sequence Database
MARTIN STEINEGGER

Result Protein Search
Build & read index Search Time Speed-up factor
MMseqs s=4 1h 17m 6m 950x
MMseqs s=7 1h 17m 11m 518x
swipe 36m 2d 5h 34m 1.8x
BLAST 36m 3d 23h 01m 1x
ublast 1h 52m 46m 127x
RAPsearch 2h 11m 10h 56m 9.5x
UniProt
54 790 250
7 616 Proteins
 search

ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
Query 4:
db 100
db 99
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries 5

Fractionofqueries
ROC5
SCOP25
UniProtKB
283 406
SCOP25
7 616
 true positive: same SCOP superfamily
 false positive: different SCOP fold
 ignore same fold different superfamily
 search
950x
518x
9.5x
127x
1.8x
1x

Workflow Protein Search
Prefiltering Alignment
Search space : 108 × 108
~ 7 days
for UniProt 5.4*107
Search space: 108 × 102
~ 2 days
Query 1 Query n
Database
Hit1
Query 1 Query nquery 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
hit 1: 123
hit 2: 68
hit 3: 32
query n
...

Filtering Sequences with k-mers
Homologous proteins Unrelated proteins
Sequence2
Sequence2
Sequence 1 Sequence 1

2014/5/8
MARTIN STEINEGGER
Exact matches of length 3 Similar matches of length 6

Exact 3-mer matches Similar 6-mer matches
Informationispower k-mersaslongaspossible
Butweneedinexactmatchestokeepsensitivityhigh
3 mer, exact
5 mer, exact
5 mer, 25 similar
6 mer, 100 similar
7 mer, 400 similar
Prob. of chance k mer match
3
1.2 10 3
5
3 10 7
25
5
7.5 10 6
100
6
1.5 10 6
400
7
3 10 7
Prob. of homologous match
at 25% sequence identity
3
1/64
5
1/1024
25
5
1/40
100
6
1/40
400
7
1/40
Keep low for high speed! Keep high for high sensitivity!

Prefiltering
Algorithm
Most critical part of MMseqs
regarding speed and memory
consumption
Calculates similarity scores on
multiple CPUs.
Computationally intense parts
are vectorized.
11
Database SetQuery Set
AAAAAA

AAAAAR

.
.
.

MHWVRE

.
.
.

XXXXXX

Seq.Ids

5351

43314

2314

.
.
.

Query
matchList of k-mers
Index table
of database
Sum of scores
Result of query 1
. . L G T M H W V R Q

A . .

MHWVRQ
42

MHWVKQ
34

MHWVRE
34

.
.
.

query 1:
db 5351: (123)
db 2314: (68)
db 2: (62)
23 ... 11+34 ... 42+34 ... 12+34
1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.

Z-scorescorrectforbackgroundk-mermatches
: summed k mer match score of query with target protein
, with from calibration run
: expected score from background matches
# expected chance k mer matches
Poisson distributed matches

Fast Smith-Waterman alignment using SSE2
Fast Smith
Waterman
Using Michael Farrar’s version
of the Smith-Waterman
algorithm to align prefiltering
outputs.
13
. . .
Prefiltering
Result
Alignment
Result
Hit1
Query 1 Query n

Multi core
parallelization
over query
sequence
Thread level parallelization with
OpenMP.
Splits query database in
packages and matches them
against the database set.
14
Node
Query seqs.
0 - 25.000
Query seqs.
25.001-50.000
Query seqs.
50.001-75.000
Query seqs.
75.001-100.000
Result
Database Set
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k1:
db 12: 103
db 71: 58
db 92: 52
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k2:
db 15: 152
db 23: 88
db 24: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: 123
db 23: 68
db 2: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k3:
db 5: 123
db 23: 68
db 2: 32
. . .
Core 1 Core 2 Core 3 Core 4

Multi node
parallelization
over database
sequence
From top to bottom:
1.  Message Passing Interface
2.  Thread Level Parallelism
3.  Data Level Parallelism
15
Aggregated
results
DB Seq
0 - 100.000
Node 1
Query Query Query
DB Seq.
100.001 - 200.000
Node 2
Query Query Query
DB Seq
200.001 - 300.000
Node 3
Query Query Query

Sequences Clusters
GLTRETVSR
Why Sequence Clustering

Workflow of MMseqs
ClusteringPrefiltering
Query 1 Query n
Database
Alignment
Hit1 Query 1 Query n

Clustering
Clustering
with greedy
set cover
Linear time and space
greedy set cover algorithm
to cluster results.
18
Database Set
Alignment
Result
Query Set
Clustering
Result

Cascaded
Clustering
19
90% sequence
identity
50% sequence
identity
20% sequence
identity
Speed
Sensitivity
Data to cluster
ClusteringPrefiltering Alignment

Updating
We created an updating
mechanism that is able to
detect changes and update
the current database.
We also guarantee stable
cluster identifiers.
20
New sequences
Old sequences
Deleted sequences
Old Result
Update
New against New
NewagainstOld
+
Updating: N × ΔN
Reclustering: N × N

Clustering Results
Clusters Corrupted Clusters Seq. per Cluster Time
MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s
MMseqs s=4 set cover 60 915 1 4.7 4m 02s
MMseqs cascaded s=4 41 173 3 7.0 3m 35s
MMseqs s=7 29 801 2 9.7 9m 26s
MMseqs cascaded s=7 22 541 1 12.9 5m 07s
blastclust 21 890 1 13.3 7h 25m 01s
CD-HIT 114 386 260 2.5 1h 25m 01s
kClust 91 681 1 3.2 9m 57s
Usearch 157 981 11 1.8 45s
SCOP25
UniProtKB
283 406
SCOP25
7 616
 cluster

Summary
l  BLAST-like searches at up to 1000x speed
l  Application on metagenomics datasets
l  Copes with huge sequence data amounts
l  Clustering large protein seq data sets with best sensitivity/speed
l  More sensitive core algorithm
l  Profile searches => boosts sensitivity at same speed
l  Applications in metagenomics
l  E. g. gut microbiomes for medical research, soil for agriculture etc.
l  Nucleotide sequence version to be tested
Outlook

Thanks
Maria Hauser
Development
Gene Center Munich
Ludwig-Maximilians-Universität
Johannes Söding
PI
Max Planck Institute Göttingen
Justas Dapkunas
Betatest
Institute of Biotechnology Vilnius
University
Klaus Faidt
Betatest
Max Planck Institute Tübingen
Borisas Bursteinas
Betatest
EBI: UniProt development
Andreas Hauser
FFindex

Thank you
for your time.
Discussion

MARTIN STEINEGGER
TP
FP

ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
 ROC
All querys:
db 100
db 99
db 65
db 63
db 62
db 59
db 56
db 55
db 50
db 48
db 43
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
Query 4:
db 100
db 99
ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2
TP
FP
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries
query 3 contributes
½ of the scores
query 4 contributes
all highest scores

MMseqs NGS 2014

Recomendados

Recomendados

Más contenido relacionado

Similar a MMseqs NGS 2014

Similar a MMseqs NGS 2014 (20)

Último

Último (20)

MMseqs NGS 2014