MMseqs (Many-against-Many sequence searching) is a novel software suite for very fast protein sequence searches and clustering of huge protein sequence data sets, such as sets of predicted protein sequences or 6-frame-translated open reading frames (ORFs) from large metagenomics experiments. MMseqs is around 1000 times faster than protein BLAST and sensitive enough to capture similarities down to less than 30% sequence identity.
At the core of MMseqs are two modules for the comparison of two sequence sets with each other. The first, prefiltering module computes the similarities between all sequences in one set with all sequences in the other based on a very fast and sensitive alignment-free metric, the sum of scores of similar 7-mers. The second module implements an AVX2-accelerated Smith-Waterman-alignment of all sequences that pass a cut-off for the score in the first module. Due to its unparalleled combination of speed and sensitivity, searches of all predicted ORFs in large metagenomics data sets through the entire UniProt or NCBI-NR databases will be feasible. This could allow to assign to functional clusters and taxonomic clades many reads that are too diverged to be mappable by current software.
MMseqs' third module can also cluster sequence sets efficiently, based on the similarity graph obtained from the comparison of the sequence set with itself in modules 1 and 2. MMseqs further supports an updating mode in which sequences can be added to an existing clustering with stable cluster identifiers and without the need to recluster the entire sequence set. MMseqs will therefore be used to offer high-quality clustered versions of the UniProt database down to 30% sequence similarity threshold.
7. Workflow Protein Search
Prefiltering Alignment
Search space : 108 × 108
~ 7 days
for UniProt 5.4*107
Search space: 108 × 102
~ 2 days
Query 1 Query n
Database
Hit1
Query 1 Query nquery 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
hit 1: 123
hit 2: 68
hit 3: 32
query n
...
9. Filtering Sequences with k-mers
2014/5/8
MARTIN STEINEGGER
Exact matches of length 3 Similar matches of length 6
10. Filtering Sequences with k-mers
Exact 3-mer matches Similar 6-mer matches
Informationispower k-mersaslongaspossible
Butweneedinexactmatchestokeepsensitivityhigh
3 mer, exact
5 mer, exact
5 mer, 25 similar
6 mer, 100 similar
7 mer, 400 similar
Prob. of chance k mer match
3
1.2 10 3
5
3 10 7
25
5
7.5 10 6
100
6
1.5 10 6
400
7
3 10 7
Prob. of homologous match
at 25% sequence identity
3
1/64
5
1/1024
25
5
1/40
100
6
1/40
400
7
1/40
Keep low for high speed! Keep high for high sensitivity!
11. Prefiltering
Algorithm
Most critical part of MMseqs
regarding speed and memory
consumption
Calculates similarity scores on
multiple CPUs.
Computationally intense parts
are vectorized.
11
Database SetQuery Set
AAAAAA
AAAAAR
.
.
.
MHWVRE
.
.
.
XXXXXX
Seq.Ids
5351
43314
2314
.
.
.
Query
matchList of k-mers
Index table
of database
Sum of scores
Result of query 1
. . L G T M H W V R Q
A . .
MHWVRQ
42
MHWVKQ
34
MHWVRE
34
.
.
.
query 1:
db 5351: (123)
db 2314: (68)
db 2: (62)
23 ... 11+34 ... 42+34 ... 12+34
1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.
12. Z-scorescorrectforbackgroundk-mermatches
: summed k mer match score of query with target protein
, with from calibration run
: expected score from background matches
# expected chance k mer matches
Poisson distributed matches
13. Fast Smith-Waterman alignment using SSE2
Fast Smith
Waterman
Using Michael Farrar’s version
of the Smith-Waterman
algorithm to align prefiltering
outputs.
13
. . .
Prefiltering
Result
Alignment
Result
Hit1
Query 1 Query n
14. Multi core
parallelization
over query
sequence
Thread level parallelization with
OpenMP.
Splits query database in
packages and matches them
against the database set.
14
Node
Query seqs.
0 - 25.000
Query seqs.
25.001-50.000
Query seqs.
50.001-75.000
Query seqs.
75.001-100.000
Result
Database Set
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k1:
db 12: 103
db 71: 58
db 92: 52
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k2:
db 15: 152
db 23: 88
db 24: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: 123
db 23: 68
db 2: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k3:
db 5: 123
db 23: 68
db 2: 32
. . .
Core 1 Core 2 Core 3 Core 4
15. Multi node
parallelization
over database
sequence
From top to bottom:
1. Message Passing Interface
2. Thread Level Parallelism
3. Data Level Parallelism
15
Aggregated
results
DB Seq
0 - 100.000
Node 1
Query Query Query
DB Seq.
100.001 - 200.000
Node 2
Query Query Query
DB Seq
200.001 - 300.000
Node 3
Query Query Query
20. Updating
We created an updating
mechanism that is able to
detect changes and update
the current database.
We also guarantee stable
cluster identifiers.
20
New sequences
Old sequences
Deleted sequences
Old Result
Update
New against New
NewagainstOld
+
Updating: N × ΔN
Reclustering: N × N
22. Summary
l BLAST-like searches at up to 1000x speed
l Application on metagenomics datasets
l Copes with huge sequence data amounts
l Clustering large protein seq data sets with best sensitivity/speed
l More sensitive core algorithm
l Profile searches => boosts sensitivity at same speed
l Applications in metagenomics
l E. g. gut microbiomes for medical research, soil for agriculture etc.
l Nucleotide sequence version to be tested
Outlook
23. Thanks
Maria Hauser
Development
Gene Center Munich
Ludwig-Maximilians-Universität
Johannes Söding
PI
Max Planck Institute Göttingen
Justas Dapkunas
Betatest
Institute of Biotechnology Vilnius
University
Klaus Faidt
Betatest
Max Planck Institute Tübingen
Borisas Bursteinas
Betatest
EBI: UniProt development
Andreas Hauser
FFindex
27. ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
ROC
All querys:
db 100
db 99
db 65
db 63
db 62
db 59
db 56
db 55
db 50
db 48
db 43
ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
Query 4:
db 100
db 99
ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2
TP
FP
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries
query 3 contributes
½ of the scores
query 4 contributes
all highest scores