Kolokwium habilitacyjne

Przewidywania funkcji
biomolekuł
dr Dariusz Plewczyński
ICM UW

1

A A Fig
F
External Int
External
environment RNA In
environment RNA
modification Pr
RNA modification Ribosome P
assembly Terminal organelle (A)
RNA decay
Metabolism Ribosome assembly
Terminal organelle
decay assembly (A
Metabolism tRNA Protein assembly mo
aminoacylation translocation
Protein m
RNA processing tRNA Protein translocation Host as
aminoacylation processing interaction
Host a
RNA processing Protein (blu

Czym jest bioinformatyka?
processing interaction
Host epithelium M. (b
Transcription Host epithelium M
like
Transcription
Transcriptional Macromolecular Translation
regulation complexation co li
DNA Protein Translation
Transcriptional Macromolecular modification
supercoiling ch c
regulation complexation Protein
DNA Protein Protein folding gre
supercoiling activation modification c

Bioinformatyka jest młodą DNA Protein Protein (B)g
repair DNA activation folding
damage mo
DNA (B

dziedziną badawczą, która repair DNA Protein Metabolites are
damage decay m
RNA cyc
Protein Metabolites a

przez zebranie i analizę decay Protein (da
Chromosome RNA
DNA cu c
condensation
Protein (d
masowych danych
the
Replication
Chromosome
DNA initiation
condensation DNA ce c
replication

doświadczalnych stanowi
FtsZ varth
Replication
polymerization
DNA initiation du c
replication Cytokinesis
Chromosome

naturalne zaplecze segregation FtsZ ce v
polymerization ph d
Cytokinesis

teoretyczne i metodologiczne
Chromosome proc
segregation
(bla
p

Nauk o życiu.
sub
B Update time & p
ea
cell variables (b
wit
Chromosome Condensation (3) s

RNA DNA
Fin
B Update time & Segregation (7) e
Transcript cell variables sio
Damage (0) w
gra

Zapewnia ona funkcjonalną
Chromosome Repair (18)
Condensation (3)

RNA DNA

DNA
RNA F
Supercoiling (5)
Segregation (7) s
Damage (0) (10)
Replication
charakteryzację białek,
Transcript
Polypeptide g
Repair (18) initiation (1)
Replication

DNA
RNA Transcriptional reg. (5)

metabolitów, cząsteczek
Protein mon. Supercoiling (5)
Transcription (8)
Polypeptide
Complex
Replication (10)
Processing (6) RN

Protein
RNA/DNA i innych typów
Replication initiation (1)

RNA
Modification (14) mo
Protein mon. pol
RNA Transcriptional reg. (5)
Aminoacylation (25)

biomolekuł występujących w
Transcription (8) No:
Decay (2) repeat
Ribosome Send cell
Complex Processing (6)
Translation (103) int
R
Protein
Initialize variables Cell Yes:

RNA
komórkach organizmów FtsZ ring
Modification (14)
Processing I (2) divided? terminate mom
RNA pol Aminoacylation (9)
Translocation (25)
No: pr
Metabolite

żywych.
Metabolic rxn Processing II (2)
Decay (2) repeat
Ribosome Send cell Folding (6) grin
Initialize variables Translation (103) Yes:
Cell

Protein
Metabolite Modification (3)
Processing I (2) an
FtsZ ring divided? terminate m
Complexation (0) et
Geometry Translocation (9) p
Ribosome assembly (6)
Metabolite

Metabolic rxn Processing II (2) an
Term. org. assembly (8) g
Host Folding (6) (0)
Activation inv

Protein Other
Metabolite Modification (3)
Decay (9)
a
tio
Other

Mass Complexation (0)
FtsZ polymerization (1) e
tha
Geometry Ribosome assembly (6)
Karr, Covert et al. Cell 150, 389-401 (2012) Stimulus Metabolism (140)
Term. org. assembly (8)
Cytokinesis (1)
nea
Host Time
Activation (0)
Host interaction (16) 2 in
tha
Decay (9) t
ind
ther

CellMass
variables Cell process submodels

• Celem bioinformatyki jest zatem
dostarczenie języka badawczego do
zrozumienia biologii

• analiza informacji o bio-molekułach, z
których zbudowany jest cały żywy organizm.
Known'protein'sequences'
16,393,342(

16(
90"
Structures)in)PDB) 84,508"
80"
14(
70"
12(
60"
Millions'

10(
50"
Thousands)

8(
40"

6(
30"

4( 20"

2( 10"

0( 0"
2003( 2004( 2005( 2006( 2007( 2008( 2009( 2010( 2011( 2012( 1990" 1992" 1994" 1996" 1998" 2000" 2002" 2004" 2006" 2008" 2010" 2012"

3

• Potrzeba dalszego rozwoju i opracowywania nowych
metod przewidywania funkcji białek
• Modelowanie i analiza wybranych układów o
istotnym znaczeniu w chemii, biologii i medycynie.
• Poruszane zagadnienia interdyscyplinarne:
– bioinformatyka,
– chemoinformatyka, projektowanie leków,
– wirtualne przesiewanie,
– sekwencja i struktura białek,
– modyﬁkacje po-translacyjne (fosforylacja,...),
– eksploracja i analiza masowych danych,
– uczenie maszynowe, analiza skupień, meta-uczenie,
uczenie konsensusowe,
– wizualizacja danych.

4

Motywacja
badawcza

5

Współczesna biologia molekularna zapewnia
duże ilości danych doświadczalnych pochodzących
m.in. z badań wielko-skalowych:

projekt genomu ludzkiego (ang. Human Genome
Project);
sekwencjonowanie następnej generacji;
genomika strukturalna (ang. structural genomics);
doświadczenia mikromacierzowe, spektroskopia Novel SGNH hydrolases in Leptospira

masowa;
wirtualne przesiewanie.

E.
UT
RIB
D IST
OT
ON
.D
CE
IEN

Figure 1. For figure legend, see page 2.
SC

classification of proteins SCOP16 currently defines six different glycine (block II) and asparagine (block III) residues (e.g., S47,
BIO

families in SGNH hydrolase superfamily: (i) esterase that hydrolyses G71, N104, D192, H195 in pdb|1bwp10). In contrast to canonical
ester bonds in lipid cover of the plant tubers (associated with a α/β hydrolases (e.g., common lipases), members of SGNH super-
scrab disease),17 (ii) haemagglutinin esterase responsible for binding family do not have “nucleophile elbow” (motif GXSXG). Invariant
ES

and lysing membrane receptors,18 (iii) acetylhydrolase regulating serine (block I) serves as the nucleophile and a proton donor to the
platelet-activating factors through hydrolysis of phospholipids (signal oxyanion hole, histidine (block V) acts as a base to make serine more
ND

termination),19 (iv) rhamnogalacturonan acetylesterase critical for nucleophilic by deprotonating its hydroxyl group, while aspartic acid
degradation of polysaccharides,11 (v) thioesterase I, TAP20 with liso- (block V) promotes formation of HisH+. Less conserved residues
LA

phospholipase activity and (vi) a putative lipase alr1529. (e.g., absent in pdb|1u8u20) in blocks II (Gly) and III (Asn) serve
SGNH hydrolases have flavodoxin-like fold structure that is char- as two additional proton donors to the oxyanion hole. During the
08

acterized by the presence of a central five-stranded parallel β-sheet general catalytic mechanism, a non-covalent Michaelis complex with
(with β2β1β3β4β5 strand order) flanked by α-helices on both sides. substrate is formed, while serine hydroxyl group attacks the substrate
20

The active site is situated at the bottom of open cleft that is easily carbonyl groups. This is followed by formation of acyl-enzyme
accessible to large substrates. This site is relatively flexible and suscep- intermediate and hydrophilic attack of positioned water molecule
©

tible to conformational changes, thus exhibiting highly versatile on ester carbonyl group resulting in restoration of free enzyme. In
functional roles such as protease, thioesterase, acetylhydrolase and order to support substrate binding, the cleft is also equipped with
acetylesterase. Conserved residues from characteristic four sequence positively charged residues (e.g., R141 in pdb|1bwp) to interact with
motifs (blocks I, II, III, V; numbering nomenclature follows ref. the substrate carboxylate groups.
21) are critical for the formation of the active site and catalysis in As shown in Figure 1, the catalytic triad (S77, D328, H333 in
SGHN hydrolase superfamily. Specifically, the active site is defined gi|116332481) is fully conserved in DUF1574 proteins. Additionally,
by the catalytic triad: serine (block I, motif GXS), aspartic acid and the predicted secondary structure pattern is consistent with that of

6
histidine (block V, motif DXXH), which are usually supported by SGNH hydrolase fold core (βαβαβαβαβα). Less conserved proton

www.landesbioscience.com Cell Cycle 2

Masowe dane doświadczalne są niezwykle
kosztowne i trudne w dalszej interpretacji.

bardzo wysoki poziom błędów;
szum informacyjny;
nieliniowe zależności w danych i cechach;
niepewność co do właściwej charakteryzacji
biomolekuł.

Metody informatyczne, takie jak uczenie
maszynowe, zapewniają naturalny język opisu
tak zebranych danych, dodatkowo umożliwiając
wspomaganie tworzenia hipotez badawczych
przez zespoły badawcze.

7

Metodologia
badawcza
INPUT: objects & features
Representations
annotations generation

Similarity Text External External External
Structure PCA/ICA Properties Mining Tools Annotations Databases
…

SQL Training Database

feature
decomposition
Scores, Structural Features
Thresholds Similarity Similarity

OUTPUT:
Support Model
Vector Neural Random Decision
Networks Forest Trees Features
Machine Decision
Reliability Score
machine learning

MLcons
consensus
8

Publikacje habilitacyjne skupiają się na
zastosowaniach metod uczenia maszynowego,
metod ﬁzyko-chemicznych, oraz metod
hybrydowych do analizy danych biologicznych.

Typowym przykładem współpracy między
metodami informatycznymi, narzędziami ﬁzyko-
chemicznymi, oraz wiedzą biologiczną jest
zagadnienie przewidywania funkcji białka na
postawie jego sekwencji aminokwasowej.

1 RPDFCLEPPY 10
11 TGPCKARIIR 20
21 YFYNAKAGLC 30
31 QTFVYGGCRA 40 FUNCTION
41 KRNNFKSAED 50
51 CMRTCGGA 58

9

Zastosowanie metod komputerowych do analizy
danych pochodzących ze złożonych systemów
biologicznych wymaga:

• zebrania danych treningowych;
• wyróżnieniu lokalnych cech ﬁzyko-chemicznych;
• oceny jakości modeli;
• stworzenia meta-modelu;
• przeprowadzenie pełnej charakteryzacji układu
biologicznego.

Training data
Model

10

Zastosowanie metod komputerowych do analizy
danych pochodzących ze złożonych systemów
biologicznych wymaga:

• zebrania danych treningowych;
• wyróżnieniu lokalnych cech ﬁzyko-chemicznych;
• oceny jakości modeli;
• stworzenia meta-modelu;
• przeprowadzenie pełnej charakteryzacji układu
biologicznego.

Training data Discovery
Model
Process
Existing
Knowledge 10

Sposób budowy meta-modeli (meta-
serwisów, meta-narzędzi) ich zastosowanie
w bioinformatyce.

Proces tworzenia meta-narzędzi może być
wielo-skalowy:

mikroskala;

mezoskala;

makroskala.

11

Teoria meta-uczenia w granicy
nieskończonego zespołu statystycznego

zespół oddziałujących algorytmów
indywidualnych (automaty komórkowe);

różnice między algorytmami, danymi,
cechami;

różnorodność zarówno metodologiczna
jak i reprezentacyjna

[DMP1] “Landau Theory of Meta-Learning” D. Plewczynski in Springer Lecture Notes
in Computer Science (LNCS) P. Bouvry et al. (Eds.): SIIS 2011, LNCS 7053, pp. 13
142--153. Springer, Heidelberg (2011);

metoda fuzji (ang. data fusion).

Dane wielko-skalowe pochodzące z wielu
heterogenicznych źródeł;
przedstawienie przepływu informacji między bazami;
ocena jakości tak zgromadzonej informacji

[DMP3] “Protein-protein interaction and pathway databases, a graphical review” T.
Klingström and D. Plewczynski. Brieﬁngs in Bioinformatics Epub Sep 17(2010); 14

Wirtualne przesiewanie: brak wiedzy o strukturze
trójwymiarowej celu terapeutycznego.

[DMP4] “Performance of machine learning methods for ligand-based virtual
screening” by D. Plewczynski, S. Spieser and U. Koch. Combinatorial Chemistry &
High Throughput Screening CCHTS 12(4): 358-68 (2009);
[DMP5] “Assessing Different Classiﬁcation Methods for Virtual Screening” by D. 15
Plewczynski, S. Spieser, U. Koch. J. Chem. Inf. Model. 46(3), p.1098-106 (2006);

Wirtualne przesiewanie: znana struktura
trójwymiarowa białka -> metody ﬁzykochemiczne,
budowa meta-serwera

[DMP6] “VoteDock: consensus docking method for prediction of protein-ligand
interactions” D. Plewczynski, M. Łaźniewski, M. von Grotthuss, L. Rychlewski and K. 16
Ginalski. Journal Computational Chemistry (2010);

Zbiór 1300 par białko-ligand z bazy PDBbind wersja 2007 (znana
prawdziwa struktura kompleksu jak i jego siła wiązania).
Meta-narzędzie poprawnie przewiduje konformację trójwymiarową aż 20%
więcej kompleksów niż pojedyncze metody, oraz 10% więcej niż
najlepszy z rozważanych siedmiu programów ﬁzyko-chemicznych.


Poprawa wartości odchylenia o około 0.5Å między przewidzianą a
prawdziwą strukturą.
Poprawa współczynnika korelacji Perason’a między przewidzianą a
rzeczywistą siłą wiązania do wartości 0.5 przy użyciu metod regresji
liniowej bazującej na wynikach dokowania poszczególnych programów.


Dalszy rozwój metody: zintegrowanie obu klas
metod obliczeniowych w celu poprawy ich
szybkości i dokładności, oraz określenie jak tak
powstałe hybrydowe meta-narzędzie komputerowe
będzie się sprawdzało w wielko-skalowych
projektach farmaceutycznych.
Metoda lasów
losowych (RF)
połączona z
metodą
dokowania dla 5
celów
białkowych
umożliwia
znalezienie 60%
aktywnych
inhibitorów
dokując tylko
10% związków z
całej kolekcji
leków (MDDR).
[DMP7] “Virtual High-Throughput Screening using combined Random Forest and
Flexible Docking” by D. Plewczynski, L. Rychlewski, Marcin von Grotthuss, and 19
Krzysztof Ginalski. Combinatorial Chemistry & High Throughput Screening, CCHTS

Brainstorming: poprawa wyników przez
jednoczesne zastosowanie wielu metod uczenia
maszynowego.

meta-metoda
polepsza wyniki
klasyﬁkacji o ok. 10%
w porównaniu do
metod pojedynczych.

[DMP8] “Brainstorming: weighted voting prediction of inhibitors for protein targets”
D. Plewczynski. Journal of Molecular Modelling Epub, Sep 21(2010); 20

Modyfikacje post-translacyjne białek: lokalna analiza sekwencji i
struktury białek w celu określenia ich funkcji biologicznej.

Wystarczy wybrać liniowe i lokalne otoczenie modyfikowanego
aminokwasu, oraz ustalić naturalną reprezentację (zestaw cech) dla
takiego krótkiego fragmentu sekwencji.
bardzo szybkie i skuteczne przeszukiwanie całych proteomów w celu
automatycznej identyfikacji modyfikacji w nowych białkach.

[DMP10] “AMS 3.0: prediction of post-translational modifications” S. Basu and D.
Plewczynski. BMC Bioinformatics (2010), Apr 28, 11:210;
[DMP11] “AutoMotif Server: prediction of single residue post-translational 21
modifications in proteins” by D. Plewczynski, A. Tkacz, L. Wyrwicz and L. Rychlewski.

Analiza celów terapeutycznych

od przewidywania biologicznej aktywności małych cząsteczek
chemicznych (tj. metabolitów, czy inhibitorów) do funkcji większych
biomolekuł (białka).
metoda wyboru białek związanych z procesami chorobowymi.
pojęcie podobieństwa sekwencyjnego.
podobieństwo strukturalne w celu określenia ich numeru
enzymatycznego.
Obie metody, t
strukturalna wz
ale też są od s
białka o podob
aminokwasowy
globalną strukt
metody struktu
ok. dwa razy w
homologicznyc
poziomie osza
statystycznego

[DMP9] “Meta-basic estimates the size of druggable human genome” by D.
Plewczynski and L. Rychlewski. Journal of Molecular Modelling 15, 695-699 (2009); 22

Algorytm łączący lokalny opis sekwencyjny z charakteryzacją
strukturalną łańcucha białkowego.

Przewidywanie lokalnej konformacji trójwymiarowej białka
wykorzystując tylko informację o jego sekwencji aminokwasowej.
Mimo, iż lokalna struktura łańcucha białkowego jest przewidziana
wyłącznie na podstawie sekwencji, to jednak użycie jednoczesne obu
tych informacji znacząco poprawia jakość dopasowania między
białkami

[DMP12] “SEA and FRAGlib – an Integrated Web Service for Improving the Alignment
Quality based on Segments Comparison” by D. Plewczynski, L. Rychlewski, L. 23
Jaroszewski, Y. Ye, A. Godzik. BMC Bioinformatics 5(1):98 (2004);

Podsumowanie &
podziekowania

24
29

Na przedstawioną rozprawę habilitacyjną składa się cykl 12
publikacji, które zostały opublikowane w latach 2004-2011.
Prace te dotyczą zastosowania metod komputerowych,
uczenia maszynowego i analizy danych w bioinformatyce, w
szczególności modelowaniu i analizie biomolekuł.

Cel: wspomaganie badań biologicznych, chemicznych
oraz medycznych.

Główną wynikiem badawczym jest pokazanie, że w
przypadku wielu zastosowań meta-algorytmów uzyskiwana
jest znacząca poprawa wyników w porównaniu do
pojedynczych metod (nawet do 10% lepszej jakości
przewidywania na bioinformatycznych zbiorach testowych).

Zauważalna poprawa procesu klasyﬁkacji danych wielko-
skalowych dzięki równoczesnemu użyciu heterogenicznych
metod (zarówno ﬁzyko-chemicznych jak i uczenia
maszynowego).

25

Dziękuję za uwagę!

CuraGen Corporation
Science, 2003 26

Ocena
ROC graphs 4 3 Tom Fawcett
ROC graphs 13

klasyfikatorów
True class 1.0
Algorithm 2 Efficient method for generating ROC points

p D
Inputs: L, the set of test examples; f (i), the probabilistic classifier’s estimate
n that example i is positive; P and N , the number of positive and negative
B

True Positive rate
examples. 0.8
Outputs: R, a list of ROC points increasing by fp rate.
Require: P > 0 and N > 0 C
Y True False A
1: Lsorted ← L sorted decreasing by f scores
0.6
Positives Positives 2: F P ← T P ← 0
3: R ←

Hypothesized 4: fprev ← −∞
5: i ← 1 0.4
class 6: while i ≤ |Lsorted | do
7: if f (i) = fprev then
False True 8: push F P , TP onto R
P
E
N Negatives
9:
N
fprev ← f (i)0.2
Negatives 10: end if
11: if Lsorted [i] is a positive example then
12: T P ← T P +0 1
13: else /* i is a negative example */
14: FP ← FP + 1 0 0.2 0.4 0.6 0.8 1.0

Column totals: P N 15: end if False Positive rate
16: i←i+1
Figure 2. A basic ROC graph showing five discrete classifiers.
17: end while
FP TP 18: push F P , T P onto R /* This is (1,1) */
fp rate = N
tp rate = P 19: end
N P

The false positive rate (also called false alarm rate) of the classi-
TP TP fier is: at The new algorithm is shown in algorithm 2. T P and F P both every
start
precision = T P +F P recall = P zero. For each positive instance we increment T P and for
negative instance we increment F P . We maintain a stack R of ROC
points, pushing a new point onto R after each instanceclassified
negatives incorrectly is processed.
accuracy = T P +T N
F-measure = 2 fp rate ≈
The final output is the stack R, which will contain points on the ROC
P +N 1/precision+1/recall total negatives
curve.
Let n be the number of points in the test set. This algorithm requires
Figure 1. Confusion matrix and common performance metrics calculated from it
Additional termsfollowed by an O(n) scanROCthe list, resulting in
an O(n log n) sort associated with down curves are:
O(n log n) total complexity.
sensitivity = recall
4.1. Equally scored instances
constructed representing the dispositions of the set of instances. This True negatives
matrix forms the basis for many common metrics. specificity =
Statements 7–10 need some explanation. These are necessary in order
False positives + True negatives
to correctly handle sequences of equally scored instances. Consider the
Figure 1 shows a confusion matrix and equations of several common
T. Fawcett (2004) ROC curve shown in figure 6. Assume − fp rate
= 1 we have a test set in which
metrics that can be calculated from it. The numbers along the major there is a sequence of instances, four negatives and six positives, all
diagonal represent the correct decisions made, and the numbers off positive predictive value line 1precision 2 does not impose
27
scored equally by f . The sort in = of algorithm

Ensembles
Statistical Computational
H H

h2 h1
h1 f f
h2
h4 h3
h3

Representational
H

h1
f
h2

h3
Comparing Supervised Classification Learning Algorithms 1917
Figure 1: A taxonomy of statistical questions in machine learning. The boxed
1914 node (Question 8) is the subject of this article.
Thomas G. Dietterich

ples) are drawn independently from a fixed probability distribution defined
by the particular application problem.
Question 1: Suppose we are given a large sample of data and a classifier C.
The classifier C may have been constructed using part of the data, but there
are enough data remaining for a separate test set. Hence, we can measure
the accuracy of C on the test set and construct a binomial confidence interval
(Snedecor & Cochran, 1989; Efron & Tibshirani, 1993; Kohavi, 1995). Note
that in Question 1, the classifier could have been produced by any method
(e.g., interviewing an expert); it need not have been produced by a learning
algorithm.
Question 2: Given a small data set, S, suppose we apply learning algo-
rithm A to S to construct classifier CA . How accurately will CA classify new
examples? Because we have no separate test set, there is no direct way to
answer this question. A frequently applied strategy is to convert this ques-
tion into Question 6: Can we predict the accuracy of algorithm A when it is
trained on randomly selected data sets of (approximately) the same size as
T. Dietterich (1998, 2000)
Figure 5: Probability of type I error for each statistical test. The four adjacent
S? If so, then we can predict the accuracy of CA , which was obtained from
bars for each test represent the probability of type I error for ✏ = 0.10, 0.20, 0.30,
and 0.40. Error bars show 95% confidence intervals for these probabilities.on S.
training The
horizontal dotted line shows the target probability of 0.05. 28
Figure 7: Type I error rates for four statistical tests. The three bars within each
Question 3: Given two classifiers CA and CB and enough data for a sep- bars
test correspond to the EXP6, letter recognition, and Pima data sets. Error

Testing on multiple ˇ
D EM S AR

1999 2000 2001 2002 2003
Total number of papers 54 152 80 87 118

datasets
Relevant papers for our study 19 45 25 31 54
Sampling method [%]
cross validation, leave-one-out 22 49 44 42 56
random resampling 11 29 44 32 54
separate subset 5 11 0 13 9
Score function [%]
classification accuracy 74 67 84 84 70
classification accuracy - exclusively 68 60 80 58 67
recall, precision. . . 21 18 16 25 19
S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE DATA S ETS
ROC, AUC 0 4 4 13 9
deviations, confidence intervals 32 42 48 42 19

Overall comparison of classifiers [%] 53 44 44 26 45
averages over the data sets 0 4 6 0 10 C4.5 C4.5+m difference rank
t-test to compare two algorithms 16 11 4 6 7 adult (sample) 0.763 0.768 +0.005 3.5
pairwise t-test one vs. others 5 11 16 3 7 breast cancer 0.599 0.591 −0.008 7
pairwise t-test each vs. each 16 13 4 6 4
breast cancer wisconsin 0.954 0.971 +0.017 9
counts of wins/ties/losses 5 4 0 6 9 cmc 0.628 0.661 +0.033 12
counts of significant wins/ties/losses 16 4 8 16 6 ionosphere 0.882 0.888 +0.006 5
iris 0.936 0.931 −0.005 3.5
liver disorders 0.661 0.668 +0.007 6
lung cancer 0.583 0.583 0.000 1.5
Table 1: An overview of the papers accepted to International Conference on Machine Learning lymphography 0.775 0.838 +0.063 14
in years 1999—2003. The reported percentages (the third line and below) apply to the mushroom 1.000 1.000 0.000 1.5
number of papers relevant for our study. primary tumor 0.940 0.962 +0.022 11
rheum 0.619 0.666 +0.047 13
voting 0.972 0.981 +0.009 8
are compared, one method (a new method or the base method) is compared to the others, or all wine 0.957 0.978 +0.021 10
methods are compared to each other. Despite the repetitive warnings against multiple hypotheses
testing, the Bonferroni correction is used only in a few ICML papers annually. A common of AUC for C4.5 with m = 0 and C4.5 with m tuned for the optimal AUC. The
Table 2: Comparison non-
columns on the right-hand illustrate the computation and would normally not be published
parametric approach is to count the number of times an algorithm performs better, worse or equally
J. Demsar (2006) in an actual paper.
to the others; counting is sometimes pairwise, resulting in a matrix of wins/ties/losses count, and the
alternative is to count the number of data sets on which the algorithm outperformed all the others. 29
Some authors prefer to count only the differences that were statistically significant; for verifying

Kolokwium habilitacyjne

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Kolokwium habilitacyjne

Notas del editor