Crowd computing: All your base are belong to us

Crowd
Compu*ng:
All
Your
Base

Are
Belong
to
Us

David
C.
Thompson

What
is
about
to
happen

•  Some
background
on:

–  me

–  compe;;on

•  Crowdsourced
science
through
the
‘ages’

•  The
data
set

•  The
Kaggle
process

•  An
overview
of
the
compe;;on

•  The
models
and
implementa;on

•  What
we
have
learnt

Behold!
Let
the
science
begin
…

about.me/dcthompson

My
favourite
papers
from
each
period:

[1]
J.
Chem.
Phys.
122,
124107
(2005)

[2]
J.
Chem.
Phys.
128,
224103
(2008)

[3]
J.
Chem.
Inf.
Model.
49,
1889
(2009)

[4]
J.
Chem.
Inf.
Model.
51,
93
(2011)

A
funny
thing
happened
at
my
1st
external
communica;ons
conference
…

Or
…

A
heart-‐wrenching
tale
of
man
versus
coﬀee
machine
…

Or
…

How
an
external
networking
opportunity
brought
some
‘gamiﬁca;on’
to

research

7

Do
real
science,
at
home.

What
happens
when
you
search
for

‘blindfolded
archery'

I
never
make
predic;ons.
And
I
never

will*

Lots
of
opportunity
to
translate
problems,
from
all

ﬁelds,
into
systems
with
gaming
elements

•  Goal
–
What
do
you
hope
to
achieve
by
playing
the

game?

•  Rules
–
The
limita;ons
on
how
you
can
achieve
the

goals

•  Feedback
–
How
close
are
you
to
achieving
your
goal?

•  Voluntary
par*cipa*on
–
Everyone
playing
the
game

accepts
the
goals,
the
rules,
and
the
feedback

*
Paul
Gascoigne
hGp://janemcgonigal.com/

What
you
should
know
about
this

exercise

•  We
wanted
to
inves;gate
the
u;lity
of
the

process

•  We
wanted
to
move
with
speed

•  We
wanted
to
use
a
data
set
the
scien;ﬁc

community
had
previously
seen

•  We
wanted
to
be
inclusive
–
no
domain

exper*se
needed

Shameless
slide
reuse
…
*

“All
models
are
wrong,
but

some
models
are
useful”

–
G.
E.
P.
Box

“…the
validity
of
any
given
model
is
of
limited

scope,
as
is
the
case
with
any
mental
construct

that
we
have
about
what
our
molecules
are

doing,
whether
we
used
a
sosware
package
or

waved
our
hands
around
in
the
air.”
–
D.
Lowe

Simula;on
and
its
discontents,
Sherry
Turkle,
Cambridge,
MA:
MIT
Press
(2009)

*
D.
C.
Thompson
et
al.
Schrödinger
Regional
User
Mee;ng,
New
York,
NY
2009

The
data
set

•  Version
2
of
the
Hansen
AMES
mutagenicity

data
was
used

•  The
following
protocol
was
observed:

What
happened

Download
smiles

#
of
molecules
(removed)

6512

Conversion
with
Corina
6503
(9)

Remove
non-‐zero
formal

6419
(84)

charge

Remove
if
more
than
99

6414
(5)

atoms

Remove
if
contain

6252
(162)

undesirable
atoms*

hGp://doc.ml.tu-‐berlin.de/toxbenchmark/

J.
Chem.
Inf.
Model.
49,
2077
(2009)

*
D,
B,
Al,
P,
Ga,
Si,
Ge,
Sn,
As,
Sb,
Se,
Te,
At,
He,
Ne,
Ar,
Kr,
Xe,
Rn

Descriptor
calcula;on

SD
file,
descriptor
calcula;on
–
6252
x
5030

–  Filter
for
low
variance
(≤
0.01);
removed
2537

–  Remove
for
high
correla;on
(>
0.90);
removed

716

–  Descriptor
normaliza;on
resulted
in
6252
x

1400

1777
.csv
dfile

Descriptor
Engine
#
of
escriptors

1200

MOE
2D
76
(186)

1000

Atom
Pair
696
(1920)

800

MolConn-‐Z
174
(745)

600

Pipeline
Pilot

5
(130)

Property
Counts

400

Daylight

825
(2048)

fingerprints
200

clogP
0
(1)

0

50

1000

1050

1100

1150

1200

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

900

950

J.
Chem.
Inf.
Model.
49,
2077
(2009)

Tes;ng
Framework

•  Public
Leaderboard:
The

split
of
the
test
set
that

compe;;on
par;cipants

see
real-‐;me
feedback
on

over
the
course
of
the

compe;;on.

•  Private
Leaderboard:
The

split
of
the
test
set
that
is

used
to
determine
the

compe;;on
winners
and

es;mate
the
generaliza;on

error.
Par;cipants
do
not

see
feedback
on
this
during

the
compe;;on.

“Predic;ve
Modeling
from
a
Kaggler’s
Perspec;ve”
Jeremy
Achin,
Sergey
Yergenson,
Tom
Degodoy

Expecta;ons

“Applicability
Domains
for
Classifica;on
Problems:

Benchmarking
of
Distance
to
Models
for
Ames
Mutagenicity
Set”

•  20
models
generated
with
different
algorithms
and
descriptors

•  Models
have
overall
accuracies
between
0.75
and
0.83
for
the
training
set

and
0.76
and
0.82
for
the
test
set

•  Inter-‐laboratory
accuracy
for
Ames
test
reported
at
85%

Expecta*on:
Models
should
have
similar
accuracy
to

literature

Goal:
Models
should
be
balanced;
sensi*vity
and

specificity
should
be
high

J.
Chem.
Inf.
Model.
50,
2094
(2010)

hGp://www.kaggle.com/c/bioresponse

Performance
as
a
func;on
of
;me

796
players
1N

log
loss=
− N ∑ y log( y ) + (1 − y ) log(1 − y )

ˆ
i i i
ˆi

703
teams
i =1

8841
entries

55
forum
topics,
409
posts

Final
Public
Δ
(log

Team
Name

Ranking
Ranking
loss)

1
Winter
is
Coming
&
Sergey
11
0

2
seelary
26
7E-‐05

3
bluehat
1
0.00051

4
jazz
15
0.0014

5
Wayne
Zhang
&
Gxav
&
woshialex
19
0.00146

6
Indy
Actuaries
38
0.00184

7
bluemaster
&
imran
7
0.00231

8
Eﬁimov
&
Bers
&
Cragin
&
vsu
4
0.00241

9
y_tag
18
0.0026

10
Killian
O’Connor
44
0.00285

11
PlanetThanet
&
SirGuessalot
40
0.00298

12
AussieTim
48
0.00335

13
Jason
Farmer
31
0.00347

14
GreenPeace
16
0.00356

15
mars
32
0.00388

16
Fuzzify
60
0.00392

17
Emanuele
63
0.00395

18
HappyHour
10
0.00431

19
Bal;c
30
0.00465

20
dejavu
20
0.00482

352
Random
Forest
Benchmark
373
0.04184

Support
Vector
Machine

541
Benchmark
522
0.12147

Op;mized
Constant
Value

647
Benchmark
638
0.31414

650
Uniform
Benchmark
642
0.31959

hGps://github.com/emanuele/kaggle_pbr

hGps://github.com/benhamner/BioResponse

#FTW
Strategies

•  Feature
selec;on
All
three
winning
teams

iden;ﬁed
D27
as

important.

What
is
it?

Organon
toxicophore*

•  RF
+
complementary
approaches

•  Blending

*
J.
Med.
Chem.
49,
312
(2005)

“Predic;ve
Modeling
from
a
Kaggler’s
Perspec;ve”
Jeremy
Achin,
Sergey
Yergenson,
Tom
Degodoy

Private
Set
Performance

TP
FN
Se:
TP/(TP+FN)

Sp:
TN/(FP+TN)

FP
TN

CCR:
(Se
+
Sp)/2

Benchmarks
Winning
Teams
Other

Team
1
873
165

RF
888
150
Team
17
896
142

Team
2
888
150

SVM
822
216
D27
781
257

Team
3
893
145

Team
1
151
687

RF
166
672
Team
17
169
669

Team
2
165
673

SVM
215
673
D27
215
623

Team
3
162
676

Se
Sp
CCR
Se
Sp
CCR
Se
Sp
CCR

RF
0.86
0.80
0.83
Team
1
0.84
0.82
0.83
Team
17
0.86
0.80
0.83

SVM
0.79
0.74
0.77
Team
2
0.86
0.80
0.83
D27
0.75
0.74
0.75

Team
3
0.86
0.80
0.83

Okay,
where’s
this
‘second’
web

service?

BIpredict

Physicochemical

proper;es
are

updated
as
molecule

is
built

Atomis;c
descriptor

values
are
appended

directly
to
the

molecule

27

*
D.
C.
Thompson
Chemical
Compu;ng
Group,
User
Group
Mee;ng,
Montreal,
2011

So,
what
did
we
learn?

•  Was
this
useful?

–  Yes

•  Par;cipa;on
was
high,
contributors
and

contribu;ons
were
diverse*

•  A
large
number
of
models
were
of
a
high
quality

–  Differences
in
top
models
in
log
loss
metric
are
small

–  Different
sta;s;cal
measures
lead
to
different

rankings

–  RandomForest
benchmark
has
high
correct

classifica;on
rate
(CCR)

*
Sort
of

‘Machine
learning
that
maGers’

Machine
learning

Domain
exper;se

skill

Kiri
L.
Wagstaﬀ.
Machine
Learning
that
Mabers.
Proceedings
of
the
Twenty-‐Ninth
Interna8onal

Conference
on
Machine
Learning
(ICML),
June
2012.
Download
PDF
(CL
#12-‐2026)

Know
your
meme

hGp://roﬂcon.org/

hGp://katemiltner.com/

Thanks
to:

Lilly
Ackley

Ben
Hamner

Amy
Kunkel

Mehul
Patel

Alex
Renner,
PhD

All
Kaggle
par;cipants
–
esp.
Winter
is
Coming
&
Sergey

Crowd computing: All your base are belong to us

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (6)

Similar a Crowd computing: All your base are belong to us

Similar a Crowd computing: All your base are belong to us (20)

Más de David Thompson

Más de David Thompson (12)

Último

Último (20)

Crowd computing: All your base are belong to us