Results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. The central thesis is that a lot of problems can be framed with gaming elements, lowering the barrier to participation, and increasing engagement. Presented at the Bio-IT Cloud Summit, Data-focused Cloud Applications session, Sept. 12-13, Hotel Kabuki, San Francisco, CA
3. What
is
about
to
happen
• Some
background
on:
– me
– compe;;on
• Crowdsourced
science
through
the
‘ages’
• The
data
set
• The
Kaggle
process
• An
overview
of
the
compe;;on
• The
models
and
implementa;on
• What
we
have
learnt
6. about.me/dcthompson
My
favourite
papers
from
each
period:
[1]
J.
Chem.
Phys.
122,
124107
(2005)
[2]
J.
Chem.
Phys.
128,
224103
(2008)
[3]
J.
Chem.
Inf.
Model.
49,
1889
(2009)
[4]
J.
Chem.
Inf.
Model.
51,
93
(2011)
7. A
funny
thing
happened
at
my
1st
external
communica;ons
conference
…
Or
…
A
heart-‐wrenching
tale
of
man
versus
coffee
machine
…
Or
…
How
an
external
networking
opportunity
brought
some
‘gamifica;on’
to
research
7
11. I
never
make
predic;ons.
And
I
never
will*
Lots
of
opportunity
to
translate
problems,
from
all
fields,
into
systems
with
gaming
elements
• Goal
–
What
do
you
hope
to
achieve
by
playing
the
game?
• Rules
–
The
limita;ons
on
how
you
can
achieve
the
goals
• Feedback
–
How
close
are
you
to
achieving
your
goal?
• Voluntary
par*cipa*on
–
Everyone
playing
the
game
accepts
the
goals,
the
rules,
and
the
feedback
*
Paul
Gascoigne
hGp://janemcgonigal.com/
15. What
you
should
know
about
this
exercise
• We
wanted
to
inves;gate
the
u;lity
of
the
process
• We
wanted
to
move
with
speed
• We
wanted
to
use
a
data
set
the
scien;fic
community
had
previously
seen
• We
wanted
to
be
inclusive
–
no
domain
exper*se
needed
16. Shameless
slide
reuse
…
*
“All
models
are
wrong,
but
some
models
are
useful”
–
G.
E.
P.
Box
“…the
validity
of
any
given
model
is
of
limited
scope,
as
is
the
case
with
any
mental
construct
that
we
have
about
what
our
molecules
are
doing,
whether
we
used
a
sosware
package
or
waved
our
hands
around
in
the
air.”
–
D.
Lowe
Simula;on
and
its
discontents,
Sherry
Turkle,
Cambridge,
MA:
MIT
Press
(2009)
*
D.
C.
Thompson
et
al.
Schrödinger
Regional
User
Mee;ng,
New
York,
NY
2009
17. The
data
set
• Version
2
of
the
Hansen
AMES
mutagenicity
data
was
used
• The
following
protocol
was
observed:
What
happened
Download
smiles
#
of
molecules
(removed)
6512
Conversion
with
Corina
6503
(9)
Remove
non-‐zero
formal
6419
(84)
charge
Remove
if
more
than
99
6414
(5)
atoms
Remove
if
contain
6252
(162)
undesirable
atoms*
hGp://doc.ml.tu-‐berlin.de/toxbenchmark/
J.
Chem.
Inf.
Model.
49,
2077
(2009)
*
D,
B,
Al,
P,
Ga,
Si,
Ge,
Sn,
As,
Sb,
Se,
Te,
At,
He,
Ne,
Ar,
Kr,
Xe,
Rn
19. Tes;ng
Framework
• Public
Leaderboard:
The
split
of
the
test
set
that
compe;;on
par;cipants
see
real-‐;me
feedback
on
over
the
course
of
the
compe;;on.
• Private
Leaderboard:
The
split
of
the
test
set
that
is
used
to
determine
the
compe;;on
winners
and
es;mate
the
generaliza;on
error.
Par;cipants
do
not
see
feedback
on
this
during
the
compe;;on.
“Predic;ve
Modeling
from
a
Kaggler’s
Perspec;ve”
Jeremy
Achin,
Sergey
Yergenson,
Tom
Degodoy
20. Expecta;ons
“Applicability
Domains
for
Classifica;on
Problems:
Benchmarking
of
Distance
to
Models
for
Ames
Mutagenicity
Set”
• 20
models
generated
with
different
algorithms
and
descriptors
• Models
have
overall
accuracies
between
0.75
and
0.83
for
the
training
set
and
0.76
and
0.82
for
the
test
set
• Inter-‐laboratory
accuracy
for
Ames
test
reported
at
85%
Expecta*on:
Models
should
have
similar
accuracy
to
literature
Goal:
Models
should
be
balanced;
sensi*vity
and
specificity
should
be
high
J.
Chem.
Inf.
Model.
50,
2094
(2010)
23. Performance
as
a
func;on
of
;me
796
players
1N
log
loss=
− N ∑ y log( y ) + (1 − y ) log(1 − y )
ˆ
i i i
ˆi
703
teams
i =1
8841
entries
55
forum
topics,
409
posts
25. #FTW
Strategies
• Feature
selec;on
All
three
winning
teams
iden;fied
D27
as
important.
What
is
it?
Organon
toxicophore*
• RF
+
complementary
approaches
• Blending
*
J.
Med.
Chem.
49,
312
(2005)
“Predic;ve
Modeling
from
a
Kaggler’s
Perspec;ve”
Jeremy
Achin,
Sergey
Yergenson,
Tom
Degodoy
26. Private
Set
Performance
TP
FN
Se:
TP/(TP+FN)
Sp:
TN/(FP+TN)
FP
TN
CCR:
(Se
+
Sp)/2
Benchmarks
Winning
Teams
Other
Team
1
873
165
RF
888
150
Team
17
896
142
Team
2
888
150
SVM
822
216
D27
781
257
Team
3
893
145
Team
1
151
687
RF
166
672
Team
17
169
669
Team
2
165
673
SVM
215
673
D27
215
623
Team
3
162
676
Se
Sp
CCR
Se
Sp
CCR
Se
Sp
CCR
RF
0.86
0.80
0.83
Team
1
0.84
0.82
0.83
Team
17
0.86
0.80
0.83
SVM
0.79
0.74
0.77
Team
2
0.86
0.80
0.83
D27
0.75
0.74
0.75
Team
3
0.86
0.80
0.83
27. Okay,
where’s
this
‘second’
web
service?
BIpredict
Physicochemical
proper;es
are
updated
as
molecule
is
built
Atomis;c
descriptor
values
are
appended
directly
to
the
molecule
27
*
D.
C.
Thompson
Chemical
Compu;ng
Group,
User
Group
Mee;ng,
Montreal,
2011
28. So,
what
did
we
learn?
• Was
this
useful?
– Yes
• Par;cipa;on
was
high,
contributors
and
contribu;ons
were
diverse*
• A
large
number
of
models
were
of
a
high
quality
– Differences
in
top
models
in
log
loss
metric
are
small
– Different
sta;s;cal
measures
lead
to
different
rankings
– RandomForest
benchmark
has
high
correct
classifica;on
rate
(CCR)
*
Sort
of
29. ‘Machine
learning
that
maGers’
Machine
learning
Domain
exper;se
skill
Kiri
L.
Wagstaff.
Machine
Learning
that
Mabers.
Proceedings
of
the
Twenty-‐Ninth
Interna8onal
Conference
on
Machine
Learning
(ICML),
June
2012.
Download
PDF
(CL
#12-‐2026)