SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Crowd	
  Compu*ng:	
  All	
  Your	
  Base	
  
          Are	
  Belong	
  to	
  Us	
  
            David	
  C.	
  Thompson	
  
What	
  is	
  about	
  to	
  happen	
  
•  Some	
  background	
  on:	
  
     –  me	
  
     –  compe;;on	
  
•    Crowdsourced	
  science	
  through	
  the	
  ‘ages’	
  
•    The	
  data	
  set	
  
•    The	
  Kaggle	
  process	
  
•    An	
  overview	
  of	
  the	
  compe;;on	
  
•    The	
  models	
  and	
  implementa;on	
  
•    What	
  we	
  have	
  learnt	
  
Behold!	
  Let	
  the	
  science	
  begin	
  …	
  
hGp://amzn.to/OyQMVf	
  
about.me/dcthompson	
  




My	
  favourite	
  papers	
  from	
  each	
  period:	
  
[1]	
  J.	
  Chem.	
  Phys.	
  122,	
  124107	
  (2005)	
  
[2]	
  J.	
  Chem.	
  Phys.	
  128,	
  224103	
  (2008)	
  
[3]	
  J.	
  Chem.	
  Inf.	
  Model.	
  49,	
  1889	
  (2009)	
  
[4]	
  J.	
  Chem.	
  Inf.	
  Model.	
  51,	
  93	
  (2011)	
  
A	
  funny	
  thing	
  happened	
  at	
  my	
  1st	
  external	
  communica;ons	
  conference	
  …	
  

                                                Or	
  …	
  

            A	
  heart-­‐wrenching	
  tale	
  of	
  man	
  versus	
  coffee	
  machine	
  …	
  

                                                Or	
  …	
  

How	
  an	
  external	
  networking	
  opportunity	
  brought	
  some	
  ‘gamifica;on’	
  to	
  
                                         research	
  




                                                                                                         7	
  
hGp://www.taviscurry.com/	
  
Do	
  real	
  science,	
  at	
  home.	
  
What	
  happens	
  when	
  you	
  search	
  for	
  
         ‘blindfolded	
  archery'	
  
I	
  never	
  make	
  predic;ons.	
  And	
  I	
  never	
  
                                        will*	
  
                 	
  Lots	
  of	
  opportunity	
  to	
  translate	
  problems,	
  from	
  all	
  
                     fields,	
  into	
  systems	
  with	
  gaming	
  elements	
  

             •  Goal	
  –	
  What	
  do	
  you	
  hope	
  to	
  achieve	
  by	
  playing	
  the	
  
                  game?	
  
             •  Rules	
  –	
  The	
  limita;ons	
  on	
  how	
  you	
  can	
  achieve	
  the	
  
                  goals	
  
             •  Feedback	
  –	
  How	
  close	
  are	
  you	
  to	
  achieving	
  your	
  goal?	
  
             •  Voluntary	
  par*cipa*on	
  –	
  Everyone	
  playing	
  the	
  game	
  
                  accepts	
  the	
  goals,	
  the	
  rules,	
  and	
  the	
  feedback	
  
             	
  
*	
     Paul	
  Gascoigne	
                                                     hGp://janemcgonigal.com/	
  
hGp://fold.it/portal/	
  
What	
  you	
  should	
  know	
  about	
  this	
  
                      exercise	
  
•  We	
  wanted	
  to	
  inves;gate	
  the	
  u;lity	
  of	
  the	
  
   process	
  	
  
•  We	
  wanted	
  to	
  move	
  with	
  speed	
  
•  We	
  wanted	
  to	
  use	
  a	
  data	
  set	
  the	
  scien;fic	
  
   community	
  had	
  previously	
  seen	
  
•  We	
  wanted	
  to	
  be	
  inclusive	
  –	
  no	
  domain	
  
   exper*se	
  needed	
  
Shameless	
  slide	
  reuse	
  …	
  *	
  



                                                                                                                         	
  “All	
  models	
  are	
  wrong,	
  but	
  
                                                                                                                             some	
  models	
  are	
  useful”
                                                                                                                                     	
  –	
  G.	
  E.	
  P.	
  Box	
  

                                                                                                                    “…the	
  validity	
  of	
  any	
  given	
  model	
  is	
  of	
  limited	
  
                                                                                                                   scope,	
  as	
  is	
  the	
  case	
  with	
  any	
  mental	
  construct	
  
                                                                                                                     that	
  we	
  have	
  about	
  what	
  our	
  molecules	
  are	
  
                                                                                                                   doing,	
  whether	
  we	
  used	
  a	
  sosware	
  package	
  or	
  
                                                                                                                   waved	
  our	
  hands	
  around	
  in	
  the	
  air.”	
  –	
  D.	
  Lowe	
  

                                                                                                                  	
  


Simula;on	
  and	
  its	
  discontents,	
  Sherry	
  Turkle,	
  Cambridge,	
  MA:	
  MIT	
  Press	
  (2009)	
  
*	
  D.	
  C.	
  Thompson	
  et	
  al.	
  Schrödinger	
  Regional	
  User	
  Mee;ng,	
  New	
  York,	
  NY	
  2009	
  
The	
  data	
  set	
  
          •  Version	
  2	
  of	
  the	
  Hansen	
  AMES	
  mutagenicity	
  
               data	
  was	
  used	
  
          •  The	
  following	
  protocol	
  was	
  observed:	
  
          	
                                                     What	
  happened	
  

                                                                Download	
  smiles	
  
                                                                                                       #	
  of	
  molecules	
  (removed)	
  

                                                                                                                     6512	
  

                                                            Conversion	
  with	
  Corina	
                         6503	
  (9)	
  

                                                           Remove	
  non-­‐zero	
  formal	
  
                                                                                                                  6419	
  (84)	
  
                                                                     charge	
  
                                                           Remove	
  if	
  more	
  than	
  99	
  
                                                                                                                   6414	
  (5)	
  
                                                                     atoms	
  
                                                               Remove	
  if	
  contain	
  
                                                                                                                 6252	
  (162)	
  
                                                               undesirable	
  atoms*	
  


hGp://doc.ml.tu-­‐berlin.de/toxbenchmark/	
  
J.	
  Chem.	
  Inf.	
  Model.	
  49,	
  2077	
  (2009)	
  
*	
  D,	
  B,	
  Al,	
  P,	
  Ga,	
  Si,	
  Ge,	
  Sn,	
  As,	
  Sb,	
  Se,	
  Te,	
  At,	
  He,	
  Ne,	
  Ar,	
  Kr,	
  Xe,	
  Rn	
  
Descriptor	
  calcula;on	
  
       SD	
  file,	
  descriptor	
  calcula;on	
  –	
  6252	
  x	
  5030	
  
              –  Filter	
  for	
  low	
  variance	
  (≤	
  0.01);	
  removed	
  2537	
  
              –  Remove	
  for	
  high	
  correla;on	
  (>	
  0.90);	
  removed	
  
                   716	
  
              –  Descriptor	
  normaliza;on	
  resulted	
  in	
  6252	
  x	
  
                                                              1400	
  


                   1777	
  .csv	
  dfile	
   	
  
              Descriptor	
  Engine	
   #	
  of	
   escriptors	
  
                                                              1200	
  
                    MOE	
  2D	
            76	
  (186)	
  
                                                              1000	
  
                   Atom	
  Pair	
         696	
  (1920)	
  
                                                               800	
  
                  MolConn-­‐Z	
           174	
  (745)	
  
                                                               600	
  
                Pipeline	
  Pilot	
  
                                            5	
  (130)	
  
               Property	
  Counts	
  
                                                               400	
  
                   Daylight	
  
                                          825	
  (2048)	
  
                  fingerprints	
                                200	
  

                      clogP	
                 0	
  (1)	
  
                                                                   0	
  
                                                                             50	
  




                                                                           1000	
  
                                                                           1050	
  
                                                                           1100	
  
                                                                           1150	
  
                                                                           1200	
  
                                                                            100	
  
                                                                            150	
  
                                                                            200	
  
                                                                            250	
  
                                                                            300	
  
                                                                            350	
  
                                                                            400	
  
                                                                            450	
  
                                                                            500	
  
                                                                            550	
  
                                                                            600	
  
                                                                            650	
  
                                                                            700	
  
                                                                            750	
  
                                                                            800	
  
                                                                            850	
  
                                                                            900	
  
                                                                            950	
  
J.	
  Chem.	
  Inf.	
  Model.	
  49,	
  2077	
  (2009)	
  
Tes;ng	
  Framework	
  
                                                                               •  Public	
  Leaderboard:	
  The	
  
                                                                                  split	
  of	
  the	
  test	
  set	
  that	
  
                                                                                  compe;;on	
  par;cipants	
  
                                                                                  see	
  real-­‐;me	
  feedback	
  on	
  
                                                                                  over	
  the	
  course	
  of	
  the	
  
                                                                                  compe;;on.	
  
                                                                               •  Private	
  Leaderboard:	
  The	
  
                                                                                  split	
  of	
  the	
  test	
  set	
  that	
  is	
  
                                                                                  used	
  to	
  determine	
  the	
  
                                                                                  compe;;on	
  winners	
  and	
  
                                                                                  es;mate	
  the	
  generaliza;on	
  
                                                                                  error.	
  Par;cipants	
  do	
  not	
  
                                                                                  see	
  feedback	
  on	
  this	
  during	
  
                                                                                  the	
  compe;;on.	
  


“Predic;ve	
  Modeling	
  from	
  a	
  Kaggler’s	
  Perspec;ve”	
  Jeremy	
  Achin,	
  Sergey	
  Yergenson,	
  Tom	
  Degodoy	
  
Expecta;ons	
  
       “Applicability	
  Domains	
  for	
  Classifica;on	
  Problems:	
  
       Benchmarking	
  of	
  Distance	
  to	
  Models	
  for	
  Ames	
  Mutagenicity	
  Set”	
  
       	
  
       •  20	
  models	
  generated	
  with	
  different	
  algorithms	
  and	
  descriptors	
  
       •  Models	
  have	
  overall	
  accuracies	
  between	
  0.75	
  and	
  0.83	
  for	
  the	
  training	
  set	
  
            and	
  0.76	
  and	
  0.82	
  for	
  the	
  test	
  set	
  
       •  Inter-­‐laboratory	
  accuracy	
  for	
  Ames	
  test	
  reported	
  at	
  85%	
  
       	
  
       Expecta*on:	
  Models	
  should	
  have	
  similar	
  accuracy	
  to	
  
         literature	
  
       Goal:	
  Models	
  should	
  be	
  balanced;	
  sensi*vity	
  and	
  
         specificity	
  should	
  be	
  high	
  

J.	
  Chem.	
  Inf.	
  Model.	
  50,	
  2094	
  (2010)	
  
hGp://www.kaggle.com/c/bioresponse	
  
Performance	
  as	
  a	
  func;on	
  of	
  ;me	
  




796	
  players	
                                 1N

                            log	
  loss=	
  − N ∑ y log( y ) + (1 − y ) log(1 − y )
                                            	
           ˆ
                                                         i   i        i
                                                                                ˆi

703	
  teams	
                                    i =1




8841	
  entries	
  
55	
  forum	
  topics,	
  409	
  posts	
  
Final	
                                                          Public	
           Δ	
  (log	
  
                                                        Team	
  Name	
  
                            Ranking	
                                                        Ranking	
           loss)	
  
                                1	
     Winter	
  is	
  Coming	
  &	
  Sergey	
                     11	
                 0	
  
                                2	
     seelary	
                                                   26	
           7E-­‐05	
  
                                3	
     bluehat	
                                                        1	
     0.00051	
  
                                4	
     jazz	
                                                      15	
          0.0014	
  
                                5	
     Wayne	
  Zhang	
  &	
  Gxav	
  &	
  woshialex	
             19	
         0.00146	
  
                                6	
     Indy	
  Actuaries	
                                         38	
         0.00184	
  
                                7	
     bluemaster	
  &	
  imran	
                                       7	
     0.00231	
  
                                8	
     Efiimov	
  &	
  Bers	
  &	
  Cragin	
  &	
  vsu	
                 4	
     0.00241	
  
                                9	
     y_tag	
                                                     18	
          0.0026	
  
                               10	
   Killian	
  O’Connor	
                                         44	
         0.00285	
  
                               11	
   PlanetThanet	
  &	
  SirGuessalot	
                           40	
         0.00298	
  
                               12	
   AussieTim	
                                                   48	
         0.00335	
  
                               13	
   Jason	
  Farmer	
                                             31	
         0.00347	
  
                               14	
   GreenPeace	
                                                  16	
         0.00356	
  
                               15	
   mars	
                                                        32	
         0.00388	
  
                               16	
   Fuzzify	
                                                     60	
         0.00392	
  
                               17	
   Emanuele	
                                                    63	
         0.00395	
  
                               18	
   HappyHour	
                                                   10	
         0.00431	
  
                               19	
   Bal;c	
                                                       30	
         0.00465	
  
                               20	
   dejavu	
                                                      20	
         0.00482	
  
                              352	
   Random	
  Forest	
  Benchmark	
                             373	
          0.04184	
  
                                        Support	
  Vector	
  Machine	
  
                              541	
   Benchmark	
                                                     522	
   0.12147	
  
                                        Op;mized	
  Constant	
  Value	
  
                              647	
   Benchmark	
                                                     638	
   0.31414	
  
                              650	
   Uniform	
  Benchmark	
                                          642	
   0.31959	
  

hGps://github.com/emanuele/kaggle_pbr	
  
hGps://github.com/benhamner/BioResponse	
  
#FTW	
  Strategies	
  
        •  Feature	
  selec;on	
                                                                     All	
  three	
  winning	
  teams	
  
        	
                                                                                           iden;fied	
  D27	
  as	
  
                                                                                                     important.	
  
                                                                                                     	
  
                                                                                                     What	
  is	
  it?	
  	
  
                                                                                                     	
  
                                                                                                     Organon	
  toxicophore*	
  




        •  RF	
  +	
  complementary	
  approaches	
  
        •  Blending	
  
*	
  J.	
  Med.	
  Chem.	
  49,	
  312	
  (2005)	
  

“Predic;ve	
  Modeling	
  from	
  a	
  Kaggler’s	
  Perspec;ve”	
  Jeremy	
  Achin,	
  Sergey	
  Yergenson,	
  Tom	
  Degodoy	
  
Private	
  Set	
  Performance	
  
          TP	
                                     FN	
                                                                                                     Se:	
  TP/(TP+FN)	
  
                                                                                                                                                            Sp:	
  TN/(FP+TN)	
  
          FP	
                                     TN	
  
                                                                                                                                                           CCR:	
  (Se	
  +	
  Sp)/2	
  

                    Benchmarks	
                                                          Winning	
  Teams	
                                                       Other	
  


                                                                          Team	
  1	
                873	
                165	
  
   RF	
                       888	
                    150	
                                                                            Team	
  17	
                896	
                142	
  
                                                                          Team	
  2	
                888	
                150	
  
  SVM	
                       822	
                    216	
                                                                              D27	
                     781	
                257	
  
                                                                          Team	
  3	
                893	
                145	
  
                                                                          Team	
  1	
                151	
                687	
  
   RF	
                       166	
                    672	
                                                                            Team	
  17	
                169	
                669	
  
                                                                          Team	
  2	
                165	
                673	
  
  SVM	
                       215	
                    673	
                                                                              D27	
                     215	
                623	
  
                                                                          Team	
  3	
                162	
                676	
  




                    Se	
                 Sp	
               CCR	
                          Se	
                 Sp	
       CCR	
                          Se	
                 Sp	
        CCR	
  



 RF	
              0.86	
               0.80	
              0.83	
     Team	
  1	
        0.84	
               0.82	
      0.83	
     Team	
  17	
       0.86	
               0.80	
       0.83	
  




SVM	
              0.79	
               0.74	
              0.77	
     Team	
  2	
        0.86	
               0.80	
      0.83	
        D27	
           0.75	
               0.74	
       0.75	
  



                                                                       Team	
  3	
        0.86	
               0.80	
      0.83	
  
Okay,	
  where’s	
  this	
  ‘second’	
  web	
  
                                       service?	
  
                                                                                                                           BIpredict	
  
                                                                                                                   	
  
                                                                                                                   Physicochemical	
  
                                                                                                                   proper;es	
  are	
  
                                                                                                                   updated	
  as	
  molecule	
  
                                                                                                                   is	
  built	
  
                                                                                                                   	
  
                                                                                                                   Atomis;c	
  descriptor	
  
                                                                                                                   values	
  are	
  appended	
  
                                                                                                                   directly	
  to	
  the	
  
                                                                                                                   molecule	
  




                                                                                                                                           27	
  
*	
  D.	
  C.	
  Thompson	
  Chemical	
  Compu;ng	
  Group,	
  User	
  Group	
  Mee;ng,	
  Montreal,	
  2011	
  
So,	
  what	
  did	
  we	
  learn?	
  
           •  Was	
  this	
  useful?	
  
                      –  Yes	
  
           •  Par;cipa;on	
  was	
  high,	
  contributors	
  and	
  
              contribu;ons	
  were	
  diverse*	
  
           •  A	
  large	
  number	
  of	
  models	
  were	
  of	
  a	
  high	
  quality	
  
                      –  Differences	
  in	
  top	
  models	
  in	
  log	
  loss	
  metric	
  are	
  small	
  
                      –  Different	
  sta;s;cal	
  measures	
  lead	
  to	
  different	
  
                         rankings	
  
                      –  RandomForest	
  benchmark	
  has	
  high	
  correct	
  
                         classifica;on	
  rate	
  (CCR)	
  

*	
  Sort	
  of	
  
‘Machine	
  learning	
  that	
  maGers’	
  


                                                                                                     Machine	
  learning	
  
       Domain	
  exper;se	
  
                                                                                                           skill	
  




Kiri	
  L.	
  Wagstaff.	
  Machine	
  Learning	
  that	
  Mabers.	
  Proceedings	
  of	
  the	
  Twenty-­‐Ninth	
  Interna8onal	
  
Conference	
  on	
  Machine	
  Learning	
  (ICML),	
  June	
  2012.	
  Download	
  PDF	
  (CL	
  #12-­‐2026)	
  
Know	
  your	
  meme	
  




hGp://roflcon.org/	
  
hGp://katemiltner.com/	
  
Thanks	
  to:	
  
Lilly	
  Ackley	
  
Ben	
  Hamner	
  
Amy	
  Kunkel	
  
Mehul	
  Patel	
  
Alex	
  Renner,	
  PhD	
  
All	
  Kaggle	
  par;cipants	
  –	
  esp.	
  Winter	
  is	
  Coming	
  &	
  Sergey	
  

Más contenido relacionado

Destacado

Grand Theft Engagement
Grand Theft EngagementGrand Theft Engagement
Grand Theft EngagementDavid Thompson
 
Connect With STEM - See It, Be It
Connect With STEM - See It, Be ItConnect With STEM - See It, Be It
Connect With STEM - See It, Be ItDavid Thompson
 
Thinking inside the box - What is the role of digital within the four walls o...
Thinking inside the box - What is the role of digital within the four walls o...Thinking inside the box - What is the role of digital within the four walls o...
Thinking inside the box - What is the role of digital within the four walls o...David Thompson
 
Competitive data science: A tale of two web services
Competitive data science: A tale of two web servicesCompetitive data science: A tale of two web services
Competitive data science: A tale of two web servicesDavid Thompson
 
Redressing the Baseline: Exploit vs. Explore
Redressing the Baseline: Exploit vs. ExploreRedressing the Baseline: Exploit vs. Explore
Redressing the Baseline: Exploit vs. ExploreDavid Thompson
 

Destacado (6)

Grand Theft Engagement
Grand Theft EngagementGrand Theft Engagement
Grand Theft Engagement
 
Connect With STEM - See It, Be It
Connect With STEM - See It, Be ItConnect With STEM - See It, Be It
Connect With STEM - See It, Be It
 
Thinking inside the box - What is the role of digital within the four walls o...
Thinking inside the box - What is the role of digital within the four walls o...Thinking inside the box - What is the role of digital within the four walls o...
Thinking inside the box - What is the role of digital within the four walls o...
 
Competitive data science: A tale of two web services
Competitive data science: A tale of two web servicesCompetitive data science: A tale of two web services
Competitive data science: A tale of two web services
 
Change Management 2.0
Change Management 2.0Change Management 2.0
Change Management 2.0
 
Redressing the Baseline: Exploit vs. Explore
Redressing the Baseline: Exploit vs. ExploreRedressing the Baseline: Exploit vs. Explore
Redressing the Baseline: Exploit vs. Explore
 

Similar a Crowd computing: All your base are belong to us

Managing Uncertainty in Value-based SE
Managing Uncertainty in Value-based SEManaging Uncertainty in Value-based SE
Managing Uncertainty in Value-based SECS, NcState
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Alex Pinto
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Codelbergmans
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codelbergmans
 
Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)민진 최
 
Project Build With Maven
Project Build With MavenProject Build With Maven
Project Build With Mavenelliando dias
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Martin Pelikan
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsJason Riedy
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Distilling dark knowledge from neural networks
Distilling dark knowledge from neural networksDistilling dark knowledge from neural networks
Distilling dark knowledge from neural networksAlexander Korbonits
 
Network Metrics and Measurements in the Era of the Digital Economies
Network Metrics and Measurements in the Era of the Digital EconomiesNetwork Metrics and Measurements in the Era of the Digital Economies
Network Metrics and Measurements in the Era of the Digital EconomiesPavel Loskot
 
Linkage Learning for Pittsburgh LCS: Making Problems Tractable
Linkage Learning for Pittsburgh LCS: Making Problems TractableLinkage Learning for Pittsburgh LCS: Making Problems Tractable
Linkage Learning for Pittsburgh LCS: Making Problems TractableXavier Llorà
 
20080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture0120080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture01Computer Science Club
 
Using Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGridUsing Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGridgwickler
 
Accelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile SystemsAccelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile SystemsDarian Frajberg
 

Similar a Crowd computing: All your base are belong to us (20)

Managing Uncertainty in Value-based SE
Managing Uncertainty in Value-based SEManaging Uncertainty in Value-based SE
Managing Uncertainty in Value-based SE
 
Report
ReportReport
Report
 
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
 
Close encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet CodeClose encounters in MDD: when Models meet Code
Close encounters in MDD: when Models meet Code
 
Close Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet codeClose Encounters in MDD: when models meet code
Close Encounters in MDD: when models meet code
 
Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)Local collaborative autoencoders (WSDM2021)
Local collaborative autoencoders (WSDM2021)
 
Project Build With Maven
Project Build With MavenProject Build With Maven
Project Build With Maven
 
Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...Distance-based bias in model-directed optimization of additively decomposable...
Distance-based bias in model-directed optimization of additively decomposable...
 
Os Django
Os DjangoOs Django
Os Django
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
DIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive GraphsDIMACS10: Parallel Community Detection for Massive Graphs
DIMACS10: Parallel Community Detection for Massive Graphs
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
Distilling dark knowledge from neural networks
Distilling dark knowledge from neural networksDistilling dark knowledge from neural networks
Distilling dark knowledge from neural networks
 
Network Metrics and Measurements in the Era of the Digital Economies
Network Metrics and Measurements in the Era of the Digital EconomiesNetwork Metrics and Measurements in the Era of the Digital Economies
Network Metrics and Measurements in the Era of the Digital Economies
 
Linkage Learning for Pittsburgh LCS: Making Problems Tractable
Linkage Learning for Pittsburgh LCS: Making Problems TractableLinkage Learning for Pittsburgh LCS: Making Problems Tractable
Linkage Learning for Pittsburgh LCS: Making Problems Tractable
 
20080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture0120080501 software verification_sharygina_lecture01
20080501 software verification_sharygina_lecture01
 
Using Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGridUsing Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGrid
 
Elder
ElderElder
Elder
 
Accelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile SystemsAccelerating Deep Learning Inference 
on Mobile Systems
Accelerating Deep Learning Inference 
on Mobile Systems
 
Claytronics_and_DSLs.ppt
Claytronics_and_DSLs.pptClaytronics_and_DSLs.ppt
Claytronics_and_DSLs.ppt
 

Más de David Thompson

Connect With STEM Learning Report (2015-2016)
Connect With STEM Learning Report (2015-2016)Connect With STEM Learning Report (2015-2016)
Connect With STEM Learning Report (2015-2016)David Thompson
 
Five Things to Know about Networks Within Firms
Five Things to Know about Networks Within FirmsFive Things to Know about Networks Within Firms
Five Things to Know about Networks Within FirmsDavid Thompson
 
Internal Social Media: The ties that bind
Internal Social Media: The ties that bindInternal Social Media: The ties that bind
Internal Social Media: The ties that bindDavid Thompson
 
Tri-State SHRM Conference
Tri-State SHRM ConferenceTri-State SHRM Conference
Tri-State SHRM ConferenceDavid Thompson
 
Diversity 2.0 - The Diversity and Inclusion Social Media Revolution
Diversity 2.0 - The Diversity and Inclusion Social Media RevolutionDiversity 2.0 - The Diversity and Inclusion Social Media Revolution
Diversity 2.0 - The Diversity and Inclusion Social Media RevolutionDavid Thompson
 
Internal Social Media: Weaving the threads together
Internal Social Media: Weaving the threads togetherInternal Social Media: Weaving the threads together
Internal Social Media: Weaving the threads togetherDavid Thompson
 
Non-covalent protein-ligand interactions? Easy as Pi
Non-covalent protein-ligand interactions? Easy as PiNon-covalent protein-ligand interactions? Easy as Pi
Non-covalent protein-ligand interactions? Easy as PiDavid Thompson
 
Docking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDocking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDavid Thompson
 
Down the Rabbit Hole: An academic's thoughts on being in industry
Down the Rabbit Hole: An academic's thoughts on being in industryDown the Rabbit Hole: An academic's thoughts on being in industry
Down the Rabbit Hole: An academic's thoughts on being in industryDavid Thompson
 
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and Models
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and ModelsBIpredict: MOE/web Server Enabled Delivery of In Silico Properties and Models
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and ModelsDavid Thompson
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeDavid Thompson
 

Más de David Thompson (12)

Connect With STEM Learning Report (2015-2016)
Connect With STEM Learning Report (2015-2016)Connect With STEM Learning Report (2015-2016)
Connect With STEM Learning Report (2015-2016)
 
Five Things to Know about Networks Within Firms
Five Things to Know about Networks Within FirmsFive Things to Know about Networks Within Firms
Five Things to Know about Networks Within Firms
 
Internal Social Media: The ties that bind
Internal Social Media: The ties that bindInternal Social Media: The ties that bind
Internal Social Media: The ties that bind
 
Tri-State SHRM Conference
Tri-State SHRM ConferenceTri-State SHRM Conference
Tri-State SHRM Conference
 
Diversity 2.0 - The Diversity and Inclusion Social Media Revolution
Diversity 2.0 - The Diversity and Inclusion Social Media RevolutionDiversity 2.0 - The Diversity and Inclusion Social Media Revolution
Diversity 2.0 - The Diversity and Inclusion Social Media Revolution
 
Internal Social Media: Weaving the threads together
Internal Social Media: Weaving the threads togetherInternal Social Media: Weaving the threads together
Internal Social Media: Weaving the threads together
 
Non-covalent protein-ligand interactions? Easy as Pi
Non-covalent protein-ligand interactions? Easy as PiNon-covalent protein-ligand interactions? Easy as Pi
Non-covalent protein-ligand interactions? Easy as Pi
 
Drugs and Electrons
Drugs and ElectronsDrugs and Electrons
Drugs and Electrons
 
Docking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD upDocking Pose Assessment: The importance of keeping your GARD up
Docking Pose Assessment: The importance of keeping your GARD up
 
Down the Rabbit Hole: An academic's thoughts on being in industry
Down the Rabbit Hole: An academic's thoughts on being in industryDown the Rabbit Hole: An academic's thoughts on being in industry
Down the Rabbit Hole: An academic's thoughts on being in industry
 
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and Models
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and ModelsBIpredict: MOE/web Server Enabled Delivery of In Silico Properties and Models
BIpredict: MOE/web Server Enabled Delivery of In Silico Properties and Models
 
Computational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to PracticeComputational Chemistry: From Theory to Practice
Computational Chemistry: From Theory to Practice
 

Último

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Último (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Crowd computing: All your base are belong to us

  • 1.
  • 2. Crowd  Compu*ng:  All  Your  Base   Are  Belong  to  Us   David  C.  Thompson  
  • 3. What  is  about  to  happen   •  Some  background  on:   –  me   –  compe;;on   •  Crowdsourced  science  through  the  ‘ages’   •  The  data  set   •  The  Kaggle  process   •  An  overview  of  the  compe;;on   •  The  models  and  implementa;on   •  What  we  have  learnt  
  • 4. Behold!  Let  the  science  begin  …  
  • 6. about.me/dcthompson   My  favourite  papers  from  each  period:   [1]  J.  Chem.  Phys.  122,  124107  (2005)   [2]  J.  Chem.  Phys.  128,  224103  (2008)   [3]  J.  Chem.  Inf.  Model.  49,  1889  (2009)   [4]  J.  Chem.  Inf.  Model.  51,  93  (2011)  
  • 7. A  funny  thing  happened  at  my  1st  external  communica;ons  conference  …   Or  …   A  heart-­‐wrenching  tale  of  man  versus  coffee  machine  …   Or  …   How  an  external  networking  opportunity  brought  some  ‘gamifica;on’  to   research   7  
  • 9. Do  real  science,  at  home.  
  • 10. What  happens  when  you  search  for   ‘blindfolded  archery'  
  • 11. I  never  make  predic;ons.  And  I  never   will*    Lots  of  opportunity  to  translate  problems,  from  all   fields,  into  systems  with  gaming  elements   •  Goal  –  What  do  you  hope  to  achieve  by  playing  the   game?   •  Rules  –  The  limita;ons  on  how  you  can  achieve  the   goals   •  Feedback  –  How  close  are  you  to  achieving  your  goal?   •  Voluntary  par*cipa*on  –  Everyone  playing  the  game   accepts  the  goals,  the  rules,  and  the  feedback     *   Paul  Gascoigne   hGp://janemcgonigal.com/  
  • 12.
  • 14.
  • 15. What  you  should  know  about  this   exercise   •  We  wanted  to  inves;gate  the  u;lity  of  the   process     •  We  wanted  to  move  with  speed   •  We  wanted  to  use  a  data  set  the  scien;fic   community  had  previously  seen   •  We  wanted  to  be  inclusive  –  no  domain   exper*se  needed  
  • 16. Shameless  slide  reuse  …  *    “All  models  are  wrong,  but   some  models  are  useful”  –  G.  E.  P.  Box   “…the  validity  of  any  given  model  is  of  limited   scope,  as  is  the  case  with  any  mental  construct   that  we  have  about  what  our  molecules  are   doing,  whether  we  used  a  sosware  package  or   waved  our  hands  around  in  the  air.”  –  D.  Lowe     Simula;on  and  its  discontents,  Sherry  Turkle,  Cambridge,  MA:  MIT  Press  (2009)   *  D.  C.  Thompson  et  al.  Schrödinger  Regional  User  Mee;ng,  New  York,  NY  2009  
  • 17. The  data  set   •  Version  2  of  the  Hansen  AMES  mutagenicity   data  was  used   •  The  following  protocol  was  observed:     What  happened   Download  smiles   #  of  molecules  (removed)   6512   Conversion  with  Corina   6503  (9)   Remove  non-­‐zero  formal   6419  (84)   charge   Remove  if  more  than  99   6414  (5)   atoms   Remove  if  contain   6252  (162)   undesirable  atoms*   hGp://doc.ml.tu-­‐berlin.de/toxbenchmark/   J.  Chem.  Inf.  Model.  49,  2077  (2009)   *  D,  B,  Al,  P,  Ga,  Si,  Ge,  Sn,  As,  Sb,  Se,  Te,  At,  He,  Ne,  Ar,  Kr,  Xe,  Rn  
  • 18. Descriptor  calcula;on   SD  file,  descriptor  calcula;on  –  6252  x  5030   –  Filter  for  low  variance  (≤  0.01);  removed  2537   –  Remove  for  high  correla;on  (>  0.90);  removed   716   –  Descriptor  normaliza;on  resulted  in  6252  x   1400   1777  .csv  dfile     Descriptor  Engine   #  of   escriptors   1200   MOE  2D   76  (186)   1000   Atom  Pair   696  (1920)   800   MolConn-­‐Z   174  (745)   600   Pipeline  Pilot   5  (130)   Property  Counts   400   Daylight   825  (2048)   fingerprints   200   clogP   0  (1)   0   50   1000   1050   1100   1150   1200   100   150   200   250   300   350   400   450   500   550   600   650   700   750   800   850   900   950   J.  Chem.  Inf.  Model.  49,  2077  (2009)  
  • 19. Tes;ng  Framework   •  Public  Leaderboard:  The   split  of  the  test  set  that   compe;;on  par;cipants   see  real-­‐;me  feedback  on   over  the  course  of  the   compe;;on.   •  Private  Leaderboard:  The   split  of  the  test  set  that  is   used  to  determine  the   compe;;on  winners  and   es;mate  the  generaliza;on   error.  Par;cipants  do  not   see  feedback  on  this  during   the  compe;;on.   “Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  
  • 20. Expecta;ons   “Applicability  Domains  for  Classifica;on  Problems:   Benchmarking  of  Distance  to  Models  for  Ames  Mutagenicity  Set”     •  20  models  generated  with  different  algorithms  and  descriptors   •  Models  have  overall  accuracies  between  0.75  and  0.83  for  the  training  set   and  0.76  and  0.82  for  the  test  set   •  Inter-­‐laboratory  accuracy  for  Ames  test  reported  at  85%     Expecta*on:  Models  should  have  similar  accuracy  to   literature   Goal:  Models  should  be  balanced;  sensi*vity  and   specificity  should  be  high   J.  Chem.  Inf.  Model.  50,  2094  (2010)  
  • 22.
  • 23. Performance  as  a  func;on  of  ;me   796  players   1N log  loss=  − N ∑ y log( y ) + (1 − y ) log(1 − y )   ˆ i i i ˆi 703  teams   i =1 8841  entries   55  forum  topics,  409  posts  
  • 24. Final   Public   Δ  (log   Team  Name   Ranking   Ranking   loss)   1   Winter  is  Coming  &  Sergey   11   0   2   seelary   26   7E-­‐05   3   bluehat   1   0.00051   4   jazz   15   0.0014   5   Wayne  Zhang  &  Gxav  &  woshialex   19   0.00146   6   Indy  Actuaries   38   0.00184   7   bluemaster  &  imran   7   0.00231   8   Efiimov  &  Bers  &  Cragin  &  vsu   4   0.00241   9   y_tag   18   0.0026   10   Killian  O’Connor   44   0.00285   11   PlanetThanet  &  SirGuessalot   40   0.00298   12   AussieTim   48   0.00335   13   Jason  Farmer   31   0.00347   14   GreenPeace   16   0.00356   15   mars   32   0.00388   16   Fuzzify   60   0.00392   17   Emanuele   63   0.00395   18   HappyHour   10   0.00431   19   Bal;c   30   0.00465   20   dejavu   20   0.00482   352   Random  Forest  Benchmark   373   0.04184   Support  Vector  Machine   541   Benchmark   522   0.12147   Op;mized  Constant  Value   647   Benchmark   638   0.31414   650   Uniform  Benchmark   642   0.31959   hGps://github.com/emanuele/kaggle_pbr   hGps://github.com/benhamner/BioResponse  
  • 25. #FTW  Strategies   •  Feature  selec;on   All  three  winning  teams     iden;fied  D27  as   important.     What  is  it?       Organon  toxicophore*   •  RF  +  complementary  approaches   •  Blending   *  J.  Med.  Chem.  49,  312  (2005)   “Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  
  • 26. Private  Set  Performance   TP   FN   Se:  TP/(TP+FN)   Sp:  TN/(FP+TN)   FP   TN   CCR:  (Se  +  Sp)/2   Benchmarks   Winning  Teams   Other   Team  1   873   165   RF   888   150   Team  17   896   142   Team  2   888   150   SVM   822   216   D27   781   257   Team  3   893   145   Team  1   151   687   RF   166   672   Team  17   169   669   Team  2   165   673   SVM   215   673   D27   215   623   Team  3   162   676   Se   Sp   CCR   Se   Sp   CCR   Se   Sp   CCR   RF   0.86   0.80   0.83   Team  1   0.84   0.82   0.83   Team  17   0.86   0.80   0.83   SVM   0.79   0.74   0.77   Team  2   0.86   0.80   0.83   D27   0.75   0.74   0.75   Team  3   0.86   0.80   0.83  
  • 27. Okay,  where’s  this  ‘second’  web   service?   BIpredict     Physicochemical   proper;es  are   updated  as  molecule   is  built     Atomis;c  descriptor   values  are  appended   directly  to  the   molecule   27   *  D.  C.  Thompson  Chemical  Compu;ng  Group,  User  Group  Mee;ng,  Montreal,  2011  
  • 28. So,  what  did  we  learn?   •  Was  this  useful?   –  Yes   •  Par;cipa;on  was  high,  contributors  and   contribu;ons  were  diverse*   •  A  large  number  of  models  were  of  a  high  quality   –  Differences  in  top  models  in  log  loss  metric  are  small   –  Different  sta;s;cal  measures  lead  to  different   rankings   –  RandomForest  benchmark  has  high  correct   classifica;on  rate  (CCR)   *  Sort  of  
  • 29. ‘Machine  learning  that  maGers’   Machine  learning   Domain  exper;se   skill   Kiri  L.  Wagstaff.  Machine  Learning  that  Mabers.  Proceedings  of  the  Twenty-­‐Ninth  Interna8onal   Conference  on  Machine  Learning  (ICML),  June  2012.  Download  PDF  (CL  #12-­‐2026)  
  • 30. Know  your  meme   hGp://roflcon.org/   hGp://katemiltner.com/  
  • 31. Thanks  to:   Lilly  Ackley   Ben  Hamner   Amy  Kunkel   Mehul  Patel   Alex  Renner,  PhD   All  Kaggle  par;cipants  –  esp.  Winter  is  Coming  &  Sergey