SlideShare a Scribd company logo
1 of 48
Download to read offline
Learning to
                                   Change Projects

                             Raymond Borges, Tim Menzies

                  Lane Department of Computer Science & Electrical Engineering
                                    West Virginia University


                               PROMISE’12: Lund, Sweden
                                    Sept 21, 2012

tim@menzies.us (LCSEE, WVU, USA)     Learning toChange Projects              PROMISE ’12   1 / 18
Sound bites



       Less predicition, more decision
       Data has shape
              “Data mining” = “carving” out that shape
       To reveal shape, remove irrelvancies
              Cut the cr*p
              Use reduction operators: dimension, column, row, rule
       Show, don’t code
              Once you can see shape, inference is superflous.
              Implications for other research.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects     PROMISE ’12   1 / 18
Decisions, Decisions...


 Tom Zimmermann:
       “We forget that the original motivation for predictive modeling was
       making decisions about software project.”




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects      PROMISE ’12   2 / 18
Decisions, Decisions...


 Tom Zimmermann:
       “We forget that the original motivation for predictive modeling was
       making decisions about software project.”


 ICSE 2012 Panel on Software Analytics
       “Prediction is all well and good, but what about decision making?”.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects      PROMISE ’12   2 / 18
Decisions, Decisions...


 Tom Zimmermann:
       “We forget that the original motivation for predictive modeling was
       making decisions about software project.”


 ICSE 2012 Panel on Software Analytics
       “Prediction is all well and good, but what about decision making?”.


 Predictive models are useful
       They focus an inquiry onto particular issues
       but predictions are sub-routines of decision processes




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects      PROMISE ’12   2 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)
       Monitor = what not to do.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)
       Monitor = what not to do.            δ = Hate(next) − Love(now)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects       PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)
       Monitor = what not to do.            δ = Hate(next) − Love(now)
       Planning = what to do next.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects       PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)
       Monitor = what not to do.            δ = Hate(next) − Love(now)
       Planning = what to do next.          δ = Love(next) − Hate(now)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects       PROMISE ’12   3 / 18
Q: How to Build Decision Systems?
 1996: T Menzies, Applications of abduction: knowledge-level modeling,
 International Journal of Human Computer Studies

 Score contexts e.g. Hate, Love; count frequencies of ranges in each:
       Diagnosis = what went wrong. δ = Hate(now) − Love(past)
       Monitor = what not to do.            δ = Hate(next) − Love(now)
       Planning = what to do next.          δ = Love(next) − Hate(now)


 δ = X − Y = contrast set = things frequent X but rare in Y
       TAR3 (2003),WHICH (2010),etc


 But for PROMISE effort estimation data
       Contrast sets are obvious...
       ... Once you find the underlying shape of the data.

tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects       PROMISE ’12   3 / 18
Q: How to find the underlying shape of the data?



   Data mining = data carving

   To find the signal in the noise...


   Timm’s algorithm
      1   Find some cr*p
      2   Throw it away
      3   Go to 1




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   4 / 18
IDEA = Iterative Dichomization on Every Attribute


 Timm’s algorithm
    1   Find some cr*p
    2   Throw it away
    3   Go to 1




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   5 / 18
IDEA = Iterative Dichomization on Every Attribute


 Timm’s algorithm
    1   Find some cr*p
    2   Throw it away
    3   Go to 1


    1   Dimensionality reduction
    2   Column reduction
    3   Row reduction
    4   Rule reduction




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   5 / 18
IDEA = Iterative Dichomization on Every Attribute


 Timm’s algorithm
    1   Find some cr*p
    2   Throw it away
    3   Go to 1


    1   Dimensionality reduction
    2   Column reduction
    3   Row reduction
    4   Rule reduction

 And in the reduced data, inference is obvious.


tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   5 / 18
IDEA = Iterative Dichomization on Every Attribute


    1   Dimensionality reduction (recursive fast PCA)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   6 / 18
IDEA = Iterative Dichomization on Every Attribute


    1   Dimensionality reduction (recursive fast PCA)


                                                 Fastmap (Faloutsos’94)
                                                        W = anything
                                                        X = furthest from W
                                                        Y = furthest from X




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects             PROMISE ’12   6 / 18
IDEA = Iterative Dichomization on Every Attribute


    1   Dimensionality reduction (recursive fast PCA)


                                                 Fastmap (Faloutsos’94)
                                                        W = anything
                                                        X = furthest from W
                                                        Y = furthest from X
                                                        Takes time O(2N)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects              PROMISE ’12   6 / 18
IDEA = Iterative Dichomization on Every Attribute


    1   Dimensionality reduction (recursive fast PCA)


                                                 Fastmap (Faloutsos’94)
                                                        W = anything
                                                        X = furthest from W
                                                        Y = furthest from X
                                                        Takes time O(2N)
                                                        Let c = dist(X,Y)
                                                        If Z has distance a, b to X,Y then
                                                                       2  2   2
                                                        X projects to a +c −b
                                                                         2c




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects               PROMISE ’12   6 / 18
IDEA = Iterative Dichomization on Every Attribute


    1   Dimensionality reduction (recursive fast PCA)


                                                 Fastmap (Faloutsos’94)
                                                        W = anything
                                                        X = furthest from W
                                                        Y = furthest from X
                                                        Takes time O(2N)
                                                        Let c = dist(X,Y)
                                                        If Z has distance a, b to X,Y then
                                                                       2  2   2
                                                        X projects to a +c −b
                                                                         2c

                          ¨
 Platt’05: Fastmap = Nystrom algorithm = fast & approximate PCA


tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects               PROMISE ’12   6 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)

        Sort columns by their diversity
        Keep columns that select for fewest clusters




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)

        Sort columns by their diversity
        Keep columns that select for fewest clusters

        e.g. nine rows in two clusters
        cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5
        cluster c2 has acap=2,2,2,3; pcap=3,4,4,5




tim@menzies.us (LCSEE, WVU, USA)         Learning toChange Projects   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)

        Sort columns by their diversity
        Keep columns that select for fewest clusters

        e.g. nine rows in two clusters
        cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5
        cluster c2 has acap=2,2,2,3; pcap=3,4,4,5
                  p(acap = 2) = 0.44                            p(acap = 3) = 0.55
                  p(pcap = 3) = p(pcap = 4) = 0.33              p(pcap = 5) = 0.33

                  p(acap   = 2|c1 ) = 0.25                      p(acap   = 2|c2 ) = 0.75
                  p(acap   = 3|c1 ) = 0.8                       p(acap   = 3|c2 ) = 0.2
                  p(pcap   = 3|c1 ) = 0.67                      p(pcap   = 3|c2 ) = 0.33
                  p(pcap   = 4|c1 ) = 0.33                      p(pcap   = 4|c2 ) = 0.67
                  p(pcap   = 5|c1 ) = 0.67                      p(pcap   = 5|c2 ) = 0.33




tim@menzies.us (LCSEE, WVU, USA)         Learning toChange Projects                   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute
    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)

        Sort columns by their diversity
        Keep columns that select for fewest clusters

        e.g. nine rows in two clusters
        cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5
        cluster c2 has acap=2,2,2,3; pcap=3,4,4,5
                  p(acap = 2) = 0.44                             p(acap = 3) = 0.55
                  p(pcap = 3) = p(pcap = 4) = 0.33               p(pcap = 5) = 0.33

                  p(acap   = 2|c1 ) = 0.25                       p(acap   = 2|c2 ) = 0.75
                  p(acap   = 3|c1 ) = 0.8                        p(acap   = 3|c2 ) = 0.2
                  p(pcap   = 3|c1 ) = 0.67                       p(pcap   = 3|c2 ) = 0.33
                  p(pcap   = 4|c1 ) = 0.33                       p(pcap   = 4|c2 ) = 0.67
                  p(pcap   = 5|c1 ) = 0.67                       p(pcap   = 5|c2 ) = 0.33
                             I(col)   =        (p(x) ∗ ( −p(x|c).log(x|c)))
                           I(acap)    =      0.239 ← keep
                           I(pcap)    =      0.273 ← prune
tim@menzies.us (LCSEE, WVU, USA)          Learning toChange Projects                   PROMISE ’12   7 / 18
IDEA = Iterative Dichomization on Every Attribute




    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   8 / 18
IDEA = Iterative Dichomization on Every Attribute




    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)
    3   Row reduction (replace clusters with their mean)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   8 / 18
IDEA = Iterative Dichomization on Every Attribute




    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)
    3   Row reduction (replace clusters with their mean)

 Replace all leaf cluster instances with their centroid
        Described only using columns within 50% of min diversity.
        e.g. Nasa93 reduces to 12 columns and 13 centroids.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects       PROMISE ’12   8 / 18
Nasa93 reduces to 12 columns and 13 centroids




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   9 / 18
IDEA = Iterative Dichomization on Every Attribute



    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)
    3   Row reduction (replace clusters with their mean)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   10 / 18
IDEA = Iterative Dichomization on Every Attribute



    1   Dimensionality reduction (recursive fast PCA)
    2   Column reduction (info gain)
    3   Row reduction (replace clusters with their mean)
    4   Rule reduction (contrast home vs neighbors)

 Surprise: after steps 1,2,3...
        Further computation is superfluous.
        Visuals sufficient for contrast set generation




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   10 / 18
Manual Construction of Contrast Sets




       Table5 = Your “home” cluster
       Table6 = Projects of similar size
       Table7 = Nearby project with fearsome effort
       Contrast set = delta on last line
tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   11 / 18
Why Cluster120?




       Is it valid that cluter120 costs so much?
       Yes, if building core services with cost amortized over N future apps.
       No, if racing to get products to a competitive market
       We do not know- but at least we are focused on that issue.
tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects     PROMISE ’12   12 / 18
Reductions on PROMISE data sets
                      size of reduces data sets
                                                                                           reduced
             25                                                          data set   rows       columns
                                                                         Albrecht      4             4
                                                                           China      66            15
                                                                      Cocomo81         8            18
                                                                     Cocomo81e         4            16
             20                                                      Cocomo81o         4            16
                                                                     Cocomo81s         2            16
                                                                      Desharnais       8            19
                                                                   Desharnais L1       6            10
                                                                   Desharnais L2       4            10
             15                                                    Desharnais L3       2            10
   columns




                                                                          Finnish      6             2
                                                                         Kemerer       2             7
                                                                      Miyazaki’94      6             3
                                                                          Nasa93      13            12
             10                                                   Nasa93 center 5      7            16
                                                                  Nasa93 center1       2            15
                                                                  Nasa93 center2       5            16
                                                                             SDR       4            21
                                                                         Telcom1       2             1
             5




             0
                  1               10                 100
                                rows



tim@menzies.us (LCSEE, WVU, USA)                  Learning toChange Projects                       PROMISE ’12   13 / 18
Reductions on PROMISE data sets
                      size of reduces data sets
                                                                                           reduced
             25                                                          data set   rows       columns
                                                                         Albrecht      4             4
                                                                           China      66            15
                                                                      Cocomo81         8            18
                                                                     Cocomo81e         4            16
             20                                                      Cocomo81o         4            16
                                                                     Cocomo81s         2            16
                                                                      Desharnais       8            19
                                                                   Desharnais L1       6            10
                                                                   Desharnais L2       4            10
             15                                                    Desharnais L3       2            10
   columns




                                                                          Finnish      6             2
                                                                         Kemerer       2             7
                                                                      Miyazaki’94      6             3
                                                                          Nasa93      13            12
             10                                                   Nasa93 center 5      7            16
                                                                  Nasa93 center1       2            15
                                                                  Nasa93 center2       5            16
                                                                             SDR       4            21
                                                                         Telcom1       2             1
             5


                                                                Q: throwing away too much?
             0
                  1               10                 100
                                rows



tim@menzies.us (LCSEE, WVU, USA)                  Learning toChange Projects                       PROMISE ’12   13 / 18
Q: Throwing Away Too Much?
 Estimates = class variable of nearest centroid in reduced space
       Compare to 90 pre-processors*learners from Kocagueneli et al. TSE,
       2011 On the Value of Ensemble Learning in Effort Estimation.
                                              pred−actual
       Performance measure = MRE =               actual




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   14 / 18
Q: Throwing Away Too Much?
 Estimates = class variable of nearest centroid in reduced space
        Compare to 90 pre-processors*learners from Kocagueneli et al. TSE,
        2011 On the Value of Ensemble Learning in Effort Estimation.
                                                                  pred−actual
        Performance measure = MRE =                                  actual

9 pre-processors:
   1   norm: normalize numerics 0..1, min..max

   2   log: replace numerics of the non-class columns with
       their logarithms

   3   PCA: replace non-class columns with principle
       components

   4   SWReg: cull uninformative columns with stepwise
       regression

   5   Width3bin: divide numerics into 3 bins with boundaries
       (max-min)/3

   6   Wdith5bin: divide numerics into 5 bins with boundaries
       (max-min)/5

   7   Freq3bins: split numerics into 3 equal size percentiles.

   8   Freq5bins: split numerics into 5 equal size percentiles.

   9   None: no pre-processor.

tim@menzies.us (LCSEE, WVU, USA)                  Learning toChange Projects    PROMISE ’12   14 / 18
Q: Throwing Away Too Much?
 Estimates = class variable of nearest centroid in reduced space
        Compare to 90 pre-processors*learners from Kocagueneli et al. TSE,
        2011 On the Value of Ensemble Learning in Effort Estimation.
                                                                  pred−actual
        Performance measure = MRE =                                  actual

9 pre-processors:                                                   10 learners:
   1   norm: normalize numerics 0..1, min..max                         1   INN: simple one nearest neighbor

   2   log: replace numerics of the non-class columns with             2   ABE0-1nn: analogy-based estimation using nearest
       their logarithms                                                    neighbor.

   3   PCA: replace non-class columns with principle                   3   ABE0-5nn: analogy-based estimation using the median
       components                                                          of the five nearest neighbors.

   4   SWReg: cull uninformative columns with stepwise                 4   CART(yes): regression trees, with sub-tree postpruning.
       regression
                                                                       5   CART(no): regression trees, no post-pruning.
   5   Width3bin: divide numerics into 3 bins with boundaries
       (max-min)/3                                                     6   NNet: two-layered neural net.

   6   Wdith5bin: divide numerics into 5 bins with boundaries          7   LReg: linear regression
       (max-min)/5
                                                                       8   PLSR: partial least squares regression.
   7   Freq3bins: split numerics into 3 equal size percentiles.
                                                                       9   PCR: principle components regression
   8   Freq5bins: split numerics into 5 equal size percentiles.
                                                                       10 SWReg: Stepwise regressions.
   9   None: no pre-processor.

tim@menzies.us (LCSEE, WVU, USA)                  Learning toChange Projects                               PROMISE ’12     14 / 18
Results
       Perennial problem with assessing different effort estimation tools.
           MRE not normal: low valley, high hills (injects much variance)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects    PROMISE ’12   15 / 18
Results
       Perennial problem with assessing different effort estimation tools.
           MRE not normal: low valley, high hills (injects much variance)




       IDEA’s predictions not better or worse than others, avoids all hills
tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects      PROMISE ’12   15 / 18
Related Work




                     Cluster using (a) centrality (e.g. k-means);
             (b) connectedness (e.g. dbScan) (c) separation (e.g. IDEA)




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects    PROMISE ’12   16 / 18
Related Work




                     Cluster using (a) centrality (e.g. k-means);
             (b) connectedness (e.g. dbScan) (c) separation (e.g. IDEA)
                                   case-                                feature
     Who                           based    clustering                 selection      task
                                     √
     Shepperd (1997)                                                                 predict
     Boley (1998)                           recursive PCA                            predict
     Bettenburg et al. (MSR’12)             recurive regression                      predict
     Posnett et al. (ASE’11)                on file/package divisions                 predict
                                    √
     Menzies et al. (ASE’11)                FastMap                                 contrast
                                    √       √                             √
     IDEA                                                                           contrast


tim@menzies.us (LCSEE, WVU, USA)     Learning toChange Projects               PROMISE ’12      16 / 18
Back to the Sound bites



       Less predicition, more decision
       Data has shape
              “Data mining” = “carving” out that shape
       To reveal shape, remove irrelvancies
              Cut the cr*p
              IDEA = reduction operators: dimension, column, row, rule
       Show, don’t code
              Once you can see shape, inference is superflous.
              Implications for other research.




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects    PROMISE ’12   17 / 18
Questions? Comments?




tim@menzies.us (LCSEE, WVU, USA)   Learning toChange Projects   PROMISE ’12   18 / 18

More Related Content

More from CS, NcState

Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceCS, NcState
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits CS, NcState
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab templateCS, NcState
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUCS, NcState
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements EngineeringCS, NcState
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginiaCS, NcState
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software EngineeringCS, NcState
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)CS, NcState
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1CS, NcState
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataCS, NcState
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter? CS, NcState
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4CS, NcState
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CSCS, NcState
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SECS, NcState
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea EngineeringCS, NcState
 

More from CS, NcState (20)

Icse15 Tech-briefing Data Science
Icse15 Tech-briefing Data ScienceIcse15 Tech-briefing Data Science
Icse15 Tech-briefing Data Science
 
Kits to Find the Bits that Fits
Kits to Find  the Bits that Fits Kits to Find  the Bits that Fits
Kits to Find the Bits that Fits
 
Ai4se lab template
Ai4se lab templateAi4se lab template
Ai4se lab template
 
Automated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSUAutomated Software Enging, Fall 2015, NCSU
Automated Software Enging, Fall 2015, NCSU
 
Requirements Engineering
Requirements EngineeringRequirements Engineering
Requirements Engineering
 
172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia172529main ken and_tim_software_assurance_research_at_west_virginia
172529main ken and_tim_software_assurance_research_at_west_virginia
 
Automated Software Engineering
Automated Software EngineeringAutomated Software Engineering
Automated Software Engineering
 
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)Next Generation “Treatment Learning” (finding the diamonds in the dust)
Next Generation “Treatment Learning” (finding the diamonds in the dust)
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
Goldrush
GoldrushGoldrush
Goldrush
 
Dagstuhl14 intro-v1
Dagstuhl14 intro-v1Dagstuhl14 intro-v1
Dagstuhl14 intro-v1
 
Know thy tools
Know thy toolsKnow thy tools
Know thy tools
 
The Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software DataThe Art and Science of Analyzing Software Data
The Art and Science of Analyzing Software Data
 
What Metrics Matter?
What Metrics Matter? What Metrics Matter?
What Metrics Matter?
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 
Sayyad slides ase13_v4
Sayyad slides ase13_v4Sayyad slides ase13_v4
Sayyad slides ase13_v4
 
Ase2013
Ase2013Ase2013
Ase2013
 
Warning: don't do CS
Warning: don't do CSWarning: don't do CS
Warning: don't do CS
 
How to do better experiments in SE
How to do better experiments in SEHow to do better experiments in SE
How to do better experiments in SE
 
Idea Engineering
Idea EngineeringIdea Engineering
Idea Engineering
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Idea

  • 1. Learning to Change Projects Raymond Borges, Tim Menzies Lane Department of Computer Science & Electrical Engineering West Virginia University PROMISE’12: Lund, Sweden Sept 21, 2012 tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 1 / 18
  • 2. Sound bites Less predicition, more decision Data has shape “Data mining” = “carving” out that shape To reveal shape, remove irrelvancies Cut the cr*p Use reduction operators: dimension, column, row, rule Show, don’t code Once you can see shape, inference is superflous. Implications for other research. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 1 / 18
  • 3. Decisions, Decisions... Tom Zimmermann: “We forget that the original motivation for predictive modeling was making decisions about software project.” tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 2 / 18
  • 4. Decisions, Decisions... Tom Zimmermann: “We forget that the original motivation for predictive modeling was making decisions about software project.” ICSE 2012 Panel on Software Analytics “Prediction is all well and good, but what about decision making?”. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 2 / 18
  • 5. Decisions, Decisions... Tom Zimmermann: “We forget that the original motivation for predictive modeling was making decisions about software project.” ICSE 2012 Panel on Software Analytics “Prediction is all well and good, but what about decision making?”. Predictive models are useful They focus an inquiry onto particular issues but predictions are sub-routines of decision processes tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 2 / 18
  • 6. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 7. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 8. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 9. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 10. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) Monitor = what not to do. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 11. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) Monitor = what not to do. δ = Hate(next) − Love(now) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 12. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) Monitor = what not to do. δ = Hate(next) − Love(now) Planning = what to do next. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 13. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) Monitor = what not to do. δ = Hate(next) − Love(now) Planning = what to do next. δ = Love(next) − Hate(now) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 14. Q: How to Build Decision Systems? 1996: T Menzies, Applications of abduction: knowledge-level modeling, International Journal of Human Computer Studies Score contexts e.g. Hate, Love; count frequencies of ranges in each: Diagnosis = what went wrong. δ = Hate(now) − Love(past) Monitor = what not to do. δ = Hate(next) − Love(now) Planning = what to do next. δ = Love(next) − Hate(now) δ = X − Y = contrast set = things frequent X but rare in Y TAR3 (2003),WHICH (2010),etc But for PROMISE effort estimation data Contrast sets are obvious... ... Once you find the underlying shape of the data. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 3 / 18
  • 15. Q: How to find the underlying shape of the data? Data mining = data carving To find the signal in the noise... Timm’s algorithm 1 Find some cr*p 2 Throw it away 3 Go to 1 tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 4 / 18
  • 16. IDEA = Iterative Dichomization on Every Attribute Timm’s algorithm 1 Find some cr*p 2 Throw it away 3 Go to 1 tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 5 / 18
  • 17. IDEA = Iterative Dichomization on Every Attribute Timm’s algorithm 1 Find some cr*p 2 Throw it away 3 Go to 1 1 Dimensionality reduction 2 Column reduction 3 Row reduction 4 Rule reduction tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 5 / 18
  • 18. IDEA = Iterative Dichomization on Every Attribute Timm’s algorithm 1 Find some cr*p 2 Throw it away 3 Go to 1 1 Dimensionality reduction 2 Column reduction 3 Row reduction 4 Rule reduction And in the reduced data, inference is obvious. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 5 / 18
  • 19. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 6 / 18
  • 20. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) Fastmap (Faloutsos’94) W = anything X = furthest from W Y = furthest from X tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 6 / 18
  • 21. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) Fastmap (Faloutsos’94) W = anything X = furthest from W Y = furthest from X Takes time O(2N) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 6 / 18
  • 22. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) Fastmap (Faloutsos’94) W = anything X = furthest from W Y = furthest from X Takes time O(2N) Let c = dist(X,Y) If Z has distance a, b to X,Y then 2 2 2 X projects to a +c −b 2c tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 6 / 18
  • 23. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) Fastmap (Faloutsos’94) W = anything X = furthest from W Y = furthest from X Takes time O(2N) Let c = dist(X,Y) If Z has distance a, b to X,Y then 2 2 2 X projects to a +c −b 2c ¨ Platt’05: Fastmap = Nystrom algorithm = fast & approximate PCA tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 6 / 18
  • 24. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 25. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 26. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) Sort columns by their diversity Keep columns that select for fewest clusters tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 27. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) Sort columns by their diversity Keep columns that select for fewest clusters e.g. nine rows in two clusters cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5 cluster c2 has acap=2,2,2,3; pcap=3,4,4,5 tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 28. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) Sort columns by their diversity Keep columns that select for fewest clusters e.g. nine rows in two clusters cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5 cluster c2 has acap=2,2,2,3; pcap=3,4,4,5 p(acap = 2) = 0.44 p(acap = 3) = 0.55 p(pcap = 3) = p(pcap = 4) = 0.33 p(pcap = 5) = 0.33 p(acap = 2|c1 ) = 0.25 p(acap = 2|c2 ) = 0.75 p(acap = 3|c1 ) = 0.8 p(acap = 3|c2 ) = 0.2 p(pcap = 3|c1 ) = 0.67 p(pcap = 3|c2 ) = 0.33 p(pcap = 4|c1 ) = 0.33 p(pcap = 4|c2 ) = 0.67 p(pcap = 5|c1 ) = 0.67 p(pcap = 5|c2 ) = 0.33 tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 29. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) Sort columns by their diversity Keep columns that select for fewest clusters e.g. nine rows in two clusters cluster c1 has acap=2,3,3,3,3; pcap=3,3,4,5,5 cluster c2 has acap=2,2,2,3; pcap=3,4,4,5 p(acap = 2) = 0.44 p(acap = 3) = 0.55 p(pcap = 3) = p(pcap = 4) = 0.33 p(pcap = 5) = 0.33 p(acap = 2|c1 ) = 0.25 p(acap = 2|c2 ) = 0.75 p(acap = 3|c1 ) = 0.8 p(acap = 3|c2 ) = 0.2 p(pcap = 3|c1 ) = 0.67 p(pcap = 3|c2 ) = 0.33 p(pcap = 4|c1 ) = 0.33 p(pcap = 4|c2 ) = 0.67 p(pcap = 5|c1 ) = 0.67 p(pcap = 5|c2 ) = 0.33 I(col) = (p(x) ∗ ( −p(x|c).log(x|c))) I(acap) = 0.239 ← keep I(pcap) = 0.273 ← prune tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 7 / 18
  • 30. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 8 / 18
  • 31. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) 3 Row reduction (replace clusters with their mean) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 8 / 18
  • 32. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) 3 Row reduction (replace clusters with their mean) Replace all leaf cluster instances with their centroid Described only using columns within 50% of min diversity. e.g. Nasa93 reduces to 12 columns and 13 centroids. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 8 / 18
  • 33. Nasa93 reduces to 12 columns and 13 centroids tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 9 / 18
  • 34. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) 3 Row reduction (replace clusters with their mean) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 10 / 18
  • 35. IDEA = Iterative Dichomization on Every Attribute 1 Dimensionality reduction (recursive fast PCA) 2 Column reduction (info gain) 3 Row reduction (replace clusters with their mean) 4 Rule reduction (contrast home vs neighbors) Surprise: after steps 1,2,3... Further computation is superfluous. Visuals sufficient for contrast set generation tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 10 / 18
  • 36. Manual Construction of Contrast Sets Table5 = Your “home” cluster Table6 = Projects of similar size Table7 = Nearby project with fearsome effort Contrast set = delta on last line tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 11 / 18
  • 37. Why Cluster120? Is it valid that cluter120 costs so much? Yes, if building core services with cost amortized over N future apps. No, if racing to get products to a competitive market We do not know- but at least we are focused on that issue. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 12 / 18
  • 38. Reductions on PROMISE data sets size of reduces data sets reduced 25 data set rows columns Albrecht 4 4 China 66 15 Cocomo81 8 18 Cocomo81e 4 16 20 Cocomo81o 4 16 Cocomo81s 2 16 Desharnais 8 19 Desharnais L1 6 10 Desharnais L2 4 10 15 Desharnais L3 2 10 columns Finnish 6 2 Kemerer 2 7 Miyazaki’94 6 3 Nasa93 13 12 10 Nasa93 center 5 7 16 Nasa93 center1 2 15 Nasa93 center2 5 16 SDR 4 21 Telcom1 2 1 5 0 1 10 100 rows tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 13 / 18
  • 39. Reductions on PROMISE data sets size of reduces data sets reduced 25 data set rows columns Albrecht 4 4 China 66 15 Cocomo81 8 18 Cocomo81e 4 16 20 Cocomo81o 4 16 Cocomo81s 2 16 Desharnais 8 19 Desharnais L1 6 10 Desharnais L2 4 10 15 Desharnais L3 2 10 columns Finnish 6 2 Kemerer 2 7 Miyazaki’94 6 3 Nasa93 13 12 10 Nasa93 center 5 7 16 Nasa93 center1 2 15 Nasa93 center2 5 16 SDR 4 21 Telcom1 2 1 5 Q: throwing away too much? 0 1 10 100 rows tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 13 / 18
  • 40. Q: Throwing Away Too Much? Estimates = class variable of nearest centroid in reduced space Compare to 90 pre-processors*learners from Kocagueneli et al. TSE, 2011 On the Value of Ensemble Learning in Effort Estimation. pred−actual Performance measure = MRE = actual tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 14 / 18
  • 41. Q: Throwing Away Too Much? Estimates = class variable of nearest centroid in reduced space Compare to 90 pre-processors*learners from Kocagueneli et al. TSE, 2011 On the Value of Ensemble Learning in Effort Estimation. pred−actual Performance measure = MRE = actual 9 pre-processors: 1 norm: normalize numerics 0..1, min..max 2 log: replace numerics of the non-class columns with their logarithms 3 PCA: replace non-class columns with principle components 4 SWReg: cull uninformative columns with stepwise regression 5 Width3bin: divide numerics into 3 bins with boundaries (max-min)/3 6 Wdith5bin: divide numerics into 5 bins with boundaries (max-min)/5 7 Freq3bins: split numerics into 3 equal size percentiles. 8 Freq5bins: split numerics into 5 equal size percentiles. 9 None: no pre-processor. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 14 / 18
  • 42. Q: Throwing Away Too Much? Estimates = class variable of nearest centroid in reduced space Compare to 90 pre-processors*learners from Kocagueneli et al. TSE, 2011 On the Value of Ensemble Learning in Effort Estimation. pred−actual Performance measure = MRE = actual 9 pre-processors: 10 learners: 1 norm: normalize numerics 0..1, min..max 1 INN: simple one nearest neighbor 2 log: replace numerics of the non-class columns with 2 ABE0-1nn: analogy-based estimation using nearest their logarithms neighbor. 3 PCA: replace non-class columns with principle 3 ABE0-5nn: analogy-based estimation using the median components of the five nearest neighbors. 4 SWReg: cull uninformative columns with stepwise 4 CART(yes): regression trees, with sub-tree postpruning. regression 5 CART(no): regression trees, no post-pruning. 5 Width3bin: divide numerics into 3 bins with boundaries (max-min)/3 6 NNet: two-layered neural net. 6 Wdith5bin: divide numerics into 5 bins with boundaries 7 LReg: linear regression (max-min)/5 8 PLSR: partial least squares regression. 7 Freq3bins: split numerics into 3 equal size percentiles. 9 PCR: principle components regression 8 Freq5bins: split numerics into 5 equal size percentiles. 10 SWReg: Stepwise regressions. 9 None: no pre-processor. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 14 / 18
  • 43. Results Perennial problem with assessing different effort estimation tools. MRE not normal: low valley, high hills (injects much variance) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 15 / 18
  • 44. Results Perennial problem with assessing different effort estimation tools. MRE not normal: low valley, high hills (injects much variance) IDEA’s predictions not better or worse than others, avoids all hills tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 15 / 18
  • 45. Related Work Cluster using (a) centrality (e.g. k-means); (b) connectedness (e.g. dbScan) (c) separation (e.g. IDEA) tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 16 / 18
  • 46. Related Work Cluster using (a) centrality (e.g. k-means); (b) connectedness (e.g. dbScan) (c) separation (e.g. IDEA) case- feature Who based clustering selection task √ Shepperd (1997) predict Boley (1998) recursive PCA predict Bettenburg et al. (MSR’12) recurive regression predict Posnett et al. (ASE’11) on file/package divisions predict √ Menzies et al. (ASE’11) FastMap contrast √ √ √ IDEA contrast tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 16 / 18
  • 47. Back to the Sound bites Less predicition, more decision Data has shape “Data mining” = “carving” out that shape To reveal shape, remove irrelvancies Cut the cr*p IDEA = reduction operators: dimension, column, row, rule Show, don’t code Once you can see shape, inference is superflous. Implications for other research. tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 17 / 18
  • 48. Questions? Comments? tim@menzies.us (LCSEE, WVU, USA) Learning toChange Projects PROMISE ’12 18 / 18