SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
Who’s who? (in GNOME)

Erik Kouters
Bogdan Vasilescu
Alexander Serebrenik
Mark G.J. van den Brand
It’s all about communication

                                              The error should
             Test #14 fails
                                               be somewhere
              sometimes
                                                   here…




                              I know how
                                 to fix it!

/ Mathematics and Computer Science                   9/26/12   PAGE 2
One person - multiple aliases

      Contributors sign off using a <name, email> alias.




<John Doe, john.doe@gmail.com>




        <John J. Doe, john.doe@yahoo.com>


/ Mathematics and Computer Science                         9/26/12   PAGE 3
One person - multiple aliases

        Names:                            Emails:
        •  John Doe                       •  j.doe@domainA
        •  Doe John                       •  john.doe@domainB
        •  J Doe                          •  john DOT doe AT domainC
        •  John                           •  jdoe@domainD
        •  John Dooe                      •  john@domainE
        •  John J. Doe
        •  John Joseph Doe
        •  John “Bone” Doe

      Identity merge algorithms:
           •  the “noisier” the data, the worse they perform!

/ Mathematics and Computer Science                              9/26/12   PAGE 4
Latent Semantic Analysis

                                                                          johnd@domainA:	
  	
  
 <John	
  Doe,	
  	
  	
  	
  	
  	
  	
  	
  johnd@domainA>	
  
                                                                          	
  {john,	
  johnd,	
  	
  
 <John	
  Joseph	
  Doe,	
  johnd@domainA>	
  
                                                                          	
  joseph,	
  doe}


                                               Document-term matrix

                                                       max similarity(jdoe,
        john	
           1           ..   ..   ..               {john,	
  johnd,	
  joseph,	
  doe})
     johnd	
             1           ..   ..   ..      = similarity(jdoe, doe)
                                                       = 1 – Levenshtein(jdoe, doe) /
   joseph	
              1           ..   ..   ..
                                                                max( length(jdoe), length(doe))
        jdoe	
         3/4           ..   ..   ..      = 1 – 1/4 = 3/4
          doe	
          1           ..   ..   ..

/ Mathematics and Computer Science                                                       9/26/12   PAGE 5
Latent Semantic Analysis

  <John	
  Smith,	
  john@domainA>
  <John	
  Brown,	
  john@domainB>	
  
                                                    Inverse document frequency

                                                    Singular value decomposition

                                                    Rank (noise) reduction
        john	
           1           ..   ..   ..

     johnd	
             1           ..   ..   ..   Cosine between documents
   joseph	
              1           ..   ..   ..
        jdoe	
         3/4           ..   ..   ..   Merge similar documents
          doe	
          1           ..   ..   ..

/ Mathematics and Computer Science                                           9/26/12   PAGE 6
s performed on one third of the data and testing on the other
o thirds. All algorithms have been implemented in Python
     Empirical evaluation
d, as well as the data, can be made available upon request.
                               Average case                                      Worst case
            1.00




                                                                 1.00
                   GNOME – all projects (Git)
                        ●                       ●
            0.95




                                                                 0.95
                   •  8618 different aliases
                   •  only 4989 unique!
                                     ●
            0.90




                                                                 0.90
F−measure




                                                     F−measure
                                     ●
                                     ●
                                     ●
                                     ●
                                     ●
            0.85




                                                                 0.85
                                     ●
                                     ●

                                     ●
                                     ●
            0.80




                                                                 0.80
                                                                          ●
            0.75




                                                                 0.75



                      Simple     Bird et al.   LSA                      Simple    Bird et al.     LSA



 1. The f -measures for the competing approaches. The f -measure ranges
ween 0 andComputer Sciencehigher the value, the better). LSA performs as well as
  / Mathematics and
                    1 (the                                                                      9/26/12   PAGE 7
d varying the remaining. After the sensitivity analysis algorithm based on LSA, robust
estricted the range of minLen to {2, 3, 4}, levThr to crepancies in VCS aliases. Empir
        Summary
 0.75}, cosThr to {0.65, 0.70, 0.75}, and k was fixed to Git repositories has shown equall
of the number of terms. In the average case, for each of algorithm as the state of the art in
en repetitions, training was performed on one tenth of the performance in the worst case.
 ME aliases (' 860), and testing on ten random subsets           approach presented in FRASR, o
 the same size from the remaining aliases. Samples were software repositories [10].
en instead of the entire remaining data for computational                                       R EFEREN
 ency reasons. In the worst case, because of fewer aliases
e dataset (673), for each of the ten repetitions, training john	
   C. 1
                                                                  [1]       Bird et al. “Mining emai
                                                                                  ..         ..    ..
performed on one third of the data and testing on the other              ACM, 2006, pp. 137–143.
 hirds. All algorithms have been implemented in Pythonjohnd	
   A. 1
                                                                  [2]        Capiluppi, .. Serebrenik
                                                                                  ..         A. ..
 as well as the data, can be made available upon request.                oping an H-Index for OSS D
                                                              joseph	
   2012, pp. 251–254. ..
                                                                            1     ..         ..
                                                                jdoe	
   P.3/4 .. “A comparison
                                                                  [3]       Christen. ..
                Average case                 Worst case
                                                                                                   ..
 1.00




                                         1.00




        ●             ●
                                                                         Techniques and practical
 0.95




                                         0.95




                                                                 doe	
   2006, pp. 290–294. ..
                                                                            1     ..         ..
               ●
                                                                  [4] S.T. Dumais. “Improving
 0.90




                                         0.90
                             F−measure




               ●
               ●

                                                                         from external sources”. In:
               ●
               ●
               ●
 0.85




                                         0.85




                                                                         23.2 (1991), pp. 229–236.
               ●
               ●

               ●

                                                                  [5] D.M. German. “The GNO
               ●
 0.80




                                         0.80




                                                                         of open source, global softw
                                                ●
 0.75




                                         0.75




                                                                         ware Process 8.4 (2003), p
   / Mathematics and Computer Science                                                9/26/12    PAGE 8
         Simple   Bird et al. LSA     Simple  Bird et al. LSA
                                                                  [6] M. Goeminne and T. Mens.

Más contenido relacionado

Más de Bogdan Vasilescu (11)

Benevol 2012
Benevol 2012Benevol 2012
Benevol 2012
 
SOS-Evol 2012
SOS-Evol 2012SOS-Evol 2012
SOS-Evol 2012
 
IPA Spring Days 2012
IPA Spring Days 2012IPA Spring Days 2012
IPA Spring Days 2012
 
ACM GIS 2011
ACM GIS 2011ACM GIS 2011
ACM GIS 2011
 
ICSM 2011
ICSM 2011ICSM 2011
ICSM 2011
 
Benevol 2011
Benevol 2011Benevol 2011
Benevol 2011
 
Sattose 2011
Sattose 2011Sattose 2011
Sattose 2011
 
Benevol 2010
Benevol 2010Benevol 2010
Benevol 2010
 
Master Thesis presentation
Master Thesis presentationMaster Thesis presentation
Master Thesis presentation
 
Seeing the forest for the trees, UMons 2011
Seeing the forest for the trees, UMons 2011Seeing the forest for the trees, UMons 2011
Seeing the forest for the trees, UMons 2011
 
WETSoM 2011
WETSoM 2011WETSoM 2011
WETSoM 2011
 

ICSM 2012 ERA

  • 1. Who’s who? (in GNOME) Erik Kouters Bogdan Vasilescu Alexander Serebrenik Mark G.J. van den Brand
  • 2. It’s all about communication The error should Test #14 fails be somewhere sometimes here… I know how to fix it! / Mathematics and Computer Science 9/26/12 PAGE 2
  • 3. One person - multiple aliases Contributors sign off using a <name, email> alias. <John Doe, john.doe@gmail.com> <John J. Doe, john.doe@yahoo.com> / Mathematics and Computer Science 9/26/12 PAGE 3
  • 4. One person - multiple aliases Names: Emails: •  John Doe •  j.doe@domainA •  Doe John •  john.doe@domainB •  J Doe •  john DOT doe AT domainC •  John •  jdoe@domainD •  John Dooe •  john@domainE •  John J. Doe •  John Joseph Doe •  John “Bone” Doe Identity merge algorithms: •  the “noisier” the data, the worse they perform! / Mathematics and Computer Science 9/26/12 PAGE 4
  • 5. Latent Semantic Analysis johnd@domainA:     <John  Doe,                johnd@domainA>    {john,  johnd,     <John  Joseph  Doe,  johnd@domainA>    joseph,  doe} Document-term matrix max similarity(jdoe, john   1 .. .. .. {john,  johnd,  joseph,  doe}) johnd   1 .. .. .. = similarity(jdoe, doe) = 1 – Levenshtein(jdoe, doe) / joseph   1 .. .. .. max( length(jdoe), length(doe)) jdoe   3/4 .. .. .. = 1 – 1/4 = 3/4 doe   1 .. .. .. / Mathematics and Computer Science 9/26/12 PAGE 5
  • 6. Latent Semantic Analysis <John  Smith,  john@domainA> <John  Brown,  john@domainB>   Inverse document frequency Singular value decomposition Rank (noise) reduction john   1 .. .. .. johnd   1 .. .. .. Cosine between documents joseph   1 .. .. .. jdoe   3/4 .. .. .. Merge similar documents doe   1 .. .. .. / Mathematics and Computer Science 9/26/12 PAGE 6
  • 7. s performed on one third of the data and testing on the other o thirds. All algorithms have been implemented in Python Empirical evaluation d, as well as the data, can be made available upon request. Average case Worst case 1.00 1.00 GNOME – all projects (Git) ● ● 0.95 0.95 •  8618 different aliases •  only 4989 unique! ● 0.90 0.90 F−measure F−measure ● ● ● ● ● 0.85 0.85 ● ● ● ● 0.80 0.80 ● 0.75 0.75 Simple Bird et al. LSA Simple Bird et al. LSA 1. The f -measures for the competing approaches. The f -measure ranges ween 0 andComputer Sciencehigher the value, the better). LSA performs as well as / Mathematics and 1 (the 9/26/12 PAGE 7
  • 8. d varying the remaining. After the sensitivity analysis algorithm based on LSA, robust estricted the range of minLen to {2, 3, 4}, levThr to crepancies in VCS aliases. Empir Summary 0.75}, cosThr to {0.65, 0.70, 0.75}, and k was fixed to Git repositories has shown equall of the number of terms. In the average case, for each of algorithm as the state of the art in en repetitions, training was performed on one tenth of the performance in the worst case. ME aliases (' 860), and testing on ten random subsets approach presented in FRASR, o the same size from the remaining aliases. Samples were software repositories [10]. en instead of the entire remaining data for computational R EFEREN ency reasons. In the worst case, because of fewer aliases e dataset (673), for each of the ten repetitions, training john   C. 1 [1] Bird et al. “Mining emai .. .. .. performed on one third of the data and testing on the other ACM, 2006, pp. 137–143. hirds. All algorithms have been implemented in Pythonjohnd   A. 1 [2] Capiluppi, .. Serebrenik .. A. .. as well as the data, can be made available upon request. oping an H-Index for OSS D joseph   2012, pp. 251–254. .. 1 .. .. jdoe   P.3/4 .. “A comparison [3] Christen. .. Average case Worst case .. 1.00 1.00 ● ● Techniques and practical 0.95 0.95 doe   2006, pp. 290–294. .. 1 .. .. ● [4] S.T. Dumais. “Improving 0.90 0.90 F−measure ● ● from external sources”. In: ● ● ● 0.85 0.85 23.2 (1991), pp. 229–236. ● ● ● [5] D.M. German. “The GNO ● 0.80 0.80 of open source, global softw ● 0.75 0.75 ware Process 8.4 (2003), p / Mathematics and Computer Science 9/26/12 PAGE 8 Simple Bird et al. LSA Simple Bird et al. LSA [6] M. Goeminne and T. Mens.