ICSM 2012 ERA

Who’s who? (in GNOME)

Erik Kouters
Bogdan Vasilescu
Alexander Serebrenik
Mark G.J. van den Brand

It’s all about communication

The error should
Test #14 fails
be somewhere
sometimes
here…

I know how
to fix it!

/ Mathematics and Computer Science 9/26/12 PAGE 2

One person - multiple aliases

Contributors sign off using a <name, email> alias.

<John Doe, john.doe@gmail.com>

<John J. Doe, john.doe@yahoo.com>


One person - multiple aliases

Names: Emails:
•  John Doe •  j.doe@domainA
•  Doe John •  john.doe@domainB
•  J Doe •  john DOT doe AT domainC
•  John •  jdoe@domainD
•  John Dooe •  john@domainE
•  John J. Doe
•  John Joseph Doe
•  John “Bone” Doe

Identity merge algorithms:
•  the “noisier” the data, the worse they perform!


Latent Semantic Analysis

johnd@domainA:

<John
Doe,

johnd@domainA>

{john,
johnd,

<John
Joseph
Doe,
johnd@domainA>

joseph,
doe}

Document-term matrix

max similarity(jdoe,
john
1 .. .. .. {john,
johnd,
joseph,
doe})
johnd
1 .. .. .. = similarity(jdoe, doe)
= 1 – Levenshtein(jdoe, doe) /
joseph
1 .. .. ..
max( length(jdoe), length(doe))
jdoe
3/4 .. .. .. = 1 – 1/4 = 3/4
doe
1 .. .. ..


Latent Semantic Analysis

<John
Smith,
john@domainA>
<John
Brown,
john@domainB>

Inverse document frequency

Singular value decomposition

Rank (noise) reduction
john
1 .. .. ..

johnd
1 .. .. .. Cosine between documents
joseph
1 .. .. ..
jdoe
3/4 .. .. .. Merge similar documents
doe
1 .. .. ..


s performed on one third of the data and testing on the other
o thirds. All algorithms have been implemented in Python
Empirical evaluation
d, as well as the data, can be made available upon request.
Average case Worst case
1.00

1.00
GNOME – all projects (Git)
● ●
0.95

0.95
•  8618 different aliases
•  only 4989 unique!
●
0.90

0.90
F−measure

F−measure
●
●
●
●
●
0.85

0.85
●
●

●
●
0.80

0.80
●
0.75

0.75

Simple Bird et al. LSA Simple Bird et al. LSA

1. The f -measures for the competing approaches. The f -measure ranges
ween 0 andComputer Sciencehigher the value, the better). LSA performs as well as
/ Mathematics and
1 (the 9/26/12 PAGE 7

d varying the remaining. After the sensitivity analysis algorithm based on LSA, robust
estricted the range of minLen to {2, 3, 4}, levThr to crepancies in VCS aliases. Empir
Summary
0.75}, cosThr to {0.65, 0.70, 0.75}, and k was ﬁxed to Git repositories has shown equall
of the number of terms. In the average case, for each of algorithm as the state of the art in
en repetitions, training was performed on one tenth of the performance in the worst case.
ME aliases (' 860), and testing on ten random subsets approach presented in FRASR, o
the same size from the remaining aliases. Samples were software repositories [10].
en instead of the entire remaining data for computational R EFEREN
ency reasons. In the worst case, because of fewer aliases
e dataset (673), for each of the ten repetitions, training john
C. 1
[1] Bird et al. “Mining emai
.. .. ..
performed on one third of the data and testing on the other ACM, 2006, pp. 137–143.
hirds. All algorithms have been implemented in Pythonjohnd
A. 1
[2] Capiluppi, .. Serebrenik
.. A. ..
as well as the data, can be made available upon request. oping an H-Index for OSS D
joseph
2012, pp. 251–254. ..
1 .. ..
jdoe
P.3/4 .. “A comparison
[3] Christen. ..
Average case Worst case
..
1.00

1.00

● ●
Techniques and practical
0.95

0.95

doe
2006, pp. 290–294. ..
1 .. ..
●
[4] S.T. Dumais. “Improving
0.90

0.90
F−measure

●
●

from external sources”. In:
●
●
●
0.85

0.85

23.2 (1991), pp. 229–236.
●
●

●

[5] D.M. German. “The GNO
●
0.80

0.80

of open source, global softw
●
0.75

0.75

ware Process 8.4 (2003), p
Simple Bird et al. LSA Simple Bird et al. LSA
[6] M. Goeminne and T. Mens.

ICSM 2012 ERA

Recomendados

Recomendados

Más contenido relacionado

Más de Bogdan Vasilescu

Más de Bogdan Vasilescu (11)

ICSM 2012 ERA