Kouters, E, Vasilescu, B, Serebrenik, A and van den Brand, MGJ (2012), "Who's who in GNOME: using LSA to merge software repository identities", In Proceedings of the 28th IEEE International Conference on Software Maintenance---Early Research Achievements (ICSM 2012 ERA), pp. 592-595. IEEE.
1. Who’s who? (in GNOME)
Erik Kouters
Bogdan Vasilescu
Alexander Serebrenik
Mark G.J. van den Brand
2. It’s all about communication
The error should
Test #14 fails
be somewhere
sometimes
here…
I know how
to fix it!
/ Mathematics and Computer Science 9/26/12 PAGE 2
3. One person - multiple aliases
Contributors sign off using a <name, email> alias.
<John Doe, john.doe@gmail.com>
<John J. Doe, john.doe@yahoo.com>
/ Mathematics and Computer Science 9/26/12 PAGE 3
4. One person - multiple aliases
Names: Emails:
• John Doe • j.doe@domainA
• Doe John • john.doe@domainB
• J Doe • john DOT doe AT domainC
• John • jdoe@domainD
• John Dooe • john@domainE
• John J. Doe
• John Joseph Doe
• John “Bone” Doe
Identity merge algorithms:
• the “noisier” the data, the worse they perform!
/ Mathematics and Computer Science 9/26/12 PAGE 4
6. Latent Semantic Analysis
<John
Smith,
john@domainA>
<John
Brown,
john@domainB>
Inverse document frequency
Singular value decomposition
Rank (noise) reduction
john
1 .. .. ..
johnd
1 .. .. .. Cosine between documents
joseph
1 .. .. ..
jdoe
3/4 .. .. .. Merge similar documents
doe
1 .. .. ..
/ Mathematics and Computer Science 9/26/12 PAGE 6
7. s performed on one third of the data and testing on the other
o thirds. All algorithms have been implemented in Python
Empirical evaluation
d, as well as the data, can be made available upon request.
Average case Worst case
1.00
1.00
GNOME – all projects (Git)
● ●
0.95
0.95
• 8618 different aliases
• only 4989 unique!
●
0.90
0.90
F−measure
F−measure
●
●
●
●
●
0.85
0.85
●
●
●
●
0.80
0.80
●
0.75
0.75
Simple Bird et al. LSA Simple Bird et al. LSA
1. The f -measures for the competing approaches. The f -measure ranges
ween 0 andComputer Sciencehigher the value, the better). LSA performs as well as
/ Mathematics and
1 (the 9/26/12 PAGE 7
8. d varying the remaining. After the sensitivity analysis algorithm based on LSA, robust
estricted the range of minLen to {2, 3, 4}, levThr to crepancies in VCS aliases. Empir
Summary
0.75}, cosThr to {0.65, 0.70, 0.75}, and k was fixed to Git repositories has shown equall
of the number of terms. In the average case, for each of algorithm as the state of the art in
en repetitions, training was performed on one tenth of the performance in the worst case.
ME aliases (' 860), and testing on ten random subsets approach presented in FRASR, o
the same size from the remaining aliases. Samples were software repositories [10].
en instead of the entire remaining data for computational R EFEREN
ency reasons. In the worst case, because of fewer aliases
e dataset (673), for each of the ten repetitions, training john
C. 1
[1] Bird et al. “Mining emai
.. .. ..
performed on one third of the data and testing on the other ACM, 2006, pp. 137–143.
hirds. All algorithms have been implemented in Pythonjohnd
A. 1
[2] Capiluppi, .. Serebrenik
.. A. ..
as well as the data, can be made available upon request. oping an H-Index for OSS D
joseph
2012, pp. 251–254. ..
1 .. ..
jdoe
P.3/4 .. “A comparison
[3] Christen. ..
Average case Worst case
..
1.00
1.00
● ●
Techniques and practical
0.95
0.95
doe
2006, pp. 290–294. ..
1 .. ..
●
[4] S.T. Dumais. “Improving
0.90
0.90
F−measure
●
●
from external sources”. In:
●
●
●
0.85
0.85
23.2 (1991), pp. 229–236.
●
●
●
[5] D.M. German. “The GNO
●
0.80
0.80
of open source, global softw
●
0.75
0.75
ware Process 8.4 (2003), p
/ Mathematics and Computer Science 9/26/12 PAGE 8
Simple Bird et al. LSA Simple Bird et al. LSA
[6] M. Goeminne and T. Mens.