%in Benoni+277-882-255-28 abortion pills for sale in Benoni
Revisiting the Applicability of the Pareto Principle to Core Development Teams in Open Source Software Projects
1. Is the Pareto Principle
Applicable to the Core
Teams of GitHub Projects?
Kazuhiro
Yamashita
Yasutaka
Kamei
Shane
McIntosh
Naoyasu
Ubayashi
Ahmed
E. Hassan
2. Core developers play a critical
role
in software development
2
Core developers are responsible
for guiding and coordinating the
development of an OSS project.
The most productive developers
who have made roughly 80% of
the total contributions.
Nakakoji
Mockus
3. In fact, some argue that core
developers in OSS projects follow the
Pareto Principle
5
Effort Result
80% 80%
20%20%
4. Pareto Principle in Software
Development
6
20 %
80 % 20 %
80 %
Project
Developers Artifacts
5. Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
7
Pareto Non-Pareto
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ Dinh-Trong
TSE
The results depend on small number
of case study systems
Other
6. Prior studies have arrived at mixed
conclusions about core teams and the
Pareto Principle
8
< 10 or 15 Other
Goeminne
IWSQM
Robles
RAMSS
Mockus
TOSEM
Geldenhuys
ECSEAA
Koch
ISJ
Dinh-Trong
TSE
7. Overview of our study of core
teams on GitHub
19
Applicability of the Pareto Principle
Number of Core Developers
8. Overview of our study of core
teams on GitHub
20
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
9. Collecting and analyzing
GitHub data to study core team
activity
21
Filter Heuristics
Core
Non-Core
Core
Non-Core
Calc Prop
Projects
Core
Non-Core
Classify
Commits
Core Team Size Activity
10. Collecting and analyzing
GitHub data to study core team
activity
22
Filter Heuristics
Core
Non-Core
Projects
22
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
11. Preprocessing GitHub data to handle
forks, duplicates, and to remove
immature projects
23
8,510,504 repositories -> 2,496 repositories
12. Collecting and analyzing
GitHub data to study core team
activity
24
Filter Heuristics
Core
Non-Core
Projects
24
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
13. Using heuristics to identify core
team members
26
Commit-based LOC-based Access-based
Core Core Core
14. 29
A B C D
Our commit-based core
contributor heuristic
Number of
Commits
= Commit
16. Step2: Compute the proportion
of commits that each
contributor
32
A BC D
60% 20% 10% 10%
Commits ratio
17. Step3: Core contributors are those
developers below the 0.8 cumulative
contribution cutoff
33
A BC D
0.8
1.0
0.6
Cumulative
ratio
Pct. CoreDev
2/4*100 = 50%
Num CoreDev
2
18. Collecting and analyzing
GitHub data to study core team
activity
35
Filter Heuristics
Core
Non-Core
Projects
35
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
19. Overview of our study of core
teams on GitHub
36
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
20. Overview of our study of core
teams on GitHub
37
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
21. Collecting and analyzing
GitHub data to study core team
activity
38
Filter Heuristics
Core
Non-Core
Projects
38
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
22. Our approach to study Core
Team Size
40
30%20%10%
Percentage of Core Devs
Compliance with
the Pareto Principle
Stratify projects along the confounding factors
Small Medium Large Small Medium Large Small Medium Large
LOC Total Author Age
The example project does not
follow the Pareto Principle
24. Often, there are fewer than 15
core developers in a projects
44
Number of core developers in projects
88% 98% 96%
Commit-Based LOC-Based Access-Based
25. Overview of our study of core
teams on GitHub
45
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
26. Overview of our study of core
teams on GitHub
48
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
27. Collecting and analyzing
GitHub data to study core team
activity
49
Filter Heuristics
Core
Non-Core
Projects
49
Core
Non-Core
Calc Prop
Core
Non-Core
Classify
Commits
Core Team Size Activity
28. Our approach to study
activity
50
By using the keywords, we classify the commits.
Development
Activity Type Keywords
Forward Engineering implement, add, request
Maintenance
Reengineering optimiz, adjust
Corrective Engineering bug, fix, issue, error
Management license, formatting, TODO
29. No big differences in
proportions of development
activities
54
Commit-Based LOC-Based Access-Based
30. Overview of our study of core
teams on GitHub
55
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
31. Overview of our study of core
teams on GitHub
56
Core and Non-Core Developers Activities
Applicability of the Pareto Principle
Number of Core Developers
More than half projects do not follow the Pareto
principle
Most of projects have 15 or less core developers
There are no big differences
between
core and non-core activities
32. Extremely large core team may
be interesting
58
Heuristic -15 16-20 21-50 51-100 101-
Commit-
Based 2,197 98 137 17 47
LOC-
Based 2,454 15 13 4 10
Access-
Based 1,164 24 24 0 0
33. Many projects face a risk of
bus factor
59
Commit-Based LOC-Based Access-Based
43% (Core=1: 8%) 81% (Core=1: 24%) 54% (Core=1: 21%)
In fact, most of projects have less than 5 core developers
44. Fork
73
One of the features of GitHub
Fork (clone)
Original
Repository
Fork
Repository
Pull Request
45. Data Extraction
74
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
46. Data Extraction
75
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
47. Data Extraction
76
(1) Filter projects by GHTorrent
Filter forked repositories.
Filter less than 10 devs repositories.
Filter repositories which is developed
outside of GitHub.
8,510,504 repositories -> 4,618 repositories
56. Data Extraction
85
(5) Filter projects by metrics
4,618 repositories -> 2,496 repositories
Filter less than 10 devs repositories.
Filter less than 1,000 LOC repositories.
Notas del editor
I’m Kazuhiro Yamashita, a PhD student at Kyushu University, Japan.
Today, I would like to talk about my research.
The slide title is “Is the Pareto principle applicable to core teams of github projects?”
This is a collaboration work of Kyushu University and Queen’s University.
In this study, we focus on core developers and the Pareto principle.
Core developers are developers who play important roles in software development projects.
For example, Nakakoji et al. state that core developers are responsible for guiding and coordinating the development of an OSS project.
On the other hand, Mockus et al. define core developers as the most productive developers who have made roughly 80% of the total contributions.
The definitions are little bit different but both definitions say core developers are important.
From the facts, core developers are a key of success for OSS projects.
Hence, there are papers which focus on core developers.
This is the agenda of this slide.
First we look at the definitions of core developers and the pareto principle.
Next, we show the previous results. Then, we show our research questions derived from previous results.
After our research questions, we describe our case study. Finally, we conclude this study.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
Some of the papers argue that the proportions of core developers in OSS projects follow the Pareto principle.
The Pareto principle is also known as 80-20 rules and it states that roughly 80% of the results come from 20% of the causes like this figure.
The principle is originally from economics field, but it is also applied to various kinds of field and software engineering field.
Such papers claim that 20% of developers produce 80% of artifacts in software development context.
As we described, there are papers which claim that the size of core developers in a successful project follows the Pareto principle.
On the other hand, there are papers which claim that the size of core developers does not follow the Pareto principle.
In other words, prior studies have arrived at mixed conclusions about core teams and the Pareto principle.
We assume that the reason why such mixed conclusions are obtained is that the results depend on small number of case study systems.
In fact, the prior studies used at most 9 OSS projects.
Addition to the Pareto principle, prior studies also have arrived at mixed conclusions about the number of core developers.
Mockus et al. claim that the number of core developers is less than 10 or 15, but some papers show other opinions.
For instance, Dinh-Trong et al. showed that 27 to 42 developers contribute to more than 80% of contributions in FreeBSD project.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
Therefore, there are papers which focus on core developers.
And there are some papers which claim that the size of core developers in a successful project is follow the pareto principle.
On the other hand, there is a paper which claims that the proportion of core developers do not follow the pareto principle.
Addition to the pareto principle, some papers show that the exact number of core developers.
But, the numbers are different according to the papers.
When we consider why such discrepancies are happened, we find that all results depend on small number of case study systems.
From the previous work, we derive research question 1 and the motivation.
In RQ1, we would like to generalize the previous results, in other words, we would like to know the proportion of core developers follow the pareto principle?
Additionally, we also would like to know the general number of core developers.
Therefore, we formulate the research question.
Addition to the size of core developers, Mockus et al. claim that a group which is larger by an order of magnitude than the core team, will repair defects.
From the state, we assume that non-core developers more work on bug fixing activity than implementing new functions.
Therefore, we formulate research question 2 according to the assumption.
The motivation of RQ2 is that we would like to know the proportions of activities of core and non-core developers.
By declaring the proportion of activities, we would like to confirm our assumption.
The second research question is that …
From the points, we derived first part of our study.
In this part, we focus on core team size and study the applicability of the Pareto principle to core developers using GitHub projects.
Not only proportions, but also numbers of core developers are argued in prior studies.
Therefore, we also study numbers of core developers in this part.
In the second part of our study, we focus on the activities of core and non-core developers.
The part is also derived from a prior study.
In prior study, Mockus states that a group, which is larger by an order of magnitude than the core team, will repair defects.
From the state, we assume that non-core developers work on more fixing bugs than implementing new functionalities.
Hence, we study the activities of core and non-core developers in second part.
This is an overview of our study.
Now we show the steps for collecting and analyzing github data to study core team activity.
As the common part of both studies, we perform two steps to collect data and identify core developers.
After the two steps, we perform both studies.
In the study for core team size, we calculate the proportions and numbers of core developers of each project then we identify the proportions follow the Pareto principle or not.
In the study for activity, we extract commits of both type of developers then we classify the commits and compare their activities.
We explain each step of our study.
First, we show how to filter projects.
In this study, we used GitHub projects as dataset. First of all, the dataset includes 8.5million repositories.
However, there are also included repositories such as fork repositories, duplicates and immature projects.
To remove such repositories, we preprocess the dataset.
After the preprocessing, 2,496 repositories remain.
We conduct our case study on the 2,496 repositories.
Next, we show heuristics that we use to identify core developers.
In this study, we used three heuristics to identify core developers.
In Commit-based heuristic, we identify core developers using amount of commits of each developer.
In LOC-based heuristic, we identify core developers using amount of LOC which is changed by developers.
In access based heuristic, we identify core developers using access right.
With regard to the access-based heuristic, we can identify core developers from the developer has access right to the repository or not.
However, in commit and loc based heuristic, we need to a way to identify core and non-core developers.
We show steps to identify core developers in commit-based heuristic using this example project.
In this project, there are 4 developers and they made some commits.
As first step, we sort developers by their number of commits in descending order.
After sorting, we calculate the proportions of commits of each developer.
For example, developer A made 6 commits out of 10 commits. Hence, the proportion of developer A is 60%
Finally, we calculate cumulative proportion and identify developers who are below the 0.8 cumulative cutoff as core developers.
In this example, developers A and C are core developers, and B and D are non-core developers.
The percentage of core developers, in this case, is 50% and the number of core developers is 2.
LOC-based heuristic has same steps with commit-based heuristic but it uses LOC instead of the number of commits.
We identified core and non-core developers in each project.
Now we show the answers to our questions.
These are our two questions.
First we show the results about core team size.
The questions that we address are: Is the Pareto principle applicable? and What is general number of core developers?
Here is the part in this figure.
This slide shows our concrete approach to study core team size.
To check the applicability of the Pareto principle, we need to define thresholds.
In this study, we define the range between 10% to 30% as the thresholds.
Therefore, the example project that we showed to explain steps of our heuristic does not follow the Pareto principle.
It is because that the example project has 50% of core developers.
Addition to check the applicability, we stratify projects along the confounding factors to find out trends.
That’s why we assume that the three factors LOC, total authors and project age may affect the size of core developers.
For example, a project that has small total authors tends to be higher proportion of core developers.
Since the results of all heuristics and confounding factors have similar trend, we show only the result of commit-based heuristic and dividing by LOC.
スライド的に分かる様に
These figures show the results of commit-based heuristics and divided by LOC.
From the left side, figures show the distribution of projects small, medium and large LOC projects respectively.
In each figure, this dotted lines are
These figures show the results of commit-based heuristics and divided by LOC.
The x-axis shows the percentages of core developers and the y-axis shows the number of projects.
From the left side, figures show the distribution of projects small, medium and large LOC projects respectively.
In each figure, this dotted lines are thresholds of the Pareto principle.
From the figures, we find that the proportions of core developers are widespread.
In fact, more than half of projects are outside of the range of the Pareto principle.
Therefore, we conclude that the proportions of core developers do not follow the Pareto principle.
When we check the number of core developers, almost 90% or more projects have 15 or less core developers.
From the study of core team size, we obtained these results.
From
Next, we address the second question.
In this study, we focus on the activities of core and non-core developers.
Here is the part of this study, in this figure.
To compare the activities, we need to classify the commits.
We first explain the method that we used for this study, then show the results.
To know developer activities, we use the method which is proposed by Hattori and Lanza.
The method classifies commits into four categories using the commit comments.
This table shows the four categories and the example of keywords.
Forward engineering category is for activities to implement new functionalities and representative keyword is “implement”.
Reengineering category is for modifying existing codes and the keyword is “optimize”.
Corrective Engineering category is for bug fixing activities and the keyword is “bug”.
Management category is for activities to control project and the keyword is “TODO”.
If any keyword is not appeared in commit comments, the commit is classified into Unknown category.
Also, if there is no comment, the commit is classified into Empty category.
This figure shows proportions of categories of each type of developers.
For example, blue bars show the proportions of Forward engineering category and yellow bars show corrective engineering.
In our assumption, the proportion of non-core developers’ corrective engineering activity is large.
However, from the figure, we find that there are no big differences in proportions of corrective engineering.
Furthermore, the other three activities have similar proportions.
Therefore, we obtained the conclusion from this study.
Finally, we obtained these results from our study.
Now we discuss some points that we can obtain from our results.
First, we think extremely large core team may be interesting.
We think it is natural that the proportions of core developers are widespread.
But, there are projects that have more than 50% of the proportion of core developers.
It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality.
図を差し替え-&gt;%でなく人数の絶対値にする
First, we think extremely large core team may be interesting.
We think it is natural that the proportions of core developers are widespread.
But, there are projects that have more than 50 core developers.
It may be interesting to find out how to coordinate such large number of core developers and how impact to the project quality.
Next, we think many projects face a risk of bus factor.
We showed that many projects have 15 or less core developers.
In fact, many of projects have less than 5 core developers.
For example, in LOC based heuristic, 81% of projects have less than 5 core developers and 24% of projects have only 1 core developers.
From the fact, we assume that many projects face a risk of bus factor.
Now we conclude our slide.
First, we showed prior studies and our two questions which are derived from prior studies.
Then, we showed our case study design to address the two questions.
From the case study, we found that core team proportions are widespread and there are no big differences in proportions of development activity between core and non-core developers.
That’s all. Thank you.