The network of collaborations in an open source project can reveal relevant emergent properties that influence its prospects of success.
In this work, we analyze open source projects to determine whether they exhibit a rich-club behavior, i.e., a phenomenon where contributors with a high number of collaborations (i.e., strongly connected within the collaboration network)
are likely to cooperate with other well-connected individuals. The presence or absence of a rich-club has an impact on the sustainability and robustness of the project.
For this analysis, we build and study a dataset with the 100 most popular projects in GitHub, exploiting connectivity patterns in the graph structure of collaborations that arise from commits, issues and pull requests. Results show that rich-club behavior is present in all the projects, but only few of them have an evident club structure. We compute coefficients both for single source graphs and the overall interaction graph, showing that rich-club behavior varies across different layers of software development. We provide possible explanations of our results, as well as implications for further analysis.
What is Advanced Excel and what are some best practices for designing and cre...
Analyzing rich club behavior in open source projects
1. Analyzing Rich-Club Behavior
in Open Source Projects
OpenSym 2019, the 15th International Symposium on Open Collaboration
Skövde, Sweden
Mattia Gasparini1, Javier Luis Cànovas Izquierdo2,
Robert Clarisò2, Marco Brambilla1, Jordi Cabot2
Politecnico di Milano1 Universitat Oberta de la Catalunya2
2. Introduction
• Git and Github data to analyze evolution,
success and management of Open Source
Software.
• Define developers behavioral patterns.
• Discover how collaborations between
developers work.
2
4. Rich-club coefficient
• Graph structural property:
It represents the tendency of well-connected nodes (i.e.: hubs) to interact with other well-
connected nodes.
• Formulation:
𝜙 𝑘 =
2𝐸 𝑘
𝑁𝑘(𝑁𝑘 − 1)
𝜌 𝑘 =
𝜙(𝑘)
𝜙 𝑟𝑎𝑛𝑑𝑜𝑚(𝑘)
𝐸 𝑘: number of edges between nodes of degree greater or equal to 𝑘
𝑁𝑘: number of nodes with degree greater or equal to 𝑘
𝜙 𝑘 : rich-club coefficient
𝜌 𝑘 : normalized rich-club coefficient
4
5. Related Work
• Rich-club phenomenon for a specific project [2],
or for a single FLOSS community [3].
• Study of the presence of a rich-club effect
across the whole GitHub social network [4].
• Analysis on open source communities exploiting
email exchanges among participants [5].
5
[2] Weifeng Pan, Bing Li, Yutao Ma, and Jing Liu. 2011. Multi-granularity evolution analysis of software using complex network theory
[3] Guido Conaldi. 2010. Flat for the few, steep for the many: Structural cohesion and Rich-Club effect as measures of hierarchy and control in FLOSS communities
[4] Antonio Lima, Luca Rossi, and Mirco Musolesi. 2014. Coding Together at Scale: GitHub as a Collaborative Social Network
[5] Sergi Valverde and Ricard V. Solé. 2007. Self-organization versus hierarchy in open-source social networks
6. Case Study
6
Top-100 starred projects in 2016 on
GitHub
926K commits produced by 50K Git users
1.3M issues-related events generated by
118K GitHub users
280K pullrequest-related events
generated by 20K GitHub users
8. Data Collection &
Preprocessing
• Git repository cloning for
commits data using Gitana
• Github activities for issues
and PR activities querying
GHArchive
• Duplicity and clashing
problem
8
9. Graphs Construction
• Definition of 4 undirected graphs:
a. PR graph
b. Commits graph
c. Issues graph
d. Supergraph (a + b + c)
• Nodes: users
• Edges connect a pair of users if
they interacted on the same
element (issue, PR, file)
9
11. Rich-club Coefficient
Calculation
• Calculation using algorithm
implementation included in
NetworkX6
• Normalized coefficient
𝜌(𝑘): rich-club effect
relevant if 𝜌 𝑘 > 1
• Discard networks for which
randomization fails
11
[6] https://networkx.github.io/documentation/stable/reference/algorithms/rich_club.html
12. Rich-club Coefficient
Results
• 60 projects have a defined
coefficient for the
supergraph.
• Each graph presents a rich-
club effect, since 𝜌 𝑘 > 1
for some 𝑘
18. Maximum coefficient distribution
• Distribution of the maximum
rich-club coefficient for each
type of graph across the studied
projects.
• Mean value around 1 for issues
and commits graphs
coefficients: weak rich-club
presence.
• Mean value around 1.4 for PR
graphs coefficient: strong rich-
club presence.
Further insights
18
19. Multi-club users
• 25 over 60 projects present a set
of users belonging to multiple rich-
clubs.
• Distribution of multi-club users
across the 25 projects.
• Developers form community with
strong influence in each project
level.
Further insights
19
20. Conclusions
First systematic evaluation of the rich-club
behaviour on open source projects:
• 60% of projects shows rich-clubs in the
supergraph, mostly with a slight effect.
• Rich-club behavior could undermine the open
paradigma, but phenomeon requires further
analysis.
• Strong rich-club presence in PR graphs may
reside to criticality of the activity.
• 25 over 60 projects have users belonging to
multiple rich-clubs.
20
GitHub is the most popular service to develop and maintain open source software. Each user interacts with many other users in the project development process (commits, issues, pr), defining collaboration networks. Studying collaboration networks helps in discovering properties and behaviors that influence development, management and success of an OSS project.
Developers collaborate mostly with the same fixed subset of other important colleagues, instead of spreading the cooperation to each component of the team.
Formally, it cab be measured by the so called rich-club coefficient ϕ(k).
Intuitively, ϕ(k) measures how far the set of nodes with degree k is from being a complete subgraph. The value of ϕ(k) ranges from 0 (all nodes are disconnected) to 1 (a clique), with higher values showing a stronger rich-club behavior in the network. It is monotonically increasing even for random networks, so a normalized coefficient has been introduced in literature: ϕ(k) is divided by the coefficient calculated for a random network with same degree distribution of the original one.
Presence or absence of a rich-clubs in open source projects has not been studied in a systematic way and has not been applied to a large dataset as the one that GitHub can now provide.
Clashing: same name of different users
Duplicity: different names for the same users
Solution: use SHA value to associate git commits to GitHub users (if still present)
Two users are connected in the PR graph if they commented/interacted on the same PR…
Calculaton of rich-club coefficient is run for each project’s supergraph to have a global view of the effect. Maximum value for each project is shown: each of the 60 graphs presents a rich club behavior, even if most of them have values only slightly higher than 1. For this reason, we want to better understand the correspondence between the coefficient and the actual graphs.
The first example that we take is the materialize repositorty: rich-club coefficient with respect to node degree is presented. It is possible to notice a rich-club behavior for a range of degrees, with a peak on k=49, which should correspond to groups of nodes with degree at least 49 connected to each other.
This seems to go against the open source paradigma: project “owned” by few users.
Established in 2014 by a team of 4 developers, with 3,853 commits and 252 contributors. Nevertheless, the project only has two top contributors (more than 1,000 commits), which belong to the original team, and no frequent contributors
Mixed behavior presence: slightly over than 1, then dramatically lower. The overall intuition is that the graph does not present rich-clubs
It was publicly announced by Apple in 2014 and was later open sourced in December 2015. Currently, the project has more than 84k commits and 674 contributors, with 14 top contributors (more than 1.000 commits) and 44 frequent contributors (between 100 and 1.000 commits). Remarkably, 4 of the top contributors and 21 of the frequent contributors do not belong to Apple according to their GitHub profile. This is a sign that the project has successfully attracted and retained external talent.
In this table, the 10 projects with highest coefficient for the supergraph are presented. Along with them, the coefficient for the other kind of graphs is calculated when possible. Infact,also these other graphs can «hide» other clubs structures.
Maximum coefficient distribution for each kind of graph as a further insight. Blue line is the one already discussed.Green and orange line show commits and issues maximum coefficient distribution: density has a peak on 1 meaning that most of the graphs do not present strong rich-clubs. Red line has its peak around 1.4: most of the projects present evident rich-club structures. This behavior could be related to the fact that PR is the most critical level in open-source software development and few trusthworty developers are in charge of most of the tasks.
We focused also the attention on the users: almost 50% of the projects, have users tha belongs to multiple clubs. The distribution presents the number of users shared across all the projects’ clubs: this means that, on average, 7 developers are in the PR club, as well as in the commits and issues club. These developers form a sub-community inside the project that has strong influence in all the project’s levels.
As rich-club phenomenon is quite complex and also its application on OSS communities relatively new, plenty of further works can be done. First of all, we want to apply weighted coefficient version to check if other patterns arise. We want to extend the analysis at the module and the ecosystem level. And third, we want to introduce time variable: in this work the graphs are built using the entire data as a 1-year snapshot, but it is possible to build monthly graphs and check if temporal clubs show up.
With this, I have concluded the presentation. Thank you for the attention.