Presentation by Natarajan Chidambaram during the International ICSE Workshop on Bots in Software Engineering (BotSE 2023) in Australia. Joint work with Mehdi Golzadeh, Tom Mens, Alexandre Decan of the Software Engineering Lab of the University of Mons and with Eleni Constantinou.
Recognising bot activity in collaborative software development
1. Recognizing bot activity in collaborative
software development
Natarajan Chidambaram
Software Engineering Lab, University of Mons, Belgium
Supported by Service public deWallonie – Recherche under grant n°2010235 “ARIAC BY DIGITALWALLONIA4.AI”
and Fonds de la Recherche Scientifique – FNRS under grant numbers F.4515.23, O.0157.18F-RG43 and T.0017.18
S
ECO-AS
S
IS
T
2. M. Golzadeh, T. Mens, A. Decan, E. Constantinou and N. Chidambaram, "Recognizing Bot Activity in Collaborative Software Development," in IEEE Software, vol. 39, no. 5, pp. 56-61, Sept.-Oct. 2022, doi: 10.1109/MS.2022.3178601.
6. Detecting bots, why?
• Recognise and accredit project contributors
• Which types of contributions to consider?
• How to identify the contributors?
• How to measure contribution effort?
• Find and hire experts
• Understand and improve the project development process
• Avoid bias in socio-technical and bot-based studies
7. Prevalence of Bots in GitHub
• 1 out of 10 top contributors is a bot
• 12 out of 21 are not marked as [bot] by GitHub
x
9. M. Golzadeh, A. Decan and N. Chidambaram, "On the Accuracy of Bot Detection Techniques," 2022 IEEE/ACM 4th International Workshop on Bots in Software Engineering (BotSE),
2022, pp. 1-5, doi: 10.1145/3528228.3528406.
10. Accuracy of bot Identification
Contributor type for 540 contributors present in 27 GitHub projects
13. N. Chidambaram, A. Decan, T. Mens, A dataset of bot and human activities in GitHub, in: International Conference on Mining Software Repositories (MSR), IEEE, 2023.
14. GitHub Events API:
Can retrieve the latest 300 events in
the last 90 days
Closing issue
branch
Creating tag
Creating branch
Creating repository
IssuesEvent
IssueCommentEvent Closing issue
created
Reopening issue
reopened
CreateEvent
Opening issue
15. # contributors # activities
Bot dataset 385 649,755
Human dataset 616 184,056
total 1,001 833,811
• 834K activities obtained from 1M+ events
• 24 activity types
• 1K contributors
• 105 days (25 Nov 2022 - 9 Mar 2023)
{
"date": "2022-11-26T14:13:19+00:00",
"activity": "Commenting issue",
"contributor": "kubevirt-bot",
"repository": "kubevirt/kubevirt",
"comment": {
"length": 255,
"GH_node": "IC_kwDOBJIk985PKH4s"
},
"issue": {
"id": 8294,
"title": "SRIOV VF interface not found in
VM",
"created_at": "2022-08-
13T11:10:06+00:00",
"status": "open",
"closed_at": null,
"resolved": false,
"GH_node": "I_kwDOBJIk985Pvz5k"
}
"conversation": {
"comments": 9
}
}
JSON format
16. Usefulness of the Dataset
• Analyse most frequent activities
• Find differences in behaviour between bots and humans
• Forecast future contributor activities
• Distinguish bot and human contributors
• Develop a new bot detection technique
bot human
Hello everyone. I am Natarajan Chidambaram doing my PhD in the software engineering lab at University of Mons, Belgium. I welcome you all to this presentation on the topic “Recognizing bot..”.
In this presentation, I am going to explain our work done in this research paper that is published in the IEEE special issue. This study is mainly done to highlight that bot activities should be recognised in collaborative software development.
Apart from highlighting the work done in one research paper, I will also talk about the results that we obtained in the further studies in this regard.
DOI
This is an example of a contributor creating a pull request and we can see a GitHub app is commenting under it. Here, we can see that the GitHub API marks this contributor as [bot].
So, this brings the importance of detecting bots in GitHub repositories. First, from the organisation point of view, it is to recognise and accredit project contributors, but the challenge is which.., how.., how.., then to find and hire the experts in performing certain tasks. Second, from the researcher point of view, we need to understand.., Avoid bias…
So, to analyse the prevalence of bots in GitHub, we considered 10 large open-source projects that were used for developing programming languages such as java, java script, python and rust. Each row in the figure represents a project and we rank the contributor based on the number of commits that they made in the project. The blue boxes are bots that are identified by GitHub API, the black boxes are bots that manually detected and not identified by GitHub API. The other boxes are human contributors. Highlighted contributors with a black border are responsible for the at least 1% of total commits in the repository. Here we can clearly have 2 inferences. First – 10% of top contributors are bots. Second inference is that more than half of these bots are not marked as bot by GitHub. By having these many bots as top contributors in the list of contributors to a repository, the human contributors might lack loose motivation as their efforts are not acknowledged. For, example, the last row, bots are the top 2 contributors. So, we use bot identification tools to identify these bots.
These are the two bot identification techniques that were developed in our lab. BoDeGHa detects bots that are involved in commenting issue and pull request activity within a repository, BoDeGiC identifies the bots that are involved in commit messages within a repository.
In the further work, we worked on to evaluate the accuracy of existing bot identification techniques.
We considered top 20 contributors in terms of commits in 27 popular projects in GitHub. As mentioned earlier, BoDeGiC is a bot identification tool that works based on commit messages, list of bots have a list of bot contributors present in GitHub, “bot” suffix is having bot at the end of the contributor’s name, BoDeGHa is another bot identification tool and GitHub account type is the type of account provided by GitHub API. None of the tools are detecting bots perfectly. The contributors on the left side of the line are bots and contributors on the right are humans. List of bots, “bot” suffix and GitHub account type did not classify any human as bot, but they classified many bots as humans. Very few accounts are classified as bots by all the tools.
As none of the tools are perfect, we thought developing an ensembled model using all these tools and methods would improve the bot identification technique. So, we developed such a model named EnsBod. The ensembled model seems to work better compared to all the bot identification techniques. There are other bot identification tools such as BIMAN that considers the “bot” string at the end of the contributor’s name, pattern in commit messages and the features related to files changed in commits. There is another tool named BotHunter which was not available at the time of this study. It is also ensembled model of bodegha, bodegic, biman with some more additional features.
None of the tools and methods could detect some bot contributors, in the circle, we can see either the bots are marked as humans or there is not enough activity to come to decision. Even EnsBod cannot identify these bots. This is because the tools consider only a limited set of activities like commenting. The unknown contributors can be active in performing other activities such as publishing a release, performing code review and so on. So, by considering all the activities that these contributors are performing in software repositories, we might be able to detect bots more efficiently.
So, we developed such a dataset of contributor activities such as fork repositories, create a tag, delete a branch, publish a release and so on which can be used for further analysis.
To get the contributor activity data, we depend on GitHub events API. Through this API we can retrieve the latest 300 events that the contributor has performed in the last 90 days. So, to collect all the contributor events that can be used for the analysis, we queried the API at regular intervals. Here is an example of a GitHub event type IssuesEvent. The action value in the payload determines the activity that the contributor is performing. The action can be closed for Closing issue, opened for opening issue and reopened for reopening an issue. Although these are completely different activities, they are reported under the same Event type in GitHub Events. So, we created a list of contributors and created a dataset of all their ACTIVITIES in GtiHub. This is not a one to one mapping, we identified the activity types from a single or a combination of events. One event type, CreateEvent can lead to three different activities. Whereas on the other hand, for Closing Issue, If it is just closed, then only IssuesEvent will be triggered, but if it is closed with a comment then 2 events would be triggered, IssuesEvent and IssueCommentEvent. Also, depending on the payload the activity changes for the same combination of event types.
To quantify our dataset, we identified 834 thousand activities from more than 1 million events. It contains 24 different activity types, performed by 1000 contributors for a duration of 105 days. Earlier I mentioned that GitHub events API can provide the contributor events only the latest 300 events in the last 90 days, but our dataset contains ALL THE ACTIVITIES performed by these contributors for 105 days that cannot be obtained through the API right now. We have two datasets, one is for the activities performed by 385 bots and another is for the activities performed by 616 human contributors. On the right, we can see an extract of an activity from our dataset. The data is present in JSON format. For each contributor we provide the first 4 fields, that is the date of activity, the activity type, the contributor who performed the activity and the repository in which the activity is performed. The other fields such as comment, issue and conversation are specific for this activity type “commenting issue” and varies for other activity types.
There are various use-cases with this dataset, on the descriptive analysis side, one can analyse the most frequent activities and find differences in behaviour between bots and humans based on their activities. Whereas on the machine learning side, we can forecast future contributor activities, develop a model that can distinguish bot and human contributors and develop a new bot identification technique.
Further, the number of activity types and variation in activity frequency.
In an on-going work, using this dataset, we statistically identified some distinguishing features between bots and humans based on their activities. They are mainly the time taken by those contributors to shift between repositories, the dispersion of activity types across repositories
This preliminary insights are accepted at SATToSE, and we might identify more distinguishing features between bot and humans, train and validate a model that an identify bot contributors and develop a tool that can be used for this purpose.
To summarise, first we saw the prevalence of bots in GitHub and highlighted the lack of bot identification by GitHub, then we evaluated the performance of the existing bot identification techniques and found that they are not perfect as do not consider all the activities that the contributors are performing in software projects. So, we developed an activity dataset and found some distinguishing features between bots and humans which can be used to develop a new bot identification technique.