SlideShare una empresa de Scribd logo
1 de 20
Recognizing bot activity in collaborative
software development
Natarajan Chidambaram
Software Engineering Lab, University of Mons, Belgium
Supported by Service public deWallonie – Recherche under grant n°2010235 “ARIAC BY DIGITALWALLONIA4.AI”
and Fonds de la Recherche Scientifique – FNRS under grant numbers F.4515.23, O.0157.18F-RG43 and T.0017.18
S
ECO-AS
S
IS
T
M. Golzadeh, T. Mens, A. Decan, E. Constantinou and N. Chidambaram, "Recognizing Bot Activity in Collaborative Software Development," in IEEE Software, vol. 39, no. 5, pp. 56-61, Sept.-Oct. 2022, doi: 10.1109/MS.2022.3178601.
https://doi.org/10.1145/3528228.3528406
https://doi.org/10.1109/MS.2022.3178601
Detecting bots, why?
• Recognise and accredit project contributors
• Which types of contributions to consider?
• How to identify the contributors?
• How to measure contribution effort?
• Find and hire experts
• Understand and improve the project development process
• Avoid bias in socio-technical and bot-based studies
Prevalence of Bots in GitHub
• 1 out of 10 top contributors is a bot
• 12 out of 21 are not marked as [bot] by GitHub
x
BoDeGHa BoDeGiC
M. Golzadeh, A. Decan and N. Chidambaram, "On the Accuracy of Bot Detection Techniques," 2022 IEEE/ACM 4th International Workshop on Bots in Software Engineering (BotSE),
2022, pp. 1-5, doi: 10.1145/3528228.3528406.
Accuracy of bot Identification
Contributor type for 540 contributors present in 27 GitHub projects
Accuracy of bot Identification
Bot Identification
N. Chidambaram, A. Decan, T. Mens, A dataset of bot and human activities in GitHub, in: International Conference on Mining Software Repositories (MSR), IEEE, 2023.
GitHub Events API:
Can retrieve the latest 300 events in
the last 90 days
Closing issue
branch
Creating tag
Creating branch
Creating repository
IssuesEvent
IssueCommentEvent Closing issue
created
Reopening issue
reopened
CreateEvent
Opening issue
# contributors # activities
Bot dataset 385 649,755
Human dataset 616 184,056
total 1,001 833,811
• 834K activities obtained from 1M+ events
• 24 activity types
• 1K contributors
• 105 days (25 Nov 2022 - 9 Mar 2023)
{
"date": "2022-11-26T14:13:19+00:00",
"activity": "Commenting issue",
"contributor": "kubevirt-bot",
"repository": "kubevirt/kubevirt",
"comment": {
"length": 255,
"GH_node": "IC_kwDOBJIk985PKH4s"
},
"issue": {
"id": 8294,
"title": "SRIOV VF interface not found in
VM",
"created_at": "2022-08-
13T11:10:06+00:00",
"status": "open",
"closed_at": null,
"resolved": false,
"GH_node": "I_kwDOBJIk985Pvz5k"
}
"conversation": {
"comments": 9
}
}
JSON format
Usefulness of the Dataset
• Analyse most frequent activities
• Find differences in behaviour between bots and humans
• Forecast future contributor activities
• Distinguish bot and human contributors
• Develop a new bot detection technique
bot human
Some Distinguishing Features
Number of activity types Variation in activity frequency
Some Distinguishing Features
Hours to shift between repositories
Dispersion of activity types across
repositories
SATToSE 2023
Recognising bot activity in collaborative software development

Más contenido relacionado

Similar a Recognising bot activity in collaborative software development

Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces
SANGHEE SHIN
 

Similar a Recognising bot activity in collaborative software development (20)

Blockchain technology overview
Blockchain technology overviewBlockchain technology overview
Blockchain technology overview
 
Crypto Currency, Bitcoin and Blockchain
Crypto Currency, Bitcoin and BlockchainCrypto Currency, Bitcoin and Blockchain
Crypto Currency, Bitcoin and Blockchain
 
New Business Models enabled by Blockchain
New Business Models enabled by BlockchainNew Business Models enabled by Blockchain
New Business Models enabled by Blockchain
 
IRJET- Bitcoin - The Future Currency
IRJET- Bitcoin - The Future CurrencyIRJET- Bitcoin - The Future Currency
IRJET- Bitcoin - The Future Currency
 
Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24Ontology of citizen science @ Siena 2016 11 24
Ontology of citizen science @ Siena 2016 11 24
 
Blockchain Technology Utilizationin Global Rakuten Ecosystem
Blockchain Technology Utilizationin Global Rakuten EcosystemBlockchain Technology Utilizationin Global Rakuten Ecosystem
Blockchain Technology Utilizationin Global Rakuten Ecosystem
 
DELLA - CRYPTOCURRENCY PRICE TRACKER
DELLA - CRYPTOCURRENCY PRICE TRACKERDELLA - CRYPTOCURRENCY PRICE TRACKER
DELLA - CRYPTOCURRENCY PRICE TRACKER
 
IRJET- A Survey on Blockchain Technology and Municipal Corporation System
IRJET- A Survey on Blockchain Technology and Municipal Corporation SystemIRJET- A Survey on Blockchain Technology and Municipal Corporation System
IRJET- A Survey on Blockchain Technology and Municipal Corporation System
 
Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618Future of jobs and digital economy citi conference 090618
Future of jobs and digital economy citi conference 090618
 
NTEN Workshop | August 9, 2017
NTEN Workshop | August 9, 2017NTEN Workshop | August 9, 2017
NTEN Workshop | August 9, 2017
 
Digital Twin and Smart Spaces
Digital Twin and Smart Spaces Digital Twin and Smart Spaces
Digital Twin and Smart Spaces
 
Rhee sokwoo
Rhee sokwooRhee sokwoo
Rhee sokwoo
 
Electric Capital Crypto Dev Report · 2022
Electric Capital Crypto Dev Report · 2022Electric Capital Crypto Dev Report · 2022
Electric Capital Crypto Dev Report · 2022
 
Proffer Blockchain Hackathon $17K+ prizes | Launch Presentation
Proffer Blockchain Hackathon $17K+ prizes | Launch PresentationProffer Blockchain Hackathon $17K+ prizes | Launch Presentation
Proffer Blockchain Hackathon $17K+ prizes | Launch Presentation
 
DLT, Blockchain Analytics and AI Workshop at NYU, Dec 10, 2018
DLT, Blockchain Analytics and AI Workshop at NYU, Dec 10, 2018DLT, Blockchain Analytics and AI Workshop at NYU, Dec 10, 2018
DLT, Blockchain Analytics and AI Workshop at NYU, Dec 10, 2018
 
Rob van Kranenburg @ Thingscon Amsterdam
Rob van Kranenburg @ Thingscon AmsterdamRob van Kranenburg @ Thingscon Amsterdam
Rob van Kranenburg @ Thingscon Amsterdam
 
ThingsConAMS - Stakeholders in a new world - Rob van Kranenburg
ThingsConAMS - Stakeholders in a new world - Rob van KranenburgThingsConAMS - Stakeholders in a new world - Rob van Kranenburg
ThingsConAMS - Stakeholders in a new world - Rob van Kranenburg
 
Blockchain and smart contracts: infrastructure and platforms
Blockchain and smart contracts: infrastructure and platformsBlockchain and smart contracts: infrastructure and platforms
Blockchain and smart contracts: infrastructure and platforms
 
Anaconda and PyData Solutions
Anaconda and PyData SolutionsAnaconda and PyData Solutions
Anaconda and PyData Solutions
 
Blockchain Technology Report 2018
Blockchain Technology Report 2018Blockchain Technology Report 2018
Blockchain Technology Report 2018
 

Más de Tom Mens

Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Tom Mens
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)
Tom Mens
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package Managers
Tom Mens
 

Más de Tom Mens (20)

How to be(come) a successful PhD student
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
 
The (r)evolution of CI/CD on GitHub
 The (r)evolution of CI/CD on GitHub The (r)evolution of CI/CD on GitHub
The (r)evolution of CI/CD on GitHub
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the Future
 
Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?Comment programmer un robot en 30 minutes?
Comment programmer un robot en 30 minutes?
 
On the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHubOn the rise and fall of CI services in GitHub
On the rise and fall of CI services in GitHub
 
On backporting practices in package dependency networks
On backporting practices in package dependency networksOn backporting practices in package dependency networks
On backporting practices in package dependency networks
 
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and RubygemsComparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
Comparing semantic versioning practices in Cargo, npm, Packagist and Rubygems
 
Lost in Zero Space
Lost in Zero SpaceLost in Zero Space
Lost in Zero Space
 
Evaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messagesEvaluating a bot detection model on git commit messages
Evaluating a bot detection model on git commit messages
 
Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!Is my software ecosystem healthy? It depends!
Is my software ecosystem healthy? It depends!
 
Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...Bot or not? Detecting bots in GitHub pull request activity based on comment s...
Bot or not? Detecting bots in GitHub pull request activity based on comment s...
 
On the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystemsOn the fragility of open source software packaging ecosystems
On the fragility of open source software packaging ecosystems
 
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...How magic is zero? An Empirical Analysis of Initial Development Releases in S...
How magic is zero? An Empirical Analysis of Initial Development Releases in S...
 
Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)Comparing dependency issues across software package distributions (FOSDEM 2020)
Comparing dependency issues across software package distributions (FOSDEM 2020)
 
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
Measuring Technical Lag in Software Deployments (CHAOSScon 2020)
 
SecoHealth 2019 Research Achievements
SecoHealth 2019 Research AchievementsSecoHealth 2019 Research Achievements
SecoHealth 2019 Research Achievements
 
SECO-Assist 2019 research seminar
SECO-Assist 2019 research seminarSECO-Assist 2019 research seminar
SECO-Assist 2019 research seminar
 
Empirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package ManagersEmpirically Analysing the Socio-Technical Health of Software Package Managers
Empirically Analysing the Socio-Technical Health of Software Package Managers
 
ConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker ContainersConPan: Analysing Packages Installed in Docker Containers
ConPan: Analysing Packages Installed in Docker Containers
 
On the Relation between Outdated Docker Containers, Severity Vulnerabilities,...
On the Relation between Outdated Docker Containers, Severity Vulnerabilities,...On the Relation between Outdated Docker Containers, Severity Vulnerabilities,...
On the Relation between Outdated Docker Containers, Severity Vulnerabilities,...
 

Último

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 

Último (20)

Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 

Recognising bot activity in collaborative software development

  • 1. Recognizing bot activity in collaborative software development Natarajan Chidambaram Software Engineering Lab, University of Mons, Belgium Supported by Service public deWallonie – Recherche under grant n°2010235 “ARIAC BY DIGITALWALLONIA4.AI” and Fonds de la Recherche Scientifique – FNRS under grant numbers F.4515.23, O.0157.18F-RG43 and T.0017.18 S ECO-AS S IS T
  • 2. M. Golzadeh, T. Mens, A. Decan, E. Constantinou and N. Chidambaram, "Recognizing Bot Activity in Collaborative Software Development," in IEEE Software, vol. 39, no. 5, pp. 56-61, Sept.-Oct. 2022, doi: 10.1109/MS.2022.3178601.
  • 4.
  • 5.
  • 6. Detecting bots, why? • Recognise and accredit project contributors • Which types of contributions to consider? • How to identify the contributors? • How to measure contribution effort? • Find and hire experts • Understand and improve the project development process • Avoid bias in socio-technical and bot-based studies
  • 7. Prevalence of Bots in GitHub • 1 out of 10 top contributors is a bot • 12 out of 21 are not marked as [bot] by GitHub x
  • 9. M. Golzadeh, A. Decan and N. Chidambaram, "On the Accuracy of Bot Detection Techniques," 2022 IEEE/ACM 4th International Workshop on Bots in Software Engineering (BotSE), 2022, pp. 1-5, doi: 10.1145/3528228.3528406.
  • 10. Accuracy of bot Identification Contributor type for 540 contributors present in 27 GitHub projects
  • 11. Accuracy of bot Identification
  • 13. N. Chidambaram, A. Decan, T. Mens, A dataset of bot and human activities in GitHub, in: International Conference on Mining Software Repositories (MSR), IEEE, 2023.
  • 14. GitHub Events API: Can retrieve the latest 300 events in the last 90 days Closing issue branch Creating tag Creating branch Creating repository IssuesEvent IssueCommentEvent Closing issue created Reopening issue reopened CreateEvent Opening issue
  • 15. # contributors # activities Bot dataset 385 649,755 Human dataset 616 184,056 total 1,001 833,811 • 834K activities obtained from 1M+ events • 24 activity types • 1K contributors • 105 days (25 Nov 2022 - 9 Mar 2023) { "date": "2022-11-26T14:13:19+00:00", "activity": "Commenting issue", "contributor": "kubevirt-bot", "repository": "kubevirt/kubevirt", "comment": { "length": 255, "GH_node": "IC_kwDOBJIk985PKH4s" }, "issue": { "id": 8294, "title": "SRIOV VF interface not found in VM", "created_at": "2022-08- 13T11:10:06+00:00", "status": "open", "closed_at": null, "resolved": false, "GH_node": "I_kwDOBJIk985Pvz5k" } "conversation": { "comments": 9 } } JSON format
  • 16. Usefulness of the Dataset • Analyse most frequent activities • Find differences in behaviour between bots and humans • Forecast future contributor activities • Distinguish bot and human contributors • Develop a new bot detection technique bot human
  • 17. Some Distinguishing Features Number of activity types Variation in activity frequency
  • 18. Some Distinguishing Features Hours to shift between repositories Dispersion of activity types across repositories

Notas del editor

  1. Hello everyone. I am Natarajan Chidambaram doing my PhD in the software engineering lab at University of Mons, Belgium. I welcome you all to this presentation on the topic “Recognizing bot..”.
  2. In this presentation, I am going to explain our work done in this research paper that is published in the IEEE special issue. This study is mainly done to highlight that bot activities should be recognised in collaborative software development.
  3. Apart from highlighting the work done in one research paper, I will also talk about the results that we obtained in the further studies in this regard. DOI
  4. This is an example of a contributor creating a pull request and we can see a GitHub app is commenting under it. Here, we can see that the GitHub API marks this contributor as [bot].
  5. So, this brings the importance of detecting bots in GitHub repositories. First, from the organisation point of view, it is to recognise and accredit project contributors, but the challenge is which.., how.., how.., then to find and hire the experts in performing certain tasks. Second, from the researcher point of view, we need to understand.., Avoid bias…
  6. So, to analyse the prevalence of bots in GitHub, we considered 10 large open-source projects that were used for developing programming languages such as java, java script, python and rust. Each row in the figure represents a project and we rank the contributor based on the number of commits that they made in the project. The blue boxes are bots that are identified by GitHub API, the black boxes are bots that manually detected and not identified by GitHub API. The other boxes are human contributors. Highlighted contributors with a black border are responsible for the at least 1% of total commits in the repository. Here we can clearly have 2 inferences. First – 10% of top contributors are bots. Second inference is that more than half of these bots are not marked as bot by GitHub. By having these many bots as top contributors in the list of contributors to a repository, the human contributors might lack loose motivation as their efforts are not acknowledged. For, example, the last row, bots are the top 2 contributors. So, we use bot identification tools to identify these bots.
  7. These are the two bot identification techniques that were developed in our lab. BoDeGHa detects bots that are involved in commenting issue and pull request activity within a repository, BoDeGiC identifies the bots that are involved in commit messages within a repository.
  8. In the further work, we worked on to evaluate the accuracy of existing bot identification techniques.
  9. We considered top 20 contributors in terms of commits in 27 popular projects in GitHub. As mentioned earlier, BoDeGiC is a bot identification tool that works based on commit messages, list of bots have a list of bot contributors present in GitHub, “bot” suffix is having bot at the end of the contributor’s name, BoDeGHa is another bot identification tool and GitHub account type is the type of account provided by GitHub API. None of the tools are detecting bots perfectly. The contributors on the left side of the line are bots and contributors on the right are humans. List of bots, “bot” suffix and GitHub account type did not classify any human as bot, but they classified many bots as humans. Very few accounts are classified as bots by all the tools.
  10. As none of the tools are perfect, we thought developing an ensembled model using all these tools and methods would improve the bot identification technique. So, we developed such a model named EnsBod. The ensembled model seems to work better compared to all the bot identification techniques. There are other bot identification tools such as BIMAN that considers the “bot” string at the end of the contributor’s name, pattern in commit messages and the features related to files changed in commits. There is another tool named BotHunter which was not available at the time of this study. It is also ensembled model of bodegha, bodegic, biman with some more additional features.
  11. None of the tools and methods could detect some bot contributors, in the circle, we can see either the bots are marked as humans or there is not enough activity to come to decision. Even EnsBod cannot identify these bots. This is because the tools consider only a limited set of activities like commenting. The unknown contributors can be active in performing other activities such as publishing a release, performing code review and so on. So, by considering all the activities that these contributors are performing in software repositories, we might be able to detect bots more efficiently.
  12. So, we developed such a dataset of contributor activities such as fork repositories, create a tag, delete a branch, publish a release and so on which can be used for further analysis.
  13. To get the contributor activity data, we depend on GitHub events API. Through this API we can retrieve the latest 300 events that the contributor has performed in the last 90 days. So, to collect all the contributor events that can be used for the analysis, we queried the API at regular intervals. Here is an example of a GitHub event type IssuesEvent. The action value in the payload determines the activity that the contributor is performing. The action can be closed for Closing issue, opened for opening issue and reopened for reopening an issue. Although these are completely different activities, they are reported under the same Event type in GitHub Events. So, we created a list of contributors and created a dataset of all their ACTIVITIES in GtiHub. This is not a one to one mapping, we identified the activity types from a single or a combination of events. One event type, CreateEvent can lead to three different activities. Whereas on the other hand, for Closing Issue, If it is just closed, then only IssuesEvent will be triggered, but if it is closed with a comment then 2 events would be triggered, IssuesEvent and IssueCommentEvent. Also, depending on the payload the activity changes for the same combination of event types.
  14. To quantify our dataset, we identified 834 thousand activities from more than 1 million events. It contains 24 different activity types, performed by 1000 contributors for a duration of 105 days. Earlier I mentioned that GitHub events API can provide the contributor events only the latest 300 events in the last 90 days, but our dataset contains ALL THE ACTIVITIES performed by these contributors for 105 days that cannot be obtained through the API right now. We have two datasets, one is for the activities performed by 385 bots and another is for the activities performed by 616 human contributors. On the right, we can see an extract of an activity from our dataset. The data is present in JSON format. For each contributor we provide the first 4 fields, that is the date of activity, the activity type, the contributor who performed the activity and the repository in which the activity is performed. The other fields such as comment, issue and conversation are specific for this activity type “commenting issue” and varies for other activity types.
  15. There are various use-cases with this dataset, on the descriptive analysis side, one can analyse the most frequent activities and find differences in behaviour between bots and humans based on their activities. Whereas on the machine learning side, we can forecast future contributor activities, develop a model that can distinguish bot and human contributors and develop a new bot identification technique.
  16. Further, the number of activity types and variation in activity frequency.
  17. In an on-going work, using this dataset, we statistically identified some distinguishing features between bots and humans based on their activities. They are mainly the time taken by those contributors to shift between repositories, the dispersion of activity types across repositories
  18. This preliminary insights are accepted at SATToSE, and we might identify more distinguishing features between bot and humans, train and validate a model that an identify bot contributors and develop a tool that can be used for this purpose.
  19. To summarise, first we saw the prevalence of bots in GitHub and highlighted the lack of bot identification by GitHub, then we evaluated the performance of the existing bot identification techniques and found that they are not perfect as do not consider all the activities that the contributors are performing in software projects. So, we developed an activity dataset and found some distinguishing features between bots and humans which can be used to develop a new bot identification technique.