1. An Empirical Study of Popularity and
Quality of NPM Packages
1
Ahmed Zerouali
2. Motivation
2
Number of libaries in most known OS package managers
- 206k libraries - Java
- 162k packages - PHP
- 600k packages - JavaScript
In 10/Nov/2017
3. Motivation
3
Reasons to choose the right OS software:
Software Quality
Software Features
Software Support and Documentation
Software Popularity
…
…
4. Motivation
4
Choosing the right OS software:
Software Quality
Software Features
Software Support and Documentation
Software Popularity
…
…
5. Motivation
5
Interviews with developers:
C. Bogart, C. Kästner, J. Herbsleb, and F. Thung. How to break
an API: Cost negotiation and community values in three software
ecosystems. In Int’l Symp. Foundations of Software Engineering
(FSE), pages 109–120. ACM, 2016.
Popularity and community reputation are the most influenced factors.
9. Method
9
An open source repository containing
metadata(size, dependents,
dependencies) of package
dependencies extracted from 23
package managers.
An open source search engine that
computes a normalized score between 0
and 1 of the npm packages popularity,
quality and maintenance
10. Method
10
characteristics score
Popularity Quality Maintenance
# stars
# forks
# subscribers
#contributors
# dependents
# downloads
# Downloads
acceleration
README?
License?
.gitignore and friends?
Has tests?
#Test coverage
Is the build passing?
#outdated deps
& vulnerabilities?
Has Custom website?
Has Linters Configured?
#Ratio of open issues
vs. total issues
#Time to close issues
#Commits frequency
#Release frequency
11. Data Extraction
11
- Download the prepared and availabale
metadata from 15th June 2017
- Use API and get the latest information
( rate limit= 60 request/minute )
-Use API (no rate limit)
13. Research Questions
13
RQ0(preliminary question): How are measures of package
popularity related to the use of a package?
RQ1: Is there a relationship between package quality
and package popularity?
RQ2: Is there a relationship between the maintainability
and popularity of packages?
RQ3: Are deprecated packages still being used?
RQ4: How different are packages used in web frontend
development in the context of all packages?
14. Data Analysis
14
-Data analysis and precessing: import pandas
- Data visualization: import matplotlib
import seaborn
- Analytics: import scipy
15. RQ0 - How are measures of package popularity
related to the use of a package?
15
Pearson correlation coefficient R= 0.8
16. RQ0 - How are measures of package popularity
related to the use of a package?
16
Almost 4 out of 10 packages are not used by any other package or
external repository.
35% of packages don’t have any direct dependency
package
17. RQ1 - Is there a relationship between package quality
and package popularity?
17
- Quality
- Testing (tests, test converage, build status)
- Carefulness( licence, readme, .gitingor..).
- Health ( outdated dependencies and
vulnerabilities)
-Branding( badges and homepage)
- Popularity
-Community interest (npms.io)
- dependent external repositories (libraries.io)
18. RQ1 - Is there a relationship between package quality
and package popularity?
18
Distribution of popularity in terms of community interest, number of dependent
repositories and quality score of npm packages, split into packages that have at
least one dependency and packages that don’t.
19. RQ1 - Is there a relationship between package quality
and package popularity?
19
Pearson correlation coefficient R <0.33, for both testing and carefulness
20. RQ2 - Is there a relationship between the
maintainability and popularity of packages?
20
Distribution of maintenance characteristics scores grouped in packages
that have a commit score above the median and packages that have a commit
score under the median(0.25)
21. RQ2 - Is there a relationship between the
maintainability and popularity of packages?
21
22. RQ3: Are deprecated packages still being used?
22
- Package declared ‘deprecated’ in the status: 768
- Packages declared ‘deprecated’ in the description: 1,522
Total deprecated packages found: 2,290( out of all npm)
Total deprecated packages found in npmsio: 836
23. RQ3: Are deprecated packages still being used?
23
0.4% of all npm packages are deprecated packages and they are
less used.
Deprecated packages are used by popular packages too.
Deprecated packages have the same characteristics as the other
packages, except for size, release frequency, commit frequency and
fixing issues.
24. RQ4: How different are packages used in web
frontend development in the context of all packages?
24
- Package on Bower : 65,397
- Package on Bower and npm : 25,203
Total front-end packages found in npmsio: 20,210
25. RQ4: How different are packages used in web
frontend development in the context of all packages?
25
27. Conclusion
27
Investigated the relationship between software popularity and quality.
Used npms.io and libraries.io.
Found that:
Software popularity and quality are weakly correlated.
Maintenance has little impact on Popularity.
Only a small number of packages are deprecated.
Front-end packages are more popular.
28. Future Work
28
- Cross-ecosystem comparisons: detect differences in the
relation between popularity and quality across ecosystems.
-Qualitative analysis: carrying out interviews and surveys
with package developers.
One of the most crippling choices new developers and even existing ones face is deciding what programming language to work in, which frameworks to use and which library to learn. Given there are literally thousands of libraries to choose from, and all have their own pros and cons, it can be difficult to decide what to learn.
Why it is important to pick the right software
Often, you can find many open-source choices that appear to fit the your need, but picking the wrong software can have expensive consequences. A lot of time is required to learn new software and integrate it into your project, and time is money.
Choosing the wrong software can be an expensive mistake.
From the different reasons that developers have when choosing a new software are:
SQ: Is this software library well tested and written?
SF: What does it provide as functionnality?
SSD: Is it well documented?
SP: for example, Is it used by a lot of people?
Out of all these reasons, popularity seems to be the most influenced factor.
Researches interviewed developers involved in open source software ecosystems about the reason behind selecting the appropriate software, and most answer were related to: popularity and community reputation.
But does this factor imply a good software quality.
Do popular software packages for example in javascript have good development quality?
Let’s take an example, this is sinon which is a test package is ranked 14 and it has good test coverage and all builds passing.
While Chai which is also a test library, has failing buils and less test converage than the top 14 package.
To verify if this is not the case for a lot of libraries, and that indeed there is a link between software quality and popularity.
We investigated this issue for packages that are hosted in the NPM packages manager.
We choose NPM because it’s now the largest registry for packages in the world, and because Javascript is one of the most used programming languages.
We used two open source package tracking tools.
Libraries.io which contains the metadata of packages dependencies extracted from 23 package managers
And
Npmsio. Which is an open sources …..
The scores are calculated using many different metrics.
For the data extraction, we had the choice between downloading the the prepared and available data of 15th june 2017 or use their api and get the latest iformation, but since there is not a lot of time between june and october and also because libraries.io has a rate limit, we used the available metadata.
For npms, we used their API
To download the data using npms.io was also really fast
After combining the data from both sources, from the 516,705
packages in libraries.io of the 15th june 2017 extracted dataset, we found 308,777 of them also in npms.io.
And We observed that all packages in npms.io are hosted on Gihtub, which is of a great value to us, since
our purpose is to analyze packages that evolve in the same.
To empirically study the relationship between software quality and popularity. We consider the following research questions.
In order to be able to answer these questions:
We used only python for the extarction, cleaning and preparing the data.
As well as for the analysis.
To play with the data we used:
Our aim with this preliminary question is to better understand the concept of popularity:
In order to study popularity, we rely on the popularity score of npms.io. This score includes, among other metrics, the number of other npm packages directly depending on it. Also we rely on the number of dependent repositories metric extracted from libraries.io . it counts the number of Git
repositories that do not correspond to an npm package yet depend on the npm package under consideration.
The package scores computed by npms.io are values between 0 and 1. To facilitate comparison with the aforementioned metric from libraries.io, we normalize this metrics to a value between 0 and 1.
As shown in this figure, the scatter plot of npm package popularity in terms of community interest compared to the number of dependent repositories, reveals a correlation between both kinds of popularity.
To confirm this we calculated the … and we found strong correlation at R=0.81
We also found that.
For the first question we verified if there is quantitative evidence of a relation between popularity terms of community interest dependent external repositoried and quaity in terms of …..
We observe that most npm packages have low popularity within the community and have very little external repositories depending on them, while most of them have a good quality score.
We also verified statistically if packagees that do not use any dependency are different but we couldn’t find a statistical signficance difference.
To calculate the quality scores, the high weights were given to carefulness and testing.
To have a deep look at how these two metrics are distributed, we divided packages in quintiles by their popularity score.
And we statistical found that for most categories, the characteristic of carefulness is higher than the characteristic of testing.
We also checked whether we can find a correlation between carefulness and testing for all packages with popularity, and we found only a weak linear correlation.
After that we studied the relation between maintenance activity and popularity.
We expected that packages under active maintenance are more popular than packages that are no longer being maintained.
When checking the source code of npms, we find that they had difficulties to evaluate packages that have disabled or zero issues in their repository. That’s why for this particular research question, we filtred them out.
We investigated the relationship between releasing, committing and fixing issues.
For all packages considered for this analysis, we grouped them into two categories of equal size based on the median value for the commit frequency.
And we found that npm packages that commit frequently have good fixing issues scores and they also release frequently.
Using the maintenance score, we checked whether we could find different distributions of the number of dependent npm packages and repositories. Similar to what we did in before, we divided packages in quintiles according to their maintenance score.
As shown in the figure, we couldnt find relevant difference between the distributions .
Which means that maintenance does not have a large impact on the popularity of npm packages
To know how the deprecated npm packages are being handled, we identified all npm packages in the libraries.io dataset that have a “deprectaed” status in them. From all packages we found only: 768
After that we analyzed manually description of all packages that have the word ‘deprecat’ in their description
We filtred packages distined to handle deprecation and we found 1522 more deprecated packages.
From this number of deprecated packages, only 836 was found in our dataset.
We analyzed their scores and popularity
After that, and in order to know how front end package are different
We extracted all packages that are hosted the front end dedicated package manager Bower.
And then we identified which of these packages are also on npm.
And finaly we could find 20,210 packages that are hosted on bower and npm and they are in our dataset
For these packages and the other packages hosted on npm, we carried out a comparison between their maintenance, quality and popularity characteristics scores.
And we found that front end packages are different in size, age and popularity. They are more popular than the other packages.
Our results could be different when relying on other metrics that have been defined and implemented in a different way to quantify quality or popularity.
Since we only used metrics already evaluated by npms and libraries.io.
We did not differentiate or classify the npm packages by their category or domain, which may impact our findings.
This analysis presented an empirical analysis on software package popularity and quality in npm packages. Using the available data on libraries.io and npms.io, two open source services that provide software dependency tracking, we analyzed the characteristics of open source npm packages in order to investigate the relationship between quality and popularity within the npm ecosystem.