To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters. Essentially, a cluster is stable if it is insensitive to small changes in the underlying data. Bootstrapping is used to make small changes in the data. It is shown that if we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications. Publications that do not clearly belong to one of the main topics in a field cannot be assigned to a cluster.
1. On cluster stability
Nees Jan van Eck
Centre for Science and Technology Studies (CWTS), Leiden University
15th International Conference on Scientometrics & Informetrics
Istanbul, Turkey, June 30, 2015
2. Introduction
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number of
clusters)
• A clustering technique can be used to force each
publication to be assigned to a cluster
• However, in a highly detailed clustering, is the
assignment of publications to clusters still meaningful?
1
4. Cluster stability
• To ensure that publications are assigned to clusters in a
meaningful way, we introduce the notion of stable
clusters
• Essentially, a cluster is stable if it is insensitive to small
changes in the underlying data
• Bootstrapping is used to make small changes in the data
3
5. Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap citation
networks:
– A bootstrap citation network is a weighted variant of the original citation
network in which each edge has an integer weight drawn from a
Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009)
• In each bootstrap citation network, perform clustering
• For each pair of publications, calculate the proportion of
the bootstrap clustering results in which the publications
are in the same cluster
4
7. Identification of stable clusters:
Step 2
• Create a network of publications with an edge between
two publications if the publications are in the same
cluster in at least a certain proportion (e.g., 0.9) of the
bootstrap clustering results
• Identify connected components in the newly created
network
• Each connected component represents a stable cluster
6
15. Conclusions
• If we want to have an accurate and detailed clustering,
we need to be satisfied with a clustering that doesn’t
comprehensively cover all publications
• Publications that do not clearly belong to one of the main
topics in a field cannot be assigned to a cluster
• Cluster stability analysis can be used to distinguish
between meaningful and non-meaningful assignments of
publications to clusters
14
17. References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large
networks. PLoS ONE, 5(1), e8694.
http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology for
constructing a publication-level classification system of
science. JASIST, 63(12), 2378-2392.
http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving
algorithm for large-scale modularity-based community
detection. European Physical Journal B, 86(11), 471.
http://dx.doi.org/10.1140/epjb/e2013-40829-0
16