Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

Life-Cycles and Mutual Effects of Scientific
Communities
Vćlav Bel´k, Marcel Karnstedt, Conor Hayes
a a
Digital Enterprise Research Institute
NUI Galway

ASNA 2010, Z¨rich
u

Introduction Methodology Data-Set Results Conclusion and FW

Motivation

Progress in science is often measured by citation measures,
which are relatively static
Detection and explanation of evolution and life-cycles provides
better arguments for the progress
Previous approaches focused mainly on analysing co-citation
graphs or textual clustering
Little work on analysis of cross-community eﬀects
Kuhn [5] claimed the development of scientiﬁc knowledge
proceeds in discrete steps:
Pre-paradigm period
Paradigm period—normal science
Crisis
Reaction to the crisis—paradigm shift

1 / 34


Cross-Community Eﬀects I
Expected Phenomena
Expected Phenomena
Clique: Graph & Network Analysis Cluster
Clique: Graph & Network Analysis Cluster

ParadigmParadigm shift
shift Paradigm merge
Paradigm merge
(a) Community shift (b) Community merge
(with community shift)

2 / 34


Cross-Community Eﬀects II

Although inspired by Kuhn, we expected evolution of
communities in rather an alleviated form
Instead of paradigm shift, we were looking for community
shift
Community merge is a complementary phenomenon, but
rather uninteresting one
Thus, rather combinations of shifts with subsequent merges,
i.e. community merge/shifts, were investigated
Instead of paradigm articulation, we were looking for
community specialization
Co-citation networks of two big camps in CS were analysed:
Semantic Web (solution-driven) and Information Retrieval
(problem-driven) [1]

3 / 34


Outline

1 Methodology
2 Data-Sets
3 Results
4 Conclusion and Future Work

4 / 34


Initial Expectations&Requirements

The methodology was developed with a set of certain requirements
arising from the nature of the problem:
1 Dynamic data-set represented by snapshots of several
consecutive time-steps
2 Communities have to be identiﬁed in the network in each
time-step
3 Authors (nodes in general) have to be uniquely identiﬁed
among all time-steps
4 For topical analysis, meta-data (topics) describing the nodes
are necessary

5 / 34


Community Detection

We identiﬁed communities using three popular algorithms:
Infomap [7]
Louvain [2]
WT [8]
All have publicly available implementations, are able to
operate over weighted networks, and produce non-overlapping
communities
In each time-step t, we identiﬁed clustering C t of n
communities: C t = {c1 , c2 , ..., cn }, where n is determined
t t t

automatically for each time-step

6 / 34


Tracking of Dynamic Communities
Communities are identified independently for each time-step.
It is thus necessary to track the evolution of each community
in further time-steps
Communities were matched according to the highest Jaccard
coefficient:
|cit ∩ cjt+1 |
match(cit ) = arg max t
cjt+1 ∈C t+1 |ci ∪ cjt+1 |

Important ancestors and descendants were identified by
modified Jaccard coefficient:
|cit ∩ cjt+1 | |cit ∩ cjt+1 |
ancestor (cit , cjt+1 ) = , descendant(cit , cjt+1 ) =
|cjt+1 | |cit |

7 / 34


Visualization

To compare and inspect the state of the network in diﬀerent
time-steps, a proper visualization is very helpful
Nodes that appeared previously should have similar positions
Colours denoting the aﬃliation of the node to its cluster
should be preserved
As we have not found any existing tool implementing these
requirements, we built our own one based on JUNG
Another tool based on Graphviz was build to automatically
create diagrams of ancestors and descendants based on
respective relations

8 / 34


Topic Detection I

We mined keywords using NLP techniques [3] from the
abstracts or full-texts for almost 70% of the underlying articles
Tokenised and stemmed [6] keywords were then assigned to
each author
Ability of keywords to discriminate authors was ranked
according to their frequency (TF) and uniqueness in the
corpus (IAF): TF-IAF
Each author a in time-step t was thus described by a
t
bag-of-words vector ka
Topical description of cluster c was obtained by a centroid of
its members
Cosine similarity was used for determining topical similarity of
two clusters

9 / 34


Topic Detection II
Interpretation of a cluster’s topic was based on characterizing
keywords—a union of:
20 highest ranked keywords
20 most frequent keywords
We were particularly interested in cross-community activity
between IR and SW camps
Deﬁnition what is IR- and what SW-related community was
based on frequent patterns mined from the publications
Any event detected by community topic evolution measures
associated with both IR- and SW-related communities was
then considered as an inter-camp dynamics
Meta-data was used to assess the quality of clusterings—WT
was omitted from further analysis

10 / 34


Measures

Overlap measures induce huge number of inter-reactions
between communities
Solution is to apply more speciﬁc measures or to use the
simple ones in combination
We developed and/or used two categories of measures
1 community life-cycle measures for measurement and
explanation the state and the evolution of the community
2 community topic evolution measures for revealing of
cross-community phenomena like community shift

11 / 34


Community Life-Cycle Measures

Structural perspective:
size S
average vertex betweenness B, RB ∈ R+
relative density ρ, Rρ ∈ [0, 1]
author entropy A, RA ∈ [0, 1]
Topical perspective:
topic drift T , RT ∈ [0, 1]
cluster content ratio H, RH ∈ R+

12 / 34


Community Topic Evolution Measures

We looked for parallel changes of structure and topic of
communities
Structural and topical measures were combined by
multiplication for simplicity and because the range remains
within [0, 1]
Community shift PS may be detected as an emergence of a
new community topically distinct from its ancestor:

PS (cit , cjt+1 ) = dissim(cit , cjt+1 ) × ancestor (cit , cjt+1 )

13 / 34


Community Topic Evolution Measures II

Community shift/merge PS/M may be detected as a merge of
two topically distinct community:

PS/M (cit , cjt+1 ) = dissim(cit , cjt+1 ) × descendant(cit , cjt+1 )

Note that both PS and PS/M are deﬁned only for two
diﬀerent communities, i.e. only if i = j
Community topic change PC expresses a change of topic of a
structurally stable community:

PC (cit ) = dissim(cit , cit+1 ) × (1 − A(cit+1 ))

Only events with values > 0.5 and with a minimal overlap of
10 authors were selected for deeper analysis

14 / 34


Data-Set
We ﬁrst picked a set of major conferences in both ﬁelds
We then selected publications from these conferences from
DBLP for 2000–2009
Co-citation network of 5772 authors and 817642 edges over
all years was extracted
3-year time-steps with 2-year overlap: 2000–2002,
2001–2003, . . .
Total number of articles was 39314 for which we were able to
scrape 22975 abstracts and 3740 full-texts
Nearly 70% coverage by content
We scraped 18313 author-provided keywords for 4102 distinct
articles
Coverage by these high-quality meta-data was 10%
We mined 263742 keywords from abstracts and full-texts

15 / 34


Shift of Louvain Community 26

Emergence of Louvain community 26 was identiﬁed as an
.
inter-camp community shift PS = 0.62 in 2006
It was formed by 80% of community 6 “web IR” and by 20%
of community 5 “SW”
The keywords in 2006 like “navigation”, “personalization”,
and “semantic web” suggests transdisciplinary topics
Massive inﬂuence of community 15 “SW and IR” in 2007 and
a change of topic towards “SW and business processes”
.
Observed as a low topic drift T = 0.29
IR-related keywords appeared again among characterizing
keywords in 2008
.
Topic then stabilized: T = 0.65

16 / 34


Evolution of Louvain Community 26
Communities 6 “web information retrieval”, 5 “semantic web”,
15 “semantic web and information retrieval” and their descendant
community 26
2005–2007 2006-2008 2007–2009 2008–2009

c5 c5 c5

20 2.8 48.6

c6 80 c26 4.7 c26 51.4 c26

90.6 8.3

c15 c15 c15

17 / 34


Position of Louvain Community 26 in 2006 and 2007

Communities 6 “web information retrieval” (pink), 5 “semantic
web” (red), 15 “semantic web and information retrieval” (violet)
and their descendant community 26 (green)

18 / 34


Specialization of Infomap Community 9

First oriented on general and core SW-related topics in 2000
Between 2002–2004 we identiﬁed 3 shifts
One of these shifts was community 99 “semantic desktop and
personalization”
The community itself then specialized on “SW services”
S,T , and H provided valuable insights
ρ, B, and A did not seem to provide any further insights

19 / 34


Life-Cycle Measures of Infomap Community 9

2 4500 ρ
H
1.8 4000 B
1.6 3500 S
A
1.4
3000 T
H, T , A, ρ

1.2
2500

B, S
1
2000
0.8
1500
0.6
0.4 1000

0.2 500
0 0
2000 2001 2002 2003 2004 2005 2006 2007 2008

time

20 / 34


Life-Cycle Measures of Infomap Community 99

1.6 1000 ρ
H
900
1.4 B
800 S
1.2 A
700
T
H, T , A, ρ

1 600

B, S
500
0.8 400
0.6 300
200
0.4
100
0.2 0
2003 2004 2005 2006 2007 2008

time

21 / 34


Shift/Merge of Community 86
.
We identified shift/merge PS/M = 0.91 of community 86
with community 0
Both communities were concerned with IR-related topics, but
each had its specific theme:
86 being more focused on “development”, “engine”, and
“system”
0 being more focused on “question answering”
90.9% of authors from 86 moved to community 0
.
Relative density ρ = 0.47 and high cluster content ratio
.
H = 1.91 suggests it was topically coherent, but structurally
weak
It is not possible to generalize the suitability of any life-cycle
measures as we have identified only one shift/merge

22 / 34


Tag Clouds of Communities 86 and 0

community characterising keywords
2002
c86 intuitive, development, ir, retrieval, control, imple-
mented, describing, high-dimensional, reducing, engine, execu-

tion, advanced, information, system, multi-

dimensional, image, usin, accurate, time, precise, features,
queries, service, dataset, document, analysis, large, structure,
cluster, and, web, processing

resolution, evaluation, passages, architecture, question, qa,
2003
c0
patterns, deﬁnitions, development, trec, mit, candidates, linguis-
tic, retrieval, answering, system, analysis, javelin,
modules, advanced, methods, science, information, approaches, pro-
cessing, using, computer, language, techniques

23 / 34


Change of topic of Infomap community 54
.
Inter-camp community topic change PC = 0.58 was identiﬁed
for Infomap community 54 between 2005 and 2006
The topic changed from “knowledge management” and
“information extraction” towards “knowledge querying” and
“semantic web”
Zero author entropy A suggests this might have been caused
by new members joining the community
34.5% were completely new, i.e. they did not come from any
previous community
20.7% coming from 54 “knowledge management and
information extraction”
17.2% coming from 29 “ontologies and SW”
6.9% coming from 70 “ontologies and folksonomies”
6.9% coming from 112 “semantic web services”

24 / 34


Tag Clouds of Infomap community 54

community characterising keywords
2005
c54 organizational, kms, sw, capturing, environment, working, ie,
acquisition, wikifactory, legacy, manager, goal, seman-

tic, tool, cooperative, layers, healthier, deﬁning, quantitative,
knowledge, web, text, learning, techniques, computer, sup-
porting, science, machine, documents, information, system
2006
c54 ontologies, language, query, speciﬁcation, knowl-
edge, manager, semantic, pure, capturing, data,
search, keyword, layers, keyword-based, hybrid, archi-
tecture, spreadsheet, web, ie, application, informa-
tion, modelling, approach, algorithm, using, methodic, retrieval,
service, system, structures

25 / 34


Emergence of Intermediary Louvain Community 15

The most complex scenario we investigated
It first emerged as a descendant of community 4 “IR” with
topic “cross-language IR”, which was identified as a
.
community shift PS = 0.55 in 2003
Since 2004, this community was under a massive influence of
community 5 “SW”, which caused a change towards
.
SW-related topics PC = 0.31
Since 2005, IR-related keywords appeared again among
characterizing keywords, while those keywords disappeared in
community 5
Therefore, whereas community 5 kept its focus on the core
SW-related topics, it largely participated in forming of a new
interdisciplinary community

26 / 34


Betweenness of Louvain Community 15

Despite of being still focused on mainly SW-related topics,
community 15 worked as an intermediary of both camps
This hypothesis is supported by high average author
betweenness B

2004–2006 2007–2009
S B S B
c15 444 1591.01659 445 2535.02
entire network 2776 2066.70764 2190 2192.85117

27 / 34


Position of Louvain Community 15 in 2004 and 2007

Community 5 “SW” (red—left side), “IR” communities 0, 4, 6 and
9 (grey, beige, pink and red—right side, respectively) and their
intermediary community 15 (violet)

28 / 34


Conclusion and Future Work I

We presented a general and scalable methodology for analysis
of cross-community phenomena uniquely combining
topological and content analysis and supported by special
visualization techniques
Three community topic evolution measures tailored for
identifying phenomena like community shift, shift/merge, and
change of topic were proposed and successfully assessed
Community shift and topic change were detected quite
commonly, which suggests that they are part of many
community life-cycles
Community shift/merge was detected very rarely, which either
means we have to improve the measure or that this is simply a
rare phenomenon
We proposed life-cycle measures characterising the states and
evolution of communities

29 / 34


Conclusion and Future Work II
The assessment showed that average vertex betweenness,
relative density, cluster content ratio, and topic drift offered
valuable insights into the phenomena revealed by community
topic evolution measures
We observed strong shifts PS → 1, when the shifted
community disappeared in the next time-step
These strong shifts had usually very different but coherent
topics
They might have been the initial sources of new topics or even
research streams
Frequently, a newly emerged community had quite weak
structure (low ρ, high A) and/or topic (low T ), while these
characteristics then improved in the subsequent time-steps
B seems to be a good measure for identification of
intermediary communities

30 / 34


Conclusion and Future Work III

We intend to cluster the community life-cycles by the
characteristic events expressed by all the measures
We expect this to provide an automated way of extracting
life-cycle taxonomies
The combination of content and structural analysis allowed us
to assess the quality of clustering revealed only by inspection
of structure of the network
We consider this original approach as a fertile ground for
future research
We plan to use other algorithms—e.g. co-clustering algorithm
of both content and objects [4]
We will extend the whole work to a larger data-set

31 / 34


References I

R. Baeza-Yates, P. Mika, and H. Zaragoza.
Search, Web 2.0, and the Semantic Web.
IEEE Intelligent Systems, 23(1):80–82, 2008.
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte,
and Etienne Lefebvre.
Fast unfolding of communities in large networks.
Journal of Statistical Mechanics: Theory and Experiment,
P10008, 2008.
Georgeta Bordea.
The Semantic Web: Research and Applications, chapter
Concept Extraction Applied to the Task of Expert Finding ,
pages 451–456.
Springer, 2010.

32 / 34


References II

Derek Greene and P´draig Cunningham.
a
Spectral Co-Clustering for Dynamic Bipartite Graphs.
Technical report, School of Computer Science & Informatics,
UCD, 2010.
Th. S. Kuhn.
The Structure of Scientiﬁc Revolutions.
University Of Chicago Press, December 1996.
Martin F. Porter.
An algorithm for suﬃx stripping.
Program, 14:130–137, 1980.

33 / 34


References III

Martin Rosvall and Carl T. Bergstrom.
Maps of random walks on complex networks reveal community
structure.
In National Academy of Sciences USA, volume 105, pages
1118–1123, 2008.
Ken Wakita and Toshiyuki Tsurumi.
Finding community structure in a mega-scale social networking
service.
In IADIS international conference on WWW/Internet 2007,
pages 153–162, 2007.

34 / 34

Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010

Similar a Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010 (20)

Más de Václav Belák

Más de Václav Belák (6)

Último

Último (20)

Life-Cycles and Mutual Effects of Scientific Communities: ASNA 2010