Research seminar given at the Poznan University of Technology, Poland, June 2014. The topic was the automatic and unsupervised discovery of topics in social networks.
4. Web 2.0 (Social Web)
◦ Huge amount of highly heterogeneous and unstructured
user-generated data in the Web (e.g. Wikipedia, blogs)
and in social networks (e.g. Facebook, Twitter)
Global aim of our work
◦ Develop tools based on Artificial Intelligence techniques
that may analyze all this information in an automatic and
unsupervised way and build knowledge structures
Some previous works
Ontology-Based Information Extraction
Ontology Learning from the Web
6. Focus on Social Networks – Twitter
500 million short messages (tweets) per day
Hashtags
7. Hashtags can be taken as indicators of the topic of
a tweet
Given a large number of tweets, most approaches
to automatic topic detection try to cluster tweets
(or cluster hashtags) in some way
Most usual solution: cluster hashtags considering
their syntactic co-occurrence
9. A semantic management of hashtags will provide
a more coherent classification than the usual ones
based on syntactic co-occurrence.
Reminder of the talk:
◦ Unsupervised semantic clustering of hashtags
◦ Case study – Medical tweets
10.
11. After obtaining the hashtags from a given corpus
of tweets, a three-step analytic process is applied:
◦ Semantic annotation of hashtags
◦ Hashtag clustering
◦ Selection of relevant clusters
12. Idea: give meaning to each hashtag, by linking it
to a WordNet concept
◦ #SagradaFamilia => Church
◦ #LFC => Football Club
Rationale: if we are able to associate each
hashtag to a concept in an ontology, we will be
able to apply ontology-based semantic similarity
measures to know the degree of relationship
between pairs of hashtags
13. Step 1: The hashtag matches directly with a
WordNet concept
◦ Word-breaking techniques and iterative prefix/suffix
analysis are applied
◦ #Cathedral, #GothicCathedral match with the “Cathedral”
concept
Easy, but most hashtags do not appear directly in
WordNet
18. At this point each hashtag h is associated to one
(or several) WordNet concepts Lh
◦ The hashtags that have not been annotated in the
previous step are dismissed
In order to apply a clustering process it is
necessary to define a measure of semantic
similarity between pairs of hashtags (i.e. between
pairs of lists of WordNet concepts)
19. We have considered that the similarity between
two hashtags h1 and h2 is the maximum similarity
between a concept in Lh1 and a concept in Lh2
◦ Any ontology-based semantic similarity measure between concepts could be applied
h1: C1 C2
h2: C3 C4 C5
0.2 0.1
0.5
0.60.3
0.1
20. We have considered that the similarity between
two hashtags h1 and h2 is the maximum similarity
between a concept in Lh1 and a concept in Lh2
◦ Any ontology-based semantic similarity measure between concepts could be applied
h1: C1 C2
h2: C3 C4 C5
0.2 0.1
0.5
0.60.3
0.1
Using these similarity between hashtags we perform
a hierarchical clustering of the set of hashtags
21. Due to the nature of social tags, traditional
clustering methods provide solutions with a large
number of irrelevant classes
It is important to analyse the clustering tree and
determine which classes of hashtags are good
enough to be shown to the user
22. filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogeneity(HCkc)
if ((b >= t1) && (|HCkc| >= t2)
&& (∄ e in finalClusts | e ⊆ HCkc))
Add HCkc to finalClusts
return finalClusts
Cut the tree and
obtain k classes
23. filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogeneity(HCkc)
if ((b >= t1) && (|HCkc| >= t2)
&& (∄ e in finalClusts | e ⊆ HCkc))
Add HCkc to finalClusts
return finalClusts
Compute the
homogeneity of
each class
24. filtering (HC, minK, maxK, t1, t2)
finalClusts := Ø
forall k in maxK .. minK
forall c in 1 .. k
b := inter-cluster-homogeneity(HCkc)
if ((b >= t1) && (|HCkc| >= t2)
&& (∄ e in finalClusts | e ⊆ HCkc))
Add HCkc to finalClusts
return finalClusts
A class is selected if it is big enough, it is homogeneous
enough, and it is not a superset of any of the previously
selected classes
A semantic centroid of each selected class is calculated
25.
26. 5000 medical tweets related to
Oncology, extracted from Symplur
(www.symplur.com)
From October 31st 2012 to January
11th 2013
The set contains1086 different
hashtags
Using the WordNet + Wikipedia
semantic annotation process, 930
hashtags (85.6%) were annotated
◦ Half of the annotations are made in the
first step (WordNet) and the other half in
the second step (Wikipedia)
◦ 156 hashtags (14.4%) were removed
2530
793 769
371
293
129
52 30 2 3 24 1
1 2 3 4 5 6 7 8 9 10 11 12
#tweets
#hashtags
hashtags/tweet
27. The remaining 930 hashtags were manually
examined.
◦ 536 (57.6%) were relevant medical hashtags, and they
were classified in 16 manually labelled categories
Organs, professions, medical tests, etc.
◦ 394 (42.4%) were considered noisy or unrelated to
Medicine
29. maxK=200, minK=5
◦ The algorithm proceeds from the cut that divides the set in 200
classes up to the cut that divides the set in 5 classes; thus, it
moves from more particular classes to more general classes
t1: minimum inter-class-homogeneity
◦ All the values between 0 and 1 (in 0.1 steps) were tested.
◦ In this talk I will consider the value 0.70.
t2: minimum number of elements
◦ All the even values between 2 and 20 were tested.
◦ In this talk I will consider the value 10.
With these parameters, 31 classes were obtained
30. A: Manual set of 16 correct classes (536 HTs) + a
noisy 17th class (394 HTs)
B: Set of 31 classes (930 HTs) obtained by the
system
We calculate, for each class Bi in B
◦ Its semantic centroid
◦ Which is the class Aj in A with which it shares more elements
Precision: How many items of Bi belong to Aj
Recall: How many items of Aj appear in Bi
31. Classes in B
Semantic centroid, Size
Best matching classes in A
Precision, Recall, Manual label
32.
33. The unsupervised analysis of the set of HTs
contained in a corpus of tweets is very hard,
because half of them may be noisy or unrelated to
the domain, and they have a very heterogeneous
nature
Our hypothesis is that semantic measures of
similarity between HTs will lead to better
classifications that standard co-occurrence
techniques
In a test on 5000 medical tweets, 13 of the 16
manually labelled classes are found, with different
degrees of precision and recall
34. Evaluate the quality of the semantic annotation step
Test different ontology-based semantic similarity
measures in the clustering step
Explore deeply the influence of the thresholds on
the selection step
Obtain as result a hierarchy of classes at different
levels of abstraction, rather than a partition
Test the system on different sets of tweets
◦ Size: from thousands to millions of tweets
◦ Domain: uni-domain or general corpus