Poster presented at the 2014 European Conference on Artificial Intelligence - Unsupervised semantic clustering of Twitter hashtags - automatic topic detection in Twitter
ECAI 2014 poster - Unsupervised semantic clustering of Twitter hashtags
1. Unsupervised semantic clustering of Twitter hashtags
Introduction
Carlos Vicient, Antonio Moreno
Micro-blogging services such as Twitter constitute one of the most
successful kinds of applications in the current Social Web.
Every day more than 500 million tweets are sent.
Growing interest in the design and development of tools that allow
users to analyse large unstructured repositories of user-tagged data.
Problems of data visualisation.
Semantic information retrieval & information extraction.
Hashtag recommendation,etc.
Our research is focused in the automated clustering of the hashtags
present in a set of tweets, which may lead to a straightforward discovery
of its main topics.
Methodology
Universitat Rovira i Virgili
ITAKA research group - Intelligent Technologies for Advanced Knowledge Acquisition
Dept. of Computer Science and Mathematics
Avda. Paisos Catalans, 26. 43007 Tarragona, Spain.
{carlos.vicient, antonio.moreno}@urv.cat
Goal: Group hashtags in clusters and select the most relevant ones.
A bottom-up analysis from maxK clusters to minK clusters is performed.
Filtering thresholds:
T1: establish the minimum inter-cluster homogeneity.
T2: establish the minimum number of elements per cluster.
Fig.2. Filtering process (t1=0.6, t2=3)
--
536 relevant medical hashtags
classified in 16 manually labelled
classes.
2530
793 769
371
Table 1. Results of Oncology tweet set
1)
-
293
This unsupervised domain-independent methodology allows a
Future work
organization
[ , Companies,
Hospital, Award, Care,
]
Winner, 0.4 0.6 0.9 0.8
Using this matrix, the similarity between concepts is
calculated using the Wu-Palmer [2] distance.
21st European Conference on Artificial Intelligence. ECAI 2014. Prague. August 18th-22nd.
[1] Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database (Language,
Speech, and Communication). MIT Press.
Hashtags are unstructured and unlimited: problems like synonymy,
polysemy, lexically similar hashtags, acronyms,a combination of
several words, or just invented expressions.
Current hashtag clustering methods are mostly based on a syntactic
analysis of their co-ocurrence.
Goal: The aim of this stage is to fill the gap between hashtags and
concepts in order to be able to compare two different hashtags at
the semantic level.
semantic clustering of a set of hashtags and the identification of
its most relevant topics, filtering the large proportion of noise
inherent to these sets.
--
2)
--
Oncology dataset:
5000 tweets (www.symplur.com)
1086 hashtags (930 annotated)
Automatic results (minK=5, MaxK=200, t1=0.70 and t2=10):
13 of the 16 target manual classes were identified
15 relevant clusters (with a total number of 266 hashtags)
Fig.1. Semantic annotation process
1)
2)
3)
Analysis of the full content of the tweets.
Improve the treatment of polysemic hashtags.
Change the bottom-up analysis of the hierarchical tree by a top-down
procedure, so that general clusters are considered before the specific ones.
Semantic Annotation
WordNet [1]
Entity
illness
tumor
cancer
sign of
zodiac
cancer
(crab)
medical
institution
clinic
Clinic
Organization
Semantic Similarity
Goal: Establish a mechanism to compare two different hashtags.
Build a similarity matrix that contains all the hashtags.
Clustering & filtering
Evaluation
Id Centroid Size Prec. Rec. Manual class
1 Woody_Plant 11 64% 41% Substances
2 Day 10 50% 24% Temporal
3 Therapy 20 75% 37% Medical tests
4 Medicine 17 76% 23% Medications
5 Cancer 46 80% 62% Cancer
6 Court 14 43% 43% Hospitals
7 Biotechnology 10 60% 15% Biological
8 Health 23 43% 45% Health Care
9 Medicine 43 60% 63% Medical Fields
10 System 10 70% 21% Body Parts
11 Area 11 73% 18% Geographical
Locations
12 Teaching 10 40% 14% Academic,
Research
13 Person 16 38% 12% Medical Jobs
14 Center 10 40% 12% Body Parts
15 Doctor 15 60% 18% Medical Jobs
16 – 31 - - - - Noise
129
52 30 2 3 24 1
1 2 3 4 5 6 7 8 9 10 11 12
#
t
w
e
e
t
s
#hashtags
[2] Wu, Z. and Palmer, M.: Verb semantics and lexical selection. In Proceedings of the
32nd annual Meeting of the Association for Computational Linguistics, New
Mexico, USA. 1994. 133-138
Hashtag1: “#Mayo” { clinic, organization }
Hashtag2: “#AustinCancerCenter” { center, medical institution }
*c = LeastCommonSubsumer(a,b)