SlideShare una empresa de Scribd logo
1 de 56
Descargar para leer sin conexión
BEN-GURION UNIVERSITY OF THE NEGEV
FACULTY OF ENGINEERING SCIENCES
DEPARTMENT OF INDUSTRIAL ENGINEERING & MANAGEMENT
An unsupervised approach to user classification in online social networks
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE M.Sc DEGREE
By: Barak Yichye
December, 2016
BEN-GURION UNIVERSITY OF THE NEGEV
FACULTY OF ENGINEERING SCIENCES
DEPARTMENT OF INDUSTRIAL ENGINEERING & MANAGEMENT
An unsupervised approach to user classification in online social networks
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE M.Sc DEGREE
By: Barak Yichye
Supervised by: Dr. Dan Vilenchik
Author:…………………….. Date:…………..
Supervisor:……………………… Date: 27/12/2016
Chairman of Graduate Studies Committee:……………….. Date:…………..
December, 2016
i
Acknowledgement
I would like to take this opportunity to express my deep gratitude to my advisor Dr.
Dan Vilenchik, whose ideas and innovative thinking are the base of my thesis work.
Thank you for sharing your expertise and countless hours, it was my pleasure to work
under your guidance and learn the true way of a researcher.
ii
Contents
1 Introduction 5
1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Methodology 11
2.1 Sparse PCA and the semantic dimension . . . . . . . . . . . . . . . . 13
2.1.1 The Integrity Score . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 A Note on Computational Efficiency . . . . . . . . . . . . . . 15
2.2 Semantic robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Connection to Anomaly Detection . . . . . . . . . . . . . . . . . . . . 17
2.4 Crawling Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Related Work 20
4 Twitter - Case study 23
4.1 Analyzing the Sparse PCA progression . . . . . . . . . . . . . . . . . 24
4.2 Using PC2 for Spam Detection . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Using PC3 for content providers . . . . . . . . . . . . . . . . . . . . . 30
4.4 Other Methods of Unsupervised Learning . . . . . . . . . . . . . . . . 32
5 Discussion 35
6 figures 37
7 bibliography 48
iii
List of Figures
1 User UML diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 User tweets UML diagram . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Crawler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 PC1 Progression, fame measure. . . . . . . . . . . . . . . . . . . . . . 39
5 PC2 Progression, spam detector. . . . . . . . . . . . . . . . . . . . . . 39
6 PC3 Progression, Content detector. . . . . . . . . . . . . . . . . . . . 39
7 Top PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8 Top 4-sparse PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9 PC1 VS PC2 Factor Map. . . . . . . . . . . . . . . . . . . . . . . . . 40
10 PC2 VS PC3 SPCA Factor Map. . . . . . . . . . . . . . . . . . . . . 41
11 PC2 VS PC3 SPCA Factor Map. . . . . . . . . . . . . . . . . . . . . 42
12 various k scree plot. Each color represents the scree plot for a k-
sparse PCA solution. The x-axis is the PC number, the y-axis is the
percentage of variance explained by that PC. . . . . . . . . . . . . . . 43
13 Top truncated PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
14 Top truncated 4-sparse PCs . . . . . . . . . . . . . . . . . . . . . . . 43
15 PC1 VS PC2 SPCA factor map. . . . . . . . . . . . . . . . . . . . . . 44
16 PC1 VS PC3 PCA factor map. . . . . . . . . . . . . . . . . . . . . . . 45
17 Spam Detection ROC plot. AUC=0.98 . . . . . . . . . . . . . . . . . 46
18 Combining the plains . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
iv
19 PC1 VS PC3 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 47
20 PC2 VS PC3 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 47
21 PC1 VS PC2 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 48
List of Tables
1 Feature details. * The measure is computed over the recent 150 tweets. 18
2 PC3 TOP10 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5
1. Introduction
Online social networks and in particular Microblogging services such as Twitter and
Instagram have become an important feature of the daily life of millions of users.
In addition to communicating with friends and family, microblogging services are
used as recommendation services, real-time news sources and targeted advertising
platforms for commercial companies. The tremendous increase in popularity of social
networking sites allows companies to collect a huge amount of personal information
about the users, their friends, and their habits. From a business point of view, this
represents an opportunity to reach a large audience for marketing purposes. On the
other hand this wealth of information, as well as the ease with which one can reach
many users, attracted the interest of malicious parties. In particular, spammers
are always looking for ways to reach new victims with their unsolicited messages.
Therefore the task of classifying users in the social network is a fundamental and
important one, and this is the focus of the present work.
User classification in online social networks was studied widely in the literature,
the goal is to predict various user attributes from the user’s profile. The attributes
are numerous and diverse, for example trying to predict the user’s ethnicity and
political affiliation [32], gender, age, regional origin [36], occupational class [33], user
income [34], and demographics [9]. The list is of course much (much) longer, but
one thing is common to all the aforementioned works: the prominent features that
are used are all text-based, sociolinguistic, and to compute them various (non-trivial)
tools of text and sentiment analysis are used. In addition, all these works apply
supervised learning algorithms (either for text analysis or the prediction task).
In this master thesis we want to study an orthogonal direction and ask if and what
6
quality of classification may be obtained from an online social network when using
only the simplest of statistics. By simple statistics we mean the network structure
(e.g. who follows whom, who is friends with whom) and basic communication behavior
traits (e.g. number of tweets per day); the content of the user’s feed is ignored. If
classification is possible, we may further ask what is the minimal number of features
that still allows for meaningful classification, or in other words how succinct can
the labelling be made. The motivation behind these questions is both theoretical:
how much information is needed to capture interesting “signals” about the users of
an online social network. Indeed a recent work by Rao et al. [36] tried a machine
learning approach to obtain an accurate prediction of latent user attributes such
as gender or age (which are latent in Twitter) when only using simple statistics as
mentioned here. They report that they were not able to perform the task, and that
adding sociolinguistic features (based on the user’s tweets) increased dramatically
the accuracy of the prediction. Another motivation for studying simple-statistics
classification is practical and the following example highlights its crux. One of the
new players in the social network arena is Snapchat (passing Twitter in daily usage,
150 million people using it each day as of 2016). The trademark of Snapchat is the
fact that the content disappears shortly after being posted. Therefore relying on
content to provide insights about the user is impossible in this case. More generally
the content that users post is not always textual, e.g. pictures, videos. Extending
text analysis to pictures and videos is not an easy task and requires sophisticated
algorithms. Therefore it is helpful to understand what can be learned about the users
in the network without analyzing the content that they post.
The second question that we want to study in this master thesis is whether unsu-
pervised learning algorithms are useful in the context of user classification, and if so,
7
can one develop a scoring mechanism to evaluate the goodness of fit? In particular
we are interested in a notion we call “semantic robustness”: since in an unsupervised
model it is only in hindsight that one interprets the model and assigns it a semantic
meaning, it is reasonable to ask how robust is the given interpretation. For example,
if one is to take a certain subset of the data (e.g. a certain subnetwork) and recompute
the model for this subset, will the model retain its original semantic interpretation?
Finally, Let us stress that the point of this work is not to show that one can
replace supervised learning algorithms that use in-depth information about the user
with unsupervised learning that rely only on simple statistics. This is of course too
naive. The goal of this master thesis is to see what can be achieved with one (or
both) hands tied behind the back: relinquishing text analysis and labeled data, the
two main pillars of current user classification algorithms.
In this work we answer the research questions that we proposed affirmatively and
demonstrate that meaningful user classification is obtainable in the Twitter social net-
work using an unsupervised approach and using only simple (non-textual) statistics.
In addition we develop a new scoring mechanism that allows us to asses the quality
of the classification and perform model selection. In the next section we describe our
results in more detail.
1.1. Our Contribution
Natural unsupervised learning algorithms for classification are clustering algorithms
(such as k-means) or feature-transformation methods such as Principal Component
Analysis (PCA for short) and its variants (sparse PCA, kernel PCA), or Independent
Component Analysis (ICA). Our method relies on PCA (in Section 4.4 we discuss the
8
various options and what led us to choose PCA). In the context of user classification,
the principal components (PC’s) define a set of labels, which may (or may not)
have a meaningful semantic interpretation. The PC’s may represent complex user’s
attributes such as being a celebrity or being a spammer. Typically the top r PC’s are
considered, where the parameter r is chosen according to the total variance explained
by that set. The PC’s induce a “soft” classification of the users in the sense that
for each user we measure how much it is “of type PCi”. A user may be classified
according to the most dominant label (measured as the length of the projection on
the relevant PC). The method is reviewed in detail in Section 2.
The main contribution of our work is summarized as follows:
(1) We propose a generic approach that may be applied in a straightforward manner
to various online social networks (such as LinkedIn, Instagram, Twitter, etc). Our
approach is generic in the sense that the user-profile statistics that we use are com-
mon (or very similar) across many social networks, e.g. the number of followers or
the number of likes.
(2) We introduce a new concept which we call the semantic dimension of the prob-
lem. In order for the PC’s to be amicable to meaningful interpretation, it is (almost)
obligatory that the obtained PC’s will be sparse. A common practice to achieve this
goal is to zero out the entries of lowest absolute value in every PC. We suggest a new
methodology to perform this task, and a way of validating the result. We suggest to
use sparse PCA instead of standard PCA and a way to choose the correct sparsity
parameter k (the number of allowed non-zeros in every PC). We identify the minimal
semantically-admissible sparsity parameter kmin as the semantic dimension of the
9
problem, alongside r which is the algebraic dimension. We suggest to choose kmin
by solving a progression of k-sparse PCA problems for increasing values of k. Each
k-sparse solution receives a score which reflects how well the non-sparse labels are
retained (details in Section 2.1). The progression also allows to zoom-in on how the
feature set that defines each label evolves, and enables refined and insightful feature
selection.
(3) We suggest a new score which we call the semantic robustness of the classification.
Robustness in the sense that the labels remain semantically valid for various types of
(sub-)networks: for example users from a specific region, students in a certain school,
etc. To compute the robustness score we perform a truncated crawl of the social
network. In this crawl we ignore users with high expression of any of the derived
labels (i.e. with a large projection on any of the leading PC’s). Our robustness score
reflects the extent to which the PC’s of the “truncated” covariance matrix retain the
semantics of the original PC’s (details in Section 2.2). Using the robustness score
we can perform a kind of cross validation and choose the sparsity parameter which
gives the best robustness results. Indeed in our experiment one of the sparse solutions
obtained a better robustness score than the non-restricted PCA.
(4) We show how to use the PC’s to obtain useful labeling of the users in the Twitter
network. For example, one of the PC’s turns out to be a perceptron for spam detec-
tion. We tested our classifier on a set of 164 accounts, 69 spam and 95 legitimate,
taken from [15]. Our classifier obtained high precision and recall rates (around 95%)
and an AUC score of 0.98. We interpret the leading PC, PC1, as a measure of fame in
the network and PC3 as a measure of legal activity (content providers). We demon-
strate how one can use these labels to locate for example locally active bloggers: we
10
look in our Twitter crawling sample for a user with relatively low projection on PC1
and a relatively high projection on PC3. One such account is KarenInzunzam with
a PC1 value of 0.000095 and a PC3 value of 0.34. Another example is Susan Hozak
with PC1 value of 0.000032 and PC3 value of 0.45. As a benchmark for the projection
values we note that the vast majority of users have a nearly-zero projection on either
PC’s. We can identify an active celebrity LanaDelRey with PC1 projection of 36 and
PC3 of 2.7, and on the other hand a not so famous news account which is extremely
active, littlebytenews, with PC1 projection of 2.97 and PC3 value of 40.
We demonstrate our methodology on one of the more prominent players in the so-
cial network arena – Twitter. Twitter is a social network designed as a microblogging
platform, where users send short text messages (called tweets) that appear on their
friends’ pages. Unlike Facebook and MySpace, no personal information is shown on
Twitter pages by default, which reinforces our motivation for relying only on simple
statistics. Users are identified only by a username and, optionally, by a real name.
A Twitter user can start “following” another user. As a consequence, he receives the
user’s tweets on his own page. The user who is “followed” can, if he wants, follow
the other one back. Tweets can be grouped by hashtags, which are popular words,
beginning with a “#” character. This allows users to efficiently search who is posting
topics of interest at a certain time. When a user likes another user”s tweet, he can
decide to retweet it. As a result, that message is shown to all his followers.
11
2. Methodology
In this section we describe our proposed methodology for user classification in online
social networks using PCA and sparse PCA. We use bold lower-case letters, e.g. x, to
signify vectors, and non-bold lower-case letters, e.g. x, to signify scalars. Upper-case
letters are reserved for matrices. We consider all vectors as column vectors. We let
p be the number of features collected for each user, n the number of users in the
sample and X the resulting n × p data matrix. Let x = (x1, . . . , xp) be the random
variable with the “true” underlying distribution of the feature set (to which we of
course have no direct access), and Σ = E[xxT
] its population covariance matrix . For
simplicity of presentation we assume that the data is centered (i.e. the mean is zero).
Let ˆΣ = 1
n
XT
X be a sample covariance matrix of x.
For sake of completeness we start with a brief description of the PCA method. The
first principal component (PC) is defined to be the direction (unit vector) v1 ∈ Rp
in which the variance of x is maximal. The variance of x in direction v is given by
the expression vT
Σv. Therefore v1 = argmaxv∈Rp vT
Σv. The latter is the Rayleigh
Quotient definition of the largest eigenvalue of a matrix, therefore v1 is the leading
eigenvector of Σ and λ1 = vT
1 Σv1 is the variance explained by v1. The remaining
PC’s are defined in a similar way and together they form a orthonormal basis of Rp
.
The sample PC’s ˆv1, . . . , ˆvp are the eigenvectors of the sample covariance matrix ˆΣ.
Under various reasonable assumptions it was proven that the principal components
v1, . . . , vp converge to the sample ones ˆv1, . . . , ˆvp [2, 27]. We assume that this is true
in our case, and we justify it by the fact that we are in the “fixed p large n” regime,
where the ratio p/n tends to 0.
We now proceed to explain how PCA may be used in the context of user classifi-
12
cation. To that end it is instructive to take a geometric approach to PCA. Associate
the feature xi with the standard basis vector ei ∈ Rp
(the vector with 1 in the ith
coordinate and 0 otherwise). In this notation, the value of the ith
feature of a user
profile x0 is the scalar product ei, x0 . In the same manner, the principal compo-
nents v1, . . . , vp form a new coordinate system (representing a new set of features),
and the extent to which x0 is “of type vi” is given by vi, x0 . The new features
v1, . . . , vp are linear combinations of the original set of features, and may represent
more complex user characteristics. For example if the original set of features contains
basic statistics such as the number of tweets per day, number of URLs per tweet, etc,
the new features may represent more complex user-statistics such as a measure of
“celebrity”, “content consumer”, “spammer”, and so on. The obtained classification
is therefore “soft”, in the sense that there is no single label per user but rather a
continuum along each axis, and the most prominent axis for a certain user may serve
as his classification label.
Typically, only a several PC’s carry a meaningful semantic signal about the users,
and the remaining PC’s correspond to noise. The question is how to identify that
subset? Unfortunately there is no single statistically-sound rule that fits all settings
that specifies how to select the PC’s but rather various (ad-hoc) heuristics. The
following observation is nevertheless useful: The percentage of variance explained in
the direction of a certain vector v is given by vT ˆΣv/ tr(ˆΣ), where tr(ˆΣ) = p
i=1 λi
is the trace of the matrix. By symmetry, a random vector will explain on average
1/p-fraction of the variance. Therefore as long as the ith
PC explains λi
tr (ˆΣ)
> 1
p
fraction of the variance, it is reasonable to look into the signal that it carries. To
conclude, the relevant subset of PCs consists of the top r PCs (sorted according to the
corresponding eigenvalue), where r can be determined according to the 1/p-fraction
13
threshold we described.
2.1. Sparse PCA and the semantic dimension
The framework that was presented above is useful only to the extent that one can
interpret the PC’s in a meaningful way. In practice many times the PC’s are linear
combinations of all (or most of) the original features, which hinders interpretability.
To deal with this phenomenon, a common practice is to zero out PC entries whose
absolute value is small.
We suggest to take an alternative approach, and use sparse PCA. Rather than
finding the eigenvectors of Σ, we look for the unit vector v with at most k non-zero
entries (k is a parameter of the problem) such that vT
Σv is maximal; v is called
the leading k-sparse eigenvector (although it is not necessarily an eigenvector of ˆΣ).
Similarly we compute the remaining k-sparse eigenvectors.
In an unsupervised setting there is no standard way to measure the fit of a model
and in particular no recipe to choose the minimal k, kmin, that still enables meaningful
classification (what we refer to as the semantic dimension of the problem). We define
a new score which we call the integrity score and use it to perform model selection.
Specifically, we generate a progression of k-sparse PCA problems and use Equation (1)
to measure the integrity of every solution. To choose kmin we can generate a plot
similar to a scree plot, or choose the minimal k whose integrity score is above some
threshold.
14
2.1.1. The Integrity Score
Given a covariance matrix ˆΣ ∈ Rp×p
and a natural number k ≤ p, the integrity score,
denoted by ι(k), is a number in the range [0, 1], and it is computed as a function of
the non-restricted and the k-sparse PCA solutions over ˆΣ. It represents a measure of
the “semantic distance” between the two solutions.
To compute the integrity score we first assume that r, the number of relevant
PC’s, is already determined. Before we present the formula for ι(k) let us discuss the
motivation behind it. The main question is how to quantify the extent to which the
semantics of a PC vi is retained in the k-sparse solution. We view the semantics of
the label represented by vi as composed of its loadings (i.e. the value of the entries of
vi) and the amount of variance it explains, λi. Therefore our score accounts for the
similarity between the vi’s and the k-sparse eigenvectors v
(k)
i ’s, and the difference in
the explained variance |λi −λ
(k)
i |, where λ
(k)
i = v
(k)
i Σv
(k)
i . Formally, the score is given
by the following formula,
ι(k) = 1
2r
r
i=1
1 −
|λi−λ
(k)
i |
max{λi,λ
(k)
i }
+ | vi, v
(k)
i |. (1)
Let us conclude with two remarks:
• Observe that the score of the non-restricted PCA is always 1, i.e. ι(p) = 1. The
closer the score is to 1, the closer the k-sparse solution is to the non-restricted
PCA solution.
• It may be the case that the semantic labels of the k-sparse solution are a per-
mutation of the non-restricted labels. Therefore, if needed, an appropriate per-
mutation should be applied before computing ι(k). In our case no permutation
15
was needed (see Section 4).
2.1.2. A Note on Computational Efficiency
While (non-restricted) PCA can be solved efficiently by computing the eigenvectors of
a symmetric matrix, sparse PCA is a difficult combinatorial problem, and in fact NP-
hard [29, 30]. Nevertheless, when the dimension p is small (12 in our case), sparse
PCA can be solved exactly in a matter of seconds (by a naive exhaustive search
approach).
In cases where p is large or even grows with n, solving sparse PCA exactly is not
a computationally feasible task. There are various heuristical approaches to obtain
an approximate solution [10, 29, 37, 46, 47], but in such cases consistency issues may
appear [20]. This gives yet another motivation to understand what can be achieved
for the classification problem with as few statistics as possible, thus keeping the
computational effort to solve sparse PCA in check.
2.2. Semantic robustness
Suppose that we obtained satisfactory classification results from the PC’s. A nat-
ural question to ask is how semantically-robust the classification is? For example,
if we were to take a subnetwork of users in a certain town, or students in a cer-
tain school and run the same classification methodology on that subnetwork. Would
the subnetwork labels retain the semantic meaning of the global labels? Since the
setting is of unsupervised nature, there are no pre-defined labels and no reference
point to compare against. We suggest a way to use a variant of ι(k) to measure the
robustness of the classification. The robustness score, together with ι(k), provide a
16
justifiable way to perform model selection (i.e. choose the parameter k with highest
robustness/integrity score).
The robustness score, ρ(k), is a function of two covariance matrices. The first co-
variance matrix is obtained from a standard crawl of the social network (see Section 4
where we describe exactly how the crawl is performed). The second covariance matrix
is obtained from a second crawl, which we call the truncated crawl. In that crawl we
ignore users that have a large projection on any of the top r PC’s of the first crawl. We
compute the top r k-sparse PC’s of the first covariance matrix v
(k)
1 , v
(k)
2 , . . . , v
(k)
r and
the top r k-sparse PC’s of the “truncated” covariance matrix v
(k,trunc)
1 , . . . , v
(k,trunc)
r .
The robustness score is given by
ρ(k) = 1
r
r
i=1
| v
(k)
i , v
(k,trunc)
i |. (2)
Let us remark that:
• The number ρ(k) is in the range [0, 1] and the closer ρ is to 1 the closer the two
solutions are. Therefore we may conclude that the labels represented by the
PC’s are semantically more robust.
• As mentioned in the previous section, if needed an appropriate permutation
should be applied to the PC’s before computing ρ(k). In our case indeed such
a permutation was needed (see Section 4).
• We do not use the eigenvalues in the computation, unlike the integrity score,
since we compare two different crawls resulting in two different matrices, and
eigenvalues may fluctuate without any semantic significance implied.
17
2.3. Connection to Anomaly Detection
Our methodology of classifying users via PCA shares similarities with the framework
of network anomaly detection. For completeness, we briefly describe this framework.
For a given natural number r ≤ p, define Nr to be the subspace spanned by the first
r PC’s {v1, . . . vr}. The basic underlying assumption of traffic anomaly detection is
that Nr, the “normal space”, corresponds to the primary regular trends (e.g., daily,
weekly) of the traffic matrix. The detection procedure relies on the decomposition of
a traffic sample x into normal and abnormal components, x = xn + xa, such that xn
is the projection of x onto Nr and xa onto the perpendicular “abnormal” subspace.
The measurement x is declared to be an anomaly if its 2-norm xa exceeds a certain
threshold. For more details we refer the reader to [21, 22, 24, 23]. In our work we
were able to classify Twitter users into normal users vs spam/robots, in a very similar
fashion to the anomaly detection framework. In [40] this framework was implemented
for finding anomalous users in Facebook, Yelp and Twitter. Their key observation
is that in all three online social networks (Facebook, Yelp, Twitter) the normal user
behavior has low-dimensions (i.e., few PC’s capture most of the variance). They
verified that indeed attacker behavior appears anomalous relative to normal user
behavior (as captured by the top PC’s). The validation was done against a large
labeled data set of anomalous users of various kinds.
2.4. Crawling Twitter
In this section we shall describe in detail the software that we designed to generate
a sample of the Twitter social network. As we already mentioned, we attempt to
use a feature set which is as generic as possible in order to make the method easily
18
Var Attribute Description
x1 NumOfTweets Total number of tweets
x2 NumOfFollowers Total number of users following me
x3 NumOfFollowing Total number of users I follow
x4 LikesGivenToOthers Number of tweets that I like
x5 NumOfTxt Total number of tweets that contain only text *
x6 NumOfUrl Total number of tweets that contain URLs *
x7 NumOfMyRT Number of other users’ tweets that I re-tweet *
x8 NumOfOthRT Total number of retweets of my tweets by others *
x9 TweetsPerDay Total number of tweets divided by lifetime (in days)
x10 NumOfUserMent Number of other users mentioned in the tweets *
x11 NumOfHashTag Number of hashtags # referenced in the tweets *
x12 LikesGivenToMe Number of likes that my tweets received
Table 1: Feature details. * The measure is computed over the recent 150 tweets.
transferable to other social networks. Hence for example we ignore the content of the
tweets since not all social networks are text-orients e.g. Instagram and Snapchat.
Our feature set. We decided upon 12 quantitative features that capture various
aspects of the activity of a user in a social network. Table 1 summarizes the 12
features. Features X1, X3 − X7 reflect the volume of activity for a specific user.
Features X2, X8 − X12 capture the interaction between users.
Twitter’s API. We used the official twitter API interface to crawl the Twitter
networks. Via the API we send a data request and receive the Twitter server response.
To access the API we needed to subscribe to the Twitter developers website and
received a special account alongside a unique access token to establish a connection
between local computers and Twitter servers. Access to Twitter servers is limited to
once every 15 minutes [49]. In every request we are allowed to query the API about
180 Twitter accounts (per developer account)[50]. The extraction of 180 Twitter
account took three minutes on average.
Our crawler is written in C# and uses C# open source class called Tweetinvi [51]
19
which provides a set of basic methods to communicate with the Twitter server. The
platform C# was chosen after confirming that the code can be easily reproduced to
work with other popular APIs of other social networks. The software contains three
classes that take care of the user data and the crawling procedure. Figure 2 describes
the UML structure.
• User class holds the data of a single Twitter account. This class holds all
the basic and general information regarding the user that doesn’t required any
special calculation( for example: number of followers, number of tweets).
• UserTweets class inherits from User class and its purpose is to TODO tweets
feature sets for each specific user. This class will hold a function that will take
the last 300 tweets for each user and will calculate the tweets-related features
describe in Table 1.
• DataExport class is responsible to gather all the Users class data and organize
it in a table data type. It enables to export the data into excel format (csv).
Validation scripts are run before exporting the data in order to remove dupli-
cates and incomplete data. Another important role of the DataExport class
is to perform backups of the data during the crawling step, every time a 15
minutes collection interval ends.
The Crawling algorithm. The crawling method that we use in BFS. We will use
a queue to hold Twitter users ID’s. The queue is initialized with a manually selected
user. Before the crawling starts we activate the seven different Twitter developers
accounts by sending to the Twitter server a request containing the special user token.
20
Once the developer user is approved the crawl starts in a BFS manner by popping
the first user from the queue and taking for a specific user all is following accounts of
that user, and pushing them to the end of the queue. A detailed view of the algorithm
is given in Figure 3
3. Related Work
There are two main quantitative approaches to study online social networks. What
one may call a classical approach, which was initially proposed in [3] and followed
by a large body of work. In this approach the network is represented as a graph,
where nodes denote objects (people, webpages) and edges between the objects denote
interactions (friendships, physical interactions, links). The graph is then analyzed
through the lens of Computer Science graph theory and phenomena like the degree
power-law distribution, and the small-world phenomenon are being examined. Rel-
evant papers to our theme deal with the problem of community detection in which
network nodes are joined together in tightly knit groups, between which there are
only looser connections. These communities often correspond to groups of nodes
that share a common property, role or function [13, 14], and in particular community
structure may be used to detect anomalies such as spam reviewers [42, 43]. Another
strand of relevant work deals with phenomena such as core/periphery structure which
captures the notion that many networks decompose into a densely connected core and
a sparsely connected periphery [5, 17].
In this master thesis we will be interested in a statistical machine-learning ap-
proach which we believe enables more nuanced classification results. At each node
(user) various statistics are collected (that are relevant to the classification task),
21
then one of the many machine learning algorithms is applied to the data to compute
a prediction (classification) model. In this line of work, the overwhelming majority of
work uses supervised learning algorithms to derive the model. In addition, as algo-
rithms for natural language processing (NLP) and automated sentiment analysis of
text developed, many of the features used for user classification were based on these
tools. The body of work is very large and here we give a random sample. Researchers
investigated the detection of gender from traditional text [16] or movie reviews [31],
the blogger’s age [6], or user biographic data from search queries [19, 45]. As on-
line social network developed researchers tried to predict the user’s ethnicity and
political affiliation [32], gender, age, regional origin [36], occupational class [33], user
income [34], and demographics [9]. Another central task is detection of anomalous
and fraudulent user behavior and specifically detecting spam activity. We survey re-
lated work in that direction when we discuss our results obtained for spam detection
using PC2 in Section 4.2.
On the other hand, very few researchers attempted to perform user classification
using an unsupervised approach (although in some sense the graph-oriented approach
we presented above is an unsupervised approach). Even fewer tried classification
without analyzing the content that the user publishes on his profile (in fact we know
only two such work [7] and [40]). This is of course understandable – why not use
all the statistical and computational power at hand to solve the problem with better
accuracy. Indeed [36] report that relinquishing text-based features, and relying only
on what we called simple statistics, made the prediction of latent properties such as
age and gender in Twitter practically impossible.
As we mentioned before, the purpose of the present work is not to show that one
22
can perform the aforementioned classification tasks with unsupervised learning and
simple statistics. This is probably false. Our purpose is to understand what can be
said about user classification in online social networks in the unsupervised feature-
restricted setting. Having said that, one of the results we obtained was a perceptron
for spam detection which performed very well on a test data. The problem of spam
detection is typically solved with supervised learning and heavily relies on content
analysis. In this case we were able to suggest an unsupervised feature-restricted
alternative.
The two most relevant work to ours is a study conducted for the YouTube net-
work, where PCA was used to classify users [7] and [40]. Similarly to our findings,
a meaningful classification was obtained using the top PC’s and using only simple
statistics. In [7] the top four PC’s carry similar semantic signals to our findings:
measures of popularity and activity (but notably no spam label). In [40] the target
is detecting anomolous users in online-social network. The methodology is s different
than ours. The working hypothesis is that the top PC’s represent the plain of normal
user behaviour, and anomalous activity is identified as having small projection on
the normal plain and large projection on orthogonal ones. The set of features that
is used in [40] is different and includes for examples counting the likes that a user
gives in Facebook according to various page categories (sports, politics, etc). The
methodology was applied to Facebook, Twitter and Flickr with similar findings to
ours: user behavior is captured by a small number of PC’s (three to five depending
on the network). The authors of [40] validated that efficiency of their predictor on
a set of labeled data with good accuracy. Unlike [40], we are not tuned to a specific
task, but rather to uncovering the latent labels in the network, come what may.
23
Our work extends [7] by introducing a fully-fledged methodology to perform the
task, including quantitative measures to evaluate the goodness of the classification,
and by introducing the utilization of sparse PCA. In addition, our work reinforces the
possibility of using unsupervised learning methods to find meaningful classification
in online social networks by applying our methodology to Twitter.
4. Twitter - Case study
In this section we present the results of applying our method to the Twitter social
network. Recall that our goal is to identify various user-types as captured by the
principal components of the covariance matrix and to compute a score for the clas-
sification using Equations (1) and (2). This task is not a-priori certain to succeed
since there is no guarantee that the PC’s can be interpreted in a meaningful way. To
collect the Twitter data we implemented a crawler that crawled the social network
graph, following a snowball approach, exploiting the public API provided by Twitter.
This approach is commonly used in the literature [39]. Crawling starts from a list
of randomly selected users and proceeds in a BFS manner. At each step the crawler
pops a user v from the queue, explores its outgoing links (there is an link from v to
w if v follows w) by adding them to the end of the queue. The crawling rate was
about 25,000 users per day (there are limitations posed by the Twitter API), and we
collected a total of 284,758 active Twitter accounts. For each account we collected
the set of features described in Table 1.
The attributes in Table 1 represent two types of information: data about the
user’s activity in the social plane (followers, following, re-tweets) and data about the
user activity in the content plane (tweets, text vs urls, etc). Twitter’s social network
24
graph is a directed one, unlike Facebook for example. Since the features have different
scales, we had to normalize the data set to unit variance, as common in such cases (see
for example [24]). Using the 284,758 Twitter profiles we created a 12 × 12 covariance
matrix ˆΣ.
4.1. Analyzing the Sparse PCA progression
To compute the integrity measure of the progression we first need to fix r, which is the
number of PC’s that we consider. To this end we compute the leading eigenvectors
of the 12 × 12 correlation matrix ˆΣ, and sort them according to a decreasing order of
the corresponding eigenvalues. The eigenvalue λi of the ith
eigenvector is proportional
to the percentage of variance it explains (the variance it explains equals λi/ λi).
These ordered values often show a clear “elbow” that separates the most important
dimensions (characterized by higher percentage of variance) from less important ones.
The first three PCs account for 18.15%, 16.22% and 13.06% of the total variance,
totalling about 50% of the variance (see Figure 12). A random vector would explain
on average 1/12 = 8.33% of the variance, therefore it is reasonable to assume that
the top three PC’s carry real signal rather than noise. Therefore we set r = 3.
We solved a progression of k-sparse PCA problems for k = 2, 3, 4, 5 and k = 12
(i.e. non-restricted PCA). Figures 4 and 9 show how the top two PC’s evolve as we
increase k.
Zooming in on PC1. At k = 2 the two non-zero features are the likes given to
the user, and the number of retweets of his messages. This is clearly a measure of
popularity in the social network. At k = 3 the number of followers joins in, however
25
for k > 3 no additional significant features exist. At k = 12 (non-restricted PCA)
the two most dominant features (largest entries in absoulte value) are indeed likes
given to the user, and the number of retweets, see Figure 7. Looking at our sample,
indeed the top users in the direction of PC1 are teen pop-idols like Justin Bieber,
Zayn Malik.
Zooming in on PC2. Now let’s zoom in on the second PC and follow its evolvement
in Figure 9. For k = 2 we see two features in opposite signs, the number of messages
containing only text, and the number of messages that contain a URL. At k = 4 two
additional features of opposite signs join in: the number of tweets that I re-tweet
and the number of tweets that contain hash-tags. The aggregated attributed are
related to the type of activity of the user in Twitter. The negative values (text and
retweet) are characteristic of human twitter accounts, and the positive ones are more
typical of robot and spam accounts. Indeed, the main way of spamming in Twitter
is by hashtags (how? Simply include a trending hashtag in your tweet and anyone
who clicks the trending topic will see your ad, for free) and urls, which appear in
shortened form on Twitter and make it impossible to know where the url is leading.
We were able to use PC2 as a linear model for spam detection (see Section 4.2).
The benefit of using sparse PCA as opposed to standard PCA is demonstrated
graphically in the factor map of PC2 vs PC3, Figures 10 and 11. The separation of
features in much more evident in the sparse case.
Deciding the value of kmin. To answer this question we compute the integrity
score ι(k) for various k values according to Equation (1). The following table sum-
marizes the results.
26
k = 2 k = 3 k = 4 k = 5 k = 12
ι(k) 0.8 0.84 0.9 0.92 1
To conclude, already for k = 2 we get a good integrity score and we may set kmin = 2.
The choice kmin = 2 is also reinforced by Figure 12 which displays similar scree plot
lines for all k values, and from the fact that the performance of our spam detection
perceptorn (which uses PC2) works very well already for k = 2 (see table in the next
section).
Computing the robustness score. We performed a truncated crawl, collecting
74,320 Twitter accounts. We ignored users whose projection on either of the PC’s
was more than 1 and obtained the matrix ˆΣtrunc
. As a case study, let us compare
the robustness of k = 12 (non-restricted PCA) and k = 4. Figures 13 and 14 show
the loading table for the top three PC’s of the truncated covariance matrix. For
convenience we also provide the co-sine similarly matrix of the PC’s.
PCtrunc
1 PCtrunc
2 PCtrunc
3
PC1 0.94 0.17 0.12
PC2 0.06 0.04 0.01
PC3 0.01 0.2 0.41
It is evident form Figures 13 and 14 that the spam-potential label PC2 is no longer
valid in the truncated search. The similarity matrix shows that PC2 is nearly orthog-
onal to the top three truncated PC’s. This is reasonable since local sub-networks
typically do not contain spammers (at least not robots). Therefore a straightfor-
ward computation of ρ(k) would not give the right conclusion. Rather, the correct
similarity matrix to consider is the following, and its score is ρ(12) = 0.675.
27
PCtrunc
1 PCtrunc
3
PC1 0.94 0.12
PC3 0.01 0.41
The robustness score for the 4-sparse classification is higher. Below is the similarity
matrix for k = 4:
PCtrunc
1 PCtrunc
2 PCtrunc
3
PC1 0.97 0 0
PC2 0 0.09 0.06
PC3 0 0.48 0.21
In this case, the best result is obtained for the matrix whose rows are PC1, PC3 and
columns PCtrunc
1 , PCtrunc
2 . The robustness score is ρ(4) = 0.725, which is larger than
ρ(12) = 0.675. Therefore we may conclude that the 4-sparse classification is semanti-
cally more robust than the one obtained using non-restricted PCA. This conclusion is
non-trivial and could not be reached without the new measures that we introduced.
It is a good substitute for regularization, whose penalty parameter cannot be chosen
using cross-validation in an unsupervised setting.
Finally, let us examine the top users in PCtrunc
1 and PCtrunc
2 measure in the trun-
cated crawl sample. The top users in PCtrunc
1 measure indeed include local celebrities
such as Becky Whitesides, mother of Knoxville teen pop idol Jacob Whitesides, and
TheGoodGodAbove - Official Twitter account of God. The top users in PCtrunc
2 mea-
sure include content providers such as blogger Chiara Ferragni (the blonde salad),
or aeshir from Alberta Canada who provides satirical and comical content. To con-
clude, we see that the labels obtained in the first search are repeated truthfully in
the truncated search.
28
4.2. Using PC2 for Spam Detection
The proliferation of social networking has contributed to an increase in spam activity
[41]. Spammers send unsolicited messages to users with varying purposes, which
include, but are not limited to advertising, propagating pornography, phishing, and
spreading viruses. URL attacks are aided by Twitter’s 140 character limit on tweets
as many legitimate users need to use link-shortening services to reduce the length of
their URLs. The ability to disguise URL destinations has made Twitter a particularly
attractive target for spammers, which has motivated the development of several spam
detection techniques. Spammers may be of various types, for example software-robots,
fake accounts, or hijacked legitimate accounts.
There are many approaches to detecting compromised or false user accounts. Here
we focus on works that use machine-learning (in the Related Work section we men-
tioned a graph-based approach that uses traditional graph parameters to perform the
task). The common machine-learning approach to spam detection uses pre-labeled
data in a supervised learning framework [1, 4, 8, 12, 44, 25]. All these works achieve
precision rates of around 90%. Miller et.al. [11] treat the identification of spammers
as an anomaly detection problem, rather than classification, where outliers are flagged
as spammers. They utilize a combination of user metrics and one-gram text features.
Their approach achieves an F1 score of 82% with high accuracy but low precision.
Our approach is also inspired by PCA-based anomaly detection paradigm [21, 22,
23, 24]. Indeed [40] implemented this framework and tested it on seveal online social
networks. The set of features that is used in [40] is different from ours and includes for
examples counting the likes that a user gives in Facebook according to various page
categories (sports, politics, etc). The methodology was applied to Facebook, Twitter
29
Precision Recall F1 score
k = 2 0.98 0.88 0.93
k = 3 1 0.87 0.93
k = 4 1 0.88 0.94
k = 5 0.95 0.97 0.96
k = 12 1 0.94 0.97
and Flickr with similar findings to ours: user behavior is captured by a small number
of PC’s (three to five depending on the network). The authors of [40] validated that
efficiency of their predictor on a set of labeled data with good accuracy. Another work
which treats spam detection as an anomaly-detection problem is [26]. Rather than
using PCA, a clustering model is obtained and outliers are classified as spammers.
Our approach is also unsupervised and uses PCA but is conceptually much simpler
than [40] and [26], and furthermore does not use any textual features (which the latter
two use). In the aftermath of applying PCA, we observed that one of the PC’s is a
candidate for a spam detection model. Our model is a perceptron: given a Twitter
account x0 ∈ R12
, we compute its projection on PC2. We tested our classifier on a
set of 164 accounts, 64 spam and 95 legitimate, taken from [15]. Figure 17 shows the
ROC plot, and the obtained AUC is 0.98.
The above table describes the classification results with threshold 1, i.e. the user is
classified as SPAM if x0, PC2 ≤ 1. The rows of the table give the results for various
choices of k, namely using PC2 from various k-sparse PCA solutions. Although as
k increases the measurements improve, but not significantly, and already for k = 2
we obtain satisfactory results. We may safely conclude that two features suffice to
separate spam from legitimate twitter accounts: the number of text-only tweets and
the number of tweets containing a url.
30
4.3. Using PC3 for content providers
After examining the prominent features of the third PC we notice that PC3 measure
the extent to which a user is a content provider, i.e. a user that contributes a lot of
information and keeps things interesting and up to date in his social network account.
Analyzing Figures 4 and 16 we see that the prominent features of PC3 indicate a very
active social network user, features such as number of user tweets, tweets per day,
number of URL. It is interesting to contrast PC3 with PC2 which indicates the extent
of non-legitimate activity, PC3 contains also features like number of user mention and
number of hash tags, that suggest a benign interaction with other users.
At Figures 9 and 11 when conducting the Sparse PCA model on top of the first
data set another interesting feature comes to the picture, Likes given to others, this
feature with three additional features (number of tweets, number of user mention
and tweets per day) give us another verification that third PCA plain can be used
as a classifier for users that interact with the social network and help with making
a lot of new content. We can take for example the top 10 users table include news
provider littlebytesnews video gaming support XboxSupport/ PSP, and an American
teen content provider ChelseaaMusic.
31
Twitter User Name Loading
littlebytesnews 44.6
RedScareBot 28.17
PSP 27.07
XboxSupport 26.42
icarusaccount 20.64
favstar1000es 17.82
stevendickinson 17.66
TheArabHash 16.72
ComercioCenter 16.54
ChelseaAMusic 16.41
Table 2: PC3 TOP10 users
Zooming into the data set. We now show how one can use the three labels
obtained from PC1, PC2, PC3 to classify Twitter accounts from the crawler’s database.
This will give us another opportunity to validate semantic soundness of the claimed
labels and to show the practical usefulness of our method. For every Twitter account
we compute its coordinate in the new semantic system spanned by the top three PCs.
To recall, PC1 indicates the level of popularity in the social network, PC2 indicate
the level of legitimacy of the user inside the network and PC3 indicates the level of
activity and content contribution of the user to the social network. Our labels may
also provide a vantage point for marketeers: using PC1 and PC3 the marketeer can
identify “local stars” of a particular local subnetwork and use them for advancing
new products, targeted commercials, targeted local information, etc. As we verified
in Section 2.2, the obtained labels are semantically robust and retain their meaning
also in subnetworks. In Table 18 we present a set of users selected according to the
following criteria. We chose users with mid-range projection on PC1 (values between
0 and -0.01), and large projection on PC2 so they will not count as spam accounts
(user will count as spammer if x0, PC2 ≤ −1.25). The results clearly show that
32
indeed the users that we chose maybe labeled as local starts in the social network
such as comedians, local singers, bloggers, life style advisors and the likes.
4.4. Other Methods of Unsupervised Learning
It is natural to ask what classification results may be obtained using other unsu-
pervised learning algorithms. The first and most natural candidate is the k-means
algorithm, in which the data is being partitioned into k cluster such that the within-
cluster distance is much smaller than the between-cluster distance (if possible of
course). The average silhouette coefficient is a number in [0, 1] and it is a standard
measure for how well the points are clustered (the closer the score to one the better
the clustering is). We ran k-means on a sub-sample of 50,000 users (we could not
run on the entire set of 250K users due to performance issues) with k = 3, 4, 5. All
the executions resulted in poor clustering, with an average silhouette coefficient of
0.16−0.17. We also ran a soft clustering algorithm, which is known as fuzzy k-means.
In this algorithm each point x is assigned a k-dimensional vector vx whose entries
are interpreted as weights. The quantity vx[i]/ j vx[j] measures to what extent x
belongs to cluster i. The closer the entries are to 1/k the fuzzier the clustering is. We
computed ran R’s fuzzy k-means algorithm (FANNY) on the 50,000 users data set for
k = 3, 4. The Dunn’s normalized coefficient (which measures the average fuzziness of
the clustering) was 0.53, 0.502 for k = 3 and k = 4 respectively. Rounding the fuzzy
clustering to a crisp clustering gave one cluster with average silhouette of about 0.9
and the remaining clusters with negative silhouette (for k = 3 and k = 4). Namely,
rounding the fuzzy solution gave bad clusters.
We ran PCA on the same sub-sample, and the same semantic map obtained on the
33
entire data set was reconstructed. Indeed the poor results of the k-means algorithm
are to be expected, as the scatter plots in Figure 21 suggest: rather than a clustered
landscape we see a heavy-tailed distribution. In other words, the vast majority of
the users are neither celebrities, nor spammers, nor content provider. They all fall in
one large cluster, and running k-means with k > 1 is an artificial attempt to break
this cluster. More generally, the clustering offered by k-means is too coarse and
rough for our purpose, while PCA offers what we called “soft clustering”. Instead
of assigning each user with one cluster, we measure the expression of a certain label
(which represents some complex quality) of that user. Therefore a user may belong
to many “clusters” at the same time and with different significance levels.
PCA is a member of a larger family of algorithms performing various transforma-
tions of the original set of features. Two other algorithms that we ran on our data
are kernel PCA and ICA (Independent Component Analysis).
We ran kernel PCA with two different kernels: a gaussian kernel and a polynomial
one. The Gaussian kernel wiped all the variance in the dataset and basically projected
all points to a single point. On the other hand the polynomial kernel focused semantics
on the measure of fame: the top users in the first PC were users with very high
number of followers (tens of millions, e.g. Barack Obama), and for the second PC
users with a very large number of tweets. We could not interpret the third PC. This
is also consistent with the corresponding eigenvalues: 0.35, 0.23 and 0.09. In addition
the spam measure disappeared. To conclude, we found that using various kernels
may enhance and focus the semantics on certain user qualities, and we leave the
further understanding of how different kernels enhance different semantical aspects
as a question for future research.
34
Finally we applied ICA (Independent Component Analysis) to our dataset. The
ICA algorithm seeks a transformation of the original feature set that maximizes
the mutual information with the original set while minimizing statistical depen-
dence. Typically ICA and PCA are combined together, where first PCA is applied to
“whiten” the data (remove linear correlations) and then ICA is applied to the PC’s.
More formally, let F = (x1, x2, . . . , xp) be the original feature set, and let PC1, . . . , PCr
be the top r PC’s. Define the new feature set V = (v1, . . . , vr) via vi = PCi, F . The
top s IC’s Y = (y1, . . . , ys) are given by Y = WV T
where W ∈ Rr×s
is the output of
the ICA optimization function. In our case we chose s = r = 3 and the matrix W we
obtained is
W =






0.97 0.25 0.04
−0.25 0.97 0.01
0.04 0.00001 0.999






.
The matrix is rather close to the identity matrix which means that the PC’s are not
only uncorrelated but also close to being statistically independent.
The results of the ICA combined with the factor map in Figures 15 and 16 suggest
the following interesting insight: the signals that each principal component carries
are approximately both statistically and semantically independent. In other words,
in a quantifiable sense, one can argue that the PCA labelling is succinct. The latter
notion of succinctness is naturally combined with the notion of a semantic dimension,
and together provide another aspect of the obtained labelling.
35
5. Discussion
The recent technological advances provide us with the ability to cheaply accumulate
unlabeled data on a very large scale. On the other hand labeled data may be costly
and hard to obtain. Therefore we see great value in understanding what contribu-
tion unsupervised methods provide for tasks that are traditionally approached via
supervised learning. This was our research question in this master thesis, and in par-
ticular we asked what is the minimal number of features that still enables meaningful
classification. In this master thesis we introduced a new methodology to derive soft
classification using sparse PCA alongside two new scores, integrity and robustness, to
help with the problem of model selection. We applied our methodology to the Twit-
ter social network and derived three labels: measure of fame (celebrity), spammer,
and content provider. Using the integrity score we concluded that using merely two
features per label is sufficient, and using the robustness score we concluded that a
sparse solution was more semantically robust than the non-restricted one (Section 4).
The limitations of our technique are obvious – we learn the labels that are inherent
(yet latent) in the data and do not train a model for a specific goal. For example, if the
goal is to predict the age of the user or his political affiliation, this may not be possible
from the labels that were found via PCA. Also the fact that we are intentionally not
using any linguistic features and ignore the content that the user posts is a limitation.
On the positive side, we obtain a cheaply-computable methodology that generalizes
easily across online social networks, and may provide insights also for networks where
textual content is not available (either intentionally like Snapchat or due to the fact
that the main content is graphical like Instagram).
An interesting question for future research is to use our methodology in order to
36
compare different online social network. We presented several paramteres that may
serve as the basis for comparison: the semantic labels, the semantic dimension (kmin),
and the semantic robustness. Another interesting parameter that we suggest to com-
pute is the semantic redundancy of the network. As we observed at the end of Section
4.4, the PC’s that we obtained were both (approximately) statistically independent
and semantically orthogonal (Figures 15 and 16). Therefore one may argue that the
classification we obtained has little redundancy (or succinct). Taking into account
the extent to which the PC’s are feature-wise orthogonal, the semantic dimension,
and the extent to which the PC’s are statistically independent (for example use ICA
as a proxy for independence), one can concoct a measure of semantic redundancy.
This measure can then be used compare different online networks.
37
6. figures
Figure 1: User UML diagram
Figure 2: User tweets UML diagram
38
Figure 3: Crawler algorithm
39
Figure 4: PC1 Progression, fame measure. Figure 5: PC2 Progression, spam detector.
Figure 6: PC3 Progression, Content detector.
Figure 7: Top PCs Figure 8: Top 4-sparse PCs
40
Figure 9: PC1 VS PC2 Factor Map.
41
Figure 10: PC2 VS PC3 SPCA Factor Map.
42
Figure 11: PC2 VS PC3 SPCA Factor Map.
43
Figure 12: various k scree plot. Each color represents the scree plot for a k-sparse
PCA solution. The x-axis is the PC number, the y-axis is the percentage of variance
explained by that PC.
Figure 13: Top truncated PCs Figure 14: Top truncated 4-sparse PCs
44
Figure 15: PC1 VS PC2 SPCA factor map.
45
Figure 16: PC1 VS PC3 PCA factor map.
46
Figure 17: Spam Detection ROC plot. AUC=0.98
Figure 18: Combining the plains
47
Figure 19: PC1 VS PC3 scatter plot.
Figure 20: PC2 VS PC3 scatter plot.
48
Figure 21: PC1 VS PC2 scatter plot.
7. bibliography
[1] A. Amleshwaram, N. Reddy, S. Yadav, and C. Yang. Cats: Characterizing
automation of twitter spammers. Technical report, Department of Electrical and
Computer Engineering, Texas A&M University, 2013.
[2] T.W. Anderson. An introduction to multivariate statistical analysis. Wiley series
in probability and mathematical statistics. Wiley, 2nd edition, 1984.
[3] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation
in large social networks: Membership, growth, and evolution. In proc. of the
12th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD), pages 44–54, 2006.
49
[4] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers
on twitter. In Seventh annual Collaboration Electronic messaging Anti-Abuse
Spam Conference Redmond Washington U.S., 2010.
[5] S. P. Borgatti and M. G. Everett. Models of core/periphery structures. In Social
Networks, 21(4):375–395, 2000.
[6] J. Burger and J. Henderson. An exploration of observable features related to
blogger age. In Computational Approaches to Analyzing Weblogs: Papers from
the 2006 AAAI Spring Symposium, pages 15–20, 2006.
[7] C. Canali, S. Casolari, and R. Lancellotti. A quantitative methodology to identify
relevant users in social networks. In Business Applications of Social Network
Analysis (BASNA), 2010 IEEE International Workshop on, pages 1–8, 2010.
[8] M. Chuah and M. McCord. Detection on twitter using traditional classifiers.
In Autonomic and Trusted Computing: 8th International Conferencem, Banff,
Canada, pages 2–4, 2011.
[9] A. Culotta, N.R. Kumar, and J. Cutler. Predicting the Demographics of Twitter
Users from Website Traffic Data. In AAAI, pages 72–78, 2015.
[10] A. d’Aspremont, L. El-Ghaoui, M. Jordan, and G. Lanckriet. A direct formula-
tion for sparse PCA using semidefinite programming. SIAM Review, 49(3):434–
448, 2004.
[11] W. Deitrick W. Hu A.H. Z. Miller, and B. Dickinson. Twitter spammer detection
using data stream. Information Sciences, 260:64 – 73, 2014.
50
[12] M. Fernandes, P. Patel, and T. Marwala. Automated detection of human users
in twitter. In Procedia Computer Science, 53:224–231, 2015.
[13] S. Fortunato. Community detection in graphs. In Phys. Rep. 486(3-5):75-–174,
2010.
[14] M. Girvan and M. Newman. Community structure in social and biological net-
works. In PNAS 99(12):7821–7826, 2002.
[15] A. Gulec and Y. Khan. Feature selection techniques for spam detection on
twitter. Technical report, Electronic Commerce Technologies (CSI 5389) Project
Report, School of EE-CS, University of Ottowa, 2014.
[16] S. Herring and J. Paolillo, Gender and genre variation in weblogs. Journal of
Sociolinguistics, 10(4):439–459, 2006.
[17] P. Holme. Core-periphery organization of complex networks. Phys. Rev. E,
72(4):046111, 2005.
[18] I.T. Jolliffe. Principal Component Analysis. Springer series in statistics. Springer,
2nd edition, 2002.
[19] R. Jones, B. Pang, R Kumar, and A. Tomkins. I know what you did last summer
- query logs and user privacy. In proc. of the sixteenth ACM conference on
information and knowledge management, pages 909–914, 2007.
[20] R. Krauthgamer, B. Nadler, and D. Vilenchik. Do semidefinite relaxations solve
sparse pca up to the information limit? Annals of Statistics, 43(3):1300–1322,
06 2015.
51
[21] A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wide anoma-
lies in traffic flows. In proc. of the 4th ACM SIGCOMM Conference on Internet
Measurement, pages 201–206, 2004.
[22] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies.
SIGCOMM Comput. Commun. Rev., 34(4):219–230, 2004.
[23] A. Lakhina, M. Crovella, and C. Diot. Mining anomalies using traffic feature
distributions. SIGCOMM Comput. Commun. Rev., 35(4):217–228, 2005.
[24] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. Kolaczyk, and N. Taft.
Structural analysis of network traffic flows. SIGMETRICS Perform. Eval. Rev.,
32(1):61–72, 2004.
[25] C. Meda, F. Bisio, P. Gastaldo, and R Zunino. A machine learning approach
for Twitter spammers detection. In 2014 International Carnahan Conference on
Security Technology (ICCST), pages 1–6, 2014.
[26] Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. Wang, Twitter spammer
detection using data stream clustering. In Information Sciences, 260:64-73, 2014.
[27] R. J. Muirhead, Aspects of Multivariate Statistical Theory. Wiley, New York,
1982.
[28] A. Mukherjee, B., Liu and N. Glance. Spotting fake reviewer groups in consumer
reviews. In Proceedings of the 21st international conference on World Wide Web,
pages 191-200 ACM, 2012.
[29] B. Moghaddam, S. Weiss, and Y.and Avidan. Generalized spectral bounds for
52
sparse LDA. In proc. of the 23rd International Conference on Machine Learning,
pages 641–648, 2006.
[30] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J.
Comput., 24(2):227–234, 1995.
[31] J. Otterbacher. Inferring gender of movie reviewers: Exploiting writing style,
content and metadata. In proc. of the 19th ACM international conference on
information and knowledge management, pages 369–378, 2010.
[32] M. Pennacchiotti and A.M. Popescu. A machine learning approach to twitter
user classification. In proc. of the 5th International Conference on Weblogs and
Social Media, pages 281-288, 2011.
[33] D. Preot¸iuc-Pietro, V. Lampos, and N. Aletras. An analysis of the user occu-
pational class through Twitter content. In The Association for Computational
Linguistics, 2015.
[34] D. Preot¸iuc-Pietro , S. Volkova, V. Lampos, Y. Bachrach, and N. Aletras. Study-
ing User Income through Language, Behaviour and Affect in Social Media. In
PloS one 10(9):e0138717, 2015.
[35] D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic
models. In proc. of the 4th International Conference on Weblogs and Social
Media, Pages 1–1, 2010.
[36] D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta. Classifying latent user
attributes in Twitter. In proc. of the 2nd International Workshop on Search and
Mining User-generated Contents, pages 37–44, 2010.
53
[37] N. Trendafilov, I.T. Jolliffe, and M. Uddin. A modified principal component tech-
nique based on the LASSO. Journal of Computational and Graphical Statistics,
12:531––547, 2003.
[38] D. Vilenchik and B. Yichye. Twitter Crawler. https://github.com/barakyi/
Twitter_crawler, 2016.
[39] B. Viswanath, A. Mislove, M. Cha, and K. Gummadi. On the evolution of user
interaction in facebook. In proc. of the 2nd ACM Workshop on Online Social
Networks, pages 37–42, 2009.
[40] B. Viswanath, M. Bashir, M. Crovella, S. Guha, K. Gummadi, B. Krishnamurthy,
and A. Mislove. Towards detecting anomalous user behavior in online social
networks. In 23rd USENIX Security Symposium (USENIX Security 14), pages
223–238, 2014.
[41] A. Wang. Detecting spam bots in online social networking sites: a machine
learning approach. In 24th Annual IFIP WG 11.3 Working Conference on Data
and Applications, Security, 2009.
[42] G. Wang, S. Xie, B. Liu, and P. Yu. Review Graph Based Online Store Review
Spammer Detection. In proc. of the 11th IEEE International Conference on Data
Mining, pages 1242–1247, 2011.
[43] G. Wang, S. Xie, B. Liu, and P. Yu. Identify Online Store Review Spammers via
Social Review Graph. In ACM Trans. Intell. Syst. Technol., 3(4):1–21, 2012.
[44] A. Wang. Don’t follow me: Spam detection in twitter. In Int’l Conference on
Security and Cryptography (SECRYPT), pages 1–10, 2014.
54
[45] I. Weber and C. Castillo. The demographics of web search. In proc. of ACM
SIGIR conference on Research and development in information retrieval, pages
523–530, 2010.
[46] D. Witten, R. Tibshirani, and R. Hastie. A penalized matrix decomposition, with
applications to sparse principal components and canonical correlation analysis.
Biostatistics, 10(3):515–534, 2009.
[47] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis.
Journal of Computational and Graphical Statistics, 15:262––286, 2006.
[48] Z. Zhang, H. Zha, and H. Simon. Low-rank approximations with sparse factors
I: Basic algorithms and error analysis.
[49] Twitter limitation - 15 min time interval
https://dev.twitter.com/rest/public/rate-limiting
[50] Twitter limition - 180 user request. https://dev.twitter.com/rest/reference/get/users/show
[51] tweetinvi - a C# library implementing twitter API.
https://tweetinvi.codeplex.com

Más contenido relacionado

La actualidad más candente

D2.3.1 Evaluation results of the LinkedUp Veni competition
D2.3.1 Evaluation results of the LinkedUp Veni competitionD2.3.1 Evaluation results of the LinkedUp Veni competition
D2.3.1 Evaluation results of the LinkedUp Veni competitionHendrik Drachsler
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysisAnil Shrestha
 
A novel incremental clustering for information extraction from social networks
A novel incremental clustering for information extraction from social networksA novel incremental clustering for information extraction from social networks
A novel incremental clustering for information extraction from social networkseSAT Journals
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115Divita Madaan
 
Jsobrinos tfm0618memoria
Jsobrinos tfm0618memoriaJsobrinos tfm0618memoria
Jsobrinos tfm0618memoriaCarlosLleras
 
Venice boats classification
Venice boats classificationVenice boats classification
Venice boats classificationRoberto Falconi
 

La actualidad más candente (9)

D2.3.1 Evaluation results of the LinkedUp Veni competition
D2.3.1 Evaluation results of the LinkedUp Veni competitionD2.3.1 Evaluation results of the LinkedUp Veni competition
D2.3.1 Evaluation results of the LinkedUp Veni competition
 
Yuntech present
Yuntech presentYuntech present
Yuntech present
 
NDU Present
NDU PresentNDU Present
NDU Present
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
A novel incremental clustering for information extraction from social networks
A novel incremental clustering for information extraction from social networksA novel incremental clustering for information extraction from social networks
A novel incremental clustering for information extraction from social networks
 
20142014_20142015_20142115
20142014_20142015_2014211520142014_20142015_20142115
20142014_20142015_20142115
 
Jsobrinos tfm0618memoria
Jsobrinos tfm0618memoriaJsobrinos tfm0618memoria
Jsobrinos tfm0618memoria
 
Venice boats classification
Venice boats classificationVenice boats classification
Venice boats classification
 
Rubino, Nicholas
Rubino, NicholasRubino, Nicholas
Rubino, Nicholas
 

Similar a merged_document

Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
Geometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksGeometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksLorenzo Cassani
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks Shah Alam Sabuj
 
Bike sharing android application
Bike sharing android applicationBike sharing android application
Bike sharing android applicationSuraj Sawant
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyAimonJamali
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
 
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055Thesis-aligned-sc13m055
Thesis-aligned-sc13m055Mohan Kashyap
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisBryan Collazo Santiago
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability EvaluationJPC Hanson
 
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...IJCSIS Research Publications
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 

Similar a merged_document (20)

Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
Geometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksGeometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural Networks
 
User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks User behavior model & recommendation on basis of social networks
User behavior model & recommendation on basis of social networks
 
Bike sharing android application
Bike sharing android applicationBike sharing android application
Bike sharing android application
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Big Data Social Network Analysis
Big Data Social Network AnalysisBig Data Social Network Analysis
Big Data Social Network Analysis
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
 
Thesis van Heesch
Thesis van HeeschThesis van Heesch
Thesis van Heesch
 
Knapp_Masterarbeit
Knapp_MasterarbeitKnapp_Masterarbeit
Knapp_Masterarbeit
 
Machine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_ThesisMachine_Learning_Blocks___Bryan_Thesis
Machine_Learning_Blocks___Bryan_Thesis
 
A.R.C. Usability Evaluation
A.R.C. Usability EvaluationA.R.C. Usability Evaluation
A.R.C. Usability Evaluation
 
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
Botnet Detection and Prevention in Software Defined Networks (SDN) using DNS ...
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
 
hardback
hardbackhardback
hardback
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 

merged_document

  • 1. BEN-GURION UNIVERSITY OF THE NEGEV FACULTY OF ENGINEERING SCIENCES DEPARTMENT OF INDUSTRIAL ENGINEERING & MANAGEMENT An unsupervised approach to user classification in online social networks THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE By: Barak Yichye December, 2016
  • 2. BEN-GURION UNIVERSITY OF THE NEGEV FACULTY OF ENGINEERING SCIENCES DEPARTMENT OF INDUSTRIAL ENGINEERING & MANAGEMENT An unsupervised approach to user classification in online social networks THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE M.Sc DEGREE By: Barak Yichye Supervised by: Dr. Dan Vilenchik Author:…………………….. Date:………….. Supervisor:……………………… Date: 27/12/2016 Chairman of Graduate Studies Committee:……………….. Date:………….. December, 2016
  • 3. i Acknowledgement I would like to take this opportunity to express my deep gratitude to my advisor Dr. Dan Vilenchik, whose ideas and innovative thinking are the base of my thesis work. Thank you for sharing your expertise and countless hours, it was my pleasure to work under your guidance and learn the true way of a researcher.
  • 4. ii Contents 1 Introduction 5 1.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Methodology 11 2.1 Sparse PCA and the semantic dimension . . . . . . . . . . . . . . . . 13 2.1.1 The Integrity Score . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 A Note on Computational Efficiency . . . . . . . . . . . . . . 15 2.2 Semantic robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Connection to Anomaly Detection . . . . . . . . . . . . . . . . . . . . 17 2.4 Crawling Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Related Work 20 4 Twitter - Case study 23 4.1 Analyzing the Sparse PCA progression . . . . . . . . . . . . . . . . . 24 4.2 Using PC2 for Spam Detection . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Using PC3 for content providers . . . . . . . . . . . . . . . . . . . . . 30 4.4 Other Methods of Unsupervised Learning . . . . . . . . . . . . . . . . 32 5 Discussion 35 6 figures 37 7 bibliography 48
  • 5. iii List of Figures 1 User UML diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2 User tweets UML diagram . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Crawler algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4 PC1 Progression, fame measure. . . . . . . . . . . . . . . . . . . . . . 39 5 PC2 Progression, spam detector. . . . . . . . . . . . . . . . . . . . . . 39 6 PC3 Progression, Content detector. . . . . . . . . . . . . . . . . . . . 39 7 Top PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8 Top 4-sparse PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 9 PC1 VS PC2 Factor Map. . . . . . . . . . . . . . . . . . . . . . . . . 40 10 PC2 VS PC3 SPCA Factor Map. . . . . . . . . . . . . . . . . . . . . 41 11 PC2 VS PC3 SPCA Factor Map. . . . . . . . . . . . . . . . . . . . . 42 12 various k scree plot. Each color represents the scree plot for a k- sparse PCA solution. The x-axis is the PC number, the y-axis is the percentage of variance explained by that PC. . . . . . . . . . . . . . . 43 13 Top truncated PCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 14 Top truncated 4-sparse PCs . . . . . . . . . . . . . . . . . . . . . . . 43 15 PC1 VS PC2 SPCA factor map. . . . . . . . . . . . . . . . . . . . . . 44 16 PC1 VS PC3 PCA factor map. . . . . . . . . . . . . . . . . . . . . . . 45 17 Spam Detection ROC plot. AUC=0.98 . . . . . . . . . . . . . . . . . 46 18 Combining the plains . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
  • 6. iv 19 PC1 VS PC3 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 47 20 PC2 VS PC3 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 47 21 PC1 VS PC2 scatter plot. . . . . . . . . . . . . . . . . . . . . . . . . 48 List of Tables 1 Feature details. * The measure is computed over the recent 150 tweets. 18 2 PC3 TOP10 users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
  • 7. 5 1. Introduction Online social networks and in particular Microblogging services such as Twitter and Instagram have become an important feature of the daily life of millions of users. In addition to communicating with friends and family, microblogging services are used as recommendation services, real-time news sources and targeted advertising platforms for commercial companies. The tremendous increase in popularity of social networking sites allows companies to collect a huge amount of personal information about the users, their friends, and their habits. From a business point of view, this represents an opportunity to reach a large audience for marketing purposes. On the other hand this wealth of information, as well as the ease with which one can reach many users, attracted the interest of malicious parties. In particular, spammers are always looking for ways to reach new victims with their unsolicited messages. Therefore the task of classifying users in the social network is a fundamental and important one, and this is the focus of the present work. User classification in online social networks was studied widely in the literature, the goal is to predict various user attributes from the user’s profile. The attributes are numerous and diverse, for example trying to predict the user’s ethnicity and political affiliation [32], gender, age, regional origin [36], occupational class [33], user income [34], and demographics [9]. The list is of course much (much) longer, but one thing is common to all the aforementioned works: the prominent features that are used are all text-based, sociolinguistic, and to compute them various (non-trivial) tools of text and sentiment analysis are used. In addition, all these works apply supervised learning algorithms (either for text analysis or the prediction task). In this master thesis we want to study an orthogonal direction and ask if and what
  • 8. 6 quality of classification may be obtained from an online social network when using only the simplest of statistics. By simple statistics we mean the network structure (e.g. who follows whom, who is friends with whom) and basic communication behavior traits (e.g. number of tweets per day); the content of the user’s feed is ignored. If classification is possible, we may further ask what is the minimal number of features that still allows for meaningful classification, or in other words how succinct can the labelling be made. The motivation behind these questions is both theoretical: how much information is needed to capture interesting “signals” about the users of an online social network. Indeed a recent work by Rao et al. [36] tried a machine learning approach to obtain an accurate prediction of latent user attributes such as gender or age (which are latent in Twitter) when only using simple statistics as mentioned here. They report that they were not able to perform the task, and that adding sociolinguistic features (based on the user’s tweets) increased dramatically the accuracy of the prediction. Another motivation for studying simple-statistics classification is practical and the following example highlights its crux. One of the new players in the social network arena is Snapchat (passing Twitter in daily usage, 150 million people using it each day as of 2016). The trademark of Snapchat is the fact that the content disappears shortly after being posted. Therefore relying on content to provide insights about the user is impossible in this case. More generally the content that users post is not always textual, e.g. pictures, videos. Extending text analysis to pictures and videos is not an easy task and requires sophisticated algorithms. Therefore it is helpful to understand what can be learned about the users in the network without analyzing the content that they post. The second question that we want to study in this master thesis is whether unsu- pervised learning algorithms are useful in the context of user classification, and if so,
  • 9. 7 can one develop a scoring mechanism to evaluate the goodness of fit? In particular we are interested in a notion we call “semantic robustness”: since in an unsupervised model it is only in hindsight that one interprets the model and assigns it a semantic meaning, it is reasonable to ask how robust is the given interpretation. For example, if one is to take a certain subset of the data (e.g. a certain subnetwork) and recompute the model for this subset, will the model retain its original semantic interpretation? Finally, Let us stress that the point of this work is not to show that one can replace supervised learning algorithms that use in-depth information about the user with unsupervised learning that rely only on simple statistics. This is of course too naive. The goal of this master thesis is to see what can be achieved with one (or both) hands tied behind the back: relinquishing text analysis and labeled data, the two main pillars of current user classification algorithms. In this work we answer the research questions that we proposed affirmatively and demonstrate that meaningful user classification is obtainable in the Twitter social net- work using an unsupervised approach and using only simple (non-textual) statistics. In addition we develop a new scoring mechanism that allows us to asses the quality of the classification and perform model selection. In the next section we describe our results in more detail. 1.1. Our Contribution Natural unsupervised learning algorithms for classification are clustering algorithms (such as k-means) or feature-transformation methods such as Principal Component Analysis (PCA for short) and its variants (sparse PCA, kernel PCA), or Independent Component Analysis (ICA). Our method relies on PCA (in Section 4.4 we discuss the
  • 10. 8 various options and what led us to choose PCA). In the context of user classification, the principal components (PC’s) define a set of labels, which may (or may not) have a meaningful semantic interpretation. The PC’s may represent complex user’s attributes such as being a celebrity or being a spammer. Typically the top r PC’s are considered, where the parameter r is chosen according to the total variance explained by that set. The PC’s induce a “soft” classification of the users in the sense that for each user we measure how much it is “of type PCi”. A user may be classified according to the most dominant label (measured as the length of the projection on the relevant PC). The method is reviewed in detail in Section 2. The main contribution of our work is summarized as follows: (1) We propose a generic approach that may be applied in a straightforward manner to various online social networks (such as LinkedIn, Instagram, Twitter, etc). Our approach is generic in the sense that the user-profile statistics that we use are com- mon (or very similar) across many social networks, e.g. the number of followers or the number of likes. (2) We introduce a new concept which we call the semantic dimension of the prob- lem. In order for the PC’s to be amicable to meaningful interpretation, it is (almost) obligatory that the obtained PC’s will be sparse. A common practice to achieve this goal is to zero out the entries of lowest absolute value in every PC. We suggest a new methodology to perform this task, and a way of validating the result. We suggest to use sparse PCA instead of standard PCA and a way to choose the correct sparsity parameter k (the number of allowed non-zeros in every PC). We identify the minimal semantically-admissible sparsity parameter kmin as the semantic dimension of the
  • 11. 9 problem, alongside r which is the algebraic dimension. We suggest to choose kmin by solving a progression of k-sparse PCA problems for increasing values of k. Each k-sparse solution receives a score which reflects how well the non-sparse labels are retained (details in Section 2.1). The progression also allows to zoom-in on how the feature set that defines each label evolves, and enables refined and insightful feature selection. (3) We suggest a new score which we call the semantic robustness of the classification. Robustness in the sense that the labels remain semantically valid for various types of (sub-)networks: for example users from a specific region, students in a certain school, etc. To compute the robustness score we perform a truncated crawl of the social network. In this crawl we ignore users with high expression of any of the derived labels (i.e. with a large projection on any of the leading PC’s). Our robustness score reflects the extent to which the PC’s of the “truncated” covariance matrix retain the semantics of the original PC’s (details in Section 2.2). Using the robustness score we can perform a kind of cross validation and choose the sparsity parameter which gives the best robustness results. Indeed in our experiment one of the sparse solutions obtained a better robustness score than the non-restricted PCA. (4) We show how to use the PC’s to obtain useful labeling of the users in the Twitter network. For example, one of the PC’s turns out to be a perceptron for spam detec- tion. We tested our classifier on a set of 164 accounts, 69 spam and 95 legitimate, taken from [15]. Our classifier obtained high precision and recall rates (around 95%) and an AUC score of 0.98. We interpret the leading PC, PC1, as a measure of fame in the network and PC3 as a measure of legal activity (content providers). We demon- strate how one can use these labels to locate for example locally active bloggers: we
  • 12. 10 look in our Twitter crawling sample for a user with relatively low projection on PC1 and a relatively high projection on PC3. One such account is KarenInzunzam with a PC1 value of 0.000095 and a PC3 value of 0.34. Another example is Susan Hozak with PC1 value of 0.000032 and PC3 value of 0.45. As a benchmark for the projection values we note that the vast majority of users have a nearly-zero projection on either PC’s. We can identify an active celebrity LanaDelRey with PC1 projection of 36 and PC3 of 2.7, and on the other hand a not so famous news account which is extremely active, littlebytenews, with PC1 projection of 2.97 and PC3 value of 40. We demonstrate our methodology on one of the more prominent players in the so- cial network arena – Twitter. Twitter is a social network designed as a microblogging platform, where users send short text messages (called tweets) that appear on their friends’ pages. Unlike Facebook and MySpace, no personal information is shown on Twitter pages by default, which reinforces our motivation for relying only on simple statistics. Users are identified only by a username and, optionally, by a real name. A Twitter user can start “following” another user. As a consequence, he receives the user’s tweets on his own page. The user who is “followed” can, if he wants, follow the other one back. Tweets can be grouped by hashtags, which are popular words, beginning with a “#” character. This allows users to efficiently search who is posting topics of interest at a certain time. When a user likes another user”s tweet, he can decide to retweet it. As a result, that message is shown to all his followers.
  • 13. 11 2. Methodology In this section we describe our proposed methodology for user classification in online social networks using PCA and sparse PCA. We use bold lower-case letters, e.g. x, to signify vectors, and non-bold lower-case letters, e.g. x, to signify scalars. Upper-case letters are reserved for matrices. We consider all vectors as column vectors. We let p be the number of features collected for each user, n the number of users in the sample and X the resulting n × p data matrix. Let x = (x1, . . . , xp) be the random variable with the “true” underlying distribution of the feature set (to which we of course have no direct access), and Σ = E[xxT ] its population covariance matrix . For simplicity of presentation we assume that the data is centered (i.e. the mean is zero). Let ˆΣ = 1 n XT X be a sample covariance matrix of x. For sake of completeness we start with a brief description of the PCA method. The first principal component (PC) is defined to be the direction (unit vector) v1 ∈ Rp in which the variance of x is maximal. The variance of x in direction v is given by the expression vT Σv. Therefore v1 = argmaxv∈Rp vT Σv. The latter is the Rayleigh Quotient definition of the largest eigenvalue of a matrix, therefore v1 is the leading eigenvector of Σ and λ1 = vT 1 Σv1 is the variance explained by v1. The remaining PC’s are defined in a similar way and together they form a orthonormal basis of Rp . The sample PC’s ˆv1, . . . , ˆvp are the eigenvectors of the sample covariance matrix ˆΣ. Under various reasonable assumptions it was proven that the principal components v1, . . . , vp converge to the sample ones ˆv1, . . . , ˆvp [2, 27]. We assume that this is true in our case, and we justify it by the fact that we are in the “fixed p large n” regime, where the ratio p/n tends to 0. We now proceed to explain how PCA may be used in the context of user classifi-
  • 14. 12 cation. To that end it is instructive to take a geometric approach to PCA. Associate the feature xi with the standard basis vector ei ∈ Rp (the vector with 1 in the ith coordinate and 0 otherwise). In this notation, the value of the ith feature of a user profile x0 is the scalar product ei, x0 . In the same manner, the principal compo- nents v1, . . . , vp form a new coordinate system (representing a new set of features), and the extent to which x0 is “of type vi” is given by vi, x0 . The new features v1, . . . , vp are linear combinations of the original set of features, and may represent more complex user characteristics. For example if the original set of features contains basic statistics such as the number of tweets per day, number of URLs per tweet, etc, the new features may represent more complex user-statistics such as a measure of “celebrity”, “content consumer”, “spammer”, and so on. The obtained classification is therefore “soft”, in the sense that there is no single label per user but rather a continuum along each axis, and the most prominent axis for a certain user may serve as his classification label. Typically, only a several PC’s carry a meaningful semantic signal about the users, and the remaining PC’s correspond to noise. The question is how to identify that subset? Unfortunately there is no single statistically-sound rule that fits all settings that specifies how to select the PC’s but rather various (ad-hoc) heuristics. The following observation is nevertheless useful: The percentage of variance explained in the direction of a certain vector v is given by vT ˆΣv/ tr(ˆΣ), where tr(ˆΣ) = p i=1 λi is the trace of the matrix. By symmetry, a random vector will explain on average 1/p-fraction of the variance. Therefore as long as the ith PC explains λi tr (ˆΣ) > 1 p fraction of the variance, it is reasonable to look into the signal that it carries. To conclude, the relevant subset of PCs consists of the top r PCs (sorted according to the corresponding eigenvalue), where r can be determined according to the 1/p-fraction
  • 15. 13 threshold we described. 2.1. Sparse PCA and the semantic dimension The framework that was presented above is useful only to the extent that one can interpret the PC’s in a meaningful way. In practice many times the PC’s are linear combinations of all (or most of) the original features, which hinders interpretability. To deal with this phenomenon, a common practice is to zero out PC entries whose absolute value is small. We suggest to take an alternative approach, and use sparse PCA. Rather than finding the eigenvectors of Σ, we look for the unit vector v with at most k non-zero entries (k is a parameter of the problem) such that vT Σv is maximal; v is called the leading k-sparse eigenvector (although it is not necessarily an eigenvector of ˆΣ). Similarly we compute the remaining k-sparse eigenvectors. In an unsupervised setting there is no standard way to measure the fit of a model and in particular no recipe to choose the minimal k, kmin, that still enables meaningful classification (what we refer to as the semantic dimension of the problem). We define a new score which we call the integrity score and use it to perform model selection. Specifically, we generate a progression of k-sparse PCA problems and use Equation (1) to measure the integrity of every solution. To choose kmin we can generate a plot similar to a scree plot, or choose the minimal k whose integrity score is above some threshold.
  • 16. 14 2.1.1. The Integrity Score Given a covariance matrix ˆΣ ∈ Rp×p and a natural number k ≤ p, the integrity score, denoted by ι(k), is a number in the range [0, 1], and it is computed as a function of the non-restricted and the k-sparse PCA solutions over ˆΣ. It represents a measure of the “semantic distance” between the two solutions. To compute the integrity score we first assume that r, the number of relevant PC’s, is already determined. Before we present the formula for ι(k) let us discuss the motivation behind it. The main question is how to quantify the extent to which the semantics of a PC vi is retained in the k-sparse solution. We view the semantics of the label represented by vi as composed of its loadings (i.e. the value of the entries of vi) and the amount of variance it explains, λi. Therefore our score accounts for the similarity between the vi’s and the k-sparse eigenvectors v (k) i ’s, and the difference in the explained variance |λi −λ (k) i |, where λ (k) i = v (k) i Σv (k) i . Formally, the score is given by the following formula, ι(k) = 1 2r r i=1 1 − |λi−λ (k) i | max{λi,λ (k) i } + | vi, v (k) i |. (1) Let us conclude with two remarks: • Observe that the score of the non-restricted PCA is always 1, i.e. ι(p) = 1. The closer the score is to 1, the closer the k-sparse solution is to the non-restricted PCA solution. • It may be the case that the semantic labels of the k-sparse solution are a per- mutation of the non-restricted labels. Therefore, if needed, an appropriate per- mutation should be applied before computing ι(k). In our case no permutation
  • 17. 15 was needed (see Section 4). 2.1.2. A Note on Computational Efficiency While (non-restricted) PCA can be solved efficiently by computing the eigenvectors of a symmetric matrix, sparse PCA is a difficult combinatorial problem, and in fact NP- hard [29, 30]. Nevertheless, when the dimension p is small (12 in our case), sparse PCA can be solved exactly in a matter of seconds (by a naive exhaustive search approach). In cases where p is large or even grows with n, solving sparse PCA exactly is not a computationally feasible task. There are various heuristical approaches to obtain an approximate solution [10, 29, 37, 46, 47], but in such cases consistency issues may appear [20]. This gives yet another motivation to understand what can be achieved for the classification problem with as few statistics as possible, thus keeping the computational effort to solve sparse PCA in check. 2.2. Semantic robustness Suppose that we obtained satisfactory classification results from the PC’s. A nat- ural question to ask is how semantically-robust the classification is? For example, if we were to take a subnetwork of users in a certain town, or students in a cer- tain school and run the same classification methodology on that subnetwork. Would the subnetwork labels retain the semantic meaning of the global labels? Since the setting is of unsupervised nature, there are no pre-defined labels and no reference point to compare against. We suggest a way to use a variant of ι(k) to measure the robustness of the classification. The robustness score, together with ι(k), provide a
  • 18. 16 justifiable way to perform model selection (i.e. choose the parameter k with highest robustness/integrity score). The robustness score, ρ(k), is a function of two covariance matrices. The first co- variance matrix is obtained from a standard crawl of the social network (see Section 4 where we describe exactly how the crawl is performed). The second covariance matrix is obtained from a second crawl, which we call the truncated crawl. In that crawl we ignore users that have a large projection on any of the top r PC’s of the first crawl. We compute the top r k-sparse PC’s of the first covariance matrix v (k) 1 , v (k) 2 , . . . , v (k) r and the top r k-sparse PC’s of the “truncated” covariance matrix v (k,trunc) 1 , . . . , v (k,trunc) r . The robustness score is given by ρ(k) = 1 r r i=1 | v (k) i , v (k,trunc) i |. (2) Let us remark that: • The number ρ(k) is in the range [0, 1] and the closer ρ is to 1 the closer the two solutions are. Therefore we may conclude that the labels represented by the PC’s are semantically more robust. • As mentioned in the previous section, if needed an appropriate permutation should be applied to the PC’s before computing ρ(k). In our case indeed such a permutation was needed (see Section 4). • We do not use the eigenvalues in the computation, unlike the integrity score, since we compare two different crawls resulting in two different matrices, and eigenvalues may fluctuate without any semantic significance implied.
  • 19. 17 2.3. Connection to Anomaly Detection Our methodology of classifying users via PCA shares similarities with the framework of network anomaly detection. For completeness, we briefly describe this framework. For a given natural number r ≤ p, define Nr to be the subspace spanned by the first r PC’s {v1, . . . vr}. The basic underlying assumption of traffic anomaly detection is that Nr, the “normal space”, corresponds to the primary regular trends (e.g., daily, weekly) of the traffic matrix. The detection procedure relies on the decomposition of a traffic sample x into normal and abnormal components, x = xn + xa, such that xn is the projection of x onto Nr and xa onto the perpendicular “abnormal” subspace. The measurement x is declared to be an anomaly if its 2-norm xa exceeds a certain threshold. For more details we refer the reader to [21, 22, 24, 23]. In our work we were able to classify Twitter users into normal users vs spam/robots, in a very similar fashion to the anomaly detection framework. In [40] this framework was implemented for finding anomalous users in Facebook, Yelp and Twitter. Their key observation is that in all three online social networks (Facebook, Yelp, Twitter) the normal user behavior has low-dimensions (i.e., few PC’s capture most of the variance). They verified that indeed attacker behavior appears anomalous relative to normal user behavior (as captured by the top PC’s). The validation was done against a large labeled data set of anomalous users of various kinds. 2.4. Crawling Twitter In this section we shall describe in detail the software that we designed to generate a sample of the Twitter social network. As we already mentioned, we attempt to use a feature set which is as generic as possible in order to make the method easily
  • 20. 18 Var Attribute Description x1 NumOfTweets Total number of tweets x2 NumOfFollowers Total number of users following me x3 NumOfFollowing Total number of users I follow x4 LikesGivenToOthers Number of tweets that I like x5 NumOfTxt Total number of tweets that contain only text * x6 NumOfUrl Total number of tweets that contain URLs * x7 NumOfMyRT Number of other users’ tweets that I re-tweet * x8 NumOfOthRT Total number of retweets of my tweets by others * x9 TweetsPerDay Total number of tweets divided by lifetime (in days) x10 NumOfUserMent Number of other users mentioned in the tweets * x11 NumOfHashTag Number of hashtags # referenced in the tweets * x12 LikesGivenToMe Number of likes that my tweets received Table 1: Feature details. * The measure is computed over the recent 150 tweets. transferable to other social networks. Hence for example we ignore the content of the tweets since not all social networks are text-orients e.g. Instagram and Snapchat. Our feature set. We decided upon 12 quantitative features that capture various aspects of the activity of a user in a social network. Table 1 summarizes the 12 features. Features X1, X3 − X7 reflect the volume of activity for a specific user. Features X2, X8 − X12 capture the interaction between users. Twitter’s API. We used the official twitter API interface to crawl the Twitter networks. Via the API we send a data request and receive the Twitter server response. To access the API we needed to subscribe to the Twitter developers website and received a special account alongside a unique access token to establish a connection between local computers and Twitter servers. Access to Twitter servers is limited to once every 15 minutes [49]. In every request we are allowed to query the API about 180 Twitter accounts (per developer account)[50]. The extraction of 180 Twitter account took three minutes on average. Our crawler is written in C# and uses C# open source class called Tweetinvi [51]
  • 21. 19 which provides a set of basic methods to communicate with the Twitter server. The platform C# was chosen after confirming that the code can be easily reproduced to work with other popular APIs of other social networks. The software contains three classes that take care of the user data and the crawling procedure. Figure 2 describes the UML structure. • User class holds the data of a single Twitter account. This class holds all the basic and general information regarding the user that doesn’t required any special calculation( for example: number of followers, number of tweets). • UserTweets class inherits from User class and its purpose is to TODO tweets feature sets for each specific user. This class will hold a function that will take the last 300 tweets for each user and will calculate the tweets-related features describe in Table 1. • DataExport class is responsible to gather all the Users class data and organize it in a table data type. It enables to export the data into excel format (csv). Validation scripts are run before exporting the data in order to remove dupli- cates and incomplete data. Another important role of the DataExport class is to perform backups of the data during the crawling step, every time a 15 minutes collection interval ends. The Crawling algorithm. The crawling method that we use in BFS. We will use a queue to hold Twitter users ID’s. The queue is initialized with a manually selected user. Before the crawling starts we activate the seven different Twitter developers accounts by sending to the Twitter server a request containing the special user token.
  • 22. 20 Once the developer user is approved the crawl starts in a BFS manner by popping the first user from the queue and taking for a specific user all is following accounts of that user, and pushing them to the end of the queue. A detailed view of the algorithm is given in Figure 3 3. Related Work There are two main quantitative approaches to study online social networks. What one may call a classical approach, which was initially proposed in [3] and followed by a large body of work. In this approach the network is represented as a graph, where nodes denote objects (people, webpages) and edges between the objects denote interactions (friendships, physical interactions, links). The graph is then analyzed through the lens of Computer Science graph theory and phenomena like the degree power-law distribution, and the small-world phenomenon are being examined. Rel- evant papers to our theme deal with the problem of community detection in which network nodes are joined together in tightly knit groups, between which there are only looser connections. These communities often correspond to groups of nodes that share a common property, role or function [13, 14], and in particular community structure may be used to detect anomalies such as spam reviewers [42, 43]. Another strand of relevant work deals with phenomena such as core/periphery structure which captures the notion that many networks decompose into a densely connected core and a sparsely connected periphery [5, 17]. In this master thesis we will be interested in a statistical machine-learning ap- proach which we believe enables more nuanced classification results. At each node (user) various statistics are collected (that are relevant to the classification task),
  • 23. 21 then one of the many machine learning algorithms is applied to the data to compute a prediction (classification) model. In this line of work, the overwhelming majority of work uses supervised learning algorithms to derive the model. In addition, as algo- rithms for natural language processing (NLP) and automated sentiment analysis of text developed, many of the features used for user classification were based on these tools. The body of work is very large and here we give a random sample. Researchers investigated the detection of gender from traditional text [16] or movie reviews [31], the blogger’s age [6], or user biographic data from search queries [19, 45]. As on- line social network developed researchers tried to predict the user’s ethnicity and political affiliation [32], gender, age, regional origin [36], occupational class [33], user income [34], and demographics [9]. Another central task is detection of anomalous and fraudulent user behavior and specifically detecting spam activity. We survey re- lated work in that direction when we discuss our results obtained for spam detection using PC2 in Section 4.2. On the other hand, very few researchers attempted to perform user classification using an unsupervised approach (although in some sense the graph-oriented approach we presented above is an unsupervised approach). Even fewer tried classification without analyzing the content that the user publishes on his profile (in fact we know only two such work [7] and [40]). This is of course understandable – why not use all the statistical and computational power at hand to solve the problem with better accuracy. Indeed [36] report that relinquishing text-based features, and relying only on what we called simple statistics, made the prediction of latent properties such as age and gender in Twitter practically impossible. As we mentioned before, the purpose of the present work is not to show that one
  • 24. 22 can perform the aforementioned classification tasks with unsupervised learning and simple statistics. This is probably false. Our purpose is to understand what can be said about user classification in online social networks in the unsupervised feature- restricted setting. Having said that, one of the results we obtained was a perceptron for spam detection which performed very well on a test data. The problem of spam detection is typically solved with supervised learning and heavily relies on content analysis. In this case we were able to suggest an unsupervised feature-restricted alternative. The two most relevant work to ours is a study conducted for the YouTube net- work, where PCA was used to classify users [7] and [40]. Similarly to our findings, a meaningful classification was obtained using the top PC’s and using only simple statistics. In [7] the top four PC’s carry similar semantic signals to our findings: measures of popularity and activity (but notably no spam label). In [40] the target is detecting anomolous users in online-social network. The methodology is s different than ours. The working hypothesis is that the top PC’s represent the plain of normal user behaviour, and anomalous activity is identified as having small projection on the normal plain and large projection on orthogonal ones. The set of features that is used in [40] is different and includes for examples counting the likes that a user gives in Facebook according to various page categories (sports, politics, etc). The methodology was applied to Facebook, Twitter and Flickr with similar findings to ours: user behavior is captured by a small number of PC’s (three to five depending on the network). The authors of [40] validated that efficiency of their predictor on a set of labeled data with good accuracy. Unlike [40], we are not tuned to a specific task, but rather to uncovering the latent labels in the network, come what may.
  • 25. 23 Our work extends [7] by introducing a fully-fledged methodology to perform the task, including quantitative measures to evaluate the goodness of the classification, and by introducing the utilization of sparse PCA. In addition, our work reinforces the possibility of using unsupervised learning methods to find meaningful classification in online social networks by applying our methodology to Twitter. 4. Twitter - Case study In this section we present the results of applying our method to the Twitter social network. Recall that our goal is to identify various user-types as captured by the principal components of the covariance matrix and to compute a score for the clas- sification using Equations (1) and (2). This task is not a-priori certain to succeed since there is no guarantee that the PC’s can be interpreted in a meaningful way. To collect the Twitter data we implemented a crawler that crawled the social network graph, following a snowball approach, exploiting the public API provided by Twitter. This approach is commonly used in the literature [39]. Crawling starts from a list of randomly selected users and proceeds in a BFS manner. At each step the crawler pops a user v from the queue, explores its outgoing links (there is an link from v to w if v follows w) by adding them to the end of the queue. The crawling rate was about 25,000 users per day (there are limitations posed by the Twitter API), and we collected a total of 284,758 active Twitter accounts. For each account we collected the set of features described in Table 1. The attributes in Table 1 represent two types of information: data about the user’s activity in the social plane (followers, following, re-tweets) and data about the user activity in the content plane (tweets, text vs urls, etc). Twitter’s social network
  • 26. 24 graph is a directed one, unlike Facebook for example. Since the features have different scales, we had to normalize the data set to unit variance, as common in such cases (see for example [24]). Using the 284,758 Twitter profiles we created a 12 × 12 covariance matrix ˆΣ. 4.1. Analyzing the Sparse PCA progression To compute the integrity measure of the progression we first need to fix r, which is the number of PC’s that we consider. To this end we compute the leading eigenvectors of the 12 × 12 correlation matrix ˆΣ, and sort them according to a decreasing order of the corresponding eigenvalues. The eigenvalue λi of the ith eigenvector is proportional to the percentage of variance it explains (the variance it explains equals λi/ λi). These ordered values often show a clear “elbow” that separates the most important dimensions (characterized by higher percentage of variance) from less important ones. The first three PCs account for 18.15%, 16.22% and 13.06% of the total variance, totalling about 50% of the variance (see Figure 12). A random vector would explain on average 1/12 = 8.33% of the variance, therefore it is reasonable to assume that the top three PC’s carry real signal rather than noise. Therefore we set r = 3. We solved a progression of k-sparse PCA problems for k = 2, 3, 4, 5 and k = 12 (i.e. non-restricted PCA). Figures 4 and 9 show how the top two PC’s evolve as we increase k. Zooming in on PC1. At k = 2 the two non-zero features are the likes given to the user, and the number of retweets of his messages. This is clearly a measure of popularity in the social network. At k = 3 the number of followers joins in, however
  • 27. 25 for k > 3 no additional significant features exist. At k = 12 (non-restricted PCA) the two most dominant features (largest entries in absoulte value) are indeed likes given to the user, and the number of retweets, see Figure 7. Looking at our sample, indeed the top users in the direction of PC1 are teen pop-idols like Justin Bieber, Zayn Malik. Zooming in on PC2. Now let’s zoom in on the second PC and follow its evolvement in Figure 9. For k = 2 we see two features in opposite signs, the number of messages containing only text, and the number of messages that contain a URL. At k = 4 two additional features of opposite signs join in: the number of tweets that I re-tweet and the number of tweets that contain hash-tags. The aggregated attributed are related to the type of activity of the user in Twitter. The negative values (text and retweet) are characteristic of human twitter accounts, and the positive ones are more typical of robot and spam accounts. Indeed, the main way of spamming in Twitter is by hashtags (how? Simply include a trending hashtag in your tweet and anyone who clicks the trending topic will see your ad, for free) and urls, which appear in shortened form on Twitter and make it impossible to know where the url is leading. We were able to use PC2 as a linear model for spam detection (see Section 4.2). The benefit of using sparse PCA as opposed to standard PCA is demonstrated graphically in the factor map of PC2 vs PC3, Figures 10 and 11. The separation of features in much more evident in the sparse case. Deciding the value of kmin. To answer this question we compute the integrity score ι(k) for various k values according to Equation (1). The following table sum- marizes the results.
  • 28. 26 k = 2 k = 3 k = 4 k = 5 k = 12 ι(k) 0.8 0.84 0.9 0.92 1 To conclude, already for k = 2 we get a good integrity score and we may set kmin = 2. The choice kmin = 2 is also reinforced by Figure 12 which displays similar scree plot lines for all k values, and from the fact that the performance of our spam detection perceptorn (which uses PC2) works very well already for k = 2 (see table in the next section). Computing the robustness score. We performed a truncated crawl, collecting 74,320 Twitter accounts. We ignored users whose projection on either of the PC’s was more than 1 and obtained the matrix ˆΣtrunc . As a case study, let us compare the robustness of k = 12 (non-restricted PCA) and k = 4. Figures 13 and 14 show the loading table for the top three PC’s of the truncated covariance matrix. For convenience we also provide the co-sine similarly matrix of the PC’s. PCtrunc 1 PCtrunc 2 PCtrunc 3 PC1 0.94 0.17 0.12 PC2 0.06 0.04 0.01 PC3 0.01 0.2 0.41 It is evident form Figures 13 and 14 that the spam-potential label PC2 is no longer valid in the truncated search. The similarity matrix shows that PC2 is nearly orthog- onal to the top three truncated PC’s. This is reasonable since local sub-networks typically do not contain spammers (at least not robots). Therefore a straightfor- ward computation of ρ(k) would not give the right conclusion. Rather, the correct similarity matrix to consider is the following, and its score is ρ(12) = 0.675.
  • 29. 27 PCtrunc 1 PCtrunc 3 PC1 0.94 0.12 PC3 0.01 0.41 The robustness score for the 4-sparse classification is higher. Below is the similarity matrix for k = 4: PCtrunc 1 PCtrunc 2 PCtrunc 3 PC1 0.97 0 0 PC2 0 0.09 0.06 PC3 0 0.48 0.21 In this case, the best result is obtained for the matrix whose rows are PC1, PC3 and columns PCtrunc 1 , PCtrunc 2 . The robustness score is ρ(4) = 0.725, which is larger than ρ(12) = 0.675. Therefore we may conclude that the 4-sparse classification is semanti- cally more robust than the one obtained using non-restricted PCA. This conclusion is non-trivial and could not be reached without the new measures that we introduced. It is a good substitute for regularization, whose penalty parameter cannot be chosen using cross-validation in an unsupervised setting. Finally, let us examine the top users in PCtrunc 1 and PCtrunc 2 measure in the trun- cated crawl sample. The top users in PCtrunc 1 measure indeed include local celebrities such as Becky Whitesides, mother of Knoxville teen pop idol Jacob Whitesides, and TheGoodGodAbove - Official Twitter account of God. The top users in PCtrunc 2 mea- sure include content providers such as blogger Chiara Ferragni (the blonde salad), or aeshir from Alberta Canada who provides satirical and comical content. To con- clude, we see that the labels obtained in the first search are repeated truthfully in the truncated search.
  • 30. 28 4.2. Using PC2 for Spam Detection The proliferation of social networking has contributed to an increase in spam activity [41]. Spammers send unsolicited messages to users with varying purposes, which include, but are not limited to advertising, propagating pornography, phishing, and spreading viruses. URL attacks are aided by Twitter’s 140 character limit on tweets as many legitimate users need to use link-shortening services to reduce the length of their URLs. The ability to disguise URL destinations has made Twitter a particularly attractive target for spammers, which has motivated the development of several spam detection techniques. Spammers may be of various types, for example software-robots, fake accounts, or hijacked legitimate accounts. There are many approaches to detecting compromised or false user accounts. Here we focus on works that use machine-learning (in the Related Work section we men- tioned a graph-based approach that uses traditional graph parameters to perform the task). The common machine-learning approach to spam detection uses pre-labeled data in a supervised learning framework [1, 4, 8, 12, 44, 25]. All these works achieve precision rates of around 90%. Miller et.al. [11] treat the identification of spammers as an anomaly detection problem, rather than classification, where outliers are flagged as spammers. They utilize a combination of user metrics and one-gram text features. Their approach achieves an F1 score of 82% with high accuracy but low precision. Our approach is also inspired by PCA-based anomaly detection paradigm [21, 22, 23, 24]. Indeed [40] implemented this framework and tested it on seveal online social networks. The set of features that is used in [40] is different from ours and includes for examples counting the likes that a user gives in Facebook according to various page categories (sports, politics, etc). The methodology was applied to Facebook, Twitter
  • 31. 29 Precision Recall F1 score k = 2 0.98 0.88 0.93 k = 3 1 0.87 0.93 k = 4 1 0.88 0.94 k = 5 0.95 0.97 0.96 k = 12 1 0.94 0.97 and Flickr with similar findings to ours: user behavior is captured by a small number of PC’s (three to five depending on the network). The authors of [40] validated that efficiency of their predictor on a set of labeled data with good accuracy. Another work which treats spam detection as an anomaly-detection problem is [26]. Rather than using PCA, a clustering model is obtained and outliers are classified as spammers. Our approach is also unsupervised and uses PCA but is conceptually much simpler than [40] and [26], and furthermore does not use any textual features (which the latter two use). In the aftermath of applying PCA, we observed that one of the PC’s is a candidate for a spam detection model. Our model is a perceptron: given a Twitter account x0 ∈ R12 , we compute its projection on PC2. We tested our classifier on a set of 164 accounts, 64 spam and 95 legitimate, taken from [15]. Figure 17 shows the ROC plot, and the obtained AUC is 0.98. The above table describes the classification results with threshold 1, i.e. the user is classified as SPAM if x0, PC2 ≤ 1. The rows of the table give the results for various choices of k, namely using PC2 from various k-sparse PCA solutions. Although as k increases the measurements improve, but not significantly, and already for k = 2 we obtain satisfactory results. We may safely conclude that two features suffice to separate spam from legitimate twitter accounts: the number of text-only tweets and the number of tweets containing a url.
  • 32. 30 4.3. Using PC3 for content providers After examining the prominent features of the third PC we notice that PC3 measure the extent to which a user is a content provider, i.e. a user that contributes a lot of information and keeps things interesting and up to date in his social network account. Analyzing Figures 4 and 16 we see that the prominent features of PC3 indicate a very active social network user, features such as number of user tweets, tweets per day, number of URL. It is interesting to contrast PC3 with PC2 which indicates the extent of non-legitimate activity, PC3 contains also features like number of user mention and number of hash tags, that suggest a benign interaction with other users. At Figures 9 and 11 when conducting the Sparse PCA model on top of the first data set another interesting feature comes to the picture, Likes given to others, this feature with three additional features (number of tweets, number of user mention and tweets per day) give us another verification that third PCA plain can be used as a classifier for users that interact with the social network and help with making a lot of new content. We can take for example the top 10 users table include news provider littlebytesnews video gaming support XboxSupport/ PSP, and an American teen content provider ChelseaaMusic.
  • 33. 31 Twitter User Name Loading littlebytesnews 44.6 RedScareBot 28.17 PSP 27.07 XboxSupport 26.42 icarusaccount 20.64 favstar1000es 17.82 stevendickinson 17.66 TheArabHash 16.72 ComercioCenter 16.54 ChelseaAMusic 16.41 Table 2: PC3 TOP10 users Zooming into the data set. We now show how one can use the three labels obtained from PC1, PC2, PC3 to classify Twitter accounts from the crawler’s database. This will give us another opportunity to validate semantic soundness of the claimed labels and to show the practical usefulness of our method. For every Twitter account we compute its coordinate in the new semantic system spanned by the top three PCs. To recall, PC1 indicates the level of popularity in the social network, PC2 indicate the level of legitimacy of the user inside the network and PC3 indicates the level of activity and content contribution of the user to the social network. Our labels may also provide a vantage point for marketeers: using PC1 and PC3 the marketeer can identify “local stars” of a particular local subnetwork and use them for advancing new products, targeted commercials, targeted local information, etc. As we verified in Section 2.2, the obtained labels are semantically robust and retain their meaning also in subnetworks. In Table 18 we present a set of users selected according to the following criteria. We chose users with mid-range projection on PC1 (values between 0 and -0.01), and large projection on PC2 so they will not count as spam accounts (user will count as spammer if x0, PC2 ≤ −1.25). The results clearly show that
  • 34. 32 indeed the users that we chose maybe labeled as local starts in the social network such as comedians, local singers, bloggers, life style advisors and the likes. 4.4. Other Methods of Unsupervised Learning It is natural to ask what classification results may be obtained using other unsu- pervised learning algorithms. The first and most natural candidate is the k-means algorithm, in which the data is being partitioned into k cluster such that the within- cluster distance is much smaller than the between-cluster distance (if possible of course). The average silhouette coefficient is a number in [0, 1] and it is a standard measure for how well the points are clustered (the closer the score to one the better the clustering is). We ran k-means on a sub-sample of 50,000 users (we could not run on the entire set of 250K users due to performance issues) with k = 3, 4, 5. All the executions resulted in poor clustering, with an average silhouette coefficient of 0.16−0.17. We also ran a soft clustering algorithm, which is known as fuzzy k-means. In this algorithm each point x is assigned a k-dimensional vector vx whose entries are interpreted as weights. The quantity vx[i]/ j vx[j] measures to what extent x belongs to cluster i. The closer the entries are to 1/k the fuzzier the clustering is. We computed ran R’s fuzzy k-means algorithm (FANNY) on the 50,000 users data set for k = 3, 4. The Dunn’s normalized coefficient (which measures the average fuzziness of the clustering) was 0.53, 0.502 for k = 3 and k = 4 respectively. Rounding the fuzzy clustering to a crisp clustering gave one cluster with average silhouette of about 0.9 and the remaining clusters with negative silhouette (for k = 3 and k = 4). Namely, rounding the fuzzy solution gave bad clusters. We ran PCA on the same sub-sample, and the same semantic map obtained on the
  • 35. 33 entire data set was reconstructed. Indeed the poor results of the k-means algorithm are to be expected, as the scatter plots in Figure 21 suggest: rather than a clustered landscape we see a heavy-tailed distribution. In other words, the vast majority of the users are neither celebrities, nor spammers, nor content provider. They all fall in one large cluster, and running k-means with k > 1 is an artificial attempt to break this cluster. More generally, the clustering offered by k-means is too coarse and rough for our purpose, while PCA offers what we called “soft clustering”. Instead of assigning each user with one cluster, we measure the expression of a certain label (which represents some complex quality) of that user. Therefore a user may belong to many “clusters” at the same time and with different significance levels. PCA is a member of a larger family of algorithms performing various transforma- tions of the original set of features. Two other algorithms that we ran on our data are kernel PCA and ICA (Independent Component Analysis). We ran kernel PCA with two different kernels: a gaussian kernel and a polynomial one. The Gaussian kernel wiped all the variance in the dataset and basically projected all points to a single point. On the other hand the polynomial kernel focused semantics on the measure of fame: the top users in the first PC were users with very high number of followers (tens of millions, e.g. Barack Obama), and for the second PC users with a very large number of tweets. We could not interpret the third PC. This is also consistent with the corresponding eigenvalues: 0.35, 0.23 and 0.09. In addition the spam measure disappeared. To conclude, we found that using various kernels may enhance and focus the semantics on certain user qualities, and we leave the further understanding of how different kernels enhance different semantical aspects as a question for future research.
  • 36. 34 Finally we applied ICA (Independent Component Analysis) to our dataset. The ICA algorithm seeks a transformation of the original feature set that maximizes the mutual information with the original set while minimizing statistical depen- dence. Typically ICA and PCA are combined together, where first PCA is applied to “whiten” the data (remove linear correlations) and then ICA is applied to the PC’s. More formally, let F = (x1, x2, . . . , xp) be the original feature set, and let PC1, . . . , PCr be the top r PC’s. Define the new feature set V = (v1, . . . , vr) via vi = PCi, F . The top s IC’s Y = (y1, . . . , ys) are given by Y = WV T where W ∈ Rr×s is the output of the ICA optimization function. In our case we chose s = r = 3 and the matrix W we obtained is W =       0.97 0.25 0.04 −0.25 0.97 0.01 0.04 0.00001 0.999       . The matrix is rather close to the identity matrix which means that the PC’s are not only uncorrelated but also close to being statistically independent. The results of the ICA combined with the factor map in Figures 15 and 16 suggest the following interesting insight: the signals that each principal component carries are approximately both statistically and semantically independent. In other words, in a quantifiable sense, one can argue that the PCA labelling is succinct. The latter notion of succinctness is naturally combined with the notion of a semantic dimension, and together provide another aspect of the obtained labelling.
  • 37. 35 5. Discussion The recent technological advances provide us with the ability to cheaply accumulate unlabeled data on a very large scale. On the other hand labeled data may be costly and hard to obtain. Therefore we see great value in understanding what contribu- tion unsupervised methods provide for tasks that are traditionally approached via supervised learning. This was our research question in this master thesis, and in par- ticular we asked what is the minimal number of features that still enables meaningful classification. In this master thesis we introduced a new methodology to derive soft classification using sparse PCA alongside two new scores, integrity and robustness, to help with the problem of model selection. We applied our methodology to the Twit- ter social network and derived three labels: measure of fame (celebrity), spammer, and content provider. Using the integrity score we concluded that using merely two features per label is sufficient, and using the robustness score we concluded that a sparse solution was more semantically robust than the non-restricted one (Section 4). The limitations of our technique are obvious – we learn the labels that are inherent (yet latent) in the data and do not train a model for a specific goal. For example, if the goal is to predict the age of the user or his political affiliation, this may not be possible from the labels that were found via PCA. Also the fact that we are intentionally not using any linguistic features and ignore the content that the user posts is a limitation. On the positive side, we obtain a cheaply-computable methodology that generalizes easily across online social networks, and may provide insights also for networks where textual content is not available (either intentionally like Snapchat or due to the fact that the main content is graphical like Instagram). An interesting question for future research is to use our methodology in order to
  • 38. 36 compare different online social network. We presented several paramteres that may serve as the basis for comparison: the semantic labels, the semantic dimension (kmin), and the semantic robustness. Another interesting parameter that we suggest to com- pute is the semantic redundancy of the network. As we observed at the end of Section 4.4, the PC’s that we obtained were both (approximately) statistically independent and semantically orthogonal (Figures 15 and 16). Therefore one may argue that the classification we obtained has little redundancy (or succinct). Taking into account the extent to which the PC’s are feature-wise orthogonal, the semantic dimension, and the extent to which the PC’s are statistically independent (for example use ICA as a proxy for independence), one can concoct a measure of semantic redundancy. This measure can then be used compare different online networks.
  • 39. 37 6. figures Figure 1: User UML diagram Figure 2: User tweets UML diagram
  • 40. 38 Figure 3: Crawler algorithm
  • 41. 39 Figure 4: PC1 Progression, fame measure. Figure 5: PC2 Progression, spam detector. Figure 6: PC3 Progression, Content detector. Figure 7: Top PCs Figure 8: Top 4-sparse PCs
  • 42. 40 Figure 9: PC1 VS PC2 Factor Map.
  • 43. 41 Figure 10: PC2 VS PC3 SPCA Factor Map.
  • 44. 42 Figure 11: PC2 VS PC3 SPCA Factor Map.
  • 45. 43 Figure 12: various k scree plot. Each color represents the scree plot for a k-sparse PCA solution. The x-axis is the PC number, the y-axis is the percentage of variance explained by that PC. Figure 13: Top truncated PCs Figure 14: Top truncated 4-sparse PCs
  • 46. 44 Figure 15: PC1 VS PC2 SPCA factor map.
  • 47. 45 Figure 16: PC1 VS PC3 PCA factor map.
  • 48. 46 Figure 17: Spam Detection ROC plot. AUC=0.98 Figure 18: Combining the plains
  • 49. 47 Figure 19: PC1 VS PC3 scatter plot. Figure 20: PC2 VS PC3 scatter plot.
  • 50. 48 Figure 21: PC1 VS PC2 scatter plot. 7. bibliography [1] A. Amleshwaram, N. Reddy, S. Yadav, and C. Yang. Cats: Characterizing automation of twitter spammers. Technical report, Department of Electrical and Computer Engineering, Texas A&M University, 2013. [2] T.W. Anderson. An introduction to multivariate statistical analysis. Wiley series in probability and mathematical statistics. Wiley, 2nd edition, 1984. [3] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: Membership, growth, and evolution. In proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 44–54, 2006.
  • 51. 49 [4] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In Seventh annual Collaboration Electronic messaging Anti-Abuse Spam Conference Redmond Washington U.S., 2010. [5] S. P. Borgatti and M. G. Everett. Models of core/periphery structures. In Social Networks, 21(4):375–395, 2000. [6] J. Burger and J. Henderson. An exploration of observable features related to blogger age. In Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium, pages 15–20, 2006. [7] C. Canali, S. Casolari, and R. Lancellotti. A quantitative methodology to identify relevant users in social networks. In Business Applications of Social Network Analysis (BASNA), 2010 IEEE International Workshop on, pages 1–8, 2010. [8] M. Chuah and M. McCord. Detection on twitter using traditional classifiers. In Autonomic and Trusted Computing: 8th International Conferencem, Banff, Canada, pages 2–4, 2011. [9] A. Culotta, N.R. Kumar, and J. Cutler. Predicting the Demographics of Twitter Users from Website Traffic Data. In AAAI, pages 72–78, 2015. [10] A. d’Aspremont, L. El-Ghaoui, M. Jordan, and G. Lanckriet. A direct formula- tion for sparse PCA using semidefinite programming. SIAM Review, 49(3):434– 448, 2004. [11] W. Deitrick W. Hu A.H. Z. Miller, and B. Dickinson. Twitter spammer detection using data stream. Information Sciences, 260:64 – 73, 2014.
  • 52. 50 [12] M. Fernandes, P. Patel, and T. Marwala. Automated detection of human users in twitter. In Procedia Computer Science, 53:224–231, 2015. [13] S. Fortunato. Community detection in graphs. In Phys. Rep. 486(3-5):75-–174, 2010. [14] M. Girvan and M. Newman. Community structure in social and biological net- works. In PNAS 99(12):7821–7826, 2002. [15] A. Gulec and Y. Khan. Feature selection techniques for spam detection on twitter. Technical report, Electronic Commerce Technologies (CSI 5389) Project Report, School of EE-CS, University of Ottowa, 2014. [16] S. Herring and J. Paolillo, Gender and genre variation in weblogs. Journal of Sociolinguistics, 10(4):439–459, 2006. [17] P. Holme. Core-periphery organization of complex networks. Phys. Rev. E, 72(4):046111, 2005. [18] I.T. Jolliffe. Principal Component Analysis. Springer series in statistics. Springer, 2nd edition, 2002. [19] R. Jones, B. Pang, R Kumar, and A. Tomkins. I know what you did last summer - query logs and user privacy. In proc. of the sixteenth ACM conference on information and knowledge management, pages 909–914, 2007. [20] R. Krauthgamer, B. Nadler, and D. Vilenchik. Do semidefinite relaxations solve sparse pca up to the information limit? Annals of Statistics, 43(3):1300–1322, 06 2015.
  • 53. 51 [21] A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wide anoma- lies in traffic flows. In proc. of the 4th ACM SIGCOMM Conference on Internet Measurement, pages 201–206, 2004. [22] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. SIGCOMM Comput. Commun. Rev., 34(4):219–230, 2004. [23] A. Lakhina, M. Crovella, and C. Diot. Mining anomalies using traffic feature distributions. SIGCOMM Comput. Commun. Rev., 35(4):217–228, 2005. [24] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. Kolaczyk, and N. Taft. Structural analysis of network traffic flows. SIGMETRICS Perform. Eval. Rev., 32(1):61–72, 2004. [25] C. Meda, F. Bisio, P. Gastaldo, and R Zunino. A machine learning approach for Twitter spammers detection. In 2014 International Carnahan Conference on Security Technology (ICCST), pages 1–6, 2014. [26] Z. Miller, B. Dickinson, W. Deitrick, W. Hu, and A. Wang, Twitter spammer detection using data stream clustering. In Information Sciences, 260:64-73, 2014. [27] R. J. Muirhead, Aspects of Multivariate Statistical Theory. Wiley, New York, 1982. [28] A. Mukherjee, B., Liu and N. Glance. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st international conference on World Wide Web, pages 191-200 ACM, 2012. [29] B. Moghaddam, S. Weiss, and Y.and Avidan. Generalized spectral bounds for
  • 54. 52 sparse LDA. In proc. of the 23rd International Conference on Machine Learning, pages 641–648, 2006. [30] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227–234, 1995. [31] J. Otterbacher. Inferring gender of movie reviewers: Exploiting writing style, content and metadata. In proc. of the 19th ACM international conference on information and knowledge management, pages 369–378, 2010. [32] M. Pennacchiotti and A.M. Popescu. A machine learning approach to twitter user classification. In proc. of the 5th International Conference on Weblogs and Social Media, pages 281-288, 2011. [33] D. Preot¸iuc-Pietro, V. Lampos, and N. Aletras. An analysis of the user occu- pational class through Twitter content. In The Association for Computational Linguistics, 2015. [34] D. Preot¸iuc-Pietro , S. Volkova, V. Lampos, Y. Bachrach, and N. Aletras. Study- ing User Income through Language, Behaviour and Affect in Social Media. In PloS one 10(9):e0138717, 2015. [35] D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In proc. of the 4th International Conference on Weblogs and Social Media, Pages 1–1, 2010. [36] D. Rao, D. Yarowsky, A. Shreevats, and M. Gupta. Classifying latent user attributes in Twitter. In proc. of the 2nd International Workshop on Search and Mining User-generated Contents, pages 37–44, 2010.
  • 55. 53 [37] N. Trendafilov, I.T. Jolliffe, and M. Uddin. A modified principal component tech- nique based on the LASSO. Journal of Computational and Graphical Statistics, 12:531––547, 2003. [38] D. Vilenchik and B. Yichye. Twitter Crawler. https://github.com/barakyi/ Twitter_crawler, 2016. [39] B. Viswanath, A. Mislove, M. Cha, and K. Gummadi. On the evolution of user interaction in facebook. In proc. of the 2nd ACM Workshop on Online Social Networks, pages 37–42, 2009. [40] B. Viswanath, M. Bashir, M. Crovella, S. Guha, K. Gummadi, B. Krishnamurthy, and A. Mislove. Towards detecting anomalous user behavior in online social networks. In 23rd USENIX Security Symposium (USENIX Security 14), pages 223–238, 2014. [41] A. Wang. Detecting spam bots in online social networking sites: a machine learning approach. In 24th Annual IFIP WG 11.3 Working Conference on Data and Applications, Security, 2009. [42] G. Wang, S. Xie, B. Liu, and P. Yu. Review Graph Based Online Store Review Spammer Detection. In proc. of the 11th IEEE International Conference on Data Mining, pages 1242–1247, 2011. [43] G. Wang, S. Xie, B. Liu, and P. Yu. Identify Online Store Review Spammers via Social Review Graph. In ACM Trans. Intell. Syst. Technol., 3(4):1–21, 2012. [44] A. Wang. Don’t follow me: Spam detection in twitter. In Int’l Conference on Security and Cryptography (SECRYPT), pages 1–10, 2014.
  • 56. 54 [45] I. Weber and C. Castillo. The demographics of web search. In proc. of ACM SIGIR conference on Research and development in information retrieval, pages 523–530, 2010. [46] D. Witten, R. Tibshirani, and R. Hastie. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. [47] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15:262––286, 2006. [48] Z. Zhang, H. Zha, and H. Simon. Low-rank approximations with sparse factors I: Basic algorithms and error analysis. [49] Twitter limitation - 15 min time interval https://dev.twitter.com/rest/public/rate-limiting [50] Twitter limition - 180 user request. https://dev.twitter.com/rest/reference/get/users/show [51] tweetinvi - a C# library implementing twitter API. https://tweetinvi.codeplex.com