microposts2015presentation-150518124457-lva1-app6892.pdf

Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015

Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.

How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.

Example

Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.

Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.

Our Definition
• Twitter spam: noisy content produced by users who
express a different behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.

Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.

Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.

Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).

Feature Engineering
User features Content features
Length of profile name Number of words
Length of profile description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet

Evaluation
Experiment Settings
• 5 widely-used classification algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.

Evaluation
Selection of Classifier
Classifier
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.

Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?

Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diff. types of spam
tweets.

Evaluation
Computational Efficiency
• Beyond accuracy, how can all these features be applied
efficiently in a stream?

Evaluation
Computational Efficiency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most efficiently calculated.

Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.

Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could affect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classification?

That’s it!
• Any Questions?

K. Lee, B. D. Eoff, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
fighting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

microposts2015presentation-150518124457-lva1-app6892.pdf

Recomendados

Recomendados

Más contenido relacionado

Similar a microposts2015presentation-150518124457-lva1-app6892.pdf

Similar a microposts2015presentation-150518124457-lva1-app6892.pdf (20)

Último

Último (20)

microposts2015presentation-150518124457-lva1-app6892.pdf