SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Making the Most of Tweet-Inherent Features for
Social Spam Detection on Twitter
Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter
Department of Computer Science
University of Warwick
18th May 2015
Social Spam on Twitter
Motivation
• Social spam is an important issue in social media services
such as Twitter, e.g.:
• Users inject tweets in trending topics.
• Users reply with promotional messages providing a link.
• We want to be able to identify these spam tweets in a
Twitter stream.
Social Spam on Twitter
How Did we Feel the Need to Identify Spam?
• We started tracking events via streaming API.
• They were often riddled with noisy tweets.
Social Spam on Twitter
Example
Social Spam on Twitter
Our Approach
• Detection of spammers: unsuitable, we couldn’t
aggregate a user’s data from a stream.
• Alternative solution: Determine if tweet is spam from its
inherent features.
Social Spam on Twitter
Definitions
• Spam originally coined for unsolicited email.
• How to define spam for Twitter? (not easy!)
• Twitter has own definition of spam, where certain level of
advertisements is allowed:
• It rather refers to the user level rather than tweet level, e.g.,
users who massively follow others.
• Harder to define a spam than a spammer.
Social Spam on Twitter
Our Definition
• Twitter spam: noisy content produced by users who
express a different behaviour from what the system is
intended for, and has the goal of grabbing attention by
exploiting the social media service’s characteristics.
Spammer vs. Spam Detection
What Did Others Do?
• Most previous work focused on spammer detection (users).
• They used features which are not readily available in a
tweet:
• For example, historical user behaviour and network
features.
• Not feasible for our use.
Spammer vs. Spam Detection
What Do We Want To Do Instead?
• (Near) Real-time spam detection, limited to features
readily available in a stream of tweets.
• Contributions:
• Test on two existing datasets, adapted to our purposes.
• Definition of different feature sets.
• Compare different classification algorithms.
• Investigate the use of different tweet-inherent features.
Datasets
• We relied on two (spammer vs non-spammer) datasets:
• Social Honeypot (Lee et al., 2011 [1]): used social honeypots
to attract spammers.
• 1KS-10KN (Yang et al., 2011 [2]): harvested tweets
containing certain malicious URLs.
• Spammer dataset to our spam dataset: Randomly select
one tweet from each spammer or legitimate user.
• Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1).
• 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
Feature Engineering
User features Content features
Length of profile name Number of words
Length of profile description Number of characters
Number of followings (FI) Number of white spaces
Number of followers (FE) Number of capitalization words
Number of tweets posted Number of capitalization words per word
Age of the user account, in hours (AU) Maximum word length
Ratio of number of followings and followers (FE/FI) Mean word length
Reputation of the user (FE/(FI + FE)) Number of exclamation marks
Following rate (FI/AU) Number of question marks
Number of tweets posted per day Number of URL links
Number of tweets posted per week Number of URL links per word
N-grams Number of hashtags
Uni + bi-gram or bi + tri-gram Number of hashtags per word
Number of mentions
Sentiment features Number of mentions per word
Automatically created sentiment lexicons Number of spam words
Manually created sentiment lexicons Number of spam words per word
Part of speech tags of every tweet
Evaluation
Experiment Settings
• 5 widely-used classification algorithms: Bernoulli Naive
Bayes, KNN, SVM, Decision Tree and Random Forests.
• Hyperparameters optimised from a subset of the dataset
separate from train/test sets.
• All 4 feature sets were combined.
• 10-fold cross-validation.
Evaluation
Selection of Classifier
Classifier
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F1-measure
Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789
KNN 0.924 0.706 0.798 0.802 0.778 0.790
SVM 0.872 0.708 0.780 0.844 0.817 0.830
Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915
Random Forest 0.993 0.716 0.831 0.941 0.950 0.946
• Random Forests outperform others in terms of
F1-measure and Precision.
• Better performance on Social Honeypot (1:1 ratio rather
than 1:10?).
• Results only 4% below original papers, which require
historic user features.
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
User features (U) 0.895 0.709 0.791 0.938 0.940 0.940
Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762
Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743
Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775
Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775
Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720
Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702
• Testing feature sets one by one:
• User features (U) most determinant for Social Honeypot.
• N-gram features best for 1KS-10KN.
• Potentially due to diff. dataset generation approaches?
Evaluation
Evaluation of Features (w/ Random Forests)
Feature Set
1KS-10KN Dataset Social Honeypot Dataset
Precision Recall F-measure Precision Recall F-measure
Single feature set 0.943 0.726 0.820 0.938 0.940 0.940
U + C 0.974 0.708 0.819 0.938 0.949 0.943
U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943
U + S 0.948 0.732 0.825 0.940 0.944 0.942
Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770
C + S 0.970 0.649 0.777 0.778 0.762 0.770
C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770
U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941
U + C + S 0.982 0.704 0.819 0.937 0.948 0.942
U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937
C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782
U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942
• However, when we combine feature sets:
• The same approach performs best (F1) for both: U + Bi &
Tri-gram (Tf).
• Combining features helps us capture diff. types of spam
tweets.
Evaluation
Computational Efficiency
• Beyond accuracy, how can all these features be applied
efficiently in a stream?
Evaluation
Computational Efficiency
Feature set
Comp. time (seconds)
for 1k tweets
User features 0.0057
N-gram 0.3965
Sentiment features 20.9838
Number of spam words (NSW) 19.0111
Part-of-speech counts (POS) 0.6139
Content features including NSW and POS 20.2367
Content features without NSW 1.0448
Content features without POS 19.6165
• Tested on regular computer (2.8 GHz Intel Core i7 processor
and 16 GB memory).
• The features that performed best in combination (User
and N-grams) are those most efficiently calculated.
Conclusion
• Random Forests were found to be the most accurate
classifier.
• Comparable performance to previous work (-4%) while
limiting features to those in a tweet.
• The use of multiple feature sets increases the possibility
to capture different spam types, and makes it more
difficult for spammers to evade.
• Diff. features perform better when used separately, but
same features are useful when combined.
Future Work
• Spam corpus constructed by picking tweets from
spammers.
• Need to study if legitimate users also likely to post spam
tweets, and how it could affect the results.
• A more recent, manually labelled spam/non-spam
dataset.
• Feasibility of cross-dataset spam classification?
That’s it!
• Any Questions?
K. Lee, B. D. Eoff, and J. Caverlee.
Seven months with the devils: A long-term study of content
polluters on twitter.
In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors,
ICWSM. The AAAI Press, 2011.
C. Yang, R. C. Harkreader, and G. Gu.
Die free or live hard? empirical evaluation and new design for
fighting evolving twitter spammers.
In Proceedings of the 14th International Conference on Recent
Advances in Intrusion Detection, RAID’11, pages 318–337,
Berlin, Heidelberg, 2011. Springer-Verlag.

Más contenido relacionado

Similar a microposts2015presentation-150518124457-lva1-app6892.pdf

Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Pete Burnap
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for Cybersecurity
VMware Tanzu
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
jobinwilson
 

Similar a microposts2015presentation-150518124457-lva1-app6892.pdf (20)

Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
Who will RT this?: Automatically Identifying and Engaging Strangers on Twitte...
 
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
Primo Reporting: Using 3rd Party Software to Create Primo Reports & Analyze P...
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
A Two Step Ranking Solution for Twitter User Engagement
A Two Step Ranking Solution for Twitter User Engagement�A Two Step Ranking Solution for Twitter User Engagement�
A Two Step Ranking Solution for Twitter User Engagement
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for Cybersecurity
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
 
Ed Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual AnalysisEd Snelson. Counterfactual Analysis
Ed Snelson. Counterfactual Analysis
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
kdd2015
kdd2015kdd2015
kdd2015
 
How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)How to not fail at security data analytics (by CxOSidekick)
How to not fail at security data analytics (by CxOSidekick)
 

Último

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

microposts2015presentation-150518124457-lva1-app6892.pdf

  • 1. Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter Bo Wang, Arkaitz Zubiaga, Maria Liakata and Rob Procter Department of Computer Science University of Warwick 18th May 2015
  • 2. Social Spam on Twitter Motivation • Social spam is an important issue in social media services such as Twitter, e.g.: • Users inject tweets in trending topics. • Users reply with promotional messages providing a link. • We want to be able to identify these spam tweets in a Twitter stream.
  • 3. Social Spam on Twitter How Did we Feel the Need to Identify Spam? • We started tracking events via streaming API. • They were often riddled with noisy tweets.
  • 4. Social Spam on Twitter Example
  • 5. Social Spam on Twitter Our Approach • Detection of spammers: unsuitable, we couldn’t aggregate a user’s data from a stream. • Alternative solution: Determine if tweet is spam from its inherent features.
  • 6. Social Spam on Twitter Definitions • Spam originally coined for unsolicited email. • How to define spam for Twitter? (not easy!) • Twitter has own definition of spam, where certain level of advertisements is allowed: • It rather refers to the user level rather than tweet level, e.g., users who massively follow others. • Harder to define a spam than a spammer.
  • 7. Social Spam on Twitter Our Definition • Twitter spam: noisy content produced by users who express a different behaviour from what the system is intended for, and has the goal of grabbing attention by exploiting the social media service’s characteristics.
  • 8. Spammer vs. Spam Detection What Did Others Do? • Most previous work focused on spammer detection (users). • They used features which are not readily available in a tweet: • For example, historical user behaviour and network features. • Not feasible for our use.
  • 9. Spammer vs. Spam Detection What Do We Want To Do Instead? • (Near) Real-time spam detection, limited to features readily available in a stream of tweets. • Contributions: • Test on two existing datasets, adapted to our purposes. • Definition of different feature sets. • Compare different classification algorithms. • Investigate the use of different tweet-inherent features.
  • 10. Datasets • We relied on two (spammer vs non-spammer) datasets: • Social Honeypot (Lee et al., 2011 [1]): used social honeypots to attract spammers. • 1KS-10KN (Yang et al., 2011 [2]): harvested tweets containing certain malicious URLs. • Spammer dataset to our spam dataset: Randomly select one tweet from each spammer or legitimate user. • Social Honeypot: 20,707 spam vs 19,249 non-spam (∼1:1). • 1KS-10KN: 1,000 spam vs 9,828 non-spam (∼1:10).
  • 11. Feature Engineering User features Content features Length of profile name Number of words Length of profile description Number of characters Number of followings (FI) Number of white spaces Number of followers (FE) Number of capitalization words Number of tweets posted Number of capitalization words per word Age of the user account, in hours (AU) Maximum word length Ratio of number of followings and followers (FE/FI) Mean word length Reputation of the user (FE/(FI + FE)) Number of exclamation marks Following rate (FI/AU) Number of question marks Number of tweets posted per day Number of URL links Number of tweets posted per week Number of URL links per word N-grams Number of hashtags Uni + bi-gram or bi + tri-gram Number of hashtags per word Number of mentions Sentiment features Number of mentions per word Automatically created sentiment lexicons Number of spam words Manually created sentiment lexicons Number of spam words per word Part of speech tags of every tweet
  • 12. Evaluation Experiment Settings • 5 widely-used classification algorithms: Bernoulli Naive Bayes, KNN, SVM, Decision Tree and Random Forests. • Hyperparameters optimised from a subset of the dataset separate from train/test sets. • All 4 feature sets were combined. • 10-fold cross-validation.
  • 13. Evaluation Selection of Classifier Classifier 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F1-measure Bernoulli NB 0.899 0.688 0.778 0.772 0.806 0.789 KNN 0.924 0.706 0.798 0.802 0.778 0.790 SVM 0.872 0.708 0.780 0.844 0.817 0.830 Decision Tree 0.788 0.782 0.784 0.914 0.916 0.915 Random Forest 0.993 0.716 0.831 0.941 0.950 0.946 • Random Forests outperform others in terms of F1-measure and Precision. • Better performance on Social Honeypot (1:1 ratio rather than 1:10?). • Results only 4% below original papers, which require historic user features.
  • 14. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure User features (U) 0.895 0.709 0.791 0.938 0.940 0.940 Content features (C) 0.951 0.657 0.776 0.771 0.753 0.762 Uni + Bi-gram (Binary) 0.930 0.725 0.815 0.759 0.727 0.743 Uni + Bi-gram (Tf) 0.959 0.715 0.819 0.783 0.767 0.775 Uni + Bi-gram (Tfidf) 0.943 0.726 0.820 0.784 0.765 0.775 Bi + Tri-gram (Tfidf) 0.931 0.684 0.788 0.797 0.656 0.720 Sentiment features (S) 0.966 0.574 0.718 0.679 0.727 0.702 • Testing feature sets one by one: • User features (U) most determinant for Social Honeypot. • N-gram features best for 1KS-10KN. • Potentially due to diff. dataset generation approaches?
  • 15. Evaluation Evaluation of Features (w/ Random Forests) Feature Set 1KS-10KN Dataset Social Honeypot Dataset Precision Recall F-measure Precision Recall F-measure Single feature set 0.943 0.726 0.820 0.938 0.940 0.940 U + C 0.974 0.708 0.819 0.938 0.949 0.943 U + Bi & Tri-gram (Tf) 0.972 0.745 0.843 0.937 0.949 0.943 U + S 0.948 0.732 0.825 0.940 0.944 0.942 Uni & Bi-gram (Tf) + S 0.964 0.721 0.824 0.797 0.744 0.770 C + S 0.970 0.649 0.777 0.778 0.762 0.770 C + Uni & Bi-gram (Tf) 0.968 0.717 0.823 0.783 0.757 0.770 U + C + Uni & Bi-gram (Tf) 0.985 0.727 0.835 0.934 0.949 0.941 U + C + S 0.982 0.704 0.819 0.937 0.948 0.942 U + Uni & Bi-gram (Tf) + S 0.994 0.720 0.834 0.928 0.946 0.937 C + Uni & Bi-gram (Tf) + S 0.966 0.720 0.824 0.806 0.758 0.782 U + C + Uni & Bi-gram (Tf) + S 0.988 0.725 0.835 0.936 0.947 0.942 • However, when we combine feature sets: • The same approach performs best (F1) for both: U + Bi & Tri-gram (Tf). • Combining features helps us capture diff. types of spam tweets.
  • 16. Evaluation Computational Efficiency • Beyond accuracy, how can all these features be applied efficiently in a stream?
  • 17. Evaluation Computational Efficiency Feature set Comp. time (seconds) for 1k tweets User features 0.0057 N-gram 0.3965 Sentiment features 20.9838 Number of spam words (NSW) 19.0111 Part-of-speech counts (POS) 0.6139 Content features including NSW and POS 20.2367 Content features without NSW 1.0448 Content features without POS 19.6165 • Tested on regular computer (2.8 GHz Intel Core i7 processor and 16 GB memory). • The features that performed best in combination (User and N-grams) are those most efficiently calculated.
  • 18. Conclusion • Random Forests were found to be the most accurate classifier. • Comparable performance to previous work (-4%) while limiting features to those in a tweet. • The use of multiple feature sets increases the possibility to capture different spam types, and makes it more difficult for spammers to evade. • Diff. features perform better when used separately, but same features are useful when combined.
  • 19. Future Work • Spam corpus constructed by picking tweets from spammers. • Need to study if legitimate users also likely to post spam tweets, and how it could affect the results. • A more recent, manually labelled spam/non-spam dataset. • Feasibility of cross-dataset spam classification?
  • 20. That’s it! • Any Questions?
  • 21. K. Lee, B. D. Eoff, and J. Caverlee. Seven months with the devils: A long-term study of content polluters on twitter. In L. A. Adamic, R. A. Baeza-Yates, and S. Counts, editors, ICWSM. The AAAI Press, 2011. C. Yang, R. C. Harkreader, and G. Gu. Die free or live hard? empirical evaluation and new design for fighting evolving twitter spammers. In Proceedings of the 14th International Conference on Recent Advances in Intrusion Detection, RAID’11, pages 318–337, Berlin, Heidelberg, 2011. Springer-Verlag.