Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services

Techniques for Automating Quality
Assessment of Context-specific Content
on Social Media Services
Prateek Dewan
PhD Thesis Defense
November 14, 2017
prateekd@iiitd.ac.in
Committee members
Dr. Alessandra Sala
Dr. Sanasam Ranbir Singh
Dr. Aditya Telang
Dr. Ponnurangam Kumaraguru (Advisor)

Who am I?
• Data Scientist at Apple
• PhD student since February, 2012 – IIIT-Delhi
• Masters (2010 – 2012), IIIT-Delhi
• Collaborations
• IBM IRL (Delhi and Bengaluru), Symantec Research Labs (Pune), Dublin City
University (Ireland), UFMG (Brazil)
• Worked in Privacy and Security on Online Social Media
• Research interests
• Applied Machine Learning
• Natural Language Processing
• Web Security
2

Online Social Media: The Big Picture
3

“With great power comes great responsibility”
4

Thesis statement
• To design and evaluate automated techniques for quality
assessment of context-specific content on social media
services in real time
• Focus: Facebook
• Biggest Online Social Media service
• 2.01 billion monthly active users
• Every 2 out of 7 human beings on the planet uses Facebook
• Most sought-after OSN for news
5

Proposed Solution
6
Identify Characterize Model
PrototypeDeployEvaluate

Scope
• Establishing the definition of poor quality content
• What all content is poor in quality?
• Untrustworthy
• Child unsafe
• Misleading information
• Hoaxes, scams, clickbait
• Violence, hate speech
• Definition conforming to
• Facebook’s community standards 1
• Definitions of page spam
8
1
https://www.facebook.com/communitystandards

Approach
•Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
9

Approach
• Poor quality posts published on Facebook
Characterize
Model
Firefox
Implement
10

Dataset
Data Type Quantity
Unique posts 4,465,371
Unique entities 3,373,953
Unique users 2,983,707
Unique pages 390,246
Unique URLs 480,407
Unique posts with one or more URLs 1,222,137
Unique entities posting URLs 856,758
Unique posts with one or more malicious URLs 11,217
Unique entities posting one or more malicious URLs 7,962
Unique malicious URLs 4,622
11

Establishing Ground Truth
• Extracted posts containing one or more URLs
• 1.2 million out of 4.4 million posts in total
• 480k unique URLs
• Used six URL blacklists
• Google Safebrowsing(malware / phishing)
• VirusTotal (spam / malware / phishing)
• Surbl (spam)
• Web of Trust (trust score)*
• SpamHaus (spam)
• Phishtank(phishing)
• Post containing one or more blacklisted URL marked as poor
quality posts (11,217 in all)
12

Web of Trust
13
Reputation: Unsatisfactory / Poor / Very poor (less than 60)
Confidence: High (greater than 10)
OR
Category: Negative
Malicious
http://www.domain.com

Findings
• Facebook’s current techniques do not suffice
• 65% of all poor quality posts existed on Facebook after 4 (or more)
months
• Gathered likes from 52,169 unique users; comments from 8,784 unique users
• Facebook’s partnership with Web of Trust?
• 88% of all malicious URLs had poor reputation on WOT
• No warning pages
14

Distribution of poor quality posts
16
Pages Users
Entities Posts

Approach
• Facebook pages publishing poor quality content
Characterize
Model
Firefox
Implement
17

Facebook Pages posting poor quality content
18
Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages. Prateek Dewan, Shrey Bagroy, and Ponnurangam
Kumaraguru (Short paper). Published at IEEE/ACM Conference on Advances in Social Networks Analysis and Mining (ASONAM), San
Francisco, USA. 2016.

Ground Truth extraction: Facebook pages
4.4 million posts
10,341 malicious posts
(1,557 pages; 5,868 users)
627 malicious
pages
19
1 or more malicious URLs in
the most recent 100 posts

Dataset of pages posting poor quality content
WOT response No. of pages No. of posts
Child unsafe 387 10,891
Untrustworthy 317 8,057
Questionable 312 8,859
Negative 266 5,863
Adult content 162 3,290
Spam 124 4,985
Phishing 39 495
Total 627 (31) 20,999
20
• Numbers in brackets are Verified pages

Content analysis (page names)
21
• Sentence Tokenization à Word Tokenization à Case normalization à
Stemming à Stopword removal
• N-gram analysis (n = 1, 2, 3)
• Politically polarized entities amongst poor quality pages
• British National Party (BNP), The Tea Party, English Defense League,
American Defense League, American Conservatives, Geert Wilders
supporters…

Network analysis
22
• Collusive behavior within pages posting poor quality content
Shares LikesComments

Temporal activity
• Activity ratio:
"#.#% &'() *"'&+ ,-&'.)
&#&,/ "#.#% &'() *"'&+
during complete observation period
• Malicious pages are more active than benign pages
23

Approach
• Misinformation spread on Facebook through images
Characterize
Model
Firefox
Implement
24

Why?: The Human Brain - Images versus text
• Human brain processes images 60,000 times faster than
text
25

Are we doing enough to "understand" images?
• Most research to analyze social media content focuses on text
• Topic modelling
• Sentiment analysis
• Does it capture everything?
• Studies related to images are limited to small scale
• Few hundred images manually annotated and analyzed
• What can be done?
• Automated techniques for image summarization; Deep Learning and
Convolutional Neural Networks (CNNs) to scale across large no. of images
• Domain transfer learning
• Optical Character Recognition
26

Methodology
• Images posted on Facebook during the Paris Attacks,
November 2015
• 3-tier pipeline for extracting high level image descriptors
from images
27
Uniqueposts 131,548
Unique users 106,275
Posts with images 75,277
Totalimages extracted 57,748
Total unique images 15,123
Images
Themes
(Inception v3)
Image Sentiment
(DeCAF trained on
SentiBank)
Optical
Character
Recognition
Human
understandable
descriptors
Text Sentiment
(LIWC) +
Topics (TF)
Manual
calibration
Tier 1: Visual Themes
Tier 2: Image Sentiment
Tier 3: Text embedded in images

Tier I: Visual Themes
• ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), 2012
• 1.2 million images, 1,000 categories
• Winner: Google’s Inception-v3 (top-1 error: 17.2%)
• 48-layer Deep Convolutional Neural Network
28

Tier I: Visual Themes contd.
• All images labeled using Inception-v3
• Validation:
• Random sample of 2,545 images annotated by 3 human annotators
• 38.87% accuracy (majority voting)
• Manual calibration
• Renamed 7 out of the top 30 (most frequently occurring) labels
• New accuracy: 51.3%
• Why rename? à
29
Bolo Tie
(Inception-v3)
PeaceForParis
(Our dataset)

Tier II: Image Sentiment
• Domain Transfer Learning
• Inception-v3’s last layer retrained using SentiBank
• SentiBank
• Images collected from Flickr using Adjective Noun Pairs (ANPs) as search
query
• ANPs: happy dog, adorable baby, abandoned house
• Weakly labeled dataset of images carrying emotion
• Final training set – 133,108 negative + 305,100 positive sentiment images
• 10-fold random subsampling
• 69.8% accuracy
30

Tier III: Text embedded in images
• Optical Character
Recognition (OCR)
• Tesseract OCR (Python)
• 31,689 images had text
• Manually extracted text
from a random sample of
1,000 images
• Compared with OCR
output using string
similarity metrics
• ~62% accuracy
31
Tesseract output:
No-one thinks that
these people are
representative of
Christians. So why
do so many think
that these people
are representative
of Muslims?

Image and post text had different topics
• Text embedded in images depicted more negative
sentiment than user generated textual content
32
Text embedded in images User generated text

Sentiment: Images versus text
• Image sentiment was more positive than text sentiment
33
0
0.1
0.2
0.3
0.4
0.5
0.6
8 24 40 56 72 88 104 120 136 152 168 184 200 216 232 248 264 280
Sentiment Value / Volume Fraction
No. of hours after the attacks
Post Text Image Text
Image Volume Fraction

Poor quality image content popular on Facebook
34

Approach
Characterize
Model
Firefox
Implement
35

Revisiting -- Establishing Ground Truth
• Extracted posts containing one or more URLs
• 1.2 million out of 4.4 million posts in total
• 480k unique URLs
• Used six URL blacklists
• Google Safebrowsing(malware / phishing)
• VirusTotal (spam / malware / phishing)
• Surbl (spam)
• Web of Trust (trust score)*
• SpamHaus (spam)
• Phishtank(phishing)
• Post containing one or more blacklisted URL marked as poor
quality posts (11,217 in all)
36

Ground Truth extraction – Dataset II
•What if a post does not have a URL?
• 500 random Facebook posts x 17 events x 3 annotators
• Definition of malicious post
• “Any irrelevant or unsolicited messages sent over the Internet, typically to large
numbers of users, for the purposes of advertising, phishing, spreading malware, etc.
are categorized as spam. In terms of online social media, social spam is any content
which is irrelevant / unrelated to the event under consideration, and / or aimed at
spreading phishing, malware, advertisements, self promotion etc., including bulk
messages, profanity, insults, hate speech, malicious links, fraudulent reviews, scams,
fake information etc.”
• Final dataset (all 3 annotators agreed on the same label)
• 571 malicious posts
• 3,841 benign posts
37

Feature set: Facebook Posts
Source Features
Entity (9) isPage, gender, pageCategory, hasUsername, usernameLength,
nameLength, numWordsInName, locale, pageLikes
Textual content
(18)
Presence of !,?,!!,??, emoticons (smile, frown), numWords,
avgWordLength, numSentences, avgSentenceLength,
numDictionaryWords, numHashtags, hashtagsPerWord, numCharacters,
numURLs, URLsPerWord, numUppercaseCharacters, numWords /
numUniqueWords
Metadata (10) Application, Presence of facebook.com URL, Presence of
apps.facebook.com URL, Presence of Facebook event URL, hasMessage,
hasStory, hasPicture, hasLink, type, linkLength
Link (7) http / https, numHyphens, numParameters, avgParameterLength,
numSubdomains, pathLength
38

Supervised learning: Dataset I
Classifier /
Features
Entity Text Metadata Link All Top 7
Naïve Bayes 54.79 52.41 71.60 69.25 56.15 74.72
Decision Tree 63.02 64.78 80.56 82.34 84.67 86.17
Random Forest 63.47 66.25 80.67 82.56 85.05 86.62
SVMrbf 61.77 64.89 78.75 81.45 75.89 83.66
39

Supervised learning: Dataset II
Classifier /
Features
Entity Text Metadata Link All
Naïve Bayes 51.67 51.60 72.45 77.58 67.63
Decision Tree 51.66 73.16 79.01 81.04 76.17
Random Forest 52.86 76.56 79.87 81.49 80.56
SVMrbf 53.16 76.52 78.18 80.37 73.79
40

Feature set: Facebook Pages
Page features Likes, talking about, description length, bio, category, name, location, check-ins, …
Posting
behavior
Daily activity ratio, post types, post likes, post comments, post shares, post engagement
ratio, post language, average post length, no. of unique URLs in posts, no. of unique
domains in posts, etc.
41
• Supervised learning
• Page + post features
• 55 features from page information
• 41 features from posting behavior
• Bag of words
• Content generated by pages

Supervised learning: Page + post features
Classifier Feature set Accuracy (%) ROC AUC
Naïve Bayesian
Page 63.95 0.685
Post 69.61 0.753
Page + Post 70.81 0.776
Logistic Regression
Page 67.38 0.745
Post 76.55 0.825
Page + Post 76.71 0.846
Decision Trees
Page 65.55 0.668
Post 71.37 0.720
Page + Post 70.81 0.758
Random Forest
Page 67.86 0.750
Post 74.95 0.829
Page + Post 75.27 0.837
42

Supervised learning: Bag of words
Classifier Feature set Accuracy (%) ROC AUC
Naïve Bayesian
Unigrams 68.27 0.682
Bigrams 69.06 0.690
Trigrams 69.77 0.697
Logistic Regression
Bigrams 74.34 0.791
Decision Trees
Bigrams 67.05 0.678
Random Forest
Bigrams 71.80 0.802
Sparse NN
Bigrams 84.12 0.872
43

Model for real time detection
• Model for pages depends on posts published by pages
• Can’t be used for detection in real time
• Two fold supervised learning based model using post
features
• Utilizing class probabilities for decision making
44

Decision boundary
45
Classifier 1
Classifier 2
1
10
High
High
Low
Malicious
Benign

Approach
Characterize
Model
Firefox
Implement
46

Facebook Inspector (FbI): Architecture
47

FbI stats
Date of public launch August 23, 2015
Total Incoming Requests 9 million +
Total public posts analyzed 3.5 million +
Total downloads 5,000+
Daily active users 250+
Total unique browsers 1,250+
Posts marked as malicious 615,000+
Posts marked as benign 2.9 million+
48

FbI evaluation: Response time
49
• ~80% posts processed within 3 seconds
• Average time per post: 2.635 seconds

FbI evaluation: Usability
• Usability study with 53 participants
• SUS score: 81.36 (A grade)
• Higher perceived usability that > 90% of all systems evaluated using
SUS scale
• 98.1% participants found FbI “easy to use”
• 67.9% participants would like use FbI frequently
• Quotes from users:
• “Saves your time spent on spam links and hence enhances user
experience.”
• “[Facebook Inspector] Can be useful for minors and people who lack
the judgement to decide how the post is.”
50

Contributions summary
• Identified and characterizedpoor quality content spread on
Facebook, with the purpose of identifying poor quality
posts published during news-making events in real time
• Evaluated supervised learning approaches for identifying
poor quality posts on Facebook in real time, using entity,
textual, metadata, and URL features
• Deployed and evaluated a novel framework and system for
real time detection of poor quality posts on Facebook
during news-making events
51

How does it help?
• Social media services are the primary source of information for
majority of Internet users
• Content is unmoderated and crowd-sourced; everything you see may not be
true
• Facebook Inspector provides a useful and usable real world solution to
assist users
• Methodology for fast and accurate summarization of image datasets
pertaining to a given topic
• Government agencies / brands can use this methodology to quickly produce
high-level summaries of events / products and gauge the pulse of the
masses
52

Real world impact
• Real time system Facebook Inspector built to identify poor
quality content is used by 250+ Facebook users, and has
processed over 9 million requests
• A unique dataset of Facebook posts containing malicious
URLs, pages posting malicious content, and images
depicting misinformation from 20+ news-making events
53

Limitations and future work
• Current system does not incorporate user feedback
• We would like to enable users to provide feedback to make a more
personalized detection model
• Computer vision techniques have limited accuracy on social
media content
• Object detection, sentiment analysis, and optical character
recognition techniques we used are not tested thoroughly on social
media content
• Identify and rank users on the basis of degree of malice
• More malicious content generated, higher the ranking
54

Acknowledgements
• NIXI for travel support (eCRS, 2014)
• IIIT-Delhi for travel support (ASONAM, 2017)
• Govt. of India for funding during PhD
• Collaborators and co-authors: Dr. Anand Kashyap, Shrey Bagroy,
Anshuman Suri, Varun Bharadhwaj, Aditi Mithal
• Monitoring committee: Dr. Vinayak and Dr. Sambuddho
• Peers: Dr. Niharika Sachdeva, Anupama Aggarwal, Dr. Paridhi Jain,
Dr. Aditi Gupta, Srishti Gupta, Rishabh Kaushal
• Members of Precog@IIITD and CERC
• Everyone else who has been part of my journey…
55

Publications – Part of thesis
• Dewan, P., Bagroy, S., and Kumaraguru, P.
Hiding in Plain Sight: The Anatomy of Malicious Pages on Facebook.
Book chapter, Lecture Notes in Social Networks, Springer 2017 (To appear)
• Dewan, P., Suri, A., Bharadhwaj, V., Mithal, A., and Kumaraguru, P.
Towards Understanding Crisis Events On Online Social Networks Through Pictures.
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), 2017.
• Dewan, P., and Kumaraguru, P.
Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on
Facebook.
Social Network Analysis and Mining Journal (SNAM), 2017. Volume 7, Issue 1.
• Dewan, P., Bagroy, S., and Kumaraguru, P.
Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages.
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), 2016 (Short paper)
• Dewan, P., and Kumaraguru, P.
Towards Automatic Real Time Identification of Malicious Posts on Facebook.
Thirteenth Annual Conference on Privacy, Security and Trust (PST), 2015
• Dewan, P., Kashyap, A., and Kumaraguru, P.
Analyzing Social and Stylometric Features to Identify Spear phishing Emails.
APWG eCrime Research Symposium (eCRS), 2014
56

Publications – Other
• Kaushal, R., Chandok, S., Jain P., Dewan, P., Gupta, N., and Kumaraguru, P.
Nudging Nemo: Helping Users Control Linkability across Social Networks.
9th International Conference on Social Informatics (SocInfo), 2017 (Short paper).
• Deshpande, P., Joshi, S., Dewan, P., Murthy, K., Mohania, M., Agrawal, S.
The Mask of ZoRRo: preventing information leakage from documents.
Knowledge and Information Systems Journal, 2014
• Mittal, S., Gupta, N., Dewan, P., Kumaraguru, P.
Pinned it! A large scale study of the Pinterest network.
1st ACM IKDD Conference on Data Sciences (CoDS), 2014
• Dewan, P., Gupta, M., Goyal, K., and Kumaraguru, P.
MultiOSN: Realtime Monitoring of Real World Events on Multiple Online Social Media
IBM ICARE 2013
• Magalhães, T., Dewan, P., Kumaraguru, P., Melo-Minardi, R., and Almeida, V.
uTrack: Track Yourself! Monitoring Information on Online Social Media.
22nd International World Wide Web Conference (WWW) (2013)
• Conway M., Dewan P., Kumaraguru P., McInerney L.
'White Pride Worldwide': A Meta- analysis of Stormfront.org
Internet, Politics, Policy 2012: Big Data, Big Challenges?, Oxford Internet Institute,
University of Oxford.
57

Thank you!
prateekd@iiitd.ac.in
http://precog.iiitd.edu.in/people/prateek

Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services

Recommended

Recommended

More Related Content

Similar to Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services

Similar to Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services (20)

More from IIIT Hyderabad

More from IIIT Hyderabad (20)

Recently uploaded

Recently uploaded (20)

Techniques for Automating Quality Assessment of Context-specific Content on Social Media Services