Detect Cyberbullying Using ML and Sentiment Analysis

DETECTING THE
PRESENCE OF
CYBERBULLYING USING
COMPUTER SOFTWARE
Ashish Arora
Department of
Computer and Electrical
Engineering and
Computer Science
Florida Atlantic
University
Mentor: Dr. Taghi M.
Khoshgoftaar

WHAT IS CYBERBULLYING
?
The use of electronic
media or communication
channel to bully a person,
typically by sending
messages of an
intimidating or threatening
nature is known as
cyberbullying.
The Technology is used to
intentionally hurt or
embarrass another person.
It involves the use of
information and
communication
technologies to support

COMMENTS INVOLVING
NEGATIVITY AND
PROFANITY
Cyberbullying
Profanity Negativity
Sexuality
Race/Cult
ure
Intelligence
Physical
Attributes

ISSUES RELATED TO
CYBERBULLYING
Classifying the conversation in to normal chat/text or
under bullying attributes.
Cyberbullying is one of the most mentally damaging
problems on internet.
It results in catastrophic impact on self-esteem and
personal lives especially of students.
The Data needs to be categorized properly before using
any approach to stop the Cyberbullying activity.

WHAT IS THE SUITABLE
SOLUTION ?
Machine Learning Online
Patrol Crawler
Sentiment Analysis
Softwares to detect
cyberbullying content

MACHINE LEARNING
METHOD -ONLINE
PATROL CRAWLER
 This method is designed to curb the issue of online
malicious entries especially on Informal School websites
This method uses a machine learning method known as
Support Vector Method(SVM) to detect any inappropriate
entry.
 The software is Designed for automatically detecting
the cyberbullying cases
The data for classification purpose is taken from
Informal School websites.
These informal school websites contains Slandering
information about teachers and students

PREVIOUS APPROACH
1.Detection of Cyberbullying activity
2.Saving the URL of website
3. Printing out websites containing
cyberbullying entry
Sending deletion request of the
suspicious entry to the website
admin or internet provider.
Informing the police or legal
affair bureau
Confirming the deletion of
the entry containing Cyber-
Bullying activity

MACHINE LEARNING
APPROACH
Machine
Learning Module
Training Phase Test Phase

TRAINING PHASE STEPS
Crawling School Website
Detecting Manually Cyber-bullying entries
Extraction of vulgar words and adding them to lexicon
Estimating word similarity with Levenshtein distance
Training with Support Vector Machine Algorithm

TEST PHASE STEPS
Crawling School Website
Detecting Cyber-bullying entries by SVM model
Part of speech analysis of the detected harmful entry
Estimating word similarity with Levenshtein distance
Marking and visualizing harmful entries

ESTIMATION OF WORD
SIMILARITY-LEVENSHTEIN
DISTANCE
Manually gathered suspicious entries to form a lexicon of vulgar
words distinctive for cyberbullying entries.
Users often change spelling of words and write in an un-
normalized behaviour. E.g. ‘ See You’ is written as ‘CU’ in chat or
forums
Using Levenshtein Distance to calculate similarity of words used in
chat.
The Levenshtein Distance between two strings is calculated as the
minimum number of operations required to transform one string in
to another, where the available operations are only deletion,
insertion or substitution of a single character
For example, the Levenshtein distance between "kitten" and
"sitting" is 3, since the following three edits change one into the
other, and there is no way to do it with fewer than three edits:
kitten → sitten (substitution of "s" for "k")
sitten → sittin (substitution of "i" for "e")
sittin → sitting (insertion of "g" at the end).

SUPPORT VECTOR
MACHINE METHOD OF
CLASSIFICATION
SVM is a method of supervised machine learning which is
used for classification of data
With a set of training samples, divided into two categories
A and B, SVM training algorithm generates a model for
prediction of whether test samples belong to either category
A or B. To classify the entries into harmful(Cyberbullying) or
Non harmful.
Samples are represented as points in space (vectors).
SVM constructs a hyperplane in a space with largest
distance to the nearest training data points.
The larger the margin the lower the generalization error of
the classiﬁer
Training samples divided in to two categories.
Samples are represented as points in space.

EVALUATION OF SVM
MODEL
Data needs to be prepared for training the SVM model.
For training data 966 entries were gathered during
manual online patrol , from which human annotators
classified 750 entries as harmful and 216 as non-
harmful.
The above entries were applied to SVM_light a software
to implement SVM algorithm.
The result is represented in terms of F-Score where F-
Score is represented in terms of Precision and Recall.

METHODOLOGY
Traini
ng
Data
set
966
750
harmf
ul
216
not
harmf
ul
SVM light (a
software for
building SVM
Models)SVM
training
10-fold
cross
validatio
n
Result of
SVM model
79.9% of
Precision
and
98.3%of
Recall.Test Data
Set
Evaluate
Pre-
processing
Feature
Extraction
D
A
T
A

RANKING THE WORDS
Apart from the classiﬁcation of cyber-bullying entries,
there is a need to appropriately determine how harmful
is a certain entry Harmfulness of an entry is calculated
using T-score
To calculate the harmfulness of the whole entry of
words ,a sum of T scores is calculated for all vulgar
words. The higher occurrence frequency a word has in a
sentence, the higher is the value of T-score
The more frequently occurring words there are in the
entry, the higher rank the entry achieves in the ranking
of harmfulness.
T-score = a/b

DISCUSSION
 The results of SVM model used to distinguish between
harmful and non-harmful information were 79.9% of
Precision and 98.3% of Recall.
This approach is not as accurate for preparing lexicon
of vulgar words , the words being matched by
Levenshtein distance sometimes does not give accurate
results.
 New vulgar words appearing frequently , need to find a
way to automatically extract new harmful words from
internet automatically.

DETECTING
CYBERBULLYING ON
SOCIAL NETWORK SITE –
TWITTER
Sentiment Classifier is used to classify tweets in to
negative and positive categories by using Machine
Learning Algorithm
The aims is to determine the bullying instances in social
networks and increase their visibility.
Twitter is used as the Source of data.

PREVIOUS APPROACH
Machine Learning Algorithm for classifying the
sentiment of twitter messages.
Previous approach classified tweets in to positive or
negative with respect to specific emoticons found in
twitter messages.
In this approach instead of emoticons commonly used
abuse words are used for labelling.
Graph visualizations, both dynamic and static, to
illustrate clustering of bullies over a period .

PROPOSED APPROACH
This software application would be capable of accurately
classifying Twitter messages as negative or positive with
respect to some commonly used terms .
Mainly Focussed on Gender Bullying by using four words
with different Polarity.
To confirm their “bullying” polarity, Amazon’s
Mechanical Turk was used.

PROPOSED APPROACH
Once polarity of words is confirmed, data would be
processed to extract some relevant information, such as
the username of the person who posted the negative
tweet (potential bully) and the username of the person
mentioned in the tweet.
The outcome of the monitoring process will be several
social graphs.
The Social Graphs will be categorized in to bully and
victim Social Graph.
The purpose of this graph is to visualize all detected
bullying instances, find clusters of bullies, and show
hidden connections between victims over a period of
time.

TECHNOLOGY USED
LingPipe – A tool kit for processing text using
computational linguistics. Implements naïve Bayes
algorithm.
Tweet Extractor – To extract tweets from twitter
continuously.
Gephi – Open Source Graph Visualization and
manipulation software
Amazon’s Mechanical Turk Service – Crowdsourcing
Market place , coordinate the use of human intelligence
to perform tasks that computers are unable to do.

DATA COLLECTION AND
PRE-PROCESSING
Tweets were collected from different sources , around
5000 tweets.
Use of Bag-of-words model. It takes every word in a
sentence as features , the whole sentence is represented
by an unordered collection of words.
5000 tweets
Previously collected
data from Stanford
students
Previously collected
data from university
professors
Used Mechanical
Turk to validate the
polarity of tweets

APPROACH
Built a framework on top of LingPipe tool kit for
processing text using computational linguistics
Framework uses LingPipe’s Naive Bayes machine
learning classifier as baseline
Framework treats the classifier and feature extractor as
one component
As part of data collection and pre-processing, accessed
Twitter looking for the tweets containing the words of
interest (negative words)
Framework Ling Pipe
Naïve Bayes Classifier
+ Tweet Extractor
Extracts
tweets

DATA COLLECTION
Open Source Library
and Streaming API
Crawls twitter
timeline
Tweets
containing
Words of
interest
For training
data
For training data, messages that contained the
words “Gay,” “Homo,” “Dike,” and “Queer” were
collected by using our in-house Tweets extractor.
The Test Data was collected at random by
streaming in public tweets from twitter’s public
timeline.

To train classifier created a training data set and a test
data set.
Training data consists of messages containing 4 words
of interest –’Gay’, ‘Homo’, ’Dike’ and ‘Queer’
5000 tweets – Approximately 3/4 of the collected
tweets were negative and 1/4 is positive tweets..
Manually labelled 460 tweets as negative and 500
tweets were labelled positive by Amazon’s Mechanical
Turk Service
The labelled data is being validated by selecting a
random sample of the collected data and use Amazon’s
mechanical Turk to confirm their sentiment.
Survey Used
Opinion Polarity Value
Negative with Bullying
Intentions
B
Negative without Bullying
Intentions
A
Positive or good content P
Neutral N

CLASSIFICATION – NAIVE
BAYES CLASSIFIER
The Focus of this approach is to find polarity of tweets.
Each word in a tweet considered unique variable in
Naïve Bayes model.
Goal – Probability of word whether it belongs to positive
or negative class
Collecting
Data set for
training
Pre
processing
Data Set
Training
Data
Training the
model
Sentime
nt
Detectio
n(Positiv
e ,
Negativ
e)

RESULTS
Amazon’s Mechanical Turk classified unlabelled data
which was used to verify and validate newly labelled data
provided by Machine Learning Algorithm.
Results
Training 500 Tweets
Positive Negative Accuracy
Naïve Bayes 65.7% 72.9% 67.3%
Amazon’s
Mturk
65.2% 74.0% 67.1%

CONCLUSION
This approach leverages the power of sentiment
analysis.
The classifier was close to 70% accurate.
 It is not the best result as expected due to restriction
from accessing unlimited content from twitter.

CYBERBULLYING
BLOCKER APPLICATION
FOR ANDROID
New types of devices connected to internet such as
smartphones and tablets further exacerbated the
problem of cyberbullying.
Android Application which automatically detects a
possible harmful content in a text.
This application uses machine learning method to spot
any undesirable content

APPLICATION
Application is built for devices supporting Android OS.
Java8 and Android Studio was used.
Gives users interface for detection of harmful contents.

HARMFUL CONTENT
DETECTION PROCESS
The Application contains one activity responsible for
interacting with the user.
For the process of checking harmful content the
application starts a background thread.
The user can still use the device even if checking process
takes a while.
User Inputs
text on
mobile
screen
Push Button to
select the
method
Feedback to the
user

METHODOLOGY
The method classifies messages as harmful or not by
using a classifier trained with language modelling
method based on Brute Force Algo.
Brute Force - Algorithms using combinatorial approach
usually generate a massive number of combinations -
potential answers to a given problem.
Algorithm applied for automatic extraction of sentence
patterns
Actual data collected by Internet Patrol (annotated by
experts)
1490 harmful and 1508 non-harmful entries.
All patterns used in classification was stored on mobile
device.
Method operates locally does not require internet
connection.

RESULT
Precision = 79 %
Recall = 79 %
Requires minimal human effort
RECALL is the ratio of the number of relevant records
retrieved to the total number of relevant records in the
database.
PRECISION is the ratio of the number of relevant records
retrieved to the total number of irrelevant and relevant
records retrieved

OTHER SOFTWARE
PRODUCTS IN THE MARKET
FOR DETECTING
CYBERBULLYING
FearNot ! – Interactive drama/video game that teaches
children strategies to prevent bullying and social
exclusion.
Samaritans Radar – The application function was alerting
a user when it spotted someone of either being bullied,
depressed or sending disturbing suicidal signals.
Application was stopped due to privacy concerns.
ReThink – This is a smartphone application which shows
a pop-up warning message when user tries to send a
message having harmful content.
PocketGuardian – It’s a parental monitoring App which
detects not only cyberbullying texting but also harmful
images. It uses machine learning algorithm.
Disadvantage – Costs $4 per month.

PROPOSAL TO FILTER
SUSPECTED MESSAGES
A ﬁltering mechanism to classify messages as “abusive”
or “non-abusive”(or“positive”and“negative,”) respectively.
In a practical system, the ﬁlter will not be completely
reliable; there will be false positives and false negatives
in at least some cases.
Some cases likes of threats requires extra efforts.
Difficult to create an automated system to reliably
recognize threats that should be reported to the police.
The problem of false positives and the problem of
discarding threats can both be dampened by diverting
messages labelled abusive to a trusted third party.

CHALLENGES
Preventing the removal of valuable messages when
attempting to ﬁlter the data.
Privacy concerns
Incidents should be reported as early as possible.
False reporting

Detect Cyberbullying Using ML and Sentiment Analysis

Detect Cyberbullying Using ML and Sentiment Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Detect Cyberbullying Using ML and Sentiment Analysis

Similar a Detect Cyberbullying Using ML and Sentiment Analysis (20)

Último

Último (20)

Detect Cyberbullying Using ML and Sentiment Analysis

Notas del editor