Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
DETECTING THE
PRESENCE OF
CYBERBULLYING USING
COMPUTER SOFTWARE
Ashish Arora
Department of
Computer and Electrical
Enginee...
WHAT IS CYBERBULLYING
?
The use of electronic
media or communication
channel to bully a person,
typically by sending
messa...
COMMENTS INVOLVING
NEGATIVITY AND
PROFANITY
Cyberbullying
Profanity Negativity
Sexuality
Race/Cult
ure
Intelligence
Physic...
ISSUES RELATED TO
CYBERBULLYING
Classifying the conversation in to normal chat/text or
under bullying attributes.
Cyberb...
WHAT IS THE SUITABLE
SOLUTION ?
Machine Learning Online
Patrol Crawler
Sentiment Analysis
Softwares to detect
cyberbullyin...
MACHINE LEARNING
METHOD -ONLINE
PATROL CRAWLER
 This method is designed to curb the issue of online
malicious entries esp...
PREVIOUS APPROACH
1.Detection of Cyberbullying activity
2.Saving the URL of website
3. Printing out websites containing
cy...
MACHINE LEARNING
APPROACH
Machine
Learning Module
Training Phase Test Phase
TRAINING PHASE STEPS
Crawling School Website
Detecting Manually Cyber-bullying entries
Extraction of vulgar words and a...
TEST PHASE STEPS
Crawling School Website
Detecting Cyber-bullying entries by SVM model
Part of speech analysis of the d...
ESTIMATION OF WORD
SIMILARITY-LEVENSHTEIN
DISTANCE
Manually gathered suspicious entries to form a lexicon of vulgar
words...
SUPPORT VECTOR
MACHINE METHOD OF
CLASSIFICATION
SVM is a method of supervised machine learning which is
used for classifi...
EVALUATION OF SVM
MODEL
Data needs to be prepared for training the SVM model.
For training data 966 entries were gathere...
METHODOLOGY
Traini
ng
Data
set
966
750
harmf
ul
216
not
harmf
ul
SVM light (a
software for
building SVM
Models)SVM
trainin...
RANKING THE WORDS
Apart from the classification of cyber-bullying entries,
there is a need to appropriately determine how ...
DISCUSSION
 The results of SVM model used to distinguish between
harmful and non-harmful information were 79.9% of
Precis...
DETECTING
CYBERBULLYING ON
SOCIAL NETWORK SITE –
TWITTER
Sentiment Classifier is used to classify tweets in to
negative a...
PREVIOUS APPROACH
Machine Learning Algorithm for classifying the
sentiment of twitter messages.
Previous approach classi...
PROPOSED APPROACH
This software application would be capable of accurately
classifying Twitter messages as negative or po...
PROPOSED APPROACH
Once polarity of words is confirmed, data would be
processed to extract some relevant information, such...
TECHNOLOGY USED
LingPipe – A tool kit for processing text using
computational linguistics. Implements naïve Bayes
algorith...
DATA COLLECTION AND
PRE-PROCESSING
Tweets were collected from different sources , around
5000 tweets.
Use of Bag-of-word...
APPROACH
Built a framework on top of LingPipe tool kit for
processing text using computational linguistics
Framework use...
DATA COLLECTION
Open Source Library
and Streaming API
Crawls twitter
timeline
Tweets
containing
Words of
interest
For trai...
To train classifier created a training data set and a test
data set.
Training data consists of messages containing 4 wor...
CLASSIFICATION – NAIVE
BAYES CLASSIFIER
The Focus of this approach is to find polarity of tweets.
Each word in a tweet c...
RESULTS
Amazon’s Mechanical Turk classified unlabelled data
which was used to verify and validate newly labelled data
prov...
CONCLUSION
This approach leverages the power of sentiment
analysis.
The classifier was close to 70% accurate.
 It is no...
CYBERBULLYING
BLOCKER APPLICATION
FOR ANDROID
New types of devices connected to internet such as
smartphones and tablets f...
APPLICATION
Application is built for devices supporting Android OS.
Java8 and Android Studio was used.
Gives users interfa...
HARMFUL CONTENT
DETECTION PROCESS
The Application contains one activity responsible for
interacting with the user.
For the...
METHODOLOGY
The method classifies messages as harmful or not by
using a classifier trained with language modelling
method...
METHOD CONT.…
RESULT
Precision = 79 %
Recall = 79 %
Requires minimal human effort
RECALL is the ratio of the number of relevant records
...
OTHER SOFTWARE
PRODUCTS IN THE MARKET
FOR DETECTING
CYBERBULLYING
FearNot ! – Interactive drama/video game that teaches
c...
PROPOSAL TO FILTER
SUSPECTED MESSAGES
A filtering mechanism to classify messages as “abusive”
or “non-abusive”(or“positive...
EXAMPLE OF FILTERING
SYSTEM
CHALLENGES
Preventing the removal of valuable messages when
attempting to filter the data.
Privacy concerns
Incidents sh...
Detecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer software
Próxima SlideShare
Cargando en…5
×
Próxima SlideShare
Detection of cyber-bullying
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

1

Compartir

Descargar para leer sin conexión

Detecting the presence of cyberbullying using computer software

Descargar para leer sin conexión

Presentation involving techniques to detect cyber-bullying activties using computer software

Detecting the presence of cyberbullying using computer software

  1. 1. DETECTING THE PRESENCE OF CYBERBULLYING USING COMPUTER SOFTWARE Ashish Arora Department of Computer and Electrical Engineering and Computer Science Florida Atlantic University Mentor: Dr. Taghi M. Khoshgoftaar
  2. 2. WHAT IS CYBERBULLYING ? The use of electronic media or communication channel to bully a person, typically by sending messages of an intimidating or threatening nature is known as cyberbullying. The Technology is used to intentionally hurt or embarrass another person. It involves the use of information and communication technologies to support
  3. 3. COMMENTS INVOLVING NEGATIVITY AND PROFANITY Cyberbullying Profanity Negativity Sexuality Race/Cult ure Intelligence Physical Attributes
  4. 4. ISSUES RELATED TO CYBERBULLYING Classifying the conversation in to normal chat/text or under bullying attributes. Cyberbullying is one of the most mentally damaging problems on internet. It results in catastrophic impact on self-esteem and personal lives especially of students. The Data needs to be categorized properly before using any approach to stop the Cyberbullying activity.
  5. 5. WHAT IS THE SUITABLE SOLUTION ? Machine Learning Online Patrol Crawler Sentiment Analysis Softwares to detect cyberbullying content
  6. 6. MACHINE LEARNING METHOD -ONLINE PATROL CRAWLER  This method is designed to curb the issue of online malicious entries especially on Informal School websites This method uses a machine learning method known as Support Vector Method(SVM) to detect any inappropriate entry.  The software is Designed for automatically detecting the cyberbullying cases The data for classification purpose is taken from Informal School websites. These informal school websites contains Slandering information about teachers and students
  7. 7. PREVIOUS APPROACH 1.Detection of Cyberbullying activity 2.Saving the URL of website 3. Printing out websites containing cyberbullying entry Sending deletion request of the suspicious entry to the website admin or internet provider. Informing the police or legal affair bureau Confirming the deletion of the entry containing Cyber- Bullying activity
  8. 8. MACHINE LEARNING APPROACH Machine Learning Module Training Phase Test Phase
  9. 9. TRAINING PHASE STEPS Crawling School Website Detecting Manually Cyber-bullying entries Extraction of vulgar words and adding them to lexicon Estimating word similarity with Levenshtein distance Training with Support Vector Machine Algorithm
  10. 10. TEST PHASE STEPS Crawling School Website Detecting Cyber-bullying entries by SVM model Part of speech analysis of the detected harmful entry Estimating word similarity with Levenshtein distance Marking and visualizing harmful entries
  11. 11. ESTIMATION OF WORD SIMILARITY-LEVENSHTEIN DISTANCE Manually gathered suspicious entries to form a lexicon of vulgar words distinctive for cyberbullying entries. Users often change spelling of words and write in an un- normalized behaviour. E.g. ‘ See You’ is written as ‘CU’ in chat or forums Using Levenshtein Distance to calculate similarity of words used in chat. The Levenshtein Distance between two strings is calculated as the minimum number of operations required to transform one string in to another, where the available operations are only deletion, insertion or substitution of a single character For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits: kitten → sitten (substitution of "s" for "k") sitten → sittin (substitution of "i" for "e") sittin → sitting (insertion of "g" at the end).
  12. 12. SUPPORT VECTOR MACHINE METHOD OF CLASSIFICATION SVM is a method of supervised machine learning which is used for classification of data With a set of training samples, divided into two categories A and B, SVM training algorithm generates a model for prediction of whether test samples belong to either category A or B. To classify the entries into harmful(Cyberbullying) or Non harmful. Samples are represented as points in space (vectors). SVM constructs a hyperplane in a space with largest distance to the nearest training data points. The larger the margin the lower the generalization error of the classifier Training samples divided in to two categories. Samples are represented as points in space.
  13. 13. EVALUATION OF SVM MODEL Data needs to be prepared for training the SVM model. For training data 966 entries were gathered during manual online patrol , from which human annotators classified 750 entries as harmful and 216 as non- harmful. The above entries were applied to SVM_light a software to implement SVM algorithm. The result is represented in terms of F-Score where F- Score is represented in terms of Precision and Recall.
  14. 14. METHODOLOGY Traini ng Data set 966 750 harmf ul 216 not harmf ul SVM light (a software for building SVM Models)SVM training 10-fold cross validatio n Result of SVM model 79.9% of Precision and 98.3%of Recall.Test Data Set Evaluate Pre- processing Feature Extraction D A T A
  15. 15. RANKING THE WORDS Apart from the classification of cyber-bullying entries, there is a need to appropriately determine how harmful is a certain entry Harmfulness of an entry is calculated using T-score To calculate the harmfulness of the whole entry of words ,a sum of T scores is calculated for all vulgar words. The higher occurrence frequency a word has in a sentence, the higher is the value of T-score The more frequently occurring words there are in the entry, the higher rank the entry achieves in the ranking of harmfulness. T-score = a/b
  16. 16. DISCUSSION  The results of SVM model used to distinguish between harmful and non-harmful information were 79.9% of Precision and 98.3% of Recall. This approach is not as accurate for preparing lexicon of vulgar words , the words being matched by Levenshtein distance sometimes does not give accurate results.  New vulgar words appearing frequently , need to find a way to automatically extract new harmful words from internet automatically.
  17. 17. DETECTING CYBERBULLYING ON SOCIAL NETWORK SITE – TWITTER Sentiment Classifier is used to classify tweets in to negative and positive categories by using Machine Learning Algorithm The aims is to determine the bullying instances in social networks and increase their visibility. Twitter is used as the Source of data.
  18. 18. PREVIOUS APPROACH Machine Learning Algorithm for classifying the sentiment of twitter messages. Previous approach classified tweets in to positive or negative with respect to specific emoticons found in twitter messages. In this approach instead of emoticons commonly used abuse words are used for labelling. Graph visualizations, both dynamic and static, to illustrate clustering of bullies over a period .
  19. 19. PROPOSED APPROACH This software application would be capable of accurately classifying Twitter messages as negative or positive with respect to some commonly used terms . Mainly Focussed on Gender Bullying by using four words with different Polarity. To confirm their “bullying” polarity, Amazon’s Mechanical Turk was used.
  20. 20. PROPOSED APPROACH Once polarity of words is confirmed, data would be processed to extract some relevant information, such as the username of the person who posted the negative tweet (potential bully) and the username of the person mentioned in the tweet. The outcome of the monitoring process will be several social graphs. The Social Graphs will be categorized in to bully and victim Social Graph. The purpose of this graph is to visualize all detected bullying instances, find clusters of bullies, and show hidden connections between victims over a period of time.
  21. 21. TECHNOLOGY USED LingPipe – A tool kit for processing text using computational linguistics. Implements naïve Bayes algorithm. Tweet Extractor – To extract tweets from twitter continuously. Gephi – Open Source Graph Visualization and manipulation software Amazon’s Mechanical Turk Service – Crowdsourcing Market place , coordinate the use of human intelligence to perform tasks that computers are unable to do.
  22. 22. DATA COLLECTION AND PRE-PROCESSING Tweets were collected from different sources , around 5000 tweets. Use of Bag-of-words model. It takes every word in a sentence as features , the whole sentence is represented by an unordered collection of words. 5000 tweets Previously collected data from Stanford students Previously collected data from university professors Used Mechanical Turk to validate the polarity of tweets
  23. 23. APPROACH Built a framework on top of LingPipe tool kit for processing text using computational linguistics Framework uses LingPipe’s Naive Bayes machine learning classifier as baseline Framework treats the classifier and feature extractor as one component As part of data collection and pre-processing, accessed Twitter looking for the tweets containing the words of interest (negative words) Framework Ling Pipe Naïve Bayes Classifier + Tweet Extractor Extracts tweets
  24. 24. DATA COLLECTION Open Source Library and Streaming API Crawls twitter timeline Tweets containing Words of interest For training data For training data, messages that contained the words “Gay,” “Homo,” “Dike,” and “Queer” were collected by using our in-house Tweets extractor. The Test Data was collected at random by streaming in public tweets from twitter’s public timeline.
  25. 25. To train classifier created a training data set and a test data set. Training data consists of messages containing 4 words of interest –’Gay’, ‘Homo’, ’Dike’ and ‘Queer’ 5000 tweets – Approximately 3/4 of the collected tweets were negative and 1/4 is positive tweets.. Manually labelled 460 tweets as negative and 500 tweets were labelled positive by Amazon’s Mechanical Turk Service The labelled data is being validated by selecting a random sample of the collected data and use Amazon’s mechanical Turk to confirm their sentiment. Survey Used Opinion Polarity Value Negative with Bullying Intentions B Negative without Bullying Intentions A Positive or good content P Neutral N
  26. 26. CLASSIFICATION – NAIVE BAYES CLASSIFIER The Focus of this approach is to find polarity of tweets. Each word in a tweet considered unique variable in Naïve Bayes model. Goal – Probability of word whether it belongs to positive or negative class Collecting Data set for training Pre processing Data Set Training Data Training the model Sentime nt Detectio n(Positiv e , Negativ e)
  27. 27. RESULTS Amazon’s Mechanical Turk classified unlabelled data which was used to verify and validate newly labelled data provided by Machine Learning Algorithm. Results Training 500 Tweets Positive Negative Accuracy Naïve Bayes 65.7% 72.9% 67.3% Amazon’s Mturk 65.2% 74.0% 67.1%
  28. 28. CONCLUSION This approach leverages the power of sentiment analysis. The classifier was close to 70% accurate.  It is not the best result as expected due to restriction from accessing unlimited content from twitter.
  29. 29. CYBERBULLYING BLOCKER APPLICATION FOR ANDROID New types of devices connected to internet such as smartphones and tablets further exacerbated the problem of cyberbullying. Android Application which automatically detects a possible harmful content in a text. This application uses machine learning method to spot any undesirable content
  30. 30. APPLICATION Application is built for devices supporting Android OS. Java8 and Android Studio was used. Gives users interface for detection of harmful contents.
  31. 31. HARMFUL CONTENT DETECTION PROCESS The Application contains one activity responsible for interacting with the user. For the process of checking harmful content the application starts a background thread. The user can still use the device even if checking process takes a while. User Inputs text on mobile screen Push Button to select the method Feedback to the user
  32. 32. METHODOLOGY The method classifies messages as harmful or not by using a classifier trained with language modelling method based on Brute Force Algo. Brute Force - Algorithms using combinatorial approach usually generate a massive number of combinations - potential answers to a given problem. Algorithm applied for automatic extraction of sentence patterns Actual data collected by Internet Patrol (annotated by experts) 1490 harmful and 1508 non-harmful entries. All patterns used in classification was stored on mobile device. Method operates locally does not require internet connection.
  33. 33. METHOD CONT.…
  34. 34. RESULT Precision = 79 % Recall = 79 % Requires minimal human effort RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved
  35. 35. OTHER SOFTWARE PRODUCTS IN THE MARKET FOR DETECTING CYBERBULLYING FearNot ! – Interactive drama/video game that teaches children strategies to prevent bullying and social exclusion. Samaritans Radar – The application function was alerting a user when it spotted someone of either being bullied, depressed or sending disturbing suicidal signals. Application was stopped due to privacy concerns. ReThink – This is a smartphone application which shows a pop-up warning message when user tries to send a message having harmful content. PocketGuardian – It’s a parental monitoring App which detects not only cyberbullying texting but also harmful images. It uses machine learning algorithm. Disadvantage – Costs $4 per month.
  36. 36. PROPOSAL TO FILTER SUSPECTED MESSAGES A filtering mechanism to classify messages as “abusive” or “non-abusive”(or“positive”and“negative,”) respectively. In a practical system, the filter will not be completely reliable; there will be false positives and false negatives in at least some cases. Some cases likes of threats requires extra efforts. Difficult to create an automated system to reliably recognize threats that should be reported to the police. The problem of false positives and the problem of discarding threats can both be dampened by diverting messages labelled abusive to a trusted third party.
  37. 37. EXAMPLE OF FILTERING SYSTEM
  38. 38. CHALLENGES Preventing the removal of valuable messages when attempting to filter the data. Privacy concerns Incidents should be reported as early as possible. False reporting
  • YashMaheta

    Feb. 22, 2021

Presentation involving techniques to detect cyber-bullying activties using computer software

Vistas

Total de vistas

1.663

En Slideshare

0

De embebidos

0

Número de embebidos

11

Acciones

Descargas

89

Compartidos

0

Comentarios

0

Me gusta

1

×