SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
Group Details:-

Dhara Shah               z3299353
Imad Hashmi              z3193866
Zuo Cui                  z3261136

Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov ,
Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM
SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF

Flow of the Literature Review is as follows:-

Introduction
Background and Previous Work
Focus on Technology used in Paper
Future and Related Work

Introduction

Since email has become a wide spread means of communication around the
world and millions of email messages are transferred every minute, it is
understandable that illegitimate use of email service is also in practice since
long. One of the many abuses of this service is spamming which is used by
advertisers around the world to send advertisements of their products to
legitimate email users. Following discussion is on the methods used by anti spam
system to detect spam emails and botnets.

Background and Previous Work

There were a lot of researches on the identification and filtering of email spam.
Based on the part of email used for spam detection, all these work could be
generally classified into two main categories: non-content-based and
content- based.

Non-content-based filters are also known as address-based filters. It uses the
information such as IP address or email address in the email header to
examine. Blacklist and Whitelist are the common technique in this category.
Blacklist records the IP addresses or email addresses which send spam. And
conversely, Whitelist contains all acceptable email addresses. They can be
deployed on the client computers or email servers. Cook et al. (2006)
experimented a domain specific blacklist which worked on the mail server to
reduce the number of spam entering the network. But blacklist may easily cause
false positive. If one of them sends spam then its IP address or email address is
recorded in the blacklist. Consequently, other legitimate mails from that email
address are all marked as spam.
Content-based detection filters spam by analyzing the message content of
received email, which overcome the drawbacks of Non-content-based filters.
They scan for some sensitive keywords in the content to identify the spam. This
type of filters includes Heuristic filters and Bayesian filters.

Heuristic detection, are also known as rule-based analysis which uses regular
expression rules to detect phrases or characters that are common to spam
mails. Rules can be set as email header information, keywords or URL in the
content. William Cohen (1996) used learning rules successfully to classify emails
into different folders. But there are little related researches on the spam
detection based on rules.

However, the spam detection precision relies on the rules which are set by mail
system managers. So it will take significantly long time to define the rules. After
that, the rules should be refined frequently. If these pre-set rules on the mail
system are not updated, the filters will not work efficiently on the new spam with
new features. Besides, the rules are rigid and easy to cause false positive.

In addition, because the content-based detection of spam can be considered as
the problem of text classification, several machine learning approaches have
been applied to spam detection. Among many others, Bayesian is one of those
being proposed. In 1998, Bayesian classification techniques are employed to the
issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence
of certain words or phrases in the message content. Then the filters evaluate the
probability whether spam or not by analyzing the statistics. As a result, the
Bayesian filters eliminate more than 95% spam in the experiments and identify
80% of incoming junk mail in the real scenario. It is obviously that the Bayesian
filters can provide a high correct rate with regard to the detection of plain-text
content.

Now Bayesian is widely used with other methods in many spam detection
technologies to improve the accuracy. However there are some issues in the
Bayesian filters. First, as the same issue as other machine learning approaches,
the accuracy of Bayesian filters depend on the quality of training data and
training process. Second, even Bayesian filters can provide high precision for
plain-text content, but it is difficult to detect the booming spam contained
images. Therefore, a further research conducted by Okayama University is
carried out to detect the image spam (Uemura et al. 2008). It designed a
method allows the existing Bayesian filter to learn image information, such as
the file size or name, and then evaluate the probability on the learning results.
After some experiments of this method, it can be observed that the false
negative rate dropped but the false positive rates are almost same. It means
this method can play only a booster role in the identification of spam using
Bayesian because less information is provided by images to distinguish the spam
and legitimate mails.
Content Based Detection System has lot of advantages but the time and loads of
processing space as it goes through the complete email. There is need of an anti
spam system which could combine the advantages of content based and non
content based spam detection system.

AutoRE which a software designed by the Microsoft research group and our
anchor paper has tried to combine the both type of detection systems i.e.
content based and non content based system. Now we will be discussing in detail
how AutoRE combine the both systems.

Focus on Technology used in Paper

AutoRE unlike all the previous solutions to detecting botnets (like spamhaus,
blacklist) where areas it creates and trains itself dynamically real time. To do
this it has 3 major steps when a set of emails is supplied to it, they are as
follows:-
    1. URL Pre-processing
    2. Group Selector
    3. Regular Expression Generation
It is important to understand that we are not identifying spam or not spam
emails. As by definition any email which is regular and sends in bulk is spam,
but spam emails are not malicious as even a normal user might send an email
which is sent to his complete contact list but is relevant and not spam. Our focus
is on spam emails generated by botnets as they are not relevant emails it don’t
have any meaning to it, they are just sent to accomplish some malicious
mission. As botnets are autonomous systems, there is a pattern in their sending
behaviour as they are programmed. So to catch that pattern above mentioned
steps are followed. While doing URL Pre-processing following parameters are
considered:-
    1. URL String
    2. Source server IP address
    3. Email sending time
All forwarded messages are discarded as a legitimate forwarding server can be
mistaken for botnet member. URL Strings which are suspiciously random and
multiple domains are extracted out. As URL strings like a.com, b.com are
unlikely to be by botnets as they are registered domain names which
economically not feasible for spammers. URL strings are then broken down and
grouped into groups as per their domain names. As it is observed that spam
emails are advertising for a particular product or particular advertising
campaign, then domain specific signatures are created. And from this domain
specific signatures domain-agnostic regular expression are created to get better
results in form of reduced false positive rates and identifying the botnets even
when they change their domains. Before creating the generalised regular
expression domain specific signature need to suffice that it’s distributed,
bursty and specific only then can be classified as spam signature.
While grouping it’s very important to understand how to group the domains as
with n number of emails there are possibly n domain names. So while
considering distributed property temporal correlation is considered and bursty
property is considered over a span of 5 days as it’s observed most ASes are
active for minimum 5 days.
Now once we are done classifying domains into the groups, next step would be
generating regular expression. By generating a Regular expression a not a
token conjunction helps us reduce the false positive rate as keywords used in
the token conjunction are words which may or may not be part of email. After
creating domain specific groups we create a signature of the group and
classification is no more based on the group and its domain agnostic. By doing
so we assured that in future if the botnets change the domain still they will be
detected as there domain will hold the same regular expression and group
signature which classifies them as spam this happens because we are not
generating domain specific signature. This a unique feature of AutoRE which
helps it finding maximum spam emails with minimum false positive rate. Also
after categorizing them and assigning them their respective regular expression
it’s very important that we verify that the emails we have classified as spam are
actually spams or not. To do so there are 2 steps we need to do. First of all we
query our suspected IP Addresses to Blacklist which are found in the list
are directly classified as spam. The ones which are not we need to run some
behavioural test to understand whether they are spams or not. This
behavioural test is done on each campaign the points to taken care of are as
follows:-

1). Similarity of Email Properties
2). Similarity of Sending Time
3). Similarity of Email Sending Behaviour

As the emails we are targeting are being generated and send by automated
system above mentioned properties play a big role. As botnets are automated
systems they are bound to have pattern as however random the sending
algorithm is designed due to the frequency of occurrence pattern is going to be
generated.

It doesn’t end here as by the means of this software we can study the
characteristic of the botnets and predict the traffic and spam emails which are
going to be generated. This study on botnets has revealed lot of facts which are
pointers for future research in the anti spam system. In the next section we will
be mentioning the results of the study on botnets and its use in technologies
emerged after AutoRE.
Future and Related Work

Characteristics of Botnets and their use in present anti spam systems:-

1). Spam Sending Patterns over the network
The above characteristic is used in A Dynamic Reputation Service for
Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation
software for filtering spam messages. The Spamspotter software classifies email
senders in real time based on their global sending behaviour. This system is
called behavioural detection. SpamSpotter than applies a third party machine
learning behavioural algorithm on this data to generate reputation of senders. A
preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural
algorithm which identifies spam senders based on their email sending behaviour
instead of their addresses and the contents that they are sending. In some
cases, SNARE mechanism is so efficient that it can identify a spammer before it
has sent a large number of email messages.
AutoRE also studied the similar behaviour though SpamSpotter goes next level
by implementing SNARE algorithm to calculate reputation of a sender.

2). Distribution of IP Address
One of the characteristic of Botnets observed while experimentation of AutoRE
was studying distribution of IP Address. This is very important characteristic to
be studied as it can help us stop and understand the wide spread of Botnets.
This property has been extended by Studying Spamming Botnets Using
Botlab [6] Botnets are the most used spamming technique used these days. It
is estimated that 85% of billions of spam messages are generated by botnets.
This paper presents a botnet monitoring platform called botlab which monitors
all incoming spam traffic at a certain location. It scans the spam messages and
obtains bot binaries through spam links. A human operator than runs specific
tools on these binaries to obtain information about the bots sending these
spams. It then executes multiple captive, sandboxed nodes from various
botnets, allowing it to observe the precise outgoing spam feeds from these
nodes. It scours the spam feeds for URLs, gathers information on scams, and
identifies exploit links. Finally, it correlates the incoming and outgoing spam
feeds to identify the most active botnets and the set of compromised hosts
comprising each botnet. Also another extension is studying the characteristic of
Botnet and using it to detect them is done in BotGraph: Large Scale
Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP
addresses among accounts holders in an email system. Applying BotGraph to
two months of Hotmail log of total 450GB data, BotGraph successfully identified
over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph
has also been implemented using a distributed clustered algorithm with Map
Reduce technique. BotGraph can detect botnet sign-ups and already created
botnet email accounts.
Also one more interesting study came up during the research of AutoRE under
the category to scan the network traffic was the increase in use of static IP
addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist
to improve by populating it by static IP address. Also research suggested that
Botnets are evolving and creating more sophisticated and polymorphic
URL’s to bypass anti spam systems.

One major disadvantage of AutoRE is its not practically real time
implemented. Till now its method are under investigation and its
inception real time is still awaited.

References

1). A Dynamic Reputation Service for Spotting Spammers Anirudh
Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh
Vempala School of Computer Science, Georgia Tech
http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf

2).BotGraph:     Large      Scale   Spamming      Botnet               Detection
http://research.microsoft.com/pubs/79413/botgraph.pdf
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to
filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998
Workshop, Madison, Wisconsin, 1998.

3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive
Logic Programming, pp. 124-143

4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before
it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006
Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202.


5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine
http://hdl.handle.net/1853/25135

6). Studying Spamming Botnets Using Botlab
http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf


7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based
Image Spam Filtering Method’, 2008 International Conference on Information
Security and Assurance, 2008 IEEE

Más contenido relacionado

La actualidad más candente

Spam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont'sSpam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont'sSaneBox
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filteringijtsrd
 
Detecting Spambot as an Antispam Technique for Web Internet BBS
Detecting Spambot as an Antispam Technique for Web Internet BBSDetecting Spambot as an Antispam Technique for Web Internet BBS
Detecting Spambot as an Antispam Technique for Web Internet BBSijsrd.com
 
How an Enterprise SPAM Filter Works
How an Enterprise SPAM Filter Works How an Enterprise SPAM Filter Works
How an Enterprise SPAM Filter Works Pinpointe On-Demand
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filteringChippy Thomas
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemcsandit
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmAkshay Pal
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamAlexander Decker
 
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...IJNSA Journal
 
E-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering SystemE-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering Systemrahulmonikasharma
 

La actualidad más candente (17)

Spam Email identification
Spam Email identificationSpam Email identification
Spam Email identification
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Spam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont'sSpam Email: 8 Dos and Dont's
Spam Email: 8 Dos and Dont's
 
B0940509
B0940509B0940509
B0940509
 
A Survey: SMS Spam Filtering
A Survey: SMS Spam FilteringA Survey: SMS Spam Filtering
A Survey: SMS Spam Filtering
 
402 406
402 406402 406
402 406
 
Detecting Spambot as an Antispam Technique for Web Internet BBS
Detecting Spambot as an Antispam Technique for Web Internet BBSDetecting Spambot as an Antispam Technique for Web Internet BBS
Detecting Spambot as an Antispam Technique for Web Internet BBS
 
How an Enterprise SPAM Filter Works
How an Enterprise SPAM Filter Works How an Enterprise SPAM Filter Works
How an Enterprise SPAM Filter Works
 
Survey on spam filtering
Survey on spam filteringSurvey on spam filtering
Survey on spam filtering
 
Spam
SpamSpam
Spam
 
Spam, security
Spam, securitySpam, security
Spam, security
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
Spam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes AlgorithmSpam filtering with Naive Bayes Algorithm
Spam filtering with Naive Bayes Algorithm
 
Jt3616901697
Jt3616901697Jt3616901697
Jt3616901697
 
A review of spam filtering and measures of antispam
A review of spam filtering and measures of antispamA review of spam filtering and measures of antispam
A review of spam filtering and measures of antispam
 
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
MINIMIZING THE TIME OF SPAM MAIL DETECTION BY RELOCATING FILTERING SYSTEM TO ...
 
E-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering SystemE-Mail Security Using Spam Mail Detection and Filtering System
E-Mail Security Using Spam Mail Detection and Filtering System
 

Destacado

Bachelorthesis
BachelorthesisBachelorthesis
BachelorthesisDhara Shah
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1Dhara Shah
 
Interactive Powerpoint
Interactive PowerpointInteractive Powerpoint
Interactive Powerpointpurofutbol
 
Vremena Goda
Vremena GodaVremena Goda
Vremena GodaCaHHu
 
Soccer Presentation1
Soccer Presentation1Soccer Presentation1
Soccer Presentation1purofutbol
 
Organizational Culture MEASURING
Organizational Culture MEASURINGOrganizational Culture MEASURING
Organizational Culture MEASURINGbertvanderlinden
 
Data_Management_Seminar_Dhara_Shah
Data_Management_Seminar_Dhara_ShahData_Management_Seminar_Dhara_Shah
Data_Management_Seminar_Dhara_ShahDhara Shah
 
Bachelorthesis.compressed
Bachelorthesis.compressedBachelorthesis.compressed
Bachelorthesis.compressedDhara Shah
 
Geert Hofstede model for analysing organizational cultures
Geert Hofstede model for analysing organizational culturesGeert Hofstede model for analysing organizational cultures
Geert Hofstede model for analysing organizational culturesbertvanderlinden
 
Welcome To Computers
Welcome To ComputersWelcome To Computers
Welcome To Computersconnerk
 
NetworkPaperthesis2
NetworkPaperthesis2NetworkPaperthesis2
NetworkPaperthesis2Dhara Shah
 
Network paperthesis2
Network paperthesis2Network paperthesis2
Network paperthesis2Dhara Shah
 
web training
web trainingweb training
web trainingsourabh4u
 
measuring organizational culture
measuring organizational culturemeasuring organizational culture
measuring organizational culturebertvanderlinden
 
I phone programming project report
I phone programming project reportI phone programming project report
I phone programming project reportDhara Shah
 
Reinventing Healthcare to Serve People, Not Institutions
Reinventing Healthcare to Serve People, Not InstitutionsReinventing Healthcare to Serve People, Not Institutions
Reinventing Healthcare to Serve People, Not InstitutionsTim O'Reilly
 

Destacado (20)

Bachelorthesis
BachelorthesisBachelorthesis
Bachelorthesis
 
NetworkPaperthesis1
NetworkPaperthesis1NetworkPaperthesis1
NetworkPaperthesis1
 
Interactive Powerpoint
Interactive PowerpointInteractive Powerpoint
Interactive Powerpoint
 
Vremena Goda
Vremena GodaVremena Goda
Vremena Goda
 
Soccer Presentation1
Soccer Presentation1Soccer Presentation1
Soccer Presentation1
 
Organizational Culture MEASURING
Organizational Culture MEASURINGOrganizational Culture MEASURING
Organizational Culture MEASURING
 
Data_Management_Seminar_Dhara_Shah
Data_Management_Seminar_Dhara_ShahData_Management_Seminar_Dhara_Shah
Data_Management_Seminar_Dhara_Shah
 
Bachelorthesis.compressed
Bachelorthesis.compressedBachelorthesis.compressed
Bachelorthesis.compressed
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Master thesis
Master thesisMaster thesis
Master thesis
 
Geert Hofstede model for analysing organizational cultures
Geert Hofstede model for analysing organizational culturesGeert Hofstede model for analysing organizational cultures
Geert Hofstede model for analysing organizational cultures
 
Welcome To Computers
Welcome To ComputersWelcome To Computers
Welcome To Computers
 
NetworkPaperthesis2
NetworkPaperthesis2NetworkPaperthesis2
NetworkPaperthesis2
 
Network paperthesis2
Network paperthesis2Network paperthesis2
Network paperthesis2
 
web training
web trainingweb training
web training
 
measuring organizational culture
measuring organizational culturemeasuring organizational culture
measuring organizational culture
 
Natural science 2 reviewer
Natural science 2 reviewerNatural science 2 reviewer
Natural science 2 reviewer
 
I phone programming project report
I phone programming project reportI phone programming project report
I phone programming project report
 
My personal brand
My personal brandMy personal brand
My personal brand
 
Reinventing Healthcare to Serve People, Not Institutions
Reinventing Healthcare to Serve People, Not InstitutionsReinventing Healthcare to Serve People, Not Institutions
Reinventing Healthcare to Serve People, Not Institutions
 

Similar a Network paperthesis1

Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingEditor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Editor IJCATR
 
Analysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysisAnalysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysisijnlc
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemcsandit
 
How to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkHow to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkGFI Software
 
Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method forijcsa
 
Monitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetMonitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetIOSR Journals
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptxAnush90
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filterShalini Gs
 
Do Humans Beat Computers At Pattern Recognition
Do Humans Beat Computers At Pattern RecognitionDo Humans Beat Computers At Pattern Recognition
Do Humans Beat Computers At Pattern RecognitionBitdefender
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesIRJET Journal
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeEditor IJCATR
 
Blockmail Technical White Paper
Blockmail   Technical White PaperBlockmail   Technical White Paper
Blockmail Technical White Paperniallmmackey
 

Similar a Network paperthesis1 (20)

Identification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using VotingIdentification of Spam Emails from Valid Emails by Using Voting
Identification of Spam Emails from Valid Emails by Using Voting
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
 
Analysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysisAnalysis of an image spam in email based on content analysis
Analysis of an image spam in email based on content analysis
 
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERINGDEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
DEVELOPMENT OF AN EFFECTIVE BAYESIAN APPROACH FOR SPAM FILTERING
 
A multi layer architecture for spam-detection system
A multi layer architecture for spam-detection systemA multi layer architecture for spam-detection system
A multi layer architecture for spam-detection system
 
How to Keep Spam Off Your Network
How to Keep Spam Off Your NetworkHow to Keep Spam Off Your Network
How to Keep Spam Off Your Network
 
Thematic and self learning method for
Thematic and self learning method forThematic and self learning method for
Thematic and self learning method for
 
Monitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in InternetMonitoring the Spread of Active Worms in Internet
Monitoring the Spread of Active Worms in Internet
 
miniproject.ppt.pptx
miniproject.ppt.pptxminiproject.ppt.pptx
miniproject.ppt.pptx
 
Seminar on web mail filter
Seminar   on   web mail filterSeminar   on   web mail filter
Seminar on web mail filter
 
Do Humans Beat Computers At Pattern Recognition
Do Humans Beat Computers At Pattern RecognitionDo Humans Beat Computers At Pattern Recognition
Do Humans Beat Computers At Pattern Recognition
 
Overview of Anti-spam filtering Techniques
Overview of Anti-spam filtering TechniquesOverview of Anti-spam filtering Techniques
Overview of Anti-spam filtering Techniques
 
SPAM FILTERS
SPAM FILTERSSPAM FILTERS
SPAM FILTERS
 
Identifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision TreeIdentifying Valid Email Spam Emails Using Decision Tree
Identifying Valid Email Spam Emails Using Decision Tree
 
Email deliverability
Email deliverabilityEmail deliverability
Email deliverability
 
Blockmail Technical White Paper
Blockmail   Technical White PaperBlockmail   Technical White Paper
Blockmail Technical White Paper
 
M dgx mde0mde=
M dgx mde0mde=M dgx mde0mde=
M dgx mde0mde=
 

Network paperthesis1

  • 1. Group Details:- Dhara Shah z3299353 Imad Hashmi z3193866 Zuo Cui z3261136 Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov , Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF Flow of the Literature Review is as follows:- Introduction Background and Previous Work Focus on Technology used in Paper Future and Related Work Introduction Since email has become a wide spread means of communication around the world and millions of email messages are transferred every minute, it is understandable that illegitimate use of email service is also in practice since long. One of the many abuses of this service is spamming which is used by advertisers around the world to send advertisements of their products to legitimate email users. Following discussion is on the methods used by anti spam system to detect spam emails and botnets. Background and Previous Work There were a lot of researches on the identification and filtering of email spam. Based on the part of email used for spam detection, all these work could be generally classified into two main categories: non-content-based and content- based. Non-content-based filters are also known as address-based filters. It uses the information such as IP address or email address in the email header to examine. Blacklist and Whitelist are the common technique in this category. Blacklist records the IP addresses or email addresses which send spam. And conversely, Whitelist contains all acceptable email addresses. They can be deployed on the client computers or email servers. Cook et al. (2006) experimented a domain specific blacklist which worked on the mail server to reduce the number of spam entering the network. But blacklist may easily cause false positive. If one of them sends spam then its IP address or email address is recorded in the blacklist. Consequently, other legitimate mails from that email address are all marked as spam.
  • 2. Content-based detection filters spam by analyzing the message content of received email, which overcome the drawbacks of Non-content-based filters. They scan for some sensitive keywords in the content to identify the spam. This type of filters includes Heuristic filters and Bayesian filters. Heuristic detection, are also known as rule-based analysis which uses regular expression rules to detect phrases or characters that are common to spam mails. Rules can be set as email header information, keywords or URL in the content. William Cohen (1996) used learning rules successfully to classify emails into different folders. But there are little related researches on the spam detection based on rules. However, the spam detection precision relies on the rules which are set by mail system managers. So it will take significantly long time to define the rules. After that, the rules should be refined frequently. If these pre-set rules on the mail system are not updated, the filters will not work efficiently on the new spam with new features. Besides, the rules are rigid and easy to cause false positive. In addition, because the content-based detection of spam can be considered as the problem of text classification, several machine learning approaches have been applied to spam detection. Among many others, Bayesian is one of those being proposed. In 1998, Bayesian classification techniques are employed to the issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence of certain words or phrases in the message content. Then the filters evaluate the probability whether spam or not by analyzing the statistics. As a result, the Bayesian filters eliminate more than 95% spam in the experiments and identify 80% of incoming junk mail in the real scenario. It is obviously that the Bayesian filters can provide a high correct rate with regard to the detection of plain-text content. Now Bayesian is widely used with other methods in many spam detection technologies to improve the accuracy. However there are some issues in the Bayesian filters. First, as the same issue as other machine learning approaches, the accuracy of Bayesian filters depend on the quality of training data and training process. Second, even Bayesian filters can provide high precision for plain-text content, but it is difficult to detect the booming spam contained images. Therefore, a further research conducted by Okayama University is carried out to detect the image spam (Uemura et al. 2008). It designed a method allows the existing Bayesian filter to learn image information, such as the file size or name, and then evaluate the probability on the learning results. After some experiments of this method, it can be observed that the false negative rate dropped but the false positive rates are almost same. It means this method can play only a booster role in the identification of spam using Bayesian because less information is provided by images to distinguish the spam and legitimate mails.
  • 3. Content Based Detection System has lot of advantages but the time and loads of processing space as it goes through the complete email. There is need of an anti spam system which could combine the advantages of content based and non content based spam detection system. AutoRE which a software designed by the Microsoft research group and our anchor paper has tried to combine the both type of detection systems i.e. content based and non content based system. Now we will be discussing in detail how AutoRE combine the both systems. Focus on Technology used in Paper AutoRE unlike all the previous solutions to detecting botnets (like spamhaus, blacklist) where areas it creates and trains itself dynamically real time. To do this it has 3 major steps when a set of emails is supplied to it, they are as follows:- 1. URL Pre-processing 2. Group Selector 3. Regular Expression Generation It is important to understand that we are not identifying spam or not spam emails. As by definition any email which is regular and sends in bulk is spam, but spam emails are not malicious as even a normal user might send an email which is sent to his complete contact list but is relevant and not spam. Our focus is on spam emails generated by botnets as they are not relevant emails it don’t have any meaning to it, they are just sent to accomplish some malicious mission. As botnets are autonomous systems, there is a pattern in their sending behaviour as they are programmed. So to catch that pattern above mentioned steps are followed. While doing URL Pre-processing following parameters are considered:- 1. URL String 2. Source server IP address 3. Email sending time All forwarded messages are discarded as a legitimate forwarding server can be mistaken for botnet member. URL Strings which are suspiciously random and multiple domains are extracted out. As URL strings like a.com, b.com are unlikely to be by botnets as they are registered domain names which economically not feasible for spammers. URL strings are then broken down and grouped into groups as per their domain names. As it is observed that spam emails are advertising for a particular product or particular advertising campaign, then domain specific signatures are created. And from this domain specific signatures domain-agnostic regular expression are created to get better results in form of reduced false positive rates and identifying the botnets even when they change their domains. Before creating the generalised regular expression domain specific signature need to suffice that it’s distributed, bursty and specific only then can be classified as spam signature.
  • 4. While grouping it’s very important to understand how to group the domains as with n number of emails there are possibly n domain names. So while considering distributed property temporal correlation is considered and bursty property is considered over a span of 5 days as it’s observed most ASes are active for minimum 5 days. Now once we are done classifying domains into the groups, next step would be generating regular expression. By generating a Regular expression a not a token conjunction helps us reduce the false positive rate as keywords used in the token conjunction are words which may or may not be part of email. After creating domain specific groups we create a signature of the group and classification is no more based on the group and its domain agnostic. By doing so we assured that in future if the botnets change the domain still they will be detected as there domain will hold the same regular expression and group signature which classifies them as spam this happens because we are not generating domain specific signature. This a unique feature of AutoRE which helps it finding maximum spam emails with minimum false positive rate. Also after categorizing them and assigning them their respective regular expression it’s very important that we verify that the emails we have classified as spam are actually spams or not. To do so there are 2 steps we need to do. First of all we query our suspected IP Addresses to Blacklist which are found in the list are directly classified as spam. The ones which are not we need to run some behavioural test to understand whether they are spams or not. This behavioural test is done on each campaign the points to taken care of are as follows:- 1). Similarity of Email Properties 2). Similarity of Sending Time 3). Similarity of Email Sending Behaviour As the emails we are targeting are being generated and send by automated system above mentioned properties play a big role. As botnets are automated systems they are bound to have pattern as however random the sending algorithm is designed due to the frequency of occurrence pattern is going to be generated. It doesn’t end here as by the means of this software we can study the characteristic of the botnets and predict the traffic and spam emails which are going to be generated. This study on botnets has revealed lot of facts which are pointers for future research in the anti spam system. In the next section we will be mentioning the results of the study on botnets and its use in technologies emerged after AutoRE.
  • 5. Future and Related Work Characteristics of Botnets and their use in present anti spam systems:- 1). Spam Sending Patterns over the network The above characteristic is used in A Dynamic Reputation Service for Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation software for filtering spam messages. The Spamspotter software classifies email senders in real time based on their global sending behaviour. This system is called behavioural detection. SpamSpotter than applies a third party machine learning behavioural algorithm on this data to generate reputation of senders. A preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural algorithm which identifies spam senders based on their email sending behaviour instead of their addresses and the contents that they are sending. In some cases, SNARE mechanism is so efficient that it can identify a spammer before it has sent a large number of email messages. AutoRE also studied the similar behaviour though SpamSpotter goes next level by implementing SNARE algorithm to calculate reputation of a sender. 2). Distribution of IP Address One of the characteristic of Botnets observed while experimentation of AutoRE was studying distribution of IP Address. This is very important characteristic to be studied as it can help us stop and understand the wide spread of Botnets. This property has been extended by Studying Spamming Botnets Using Botlab [6] Botnets are the most used spamming technique used these days. It is estimated that 85% of billions of spam messages are generated by botnets. This paper presents a botnet monitoring platform called botlab which monitors all incoming spam traffic at a certain location. It scans the spam messages and obtains bot binaries through spam links. A human operator than runs specific tools on these binaries to obtain information about the bots sending these spams. It then executes multiple captive, sandboxed nodes from various botnets, allowing it to observe the precise outgoing spam feeds from these nodes. It scours the spam feeds for URLs, gathers information on scams, and identifies exploit links. Finally, it correlates the incoming and outgoing spam feeds to identify the most active botnets and the set of compromised hosts comprising each botnet. Also another extension is studying the characteristic of Botnet and using it to detect them is done in BotGraph: Large Scale Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP addresses among accounts holders in an email system. Applying BotGraph to two months of Hotmail log of total 450GB data, BotGraph successfully identified over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph has also been implemented using a distributed clustered algorithm with Map Reduce technique. BotGraph can detect botnet sign-ups and already created botnet email accounts.
  • 6. Also one more interesting study came up during the research of AutoRE under the category to scan the network traffic was the increase in use of static IP addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist to improve by populating it by static IP address. Also research suggested that Botnets are evolving and creating more sophisticated and polymorphic URL’s to bypass anti spam systems. One major disadvantage of AutoRE is its not practically real time implemented. Till now its method are under investigation and its inception real time is still awaited. References 1). A Dynamic Reputation Service for Spotting Spammers Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala School of Computer Science, Georgia Tech http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf 2).BotGraph: Large Scale Spamming Botnet Detection http://research.microsoft.com/pubs/79413/botgraph.pdf M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. 3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive Logic Programming, pp. 124-143 4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006 Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202. 5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine http://hdl.handle.net/1853/25135 6). Studying Spamming Botnets Using Botlab http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf 7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based Image Spam Filtering Method’, 2008 International Conference on Information Security and Assurance, 2008 IEEE