Network paperthesis1

Group Details:-

Dhara Shah z3299353
Imad Hashmi z3193866
Zuo Cui z3261136

Our Paper:- Y. Xie , F. Yu, K. Achan , R. Panigraphy , G. Hulten and I. Osipkov ,
Spamming Botnets: Signatures and Characteristics, in Proceedings of ACM
SIGCOMM 2008, pp. 171-182, Seattle, USA August 2008. PDF

Flow of the Literature Review is as follows:-

Introduction
Background and Previous Work
Focus on Technology used in Paper
Future and Related Work

Introduction

Since email has become a wide spread means of communication around the
world and millions of email messages are transferred every minute, it is
understandable that illegitimate use of email service is also in practice since
long. One of the many abuses of this service is spamming which is used by
advertisers around the world to send advertisements of their products to
legitimate email users. Following discussion is on the methods used by anti spam
system to detect spam emails and botnets.

Background and Previous Work

There were a lot of researches on the identification and filtering of email spam.
Based on the part of email used for spam detection, all these work could be
generally classified into two main categories: non-content-based and
content- based.

Non-content-based filters are also known as address-based filters. It uses the
information such as IP address or email address in the email header to
examine. Blacklist and Whitelist are the common technique in this category.
Blacklist records the IP addresses or email addresses which send spam. And
conversely, Whitelist contains all acceptable email addresses. They can be
deployed on the client computers or email servers. Cook et al. (2006)
experimented a domain specific blacklist which worked on the mail server to
reduce the number of spam entering the network. But blacklist may easily cause
false positive. If one of them sends spam then its IP address or email address is
recorded in the blacklist. Consequently, other legitimate mails from that email
address are all marked as spam.

Content-based detection filters spam by analyzing the message content of
received email, which overcome the drawbacks of Non-content-based filters.
They scan for some sensitive keywords in the content to identify the spam. This
type of filters includes Heuristic filters and Bayesian filters.

Heuristic detection, are also known as rule-based analysis which uses regular
expression rules to detect phrases or characters that are common to spam
mails. Rules can be set as email header information, keywords or URL in the
content. William Cohen (1996) used learning rules successfully to classify emails
into different folders. But there are little related researches on the spam
detection based on rules.

However, the spam detection precision relies on the rules which are set by mail
system managers. So it will take significantly long time to define the rules. After
that, the rules should be refined frequently. If these pre-set rules on the mail
system are not updated, the filters will not work efficiently on the new spam with
new features. Besides, the rules are rigid and easy to cause false positive.

In addition, because the content-based detection of spam can be considered as
the problem of text classification, several machine learning approaches have
been applied to spam detection. Among many others, Bayesian is one of those
being proposed. In 1998, Bayesian classification techniques are employed to the
issue of spam filtering (Sahami et al, 1998). It is able to classify the occurrence
of certain words or phrases in the message content. Then the filters evaluate the
probability whether spam or not by analyzing the statistics. As a result, the
Bayesian filters eliminate more than 95% spam in the experiments and identify
80% of incoming junk mail in the real scenario. It is obviously that the Bayesian
filters can provide a high correct rate with regard to the detection of plain-text
content.

Now Bayesian is widely used with other methods in many spam detection
technologies to improve the accuracy. However there are some issues in the
Bayesian filters. First, as the same issue as other machine learning approaches,
the accuracy of Bayesian filters depend on the quality of training data and
training process. Second, even Bayesian filters can provide high precision for
plain-text content, but it is difficult to detect the booming spam contained
images. Therefore, a further research conducted by Okayama University is
carried out to detect the image spam (Uemura et al. 2008). It designed a
method allows the existing Bayesian filter to learn image information, such as
the file size or name, and then evaluate the probability on the learning results.
After some experiments of this method, it can be observed that the false
negative rate dropped but the false positive rates are almost same. It means
this method can play only a booster role in the identification of spam using
Bayesian because less information is provided by images to distinguish the spam
and legitimate mails.

Content Based Detection System has lot of advantages but the time and loads of
processing space as it goes through the complete email. There is need of an anti
spam system which could combine the advantages of content based and non
content based spam detection system.

AutoRE which a software designed by the Microsoft research group and our
anchor paper has tried to combine the both type of detection systems i.e.
content based and non content based system. Now we will be discussing in detail
how AutoRE combine the both systems.

Focus on Technology used in Paper

AutoRE unlike all the previous solutions to detecting botnets (like spamhaus,
blacklist) where areas it creates and trains itself dynamically real time. To do
this it has 3 major steps when a set of emails is supplied to it, they are as
follows:-
1. URL Pre-processing
2. Group Selector
3. Regular Expression Generation
It is important to understand that we are not identifying spam or not spam
emails. As by definition any email which is regular and sends in bulk is spam,
but spam emails are not malicious as even a normal user might send an email
which is sent to his complete contact list but is relevant and not spam. Our focus
is on spam emails generated by botnets as they are not relevant emails it don’t
have any meaning to it, they are just sent to accomplish some malicious
mission. As botnets are autonomous systems, there is a pattern in their sending
behaviour as they are programmed. So to catch that pattern above mentioned
steps are followed. While doing URL Pre-processing following parameters are
considered:-
1. URL String
2. Source server IP address
3. Email sending time
All forwarded messages are discarded as a legitimate forwarding server can be
mistaken for botnet member. URL Strings which are suspiciously random and
multiple domains are extracted out. As URL strings like a.com, b.com are
unlikely to be by botnets as they are registered domain names which
economically not feasible for spammers. URL strings are then broken down and
grouped into groups as per their domain names. As it is observed that spam
emails are advertising for a particular product or particular advertising
campaign, then domain specific signatures are created. And from this domain
specific signatures domain-agnostic regular expression are created to get better
results in form of reduced false positive rates and identifying the botnets even
when they change their domains. Before creating the generalised regular
expression domain specific signature need to suffice that it’s distributed,
bursty and specific only then can be classified as spam signature.

While grouping it’s very important to understand how to group the domains as
with n number of emails there are possibly n domain names. So while
considering distributed property temporal correlation is considered and bursty
property is considered over a span of 5 days as it’s observed most ASes are
active for minimum 5 days.
Now once we are done classifying domains into the groups, next step would be
generating regular expression. By generating a Regular expression a not a
token conjunction helps us reduce the false positive rate as keywords used in
the token conjunction are words which may or may not be part of email. After
creating domain specific groups we create a signature of the group and
classification is no more based on the group and its domain agnostic. By doing
so we assured that in future if the botnets change the domain still they will be
detected as there domain will hold the same regular expression and group
signature which classifies them as spam this happens because we are not
generating domain specific signature. This a unique feature of AutoRE which
helps it finding maximum spam emails with minimum false positive rate. Also
after categorizing them and assigning them their respective regular expression
it’s very important that we verify that the emails we have classified as spam are
actually spams or not. To do so there are 2 steps we need to do. First of all we
query our suspected IP Addresses to Blacklist which are found in the list
are directly classified as spam. The ones which are not we need to run some
behavioural test to understand whether they are spams or not. This
behavioural test is done on each campaign the points to taken care of are as
follows:-

1). Similarity of Email Properties
2). Similarity of Sending Time
3). Similarity of Email Sending Behaviour

As the emails we are targeting are being generated and send by automated
system above mentioned properties play a big role. As botnets are automated
systems they are bound to have pattern as however random the sending
algorithm is designed due to the frequency of occurrence pattern is going to be
generated.

It doesn’t end here as by the means of this software we can study the
characteristic of the botnets and predict the traffic and spam emails which are
going to be generated. This study on botnets has revealed lot of facts which are
pointers for future research in the anti spam system. In the next section we will
be mentioning the results of the study on botnets and its use in technologies
emerged after AutoRE.

Future and Related Work

Characteristics of Botnets and their use in present anti spam systems:-

1). Spam Sending Patterns over the network
The above characteristic is used in A Dynamic Reputation Service for
Spotting Spammers [1] SpamSpotter is real time (like AutoRE) reputation
software for filtering spam messages. The Spamspotter software classifies email
senders in real time based on their global sending behaviour. This system is
called behavioural detection. SpamSpotter than applies a third party machine
learning behavioural algorithm on this data to generate reputation of senders. A
preferred algorithm in SpamSpotter is SNARE. It is a network level behavioural
algorithm which identifies spam senders based on their email sending behaviour
instead of their addresses and the contents that they are sending. In some
cases, SNARE mechanism is so efficient that it can identify a spammer before it
has sent a large number of email messages.
AutoRE also studied the similar behaviour though SpamSpotter goes next level
by implementing SNARE algorithm to calculate reputation of a sender.

2). Distribution of IP Address
One of the characteristic of Botnets observed while experimentation of AutoRE
was studying distribution of IP Address. This is very important characteristic to
be studied as it can help us stop and understand the wide spread of Botnets.
This property has been extended by Studying Spamming Botnets Using
Botlab [6] Botnets are the most used spamming technique used these days. It
is estimated that 85% of billions of spam messages are generated by botnets.
This paper presents a botnet monitoring platform called botlab which monitors
all incoming spam traffic at a certain location. It scans the spam messages and
obtains bot binaries through spam links. A human operator than runs specific
tools on these binaries to obtain information about the bots sending these
spams. It then executes multiple captive, sandboxed nodes from various
botnets, allowing it to observe the precise outgoing spam feeds from these
nodes. It scours the spam feeds for URLs, gathers information on scams, and
identifies exploit links. Finally, it correlates the incoming and outgoing spam
feeds to identify the most active botnets and the set of compromised hosts
comprising each botnet. Also another extension is studying the characteristic of
Botnet and using it to detect them is done in BotGraph: Large Scale
Spamming Botnet Detection [2] BotGraph detects the abnormal sharing of IP
addresses among accounts holders in an email system. Applying BotGraph to
two months of Hotmail log of total 450GB data, BotGraph successfully identified
over 26 million bot-accounts with a low false positive rate of 0.44%. BotGraph
has also been implemented using a distributed clustered algorithm with Map
Reduce technique. BotGraph can detect botnet sign-ups and already created
botnet email accounts.

Also one more interesting study came up during the research of AutoRE under
the category to scan the network traffic was the increase in use of static IP
addresses from Nov 2006 to July 2007. Due to this study it helped the blacklist
to improve by populating it by static IP address. Also research suggested that
Botnets are evolving and creating more sophisticated and polymorphic
URL’s to bypass anti spam systems.

One major disadvantage of AutoRE is its not practically real time
implemented. Till now its method are under investigation and its
inception real time is still awaited.

References

1). A Dynamic Reputation Service for Spotting Spammers Anirudh
Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh
Vempala School of Computer Science, Georgia Tech
http://www.cs.purdue.edu/homes/hkhande/projects/spam/spam_nsdi.pdf

2).BotGraph: Large Scale Spamming Botnet Detection
http://research.microsoft.com/pubs/79413/botgraph.pdf
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. ‘A Bayesian approach to
filtering junk E-mail’. In Learning for Text Categorization: Papers from the 1998
Workshop, Madison, Wisconsin, 1998.

3). Cohen, W 1996 ‘Learning Rules that Classify E-Mail’, Advances in Inductive
Logic Programming, pp. 124-143

4). Cook, D, Hartnett, J, Manderson, K&Scanlan,J 2006, ‘Catching Spam Before
it Arrives: Domain Specific Dynamic Blacklists’, Proceedings of the 2006
Australasian workshops on Grid computing and e-research, Vol. 54, pp.193-202.

5). SNARE: Spatio-temporal Network-level Automatic Reputation Engine
http://hdl.handle.net/1853/25135

6). Studying Spamming Botnets Using Botlab
http://www.cs.washington.edu/homes/arvind/papers/botlab.pdf

7).Uemura, M& Tabata, T 2008 ‘Design and Evaluation of a Bayesian-filter-based
Image Spam Filtering Method’, 2008 International Conference on Information
Security and Assurance, 2008 IEEE

Network paperthesis1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Destacado

Destacado (20)

Similar a Network paperthesis1

Similar a Network paperthesis1 (20)

Network paperthesis1