2. Spam
● 70% of total email volume
● Estimated cost : $20.5 billion/year
● To fight better, need better strategic knowledge
● Examples :
● “Guaranteed Results”
● “Make YourPenis 3-inches longer & thicker, girl will
love you 1k”
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 2
3. Spam
● 70% of total email volume
● Estimated cost : $20.5 billion/year
● To fight better, need better strategic knowledge
● Examples :
● “Guaranteed Results”
Close IP
● “Make YourPenis 3-inches longer & thicker, girl will
Same domain
love you 1k”
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 3
4. Problem statement
● Cluster spams in parallel :
● To get useful insights
● Fast!
● Dataset : 1 million spams (231MB)
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 4
5. Problem statement
● Subject Your Special Order #253650
● Charset windows-1250
● Geo GB
● Day 2010-10-01
● Host virginmedia.com
● ip 82.4.229.158
● Lang english
● Size 1482
● From berry_wagnertl@migrosbank.ch
● Rcpt brady@domain0140.com
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 5
6. What's next...
1. MapReduce and Apache Hadoop
2. Parallel K-means
3. Implementation
4. Benchmarks and speedup analysis
5. Clusters vizualisation
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 6
7. 1. MapReduce
● Model for processing large data sets
● Master node splits and distributes dataset
2 steps :
1.Map : worker nodes process data,
and pass partial results to master
2.Reduce : master combines partial results
● Also name of Google's implementation
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 7
8. 1. Apache Hadoop
● Free implementation of MapReduce
● Written in Java
● Process large amounts of data (PB)
● Used by :
● Yahoo : + 10.000 cores
● Facebook : 30 PB of data
● Distributed filesystem (HDFS) + data locality
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 8
9. 1. Apache Hadoop
● Job Tracker
● ≃ Master
● Divides input data into “splits”
● Schedules map tasks (with data locality)
● Schedules reduce tasks on nodes
● Checks tasks health
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 9
10. 1. Apache Hadoop
<key, value> <key, list of values>
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 10
11. 2. KMeans
● Select initial centers
● Until stop criterion is reached :
● Assign each point to closest center
● Compute new center
● Advantages :
● Suited to large datasets
● Can be implemented
in parallel
● Computation O(nki)
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 11
12. 2. Parallel KMeans
● “Parallel K-Means Clustering Based on MapReduce”
Weizhong Zhao, Huifang Ma and Qing He
● Map (point) :
● Compute distance to each center
● Output <id closest center, point>
● Reduce (list of points) :
● Compute center
● Output <center>
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 12
15. 3. Implementation :
Abstract KMeansMapper
public void configure(JobConf job) {
// reads from
// "/it_" + job.get("iteration") + "/partxxxxx"
this.fetchCenters(job);
}
public void map(key, value,...) {
IPoint point = this.createPointInstance();
point.parse(value);
...
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 15
16. 3. Implementation :
Abstract KMeansReducer
public void reduce(key, values, …) {
new_center = this.createCenterInstance();
new_center.setOldCenter(old_center);
while (values.hasNext()) {
new_center.addPoint(point);
}
new_center.compute();
output.collect(new_center);
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 16
17. 3. Implementation :
Spam Clustering
● Distance between spams :
Weighted Average of feature distances
● Text features : Jaro distance
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 17
18. 3. Implementation :
Spam Clustering
Jaro similarity =
Where :
● m = number of matching characters;
● t = number matching characters not located at the
same position / 2.
Matching = not farther than
=> Takes misspelling into account
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 18
19. 3. Implementation :
Spam Clustering
Distance between spams :
Weighted Average of feature distances
● Text features : Jaro distance
● IP : Number of different bits / 32
● Size : max 10% difference
● Day : arctangent-shaped function
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 19
21. 3. Implementation :
Spam Clustering
● Center of cluster :
● Text features : Longest Common Subsequence;
● Charset, Geo (country code), Lang, Day :
most often occurring value;
● Size : average value.
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 21
22. 4. Benchmarks
● Small Cluster : 3 nodes
● Single core
● 2GB RAM
● Gigabit Ethernet network
● Data replication : 3
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 22
23. 4. Benchmarks
● n = 1M spams
● k = 30
● i = 10
=> 1131 sec
Thibault Debatty Parallel Spam Clustering with Apache Hadoop 23