Parallel SPAM Clustering with Hadoop

Parallel Spam Clustering
with Apache Hadoop

Thibault Debatty

Spam
● 70% of total email volume
● Estimated cost : $20.5 billion/year
● To fight better, need better strategic knowledge
● Examples :
● “Guaranteed Results”
● “Make YourPenis 3-inches longer & thicker, girl will
love you 1k”

Thibault Debatty Parallel Spam Clustering with Apache Hadoop 2

Spam
● 70% of total email volume
● Estimated cost : $20.5 billion/year
● To fight better, need better strategic knowledge
● Examples :
● “Guaranteed Results”
Close IP
● “Make YourPenis 3-inches longer & thicker, girl will
Same domain
love you 1k”


Problem statement
● Cluster spams in parallel :
● To get useful insights
● Fast!
● Dataset : 1 million spams (231MB)


Problem statement
● Subject Your Special Order #253650
● Charset windows-1250
● Geo GB
● Day 2010-10-01
● Host virginmedia.com
● ip 82.4.229.158
● Lang english
● Size 1482
● From berry_wagnertl@migrosbank.ch
● Rcpt brady@domain0140.com

What's next...

1. MapReduce and Apache Hadoop
2. Parallel K-means
3. Implementation
4. Benchmarks and speedup analysis
5. Clusters vizualisation


1. MapReduce
● Model for processing large data sets
● Master node splits and distributes dataset
2 steps :
1.Map : worker nodes process data,
and pass partial results to master
2.Reduce : master combines partial results
● Also name of Google's implementation


1. Apache Hadoop

● Free implementation of MapReduce
● Written in Java
● Process large amounts of data (PB)
● Used by :
● Yahoo : + 10.000 cores
● Facebook : 30 PB of data
● Distributed filesystem (HDFS) + data locality

1. Apache Hadoop
● Job Tracker
● ≃ Master
● Divides input data into “splits”
● Schedules map tasks (with data locality)
● Schedules reduce tasks on nodes
● Checks tasks health


1. Apache Hadoop
<key, value> <key, list of values>


2. KMeans
● Select initial centers
● Until stop criterion is reached :
● Assign each point to closest center
● Compute new center
● Advantages :
● Suited to large datasets
● Can be implemented
in parallel
● Computation O(nki)

2. Parallel KMeans
● “Parallel K-Means Clustering Based on MapReduce”
Weizhong Zhao, Huifang Ma and Qing He
● Map (point) :
● Compute distance to each center
● Output <id closest center, point>
● Reduce (list of points) :
● Compute center
● Output <center>


3. Implementation : KMeans
● Abstract KMeans
● Abstract KMeansMapper
● Abstract KmeansReducer
● Interface IPoint
● Interface ICenter
● 2 concrete implementations :
● Spam
● Simple 2D points


3. Implementation :
Abstract KMeans
// Write to "/it_0/part00000"
this.writeInitialCentroids();
for (…) {
    conf.setMapperClass(this.mapper);
    conf.setReducerClass(this.reducer);
    conf.setInt("iteration", iteration);
    SetOutputPath(... "/it_" + (iteration + 1));
    ...
}


3. Implementation :
Abstract KMeansMapper
public void configure(JobConf job) {
    // reads from
    // "/it_" + job.get("iteration") + "/partxxxxx"
    this.fetchCenters(job);
}
public void map(key, value,...) {
    IPoint point = this.createPointInstance();
    point.parse(value);
    ...
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();


3. Implementation :
Abstract KMeansReducer
public void reduce(key, values, …) {
    new_center = this.createCenterInstance();
    new_center.setOldCenter(old_center);
    while (values.hasNext()) {
        new_center.addPoint(point);
    }
    new_center.compute();
    output.collect(new_center);
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();

3. Implementation :
Spam Clustering
● Distance between spams :
Weighted Average of feature distances
● Text features : Jaro distance


3. Implementation :
Spam Clustering

Jaro similarity =
Where :
● m = number of matching characters;
● t = number matching characters not located at the
same position / 2.
Matching = not farther than
=> Takes misspelling into account


3. Implementation :
Spam Clustering
Distance between spams :
Weighted Average of feature distances
● Text features : Jaro distance
● IP : Number of different bits / 32
● Size : max 10% difference
● Day : arctangent-shaped function


3. Implementation :
Spam Clustering


3. Implementation :
Spam Clustering
● Center of cluster :
● Text features : Longest Common Subsequence;
● Charset, Geo (country code), Lang, Day :
most often occurring value;
● Size : average value.


4. Benchmarks
● Small Cluster : 3 nodes
● Single core
● 2GB RAM
● Gigabit Ethernet network
● Data replication : 3


4. Benchmarks
● n = 1M spams
● k = 30
● i = 10
=> 1131 sec


4. Benchmarks : scalability

3500

3000

2500
Execution time (sec)

2000

1500

1000

500

0
1 node 2 nodes 3 nodes


4. Benchmarks : scalability


4. Benchmarks :
Hadoop Overhead
Sequential : 2424 sec
3 servers (theoretic) : 808 sec
3 servers (real) : 1131 sec
Overhead : 323 sec (40%)


4. Benchmarks :
Hadoop Overhead

MPI Jumpshot


4. Benchmarks :
Hadoop Overhead
No data (setup) : 76 sec (9.5%)
Trivial distance (setup + sort) : 242 sec
Sort : 166 sec (20.5%)
Remaining : 81 sec (10%)

4. Benchmarks :
Weka and Mahout
● 10 million 2D points
● Weka (sequential) 5355 sec
● Hadoop: 1841 sec (2.9x faster)
● Mahout + 4h ?


4. Benchmarks
● Bigger cluster :
● 27 nodes
● 2 x 4 cores
● 16 GB
● Deployment:
● Shared home dir (NFS)
● Custom setup script
● Executed on all nodes
through SSH


4. Benchmarks :
Cluster 1M spams
Small cluster : Bigger cluster :
● 3 cores ● 216 cores

● k = 30 ● k = 4000

● 1131 sec ● 2484 sec


4. Benchmarks :
Comparison
Small cluster : Bigger cluster :
x 72
● 3 cores ● 216 cores

x 133
● k = 30 ● k = 4000

● 1131 sec ● 2484 sec
Expected : 2089 sec
Difference : 19%


4. Benchmarks :
Profiling and optimization
With String dates : With timestamps :
- 32%
● 1131 sec ● 770 sec


5. Results
● "Your receipt #"
● From: ""
● To: "@domain4.com"
● “LinkedIn Messages, /0/2010"
● From: "adjustsc5837@rodneymoore.com"
● To: "@domain0140.com"
● ""
● From: "LiliKepp5219@telemar.net.br"
● To: "@domain4.c"


5. Results Visualization
● "eil rder #"
● From: "hilton_ns@datares.com.my"


Conclusion
● Hadoop allows faster clustering
● But:
● Limitations
● Lacks graphical performance analysis tool (MPI Jumpshot)
● Programmer needs to understand inner working!
● Lot of room for improvement:
● Memcached to store intermediate centers?
● MPI to intercept method calls between JVMs?
● Selection of initial centers (canopy?), stop criterion?
● Distance computation (WOWA)
● Clustering algorithm (online clustering)
● Influence of data locality and data size?

Questions ?


Parallel SPAM Clustering with Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (11)

Similar a Parallel SPAM Clustering with Hadoop

Similar a Parallel SPAM Clustering with Hadoop (20)

Más de Thibault Debatty

Más de Thibault Debatty (13)

Parallel SPAM Clustering with Hadoop