SlideShare una empresa de Scribd logo
1 de 37
Parallel Spam Clustering
 with Apache Hadoop

    Thibault Debatty
Spam
 ●   70% of total email volume
 ●   Estimated cost : $20.5 billion/year
 ●   To fight better, need better strategic knowledge
 ●   Examples :
       ●   “Guaranteed Results”
       ●   “Make YourPenis 3-inches longer & thicker, girl will
           love you 1k”



Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   2
Spam
 ●   70% of total email volume
 ●   Estimated cost : $20.5 billion/year
 ●   To fight better, need better strategic knowledge
 ●   Examples :
       ●   “Guaranteed Results”
                                            Close IP
       ●   “Make YourPenis 3-inches longer & thicker, girl will
                                         Same domain
           love you 1k”



Thibault Debatty          Parallel Spam Clustering with Apache Hadoop   3
Problem statement
 ●   Cluster spams in parallel :
       ●   To get useful insights
       ●   Fast!
 ●   Dataset : 1 million spams (231MB)




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   4
Problem statement
 ●   Subject         Your Special Order #253650
 ●   Charset         windows-1250
 ●   Geo             GB
 ●   Day             2010-10-01
 ●   Host            virginmedia.com
 ●   ip              82.4.229.158
 ●   Lang            english
 ●   Size            1482
 ●   From            berry_wagnertl@migrosbank.ch
 ●   Rcpt            brady@domain0140.com
Thibault Debatty          Parallel Spam Clustering with Apache Hadoop   5
What's next...

1. MapReduce and Apache Hadoop
2. Parallel K-means
3. Implementation
4. Benchmarks and speedup analysis
5. Clusters vizualisation

Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   6
1. MapReduce
 ●   Model for processing large data sets
 ●   Master node splits and distributes dataset
     2 steps :
       1.Map : worker nodes process data,
         and pass partial results to master
       2.Reduce : master combines partial results
 ●   Also name of Google's implementation



Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   7
1. Apache Hadoop

 ●   Free implementation of MapReduce
 ●   Written in Java
 ●   Process large amounts of data (PB)
 ●   Used by :
       ●   Yahoo : + 10.000 cores
       ●   Facebook : 30 PB of data
 ●   Distributed filesystem (HDFS) + data locality
Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   8
1. Apache Hadoop
 ●   Job Tracker
       ●   ≃ Master
       ●   Divides input data into “splits”
       ●   Schedules map tasks (with data locality)
       ●   Schedules reduce tasks on nodes
       ●   Checks tasks health




Thibault Debatty         Parallel Spam Clustering with Apache Hadoop   9
1. Apache Hadoop
                        <key, value>                             <key, list of values>




Thibault Debatty       Parallel Spam Clustering with Apache Hadoop                       10
2. KMeans
 ●   Select initial centers
 ●   Until stop criterion is reached :
       ●   Assign each point to closest center
       ●   Compute new center
 ●   Advantages :
       ●   Suited to large datasets
       ●   Can be implemented
           in parallel
 ●   Computation O(nki)
Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   11
2. Parallel KMeans
 ●   “Parallel K-Means Clustering Based on MapReduce”
     Weizhong Zhao, Huifang Ma and Qing He
 ●   Map (point) :
       ●   Compute distance to each center
       ●   Output <id closest center, point>
 ●   Reduce (list of points) :
       ●   Compute center
       ●   Output <center>


Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   12
3. Implementation : KMeans
 ●   Abstract KMeans
       ●   Abstract KMeansMapper
       ●   Abstract KmeansReducer
       ●   Interface IPoint
       ●   Interface ICenter
 ●   2 concrete implementations :
       ●   Spam
       ●   Simple 2D points


Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   13
3. Implementation :
                   Abstract KMeans
// Write to "/it_0/part­00000"
this.writeInitialCentroids();
for (…) {
    conf.setMapperClass(this.mapper);
    conf.setReducerClass(this.reducer);
    conf.setInt("iteration", iteration);
    SetOutputPath(... "/it_" + (iteration + 1));
    ...
}




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   14
3. Implementation :
                   Abstract KMeansMapper
public void configure(JobConf job) {
    // reads from
    // "/it_" + job.get("iteration") + "/part­xxxxx"
    this.fetchCenters(job);
}
public void map(key, value,...) {
    IPoint point = this.createPointInstance();
    point.parse(value);
    ...
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();

Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   15
3. Implementation :
                   Abstract KMeansReducer
public void reduce(key, values, …) {
    new_center = this.createCenterInstance();
    new_center.setOldCenter(old_center);
    while (values.hasNext()) {
        new_center.addPoint(point);
    }
    new_center.compute();
    output.collect(new_center);
}
public abstract IPoint createPointInstance();
public abstract ICenter createCenterInstance();
Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   16
3. Implementation :
                   Spam Clustering
 ●   Distance between spams :
     Weighted Average of feature distances
       ●   Text features : Jaro distance




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   17
3. Implementation :
                   Spam Clustering

     Jaro similarity =
     Where :
       ●   m = number of matching characters;
       ●   t = number matching characters not located at the
           same position / 2.
     Matching = not farther than
     => Takes misspelling into account

Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   18
3. Implementation :
                   Spam Clustering
     Distance between spams :
     Weighted Average of feature distances
       ●   Text features : Jaro distance
       ●   IP : Number of different bits / 32
       ●   Size : max 10% difference
       ●   Day : arctangent-shaped function




Thibault Debatty         Parallel Spam Clustering with Apache Hadoop   19
3. Implementation :
                   Spam Clustering




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   20
3. Implementation :
                   Spam Clustering
 ●   Center of cluster :
       ●   Text features : Longest Common Subsequence;
       ●   Charset, Geo (country code), Lang, Day :
           most often occurring value;
       ●   Size : average value.




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   21
4. Benchmarks
 ●   Small Cluster : 3 nodes
       ●   Single core
       ●   2GB RAM
       ●   Gigabit Ethernet network
 ●   Data replication : 3




Thibault Debatty         Parallel Spam Clustering with Apache Hadoop   22
4. Benchmarks
 ●   n = 1M spams
 ●   k = 30
 ●   i = 10
     => 1131 sec




Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   23
4. Benchmarks : scalability

                         3500


                         3000


                         2500
  Execution time (sec)




                         2000


                         1500


                         1000


                         500


                           0
                           1 node                           2 nodes                    3 nodes




Thibault Debatty                         Parallel Spam Clustering with Apache Hadoop             24
4. Benchmarks : scalability




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   25
4. Benchmarks :
                   Hadoop Overhead
Sequential :                                           2424 sec
3 servers (theoretic) :                                808 sec
3 servers (real) :                                     1131 sec
Overhead :                                             323 sec (40%)




Thibault Debatty       Parallel Spam Clustering with Apache Hadoop     26
4. Benchmarks :
                   Hadoop Overhead
Sequential :                                           2424 sec
3 servers (theoretic) :                                808 sec
3 servers (real) :                                     1131 sec
Overhead :                                             323 sec (40%)




                                                  MPI Jumpshot

Thibault Debatty       Parallel Spam Clustering with Apache Hadoop     27
4. Benchmarks :
                   Hadoop Overhead
Sequential :                                           2424 sec
3 servers (theoretic) :                                808 sec
3 servers (real) :                                     1131 sec
Overhead :                                             323 sec (40%)
No data (setup) :                                      76 sec        (9.5%)
Trivial distance (setup + sort) : 242 sec
Sort :                                                 166 sec (20.5%)
Remaining :                                            81 sec        (10%)
Thibault Debatty       Parallel Spam Clustering with Apache Hadoop            28
4. Benchmarks :
                   Weka and Mahout
 ●   10 million 2D points
 ●   Weka (sequential)                           5355 sec
 ●   Hadoop:                                     1841 sec (2.9x faster)
 ●   Mahout                                      + 4h ?




Thibault Debatty       Parallel Spam Clustering with Apache Hadoop    29
4. Benchmarks
 ●   Bigger cluster :
      ●   27 nodes
      ●   2 x 4 cores
      ●   16 GB
 ●   Deployment:
      ●   Shared home dir (NFS)
      ●   Custom setup script
      ●   Executed on all nodes
          through SSH


Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   30
4. Benchmarks :
                   Cluster 1M spams
     Small cluster :                              Bigger cluster :
       ●   3 cores                                 ●    216 cores

       ●   k = 30                                  ●    k = 4000

       ●   1131 sec                                ●    2484 sec




Thibault Debatty       Parallel Spam Clustering with Apache Hadoop   31
4. Benchmarks :
                   Comparison
     Small cluster :                               Bigger cluster :
                              x 72
       ●   3 cores                                  ●    216 cores

                             x 133
       ●   k = 30                                   ●    k = 4000

       ●   1131 sec                                 ●    2484 sec
                                                         Expected : 2089 sec
                                                         Difference : 19%




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop            32
4. Benchmarks :
                   Profiling and optimization
     With String dates :                           With timestamps :
                             - 32%
       ●   1131 sec                                 ●    770 sec




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop    33
5. Results
 ●   "Your receipt #"
      ●    From: ""
      ●    To: "@domain4.com"
 ●   “LinkedIn Messages, /0/2010"
      ●    From: "adjustsc5837@rodneymoore.com"
      ●    To: "@domain0140.com"
 ●   ""
      ●    From: "LiliKepp5219@telemar.net.br"
      ●    To: "@domain4.c"

Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   34
5. Results Visualization
 ●   "eil rder #"
       ●   From: "hilton_ns@datares.com.my"




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   35
Conclusion
 ●   Hadoop allows faster clustering
 ●   But:
     ●   Limitations
     ●   Lacks graphical performance analysis tool (MPI Jumpshot)
     ●   Programmer needs to understand inner working!
 ●   Lot of room for improvement:
     ●   Memcached to store intermediate centers?
     ●   MPI to intercept method calls between JVMs?
     ●   Selection of initial centers (canopy?), stop criterion?
     ●   Distance computation (WOWA)
     ●   Clustering algorithm (online clustering)
     ●   Influence of data locality and data size?
Thibault Debatty             Parallel Spam Clustering with Apache Hadoop   36
Questions ?




Thibault Debatty        Parallel Spam Clustering with Apache Hadoop   37

Más contenido relacionado

La actualidad más candente

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentationGabriel Eisbruch
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataTravis Oliphant
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 

La actualidad más candente (19)

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Scale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyDataScale up and Scale Out Anaconda and PyData
Scale up and Scale Out Anaconda and PyData
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 

Destacado

Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT DetectionThibault Debatty
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataThibault Debatty
 
Apt sharing tisa protalk 2-2554
Apt sharing tisa protalk 2-2554Apt sharing tisa protalk 2-2554
Apt sharing tisa protalk 2-2554TISA
 
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...Security B-Sides
 
2015 APT APC Result Letter (APT Program)
2015 APT APC Result Letter (APT Program)2015 APT APC Result Letter (APT Program)
2015 APT APC Result Letter (APT Program)Andre Phyffer
 
Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Dan Morrill
 
Persistence is Key: Advanced Persistent Threats
Persistence is Key: Advanced Persistent ThreatsPersistence is Key: Advanced Persistent Threats
Persistence is Key: Advanced Persistent ThreatsSameer Thadani
 
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...North Texas Chapter of the ISSA
 
Introduction to Advanced Persistent Threats (APT) for Non-Security Engineers
Introduction to Advanced Persistent Threats (APT) for Non-Security EngineersIntroduction to Advanced Persistent Threats (APT) for Non-Security Engineers
Introduction to Advanced Persistent Threats (APT) for Non-Security EngineersOllie Whitehouse
 
Security Intelligence: Advanced Persistent Threats
Security Intelligence: Advanced Persistent ThreatsSecurity Intelligence: Advanced Persistent Threats
Security Intelligence: Advanced Persistent ThreatsPeter Wood
 
Common Techniques To Identify Advanced Persistent Threat (APT)
Common Techniques To Identify Advanced Persistent Threat (APT)Common Techniques To Identify Advanced Persistent Threat (APT)
Common Techniques To Identify Advanced Persistent Threat (APT)Yuval Sinay, CISSP, C|CISO
 

Destacado (11)

Multi-Agent System for APT Detection
Multi-Agent System for APT DetectionMulti-Agent System for APT Detection
Multi-Agent System for APT Detection
 
Building k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text DataBuilding k-nn Graphs From Large Text Data
Building k-nn Graphs From Large Text Data
 
Apt sharing tisa protalk 2-2554
Apt sharing tisa protalk 2-2554Apt sharing tisa protalk 2-2554
Apt sharing tisa protalk 2-2554
 
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...
Advanced Persistent Threats (Shining the Light on the Industries' Best Kept S...
 
2015 APT APC Result Letter (APT Program)
2015 APT APC Result Letter (APT Program)2015 APT APC Result Letter (APT Program)
2015 APT APC Result Letter (APT Program)
 
Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)Understanding advanced persistent threats (APT)
Understanding advanced persistent threats (APT)
 
Persistence is Key: Advanced Persistent Threats
Persistence is Key: Advanced Persistent ThreatsPersistence is Key: Advanced Persistent Threats
Persistence is Key: Advanced Persistent Threats
 
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...
NTXISSACSC2 - Advanced Persistent Threat (APT) Life Cycle Management Monty Mc...
 
Introduction to Advanced Persistent Threats (APT) for Non-Security Engineers
Introduction to Advanced Persistent Threats (APT) for Non-Security EngineersIntroduction to Advanced Persistent Threats (APT) for Non-Security Engineers
Introduction to Advanced Persistent Threats (APT) for Non-Security Engineers
 
Security Intelligence: Advanced Persistent Threats
Security Intelligence: Advanced Persistent ThreatsSecurity Intelligence: Advanced Persistent Threats
Security Intelligence: Advanced Persistent Threats
 
Common Techniques To Identify Advanced Persistent Threat (APT)
Common Techniques To Identify Advanced Persistent Threat (APT)Common Techniques To Identify Advanced Persistent Threat (APT)
Common Techniques To Identify Advanced Persistent Threat (APT)
 

Similar a Parallel SPAM Clustering with Hadoop

Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!DataWorks Summit
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking systemJesse Vincent
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments buildacloud
 
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsDeploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsPuppet
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 

Similar a Parallel SPAM Clustering with Hadoop (20)

Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!Hadoop Hardware @Twitter: Size does matter!
Hadoop Hardware @Twitter: Size does matter!
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
SD, a P2P bug tracking system
SD, a P2P bug tracking systemSD, a P2P bug tracking system
SD, a P2P bug tracking system
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments
 
Deploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata EnvironmentsDeploying Hadoop-Based Bigdata Environments
Deploying Hadoop-Based Bigdata Environments
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Más de Thibault Debatty

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsThibault Debatty
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessThibault Debatty
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsThibault Debatty
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...Thibault Debatty
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduceThibault Debatty
 

Más de Thibault Debatty (13)

An introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphsAn introduction to similarity search and k-nn graphs
An introduction to similarity search and k-nn graphs
 
Blockchain for dummies
Blockchain for dummiesBlockchain for dummies
Blockchain for dummies
 
Building a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation AwarenessBuilding a Cyber Range for training Cyber Defense Situation Awareness
Building a Cyber Range for training Cyber Defense Situation Awareness
 
Design and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithmsDesign and analysis of distributed k-nearest neighbors graph algorithms
Design and analysis of distributed k-nearest neighbors graph algorithms
 
A comparative analysis of visualisation techniques to achieve CySA in the mi...
A comparative analysis of visualisation techniques to achieve CySA in the  mi...A comparative analysis of visualisation techniques to achieve CySA in the  mi...
A comparative analysis of visualisation techniques to achieve CySA in the mi...
 
Cyber Range
Cyber RangeCyber Range
Cyber Range
 
Easy Server Monitoring
Easy Server MonitoringEasy Server Monitoring
Easy Server Monitoring
 
Data diode
Data diodeData diode
Data diode
 
USB Portal
USB PortalUSB Portal
USB Portal
 
Smart Router
Smart RouterSmart Router
Smart Router
 
Web shell detector
Web shell detectorWeb shell detector
Web shell detector
 
Graph based APT detection
Graph based APT detectionGraph based APT detection
Graph based APT detection
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 

Parallel SPAM Clustering with Hadoop

  • 1. Parallel Spam Clustering with Apache Hadoop Thibault Debatty
  • 2. Spam ● 70% of total email volume ● Estimated cost : $20.5 billion/year ● To fight better, need better strategic knowledge ● Examples : ● “Guaranteed Results” ● “Make YourPenis 3-inches longer & thicker, girl will love you 1k” Thibault Debatty Parallel Spam Clustering with Apache Hadoop 2
  • 3. Spam ● 70% of total email volume ● Estimated cost : $20.5 billion/year ● To fight better, need better strategic knowledge ● Examples : ● “Guaranteed Results” Close IP ● “Make YourPenis 3-inches longer & thicker, girl will Same domain love you 1k” Thibault Debatty Parallel Spam Clustering with Apache Hadoop 3
  • 4. Problem statement ● Cluster spams in parallel : ● To get useful insights ● Fast! ● Dataset : 1 million spams (231MB) Thibault Debatty Parallel Spam Clustering with Apache Hadoop 4
  • 5. Problem statement ● Subject Your Special Order #253650 ● Charset windows-1250 ● Geo GB ● Day 2010-10-01 ● Host virginmedia.com ● ip 82.4.229.158 ● Lang english ● Size 1482 ● From berry_wagnertl@migrosbank.ch ● Rcpt brady@domain0140.com Thibault Debatty Parallel Spam Clustering with Apache Hadoop 5
  • 6. What's next... 1. MapReduce and Apache Hadoop 2. Parallel K-means 3. Implementation 4. Benchmarks and speedup analysis 5. Clusters vizualisation Thibault Debatty Parallel Spam Clustering with Apache Hadoop 6
  • 7. 1. MapReduce ● Model for processing large data sets ● Master node splits and distributes dataset 2 steps : 1.Map : worker nodes process data, and pass partial results to master 2.Reduce : master combines partial results ● Also name of Google's implementation Thibault Debatty Parallel Spam Clustering with Apache Hadoop 7
  • 8. 1. Apache Hadoop ● Free implementation of MapReduce ● Written in Java ● Process large amounts of data (PB) ● Used by : ● Yahoo : + 10.000 cores ● Facebook : 30 PB of data ● Distributed filesystem (HDFS) + data locality Thibault Debatty Parallel Spam Clustering with Apache Hadoop 8
  • 9. 1. Apache Hadoop ● Job Tracker ● ≃ Master ● Divides input data into “splits” ● Schedules map tasks (with data locality) ● Schedules reduce tasks on nodes ● Checks tasks health Thibault Debatty Parallel Spam Clustering with Apache Hadoop 9
  • 10. 1. Apache Hadoop <key, value> <key, list of values> Thibault Debatty Parallel Spam Clustering with Apache Hadoop 10
  • 11. 2. KMeans ● Select initial centers ● Until stop criterion is reached : ● Assign each point to closest center ● Compute new center ● Advantages : ● Suited to large datasets ● Can be implemented in parallel ● Computation O(nki) Thibault Debatty Parallel Spam Clustering with Apache Hadoop 11
  • 12. 2. Parallel KMeans ● “Parallel K-Means Clustering Based on MapReduce” Weizhong Zhao, Huifang Ma and Qing He ● Map (point) : ● Compute distance to each center ● Output <id closest center, point> ● Reduce (list of points) : ● Compute center ● Output <center> Thibault Debatty Parallel Spam Clustering with Apache Hadoop 12
  • 13. 3. Implementation : KMeans ● Abstract KMeans ● Abstract KMeansMapper ● Abstract KmeansReducer ● Interface IPoint ● Interface ICenter ● 2 concrete implementations : ● Spam ● Simple 2D points Thibault Debatty Parallel Spam Clustering with Apache Hadoop 13
  • 14. 3. Implementation : Abstract KMeans // Write to "/it_0/part­00000" this.writeInitialCentroids(); for (…) {     conf.setMapperClass(this.mapper);     conf.setReducerClass(this.reducer);     conf.setInt("iteration", iteration);     SetOutputPath(... "/it_" + (iteration + 1));     ... } Thibault Debatty Parallel Spam Clustering with Apache Hadoop 14
  • 15. 3. Implementation : Abstract KMeansMapper public void configure(JobConf job) {     // reads from     // "/it_" + job.get("iteration") + "/part­xxxxx"     this.fetchCenters(job); } public void map(key, value,...) {     IPoint point = this.createPointInstance();     point.parse(value);     ... } public abstract IPoint createPointInstance(); public abstract ICenter createCenterInstance(); Thibault Debatty Parallel Spam Clustering with Apache Hadoop 15
  • 16. 3. Implementation : Abstract KMeansReducer public void reduce(key, values, …) {     new_center = this.createCenterInstance();     new_center.setOldCenter(old_center);     while (values.hasNext()) {         new_center.addPoint(point);     }     new_center.compute();     output.collect(new_center); } public abstract IPoint createPointInstance(); public abstract ICenter createCenterInstance(); Thibault Debatty Parallel Spam Clustering with Apache Hadoop 16
  • 17. 3. Implementation : Spam Clustering ● Distance between spams : Weighted Average of feature distances ● Text features : Jaro distance Thibault Debatty Parallel Spam Clustering with Apache Hadoop 17
  • 18. 3. Implementation : Spam Clustering Jaro similarity = Where : ● m = number of matching characters; ● t = number matching characters not located at the same position / 2. Matching = not farther than => Takes misspelling into account Thibault Debatty Parallel Spam Clustering with Apache Hadoop 18
  • 19. 3. Implementation : Spam Clustering Distance between spams : Weighted Average of feature distances ● Text features : Jaro distance ● IP : Number of different bits / 32 ● Size : max 10% difference ● Day : arctangent-shaped function Thibault Debatty Parallel Spam Clustering with Apache Hadoop 19
  • 20. 3. Implementation : Spam Clustering Thibault Debatty Parallel Spam Clustering with Apache Hadoop 20
  • 21. 3. Implementation : Spam Clustering ● Center of cluster : ● Text features : Longest Common Subsequence; ● Charset, Geo (country code), Lang, Day : most often occurring value; ● Size : average value. Thibault Debatty Parallel Spam Clustering with Apache Hadoop 21
  • 22. 4. Benchmarks ● Small Cluster : 3 nodes ● Single core ● 2GB RAM ● Gigabit Ethernet network ● Data replication : 3 Thibault Debatty Parallel Spam Clustering with Apache Hadoop 22
  • 23. 4. Benchmarks ● n = 1M spams ● k = 30 ● i = 10 => 1131 sec Thibault Debatty Parallel Spam Clustering with Apache Hadoop 23
  • 24. 4. Benchmarks : scalability 3500 3000 2500 Execution time (sec) 2000 1500 1000 500 0 1 node 2 nodes 3 nodes Thibault Debatty Parallel Spam Clustering with Apache Hadoop 24
  • 25. 4. Benchmarks : scalability Thibault Debatty Parallel Spam Clustering with Apache Hadoop 25
  • 26. 4. Benchmarks : Hadoop Overhead Sequential : 2424 sec 3 servers (theoretic) : 808 sec 3 servers (real) : 1131 sec Overhead : 323 sec (40%) Thibault Debatty Parallel Spam Clustering with Apache Hadoop 26
  • 27. 4. Benchmarks : Hadoop Overhead Sequential : 2424 sec 3 servers (theoretic) : 808 sec 3 servers (real) : 1131 sec Overhead : 323 sec (40%) MPI Jumpshot Thibault Debatty Parallel Spam Clustering with Apache Hadoop 27
  • 28. 4. Benchmarks : Hadoop Overhead Sequential : 2424 sec 3 servers (theoretic) : 808 sec 3 servers (real) : 1131 sec Overhead : 323 sec (40%) No data (setup) : 76 sec (9.5%) Trivial distance (setup + sort) : 242 sec Sort : 166 sec (20.5%) Remaining : 81 sec (10%) Thibault Debatty Parallel Spam Clustering with Apache Hadoop 28
  • 29. 4. Benchmarks : Weka and Mahout ● 10 million 2D points ● Weka (sequential) 5355 sec ● Hadoop: 1841 sec (2.9x faster) ● Mahout + 4h ? Thibault Debatty Parallel Spam Clustering with Apache Hadoop 29
  • 30. 4. Benchmarks ● Bigger cluster : ● 27 nodes ● 2 x 4 cores ● 16 GB ● Deployment: ● Shared home dir (NFS) ● Custom setup script ● Executed on all nodes through SSH Thibault Debatty Parallel Spam Clustering with Apache Hadoop 30
  • 31. 4. Benchmarks : Cluster 1M spams Small cluster : Bigger cluster : ● 3 cores ● 216 cores ● k = 30 ● k = 4000 ● 1131 sec ● 2484 sec Thibault Debatty Parallel Spam Clustering with Apache Hadoop 31
  • 32. 4. Benchmarks : Comparison Small cluster : Bigger cluster : x 72 ● 3 cores ● 216 cores x 133 ● k = 30 ● k = 4000 ● 1131 sec ● 2484 sec Expected : 2089 sec Difference : 19% Thibault Debatty Parallel Spam Clustering with Apache Hadoop 32
  • 33. 4. Benchmarks : Profiling and optimization With String dates : With timestamps : - 32% ● 1131 sec ● 770 sec Thibault Debatty Parallel Spam Clustering with Apache Hadoop 33
  • 34. 5. Results ● "Your receipt #" ● From: "" ● To: "@domain4.com" ● “LinkedIn Messages, /0/2010" ● From: "adjustsc5837@rodneymoore.com" ● To: "@domain0140.com" ● "" ● From: "LiliKepp5219@telemar.net.br" ● To: "@domain4.c" Thibault Debatty Parallel Spam Clustering with Apache Hadoop 34
  • 35. 5. Results Visualization ● "eil rder #" ● From: "hilton_ns@datares.com.my" Thibault Debatty Parallel Spam Clustering with Apache Hadoop 35
  • 36. Conclusion ● Hadoop allows faster clustering ● But: ● Limitations ● Lacks graphical performance analysis tool (MPI Jumpshot) ● Programmer needs to understand inner working! ● Lot of room for improvement: ● Memcached to store intermediate centers? ● MPI to intercept method calls between JVMs? ● Selection of initial centers (canopy?), stop criterion? ● Distance computation (WOWA) ● Clustering algorithm (online clustering) ● Influence of data locality and data size? Thibault Debatty Parallel Spam Clustering with Apache Hadoop 36
  • 37. Questions ? Thibault Debatty Parallel Spam Clustering with Apache Hadoop 37