Apache hama @ Samsung SW Academy

Apache Hama
a Bulk Synchronous Parallel Computing

Edward J. Yoon
<edwardyoon@apache.org>

Who Am I
• Edward J. Yoon
– @eddieyoon
• Founder of Apache Hama
• PMC member of Apache BigTop
• Oracle Employee

What’s Hama?
• Open Source
– Under Apache 2.0 License
• Written In Java
• Apache Top Level Project

Characteristics
• a General BSP computing engine
– M/R like Input/Output Formatter
• SequenceFile, Text, Accumulo, Hbase, …, etc.
– Job Manager
– Checkpoint Recovery
• Streaming and Pipes
– Python, C++, …, etc.
• Graph and Machine Learning Packages
– K-means, Gradient Descent, Collaborative Filtering

Bulk Synchronous Parallel?
• Originally introduced by Valiant
• a Sequence of supersteps

Compare to M/R and MPI
• Supports message-passing paradigm style of
application development
• Provides a flexible, simple, and easy-to-use
small APIs
• Enables to perform better than MPI for
communication-intensive applications
• Guarantees impossibility of deadlocks or
collisions in the communication mechanisms

So, fit for what?
• Processing Big Data w/ complicated
relationships
– e.g., graph or network.
• Iterative or Recursive scientific applications
• Continuous Event Processing

Could be applied to
• Analyze user actions and patterns
• Social Target Marketing
• Observe evolution of Social networks
• Detect anomaly rapidly in Real-time
• Business Intelligence

Internals
• Pluggable RPC Architecture for message
transfer
– e.g., Hadoop RPC, Avro RPC, …, etc.
• Message Collector, Bundler, and
Compressor to reduce network overheads
and contentions
– e.g., Snappy, Bzip2, …, etc.

BSP API

public abstract void bsp(BSPPeer<K1, V1, K2, V2, M> peer)
throws IOException, SyncException;

BSP Examples
• Pi Calculation
• Sparse Matrix-Vector Multiplication
• K-means Clustering
• Gradient Descent

Graph API

public void compute(Iterator<M> messages)
throws IOException;

Graph Examples
• In-link Count
• Single Source Shortest Path
• Pagerank
• Bipartitie Matching
• Semi-Clustering

SSSP Performance
• a SSSP for random
graph of 1 billion edges
is computed in 400
seconds on 1 Oracle
BDA

Apache hama @ Samsung SW Academy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Apache hama @ Samsung SW Academy

Similar a Apache hama @ Samsung SW Academy (20)

Más de Edward Yoon

Más de Edward Yoon (12)

Apache hama @ Samsung SW Academy