Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

A Brief Discussion on: Hadoop
MapReduce, Pig,
JavaFlume,Cascading & Dremel

Presented By: Somnath Mazumdar
29th Nov 2011

MapReduce
è  Based on Google's MapReduce Programming Framework
è  FileSystem: GFS for MapReduce ... HDFS for Hadoop
è  Language: MapReduce is written in C++ but Hadoop is in Java
è  Basic Functions : Map and Reduce inspired by similar primitives in
LISP and other languages...
Why we should use ???
l  Automatic parallelization and distribution
l  Fault-tolerance
l  I/O scheduling
l  Status and monitoring

MapReduce
Map Function: Reduce Function:
(1)  Processes input key/value (1)  Combines all intermediate values
pair for a particular key

(2)  Produces a set of merged output
(2)  Produces set of values
intermediate pairs
Syntax:
Syntax:
reduce (out_key, list(inter_value)) ->
map (key,value)- list(out_value)
>list(key,inter_value)

Programming Model
(Hello, 1)
Hello World, Bye (Bye, 1)
World!
M1
(World, 1) (Hello, 2)
(World, 1) (Bye, 1)
R1 (Welcome, 1)
(to, 3)
(Welcome, 1)
(to, 1)
Welcome to UCD, (to, 1)
Goodbye to UCD.
M2
(Goodbye, 1)
(UCD, 1)
(UCD, 1)
(World, 2)
(UCD, 2)
Hello
(Hello, 1) R2 (Goodbye, 2)
(to, 1) (MapReduce,
MapReduce,
Goodbye to
M3 (Goodbye, 1) 2)
MapReduce. (MapReduce,
1)
(MapReduce,
1)

HDFS Map Intermediate Reduce HDFS
Phase Result Phase

MapReduce
Applications:
(1)  Distributed grep & Distributed sort
(2)  Web link-graph reversal,
(3)  Web access log stats,
(4)  Document clustering,
(5)  Machine Learning and so on...

To know more:

è  MapReduce: Simplified Data Processing on Large Clusters
by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.

è  Hadoop: The Definitive Guide - O'Reilly Media

PIG
è  First Pig developed at Yahoo Research around 2006 later moved to
Apache Software Foundation
è  Pig is a data flow programming environment for processing large files
based on MapReduce / Hadoop.
è  High-level platform for creating MapReduce programs used
with Hadoop and HDFS
è  Apache library that interprets scripts written in Pig Latin and runs
them on a Hadoop cluster.

At Yahoo! 40% of all Hadoop jobs are run with Pig

PIG
WorkFlow:
First step: Load input data.
Second step: Manipulate data with functions like filtering, using
foreach, distinct or any user defined functions.
Third step: Group the data. Final stage: Writing data into the DFS or
repeating the step if another dataset arrives.

Scripts written in PigLatin------------------->Hadoop ready jobs
Pig Library/Engine

Take Away Point:: Do more with data not with functions..

Cascading
Query API and Query Planner for defining, sharing, and executing data
processing workflows.

Supports to create and execute complex data processing workflows on a
Hadoop cluster using any JVM-based language (Java, JRuby, Clojure,
etc.).

Originally authored by Chris Wensel (founder of Concurrent, Inc.)
What it offers??
Data Processing API (core)
Process Planner
Process Scheduler
How to use?? 1. Install Hadoop
2. Put Hadoop job .jar which must contain cascading .jars.

Cascading:‘Source-Pipe-Sink’
How it works??
Source: Data is captured from sources.
Pipes: are created independent from the data they will process. Supports
reusable ‘pipes’ concept.
Sinks: Results are stored in output files or ‘sinks’.
Data Processing API provides Source-Pipe-Sink mechanism.
Once tied to data sources and sinks, it is called a ‘flow’(Topological
Scheduler). These flows can be grouped into a
‘cascade’(CascadeConnector class), and the process scheduler will
ensure a given flow does not execute until all its dependencies are
satisfied.

Cascading
Pipe Assembly------MR Job Planner---->graph of dependent MapReduce
jobs.
Also provides External Data Interfaces for data...

It efficiently supports splits, joins, grouping, and sorting.

Usages: log file analysis, bioinformatics, machine learning, predictive
analytics, web content mining etc.

Cascading is cited as one of the top five most powerful Hadoop projects
by SD Times in 2011.

FlumeJava
Java Library API that makes easy to develop,test and run
efficient data parallel pipelines.
Born on May 2009 @ Google Lab
Library is a collection of immutable parallel classes.
Flumejava:
1. abstracts how data is presented as in memory data structure or
as file
2. abstracts away the implementation details like local loop or
remote MR job.
3. Implements parallel job using deferred evaluation

FlumeJava
How it works???
Step1: invoke the parallel operation.
Step2: Do not run. Do the following ..
2.1. Records the operation and the arguments.
2.2. save them into an internal execution plan graph
structure.
2.3. Construct the execution plan for whole computation.
Step3: Optimizes the execution plan.
Step4: Execute them.
Faster than typical MR pipeline with same logical struct. & easier.

FlumeJava
Data Model:
Pcollection<T>: central class, an immutable bag of elements of type T
Can be unordered (collection(efficient)) or ordered (sequence).
PTable<K, V>:Second central class
Immutable multi-map with keys of class K and values of class V
Operators:
parallelDo(PCollection<T>): Core parallel primitives
groupByKey(PTable<Pair<K,V>>)
combineValues(PTable<Pair<K, Collection<V>>):
flatten(): logical view of multiple PCollections as one Pcollection
Join()

Dremel
A distributed system for interactive analysis of large datasets since
2006 in Google.
Provides custom, scalable data management solution built over shared
clusters of commodity machines.
Three Features/Key aspects:
1. Storage Format: column-striped storage representation for non
relational nested data (lossless representation).
Why nested?
It backs a platform-neutral, extensible mechanism for serializing
structured data at Google.
What is main aim??
Store all values of a given field consecutively to improve retrieval
efficiency.

Dremel
2. Query Language: Provides a high-level, SQL-like language to express
ad hoc queries.
It efficiently implementable on columnar nested storage.
Fields are referenced using path expressions.
Supports nested subqueries, inter and intra-record aggregation, joins
etc.
3. Execution:Multi-level serving tree concept (distributed search engine)
Several queries can execute simultaneously.
Query dispatcher schedules queries based on priorities and
balances load

I am lost..Are MR and Dremel
same??
Features MapReduce aka MR Dremel
Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab
Type Distributed & parallel Distributed interactive
programming framework ad hoc query system
Scalable & Fault Yes Yes
Tolerant
Data processing Record oriented Column oriented
Batch processing Yes No
In situ processing No Yes

Take away point:: Dremel it complements MapReduce-based
computing.

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Similar a Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra (20)

Último

Último (20)

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra