Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
1. A Brief Discussion on: Hadoop
MapReduce, Pig,
JavaFlume,Cascading & Dremel
Presented By: Somnath Mazumdar
29th Nov 2011
2. MapReduce
è Based on Google's MapReduce Programming Framework
è FileSystem: GFS for MapReduce ... HDFS for Hadoop
è Language: MapReduce is written in C++ but Hadoop is in Java
è Basic Functions : Map and Reduce inspired by similar primitives in
LISP and other languages...
Why we should use ???
l Automatic parallelization and distribution
l Fault-tolerance
l I/O scheduling
l Status and monitoring
3. MapReduce
Map Function: Reduce Function:
(1) Processes input key/value (1) Combines all intermediate values
pair for a particular key
(2) Produces a set of merged output
(2) Produces set of values
intermediate pairs
Syntax:
Syntax:
reduce (out_key, list(inter_value)) ->
map (key,value)- list(out_value)
>list(key,inter_value)
5. MapReduce
Applications:
(1) Distributed grep & Distributed sort
(2) Web link-graph reversal,
(3) Web access log stats,
(4) Document clustering,
(5) Machine Learning and so on...
To know more:
è MapReduce: Simplified Data Processing on Large Clusters
by Jeffrey Dean and Sanjay Ghemawat, Google, Inc.
è Hadoop: The Definitive Guide - O'Reilly Media
6.
7. PIG
è First Pig developed at Yahoo Research around 2006 later moved to
Apache Software Foundation
è Pig is a data flow programming environment for processing large files
based on MapReduce / Hadoop.
è High-level platform for creating MapReduce programs used
with Hadoop and HDFS
è Apache library that interprets scripts written in Pig Latin and runs
them on a Hadoop cluster.
At Yahoo! 40% of all Hadoop jobs are run with Pig
8. PIG
WorkFlow:
First step: Load input data.
Second step: Manipulate data with functions like filtering, using
foreach, distinct or any user defined functions.
Third step: Group the data. Final stage: Writing data into the DFS or
repeating the step if another dataset arrives.
Scripts written in PigLatin------------------->Hadoop ready jobs
Pig Library/Engine
Take Away Point:: Do more with data not with functions..
9. Cascading
Query API and Query Planner for defining, sharing, and executing data
processing workflows.
Supports to create and execute complex data processing workflows on a
Hadoop cluster using any JVM-based language (Java, JRuby, Clojure,
etc.).
Originally authored by Chris Wensel (founder of Concurrent, Inc.)
What it offers??
Data Processing API (core)
Process Planner
Process Scheduler
How to use?? 1. Install Hadoop
2. Put Hadoop job .jar which must contain cascading .jars.
10. Cascading:‘Source-Pipe-Sink’
How it works??
Source: Data is captured from sources.
Pipes: are created independent from the data they will process. Supports
reusable ‘pipes’ concept.
Sinks: Results are stored in output files or ‘sinks’.
Data Processing API provides Source-Pipe-Sink mechanism.
Once tied to data sources and sinks, it is called a ‘flow’(Topological
Scheduler). These flows can be grouped into a
‘cascade’(CascadeConnector class), and the process scheduler will
ensure a given flow does not execute until all its dependencies are
satisfied.
11. Cascading
Pipe Assembly------MR Job Planner---->graph of dependent MapReduce
jobs.
Also provides External Data Interfaces for data...
It efficiently supports splits, joins, grouping, and sorting.
Usages: log file analysis, bioinformatics, machine learning, predictive
analytics, web content mining etc.
Cascading is cited as one of the top five most powerful Hadoop projects
by SD Times in 2011.
12. FlumeJava
Java Library API that makes easy to develop,test and run
efficient data parallel pipelines.
Born on May 2009 @ Google Lab
Library is a collection of immutable parallel classes.
Flumejava:
1. abstracts how data is presented as in memory data structure or
as file
2. abstracts away the implementation details like local loop or
remote MR job.
3. Implements parallel job using deferred evaluation
13. FlumeJava
How it works???
Step1: invoke the parallel operation.
Step2: Do not run. Do the following ..
2.1. Records the operation and the arguments.
2.2. save them into an internal execution plan graph
structure.
2.3. Construct the execution plan for whole computation.
Step3: Optimizes the execution plan.
Step4: Execute them.
Faster than typical MR pipeline with same logical struct. & easier.
14. FlumeJava
Data Model:
Pcollection<T>: central class, an immutable bag of elements of type T
Can be unordered (collection(efficient)) or ordered (sequence).
PTable<K, V>:Second central class
Immutable multi-map with keys of class K and values of class V
Operators:
parallelDo(PCollection<T>): Core parallel primitives
groupByKey(PTable<Pair<K,V>>)
combineValues(PTable<Pair<K, Collection<V>>):
flatten(): logical view of multiple PCollections as one Pcollection
Join()
15. Dremel
A distributed system for interactive analysis of large datasets since
2006 in Google.
Provides custom, scalable data management solution built over shared
clusters of commodity machines.
Three Features/Key aspects:
1. Storage Format: column-striped storage representation for non
relational nested data (lossless representation).
Why nested?
It backs a platform-neutral, extensible mechanism for serializing
structured data at Google.
What is main aim??
Store all values of a given field consecutively to improve retrieval
efficiency.
16. Dremel
2. Query Language: Provides a high-level, SQL-like language to express
ad hoc queries.
It efficiently implementable on columnar nested storage.
Fields are referenced using path expressions.
Supports nested subqueries, inter and intra-record aggregation, joins
etc.
3. Execution:Multi-level serving tree concept (distributed search engine)
Several queries can execute simultaneously.
Query dispatcher schedules queries based on priorities and
balances load
17. I am lost..Are MR and Dremel
same??
Features MapReduce aka MR Dremel
Birth Year & Place Since 2004 @ Google lab Since 2006 @ Google lab
Type Distributed & parallel Distributed interactive
programming framework ad hoc query system
Scalable & Fault Yes Yes
Tolerant
Data processing Record oriented Column oriented
Batch processing Yes No
In situ processing No Yes
Take away point:: Dremel it complements MapReduce-based
computing.