Hadoop: An Introduction to Distributed Processing of Large Datasets

Hadoop
Saeed Iqbal P11-6501 MSCS

What Is ? inspired by
• System for Processing mind-boggingly large amount
of data.
• The Apache Hadoop software library is a framework:
• that allows for the distributed processing of large data sets
across clusters of computers using simple programming
models.
• It is designed to scale up from single servers to thousands
of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability,
• The library itself is designed to detect and handle failures at
the application layer, so delivering a highly-available service
on top of a cluster of computers, each of which may be
prone to failures.

Hadoop Core
• Open sourced, flexible and available architecture
for large scale computation and data processing on
a network of commodity hardware
• Open Source Software + Hardware Commodity
• IT Costs Reduction

MapReduce Computation

HDFS Storage

Hadoop, Why?
• Need to process Multi Petabyte Datasets
• Expensive to build reliability in each application.
• Nodes fail every day
• Failure is expected, rather than exceptional.
• The number of nodes in a cluster is not constant.
• Need common infrastructure
• Efficient, reliable, Open Source Apache License
• The above goals are same as Condor, but
• Workloads are IO bound and not CPU bound

Hadoop History
• 2004—Initial versions of what is now Hadoop Distributed Filesystem and
MapReduce implemented by Doug Cutting and Mike Cafarella.
• December 2005—Nutch ported to the new framework. Hadoop runs reliably on
20 nodes.
• January 2006—Doug Cutting joins Yahoo!.
• February 2006—Apache Hadoop project officially started to support the
standalone development of MapReduce and HDFS.
• February 2006—Adoption of Hadoop by Yahoo! Grid team.
• April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
• May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
• May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than
April benchmark).
• October 2006—Research cluster reaches 600 nodes.
• December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
• January 2007—Research cluster reaches 900 nodes.
• April 2007—Research clusters—2 clusters of 1000 nodes.
• April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
• October 2008—Loading 10 terabytes of data per a day on to research clusters.
• March 2009—17 clusters with a total of 24,000 nodes
• April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1400
nodes) and the 100 terabyte sort in 173 minutes (on 3400 nodes).

Who uses Hadoop?
Amazon/A9
Facebook
Google
IBM
Joost
Last.fm
New York Times
PowerSet
Veoh
Yahoo!
Twitter.com

How does HDFS Work? MapReduce has
undergone a
Let Suppose we have a file complete
overhaul in
Size : 300MB hadoop-0.23 and
we now
have, what we
call, MapReduce
2.0 (MRv2) or
0
YARN.
0 The fundamental
M idea of MRv2 is
B to split up the
two major
functionalities of
the
JobTracker, resou
rce management
and job
scheduling/monit
oring, into
separate daemo

How does HDFS Work? 1 MapReduce has
HDFS splits it into blocks 2 undergone a complete
8 overhaul in hadoop-0.23
Size of each block is 128 MB. M
and we now have, what
we call, MapReduce 2.0
B (MRv2) or YARN.
The fundamental idea of
MRv2 is to split up the
1 two major functionalities
2 of the JobTracker,
8
M
B

4 resource management
4 and job
M scheduling/monitoring, i
B nto separate daemo

How does HDFS Work? MapReduce has
undergone a complete
HDFS will keep 3 copies of each overhaul in hadoop-0.23
and we now have, what
Block. we call, MapReduce 2.0
HDFS store these blocks on (MRv2) or YARN.

datanodes,
HDFS distributes the block to
the DNs

How does HDFS Work?
The Name Node tracks blocks and Data nodes.
DN DN DN

DN DN
DN
Name Node

DN DN DN

How does HDFS Work?
Sometimes a datanode will die. Not a problem,

Example for MapReduce
Page 1: the weather is good
Page 2: today is good
Page 3: good weather is good.

Map output
Worker 1:
(the 1), (weather 1), (is 1), (good 1).
Worker 2:
(today 1), (is 1), (good 1).
Worker 3:
(good 1), (weather 1), (is 1), (good 1).

Reduce Output
Worker 1:
(the 1)
Worker 2:
(is 3)
Worker 3:
(weather 2)
Worker 4:
(today 1)
Worker 5:
(good 4)

MapReduce: Programming Model
Process data using special map() and reduce()
functions
The map() function is called on every item in the
input and emits a series of intermediate key/value
pairs
All values associated with a given key are grouped
together
The reduce() function is called on every unique
key, and its value list, and emits a value that is
added to the output

MapReduce:Programming Model

M <How,1>
<now,1> <How,1 1>
How now <brown,1> <now,1 1> brown 1
Brown M <cow,1> <brown,1> R cow 1
cow <How,1> <cow,1> does 1
<does,1> <does,1> How 2
M <it,1> <it,1>
R it 1
How does
It work now <work,1> <work,1> now 2
<now,1> work 1
M Reduce
MapReduce
Framework
Map
Input Output

MapReduce:Programming Model
More formally,
Map(k1,v1) --> list(k2,v2)
Reduce(k2, list(v2)) --> list(v2)

MapReduce Life Cycle

Map function

Reduce function

Run this program as a
MapReduce job

Hadoop Environment
Hadoop has become the kernel of the distributed
operating system for Big Data
The project includes these modules:
Hadoop Common: The common utilities that support
the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and
cluster resource management.
Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.

Hadoop Architecture

Hue Mahout
(Web Console) (Data Mining)

Oozie
(Job Workflow & Scheduling)
(Coordination)
Zookeeper

Sqoop/Flume
Pig/Hive (Analytical Language)
(Data integration)

MapReduce Runtime
(Dist. Programming Framework) Hbase
(Column NoSQL DB)

Hadoop Distributed File System (HDFS)

Zookeeper – Coordination
Framework
Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ZooKeeper
• A centralized service for maintaining
• Configuration information, naming
• Providing distributed synchronization,
• and providing group services.
• A set of tools to build distributed applications that can
safely handle partial failures
• ZooKeeper was designed to store coordination data
• Status information
• Configuration
• Location information

ZooKeeper
• ZooKeeper allows distributed processes to coordinate
with each other through a shared hierarchical name
space of data registers (we call these registers
znodes), much like a file system.

Flume / Sqoop – Data Integration
Framework
Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


FLUME

• Flume is a distributed:
• A distributed, reliable, and data collection service
• It efficiently collecting, aggregating, and moving large
amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
• It has a simple and flexible architecture based on
streaming data flows.

Flume: High-Level Overview
• Logical Node
• Source
• Sink

Flume Architecture

Log Log
...

Flume Node Flume Node

HDFS

December 2nd, 2012 - Apache Flume 1.3.0 Released
©2011 Cloudera, Inc. All Rights Reserved.

Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently
transferring bulk data between Apache Hadoop and
structured data stores such as relational databases.
• Easy, parallel database import/export
• What you want do?
• Insert data from RDBMS to HDFS
• Export data from HDFS back into RDBMS

Sqoop Architecture & Example
March of 2012
HDFS

Sqoop

RDBMS
$ sqoop import --connect jdbc:mysql://localhost/world --username root --
table City
...

$ hadoop fs -cat City/part-m-00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,Herat,AF
G,Herat,1868004,Mazar-e-
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,Noord-Holland,731200
...
33

Pig / Hive – Analytical Language
Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Why Hive and Pig?
Although MapReduce is very powerful, it can also be
complex to master
Many organizations have business or data analysts who
are skilled at writing SQL queries, but not at writing
Java code
Many organizations have programmers who are skilled
at writing code in scripting languages
Hive and Pig are two projects which evolved separately
to help such people analyze huge amounts of data
via MapReduce
Hive was initially developed at Facebook, Pig at Yahoo!

Hive
What is Hive?
An SQL-like interface to Hadoop
Data Warehouse infrastructure that provides easy data
summarization and ad hoc querying and the analysis
of large datasets stored in Hadoop compatible file
systems.
MapRuduce for execution
HDFS for storage
Hive Query Language
Basic-SQL : Select, From, Join, Group-By
Equi-Join, Muti-Table Insert, Multi-Group-By
Batch query
SELECT * FROM purchases WHERE price > 100 GROUP BY storeid

Hive

SQL

Hive

MapReduce

38

Pig
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which
you need to do some processing or analysis , one
way is to write Map Reduce code and then run that
processing on data.
Other way is to write Pig scripts which would inturn be
converted to Map Reduce code and would process
your data.
Pig consists of two parts
• Pig latin language
• Pig engine A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;STORE C into ‘c.txt’

Pig Latin & Pig Engine
• Pig latin is a scripting language which allows you to
describe how data flow from one or more inputs
should be read , how it should be processed and
then where it should be stored.
• The flows can be simple or complex where some
processing is applied in between. Data can be picked
from multiple inputs.
• We can say Pig Latin describes a directed acyclic
graphs where edges are data flows and the nodes
are operators that process the data
• Pig Engine:
• The job of engine is to execute the data flow written
in Pig latin in parallel on hadoop infrastructure.

Why Pig is required when we can code all in MR
• Pig provides all standard data processing operations
like sort , group , join , filter , order by , union right
inside pig latin
• In MR we have to lots of manual coding.
Pig does optimization of Pig latin scripts while
creating them into MR jobs.
• It creates optimized version of Map reduce to run on
hadoop
• It takes very less time to write Pig latin script then to
write corresponding MR code
• Where Pig is useful
Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing

Pig

Script

Pig

MapReduce


WordCount Example
• Input
Hello World Bye World
Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

• theBye, 1> just sums up the values
< reduce
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>

WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

WordCount Example By Pig

A = LOAD 'wordcount/input' USING PigStorage as
(token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;

WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ’wordcount/input'
OVERWRITE INTO TABLE wordcount;

SELECT count(*) FROM wordcount GROUP BY token;

Hbase – Column NoSQL DB
Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


• Apache HBase™ is the Hadoop database, a
distributed, scalable, big data store.
• Apache HBase is an open-source, distributed, versioned, column-
oriented store modeled after Google's Bigtable: A Distributed Storage
System for Structured Data by Chang et al.

• Coordinated by Zookeeper
• Low Latency
• Random Reads And Writes
• Distributed Key/Value Store
• Simple API
– PUT
– GET
– DELETE
– SCANE

Hbase
HBase is a type of "NoSQL" database. NoSQL?
"NoSQL" is a general term meaning that the database
isn't an RDBMS which supports SQL as its primary
access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local
NoSQL database, whereas HBase is very much a
distributed database.

Technically speaking, HBase is really more a "Data
Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed
columns, secondary indexes, triggers, and advanced
query languages, etc.

HBase Examples
hbase> create 'mytable', 'mycf‘
hbase> list
hbase> put 'mytable', 'row1', 'mycf:col1', 'val1‘
hbase> scan 'mytable‘
hbase> disable 'mytable‘
hbase> drop 'mytable'

Hbase reference : http://hbase.apache.org


Oozie – Job Workflow & Scheduling

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is ?
Oozie is a server-based workflow scheduler system to
manage Apache Hadoop jobs (e.g. load data, storing
data, analyze data, cleaning data, running map
reduce jobs, etc.)
A Java Web Application
Oozie is a workﬂow scheduler for Hadoop
Triggered
Time
Job 1 Job 2
Data

Job 3

Job 4 Job 5

Mahout – Data Mining

Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


What is
Machine-learning tool
Distributed and scalable machine learning algorithms
on the Hadoop platform
Building intelligent applications easier and faster
Our core algorithms for clustering, classfication and
batch based collaborative filtering are implemented
on top of Apache Hadoop using the map/reduce
paradigm.

Mahout Use Cases
Yahoo: Spam Detection
Foursquare: Recommendations
SpeedDate.com: Recommendations
Adobe: User Targetting
Amazon: Personalization Platform


Use case Example
Predict what the user likes based on
His/Her historical behavior
Aggregate behavior of people similar to him

Recap – Hadoop Arcitecture
Hue Mahout

Oozie
(Coordination)
Zookeeper

Sqoop/Flume
(Data integration)

MapReduce Runtime
(Column NoSQL DB)


Hadoop: An Introduction to Distributed Processing of Large Datasets

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Hadoop: An Introduction to Distributed Processing of Large Datasets

Similar a Hadoop: An Introduction to Distributed Processing of Large Datasets (20)

Hadoop: An Introduction to Distributed Processing of Large Datasets