Hadoop Seminar Summary

HADOOP

HPC4 Seminar
IPM
December 2011

Omid Djoudi
od90125@yahoo.com

2011 IPM - HPC4 1

Hadoop
Scale up: Multi Processing machines -> expensive
Scale out: Commodity hardware

Cost efficiency = Cost / Performance
-> Commodity hardware 12 times higher than SMP

Communication between nodes faster in SMP
But for data intensive applications, workload requires a cluster of
machines
-> network transfer inevitable

2011 IPM - HPC4 2

Hadoop
Hadoop :
Open-source framework for implementing Map/Reduce in a distributed
environment.

Initially developed in Yahoo, Google.
-> Map/Reduce Framework : Yahoo.
-> HDFS – Hadoop Distributed File System : Google GFS.

Moved to open-source license - Apache Project.

Yahoo(20000 servers), Google, Amazon, Ebay, Facebook

2011 IPM - HPC4 3

Hadoop
Suitable for processing TB, PB of datas

Reliability – Commodity machines have less reliable disks
-> mean time between failure = 1000 days
-> 10000 server cluster experience 10 failures a day

Redundant distribution and processing
-> Data is distributed in n replicas
-> Code is spread m times in “slots” across cluster. m > n

2011 IPM - HPC4 4

Hadoop
Sequential access
-> Data too big to fit in memory, random access expensive
-> Data access sequentially, no seek, no binary tree search

“Shared nothing” architecture
-> State would be un-maintanable in a highly asynchronous environment

Values represented by a list of <key,value>
-> Limit explicit communication between nodes
-> Keys provide information used to move data in clusters

2011 IPM - HPC4 5

HDFS
Distributed File System – Decouple namespace from data.
Partition a dataset across a cluster

File system targeted at “very large” files – TB, PB
-> Small number of files to increase namespace management

Files written once – no update or append
-> Optimisation required for distributing files in blocks

Fault-tolerance, redundancy of data
Targeted at batch processing : High throughput, but high latency!
2011 IPM - HPC4 6

HDFS
Files divided in blocks (64MB, 128MB) if size(file)>size(block)
->Each file / block is the atomic input for a map instance.

HDFS block >> Disk block to reduce the number of disk seeks
compared to the data load. Allow streaming

Block simplifies the storage and management:
Metadata maintained outside the data, separation of security and failure
management from the intensive disk operations.

Replication done at block level.

2011 IPM - HPC4 7

HDFS

2011 IPM - HPC4 8

HDFS
Replica placement strategy - Minimise transfer across rack
network switches while keeping load balancing ratio – 1
replica on local rack, 2 on remote racks
-> write bandwitdh optimisation: transit by 2 network switch instead of 3.

Affinity – map executed on nodes where block is present. If not
possible, use rack awareness to minimise distance between
process and data -> move program to data.

Cluster rebalancing – Additional replicas can be created
dynamically if high demand.

2011 IPM - HPC4 9

HDFS
Scalability and performance limited by single namespace server
architecture

NameNode and DataNode decoupled for scalability:
-> Metadata operation is fast, Data operation is heavy and slow
-> If one server for both, data operation would dominate, bottleneck in
namespace response

Whole namespace is in RAM + periodic backup to disk (journal)
-> limitation on the number of files:
1 GB metadata = 1 PB physical storage

2011 IPM - HPC4 10

Submission

2011 IPM - HPC4 11

Submission

2011 IPM - HPC4 12

Communications
Synchronisation:
(DataNode-> NameNode)

Heartbeat (every 3 seconds) :
- Total disk
- Used disk
- Number of data transferred by the node (used for load balancing)

Block report (every hour or on demand) :
- List of Block IDs
- Length
- Generation stamp

2011 IPM - HPC4 13

Communications
Synchronisation:
(NameNode->DataNode)

Reply to heartbeat from dataNode

Contains information:
- Replicate block to other nodes
- Remove local replica
- Shut dowm
- Send block report

2011 IPM - HPC4 14

Communications
Synchronisation:
(TaskTracker>JobTracker)

Heartbeat :
- Available slots for map and reduce
- Pull mode

(JobTracker>TaskTracker)

Heartbeat :
- Task allocation information

2011 IPM - HPC4 15

Benchmark
Tera-Sort benchmark

1800 machines dual 2GHz Intel Xeon with hyperthreading, 4GO
memory
maximum 1 reduce per machine
10^10 * 100 bytes records (1 TB of input data) – records
normally distributed to have balanced reducers
Map: Extract 10 bytes = key, original line = value, emit (key,value)
Reduce: Identity function
M = 10000, M_size = 64MB, R = 4000

2011 IPM - HPC4 16

Benchmark
Input rate – peak at 13 MB/s – Stops after map
phase has finished
Higher than shuffle and reduce rate because of
data locality.

Shuffle rate - starts as soon as first map output has
been generated. Stops after 1800 – end of first
batch of reducers (1 reducer per machine) and
begin after first reducers finish their processing

Reduce rate – first writes rates are higher, and then
there is the second round of shuffles which
begin again so the rate will slightly decrease.
Rate is lower than shuffle -> 2 copies generated
for output.

2011 IPM - HPC4 17

Tuning
Increase number of reducers

If more reducers than available slots, faster machines will
execute more instances of reducers

+ Increase in load balancing
+ Decrease cost of failure
- Increase in global overhead

2011 IPM - HPC4 18

Tuning
In-mapper combining
Combiner execution is optional and left to the decision of framework.

To force aggregation in map() :
- State preservation - iterations executed in a single JVM instance
- Emit the whole result at once in a hook

map (filename, file-contents):
array = new associative_array(int,int);
for each number in file-contents:
array[number] += number^2
foreach number of array
emit (number, number^2)

2011 IPM - HPC4 19

Tuning
Configuration parameters

Map parameters
io.sort.mb: Size of memory buffer for the map output
io.sort.spill.percent: Percent of filling of the buffer before writing to disk.
The writing is done in background, but the process stops if the buffer is full and the
disk is not fast enough to let to flushing completed.
-> Increase the buffer size if map functions are small
-> Increase the (buffer size / spill percent ) to optimise fluidity – Possible if disk access
can handle efficient parallel writes
task.tracker.http.threads : Number of threads in map nodes serving reduce
requests
-> Increase if big cluster and large jobs

2011 IPM - HPC4 20

Tuning

Merge/Sort parameters
mapred.job.shuffle.input.buffer.percent = Percent of available RAM in reduce
node for keeping the map outputs.
Write on disk after reaching mapred.job.shuffle.merge.percent of memory or
mapred.inmem.merge.threshold number of files

-> Increase the memory usage if reduce tasks are small and number of
mappers not much bigger than number of reducers

2011 IPM - HPC4 21

Tuning

io.sort.factor
Merge factor : Number of rounds for creating the merged file received from the map
outputs – The number of input files for reduce will be
nb_received_map / io.sort.factor

-> Increase if high availability on memory on nodes.
-> Take into account mapred.job.shuffle.input.buffer.percent which will reduce the
available memory for the merge factor

2011 IPM - HPC4 22

Scheduler
The scheduler is based on jobs and not tasks.

FIFO
Each job will use all the available resources, penalising other users

Fair scheduler
Cluster is shared fairly between different users
-> Pool of jobs per user
-> Preemption if new jobs change the sharing balance and make a pool more resource
intensive

No affinity scoring calculated for tasks during scheduling sessions. The data affinity is
performed after the task selection by the scheduler
-> This is a serious handicap for a data grid!

2011 IPM - HPC4 23

Weakness
JobTracker tied to resources:
-> leverage to a pool of available resources more difficult
-> No dynamic scalability, resource planning should be fixed in advance
-> More difficult to create SLA in shared grid

Pull mode between task tracker and job tracker
-> Peak, valley issue – idle periods between polling times. We can increase heartbeat
frequency but risk of network saturation

No possibility to pin resources (slave nodes) to a job

2011 IPM - HPC4 24

Weakness
Reduce phase can only begin after end of map phase
M= nbr of maps, R = nbr of reduces
M_slots = nbr of map slots, R_slots = nbr of reduce slots
Tm = average duration of maps. Tr = average duration of reduces
Total time = Tm* min (1, M/M_slots) + Tr * min (1, R/R_slots)

If reduce phase can begin as soon as first map result avaiable:
-> R will be bigger as there will be minimum as much reduces as outputs from
maps -> R_new = max(M,R) = M most of the time
Total time = max (Tn,Tm) * min (1, 2*M / (M_slots+R_slots))

2011 IPM - HPC4 25

ADDONS - HBASE
Database – storage by on column
-> Datawarehousing

Based on Google BigTable
Stores huge data
Performant access elements by (row, column)

All columns not necessary present for all lines!

2011 IPM - HPC4 26

ADDONS - PIG LATIN
High level data flow language based on Hadoop. Used for data analysis.

fileA:
User1, a
User1, b
User2, c

Log = Load ‘fileA’ (user, value);
Grp = GROUP Log by user; outputFile:
Count=FOREACH Grp GENERATE group, User1, 2
COUNT(Log);
User2, 1
STORE Count INTO ‘outputFile’;

2011 IPM - HPC4 27

ADDONS - HIVE
High level language –SQL like load, query

CREATE TABLE T (a INT, b STRING)…
LOAD DATA INPATH “file_name” INTO TABLE T;
SELECT * FROM …

Allow joins and more powerful features

2011 IPM - HPC4 28

CONCLUSION
• Hadoop is a fair middleware for distributed data processing

• Restrictive usage: High volume data and <key,value>
processing

• No clear separation between process and resource
management

• But…very active project, evolution will bring improvements

2011 IPM - HPC4 29

Hadoop Seminar Summary

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Hadoop Seminar Summary

Similar to Hadoop Seminar Summary (20)

Hadoop Seminar Summary