This document discusses the Hadoop framework. It provides an overview of Hadoop and its core components, MapReduce and HDFS. It describes how Hadoop is suitable for processing large datasets in distributed environments using commodity hardware. It also summarizes some of Hadoop's limitations and how additional tools like HBase, Pig Latin, and Hive can expand its capabilities.
2. Hadoop
Scale up: Multi Processing machines -> expensive
Scale out: Commodity hardware
Cost efficiency = Cost / Performance
-> Commodity hardware 12 times higher than SMP
Communication between nodes faster in SMP
But for data intensive applications, workload requires a cluster of
machines
-> network transfer inevitable
2011 IPM - HPC4 2
3. Hadoop
Hadoop :
Open-source framework for implementing Map/Reduce in a distributed
environment.
Initially developed in Yahoo, Google.
-> Map/Reduce Framework : Yahoo.
-> HDFS – Hadoop Distributed File System : Google GFS.
Moved to open-source license - Apache Project.
Yahoo(20000 servers), Google, Amazon, Ebay, Facebook
2011 IPM - HPC4 3
4. Hadoop
Suitable for processing TB, PB of datas
Reliability – Commodity machines have less reliable disks
-> mean time between failure = 1000 days
-> 10000 server cluster experience 10 failures a day
Redundant distribution and processing
-> Data is distributed in n replicas
-> Code is spread m times in “slots” across cluster. m > n
2011 IPM - HPC4 4
5. Hadoop
Sequential access
-> Data too big to fit in memory, random access expensive
-> Data access sequentially, no seek, no binary tree search
“Shared nothing” architecture
-> State would be un-maintanable in a highly asynchronous environment
Values represented by a list of <key,value>
-> Limit explicit communication between nodes
-> Keys provide information used to move data in clusters
2011 IPM - HPC4 5
6. HDFS
Distributed File System – Decouple namespace from data.
Partition a dataset across a cluster
File system targeted at “very large” files – TB, PB
-> Small number of files to increase namespace management
Files written once – no update or append
-> Optimisation required for distributing files in blocks
Fault-tolerance, redundancy of data
Targeted at batch processing : High throughput, but high latency!
2011 IPM - HPC4 6
7. HDFS
Files divided in blocks (64MB, 128MB) if size(file)>size(block)
->Each file / block is the atomic input for a map instance.
HDFS block >> Disk block to reduce the number of disk seeks
compared to the data load. Allow streaming
Block simplifies the storage and management:
Metadata maintained outside the data, separation of security and failure
management from the intensive disk operations.
Replication done at block level.
2011 IPM - HPC4 7
9. HDFS
Replica placement strategy - Minimise transfer across rack
network switches while keeping load balancing ratio – 1
replica on local rack, 2 on remote racks
-> write bandwitdh optimisation: transit by 2 network switch instead of 3.
Affinity – map executed on nodes where block is present. If not
possible, use rack awareness to minimise distance between
process and data -> move program to data.
Cluster rebalancing – Additional replicas can be created
dynamically if high demand.
2011 IPM - HPC4 9
10. HDFS
Scalability and performance limited by single namespace server
architecture
NameNode and DataNode decoupled for scalability:
-> Metadata operation is fast, Data operation is heavy and slow
-> If one server for both, data operation would dominate, bottleneck in
namespace response
Whole namespace is in RAM + periodic backup to disk (journal)
-> limitation on the number of files:
1 GB metadata = 1 PB physical storage
2011 IPM - HPC4 10
13. Communications
Synchronisation:
(DataNode-> NameNode)
Heartbeat (every 3 seconds) :
- Total disk
- Used disk
- Number of data transferred by the node (used for load balancing)
Block report (every hour or on demand) :
- List of Block IDs
- Length
- Generation stamp
2011 IPM - HPC4 13
16. Benchmark
Tera-Sort benchmark
1800 machines dual 2GHz Intel Xeon with hyperthreading, 4GO
memory
maximum 1 reduce per machine
10^10 * 100 bytes records (1 TB of input data) – records
normally distributed to have balanced reducers
Map: Extract 10 bytes = key, original line = value, emit (key,value)
Reduce: Identity function
M = 10000, M_size = 64MB, R = 4000
2011 IPM - HPC4 16
17. Benchmark
Input rate – peak at 13 MB/s – Stops after map
phase has finished
Higher than shuffle and reduce rate because of
data locality.
Shuffle rate - starts as soon as first map output has
been generated. Stops after 1800 – end of first
batch of reducers (1 reducer per machine) and
begin after first reducers finish their processing
Reduce rate – first writes rates are higher, and then
there is the second round of shuffles which
begin again so the rate will slightly decrease.
Rate is lower than shuffle -> 2 copies generated
for output.
2011 IPM - HPC4 17
18. Tuning
Increase number of reducers
If more reducers than available slots, faster machines will
execute more instances of reducers
+ Increase in load balancing
+ Decrease cost of failure
- Increase in global overhead
2011 IPM - HPC4 18
19. Tuning
In-mapper combining
Combiner execution is optional and left to the decision of framework.
To force aggregation in map() :
- State preservation - iterations executed in a single JVM instance
- Emit the whole result at once in a hook
map (filename, file-contents):
array = new associative_array(int,int);
for each number in file-contents:
array[number] += number^2
foreach number of array
emit (number, number^2)
2011 IPM - HPC4 19
20. Tuning
Configuration parameters
Map parameters
io.sort.mb: Size of memory buffer for the map output
io.sort.spill.percent: Percent of filling of the buffer before writing to disk.
The writing is done in background, but the process stops if the buffer is full and the
disk is not fast enough to let to flushing completed.
-> Increase the buffer size if map functions are small
-> Increase the (buffer size / spill percent ) to optimise fluidity – Possible if disk access
can handle efficient parallel writes
task.tracker.http.threads : Number of threads in map nodes serving reduce
requests
-> Increase if big cluster and large jobs
2011 IPM - HPC4 20
21. Tuning
Configuration parameters
Merge/Sort parameters
mapred.job.shuffle.input.buffer.percent = Percent of available RAM in reduce
node for keeping the map outputs.
Write on disk after reaching mapred.job.shuffle.merge.percent of memory or
mapred.inmem.merge.threshold number of files
-> Increase the memory usage if reduce tasks are small and number of
mappers not much bigger than number of reducers
2011 IPM - HPC4 21
22. Tuning
Configuration parameters
io.sort.factor
Merge factor : Number of rounds for creating the merged file received from the map
outputs – The number of input files for reduce will be
nb_received_map / io.sort.factor
-> Increase if high availability on memory on nodes.
-> Take into account mapred.job.shuffle.input.buffer.percent which will reduce the
available memory for the merge factor
2011 IPM - HPC4 22
23. Scheduler
The scheduler is based on jobs and not tasks.
FIFO
Each job will use all the available resources, penalising other users
Fair scheduler
Cluster is shared fairly between different users
-> Pool of jobs per user
-> Preemption if new jobs change the sharing balance and make a pool more resource
intensive
No affinity scoring calculated for tasks during scheduling sessions. The data affinity is
performed after the task selection by the scheduler
-> This is a serious handicap for a data grid!
2011 IPM - HPC4 23
24. Weakness
JobTracker tied to resources:
-> leverage to a pool of available resources more difficult
-> No dynamic scalability, resource planning should be fixed in advance
-> More difficult to create SLA in shared grid
Pull mode between task tracker and job tracker
-> Peak, valley issue – idle periods between polling times. We can increase heartbeat
frequency but risk of network saturation
No possibility to pin resources (slave nodes) to a job
2011 IPM - HPC4 24
25. Weakness
Reduce phase can only begin after end of map phase
M= nbr of maps, R = nbr of reduces
M_slots = nbr of map slots, R_slots = nbr of reduce slots
Tm = average duration of maps. Tr = average duration of reduces
Total time = Tm* min (1, M/M_slots) + Tr * min (1, R/R_slots)
If reduce phase can begin as soon as first map result avaiable:
-> R will be bigger as there will be minimum as much reduces as outputs from
maps -> R_new = max(M,R) = M most of the time
Total time = max (Tn,Tm) * min (1, 2*M / (M_slots+R_slots))
2011 IPM - HPC4 25
26. ADDONS - HBASE
Database – storage by on column
-> Datawarehousing
Based on Google BigTable
Stores huge data
Performant access elements by (row, column)
All columns not necessary present for all lines!
2011 IPM - HPC4 26
27. ADDONS - PIG LATIN
High level data flow language based on Hadoop. Used for data analysis.
fileA:
User1, a
User1, b
User2, c
Log = Load ‘fileA’ (user, value);
Grp = GROUP Log by user; outputFile:
Count=FOREACH Grp GENERATE group, User1, 2
COUNT(Log);
User2, 1
STORE Count INTO ‘outputFile’;
2011 IPM - HPC4 27
28. ADDONS - HIVE
High level language –SQL like load, query
CREATE TABLE T (a INT, b STRING)…
LOAD DATA INPATH “file_name” INTO TABLE T;
SELECT * FROM …
Allow joins and more powerful features
2011 IPM - HPC4 28
29. CONCLUSION
• Hadoop is a fair middleware for distributed data processing
• Restrictive usage: High volume data and <key,value>
processing
• No clear separation between process and resource
management
• But…very active project, evolution will bring improvements
2011 IPM - HPC4 29