Hadoop MapReduce Introduction and Deep Insight

Hadoop MapReduce
Introduction and Deep Insight
July 9, 2012
Anty Rao
Big Data Engineering Team
Hanborq Inc.

Outline
• Architecture
• Job Tracker
• Task Tracker
• Map/Reduce internal
• Optimization
• YARN

2

Architecture

MapReduce RPC
JobTracker
Client

beat
Heart

TaskTracker TaskTracker TaskTracker

Child Child Child Child Child Child Child Child Child
JVM JVM JVM JVM JVM JVM JVM JVM JVM

3

Job Tracker
• Manages cluster resources
• Job scheduling

5

Implementation Overview

6

ExpireLaunchingTasks
• A thread to timeout tasks that have been
assigned to task trackers, but have not
reported back yet.
• After get report from task tracker, task
tracker take over the responsibility of
monitoring task execution, such as killing
unresponsive task.

7

ExpireTrackers
• Used to monitor task tracker status, expire
Task tracers that have gone down.
• After task tracker die, reschedule all tasks
reside on dead task tracker.

8

RetireJobs
• Used to remove old finished Jobs that
have been around too long.
• Job tracker can’t retain all finished job’s
info
• There is also a upper limit on # of job info
on a per-user basis.

9

JobInitThread
• Used to initialize jobs that have just been
created.
• Job initialization including
– Create split info per map
– Create map tasks
– Create reduce tasks

10

TaskCommitQueue
• A thread which does all of the HDFS FS-
related operations for task
– Promote outputs of COMMIT_PENDING tasks
– Discard outputs for FAILED/KILLED tasks
• All local file system related operation is in
charge of task trackers.

11

HTTP Server
• Supply job tracker status
• Supply all job status
– Per job metrics
• Supply history job status

12

Key Data Structures
• JobInProcess
– Maintain all the info for keeping a Job on the straight and narrow.
– It keeps its JobProfile and its latest JobStatus, plus a set of
tables for doing bookkeeping of its tasks
– Penalize task tracker for each of the jobs which had any tasks
running on it when it was lost.

• TaskInProgress
– Maintain all the info needed for a task in the lifetime of its owning
job.
– A give task might be speculatively executed or re-executed.
– Maintain multiple task states for different task attempts,

13

The whole life of a job

14

The life of a job
• Client
– User create custom mapper, reducer; Client
compute splits, upload job configuration file, jar
file, split meta info onto HDFS
– Submit job to job tracker
• Job Tracker
– Initialize job, read in job split info, determine final
# maps, create all needed map tasks and reduce
tasks; create all needed structures to represent
these tasks
– Tasks pulled by task tracker through heartbeats

16

The life of a job
• Task Tracker
– Through heartbeats pull tasks from job tracker
– Initialize job, only once per job
– Initialize task
• Download all needed jar file, configuration file,
distributed cache from HDFS to local disk
• Create staging working directory for task on local disk
• Localize configuration file
– Create java launching options, setup the Child
JVM

17

The life of a job
• Child JVM
– RPC with task tracker to get it’s task info
– Actually do the dirty chore : execute map or
reduce function, during this period it report status
regularly in case being killed by task tracker.
– retrieve map complete event from task tracker, if
needed. Report fetch failure to TT
– When task done, report COMMIT_PENDING or
SUCCEEDED state to TT

18

Task Tracker
• Per-node agent
• Manage tasks

20

Implementation
Overview

21

TT Main Thread
• Heartbeat with JJ periodically to report task
status, retrieve directives which includs
launch task action, kill job action, kill task
action
• Kill unresponsive task within configured time
period
• If there isn’t enough disk space to
accommodate all running task, pick tasks to
kill
• In case TT expire , reinitialize itself.

22

taskCleanupThread
• Thread dedicate to process clean up
actions assigned by JJ
– Kill job action
– Kill task action

23

directoryCleanupThread
• Before task executing, create a executing
environment
– Create staging directory
– Copy configuration file
– Etc
• When task running, may produce multiple
intermediate files in local staging directory
• After job/task complete or fail, delete all
these crappy directory and files.

24

taskLauncher
• Localize job
• Localize task
• Create a taskRunner thread to manage
Child JVM

25

TaskRunner
• It’s a Thread
• Two type
– MapTaskRunner
– ReduceTaskRunner
• Main duties
– Make up the launching java Options &
Executing Environment
– In charge of launching, killing Child JVM.

26

MapEventsFetcherThread
• When there are tasks(reducer) in shuffle
phase, RPC with JJ to fetch map
completion event, on a per-job basis.

27

Child JVM
• Actually execute map/reduce function
• Report status to TT periodically
• Retrieve map completion event from TT
for reducer task if needed.

28

Key data structures
• Running Jobs
– JobID
– JobConf
– Set<TaskInProgress>
• TaskInProgress
– Task
– TaskStatus
– TaskRunner

29

Map/Reduce Internal

30

Map/Reduce Programming Mode

Hadoop—The Definition Guide

31

Map Phase
Diagram

32

Steps of Map Phase
• Put records emitted by map function into circle
buffer continually
• When buffer usage space exceed
io.sort.mb*io.sort.spill.percent, spill will start which
will sort records by partition, key-part, then write
out buffer onto disk, with a index file associated
with it indicating the positions where partition
begins.
• Merge will combine all the intermediate files into a
single large file, plus a index file.

33

Main map-side tuning Knobs

34

Reduce Phase Diagram

35

• <property>
• <name>mapred.tasktracker.indexcache.mb</name>
• <value>10</value>
• <description> The maximum memory that a task
tracker allows for the
• index cache that is used when serving map outputs
to reducers.
• </description>
• </property>

36

Steps of Reduce Phase
• Pull over data from map, if there is space
available In memory & the size of file is
less than
25%*HeapSize*mapred.job.shuffle.input.b
uffer.percent, put file in memory, else
directly store file on disk.

37

Steps of Reduce Phase(Cont.)
• Merge operation will merge and sort data
from memory and/or disk and write result on
disk. Merge operation come in two different
flavors:
– In-memory merge operation
• In-memory merge operation can be triggered when
accumulated memory space exceed
mapred.job.shuffle.merge.percent.
– On-disk merge operation
• On-disk merge operation will be triggered when # of
files on disk exceed configured threshold.

38

Steps of Reduce Phase(Cont.)
• When shuffle and sort complete, before
feeding reduce function, it must satisfy the
following constraints:
– memory usage for buffering reduce input
can’t exceed
mapred.job.reduce.input.buffer.percent;
– # of files on disk can’t exceed io.sort.factor

39

Notes about Reduce
• Shuffle & sort take up % of Reduce heap size
to buffer shuffle data, because Reduce can’t
start until shuffle and sort complete. As
opposed to Map phase, which buffer size is
determined by io.sort.mb.
• Reduce input may contains multiple files, not
necessarily a single file. Just using a heap
iterator to feed reduce function.

40

Reduce-side
Key parameters

41

Optimization Tuning
• We can make use of
mapred.job.reduce.input.buffer.percent which
specify how much memory can be spared to
use as reduce input buffer
• Look at the difference between the following
cases
– Case-1
– Case-2
– Case-3

42

Case-1

All reduce input reside on disk

Case-2

Partial data in memory ,plus data on
disk as reduce input

Case-3

Much better, all data in memory

• If reduce function don’t stress memory too
much, we can spare some memory to
buffer reduce input to boost overall
performance.
• What’s more, if input data is small, we can
let reduces hold all intermediate data in
memory, not involving disk access.

46

Shuffle:
Netty Server & Batch Fetch (1)
• Less TCP connection overhead.
• Reduce the effect of TCP slow start.
• More important, better shuffle schedule in
Reduce Phase result in better overall
performance.

Shuffle:
Netty Server & Batch Fetch (2)
One connection per map Batch fetch
• Each fetch thread in reduce • Fetch thread copy multiple map
outputs per connection.
copy one map output per
• This fetch thread take over this TT,
connection, even there are other fetch threads can’t fetch
many outputs in TT. outputs from this TT during coping
period.

vs

Sort Avoidance
• Many real-world jobs require shuffling, but not sorting. And the
sorting bring much overhead.
– Hash Aggregation
– Hash Join
– … etc.

• When sorting is turned off, the mapper feeds data to the reducer
which directly passes the data to the Reduce() function bypassing
the intermediate sorting step.
– Spilling, Partitioning, Merging and Reducing will be more efficient.

• How to turn off sorting?
– JobConf job = (JobConf) getConf();
– job.setBoolean("mapred.sort.avoidance", true);

• MAPREDUCE-4039

Sort Avoidance: Spill and Partition
• When spills, records compare by partition
only.
• Partition comparison using counting sort [O(n)],
not quick sort [O(nlog n)].

Sort Avoidance: Early Reduce
(Remove shuffle barrier)
• Currently reduce function can’t start until
all map outputs have been fetched already.
• When sort is unnecessary, reduce function
can start as soon as there is any map
output available.
• Greatly improve overall performance!

Sort Avoidance: Bytes Merge
• No overhead of
key/value
serialization/deseriali
zation, comparison
• Don’t take care of
records, just bytes
• Just concatenate
byte streams
together – read in
bytes, write out bytes.

Sort Avoidance:
Sequential Reduce Input
• Sequential read input files to feed reduce
function, So no disk seeks, better
performance.

YARN
(yet another resource negotiator)

55

Current Limitations
• Hard partition of resources into map and
reduce slots
– Low resource utilization
• Lacks support for alternate paradigms
– Iterative applications implemented using
MapReduce are 10x slower.
– Hacks for the likes of MPI/Graph Processing
• Lack of wire-compatible protocols
– Client and cluster must be of sameversion
– Applications and work flows cannot migrate to
different clusters

56

Current Limitations(Cont.)
• Scalability
– Maximum Cluster size – 4,000 nodes
– Maximum concurrent tasks–40,000
– Coarse synchronization in JobTracker
• Single point of failure
– Failure kills all queued and running jobs
– Jobs need to be re-submitted by user
• Restart is very tricky due to complex state

57

Yarn Architecture

58

Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master

59

Design Centre
• Split up the two major functions of
JobTractor
– Cluster resource management
– Application life-cycle management
• MapReduce becomes user-land library

60

Code
• MapReduce Classic
– Mess
• Yarn
– better

61

Questions?

ant.rao@gmail.com

62

Secondary Sort
• Want to sort by value
• Solution
– setOutputKeyComparatorClass
– setOutputValueGroupingComparator
– Partitioner

63

Hadoop MapReduce Introduction and Deep Insight

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (17)

Similar a Hadoop MapReduce Introduction and Deep Insight

Similar a Hadoop MapReduce Introduction and Deep Insight (20)

Más de Hanborq Inc.

Más de Hanborq Inc. (11)

Último

Último (20)

Hadoop MapReduce Introduction and Deep Insight