7. ExpireLaunchingTasks
• A thread to timeout tasks that have been
assigned to task trackers, but have not
reported back yet.
• After get report from task tracker, task
tracker take over the responsibility of
monitoring task execution, such as killing
unresponsive task.
7
8. ExpireTrackers
• Used to monitor task tracker status, expire
Task tracers that have gone down.
• After task tracker die, reschedule all tasks
reside on dead task tracker.
8
9. RetireJobs
• Used to remove old finished Jobs that
have been around too long.
• Job tracker can’t retain all finished job’s
info
• There is also a upper limit on # of job info
on a per-user basis.
9
10. JobInitThread
• Used to initialize jobs that have just been
created.
• Job initialization including
– Create split info per map
– Create map tasks
– Create reduce tasks
10
11. TaskCommitQueue
• A thread which does all of the HDFS FS-
related operations for task
– Promote outputs of COMMIT_PENDING tasks
– Discard outputs for FAILED/KILLED tasks
• All local file system related operation is in
charge of task trackers.
11
12. HTTP Server
• Supply job tracker status
• Supply all job status
– Per job metrics
• Supply history job status
12
13. Key Data Structures
• JobInProcess
– Maintain all the info for keeping a Job on the straight and narrow.
– It keeps its JobProfile and its latest JobStatus, plus a set of
tables for doing bookkeeping of its tasks
– Penalize task tracker for each of the jobs which had any tasks
running on it when it was lost.
• TaskInProgress
– Maintain all the info needed for a task in the lifetime of its owning
job.
– A give task might be speculatively executed or re-executed.
– Maintain multiple task states for different task attempts,
13
16. The life of a job
• Client
– User create custom mapper, reducer; Client
compute splits, upload job configuration file, jar
file, split meta info onto HDFS
– Submit job to job tracker
• Job Tracker
– Initialize job, read in job split info, determine final
# maps, create all needed map tasks and reduce
tasks; create all needed structures to represent
these tasks
– Tasks pulled by task tracker through heartbeats
16
17. The life of a job
• Task Tracker
– Through heartbeats pull tasks from job tracker
– Initialize job, only once per job
– Initialize task
• Download all needed jar file, configuration file,
distributed cache from HDFS to local disk
• Create staging working directory for task on local disk
• Localize configuration file
– Create java launching options, setup the Child
JVM
17
18. The life of a job
• Child JVM
– RPC with task tracker to get it’s task info
– Actually do the dirty chore : execute map or
reduce function, during this period it report status
regularly in case being killed by task tracker.
– retrieve map complete event from task tracker, if
needed. Report fetch failure to TT
– When task done, report COMMIT_PENDING or
SUCCEEDED state to TT
18
22. TT Main Thread
• Heartbeat with JJ periodically to report task
status, retrieve directives which includs
launch task action, kill job action, kill task
action
• Kill unresponsive task within configured time
period
• If there isn’t enough disk space to
accommodate all running task, pick tasks to
kill
• In case TT expire , reinitialize itself.
22
24. directoryCleanupThread
• Before task executing, create a executing
environment
– Create staging directory
– Copy configuration file
– Etc
• When task running, may produce multiple
intermediate files in local staging directory
• After job/task complete or fail, delete all
these crappy directory and files.
24
26. TaskRunner
• It’s a Thread
• Two type
– MapTaskRunner
– ReduceTaskRunner
• Main duties
– Make up the launching java Options &
Executing Environment
– In charge of launching, killing Child JVM.
26
28. Child JVM
• Actually execute map/reduce function
• Report status to TT periodically
• Retrieve map completion event from TT
for reducer task if needed.
28
33. Steps of Map Phase
• Put records emitted by map function into circle
buffer continually
• When buffer usage space exceed
io.sort.mb*io.sort.spill.percent, spill will start which
will sort records by partition, key-part, then write
out buffer onto disk, with a index file associated
with it indicating the positions where partition
begins.
• Merge will combine all the intermediate files into a
single large file, plus a index file.
33
36. • <property>
• <name>mapred.tasktracker.indexcache.mb</name>
• <value>10</value>
• <description> The maximum memory that a task
tracker allows for the
• index cache that is used when serving map outputs
to reducers.
• </description>
• </property>
36
37. Steps of Reduce Phase
• Pull over data from map, if there is space
available In memory & the size of file is
less than
25%*HeapSize*mapred.job.shuffle.input.b
uffer.percent, put file in memory, else
directly store file on disk.
37
38. Steps of Reduce Phase(Cont.)
• Merge operation will merge and sort data
from memory and/or disk and write result on
disk. Merge operation come in two different
flavors:
– In-memory merge operation
• In-memory merge operation can be triggered when
accumulated memory space exceed
mapred.job.shuffle.merge.percent.
– On-disk merge operation
• On-disk merge operation will be triggered when # of
files on disk exceed configured threshold.
38
39. Steps of Reduce Phase(Cont.)
• When shuffle and sort complete, before
feeding reduce function, it must satisfy the
following constraints:
– memory usage for buffering reduce input
can’t exceed
mapred.job.reduce.input.buffer.percent;
– # of files on disk can’t exceed io.sort.factor
39
40. Notes about Reduce
• Shuffle & sort take up % of Reduce heap size
to buffer shuffle data, because Reduce can’t
start until shuffle and sort complete. As
opposed to Map phase, which buffer size is
determined by io.sort.mb.
• Reduce input may contains multiple files, not
necessarily a single file. Just using a heap
iterator to feed reduce function.
40
42. Optimization Tuning
• We can make use of
mapred.job.reduce.input.buffer.percent which
specify how much memory can be spared to
use as reduce input buffer
• Look at the difference between the following
cases
– Case-1
– Case-2
– Case-3
42
46. • If reduce function don’t stress memory too
much, we can spare some memory to
buffer reduce input to boost overall
performance.
• What’s more, if input data is small, we can
let reduces hold all intermediate data in
memory, not involving disk access.
46
48. Shuffle:
Netty Server & Batch Fetch (1)
• Less TCP connection overhead.
• Reduce the effect of TCP slow start.
• More important, better shuffle schedule in
Reduce Phase result in better overall
performance.
49. Shuffle:
Netty Server & Batch Fetch (2)
One connection per map Batch fetch
• Each fetch thread in reduce • Fetch thread copy multiple map
outputs per connection.
copy one map output per
• This fetch thread take over this TT,
connection, even there are other fetch threads can’t fetch
many outputs in TT. outputs from this TT during coping
period.
vs
50. Sort Avoidance
• Many real-world jobs require shuffling, but not sorting. And the
sorting bring much overhead.
– Hash Aggregation
– Hash Join
– … etc.
• When sorting is turned off, the mapper feeds data to the reducer
which directly passes the data to the Reduce() function bypassing
the intermediate sorting step.
– Spilling, Partitioning, Merging and Reducing will be more efficient.
• How to turn off sorting?
– JobConf job = (JobConf) getConf();
– job.setBoolean("mapred.sort.avoidance", true);
• MAPREDUCE-4039
51. Sort Avoidance: Spill and Partition
• When spills, records compare by partition
only.
• Partition comparison using counting sort [O(n)],
not quick sort [O(nlog n)].
52. Sort Avoidance: Early Reduce
(Remove shuffle barrier)
• Currently reduce function can’t start until
all map outputs have been fetched already.
• When sort is unnecessary, reduce function
can start as soon as there is any map
output available.
• Greatly improve overall performance!
53. Sort Avoidance: Bytes Merge
• No overhead of
key/value
serialization/deseriali
zation, comparison
• Don’t take care of
records, just bytes
• Just concatenate
byte streams
together – read in
bytes, write out bytes.
54. Sort Avoidance:
Sequential Reduce Input
• Sequential read input files to feed reduce
function, So no disk seeks, better
performance.
56. Current Limitations
• Hard partition of resources into map and
reduce slots
– Low resource utilization
• Lacks support for alternate paradigms
– Iterative applications implemented using
MapReduce are 10x slower.
– Hacks for the likes of MPI/Graph Processing
• Lack of wire-compatible protocols
– Client and cluster must be of sameversion
– Applications and work flows cannot migrate to
different clusters
56
57. Current Limitations(Cont.)
• Scalability
– Maximum Cluster size – 4,000 nodes
– Maximum concurrent tasks–40,000
– Coarse synchronization in JobTracker
• Single point of failure
– Failure kills all queued and running jobs
– Jobs need to be re-submitted by user
• Restart is very tricky due to complex state
57
59. Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues
• Node Manager
– Per-machine agent
– Manages the life-cycle of container
– Container resource monitoring
• Application Master
– Per-application
– Manages application scheduling and task execution
– E.g. MapReduce Application Master
59
60. Design Centre
• Split up the two major functions of
JobTractor
– Cluster resource management
– Application life-cycle management
• MapReduce becomes user-land library
60