Design patterns in MapReduce

Akhilesh Joshi
akhileshjoshi123@gmail.com

▪Local Aggregation
▪Pairs and Stripes
▪Order inversion
▪Graph algorithms

▪ In Hadoop, intermediate results are written to local disk before being sent over the
network. Since network and disk latencies are relatively expensive compared to
other operations, reductions in the amount of intermediate data translate into
increases in algorithmic efficiency.
▪ In MapReduce, local aggregation of intermediate results is one of the keys to
efficient algorithms.
▪ Hence we take help of COMBINER to perform this local aggregation and reduce
the intermediate key value pairs that are passed on by Mapper to the Reducer.

▪ Combiners provide a general mechanism within the MapReduce framework to
reduce the amount of intermediate data generated by the mappers.
▪ They can be understood as mini-reducers that process the output of mappers.
▪ Combiners aggregate term counts across the documents processed by each map
task
▪ CONCLUSION
This results in a reduction in the number of intermediate key-value pairs that
need to be shuffled across the network ==> from the order of total number of terms in
the collection to the order of the number of unique terms in the collection.

An associative array (i.e., Map in Java) is introduced inside the mapper to tally up
term counts within a single document: instead of emitting a key-value pair for each
term in the document, this version emits a key-value pair for each unique term in the
document.
NOTE : reducer is not Changed !

In this case, we initialize an associative array for holding term counts. Since it is
possible to preserve state across multiple calls of the Map method (for each input
key-value pair), we can continue to accumulate partial term counts in the associative
array across multiple documents, and emit key-value pairs only when the mapper
has processed all documents.That is emission of intermediate data is deferred until
the Close method in the pseudo-code.
IN MAPPER COMBINER
IMPLEMENTATION

▪ Advantage of using this design pattern is that we will have control over the local
aggregation
▪ In mapper combiners should be preferred over actual combiners as using actual
combiners creates an overhead for creating and destroying the objects
▪ Combiners does reduce amount of intermediate data but they keep the number of
key value pairs as it is as they are outputted from the mapper but that too in
aggregated format

▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across
the documents
▪ This creates the bottleneck as outcome of the algorithm will depend on the type of
key value pair it receives we call it as ORDER DEPENDING BUG !
▪ It is difficult to check the problem when we are dealing with large datasets
Another Disadvantage ?? 
▪ There is need for memory where there should be sufficient in memory until all the
key values pairs are processed (there may be case in our word count example
where vocabulary size may exceed the associative array !)
▪ SOLUTION : flush in memory periodically by maintaining some counter.The size of
the blocks to be flushed is empirical and is hard to determine.

▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric
employee id and value is salary)
▪ Addition and Multiplication are associative
▪ Division and Subtraction are not associative

1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer
3. Using in-memory combiner to increase efficiency of approach 2

1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer : DOES NOT WORK
3. Using in-memory combiner to increase efficiency of approach 2
WHY ?
Because since average is not associative combiners will calculate average from
separate mapper classes and send them to reducer. Now reducer will take these
averages and again combine them to into average.This will lead to wrong solution
since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))

▪ NOTES :
▪ NO COMBINER USED
▪ AVERAGE IS CALCULATED IN REDUCER
▪ MAPPER USED : IDENTITY MAPPER
▪ This algorithm works but has some
problems
Problems:
1. requires shuffling all key-value pairs from mappers to reducers
across the network
2. reducer cannot be used as a combiner

INCORRECT
Notes :
1. Combiners used
2. Wrong since output of combiner
should match with output of
mapper here output of combiner
is pair where as output of mapper
was just list of intergers
3. This breaks the map reduce basic
knowledge

CORRECTLY
Notes:
Correct implementation of combiner since
output of mapper is matching with output of
combiner
What if I don’t use combiner ?
Still the reducer will be able to calculate mean at
the end correctly ; combiner just act as
intermediator to reduce the reducer workload.
Also , output of reducer need not to be same as
that of combiner or mapper.

EFFICIENT THAN ALL OTHER VERSIONS
Notes :
▪ Inside the mapper, the partial sums and counts
associated with each string are held in memory
across input key-value pairs
▪ Intermediate key-value pairs are emitted only after
the entire input split has been processed
▪ In memory combiner is efficiently using the
resources to reach out to the desired results
▪ Workload on Reducer is bit reduced
▪ WE ARE EMMITING SUM AND COUNT TO REACH
THE AVERAGE i.e. associative actions for non
associative (average) result.

▪ The concept of stripes is to aggregate data prior to the Reducers by using a
Combiner.There are several benefits to this, discussed below.When a Mapper
completes, its intermediate data sits idle when pairing until all Mappers are
complete.With striping, the intermediate data is passed to the Combiner, which
can start processing through the data like a Reducer. So, instead of Mappers sitting
idle, they can execute the Combiner, until the slowest Mapper finishes.
▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and-
stripes-explained/

▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪The mapper
▪ Processes each input document
▪ Emits key-value pairs with:
▪ Each co-occurring word pair as the key
▪ The integer one (the count) as the value
▪ This is done with two nested loops:
▪ The outer loop iterates over all words
▪ The inner loop iterates over all neighbors
▪The reducer:
▪ Receives pairs relative to co-occurring words
▪ This requires modifying the partitioner
▪ Computes an absolute count of the joint event
▪ Emits the pair and the count as the final key-value output
▪ Basically reducers emit the cells of the matrix

▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪ The mapper:
▪ Same two nested loops structure as before
▪ Co-occurrence information is first stored in an associative array
▪ Emit key-value pairs with words as keys and the corresponding arrays as values
▪ The reducer:
▪ Receives all associative arrays related to the same word
▪ Performs an element-wise sum of all associative arrays with the same key
▪ Emits key-value output in the form of word, associative array
▪ Basically, reducers emit rows of the co-occurrence matrix

▪ Generates a large number of key-value pairs (also intermediate)
▪ The benefit from combiners is limited, as it is less likely for a mapper to process
multiple occurrences of a word
▪ Does not suffer from memory paging problems

▪ More compact
▪ Generates fewer and shorted intermediate keys
▪ Can make better use of combiners
▪ The framework has less sorting to do
▪ The values are more complex and have serialization/deserialization overhead
▪ Greatly benefits from combiners, as the key space is the vocabulary
▪ Suffers from memory paging problems, if not properly engineered

“STRIPES”
▪ Idea: group together pairs into an associative array
▪ Each mapper takes a sentence:
▪ Generate all co-occurring term pairs
▪ For each term, emit a → { b: countb, c: countc, d: countd … }
▪ Reducers perform element-wise sum of associative arrays
(a, b) → 1
(a, c) → 2
(a, d) → 5
(a, e) → 3
(a, f) → 2
a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
a → { b: 1, d: 5, e: 3 }
a → { b: 1, c: 2, d: 2, f: 2 }
a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
+

▪ Combiners can be used both in pair and stripes but the implementation of
combiners in strips gives better result because of the associative array
▪ Stripe approach might encounter the problem of memory since it tries to fit the
associative array into it memory
▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space
▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR
OWN SIGNIFICANCE.
▪ Memory problem in strip can be dealt by dividing the entire vocabulary into
buckets and applying strips on individual buckets this in turn reduces the memory
allocation required for the stripes approach.

ORDER INVERSION
▪ Drawback with co-occurrence matrix is that some words do appear too frequently
together ; simply because one of the words is very common
▪ Solution :
▪ convert absolute counts into relative frequencies, f(w(j) | w(i)).That is, what proportion of
the time does wj appear in the context of wi?
▪ N(·, ·) is the number of times a co-occurring word pair is observed
▪ The denominator is called the marginal (the sum of the counts of the conditioning
variable co-occurring with anything else)

▪ In the reducer, the counts of all words that co-occur with the conditioning variable
(wi) are available in the associative array
▪ Hence, the sum of all those counts gives the marginal
▪ Then we divide the the joint counts by the marginal and we’re done

▪ The reducer receives the pair (wi , wj) and the count
▪ From this information alone it is not possible to compute f(wj|wi)
▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple
keys
▪ We can buffer in memory all the words that co-occur with wi and their counts
▪ This is basically building the associative array in the stripes method
We must define the sort order of the pair
▪ In this way, the keys are first sorted by the left word, and then by the right word (in the
pair)
▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi)
have been seen
▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit
We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically
hence we use separate partitioner to achieve this task . . .

▪ Emit a special key-value pair to capture the marginal
▪ Control the sort order of the intermediate key, so that the special key-value pair is
processed first
▪ Define a custom partitioner for routing intermediate key-value pairs
▪ Preserve state across multiple keys in the reducer

▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it
processes the data that generated them.
▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs
e.g. what are the relative frequencies of words occurring within a small window of the word
"dog"? The mapper counts word pairs in the corpus, so its output looks like..
((dog, cat), 125)
((dog, foot), 246)
▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*),
5348)
▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing
the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*),
5348) first, followed by all the other counts, and can trivially store the total and then output relative
frequencies.
The benefit of the pattern is that it avoids an extra MapReduce
iteration without creating any additional scalability bottleneck.

▪ Input to reducers are sorted by the keys
▪ Values are arbitrarily ordered
▪ We may want to order reducer values either ascending or descending.
▪ Solution :
▪ Buffer reducer values in memory and sort
▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of
object space in memory heap
▪ Use secondary sort design pattern in map reduce
▪ Uses shuffle and sort method
▪ Reducer values will be sorted

▪ Secondary key sorting is done by creating a composite key

Design patterns in MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Design patterns in MapReduce

Similar a Design patterns in MapReduce (20)

Más de Akhilesh Joshi

Más de Akhilesh Joshi (20)

Último

Último (20)

Design patterns in MapReduce