SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Akhilesh Joshi
akhileshjoshi123@gmail.com
▪Local Aggregation
▪Pairs and Stripes
▪Order inversion
▪Graph algorithms
▪ In Hadoop, intermediate results are written to local disk before being sent over the
network. Since network and disk latencies are relatively expensive compared to
other operations, reductions in the amount of intermediate data translate into
increases in algorithmic efficiency.
▪ In MapReduce, local aggregation of intermediate results is one of the keys to
efficient algorithms.
▪ Hence we take help of COMBINER to perform this local aggregation and reduce
the intermediate key value pairs that are passed on by Mapper to the Reducer.
▪ Combiners provide a general mechanism within the MapReduce framework to
reduce the amount of intermediate data generated by the mappers.
▪ They can be understood as mini-reducers that process the output of mappers.
▪ Combiners aggregate term counts across the documents processed by each map
task
▪ CONCLUSION
This results in a reduction in the number of intermediate key-value pairs that
need to be shuffled across the network ==> from the order of total number of terms in
the collection to the order of the number of unique terms in the collection.
An associative array (i.e., Map in Java) is introduced inside the mapper to tally up
term counts within a single document: instead of emitting a key-value pair for each
term in the document, this version emits a key-value pair for each unique term in the
document.
NOTE : reducer is not Changed !
In this case, we initialize an associative array for holding term counts. Since it is
possible to preserve state across multiple calls of the Map method (for each input
key-value pair), we can continue to accumulate partial term counts in the associative
array across multiple documents, and emit key-value pairs only when the mapper
has processed all documents.That is emission of intermediate data is deferred until
the Close method in the pseudo-code.
IN MAPPER COMBINER
IMPLEMENTATION
▪ Advantage of using this design pattern is that we will have control over the local
aggregation
▪ In mapper combiners should be preferred over actual combiners as using actual
combiners creates an overhead for creating and destroying the objects
▪ Combiners does reduce amount of intermediate data but they keep the number of
key value pairs as it is as they are outputted from the mapper but that too in
aggregated format
▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across
the documents
▪ This creates the bottleneck as outcome of the algorithm will depend on the type of
key value pair it receives we call it as ORDER DEPENDING BUG !
▪ It is difficult to check the problem when we are dealing with large datasets
Another Disadvantage ?? 
▪ There is need for memory where there should be sufficient in memory until all the
key values pairs are processed (there may be case in our word count example
where vocabulary size may exceed the associative array !)
▪ SOLUTION : flush in memory periodically by maintaining some counter.The size of
the blocks to be flushed is empirical and is hard to determine.
▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric
employee id and value is salary)
▪ Addition and Multiplication are associative
▪ Division and Subtraction are not associative
1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer
3. Using in-memory combiner to increase efficiency of approach 2
1. Computing Average directly in reducer (no combiner)
2. Using Combiners to reduce workload on reducer : DOES NOT WORK
3. Using in-memory combiner to increase efficiency of approach 2
WHY ?
Because since average is not associative combiners will calculate average from
separate mapper classes and send them to reducer. Now reducer will take these
averages and again combine them to into average.This will lead to wrong solution
since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))
▪ NOTES :
▪ NO COMBINER USED
▪ AVERAGE IS CALCULATED IN REDUCER
▪ MAPPER USED : IDENTITY MAPPER
▪ This algorithm works but has some
problems
Problems:
1. requires shuffling all key-value pairs from mappers to reducers
across the network
2. reducer cannot be used as a combiner
INCORRECT
Notes :
1. Combiners used
2. Wrong since output of combiner
should match with output of
mapper here output of combiner
is pair where as output of mapper
was just list of intergers
3. This breaks the map reduce basic
knowledge
CORRECTLY
Notes:
Correct implementation of combiner since
output of mapper is matching with output of
combiner
What if I don’t use combiner ?
Still the reducer will be able to calculate mean at
the end correctly ; combiner just act as
intermediator to reduce the reducer workload.
Also , output of reducer need not to be same as
that of combiner or mapper.
EFFICIENT THAN ALL OTHER VERSIONS
Notes :
▪ Inside the mapper, the partial sums and counts
associated with each string are held in memory
across input key-value pairs
▪ Intermediate key-value pairs are emitted only after
the entire input split has been processed
▪ In memory combiner is efficiently using the
resources to reach out to the desired results
▪ Workload on Reducer is bit reduced
▪ WE ARE EMMITING SUM AND COUNT TO REACH
THE AVERAGE i.e. associative actions for non
associative (average) result.
▪ The concept of stripes is to aggregate data prior to the Reducers by using a
Combiner.There are several benefits to this, discussed below.When a Mapper
completes, its intermediate data sits idle when pairing until all Mappers are
complete.With striping, the intermediate data is passed to the Combiner, which
can start processing through the data like a Reducer. So, instead of Mappers sitting
idle, they can execute the Combiner, until the slowest Mapper finishes.
▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and-
stripes-explained/
▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪The mapper
▪ Processes each input document
▪ Emits key-value pairs with:
▪ Each co-occurring word pair as the key
▪ The integer one (the count) as the value
▪ This is done with two nested loops:
▪ The outer loop iterates over all words
▪ The inner loop iterates over all neighbors
▪The reducer:
▪ Receives pairs relative to co-occurring words
▪ This requires modifying the partitioner
▪ Computes an absolute count of the joint event
▪ Emits the pair and the count as the final key-value output
▪ Basically reducers emit the cells of the matrix
▪ Input to the problem
▪ Key-value pairs in the form of a docid and a doc
▪ The mapper:
▪ Same two nested loops structure as before
▪ Co-occurrence information is first stored in an associative array
▪ Emit key-value pairs with words as keys and the corresponding arrays as values
▪ The reducer:
▪ Receives all associative arrays related to the same word
▪ Performs an element-wise sum of all associative arrays with the same key
▪ Emits key-value output in the form of word, associative array
▪ Basically, reducers emit rows of the co-occurrence matrix
▪ Generates a large number of key-value pairs (also intermediate)
▪ The benefit from combiners is limited, as it is less likely for a mapper to process
multiple occurrences of a word
▪ Does not suffer from memory paging problems
▪ More compact
▪ Generates fewer and shorted intermediate keys
▪ Can make better use of combiners
▪ The framework has less sorting to do
▪ The values are more complex and have serialization/deserialization overhead
▪ Greatly benefits from combiners, as the key space is the vocabulary
▪ Suffers from memory paging problems, if not properly engineered
“STRIPES”
▪ Idea: group together pairs into an associative array
▪ Each mapper takes a sentence:
▪ Generate all co-occurring term pairs
▪ For each term, emit a → { b: countb, c: countc, d: countd … }
▪ Reducers perform element-wise sum of associative arrays
(a, b) → 1
(a, c) → 2
(a, d) → 5
(a, e) → 3
(a, f) → 2
a → { b: 1, c: 2, d: 5, e: 3, f: 2 }
a → { b: 1, d: 5, e: 3 }
a → { b: 1, c: 2, d: 2, f: 2 }
a → { b: 2, c: 2, d: 7, e: 3, f: 2 }
+
▪ Combiners can be used both in pair and stripes but the implementation of
combiners in strips gives better result because of the associative array
▪ Stripe approach might encounter the problem of memory since it tries to fit the
associative array into it memory
▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space
▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR
OWN SIGNIFICANCE.
▪ Memory problem in strip can be dealt by dividing the entire vocabulary into
buckets and applying strips on individual buckets this in turn reduces the memory
allocation required for the stripes approach.
ORDER INVERSION
▪ Drawback with co-occurrence matrix is that some words do appear too frequently
together ; simply because one of the words is very common
▪ Solution :
▪ convert absolute counts into relative frequencies, f(w(j) | w(i)).That is, what proportion of
the time does wj appear in the context of wi?
▪ N(·, ·) is the number of times a co-occurring word pair is observed
▪ The denominator is called the marginal (the sum of the counts of the conditioning
variable co-occurring with anything else)
▪ In the reducer, the counts of all words that co-occur with the conditioning variable
(wi) are available in the associative array
▪ Hence, the sum of all those counts gives the marginal
▪ Then we divide the the joint counts by the marginal and we’re done
▪ The reducer receives the pair (wi , wj) and the count
▪ From this information alone it is not possible to compute f(wj|wi)
▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple
keys
▪ We can buffer in memory all the words that co-occur with wi and their counts
▪ This is basically building the associative array in the stripes method
We must define the sort order of the pair
▪ In this way, the keys are first sorted by the left word, and then by the right word (in the
pair)
▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi)
have been seen
▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit
We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically
hence we use separate partitioner to achieve this task . . .
▪ Emit a special key-value pair to capture the marginal
▪ Control the sort order of the intermediate key, so that the special key-value pair is
processed first
▪ Define a custom partitioner for routing intermediate key-value pairs
▪ Preserve state across multiple keys in the reducer
▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it
processes the data that generated them.
▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs
e.g. what are the relative frequencies of words occurring within a small window of the word
"dog"? The mapper counts word pairs in the corpus, so its output looks like..
((dog, cat), 125)
((dog, foot), 246)
▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*),
5348)
▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing
the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*),
5348) first, followed by all the other counts, and can trivially store the total and then output relative
frequencies.
The benefit of the pattern is that it avoids an extra MapReduce
iteration without creating any additional scalability bottleneck.
▪ Input to reducers are sorted by the keys
▪ Values are arbitrarily ordered
▪ We may want to order reducer values either ascending or descending.
▪ Solution :
▪ Buffer reducer values in memory and sort
▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of
object space in memory heap
▪ Use secondary sort design pattern in map reduce
▪ Uses shuffle and sort method
▪ Reducer values will be sorted
▪ Secondary key sorting is done by creating a composite key
▪Parallel BFS
▪Page Rank
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce
Design patterns in MapReduce

Más contenido relacionado

La actualidad más candente

Signals and classification
Signals and classificationSignals and classification
Signals and classificationSuraj Mishra
 
5. analog to digital conversion. ( pcm ,dm with short descriptions )
5. analog to digital conversion. ( pcm ,dm with short descriptions )5. analog to digital conversion. ( pcm ,dm with short descriptions )
5. analog to digital conversion. ( pcm ,dm with short descriptions )MdFazleRabbi18
 
Qudrature Amplitude Modulation by Krishna Teja & Sunil
Qudrature Amplitude Modulation by Krishna Teja & SunilQudrature Amplitude Modulation by Krishna Teja & Sunil
Qudrature Amplitude Modulation by Krishna Teja & Sunilkrishnateja407
 
Pulse code modulation and Demodulation
Pulse code modulation and DemodulationPulse code modulation and Demodulation
Pulse code modulation and DemodulationAbdul Razaq
 
L 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmL 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmDEEPIKA KAMBOJ
 
FHSS- Frequency Hop Spread Spectrum
FHSS- Frequency Hop Spread SpectrumFHSS- Frequency Hop Spread Spectrum
FHSS- Frequency Hop Spread SpectrumRohit Choudhury
 
8085 instruction set and addressing modes
8085 instruction set and addressing modes8085 instruction set and addressing modes
8085 instruction set and addressing modesVijay Kumar
 
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal Processing
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal ProcessingDSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal Processing
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal ProcessingAmr E. Mohamed
 
multiple access techniques used in wireless communication
multiple access techniques used in wireless communicationmultiple access techniques used in wireless communication
multiple access techniques used in wireless communicationSajid ali
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulationstk_gpg
 
Multiplexing FDM and TDM
Multiplexing FDM and TDMMultiplexing FDM and TDM
Multiplexing FDM and TDMaliahmadfarooq
 

La actualidad más candente (20)

Source coding
Source coding Source coding
Source coding
 
Signals and classification
Signals and classificationSignals and classification
Signals and classification
 
Line coding
Line codingLine coding
Line coding
 
5. analog to digital conversion. ( pcm ,dm with short descriptions )
5. analog to digital conversion. ( pcm ,dm with short descriptions )5. analog to digital conversion. ( pcm ,dm with short descriptions )
5. analog to digital conversion. ( pcm ,dm with short descriptions )
 
Qudrature Amplitude Modulation by Krishna Teja & Sunil
Qudrature Amplitude Modulation by Krishna Teja & SunilQudrature Amplitude Modulation by Krishna Teja & Sunil
Qudrature Amplitude Modulation by Krishna Teja & Sunil
 
Pulse code modulation and Demodulation
Pulse code modulation and DemodulationPulse code modulation and Demodulation
Pulse code modulation and Demodulation
 
L 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcmL 1 5 sampling quantizing encoding pcm
L 1 5 sampling quantizing encoding pcm
 
FHSS- Frequency Hop Spread Spectrum
FHSS- Frequency Hop Spread SpectrumFHSS- Frequency Hop Spread Spectrum
FHSS- Frequency Hop Spread Spectrum
 
MINIMUM SHIFT KEYING(MSK)
MINIMUM SHIFT KEYING(MSK)MINIMUM SHIFT KEYING(MSK)
MINIMUM SHIFT KEYING(MSK)
 
NON PARAMETRIC METHOD
NON PARAMETRIC METHODNON PARAMETRIC METHOD
NON PARAMETRIC METHOD
 
8085 instruction set and addressing modes
8085 instruction set and addressing modes8085 instruction set and addressing modes
8085 instruction set and addressing modes
 
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal Processing
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal ProcessingDSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal Processing
DSP_2018_FOEHU - Lec 1 - Introduction to Digital Signal Processing
 
TDMA
TDMATDMA
TDMA
 
Ch5
Ch5Ch5
Ch5
 
Hash Function
Hash FunctionHash Function
Hash Function
 
Usart 8251
Usart 8251Usart 8251
Usart 8251
 
multiple access techniques used in wireless communication
multiple access techniques used in wireless communicationmultiple access techniques used in wireless communication
multiple access techniques used in wireless communication
 
Pulse modulation
Pulse modulationPulse modulation
Pulse modulation
 
Multiplexing FDM and TDM
Multiplexing FDM and TDMMultiplexing FDM and TDM
Multiplexing FDM and TDM
 
The medium access sublayer
 The medium  access sublayer The medium  access sublayer
The medium access sublayer
 

Destacado

Contaduría, qué es y para qué sirve
Contaduría, qué es y para qué sirveContaduría, qué es y para qué sirve
Contaduría, qué es y para qué sirveValeria_Alvarez
 
Manual Corporativo de Gabriela Sandoval Graphic Designer
Manual Corporativo de Gabriela Sandoval Graphic DesignerManual Corporativo de Gabriela Sandoval Graphic Designer
Manual Corporativo de Gabriela Sandoval Graphic DesignerGabriela Sandoval
 
Taiwan IPv6 Measurement
Taiwan IPv6 MeasurementTaiwan IPv6 Measurement
Taiwan IPv6 MeasurementAPNIC
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopSkillspeed
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveEMC
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning Pranya Prabhakar
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi
 
Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Apigee | Google Cloud
 
Siete pasos para el proceso de selección de personal
Siete pasos para el proceso de selección de personalSiete pasos para el proceso de selección de personal
Siete pasos para el proceso de selección de personalcarlosenriqu3
 
O messias prometido
O messias prometidoO messias prometido
O messias prometidoÉlida Rolim
 
Population Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsPopulation Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsFrank Wang
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsTeradata Aster
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiProfessor Lili Saghafi
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial IntelligenceZavain Dar
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 

Destacado (20)

Contaduría, qué es y para qué sirve
Contaduría, qué es y para qué sirveContaduría, qué es y para qué sirve
Contaduría, qué es y para qué sirve
 
Manual Corporativo de Gabriela Sandoval Graphic Designer
Manual Corporativo de Gabriela Sandoval Graphic DesignerManual Corporativo de Gabriela Sandoval Graphic Designer
Manual Corporativo de Gabriela Sandoval Graphic Designer
 
Taiwan IPv6 Measurement
Taiwan IPv6 MeasurementTaiwan IPv6 Measurement
Taiwan IPv6 Measurement
 
segunda opcion
segunda opcionsegunda opcion
segunda opcion
 
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop
 
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspectiveBig Data & Analytics MapReduce/Hadoop – A programmer’s perspective
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010Daniel Abadi HadoopWorld 2010
Daniel Abadi HadoopWorld 2010
 
Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?Predictive Analytics on Big Data. DIY or BUY?
Predictive Analytics on Big Data. DIY or BUY?
 
Siete pasos para el proceso de selección de personal
Siete pasos para el proceso de selección de personalSiete pasos para el proceso de selección de personal
Siete pasos para el proceso de selección de personal
 
O messias prometido
O messias prometidoO messias prometido
O messias prometido
 
Population Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text AnalyticsPopulation Health Management, Predictive Analytics, Big Data and Text Analytics
Population Health Management, Predictive Analytics, Big Data and Text Analytics
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Big data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili SaghafiBig data and Predictive Analytics By : Professor Lili Saghafi
Big data and Predictive Analytics By : Professor Lili Saghafi
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Big Data & Artificial Intelligence
Big Data & Artificial IntelligenceBig Data & Artificial Intelligence
Big Data & Artificial Intelligence
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 

Similar a Design patterns in MapReduce

Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Big data skew
Big data skewBig data skew
Big data skewayan ray
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creatingLen Bass
 
Shuffle sort 101
Shuffle sort 101Shuffle sort 101
Shuffle sort 101Jeff Bean
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OSVedant Mane
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e práticaPET Computação
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
 

Similar a Design patterns in MapReduce (20)

2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
MapReduce
MapReduceMapReduce
MapReduce
 
Big data skew
Big data skewBig data skew
Big data skew
 
Architecting for the cloud map reduce creating
Architecting for the cloud   map reduce creatingArchitecting for the cloud   map reduce creating
Architecting for the cloud map reduce creating
 
IOE MODULE 6.pptx
IOE MODULE 6.pptxIOE MODULE 6.pptx
IOE MODULE 6.pptx
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Shuffle sort 101
Shuffle sort 101Shuffle sort 101
Shuffle sort 101
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Hadoop Map Reduce OS
Hadoop Map Reduce OSHadoop Map Reduce OS
Hadoop Map Reduce OS
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
MapReduce: teoria e prática
MapReduce: teoria e práticaMapReduce: teoria e prática
MapReduce: teoria e prática
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 

Más de Akhilesh Joshi

PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learningAkhilesh Joshi
 
random forest regression
random forest regressionrandom forest regression
random forest regressionAkhilesh Joshi
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
support vector regression
support vector regressionsupport vector regression
support vector regressionAkhilesh Joshi
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regressionAkhilesh Joshi
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regressionAkhilesh Joshi
 
simple linear regression
simple linear regressionsimple linear regression
simple linear regressionAkhilesh Joshi
 
R square vs adjusted r square
R square vs adjusted r squareR square vs adjusted r square
R square vs adjusted r squareAkhilesh Joshi
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)Akhilesh Joshi
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and RAkhilesh Joshi
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonAkhilesh Joshi
 
Bastion Host : Amazon Web Services
Bastion Host : Amazon Web ServicesBastion Host : Amazon Web Services
Bastion Host : Amazon Web ServicesAkhilesh Joshi
 
Google knowledge graph
Google knowledge graphGoogle knowledge graph
Google knowledge graphAkhilesh Joshi
 
Machine learning (domingo's paper)
Machine learning (domingo's paper)Machine learning (domingo's paper)
Machine learning (domingo's paper)Akhilesh Joshi
 
SoLoMo - Future of Marketing
SoLoMo - Future of MarketingSoLoMo - Future of Marketing
SoLoMo - Future of MarketingAkhilesh Joshi
 

Más de Akhilesh Joshi (20)

PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
random forest regression
random forest regressionrandom forest regression
random forest regression
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
support vector regression
support vector regressionsupport vector regression
support vector regression
 
polynomial linear regression
polynomial linear regressionpolynomial linear regression
polynomial linear regression
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
simple linear regression
simple linear regressionsimple linear regression
simple linear regression
 
R square vs adjusted r square
R square vs adjusted r squareR square vs adjusted r square
R square vs adjusted r square
 
K fold
K foldK fold
K fold
 
Grid search (parameter tuning)
Grid search (parameter tuning)Grid search (parameter tuning)
Grid search (parameter tuning)
 
svm classification
svm classificationsvm classification
svm classification
 
knn classification
knn classificationknn classification
knn classification
 
logistic regression with python and R
logistic regression with python and Rlogistic regression with python and R
logistic regression with python and R
 
Data preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and PythonData preprocessing for Machine Learning with R and Python
Data preprocessing for Machine Learning with R and Python
 
Design patterns
Design patternsDesign patterns
Design patterns
 
Bastion Host : Amazon Web Services
Bastion Host : Amazon Web ServicesBastion Host : Amazon Web Services
Bastion Host : Amazon Web Services
 
Google knowledge graph
Google knowledge graphGoogle knowledge graph
Google knowledge graph
 
Machine learning (domingo's paper)
Machine learning (domingo's paper)Machine learning (domingo's paper)
Machine learning (domingo's paper)
 
SoLoMo - Future of Marketing
SoLoMo - Future of MarketingSoLoMo - Future of Marketing
SoLoMo - Future of Marketing
 
Webcrawler
WebcrawlerWebcrawler
Webcrawler
 

Último

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Design patterns in MapReduce

  • 2. ▪Local Aggregation ▪Pairs and Stripes ▪Order inversion ▪Graph algorithms
  • 3. ▪ In Hadoop, intermediate results are written to local disk before being sent over the network. Since network and disk latencies are relatively expensive compared to other operations, reductions in the amount of intermediate data translate into increases in algorithmic efficiency. ▪ In MapReduce, local aggregation of intermediate results is one of the keys to efficient algorithms. ▪ Hence we take help of COMBINER to perform this local aggregation and reduce the intermediate key value pairs that are passed on by Mapper to the Reducer.
  • 4.
  • 5. ▪ Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers. ▪ They can be understood as mini-reducers that process the output of mappers. ▪ Combiners aggregate term counts across the documents processed by each map task ▪ CONCLUSION This results in a reduction in the number of intermediate key-value pairs that need to be shuffled across the network ==> from the order of total number of terms in the collection to the order of the number of unique terms in the collection.
  • 6. An associative array (i.e., Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document, this version emits a key-value pair for each unique term in the document. NOTE : reducer is not Changed !
  • 7. In this case, we initialize an associative array for holding term counts. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair), we can continue to accumulate partial term counts in the associative array across multiple documents, and emit key-value pairs only when the mapper has processed all documents.That is emission of intermediate data is deferred until the Close method in the pseudo-code. IN MAPPER COMBINER IMPLEMENTATION
  • 8. ▪ Advantage of using this design pattern is that we will have control over the local aggregation ▪ In mapper combiners should be preferred over actual combiners as using actual combiners creates an overhead for creating and destroying the objects ▪ Combiners does reduce amount of intermediate data but they keep the number of key value pairs as it is as they are outputted from the mapper but that too in aggregated format
  • 9. ▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across the documents ▪ This creates the bottleneck as outcome of the algorithm will depend on the type of key value pair it receives we call it as ORDER DEPENDING BUG ! ▪ It is difficult to check the problem when we are dealing with large datasets Another Disadvantage ??  ▪ There is need for memory where there should be sufficient in memory until all the key values pairs are processed (there may be case in our word count example where vocabulary size may exceed the associative array !) ▪ SOLUTION : flush in memory periodically by maintaining some counter.The size of the blocks to be flushed is empirical and is hard to determine.
  • 10. ▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric employee id and value is salary) ▪ Addition and Multiplication are associative ▪ Division and Subtraction are not associative
  • 11. 1. Computing Average directly in reducer (no combiner) 2. Using Combiners to reduce workload on reducer 3. Using in-memory combiner to increase efficiency of approach 2
  • 12. 1. Computing Average directly in reducer (no combiner) 2. Using Combiners to reduce workload on reducer : DOES NOT WORK 3. Using in-memory combiner to increase efficiency of approach 2 WHY ? Because since average is not associative combiners will calculate average from separate mapper classes and send them to reducer. Now reducer will take these averages and again combine them to into average.This will lead to wrong solution since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))
  • 13. ▪ NOTES : ▪ NO COMBINER USED ▪ AVERAGE IS CALCULATED IN REDUCER ▪ MAPPER USED : IDENTITY MAPPER ▪ This algorithm works but has some problems Problems: 1. requires shuffling all key-value pairs from mappers to reducers across the network 2. reducer cannot be used as a combiner
  • 14. INCORRECT Notes : 1. Combiners used 2. Wrong since output of combiner should match with output of mapper here output of combiner is pair where as output of mapper was just list of intergers 3. This breaks the map reduce basic knowledge
  • 15. CORRECTLY Notes: Correct implementation of combiner since output of mapper is matching with output of combiner What if I don’t use combiner ? Still the reducer will be able to calculate mean at the end correctly ; combiner just act as intermediator to reduce the reducer workload. Also , output of reducer need not to be same as that of combiner or mapper.
  • 16. EFFICIENT THAN ALL OTHER VERSIONS Notes : ▪ Inside the mapper, the partial sums and counts associated with each string are held in memory across input key-value pairs ▪ Intermediate key-value pairs are emitted only after the entire input split has been processed ▪ In memory combiner is efficiently using the resources to reach out to the desired results ▪ Workload on Reducer is bit reduced ▪ WE ARE EMMITING SUM AND COUNT TO REACH THE AVERAGE i.e. associative actions for non associative (average) result.
  • 17. ▪ The concept of stripes is to aggregate data prior to the Reducers by using a Combiner.There are several benefits to this, discussed below.When a Mapper completes, its intermediate data sits idle when pairing until all Mappers are complete.With striping, the intermediate data is passed to the Combiner, which can start processing through the data like a Reducer. So, instead of Mappers sitting idle, they can execute the Combiner, until the slowest Mapper finishes. ▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and- stripes-explained/
  • 18.
  • 19. ▪ Input to the problem ▪ Key-value pairs in the form of a docid and a doc ▪The mapper ▪ Processes each input document ▪ Emits key-value pairs with: ▪ Each co-occurring word pair as the key ▪ The integer one (the count) as the value ▪ This is done with two nested loops: ▪ The outer loop iterates over all words ▪ The inner loop iterates over all neighbors ▪The reducer: ▪ Receives pairs relative to co-occurring words ▪ This requires modifying the partitioner ▪ Computes an absolute count of the joint event ▪ Emits the pair and the count as the final key-value output ▪ Basically reducers emit the cells of the matrix
  • 20.
  • 21. ▪ Input to the problem ▪ Key-value pairs in the form of a docid and a doc ▪ The mapper: ▪ Same two nested loops structure as before ▪ Co-occurrence information is first stored in an associative array ▪ Emit key-value pairs with words as keys and the corresponding arrays as values ▪ The reducer: ▪ Receives all associative arrays related to the same word ▪ Performs an element-wise sum of all associative arrays with the same key ▪ Emits key-value output in the form of word, associative array ▪ Basically, reducers emit rows of the co-occurrence matrix
  • 22.
  • 23. ▪ Generates a large number of key-value pairs (also intermediate) ▪ The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word ▪ Does not suffer from memory paging problems
  • 24. ▪ More compact ▪ Generates fewer and shorted intermediate keys ▪ Can make better use of combiners ▪ The framework has less sorting to do ▪ The values are more complex and have serialization/deserialization overhead ▪ Greatly benefits from combiners, as the key space is the vocabulary ▪ Suffers from memory paging problems, if not properly engineered
  • 25. “STRIPES” ▪ Idea: group together pairs into an associative array ▪ Each mapper takes a sentence: ▪ Generate all co-occurring term pairs ▪ For each term, emit a → { b: countb, c: countc, d: countd … } ▪ Reducers perform element-wise sum of associative arrays (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +
  • 26. ▪ Combiners can be used both in pair and stripes but the implementation of combiners in strips gives better result because of the associative array ▪ Stripe approach might encounter the problem of memory since it tries to fit the associative array into it memory ▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space ▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR OWN SIGNIFICANCE. ▪ Memory problem in strip can be dealt by dividing the entire vocabulary into buckets and applying strips on individual buckets this in turn reduces the memory allocation required for the stripes approach.
  • 27. ORDER INVERSION ▪ Drawback with co-occurrence matrix is that some words do appear too frequently together ; simply because one of the words is very common ▪ Solution : ▪ convert absolute counts into relative frequencies, f(w(j) | w(i)).That is, what proportion of the time does wj appear in the context of wi? ▪ N(·, ·) is the number of times a co-occurring word pair is observed ▪ The denominator is called the marginal (the sum of the counts of the conditioning variable co-occurring with anything else)
  • 28. ▪ In the reducer, the counts of all words that co-occur with the conditioning variable (wi) are available in the associative array ▪ Hence, the sum of all those counts gives the marginal ▪ Then we divide the the joint counts by the marginal and we’re done
  • 29. ▪ The reducer receives the pair (wi , wj) and the count ▪ From this information alone it is not possible to compute f(wj|wi) ▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple keys ▪ We can buffer in memory all the words that co-occur with wi and their counts ▪ This is basically building the associative array in the stripes method We must define the sort order of the pair ▪ In this way, the keys are first sorted by the left word, and then by the right word (in the pair) ▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi) have been seen ▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically hence we use separate partitioner to achieve this task . . .
  • 30.
  • 31. ▪ Emit a special key-value pair to capture the marginal ▪ Control the sort order of the intermediate key, so that the special key-value pair is processed first ▪ Define a custom partitioner for routing intermediate key-value pairs ▪ Preserve state across multiple keys in the reducer
  • 32. ▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them. ▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"? The mapper counts word pairs in the corpus, so its output looks like.. ((dog, cat), 125) ((dog, foot), 246) ▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*), 5348) ▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies. The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.
  • 33. ▪ Input to reducers are sorted by the keys ▪ Values are arbitrarily ordered ▪ We may want to order reducer values either ascending or descending. ▪ Solution : ▪ Buffer reducer values in memory and sort ▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of object space in memory heap ▪ Use secondary sort design pattern in map reduce ▪ Uses shuffle and sort method ▪ Reducer values will be sorted
  • 34. ▪ Secondary key sorting is done by creating a composite key