SlideShare una empresa de Scribd logo
1 de 53
Descargar para leer sin conexión
MapReduceConcepts
HadoopMapReduce 
•HadoopMapReduceis a 
–Software framework 
–For easily writing applications 
–Which process vast amounts of data (multi-terabyte data-sets) 
–In-parallel on large clusters (thousands of nodes) of commodity hardware 
–In a reliable, fault-tolerant manner. 
•A MapReducejobusually splits the input data-set into independent chunks which are processed by themap tasksin a completely parallel manner. 
•The framework sorts the outputs of the maps, which are then input to thereduce tasks. 
•Typically both the input and the output of the job are stored in a file-system. 
•The framework takes care of scheduling tasks, monitoring them and re- executes the failed tasks.
JobTrackerand TaskTracker 
•The MapReduceframework consists of a 
–single masterJobTrackerand 
–one slaveTaskTrackerper cluster-node. 
•The master is responsible for scheduling the jobs' component -tasks on the slaves, monitoring them and re-executing the failed tasks. 
•The slaves execute the tasks as directed by the master.
JobTrackerand TaskTracker
Job Specification 
•Minimally, applications specify the input/output locations and supplymapandreducefunctions via implementations of appropriate interfaces and/or abstract-classes. 
•These, and other job parameters, comprise thejob configuration. 
•The Hadoopjob clientthen submits the job (jar/executable etc.) and configuration to theJobTrackerwhich then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job- client. 
•Although the Hadoopframework is implemented in JavaTM, MapReduceapplications need not be written in Java. 
–HadoopStreamingis a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapperand/or the reducer. 
–HadoopPipesis aSWIG-compatibleC++ APIto implement MapReduceapplications (non JNITMbased).
Input Output 
•The MapReduceframework operates exclusively on<key, value>pairs. 
•That is, the framework views the input to the job as a set of<key, value>pairs and produces a set of<key, value>pairs as the output of the job, conceivably of different types. 
•Thekeyandvalueclasses have to be serializableby the framework and hence need to implement theWritableinterface. 
•Additionally, thekeyclasses have to implement theWritableComparableinterface to facilitate sorting by the framework. 
•Input and Output types of a MapReducejob: 
–(input)<k1, v1>->map-><k2, v2>->combine-><k2, v2>->reduce- ><k3, v3>(output)
TaskTracker 
JobTracker-Master 
Client 
TaskTracker 
TaskTracker - Slaves 
Run Map and Reduce task 
Manage intermediate output 
UI for submitting jobs 
Polls status information 
Accepts MR jobs 
Assigns tasks to slaves 
Monitors tasks 
Handles failures 
Task 
Run the Map and Reduce functions 
Report progress 
HadoopMap/Reduce Architecture
D 
D 
D 
D 
T 
T 
JobTracker 
Namenode 
Machines with Datanodesand Tasktrackers 
T 
T 
T 
D 
Client 
Submit Job 
HTTP Monitoring UI 
Get Block 
Locations 
Locality optimizations 
–With large data, bandwidth to data is a problem 
–Map tasks are scheduled close to the inputs when possible 
Automatic re-execution on failure 
←. tasks 
HadoopHDFS + MR cluster -putting them together
Word Count DataFlow
Ways of using Hadoop 
•Map-Reduce Java API 
–http://hadoop.apache.org/mapreduce/ 
•Streaming 
–http://hadoop.apache.org/common/docs/current/mapred_tutorial.html 
•Pipes 
–http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package- summary.html 
•Pig 
–http://hadoop.apache.org/pig/ 
•Hive 
–http://hive.apache.org
Example: Compute TF-IDF using Map/Reduce 
•TF-IDF (Term Frequency, Inverse Document Frequency) 
•Is a basic technique 
•To compute the relevancy of a document with respect to a particular term 
•"Term" is a generalized element contains within a document. 
•A "term" is a generalized idea of what a document contains. 
•E.g. a term can be a word, a phrase, or a concept.
TF-IDF 
•Intuitively, the relevancy of a document to a term can be calculated 
•from the percentage of that term shows up in the document 
•i.e.: the count of the term in that document divide by the total number of terms in it. 
•We called this the "term frequency“
TF-IDF 
•On the other hand, 
•If this is a very common term which appears in many other documents, 
•Then its relevancy should be reduced. 
•i.e.: the count of documents having this term divided by total number of documents. 
•We called this the "document frequency“
TF-IDF 
•The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency. 
•relevancy = tf-idf= term frequency * log (1 / document frequency) 
•This is called tf-idf. 
•A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idfas its value.
Example: Compute TF-IDF using Map/Reduce
Basic MapReduce Patterns 
1.Counting and Summing 
2.Collating 
3.Filtering (“Grepping”), Parsing, and Validation 
4.Distributed Task Execution 
5.Sorting 
6.Iterative Message Passing (Graph Processing) 
7.Distinct Values (Unique Items Counting) 
8.Cross-Correlation 
9.Selection 
10.Projection 
11.Union 
12.Intersection 
13.Difference 
14.GroupByand Aggregation 
15.Joining 
16.Time-series –moving average
Counting and Summing 
•Problem Statement: 
–There is a number of documents where each document is a set of terms. 
–It is required to calculate a total number ofoccurrences of each term in all documents. 
–Alternatively, it can be an arbitrary function of the terms. 
–For instance, there is a log file where each record contains a response time and it is required to calculate an average response time. 
•Applications: Log Analysis, Data Querying
Counting and Summing 
Solution: 
class Mapper 
method Map(docidid, doc d) 
for all term t in doc d do 
Emit(term t, count 1) 
class Combiner 
method Combine(term t, [c1, c2,...]) 
sum = 0 
for all count c in [c1, c2,...] do 
sum = sum + c 
Emit(term t, count sum) 
class Reducer 
method Reduce(term t, counts [c1, c2,...]) 
sum = 0 
for all count c in [c1, c2,...] do 
sum = sum + c 
Emit(term t, count sum)
Collating 
•Problem Statement: 
–There is a set of items and some function of one item. 
–It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. 
–The most typical example is building of inverted indexes. 
•Applications: Inverted Indexes, ETL
Collating 
Solution: 
Mappercomputes a given function for each item and emits value of the function as a key and item itself as a value. 
class Mapper 
method Map(docidid, doc d) 
for all term t in doc d do 
Emit(term f(t), t) 
Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found. 
class Reducer 
method Reduce(term f(t), [t1, t2,...]) 
//process or save 
Emit(term t, count sum)
Filtering (“Grepping”), Parsing, and Validation 
•Problem Statement: 
–There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation. 
–The later case includes such tasks as text parsing and value extraction, conversion from one format to another. 
•Applications: Log Analysis, Data Querying, ETL, Data Validation
Filtering (“Grepping”), Parsing, and Validation 
Solution: 
Mappertakes records one by one and emits accepted items or their transformed versions
Distributed Task Execution 
•Problem Statement: 
–There is a large computational problem that can bedivided into multiple parts and results from all parts can be combined together to obtain a final result. 
•Applications: Physical and Engineering Simulations, Numerical Analysis, Performance Testing
Distributed Task Execution 
Solution: 
Problem description is split in a set of specifications and specifications are stored as input data for Mappers. 
Each Mappertakes a specification, performs corresponding computations and emits results. 
Reducer combines all emitted parts into the final result. 
Case Study: Simulation of a Digital Communication System 
There is a software simulator of a digital communication system like WiMAXthat passes some volume of random data through the system model and computes error probability of throughput. 
Each Mapperruns simulation for specified amount of data which is 1/Nth of the required sampling and emit error rate. 
Reducer computes average error rate.
Sorting 
•Problem Statement: 
–There is a set of records and it is required to sort these records by some rule or process these records in a certain order. 
•Applications: ETL, Data Analysis
Sorting 
Solution: 
Mappersjust emit all items as values associated with the sorting keys that are assembled as function of items. 
Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of MapReduce (and Hadoop). 
In particular, it is very common to use composite keys to achieve secondary sorting and grouping. 
Sorting in MapReduce is originally intended for sorting of the emitted key-value pairs by key, but there exist techniques that leverage Hadoopimplementationspecifics to achieve sorting by values.
Iterative Message Passing (Graph Processing) 
•Problem Statement: 
–There is a network of entities and relationships between them. 
–It is required to calculate a state of each entity on the basis of properties of the other entities in its neighborhood. 
–This state can represent a distance to other nodes, indication that there is a neighbor with the certain properties, characteristic of neighborhood density and so on. 
•Applications: Graph Analysis, Web Indexing
Iterative Message Passing (Graph Processing) 
Solution: 
A network is stored as a set of nodes and each node contains a list of adjacent node IDs. 
Conceptually, MapReduce jobs are performed in iterative way and at each iteration each node sends messages to its neighbors. 
Each neighbor updates its state on the basis of the received messages. 
Iterations are terminated by some condition like fixed maximal number of iterations (say, network diameter) or negligible changes in states between two consecutive iterations. 
From the technical point of view, Mapperemits messages for each node using ID of the adjacent node as a key. 
As result, all messages are grouped by the incoming node and reducer is able to re-compute state and rewrite node with the new state.
Iterative Message Passing (Graph Processing) 
Solution: 
class Mapper 
method Map(id n, object N) 
Emit(id n, object N) 
for all id m in N.OutgoingRelationsdo 
Emit(id m, message getMessage(N)) 
class Reducer 
method Reduce(id m, [s1, s2,...]) 
object M = null 
messages = [] 
for all s in [s1, s2,...] do 
if IsObject(s) then 
M = s 
else // s is a message 
messages.add(s) 
M.State= calculateState(messages) 
Emit(id m, object M)
Case Study: Availability Propagation Through The Tree of Categories 
•Problem Statement: 
–This problem is inspired by real life eCommercetask. 
–There is a tree of categories that branches out from large categories (like Men, Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and eventually to small end-of-line categories (like Men Blue Jeans). 
–End-of-line category is either available (contains products) or not. 
–Some high level category is available if there is at least one available end-of-line category in its subtree. 
–The goal is to calculate availabilities for all categories if availabilities of end-of- line categories are know.
Case Study: Availability Propagation Through The Tree of Categories 
class N 
State in {True = 2, False = 1, null = 0}, //initialized 1 or 2 for end- of-line categories, 0 otherwise 
method getMessage(object N) 
return N.State 
method calculateState(state s, data [d1, d2,...]) 
return max( [d1, d2,...] )
Case Study: Breadth-First Search 
•Problem Statement:There is a graph and it is required to calculate distance (a number of hops) from one source node to all other nodes in the graph.
Case Study: Breadth-First Search 
class N 
State is distance, initialized 0 for source node, INFINITY for all other nodes 
method getMessage(N) 
return N.State+ 1 
method calculateState(state s, data [d1, d2,...]) 
min( [d1, d2,...] )
Distinct Values (Unique Items Counting) 
•Problem Statement: 
–There is a set of records that contain fields F and G. Count the total number of uniquevalues of filed F for each subset of records that have the same G (grouped by G). 
–The problem can be a little bit generalized and formulated in terms of faceted search: There is a set of records. Each record has field F and arbitrary number of category labels G = {G1, G2, …} . Count the total number of uniquevalues of filed F for each subset of records for each value of any label. 
•Example Applications: Log Analysis, Unique Users Counting 
Record 1: F=1, G={a, b} 
Record 2: F=2, G={a, d, e} 
Record 3: F=1, G={b} 
Record 4: F=3, G={a, b} 
Result: 
a -> 3 // F=1, F=2, F=3 
b -> 2 // F=1, F=3 
d -> 1 // F=2 
e -> 1 // F=2
Distinct Values (Unique Items Counting) 
Solution: 
PHASE-1 
class Mapper 
method Map(null, record [value f, categories [g1, g2,...]]) 
for all category g in [g1, g2,...] 
Emit(record [g, f], count 1) 
class Reducer 
method Reduce(record [g, f], counts [n1, n2, ...]) 
Emit(record [g, f], null ) 
PHASE-2 
class Mapper 
method Map(record [f, g], null) 
Emit(value g, count 1) 
class Reducer 
method Reduce(value g, counts [n1, n2,...]) 
Emit(value g, sum( [n1, n2,...] ) )
Cross-Correlation 
•Problem Statement: 
–There is a set of tuplesof items. 
–For each possible pair of items calculate a number of tupleswhere these items co-occur. 
–If the total number of items is N then N*N values should be reported. 
–This problem appears in : 
•text analysis (say, items are words and tuplesare sentences), 
•market analysis (customers who buythistend to also buythat). 
–If N*N is small -matrix canfit in the memory of a single machine.
Market Basket Analysis 
•Suppose a store sells N different products and has data on B customer purchases (each purchase is called a “market basket”) 
•An association rule is expressed as “If x is sold, then y is sold”: 
–x y 
•“Support” for x y: Joint probability of P (x ^ y) 
–The probability of finding x and y together in a random basket 
•“Confidence” in x y: Conditional probability of P (y | x) 
–The probability of finding y in a basket if the basket already contains x 
•Goal is finding rules with high support and high confidence 
–How are they computed? 
–Cooccur[x, x] represents the number of baskets that contain x 
–Cooccur[x, y] represents the number of baskets that contain x & y 
–Cooccur[x, y] == Cooccur[y, x] 
–Support: Calculated as Cooccur[x, y] / B 
–Confidence: Calculated as Cooccur[x, y] / Cooccur[x, x]
Cross-Correlation 
Solution: 
class Mapper 
method Map(null, basket = [i1, i2,...] ) 
for all item iin basket 
for all item j in basket 
Emit(pair [ij], count 1) 
class Reducer 
method Reduce(pair [ij], counts [c1, c2,...]) 
s = sum([c1, c2,...]) 
Emit(pair[ij], count s)
Cross-Correlation 
Solution: 
class Mapper 
method Map(null, basket = [i1, i2,...] ) 
for all item iin basket 
H = new AssociativeArray: item -> counter 
for all item j in basket 
H{j} = H{j} + 1 
Emit(item i, stripe H) 
class Reducer 
method Reduce(item i, stripes [H1, H2,...]) 
H = new AssociativeArray: item -> counter 
H = merge-sum( [H1, H2,...] ) 
for all item j in H.keys() 
Emit(pair [ij], H{j})
Selection 
Solution: 
class Mapper 
method Map(rowkeykey, tuplet) 
if t satisfies the predicate 
Emit(tuplet, null)
Projection 
Solution: 
class Mapper 
method Map(rowkeykey, tuplet) 
tupleg = project(t) // extract required fields to tupleg 
Emit(tupleg, null) 
class Reducer 
method Reduce(tuplet, array n) // n is an array of nulls 
Emit(tuplet, null)
Union 
Solution: 
class Mapper 
method Map(rowkeykey, tuplet) 
Emit(tuplet, null) 
class Reducer 
method Reduce(tuplet, array n) // n is an array of one or two nulls 
Emit(tuplet, null)
Intersection 
Solution: 
class Mapper 
method Map(rowkeykey, tuplet) 
Emit(tuplet, null) 
class Reducer 
method Reduce(tuplet, array n) // n is an array of one or more nulls 
if n.size() >= 2 
Emit(tuplet, null)
Difference 
Solution: 
// We want to compute difference R –S. 
// Mapperemits all tuplesand tag which is a name of the set this record came from. // Reducer emits only records that came from R but not from S. 
class Mapper 
method Map(rowkeykey, tuplet) 
Emit(tuplet, string t.SetName) // t.SetNameis either 'R' or 'S' 
class Reducer 
method Reduce(tuplet, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] 
if n.size() = 1 and n[1] = 'R' 
Emit(tuplet, null)
Join 
•AJOINis a means for combining fields from two tables by using values common to each. 
•ANSI standard SQL specifies four types ofJOIN:INNER,OUTER,LEFT, andRIGHT. 
•As a special case, a table (base table, view, or joined table) canJOINto itself in aself-join. 
•Join algorithms 
–Nested loop 
–Sort-Merge 
–Hash 
SELECT * 
FROM employee 
INNER JOIN department ON employee.DepartmentID= department.DepartmentID; 
SELECT * 
FROM employee 
JOIN department ON employee.DepartmentID= department.DepartmentID; 
SELECT * 
FROM employee 
LEFT OUTER JOIN department ON employee.DepartmentID= department.DepartmentID;
Nested loop 
For each tupler in R do 
For each tuples in S do 
If r and s satisfy the join condition 
Then output the tuple<r,s>
Sort-Merge 
p in P; q in Q; g in Q 
while more tuplesin inputs do 
while p.a< g.bdo 
advance p 
end while 
while p.a> g.bdo 
advance g //a group might begin here 
end while 
while p.a== g.bdo 
q = g //mark group beginning 
while p.a== q.bdo 
Add <p,q> to the result 
Advance q 
end while 
Advance p //move forward 
end while 
g = q //candidate to begin next group 
end while
Sort-Merge : MapReduce 
•This algorithm joins of two sets R and L on some key k. 
•Mappergoes through all tuplesfrom R and L, extracts key k from the tuples, marks tuplewith a tag that indicates a set this tuplecame from (‘R’ or ‘L’), and emits tagged tupleusing k as a key. 
•Reducer receives all tuplesfor a particular key k and put them into two buckets –for R and for L. 
•When two buckets are filled, Reducer runs nested loop over them and emits a cross join of the buckets. 
•Each emitted tupleis a concatenation R-tuple, L-tuple, and key k. 
•This approach has the following disadvantages: 
–Mapperemits absolutely all data, even for keys thatoccur only in one set and have no pair in the other. 
–Reducer should hold all data for one key in the memory. If data doesn’t fit the memory, its Reducer’s responsibility to handle this by some kind of swap.
Sort-Merge : MapReduce Pseudocode 
Solution: 
class Mapper 
method Map(null, tuple[join_keyk, value v1, value v2,...]) 
Emit(join_keyk, tagged_tuple[set_nametag, values [v1, v2, ...] ] ) 
class Reducer 
method Reduce(join_keyk, tagged_tuples[t1, t2,...]) 
H = new AssociativeArray: set_name-> values 
for all tagged_tuplet in [t1, t2,...] // separate values into 2 arrays 
H{t.tag}.add(t.values) 
for all values r in H{'R'} // produce a cross-join of the two arrays 
for all values l in H{'L'} 
Emit(null, [k r l] )
Hash Join 
•The Hash Join algorithm consists of a ‘build’ phase and a ‘probe’ phase. 
•In its simplest variant, the smaller dataset is loaded into an in-memory hash table in the build phase. 
•In the ‘probe’ phase, the larger dataset is scanned and joined with the relevant tuple(s) by looking into the hash table. 
for all p in P do 
Load p into in memory hash table H 
end for 
for all q in Q do 
if H contains p matching with q then 
add <p,q> to the result 
end if 
end for
Hash Join : MapReduce Pseudocode 
•Let’s assume that we join two sets –R and L, R is relative small. If so, R can be distributed to all Mappersand each Mappercan load it and index by the join key. 
class Mapper 
method Initialize 
H = new AssociativeArray: join_key-> tuplefrom R 
R = loadR() 
for all [ join_keyk, tuple[r1, r2,...] ] in R 
H{k} = H{k}.append( [r1, r2,...] ) 
method Map(join_keyk, tuplel) 
for all tupler in H{k} 
Emit(null, tuple[k r l] )
Reference 
F. N. Afratiand J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99-110, 2010 
R. Vernica, M. J. Carey, and C. Li. Ecientparallel set-similarity joins using mapreduce. In SIGMOD, pages 495-506, 2010. 
H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplifiedrelational data processing on large clusters. In SIGMOD Conference, pages 1029–1040, 2007. 
Processing theta-joins using MapReduce, AlperOkcan, MirekRiedewald, SIGMOD '11 
FotoN. Afrati, Jeffrey D. Ullman: Optimizing MultiwayJoins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng. 23(9): 1282-1298 (2011) 
V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisetsand Vectors, Ahmed Metwally, Christos Faloutsos, PVLDB Proceedings of the VLDB Endowment, vol. 5 (2012)
End of session 
Day –2: MapReduceConcepts

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 

Similar a Hadoop map reduce concepts

Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsAkhil Kaushik
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...AntareepMajumder
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.pptArumugam90
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptRubenGabrielHernande
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduceShrihari Rathod
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxSyed Zaid Irshad
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 

Similar a Hadoop map reduce concepts (20)

Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
FALLSEM2022-23_BCSE202L_TH_VL2022230103292_Reference_Material_I_25-07-2022_Fu...
 
try
trytry
try
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
SecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.pptSecondPresentationDesigning_Parallel_Programs.ppt
SecondPresentationDesigning_Parallel_Programs.ppt
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Unit 2
Unit 2Unit 2
Unit 2
 
Design and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptxDesign and Analysis of Algorithms.pptx
Design and Analysis of Algorithms.pptx
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 

Más de Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descentSubhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysisSubhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 

Más de Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Último

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 

Último (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 

Hadoop map reduce concepts

  • 2. HadoopMapReduce •HadoopMapReduceis a –Software framework –For easily writing applications –Which process vast amounts of data (multi-terabyte data-sets) –In-parallel on large clusters (thousands of nodes) of commodity hardware –In a reliable, fault-tolerant manner. •A MapReducejobusually splits the input data-set into independent chunks which are processed by themap tasksin a completely parallel manner. •The framework sorts the outputs of the maps, which are then input to thereduce tasks. •Typically both the input and the output of the job are stored in a file-system. •The framework takes care of scheduling tasks, monitoring them and re- executes the failed tasks.
  • 3. JobTrackerand TaskTracker •The MapReduceframework consists of a –single masterJobTrackerand –one slaveTaskTrackerper cluster-node. •The master is responsible for scheduling the jobs' component -tasks on the slaves, monitoring them and re-executing the failed tasks. •The slaves execute the tasks as directed by the master.
  • 5. Job Specification •Minimally, applications specify the input/output locations and supplymapandreducefunctions via implementations of appropriate interfaces and/or abstract-classes. •These, and other job parameters, comprise thejob configuration. •The Hadoopjob clientthen submits the job (jar/executable etc.) and configuration to theJobTrackerwhich then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job- client. •Although the Hadoopframework is implemented in JavaTM, MapReduceapplications need not be written in Java. –HadoopStreamingis a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapperand/or the reducer. –HadoopPipesis aSWIG-compatibleC++ APIto implement MapReduceapplications (non JNITMbased).
  • 6. Input Output •The MapReduceframework operates exclusively on<key, value>pairs. •That is, the framework views the input to the job as a set of<key, value>pairs and produces a set of<key, value>pairs as the output of the job, conceivably of different types. •Thekeyandvalueclasses have to be serializableby the framework and hence need to implement theWritableinterface. •Additionally, thekeyclasses have to implement theWritableComparableinterface to facilitate sorting by the framework. •Input and Output types of a MapReducejob: –(input)<k1, v1>->map-><k2, v2>->combine-><k2, v2>->reduce- ><k3, v3>(output)
  • 7. TaskTracker JobTracker-Master Client TaskTracker TaskTracker - Slaves Run Map and Reduce task Manage intermediate output UI for submitting jobs Polls status information Accepts MR jobs Assigns tasks to slaves Monitors tasks Handles failures Task Run the Map and Reduce functions Report progress HadoopMap/Reduce Architecture
  • 8. D D D D T T JobTracker Namenode Machines with Datanodesand Tasktrackers T T T D Client Submit Job HTTP Monitoring UI Get Block Locations Locality optimizations –With large data, bandwidth to data is a problem –Map tasks are scheduled close to the inputs when possible Automatic re-execution on failure ←. tasks HadoopHDFS + MR cluster -putting them together
  • 10. Ways of using Hadoop •Map-Reduce Java API –http://hadoop.apache.org/mapreduce/ •Streaming –http://hadoop.apache.org/common/docs/current/mapred_tutorial.html •Pipes –http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package- summary.html •Pig –http://hadoop.apache.org/pig/ •Hive –http://hive.apache.org
  • 11. Example: Compute TF-IDF using Map/Reduce •TF-IDF (Term Frequency, Inverse Document Frequency) •Is a basic technique •To compute the relevancy of a document with respect to a particular term •"Term" is a generalized element contains within a document. •A "term" is a generalized idea of what a document contains. •E.g. a term can be a word, a phrase, or a concept.
  • 12. TF-IDF •Intuitively, the relevancy of a document to a term can be calculated •from the percentage of that term shows up in the document •i.e.: the count of the term in that document divide by the total number of terms in it. •We called this the "term frequency“
  • 13. TF-IDF •On the other hand, •If this is a very common term which appears in many other documents, •Then its relevancy should be reduced. •i.e.: the count of documents having this term divided by total number of documents. •We called this the "document frequency“
  • 14. TF-IDF •The overall relevancy of a document with respect to a term can be computed using both the term frequency and document frequency. •relevancy = tf-idf= term frequency * log (1 / document frequency) •This is called tf-idf. •A "document" can be considered as a multi-dimensional vector where each dimension represents a term with the tf-idfas its value.
  • 15. Example: Compute TF-IDF using Map/Reduce
  • 16. Basic MapReduce Patterns 1.Counting and Summing 2.Collating 3.Filtering (“Grepping”), Parsing, and Validation 4.Distributed Task Execution 5.Sorting 6.Iterative Message Passing (Graph Processing) 7.Distinct Values (Unique Items Counting) 8.Cross-Correlation 9.Selection 10.Projection 11.Union 12.Intersection 13.Difference 14.GroupByand Aggregation 15.Joining 16.Time-series –moving average
  • 17. Counting and Summing •Problem Statement: –There is a number of documents where each document is a set of terms. –It is required to calculate a total number ofoccurrences of each term in all documents. –Alternatively, it can be an arbitrary function of the terms. –For instance, there is a log file where each record contains a response time and it is required to calculate an average response time. •Applications: Log Analysis, Data Querying
  • 18. Counting and Summing Solution: class Mapper method Map(docidid, doc d) for all term t in doc d do Emit(term t, count 1) class Combiner method Combine(term t, [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
  • 19. Collating •Problem Statement: –There is a set of items and some function of one item. –It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. –The most typical example is building of inverted indexes. •Applications: Inverted Indexes, ETL
  • 20. Collating Solution: Mappercomputes a given function for each item and emits value of the function as a key and item itself as a value. class Mapper method Map(docidid, doc d) for all term t in doc d do Emit(term f(t), t) Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found. class Reducer method Reduce(term f(t), [t1, t2,...]) //process or save Emit(term t, count sum)
  • 21. Filtering (“Grepping”), Parsing, and Validation •Problem Statement: –There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation. –The later case includes such tasks as text parsing and value extraction, conversion from one format to another. •Applications: Log Analysis, Data Querying, ETL, Data Validation
  • 22. Filtering (“Grepping”), Parsing, and Validation Solution: Mappertakes records one by one and emits accepted items or their transformed versions
  • 23. Distributed Task Execution •Problem Statement: –There is a large computational problem that can bedivided into multiple parts and results from all parts can be combined together to obtain a final result. •Applications: Physical and Engineering Simulations, Numerical Analysis, Performance Testing
  • 24. Distributed Task Execution Solution: Problem description is split in a set of specifications and specifications are stored as input data for Mappers. Each Mappertakes a specification, performs corresponding computations and emits results. Reducer combines all emitted parts into the final result. Case Study: Simulation of a Digital Communication System There is a software simulator of a digital communication system like WiMAXthat passes some volume of random data through the system model and computes error probability of throughput. Each Mapperruns simulation for specified amount of data which is 1/Nth of the required sampling and emit error rate. Reducer computes average error rate.
  • 25. Sorting •Problem Statement: –There is a set of records and it is required to sort these records by some rule or process these records in a certain order. •Applications: ETL, Data Analysis
  • 26. Sorting Solution: Mappersjust emit all items as values associated with the sorting keys that are assembled as function of items. Nevertheless, in practice sorting is often used in a quite tricky way, that’s why it is said to be a heart of MapReduce (and Hadoop). In particular, it is very common to use composite keys to achieve secondary sorting and grouping. Sorting in MapReduce is originally intended for sorting of the emitted key-value pairs by key, but there exist techniques that leverage Hadoopimplementationspecifics to achieve sorting by values.
  • 27. Iterative Message Passing (Graph Processing) •Problem Statement: –There is a network of entities and relationships between them. –It is required to calculate a state of each entity on the basis of properties of the other entities in its neighborhood. –This state can represent a distance to other nodes, indication that there is a neighbor with the certain properties, characteristic of neighborhood density and so on. •Applications: Graph Analysis, Web Indexing
  • 28. Iterative Message Passing (Graph Processing) Solution: A network is stored as a set of nodes and each node contains a list of adjacent node IDs. Conceptually, MapReduce jobs are performed in iterative way and at each iteration each node sends messages to its neighbors. Each neighbor updates its state on the basis of the received messages. Iterations are terminated by some condition like fixed maximal number of iterations (say, network diameter) or negligible changes in states between two consecutive iterations. From the technical point of view, Mapperemits messages for each node using ID of the adjacent node as a key. As result, all messages are grouped by the incoming node and reducer is able to re-compute state and rewrite node with the new state.
  • 29. Iterative Message Passing (Graph Processing) Solution: class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelationsdo Emit(id m, message getMessage(N)) class Reducer method Reduce(id m, [s1, s2,...]) object M = null messages = [] for all s in [s1, s2,...] do if IsObject(s) then M = s else // s is a message messages.add(s) M.State= calculateState(messages) Emit(id m, object M)
  • 30. Case Study: Availability Propagation Through The Tree of Categories •Problem Statement: –This problem is inspired by real life eCommercetask. –There is a tree of categories that branches out from large categories (like Men, Women, Kids) to smaller ones (like Men Jeans or Women Dresses), and eventually to small end-of-line categories (like Men Blue Jeans). –End-of-line category is either available (contains products) or not. –Some high level category is available if there is at least one available end-of-line category in its subtree. –The goal is to calculate availabilities for all categories if availabilities of end-of- line categories are know.
  • 31. Case Study: Availability Propagation Through The Tree of Categories class N State in {True = 2, False = 1, null = 0}, //initialized 1 or 2 for end- of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
  • 32. Case Study: Breadth-First Search •Problem Statement:There is a graph and it is required to calculate distance (a number of hops) from one source node to all other nodes in the graph.
  • 33. Case Study: Breadth-First Search class N State is distance, initialized 0 for source node, INFINITY for all other nodes method getMessage(N) return N.State+ 1 method calculateState(state s, data [d1, d2,...]) min( [d1, d2,...] )
  • 34. Distinct Values (Unique Items Counting) •Problem Statement: –There is a set of records that contain fields F and G. Count the total number of uniquevalues of filed F for each subset of records that have the same G (grouped by G). –The problem can be a little bit generalized and formulated in terms of faceted search: There is a set of records. Each record has field F and arbitrary number of category labels G = {G1, G2, …} . Count the total number of uniquevalues of filed F for each subset of records for each value of any label. •Example Applications: Log Analysis, Unique Users Counting Record 1: F=1, G={a, b} Record 2: F=2, G={a, d, e} Record 3: F=1, G={b} Record 4: F=3, G={a, b} Result: a -> 3 // F=1, F=2, F=3 b -> 2 // F=1, F=3 d -> 1 // F=2 e -> 1 // F=2
  • 35. Distinct Values (Unique Items Counting) Solution: PHASE-1 class Mapper method Map(null, record [value f, categories [g1, g2,...]]) for all category g in [g1, g2,...] Emit(record [g, f], count 1) class Reducer method Reduce(record [g, f], counts [n1, n2, ...]) Emit(record [g, f], null ) PHASE-2 class Mapper method Map(record [f, g], null) Emit(value g, count 1) class Reducer method Reduce(value g, counts [n1, n2,...]) Emit(value g, sum( [n1, n2,...] ) )
  • 36. Cross-Correlation •Problem Statement: –There is a set of tuplesof items. –For each possible pair of items calculate a number of tupleswhere these items co-occur. –If the total number of items is N then N*N values should be reported. –This problem appears in : •text analysis (say, items are words and tuplesare sentences), •market analysis (customers who buythistend to also buythat). –If N*N is small -matrix canfit in the memory of a single machine.
  • 37. Market Basket Analysis •Suppose a store sells N different products and has data on B customer purchases (each purchase is called a “market basket”) •An association rule is expressed as “If x is sold, then y is sold”: –x y •“Support” for x y: Joint probability of P (x ^ y) –The probability of finding x and y together in a random basket •“Confidence” in x y: Conditional probability of P (y | x) –The probability of finding y in a basket if the basket already contains x •Goal is finding rules with high support and high confidence –How are they computed? –Cooccur[x, x] represents the number of baskets that contain x –Cooccur[x, y] represents the number of baskets that contain x & y –Cooccur[x, y] == Cooccur[y, x] –Support: Calculated as Cooccur[x, y] / B –Confidence: Calculated as Cooccur[x, y] / Cooccur[x, x]
  • 38. Cross-Correlation Solution: class Mapper method Map(null, basket = [i1, i2,...] ) for all item iin basket for all item j in basket Emit(pair [ij], count 1) class Reducer method Reduce(pair [ij], counts [c1, c2,...]) s = sum([c1, c2,...]) Emit(pair[ij], count s)
  • 39. Cross-Correlation Solution: class Mapper method Map(null, basket = [i1, i2,...] ) for all item iin basket H = new AssociativeArray: item -> counter for all item j in basket H{j} = H{j} + 1 Emit(item i, stripe H) class Reducer method Reduce(item i, stripes [H1, H2,...]) H = new AssociativeArray: item -> counter H = merge-sum( [H1, H2,...] ) for all item j in H.keys() Emit(pair [ij], H{j})
  • 40. Selection Solution: class Mapper method Map(rowkeykey, tuplet) if t satisfies the predicate Emit(tuplet, null)
  • 41. Projection Solution: class Mapper method Map(rowkeykey, tuplet) tupleg = project(t) // extract required fields to tupleg Emit(tupleg, null) class Reducer method Reduce(tuplet, array n) // n is an array of nulls Emit(tuplet, null)
  • 42. Union Solution: class Mapper method Map(rowkeykey, tuplet) Emit(tuplet, null) class Reducer method Reduce(tuplet, array n) // n is an array of one or two nulls Emit(tuplet, null)
  • 43. Intersection Solution: class Mapper method Map(rowkeykey, tuplet) Emit(tuplet, null) class Reducer method Reduce(tuplet, array n) // n is an array of one or more nulls if n.size() >= 2 Emit(tuplet, null)
  • 44. Difference Solution: // We want to compute difference R –S. // Mapperemits all tuplesand tag which is a name of the set this record came from. // Reducer emits only records that came from R but not from S. class Mapper method Map(rowkeykey, tuplet) Emit(tuplet, string t.SetName) // t.SetNameis either 'R' or 'S' class Reducer method Reduce(tuplet, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] if n.size() = 1 and n[1] = 'R' Emit(tuplet, null)
  • 45. Join •AJOINis a means for combining fields from two tables by using values common to each. •ANSI standard SQL specifies four types ofJOIN:INNER,OUTER,LEFT, andRIGHT. •As a special case, a table (base table, view, or joined table) canJOINto itself in aself-join. •Join algorithms –Nested loop –Sort-Merge –Hash SELECT * FROM employee INNER JOIN department ON employee.DepartmentID= department.DepartmentID; SELECT * FROM employee JOIN department ON employee.DepartmentID= department.DepartmentID; SELECT * FROM employee LEFT OUTER JOIN department ON employee.DepartmentID= department.DepartmentID;
  • 46. Nested loop For each tupler in R do For each tuples in S do If r and s satisfy the join condition Then output the tuple<r,s>
  • 47. Sort-Merge p in P; q in Q; g in Q while more tuplesin inputs do while p.a< g.bdo advance p end while while p.a> g.bdo advance g //a group might begin here end while while p.a== g.bdo q = g //mark group beginning while p.a== q.bdo Add <p,q> to the result Advance q end while Advance p //move forward end while g = q //candidate to begin next group end while
  • 48. Sort-Merge : MapReduce •This algorithm joins of two sets R and L on some key k. •Mappergoes through all tuplesfrom R and L, extracts key k from the tuples, marks tuplewith a tag that indicates a set this tuplecame from (‘R’ or ‘L’), and emits tagged tupleusing k as a key. •Reducer receives all tuplesfor a particular key k and put them into two buckets –for R and for L. •When two buckets are filled, Reducer runs nested loop over them and emits a cross join of the buckets. •Each emitted tupleis a concatenation R-tuple, L-tuple, and key k. •This approach has the following disadvantages: –Mapperemits absolutely all data, even for keys thatoccur only in one set and have no pair in the other. –Reducer should hold all data for one key in the memory. If data doesn’t fit the memory, its Reducer’s responsibility to handle this by some kind of swap.
  • 49. Sort-Merge : MapReduce Pseudocode Solution: class Mapper method Map(null, tuple[join_keyk, value v1, value v2,...]) Emit(join_keyk, tagged_tuple[set_nametag, values [v1, v2, ...] ] ) class Reducer method Reduce(join_keyk, tagged_tuples[t1, t2,...]) H = new AssociativeArray: set_name-> values for all tagged_tuplet in [t1, t2,...] // separate values into 2 arrays H{t.tag}.add(t.values) for all values r in H{'R'} // produce a cross-join of the two arrays for all values l in H{'L'} Emit(null, [k r l] )
  • 50. Hash Join •The Hash Join algorithm consists of a ‘build’ phase and a ‘probe’ phase. •In its simplest variant, the smaller dataset is loaded into an in-memory hash table in the build phase. •In the ‘probe’ phase, the larger dataset is scanned and joined with the relevant tuple(s) by looking into the hash table. for all p in P do Load p into in memory hash table H end for for all q in Q do if H contains p matching with q then add <p,q> to the result end if end for
  • 51. Hash Join : MapReduce Pseudocode •Let’s assume that we join two sets –R and L, R is relative small. If so, R can be distributed to all Mappersand each Mappercan load it and index by the join key. class Mapper method Initialize H = new AssociativeArray: join_key-> tuplefrom R R = loadR() for all [ join_keyk, tuple[r1, r2,...] ] in R H{k} = H{k}.append( [r1, r2,...] ) method Map(join_keyk, tuplel) for all tupler in H{k} Emit(null, tuple[k r l] )
  • 52. Reference F. N. Afratiand J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99-110, 2010 R. Vernica, M. J. Carey, and C. Li. Ecientparallel set-similarity joins using mapreduce. In SIGMOD, pages 495-506, 2010. H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: simplifiedrelational data processing on large clusters. In SIGMOD Conference, pages 1029–1040, 2007. Processing theta-joins using MapReduce, AlperOkcan, MirekRiedewald, SIGMOD '11 FotoN. Afrati, Jeffrey D. Ullman: Optimizing MultiwayJoins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng. 23(9): 1282-1298 (2011) V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisetsand Vectors, Ahmed Metwally, Christos Faloutsos, PVLDB Proceedings of the VLDB Endowment, vol. 5 (2012)
  • 53. End of session Day –2: MapReduceConcepts