This document discusses processing large datasets with Python and Hadoop. It begins with an example of finding the highest temperature from a climate dataset using a map-reduce approach. Next, it provides code examples for implementing map-reduce in pure Python, with Hadoop Streaming, and with the Dumbo library. The document then discusses using Amazon Elastic MapReduce for running Hadoop jobs on AWS. It poses a question about how to implement breadth-first search as a map-reduce algorithm and ends with an example of using MongoDB's map-reduce functionality.
3. high_temp=0 forline inopen('1901'): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): high_temp=max(high_temp,float(temp)) printhigh_temp How can we make this scale? ( and do more interesting things )
4.
5. JeffREY DEAN – Google - 2004 “Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
6.
7. defmapper(line): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): returnfloat(temp) returnNone output =map(mapper,open('1901')) printreduce(max,output) Mapreduce in pure python
8. for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s%s" % (year, temp) mapper.py (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("") if last_key and last_key != key: print "%s%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%s%s" % (last_key, max_val) reduer.py cat dataFile | mapper.py | sort | reducer.py Hadoop Streaming
18. CLI – API – Web Console Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Elastic Map REduce
24. Map Reduce Algorithms ARE different BFS(G, s) // G is the graph and s is the starting node for each vertex u ∈ V [G] - {s} do color[u] ← WHITE // color of vertex u d[u] ← ∞ // distance from source s to vertex u π[u] ← NIL // predecessor of u color[s] ← GRAY d[s] ← 0 π[s] ← NIL Q ← Ø // Q is a FIFO - queue ENQUEUE(Q, s) while Q ≠ Ø // iterates as long as there are gray vertices. do u ← DEQUEUE(Q) for each v ∈ Adj[u] do if color[v] = WHITE // discover the undiscovered adjacent vertices then color[v] ← GRAY // enqueued whenever painted gray d[v] ← d[u] + 1 π[v] ← u ENQUEUE(Q, v) color[u] ← BLACK // painted black whenever dequeued