MapReduce: teoria e prática

MapReduce 101
ms
ste

Sy
dic
or

by

ha
C

Big Data, what's the big deal?
Why is this talk relevant to you?
● we have too much data
to process in a single computer

● we make too few informed decision
based on the data we have

● we have too little {time|CPU|memory}
to analyze all this data

● 'cuz not everything needs to be on-line
It's 2013 but doing batch processing is still OK

Map-what?
And why MapReduce and not, say MPI?
● Simple computation model
MapReduce exposes a simple (and limited) computational model.
It can be a restraining at times but it is a trade off.

● Fault-tolerance, parallelization and
distribution among machines for free
The framework deals with this for you so you don't have to

● Because it is the bread-and-butter of Big
Data processing
It is available in all major cloud computing platforms, and it is against what
other Big Data systems compare themselves against.

Outline
● Fast recap on python and whatnot
● Introduction to MapReduce
● Counting Words
● MrJob and EMR
● Real-life examples

Fast recap
Let's assume you know what the following is:
● JSON
● Python's yield keyword
● Generators in Python
● Amazon S3
● Amazon EC2
If you don't, raise your hand now. REALLY

Recap
JSON
JSON (JavaScript Object Notation) is a
lightweight data-interchange format.
It's like if XML and JavaScript slept together and gave birth a bastard but goodlooking child.
{"timestamp": "2011-08-15 22:17:31.334057",
"track_id": "TRACCJA128F149A144",
"tags": [["Bossa Nova", "100"],
["jazz", "20"],
["acoustic", "20"],
["romantic", "20"],],
"title": "Segredo",
"artist": "Jou00e3o Gilberto"}

Recap
Python generators
From Python's wiki:
“Generators functions allow you to declare a
function that behaves like an iterator, i.e. it
can be used in a for loop.”
The difference is: a generator can be iterated (or read)
only once as you don't store things in memory but create
them on the fly [2].
You can create generators using the yield keyword.

Recap
Python yield keyword
It's just like a return, but turns your function into
a generator.
Your function will suspend its execution after yielding a value and resume its
execution for after the request for the next item in the generator (next loop).

def count_from_1():
i = 1
while True:
yield i
i += 1
for j in count_from_1(): print j

Recap
Amazon S3
From Wikipedia:
“Amazon S3 (Simple Storage Service) is an
online storage web service offered by
Amazon Web Services.”
Its like a distributed filesystem that is easy to
use from other Amazon services, specially from
Amazon Elastic MapReduce.

Recap
EC2 - Elastic Cloud Computing
From Wikipedia:
“EC2 allows users to rent virtual computers
on which to run their own computer
applications”
So you can rent clusters on demand, no need to maintain,
keep fixing and up-to-date your ever breaking cluster of
computers. Less headache, moar action.
Instances can be purchased on demand for fixed prices or
you can bid on those.

MapReduce:
a quick introduction

MapReduce
MapReduce builds on the observation that
many tasks have the same structure:
computation is applied over a large number of
records to generate partial results, which are
then aggregated in some fashion.

MapReduce
Map

MapReduce
Map

Reduce

Typical (big data) problem
● Iterate over a large number of records

Map something of interest from each
● Extract
● Shuffle and sort intermediate results
uce
Red
● Aggregate intermediate results
● Generate final output

Phases of a MapReduction
MapReduce have the following steps:
map(key, value) -> [(key1, value1), (key1, value2)]
combine

May happen in parallel, in multiple
machines!

sort + shuffle
reduce(key1, [value1, value2]) -> [(keyX, valueY)]

Notice:
Reduce phase only starts after all mappers
have completed.
Yes, there is a synchronization barrier right there.

There is no global knowledge
Neither mappers nor reducers know what other mappers (or reducers) are
processing

Counting Words
Counting the number of occurrences of a word
in a document collection is quite a big deal.
Let's try with a small example:
"Me gusta correr, me gustas tu.
Me gusta la lluvia, me gustas tu."

Counting Words
"Me gusta correr, me gustas tu.
Me gusta la lluvia, me gustas tu."
me 4
gusta 2
correr 1
gustas 2
tu 2
la 1
lluvia 1

Counting word - in Python
doc = open('input')
count = {}
for line in doc:
words = line.split()
for w in words:
count[w] = count.get(w, 0) + 1

Easy, right? Yeah... too easy. Let's split what
we do for each line and aggregate, shall we?

Counting word - in MapReduce

def map_get_words(self, key, line):
for word in line.split():
yield word, 1
def reduce_sum_words(self, word, occurrences):
yield word, sum(occurrences)

What is Map's output?
def map_get_words(self, key, line):
yield word, 1
key=1

key=2

line="me gusta correr me gustas tu"

line="me gusta la lluvia me gustas tu"

('me', 1)
('gusta', 1)
('correr', 1)
('me', 1)
('gustas', 1)
('tu', 1)

('me', 1),
('gusta', 1)
('la', 1)
('lluvia', 1)
('me', 1)
('gustas', 1)
('tu', 1)

What about shuffle?
Think of it as a distributed group by
operation.
In the local map instance/node:

● it sorts map output values,
● groups them by their key,
● send this group of key and associated values to the
reduce node responsible for this key.
In the reduce instance/node:

● the framework joins all values associated with this key
in a single list - for you, for free.

What's Shuffle output? or
What's Reducer input?
Key

(input) Values

correr

[1]
Notice:

gusta

[1, 1]

gustas

[1, 1]

la

[1]

lluvia

[1]

me

[1, 1, 1, 1]

tu

[1, 1]

This table represents a global
view.
"In real life", each reducer
instance only knows about its
own key and values.

What's Reducer output?
def reduce_sum_words(self, word, occurrences):
word

occurrences

output

correr

[1]

(correr, 1)

gusta

[1, 1]

(gusta, 2)

gustas

[1, 1]

(gustas, 2)

la

[1]

(la, 1)

lluvia

[1]

(lluvia, 1)

me

[1, 1, 1, 1]

(me, 4)

tu

[1, 1]

(tu, 2)

MapReduce (main) Implementations
Google MapReduce
● C++
● Proprietary

Apache Hadoop
● Java

●

○ interfaces for anything that runs in the JVM
○ Hadoop streamming for a pipe-like programming
language agnostic interface
Open source

Nobody really cares about the others (for now... ;)

Amazon Elastic MapReduce (EMR)
Amazon Elastic MapReduce
● Uses Hadoop with extra sauces
● creates a hadoop cluster on demand
● It's magical -- except when it fails
● Can be a sort of unpredictable sometimes
○ Installing python modules can fail for no clear reason

MrJob
It's a python interface for hadoop streaming
jobs with a really easy to use interface
● Can run jobs locally or in EMR.
● Takes care of uploading your python code to
EMR.
● Deals better if everything is in a single
python module.
● Easy interface to chain sequences of M/R
steps.
● Some basic tools to aid debugging.

Counting words
Full MrJob Example
from mrjob.job import MRJob
class MRWordCounter(MRJob):
def get_words(self, key, line):
yield word, 1
def sum_words(self, word, occurrences):
def steps(self):
return [self.mr(self.get_words, self.sum_words),]
if __name__ == '__main__':
MRWordCounter.run()

MrJob
Lauching a job
Running it locally
python countwords.py --conf-path=mrjob.conf
input.txt

Running it in EMR
Do not forget to set AWS_ env. vars!

python countwords.py
--conf-path=mrjob.conf
-r emr
's3://ufcgplayground/data/words/*'
--no-output
--output-dir=s3://ufcgplayground/tmp/bla/

MrJob
Installing and Environment setup
Install MrJob using pip or easy_install
Do not, I repeat DO NOT install the version in Ubuntu/Debian.

sudo pip install mrjob

Setup your environment with AWS credentials
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Setup your environment to look for MrJob
settings:
export MRJOB_CONF=<path to mrjob.conf>

MrJob
Installing and Environment setup
Use our sample MrJob app as your template
git clone https://github.com/chaordic/mr101ufcg.git

Modify the sample mrjob.conf so that your jobs
are labeled to your team
It's the Right Thing © to do.
s3_logs_uri: s3://ufcgplayground/yournamehere/log/
s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/

Profit!

Target Categories
Objective: Find the most commonly viewed
categories per user
Input:
● views and orders
Patterns used:
● simple aggregation

zezin, fulano, [telefone, celulares, vivo]
Map input

zezin, fulano, [eletro, caos, furadeira]
lojaX, fulano, [livros, arte, anime]

Map input

Key

Map input

Key

Sort + Shuffle

[telefone, celulares, vivo]
(zezin, fulano)

[eletro, caos, furadeira]

Reduce Input

[livros, arte, anime]
(lojaX, fulano)


(zezin, fulano)


Reduce Input

(lojaX, fulano)


(zezin, fulano)


Reduce Input

(lojaX, fulano)


(zezin, fulano)

([telefone, celulares, vivo], 2)
([eletro, caos, furadeira], 1)

Reduce Output

(lojaX, fulano)

([livros, arte, anime], 3)

Filter Expensive Categories
Objective: List all categories where a user
purchased something expensive.
Input:
● Orders (for price and user information)
● Products (for category information)
Patterns used:
● merge using reducer

BuyOrders
Products

Map Input

lojaX

livro

fulano

R$ 20

lojaX

iphone

deltrano

R$ 1800

lojaX

livro


lojaX

iphone


We have to merge
those tables above!

BuyOrders
Products

Map Input

lojaX

livro

fulano

R$ 20

lojaX

iphone

deltrano

R$ 1800

lojaX

livro


lojaX

iphone


common

Key

BuyOrders
Products

Map Input

Map Output

lojaX

livro

fulano

R$ 20

(nada, é barato)

lojaX

iphone

deltrano

R$ 1800

{”usuario” : “deltrano”}

lojaX

livro


{“cat”: [livros...]}

lojaX

iphone


{“cat”: [telefone...]}

Key

Value

BuyOrders
Products

Map Input

Map Output

lojaX

livro

fulano

R$ 20

(nada, é barato)

lojaX

iphone

deltrano

R$ 1800


lojaX

livro


{“cat”: [livros...]}

lojaX

iphone


{“cat”: [telefone...]}

Reduce Input

Key

Value

(lojaX, livro)

{“cat”: [livros, arte, anime]}

(lojaX, iphone)

{“cat”: [telefone, celulares, vivo]}

Reduce Input

(lojaX, livro)


(lojaX, iphone)


Key

Values

Reduce Input

(lojaX, livro)


(lojaX, iphone)


Key

Values

Those are the parts we care
about!

Reduce Input

(lojaX, livro)


(lojaX, iphone)


Reduce Output

Key

(lojaX, deltrano)

Values


Real datasets, real problems
In the following hour we will write code to
analyse some real datasets:
● Twitter Dataset (from an article published in WWW'10)
● LastFM Dataset, from The Million Song Datset

Supporting code
● available at GitHub, under https://github.
com/chaordic/mr101ufcg
● comes with sample data under data for
local runs.

Twitter Followers Dataset
A somewhat big dataset
● 41.7 million profiles
● 1.47 billion social relations (who follows who)
● 25 Gb of uncompressed data

Available at s3://mr101ufcg/data/twitter/ ...
● splitted/*.gz
full dataset splitted in small compressed files

● numeric2screen.txt
numerid id to original screen name mapping

● followed_by.txt
original 25Gb dataset as a single file

Twitter Followers Dataset
Each line in followed_by.txt has the
following format:
user_id

t

follower_id

For instance:
12

t

38

12

t

41

13

t

47

13

t

52

13

t

53

14

t

56

Million Song Dataset project's
Last.fm Dataset
A not-so-big dataset
● 943,347 tracks
● 1.2G of compressed data
Yeah, it is not all that big...

Available at s3://mr101ufcg/data/lastfm/ ...
● metadata/*.gz
Track metadata information, in JSONProtocol format.

● similars/*.gz
Track similarity information, in JSONProtocol format.

Million Song Dataset project's
Last.fm Dataset
JSONProcotol encodes key-pair information in
a single line using json-encoded values
separated by a tab character ( t ).
<JSON encoded data>

t

<JSON encoded data>

Exemple line:
"TRACHOZ12903CCA8B3" t {"timestamp": "2011-09-07 22:12:
47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [],
"title": "Close Up", "artist": "Charles Williams"}

Stuff I didn't talk about but are sorta
cool
Persistent jobs
Serialization (protocols in MrJob parlance)
Amazon EMR Console
Hadoop dashboard (and port 9100)

Combiners
Are just like reducers but take place just after a Map and
just before data is sent to the network during shuffle.
Combiners must...
● be associative {a.(b.c) == (a.b).c}
● commutative (a.b == b.a)
● have the same input and output types as yours Map
output type.
Caveats:
● Combiners can be executed zero, one or many times,
so don't make your MR depend on them

Reference & Further reading
[1] MapReduce: A Crash Course
[2] StackOverflow: The python yield keyword
explained
[3] Explicando iterables, generators e yield no
python
[4] MapReduce: Simplied Data Processing on
Large Clusters

Reference & Further reading
[5] MrJob 4.0 - Quick start
[6] Amazon EC2 Instance Types

Life beyond MapReduce
What reading about other frameworks for
distributed processing with BigData?
● Spark
● Storm
● GraphLab
And don't get me started on NoSQL...

Many thanks to...

for supporting this course.
You know there will be some live, intense, groovy Elastic MapReduce action
right after this presentation, right?

Questions?
Feel free to contact me at tiago.
macambira@chaordicsystems.com.br

Or follows us @chaordic

So, lets write some code?
Twitter Dataset
● Count how many followers each user has
● Discover the user with more followers
● What if I want the top-N most followed?

LastFM
● Merge similarity and metadata for tracks
● What is the most "plain" song?
● What is the plainest rock song according only to rock
songs?

MapReduce: teoria e prática

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a MapReduce: teoria e prática

Similar a MapReduce: teoria e prática (20)

Más de PET Computação

Más de PET Computação (20)

Último

Último (20)

MapReduce: teoria e prática