SlideShare una empresa de Scribd logo
1 de 71
Descargar para leer sin conexión
MapReduce 101
ms
ste

Sy
dic
or

by

ha
C
Brought to you by...
Big Data, what's the big deal?
Why is this talk relevant to you?
● we have too much data
to process in a single computer

● we make too few informed decision
based on the data we have

● we have too little {time|CPU|memory}
to analyze all this data

● 'cuz not everything needs to be on-line
It's 2013 but doing batch processing is still OK
Map-what?
And why MapReduce and not, say MPI?
● Simple computation model
MapReduce exposes a simple (and limited) computational model.
It can be a restraining at times but it is a trade off.

● Fault-tolerance, parallelization and
distribution among machines for free
The framework deals with this for you so you don't have to

● Because it is the bread-and-butter of Big
Data processing
It is available in all major cloud computing platforms, and it is against what
other Big Data systems compare themselves against.
Outline
● Fast recap on python and whatnot
● Introduction to MapReduce
● Counting Words
● MrJob and EMR
● Real-life examples
Fast recap
Fast recap
Let's assume you know what the following is:
● JSON
● Python's yield keyword
● Generators in Python
● Amazon S3
● Amazon EC2
If you don't, raise your hand now. REALLY
Recap
JSON
JSON (JavaScript Object Notation) is a
lightweight data-interchange format.
It's like if XML and JavaScript slept together and gave birth a bastard but goodlooking child.
{"timestamp": "2011-08-15 22:17:31.334057",
"track_id": "TRACCJA128F149A144",
"tags": [["Bossa Nova", "100"],
["jazz", "20"],
["acoustic", "20"],
["romantic", "20"],],
"title": "Segredo",
"artist": "Jou00e3o Gilberto"}
Recap
Python generators
From Python's wiki:
“Generators functions allow you to declare a
function that behaves like an iterator, i.e. it
can be used in a for loop.”
The difference is: a generator can be iterated (or read)
only once as you don't store things in memory but create
them on the fly [2].
You can create generators using the yield keyword.
Recap
Python yield keyword
It's just like a return, but turns your function into
a generator.
Your function will suspend its execution after yielding a value and resume its
execution for after the request for the next item in the generator (next loop).

def count_from_1():
i = 1
while True:
yield i
i += 1
for j in count_from_1(): print j
Recap
Amazon S3
From Wikipedia:
“Amazon S3 (Simple Storage Service) is an
online storage web service offered by
Amazon Web Services.”
Its like a distributed filesystem that is easy to
use from other Amazon services, specially from
Amazon Elastic MapReduce.
Recap
EC2 - Elastic Cloud Computing
From Wikipedia:
“EC2 allows users to rent virtual computers
on which to run their own computer
applications”
So you can rent clusters on demand, no need to maintain,
keep fixing and up-to-date your ever breaking cluster of
computers. Less headache, moar action.
Instances can be purchased on demand for fixed prices or
you can bid on those.
MapReduce:
a quick introduction
MapReduce
MapReduce builds on the observation that
many tasks have the same structure:
computation is applied over a large number of
records to generate partial results, which are
then aggregated in some fashion.
MapReduce
MapReduce builds on the observation that
many tasks have the same structure:
computation is applied over a large number of
records to generate partial results, which are
then aggregated in some fashion.
Map
MapReduce
MapReduce builds on the observation that
many tasks have the same structure:
computation is applied over a large number of
records to generate partial results, which are
then aggregated in some fashion.
Map

Reduce
Typical (big data) problem
● Iterate over a large number of records

Map something of interest from each
● Extract
● Shuffle and sort intermediate results
uce
Red
● Aggregate intermediate results
● Generate final output
Phases of a MapReduction
MapReduce have the following steps:
map(key, value) -> [(key1, value1), (key1, value2)]
combine

May happen in parallel, in multiple
machines!

sort + shuffle
reduce(key1, [value1, value2]) -> [(keyX, valueY)]
Notice:
Reduce phase only starts after all mappers
have completed.
Yes, there is a synchronization barrier right there.

There is no global knowledge
Neither mappers nor reducers know what other mappers (or reducers) are
processing
Counting Words
Counting the number of occurrences of a word
in a document collection is quite a big deal.
Let's try with a small example:
"Me gusta correr, me gustas tu.
Me gusta la lluvia, me gustas tu."
Counting Words
"Me gusta correr, me gustas tu.
Me gusta la lluvia, me gustas tu."
me 4
gusta 2
correr 1
gustas 2
tu 2
la 1
lluvia 1
Counting word - in Python
doc = open('input')
count = {}
for line in doc:
words = line.split()
for w in words:
count[w] = count.get(w, 0) + 1

Easy, right? Yeah... too easy. Let's split what
we do for each line and aggregate, shall we?
Counting word - in MapReduce

def map_get_words(self, key, line):
for word in line.split():
yield word, 1
def reduce_sum_words(self, word, occurrences):
yield word, sum(occurrences)
What is Map's output?
def map_get_words(self, key, line):
for word in line.split():
yield word, 1
key=1

key=2

line="me gusta correr me gustas tu"

line="me gusta la lluvia me gustas tu"

('me', 1)
('gusta', 1)
('correr', 1)
('me', 1)
('gustas', 1)
('tu', 1)

('me', 1),
('gusta', 1)
('la', 1)
('lluvia', 1)
('me', 1)
('gustas', 1)
('tu', 1)
What about shuffle?
What about shuffle?
Think of it as a distributed group by
operation.
In the local map instance/node:

● it sorts map output values,
● groups them by their key,
● send this group of key and associated values to the
reduce node responsible for this key.
In the reduce instance/node:

● the framework joins all values associated with this key
in a single list - for you, for free.
What's Shuffle output? or
What's Reducer input?
Key

(input) Values

correr

[1]
Notice:

gusta

[1, 1]

gustas

[1, 1]

la

[1]

lluvia

[1]

me

[1, 1, 1, 1]

tu

[1, 1]

This table represents a global
view.
"In real life", each reducer
instance only knows about its
own key and values.
What's Reducer output?
def reduce_sum_words(self, word, occurrences):
yield word, sum(occurrences)
word

occurrences

output

correr

[1]

(correr, 1)

gusta

[1, 1]

(gusta, 2)

gustas

[1, 1]

(gustas, 2)

la

[1]

(la, 1)

lluvia

[1]

(lluvia, 1)

me

[1, 1, 1, 1]

(me, 4)

tu

[1, 1]

(tu, 2)
MapReduce (main) Implementations
Google MapReduce
● C++
● Proprietary

Apache Hadoop
● Java

●

○ interfaces for anything that runs in the JVM
○ Hadoop streamming for a pipe-like programming
language agnostic interface
Open source

Nobody really cares about the others (for now... ;)
Amazon Elastic MapReduce (EMR)
Amazon Elastic MapReduce
● Uses Hadoop with extra sauces
● creates a hadoop cluster on demand
● It's magical -- except when it fails
● Can be a sort of unpredictable sometimes
○ Installing python modules can fail for no clear reason
MrJob
It's a python interface for hadoop streaming
jobs with a really easy to use interface
● Can run jobs locally or in EMR.
● Takes care of uploading your python code to
EMR.
● Deals better if everything is in a single
python module.
● Easy interface to chain sequences of M/R
steps.
● Some basic tools to aid debugging.
Counting words
Full MrJob Example
from mrjob.job import MRJob
class MRWordCounter(MRJob):
def get_words(self, key, line):
for word in line.split():
yield word, 1
def sum_words(self, word, occurrences):
yield word, sum(occurrences)
def steps(self):
return [self.mr(self.get_words, self.sum_words),]
if __name__ == '__main__':
MRWordCounter.run()
MrJob
Lauching a job
Running it locally
python countwords.py --conf-path=mrjob.conf
input.txt

Running it in EMR
Do not forget to set AWS_ env. vars!

python countwords.py 
--conf-path=mrjob.conf 
-r emr 
's3://ufcgplayground/data/words/*' 
--no-output 
--output-dir=s3://ufcgplayground/tmp/bla/
MrJob
Installing and Environment setup
Install MrJob using pip or easy_install
Do not, I repeat DO NOT install the version in Ubuntu/Debian.

sudo pip install mrjob

Setup your environment with AWS credentials
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

Setup your environment to look for MrJob
settings:
export MRJOB_CONF=<path to mrjob.conf>
MrJob
Installing and Environment setup
Use our sample MrJob app as your template
git clone https://github.com/chaordic/mr101ufcg.git

Modify the sample mrjob.conf so that your jobs
are labeled to your team
It's the Right Thing © to do.
s3_logs_uri: s3://ufcgplayground/yournamehere/log/
s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/

Profit!
Rea

l
Target Categories
Objective: Find the most commonly viewed
categories per user
Input:
● views and orders
Patterns used:
● simple aggregation
zezin, fulano, [telefone, celulares, vivo]
zezin, fulano, [telefone, celulares, vivo]
Map input

zezin, fulano, [eletro, caos, furadeira]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
zezin, fulano, [telefone, celulares, vivo]
zezin, fulano, [telefone, celulares, vivo]
Map input

zezin, fulano, [eletro, caos, furadeira]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
Key
zezin, fulano, [telefone, celulares, vivo]
zezin, fulano, [telefone, celulares, vivo]
Map input

zezin, fulano, [eletro, caos, furadeira]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
lojaX, fulano, [livros, arte, anime]
Key

Sort + Shuffle

[telefone, celulares, vivo]
(zezin, fulano)

[telefone, celulares, vivo]
[eletro, caos, furadeira]

Reduce Input

[livros, arte, anime]
(lojaX, fulano)

[livros, arte, anime]
[livros, arte, anime]
[telefone, celulares, vivo]
(zezin, fulano)

[telefone, celulares, vivo]
[eletro, caos, furadeira]

Reduce Input

[livros, arte, anime]
(lojaX, fulano)

[livros, arte, anime]
[livros, arte, anime]
[telefone, celulares, vivo]
(zezin, fulano)

[telefone, celulares, vivo]
[eletro, caos, furadeira]

Reduce Input

[livros, arte, anime]
(lojaX, fulano)

[livros, arte, anime]
[livros, arte, anime]

(zezin, fulano)

([telefone, celulares, vivo], 2)
([eletro, caos, furadeira], 1)

Reduce Output

(lojaX, fulano)

([livros, arte, anime], 3)
Filter Expensive Categories
Objective: List all categories where a user
purchased something expensive.
Input:
● Orders (for price and user information)
● Products (for category information)
Patterns used:
● merge using reducer
BuyOrders
Products

Map Input

lojaX

livro

fulano

R$ 20

lojaX

iphone

deltrano

R$ 1800

lojaX

livro

[livros, arte, anime]

lojaX

iphone

[telefone, celulares, vivo]

We have to merge
those tables above!
BuyOrders
Products

Map Input

lojaX

livro

fulano

R$ 20

lojaX

iphone

deltrano

R$ 1800

lojaX

livro

[livros, arte, anime]

lojaX

iphone

[telefone, celulares, vivo]

common

Key
BuyOrders
Products

Map Input

Map Output

lojaX

livro

fulano

R$ 20

(nada, é barato)

lojaX

iphone

deltrano

R$ 1800

{”usuario” : “deltrano”}

lojaX

livro

[livros, arte, anime]

{“cat”: [livros...]}

lojaX

iphone

[telefone, celulares, vivo]

{“cat”: [telefone...]}

Key

Value
BuyOrders
Products

Map Input

Map Output

lojaX

livro

fulano

R$ 20

(nada, é barato)

lojaX

iphone

deltrano

R$ 1800

{”usuario” : “deltrano”}

lojaX

livro

[livros, arte, anime]

{“cat”: [livros...]}

lojaX

iphone

[telefone, celulares, vivo]

{“cat”: [telefone...]}

Reduce Input

Key

Value

(lojaX, livro)

{“cat”: [livros, arte, anime]}

(lojaX, iphone)

{”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}
Reduce Input

(lojaX, livro)

{“cat”: [livros, arte, anime]}

(lojaX, iphone)

{”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}

Key

Values
Reduce Input

(lojaX, livro)

{“cat”: [livros, arte, anime]}

(lojaX, iphone)

{”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}

Key

Values

Those are the parts we care
about!
Reduce Input

(lojaX, livro)

{“cat”: [livros, arte, anime]}

(lojaX, iphone)

{”usuario” : “deltrano”}
{“cat”: [telefone, celulares, vivo]}

Reduce Output

Key

(lojaX, deltrano)

Values

[telefone, celulares, vivo]
Rea

l

tasets
Da
Real datasets, real problems
In the following hour we will write code to
analyse some real datasets:
● Twitter Dataset (from an article published in WWW'10)
● LastFM Dataset, from The Million Song Datset

Supporting code
● available at GitHub, under https://github.
com/chaordic/mr101ufcg
● comes with sample data under data for
local runs.
Twitter Followers Dataset
A somewhat big dataset
● 41.7 million profiles
● 1.47 billion social relations (who follows who)
● 25 Gb of uncompressed data

Available at s3://mr101ufcg/data/twitter/ ...
● splitted/*.gz
full dataset splitted in small compressed files

● numeric2screen.txt
numerid id to original screen name mapping

● followed_by.txt
original 25Gb dataset as a single file
Twitter Followers Dataset
Each line in followed_by.txt has the
following format:
user_id

t

follower_id

For instance:
12

t

38

12

t

41

13

t

47

13

t

52

13

t

53

14

t

56
Million Song Dataset project's
Last.fm Dataset
A not-so-big dataset
● 943,347 tracks
● 1.2G of compressed data
Yeah, it is not all that big...

Available at s3://mr101ufcg/data/lastfm/ ...
● metadata/*.gz
Track metadata information, in JSONProtocol format.

● similars/*.gz
Track similarity information, in JSONProtocol format.
Million Song Dataset project's
Last.fm Dataset
JSONProcotol encodes key-pair information in
a single line using json-encoded values
separated by a tab character ( t ).
<JSON encoded data>

t

<JSON encoded data>

Exemple line:
"TRACHOZ12903CCA8B3" t {"timestamp": "2011-09-07 22:12:
47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [],
"title": "Close Up", "artist": "Charles Williams"}
tions?
ues
Q
Stuff I didn't talk about but are sorta
cool
Persistent jobs
Serialization (protocols in MrJob parlance)
Amazon EMR Console
Hadoop dashboard (and port 9100)
Combiners
Are just like reducers but take place just after a Map and
just before data is sent to the network during shuffle.
Combiners must...
● be associative {a.(b.c) == (a.b).c}
● commutative (a.b == b.a)
● have the same input and output types as yours Map
output type.
Caveats:
● Combiners can be executed zero, one or many times,
so don't make your MR depend on them
Reference & Further reading
[1] MapReduce: A Crash Course
[2] StackOverflow: The python yield keyword
explained
[3] Explicando iterables, generators e yield no
python
[4] MapReduce: Simplied Data Processing on
Large Clusters
Reference & Further reading
[5] MrJob 4.0 - Quick start
[6] Amazon EC2 Instance Types
Life beyond MapReduce
What reading about other frameworks for
distributed processing with BigData?
● Spark
● Storm
● GraphLab
And don't get me started on NoSQL...
Many thanks to...

for supporting this course.
You know there will be some live, intense, groovy Elastic MapReduce action
right after this presentation, right?
Questions?
Feel free to contact me at tiago.
macambira@chaordicsystems.com.br

Or follows us @chaordic
So, lets write some code?
Twitter Dataset
● Count how many followers each user has
● Discover the user with more followers
● What if I want the top-N most followed?

LastFM
● Merge similarity and metadata for tracks
● What is the most "plain" song?
● What is the plainest rock song according only to rock
songs?
Extra slides

Más contenido relacionado

La actualidad más candente

Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barré-Brisebois
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
 
Creating Games for Asha - platform
Creating Games for Asha - platformCreating Games for Asha - platform
Creating Games for Asha - platformJussi Pohjolainen
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsHolger Gruen
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesTobias Lindaaker
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Mark Rees
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overheadCass Everitt
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
Thinking Functionally with Clojure
Thinking Functionally with ClojureThinking Functionally with Clojure
Thinking Functionally with ClojureJohn Stevenson
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)Jyh-Miin Lin
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 

La actualidad más candente (20)

Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Creating Games for Asha - platform
Creating Games for Asha - platformCreating Games for Asha - platform
Creating Games for Asha - platform
 
Redux Thunk
Redux ThunkRedux Thunk
Redux Thunk
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Exploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic LanguagesExploiting Concurrency with Dynamic Languages
Exploiting Concurrency with Dynamic Languages
 
Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014Seeing with Python presented at PyCon AU 2014
Seeing with Python presented at PyCon AU 2014
 
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Thinking Functionally with Clojure
Thinking Functionally with ClojureThinking Functionally with Clojure
Thinking Functionally with Clojure
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
A minimal introduction to Python non-uniform fast Fourier transform (pynufft)
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 

Destacado

Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e Atuação
Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e AtuaçãoRedes de Sensores e Robôs: Um novo paradigma de Monitoramento e Atuação
Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e AtuaçãoPET Computação
 
Cooperação e Codificação de Rede Aplicadas as RSSF Industriais
Cooperação e Codificação de Rede Aplicadas as RSSF IndustriaisCooperação e Codificação de Rede Aplicadas as RSSF Industriais
Cooperação e Codificação de Rede Aplicadas as RSSF IndustriaisPET Computação
 
Processamento e visualização tridimensional de imagens de satelite e radar
Processamento e visualização tridimensional de imagens de satelite e radarProcessamento e visualização tridimensional de imagens de satelite e radar
Processamento e visualização tridimensional de imagens de satelite e radarPET Computação
 
Processamento e visualização tridimensional de imagens de Satelite e Radar
Processamento e visualização tridimensional de imagens de Satelite e RadarProcessamento e visualização tridimensional de imagens de Satelite e Radar
Processamento e visualização tridimensional de imagens de Satelite e RadarPET Computação
 
Com a cabeça nas nuvens: montando ambientes para aplicações elásticas
 Com a cabeça nas nuvens: montando ambientes para aplicações elásticas Com a cabeça nas nuvens: montando ambientes para aplicações elásticas
Com a cabeça nas nuvens: montando ambientes para aplicações elásticasPET Computação
 
2 aula micro e macro ambientes
2 aula micro e macro ambientes2 aula micro e macro ambientes
2 aula micro e macro ambientesAmanda Negreti
 
apresentação_dissertação
apresentação_dissertaçãoapresentação_dissertação
apresentação_dissertaçãoAna Hantt
 
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...PET Computação
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Destacado (14)

Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e Atuação
Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e AtuaçãoRedes de Sensores e Robôs: Um novo paradigma de Monitoramento e Atuação
Redes de Sensores e Robôs: Um novo paradigma de Monitoramento e Atuação
 
Refactoring like a boss
Refactoring like a bossRefactoring like a boss
Refactoring like a boss
 
Linux em tempo real
Linux em tempo realLinux em tempo real
Linux em tempo real
 
Cooperação e Codificação de Rede Aplicadas as RSSF Industriais
Cooperação e Codificação de Rede Aplicadas as RSSF IndustriaisCooperação e Codificação de Rede Aplicadas as RSSF Industriais
Cooperação e Codificação de Rede Aplicadas as RSSF Industriais
 
Latex
LatexLatex
Latex
 
Processamento e visualização tridimensional de imagens de satelite e radar
Processamento e visualização tridimensional de imagens de satelite e radarProcessamento e visualização tridimensional de imagens de satelite e radar
Processamento e visualização tridimensional de imagens de satelite e radar
 
Processamento e visualização tridimensional de imagens de Satelite e Radar
Processamento e visualização tridimensional de imagens de Satelite e RadarProcessamento e visualização tridimensional de imagens de Satelite e Radar
Processamento e visualização tridimensional de imagens de Satelite e Radar
 
Com a cabeça nas nuvens: montando ambientes para aplicações elásticas
 Com a cabeça nas nuvens: montando ambientes para aplicações elásticas Com a cabeça nas nuvens: montando ambientes para aplicações elásticas
Com a cabeça nas nuvens: montando ambientes para aplicações elásticas
 
2 aula micro e macro ambientes
2 aula micro e macro ambientes2 aula micro e macro ambientes
2 aula micro e macro ambientes
 
Micro e macro tendências 2015
Micro e macro tendências 2015Micro e macro tendências 2015
Micro e macro tendências 2015
 
Planejamento automático
Planejamento automáticoPlanejamento automático
Planejamento automático
 
apresentação_dissertação
apresentação_dissertaçãoapresentação_dissertação
apresentação_dissertação
 
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...
Software Evolution: From Legacy Systems, Service Oriented Architecture to Clo...
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar a MapReduce: teoria e prática

Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce Sina Ebrahimi
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesxryuseix
 

Similar a MapReduce: teoria e prática (20)

Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Parallel computation
Parallel computationParallel computation
Parallel computation
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
 
MapReduce
MapReduceMapReduce
MapReduce
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
vega
vegavega
vega
 
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
 
GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our lives
 

Más de PET Computação

Testes de escalabilidade usando cloud
Testes de escalabilidade usando cloudTestes de escalabilidade usando cloud
Testes de escalabilidade usando cloudPET Computação
 
Bancos de dados nas nuvens: uma visão geral
Bancos de dados nas nuvens: uma visão geralBancos de dados nas nuvens: uma visão geral
Bancos de dados nas nuvens: uma visão geralPET Computação
 
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...PET Computação
 
Cloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionCloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionPET Computação
 
Rastreamento de objetos utilizando ar drone
Rastreamento de objetos utilizando ar droneRastreamento de objetos utilizando ar drone
Rastreamento de objetos utilizando ar dronePET Computação
 
Evoluindo dot project em alinhamento ao pmbok
Evoluindo dot project em alinhamento ao pmbokEvoluindo dot project em alinhamento ao pmbok
Evoluindo dot project em alinhamento ao pmbokPET Computação
 
Apresentação geral do gqs - Usabilidade na convergência digital - Customizaç...
Apresentação geral do gqs -  Usabilidade na convergência digital - Customizaç...Apresentação geral do gqs -  Usabilidade na convergência digital - Customizaç...
Apresentação geral do gqs - Usabilidade na convergência digital - Customizaç...PET Computação
 
Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios
 Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios
Redes de sensores sem fio autonômicas: abordagens, aplicações e desafiosPET Computação
 
Teste combinatório de software
Teste combinatório de softwareTeste combinatório de software
Teste combinatório de softwarePET Computação
 
Google app engine para lean startups: the good, the bad and the ugly
Google app engine para lean startups: the good, the bad and the uglyGoogle app engine para lean startups: the good, the bad and the ugly
Google app engine para lean startups: the good, the bad and the uglyPET Computação
 
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...PET Computação
 
Métodos formais aplicados a segurança da informação
Métodos formais aplicados a segurança da informaçãoMétodos formais aplicados a segurança da informação
Métodos formais aplicados a segurança da informaçãoPET Computação
 
Segurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingSegurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingPET Computação
 
Segurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingSegurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingPET Computação
 
LISA - Laboratório de Integração de Sistemas e Aplicações
 LISA - Laboratório de Integração de Sistemas e Aplicações LISA - Laboratório de Integração de Sistemas e Aplicações
LISA - Laboratório de Integração de Sistemas e AplicaçõesPET Computação
 
Cloud computing e big data
Cloud computing e big dataCloud computing e big data
Cloud computing e big dataPET Computação
 
Explorando o robot operating system para aplicações em robótica móvel
 Explorando o robot operating system para aplicações em robótica móvel Explorando o robot operating system para aplicações em robótica móvel
Explorando o robot operating system para aplicações em robótica móvelPET Computação
 

Más de PET Computação (20)

Testes de escalabilidade usando cloud
Testes de escalabilidade usando cloudTestes de escalabilidade usando cloud
Testes de escalabilidade usando cloud
 
Bancos de dados nas nuvens: uma visão geral
Bancos de dados nas nuvens: uma visão geralBancos de dados nas nuvens: uma visão geral
Bancos de dados nas nuvens: uma visão geral
 
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...
Uma reflexão sobre os 28 anos de pesquisa no laboratório de integração de sof...
 
Cloud computing: evolution or redefinition
Cloud computing: evolution or redefinitionCloud computing: evolution or redefinition
Cloud computing: evolution or redefinition
 
Rastreamento de objetos utilizando ar drone
Rastreamento de objetos utilizando ar droneRastreamento de objetos utilizando ar drone
Rastreamento de objetos utilizando ar drone
 
Evoluindo dot project em alinhamento ao pmbok
Evoluindo dot project em alinhamento ao pmbokEvoluindo dot project em alinhamento ao pmbok
Evoluindo dot project em alinhamento ao pmbok
 
Ensinar com jogos
Ensinar com jogosEnsinar com jogos
Ensinar com jogos
 
Apresentação geral do gqs - Usabilidade na convergência digital - Customizaç...
Apresentação geral do gqs -  Usabilidade na convergência digital - Customizaç...Apresentação geral do gqs -  Usabilidade na convergência digital - Customizaç...
Apresentação geral do gqs - Usabilidade na convergência digital - Customizaç...
 
Ferramenta git
Ferramenta gitFerramenta git
Ferramenta git
 
Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios
 Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios
Redes de sensores sem fio autonômicas: abordagens, aplicações e desafios
 
Teste combinatório de software
Teste combinatório de softwareTeste combinatório de software
Teste combinatório de software
 
1+1=0
1+1=01+1=0
1+1=0
 
Google app engine para lean startups: the good, the bad and the ugly
Google app engine para lean startups: the good, the bad and the uglyGoogle app engine para lean startups: the good, the bad and the ugly
Google app engine para lean startups: the good, the bad and the ugly
 
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...
Identificando seu estilo pessoal de aprendizagem através da aplicação de dois...
 
Métodos formais aplicados a segurança da informação
Métodos formais aplicados a segurança da informaçãoMétodos formais aplicados a segurança da informação
Métodos formais aplicados a segurança da informação
 
Segurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingSegurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computing
 
Segurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computingSegurança, gestão e sustentabilidade para cloud computing
Segurança, gestão e sustentabilidade para cloud computing
 
LISA - Laboratório de Integração de Sistemas e Aplicações
 LISA - Laboratório de Integração de Sistemas e Aplicações LISA - Laboratório de Integração de Sistemas e Aplicações
LISA - Laboratório de Integração de Sistemas e Aplicações
 
Cloud computing e big data
Cloud computing e big dataCloud computing e big data
Cloud computing e big data
 
Explorando o robot operating system para aplicações em robótica móvel
 Explorando o robot operating system para aplicações em robótica móvel Explorando o robot operating system para aplicações em robótica móvel
Explorando o robot operating system para aplicações em robótica móvel
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

MapReduce: teoria e prática

  • 3. Big Data, what's the big deal? Why is this talk relevant to you? ● we have too much data to process in a single computer ● we make too few informed decision based on the data we have ● we have too little {time|CPU|memory} to analyze all this data ● 'cuz not everything needs to be on-line It's 2013 but doing batch processing is still OK
  • 4. Map-what? And why MapReduce and not, say MPI? ● Simple computation model MapReduce exposes a simple (and limited) computational model. It can be a restraining at times but it is a trade off. ● Fault-tolerance, parallelization and distribution among machines for free The framework deals with this for you so you don't have to ● Because it is the bread-and-butter of Big Data processing It is available in all major cloud computing platforms, and it is against what other Big Data systems compare themselves against.
  • 5. Outline ● Fast recap on python and whatnot ● Introduction to MapReduce ● Counting Words ● MrJob and EMR ● Real-life examples
  • 7. Fast recap Let's assume you know what the following is: ● JSON ● Python's yield keyword ● Generators in Python ● Amazon S3 ● Amazon EC2 If you don't, raise your hand now. REALLY
  • 8. Recap JSON JSON (JavaScript Object Notation) is a lightweight data-interchange format. It's like if XML and JavaScript slept together and gave birth a bastard but goodlooking child. {"timestamp": "2011-08-15 22:17:31.334057", "track_id": "TRACCJA128F149A144", "tags": [["Bossa Nova", "100"], ["jazz", "20"], ["acoustic", "20"], ["romantic", "20"],], "title": "Segredo", "artist": "Jou00e3o Gilberto"}
  • 9. Recap Python generators From Python's wiki: “Generators functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.” The difference is: a generator can be iterated (or read) only once as you don't store things in memory but create them on the fly [2]. You can create generators using the yield keyword.
  • 10. Recap Python yield keyword It's just like a return, but turns your function into a generator. Your function will suspend its execution after yielding a value and resume its execution for after the request for the next item in the generator (next loop). def count_from_1(): i = 1 while True: yield i i += 1 for j in count_from_1(): print j
  • 11. Recap Amazon S3 From Wikipedia: “Amazon S3 (Simple Storage Service) is an online storage web service offered by Amazon Web Services.” Its like a distributed filesystem that is easy to use from other Amazon services, specially from Amazon Elastic MapReduce.
  • 12. Recap EC2 - Elastic Cloud Computing From Wikipedia: “EC2 allows users to rent virtual computers on which to run their own computer applications” So you can rent clusters on demand, no need to maintain, keep fixing and up-to-date your ever breaking cluster of computers. Less headache, moar action. Instances can be purchased on demand for fixed prices or you can bid on those.
  • 14. MapReduce MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion.
  • 15. MapReduce MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion. Map
  • 16. MapReduce MapReduce builds on the observation that many tasks have the same structure: computation is applied over a large number of records to generate partial results, which are then aggregated in some fashion. Map Reduce
  • 17. Typical (big data) problem ● Iterate over a large number of records Map something of interest from each ● Extract ● Shuffle and sort intermediate results uce Red ● Aggregate intermediate results ● Generate final output
  • 18. Phases of a MapReduction MapReduce have the following steps: map(key, value) -> [(key1, value1), (key1, value2)] combine May happen in parallel, in multiple machines! sort + shuffle reduce(key1, [value1, value2]) -> [(keyX, valueY)]
  • 19.
  • 20. Notice: Reduce phase only starts after all mappers have completed. Yes, there is a synchronization barrier right there. There is no global knowledge Neither mappers nor reducers know what other mappers (or reducers) are processing
  • 21.
  • 22. Counting Words Counting the number of occurrences of a word in a document collection is quite a big deal. Let's try with a small example: "Me gusta correr, me gustas tu. Me gusta la lluvia, me gustas tu."
  • 23. Counting Words "Me gusta correr, me gustas tu. Me gusta la lluvia, me gustas tu." me 4 gusta 2 correr 1 gustas 2 tu 2 la 1 lluvia 1
  • 24. Counting word - in Python doc = open('input') count = {} for line in doc: words = line.split() for w in words: count[w] = count.get(w, 0) + 1 Easy, right? Yeah... too easy. Let's split what we do for each line and aggregate, shall we?
  • 25. Counting word - in MapReduce def map_get_words(self, key, line): for word in line.split(): yield word, 1 def reduce_sum_words(self, word, occurrences): yield word, sum(occurrences)
  • 26. What is Map's output? def map_get_words(self, key, line): for word in line.split(): yield word, 1 key=1 key=2 line="me gusta correr me gustas tu" line="me gusta la lluvia me gustas tu" ('me', 1) ('gusta', 1) ('correr', 1) ('me', 1) ('gustas', 1) ('tu', 1) ('me', 1), ('gusta', 1) ('la', 1) ('lluvia', 1) ('me', 1) ('gustas', 1) ('tu', 1)
  • 28.
  • 29. What about shuffle? Think of it as a distributed group by operation. In the local map instance/node: ● it sorts map output values, ● groups them by their key, ● send this group of key and associated values to the reduce node responsible for this key. In the reduce instance/node: ● the framework joins all values associated with this key in a single list - for you, for free.
  • 30. What's Shuffle output? or What's Reducer input? Key (input) Values correr [1] Notice: gusta [1, 1] gustas [1, 1] la [1] lluvia [1] me [1, 1, 1, 1] tu [1, 1] This table represents a global view. "In real life", each reducer instance only knows about its own key and values.
  • 31. What's Reducer output? def reduce_sum_words(self, word, occurrences): yield word, sum(occurrences) word occurrences output correr [1] (correr, 1) gusta [1, 1] (gusta, 2) gustas [1, 1] (gustas, 2) la [1] (la, 1) lluvia [1] (lluvia, 1) me [1, 1, 1, 1] (me, 4) tu [1, 1] (tu, 2)
  • 32.
  • 33. MapReduce (main) Implementations Google MapReduce ● C++ ● Proprietary Apache Hadoop ● Java ● ○ interfaces for anything that runs in the JVM ○ Hadoop streamming for a pipe-like programming language agnostic interface Open source Nobody really cares about the others (for now... ;)
  • 34. Amazon Elastic MapReduce (EMR) Amazon Elastic MapReduce ● Uses Hadoop with extra sauces ● creates a hadoop cluster on demand ● It's magical -- except when it fails ● Can be a sort of unpredictable sometimes ○ Installing python modules can fail for no clear reason
  • 35. MrJob It's a python interface for hadoop streaming jobs with a really easy to use interface ● Can run jobs locally or in EMR. ● Takes care of uploading your python code to EMR. ● Deals better if everything is in a single python module. ● Easy interface to chain sequences of M/R steps. ● Some basic tools to aid debugging.
  • 36. Counting words Full MrJob Example from mrjob.job import MRJob class MRWordCounter(MRJob): def get_words(self, key, line): for word in line.split(): yield word, 1 def sum_words(self, word, occurrences): yield word, sum(occurrences) def steps(self): return [self.mr(self.get_words, self.sum_words),] if __name__ == '__main__': MRWordCounter.run()
  • 37. MrJob Lauching a job Running it locally python countwords.py --conf-path=mrjob.conf input.txt Running it in EMR Do not forget to set AWS_ env. vars! python countwords.py --conf-path=mrjob.conf -r emr 's3://ufcgplayground/data/words/*' --no-output --output-dir=s3://ufcgplayground/tmp/bla/
  • 38. MrJob Installing and Environment setup Install MrJob using pip or easy_install Do not, I repeat DO NOT install the version in Ubuntu/Debian. sudo pip install mrjob Setup your environment with AWS credentials export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... Setup your environment to look for MrJob settings: export MRJOB_CONF=<path to mrjob.conf>
  • 39. MrJob Installing and Environment setup Use our sample MrJob app as your template git clone https://github.com/chaordic/mr101ufcg.git Modify the sample mrjob.conf so that your jobs are labeled to your team It's the Right Thing © to do. s3_logs_uri: s3://ufcgplayground/yournamehere/log/ s3_scratch_uri: s3://ufcgplayground/yournamehere/tmp/ Profit!
  • 40. Rea l
  • 41. Target Categories Objective: Find the most commonly viewed categories per user Input: ● views and orders Patterns used: ● simple aggregation
  • 42. zezin, fulano, [telefone, celulares, vivo] zezin, fulano, [telefone, celulares, vivo] Map input zezin, fulano, [eletro, caos, furadeira] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime]
  • 43. zezin, fulano, [telefone, celulares, vivo] zezin, fulano, [telefone, celulares, vivo] Map input zezin, fulano, [eletro, caos, furadeira] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime] Key
  • 44. zezin, fulano, [telefone, celulares, vivo] zezin, fulano, [telefone, celulares, vivo] Map input zezin, fulano, [eletro, caos, furadeira] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime] lojaX, fulano, [livros, arte, anime] Key Sort + Shuffle [telefone, celulares, vivo] (zezin, fulano) [telefone, celulares, vivo] [eletro, caos, furadeira] Reduce Input [livros, arte, anime] (lojaX, fulano) [livros, arte, anime] [livros, arte, anime]
  • 45. [telefone, celulares, vivo] (zezin, fulano) [telefone, celulares, vivo] [eletro, caos, furadeira] Reduce Input [livros, arte, anime] (lojaX, fulano) [livros, arte, anime] [livros, arte, anime]
  • 46. [telefone, celulares, vivo] (zezin, fulano) [telefone, celulares, vivo] [eletro, caos, furadeira] Reduce Input [livros, arte, anime] (lojaX, fulano) [livros, arte, anime] [livros, arte, anime] (zezin, fulano) ([telefone, celulares, vivo], 2) ([eletro, caos, furadeira], 1) Reduce Output (lojaX, fulano) ([livros, arte, anime], 3)
  • 47. Filter Expensive Categories Objective: List all categories where a user purchased something expensive. Input: ● Orders (for price and user information) ● Products (for category information) Patterns used: ● merge using reducer
  • 48. BuyOrders Products Map Input lojaX livro fulano R$ 20 lojaX iphone deltrano R$ 1800 lojaX livro [livros, arte, anime] lojaX iphone [telefone, celulares, vivo] We have to merge those tables above!
  • 49. BuyOrders Products Map Input lojaX livro fulano R$ 20 lojaX iphone deltrano R$ 1800 lojaX livro [livros, arte, anime] lojaX iphone [telefone, celulares, vivo] common Key
  • 50. BuyOrders Products Map Input Map Output lojaX livro fulano R$ 20 (nada, é barato) lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”} lojaX livro [livros, arte, anime] {“cat”: [livros...]} lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]} Key Value
  • 51. BuyOrders Products Map Input Map Output lojaX livro fulano R$ 20 (nada, é barato) lojaX iphone deltrano R$ 1800 {”usuario” : “deltrano”} lojaX livro [livros, arte, anime] {“cat”: [livros...]} lojaX iphone [telefone, celulares, vivo] {“cat”: [telefone...]} Reduce Input Key Value (lojaX, livro) {“cat”: [livros, arte, anime]} (lojaX, iphone) {”usuario” : “deltrano”} {“cat”: [telefone, celulares, vivo]}
  • 52. Reduce Input (lojaX, livro) {“cat”: [livros, arte, anime]} (lojaX, iphone) {”usuario” : “deltrano”} {“cat”: [telefone, celulares, vivo]} Key Values
  • 53. Reduce Input (lojaX, livro) {“cat”: [livros, arte, anime]} (lojaX, iphone) {”usuario” : “deltrano”} {“cat”: [telefone, celulares, vivo]} Key Values Those are the parts we care about!
  • 54. Reduce Input (lojaX, livro) {“cat”: [livros, arte, anime]} (lojaX, iphone) {”usuario” : “deltrano”} {“cat”: [telefone, celulares, vivo]} Reduce Output Key (lojaX, deltrano) Values [telefone, celulares, vivo]
  • 56. Real datasets, real problems In the following hour we will write code to analyse some real datasets: ● Twitter Dataset (from an article published in WWW'10) ● LastFM Dataset, from The Million Song Datset Supporting code ● available at GitHub, under https://github. com/chaordic/mr101ufcg ● comes with sample data under data for local runs.
  • 57. Twitter Followers Dataset A somewhat big dataset ● 41.7 million profiles ● 1.47 billion social relations (who follows who) ● 25 Gb of uncompressed data Available at s3://mr101ufcg/data/twitter/ ... ● splitted/*.gz full dataset splitted in small compressed files ● numeric2screen.txt numerid id to original screen name mapping ● followed_by.txt original 25Gb dataset as a single file
  • 58. Twitter Followers Dataset Each line in followed_by.txt has the following format: user_id t follower_id For instance: 12 t 38 12 t 41 13 t 47 13 t 52 13 t 53 14 t 56
  • 59. Million Song Dataset project's Last.fm Dataset A not-so-big dataset ● 943,347 tracks ● 1.2G of compressed data Yeah, it is not all that big... Available at s3://mr101ufcg/data/lastfm/ ... ● metadata/*.gz Track metadata information, in JSONProtocol format. ● similars/*.gz Track similarity information, in JSONProtocol format.
  • 60. Million Song Dataset project's Last.fm Dataset JSONProcotol encodes key-pair information in a single line using json-encoded values separated by a tab character ( t ). <JSON encoded data> t <JSON encoded data> Exemple line: "TRACHOZ12903CCA8B3" t {"timestamp": "2011-09-07 22:12: 47.150438", "track_id": "TRACHOZ12903CCA8B3", "tags": [], "title": "Close Up", "artist": "Charles Williams"}
  • 62. Stuff I didn't talk about but are sorta cool Persistent jobs Serialization (protocols in MrJob parlance) Amazon EMR Console Hadoop dashboard (and port 9100)
  • 63. Combiners Are just like reducers but take place just after a Map and just before data is sent to the network during shuffle. Combiners must... ● be associative {a.(b.c) == (a.b).c} ● commutative (a.b == b.a) ● have the same input and output types as yours Map output type. Caveats: ● Combiners can be executed zero, one or many times, so don't make your MR depend on them
  • 64. Reference & Further reading [1] MapReduce: A Crash Course [2] StackOverflow: The python yield keyword explained [3] Explicando iterables, generators e yield no python [4] MapReduce: Simplied Data Processing on Large Clusters
  • 65. Reference & Further reading [5] MrJob 4.0 - Quick start [6] Amazon EC2 Instance Types
  • 66. Life beyond MapReduce What reading about other frameworks for distributed processing with BigData? ● Spark ● Storm ● GraphLab And don't get me started on NoSQL...
  • 67. Many thanks to... for supporting this course. You know there will be some live, intense, groovy Elastic MapReduce action right after this presentation, right?
  • 68. Questions? Feel free to contact me at tiago. macambira@chaordicsystems.com.br Or follows us @chaordic
  • 69.
  • 70. So, lets write some code? Twitter Dataset ● Count how many followers each user has ● Discover the user with more followers ● What if I want the top-N most followed? LastFM ● Merge similarity and metadata for tracks ● What is the most "plain" song? ● What is the plainest rock song according only to rock songs?