Introduction To Elastic MapReduce at WHUG

Possible real-world situation
● We have big data and/or very long,
embarrassingly parallel computation
● Our data may grow fast
● We want to start and try Hadoop asap

● We do not have our own infrastructure
● We do not have Hadoop administrators
● We have limited funds

Possible solution
Amazon Elastic MapReduce (EMR)
● Hadoop framework running on the web scale
infrastructure of Amazon

EMR Benefits
Elastic (scalable)
● Use one, hundred, or even thousands of
instances to process even petabytes of data
● Modify the number of instances while the job
flow is running
● Start computation within minutes

EMR Benefits
Easy to use
● No configuration necessary
○ Do not worry about setting up hardware and
networking, running, managing and tuning the
performance of Hadoop cluster
● Easy-to-use tools and plugins available
○ AWS Web Management Console
○ Command Line Tools by Amazon
○ Amazon EMR API, SDK, Libraries
○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio
for EMR)

EMR Benefits
Reliable
● Build on Amazon's highly available and
battle-tested infrastructure
● Provision new nodes to replace those that
fail
● Used by e.g.:

EMR Benefits
Cost effective
● Pay for what you use (for each started hour)
● Choose various instance types that meets
your requirements
● Possibility to reserve instances for 1 or 3
years to pay less for hour

EMR Overview
Amazon Elastic MapReduce (Amazon EMR)
works in conjunction with
● Amazon EC2 to rent computing instances
(with Hadoop installed)
● Amazon S3 to store input and output data,
scripts/applications and logs

EMR Architectural Overview

* image from the Internet

EC2 Instance Types

* image from Big Data University, Course: "Hadoop and the Amazon Cloud"

EMR Pricing - "On-demand"
instances
Standard Family Instances (US East Region)

http://aws.amazon.com/elasticmapreduce/pricing/

EC2 & S3 Pricing - Real-world example
New York Times wanted to host all public
domain articles from 1851 to 1922.
● 11 million articles
● 4 TB of raw image TIFF input data converted
to 1.5 TB of PDF documents
● 100 EC2 Instances rented
● < 24 hours of computation
● $240 paid (not including storage & bandwidth)
● 1 employee assigned to this task

EC2 & S3 Pricing - Real-world example

How much
did they pay for storage
and bandwidth?

S3 Pricing

http://aws.amazon.com/s3/pricing/

EC2 & S3 Pricing Calculator
Simple Monthly Calculator:
http://calculator.s3.amazonaws.com/calc5.html

AWS Free Usage Tier (Per Month)
Available for free to new AWS customers for 12
months following AWS sign-up date e.g.:
● 750 hours of Amazon EC2 Micro Instance
usage
○ 613 MB of memory and 32-bit or 64-bit platform
● 5 GB of Amazon S3 standard storage,
20,000 Get and 2,000 Put Requests
● 15 GB of bandwidth out aggregated across
all AWS services

EMR - Support for Hadoop
Ecosystem
Develop and run MapReduce application using:
● Java
● Streaming (e.g. Ruby, Perl, Python, PHP, R,
or C++)
● Pig
● Hive

HBase can be easily installed using set of EC2
scripts
●

EMR - Featured Users

* logos form http://aws.amazon.com/elasticmapreduce/

EMR - Case Study - Yelp

● help people connect
with great local business
● share reviews and insights

● as of November 2010:
○ 39 million monthly unique visitors
○ in total, 14 million reviews posted
●

EMR - Case Study - Yelp
● uses S3 to store daily logs (~100GB/day)
and photos
● uses EMR to power features like
○ People who viewed this also viewed
○ Review highlights
○ Autocomplete in search box
○ Top searches
● implements jobs in Python and uses their
own open-source library, mrjob, to run them
on EMR

mrjob - WordCount example
from mrjob.job import MRJob

class MRWordCounter(MRJob):
def mapper(self, key, line):
for word in line.split():
yield word, 1

def reducer(self, word, occurrences):
yield word, sum(occurrences)

if __name__ == '__main__':
MRWordCounter.run()

mrjob - run on EMR
$ python wordcount.py
--ec2_instance_type c1.medium
--num-ec2-instances 10
-r emr < 's3://input-bucket/*.txt' > output

Million Song Dataset
● Contains detailed acoustic and contextual
data for one million popular songs
● ~300 GB of data
● Publicly available
○ for download: http://www.infochimps.
com/collections/million-songs
○ for processing using EMR: http://tbmmsd.s3.
amazonaws.com/

Million Song Dataset
Contains data such as:
● Song's title, year and hotness
● Song's tempo, duration, danceability,
energy, loudness, segments count, preview
(URL to mp3 file) and so on
● Artist's name and hotness

Million Song Dataset - Song's
density
Song's density* can be defined as the average
number of notes or atomic sounds (called
segments) per second in a song.

density = segmentCnt / duration

* based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - Task*
Simple music recommendation system
● Calculate density for each song
● Find hot songs with similar density

* based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - MapReduce
Input data
● 339 files
● Each file contains ~3 000 songs
● Each song is represented by one line in
input file
● Fields are separated by a tab character

Mapper
● Reads song's data from each line of input
text
● Calculate song's density
● Emits song's density as key with some other
details as value

<line_offset, song_data> ->
<density, (artist_name, song_title, song_url)>

public void map(LongWritable key, Text value,
OutputCollector<FloatWritable, TripleTextWritable> output, Reporter
reporter) throws IOException {

song.parseLine(value.toString());
if (song.tempo > 0 && song.duration > 0 ) {
// calculate density
float density = ((float) song.segmentCnt) / song.duration;

denstyWritable.set(density);
songWritable.set(song.artistName, song.title, song.preview);

output.collect(denstyWritable, songWritable);
}
}

Reducer
● Identity Reducer
● Each Reducer gets density values from
different range: <i,i+1)*,**

<density, [(artist_name, song_title, song_url)]> ->
<density, (artist_name, song_title, song_url)>

* thanks to a custom Partitioner
** not optimal partitioning (partitions are not balanced)

Demo - used software
● Karmasphere Studio for EMR (Eclipse
plugin)
○ graphical environment that supports the complete
lifecycle for developing for Amazon Elastic
MapReduce, including prototyping, developing,
testing, debugging, deploying and optimizing
Hadoop Jobs (http://www.karmasphere.
com/ksc/karmasphere-studio-for-amazon.html)

Demo - used software
● Karmasphere Studio for EMR (Eclipse
plugin)

images from:
http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Please watch video on WHUG channel on
YouTube

http://www.youtube.com/watch?
v=Azwilbn8GCs

Introduction To Elastic MapReduce at WHUG

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a Introduction To Elastic MapReduce at WHUG

Similar a Introduction To Elastic MapReduce at WHUG (20)

Último

Último (20)

Introduction To Elastic MapReduce at WHUG