SlideShare a Scribd company logo
1 of 53
Perl on Amazon Elastic
      MapReduce
A Gentle Introduction
   to MapReduce
ā€¢ Distributed computing model
ā€¢ Mappers process the input and forward
  intermediate results to reducers.
ā€¢ Reducers aggregate these intermediate
  results, and emit the ļ¬nal results.
$ map | sort | reduce
MapReduce
ā€¢ Input data sent to mappers as (k, v) pairs.
ā€¢ After processing, mappers emit (k v ).
                                     out,   out

ā€¢ These pairs are sorted and sent to
  reducers.
ā€¢ All (k   out, vout)
                pairs for a given kout are sent
  to a single reducer.
MapReduce


ā€¢ Reducers get (k, [v , v , ā€¦, v ]).
                      1   2    n

ā€¢ After processing, the reducer emits a (k , v )
                                            f   f
  per result.
MapReduce


 We wanted to have a world map showing
where people were starting our games (like
             Mozilla Glow)
Glowļ¬sh
MapReduce
ā€¢ Input: ( epoch, IP address )
ā€¢ Mappers group these into 5-minute blocks,
  and emit ( block Id, IP address )
ā€¢ Reducers get ( blockId, [ip , ip , ā€¦, ip ] )
                                 1   2        n

ā€¢ Do a geo lookup and emit
  ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), ā€¦ ] )
$ map | sort | reduce
Apache Hadoop
ā€¢ Distributed programming framework
ā€¢ Implements MapReduce
ā€¢ Does all the usual distributed programming
  heavy-lifting for you
ā€¢ Highly-fault tolerant, automatic task re-
  assignment in case of failure
ā€¢ You focus on mappers and reducers
Apache Hadoop
ā€¢ Native Java API
ā€¢ Streaming API which can use mappers and
  reducers written in any programming
  language.
ā€¢ Distributed ļ¬le system (HDFS)
ā€¢ Distributed Cache
Amazon Elastic
        MapReduce
ā€¢ On-demand Hadoop clusters running on
  EC2 instances.
ā€¢ Improved S3 support for storage of input
  and output data.
ā€¢ Build workļ¬‚ows by sending jobs to a
  cluster.
EMR Downsides
ā€¢ No control over the machine images.
ā€¢ Perl 5.8.8
ā€¢ Ephemeral, when your cluster is shut down
  (or dies), HDFS is gone.
ā€¢ HDFS not available at cluster-creation time.
ā€¢ Debian
Streaming vs. Native


$ cat | map | sort | reduce
Streaming vs. Native

Instead of
               ( k, [ v1, v2, ā€¦, vn ] )
reducers get
 (( k1, v1 ), ā€¦, ( k1, vn ), ( k2, v1 ), ā€¦, ( k2, v2 ))
Composite Keys
ā€¢ Reducers receive both keys and values
  sorted
ā€¢ Merge 3 tables:
  userid, 0, ā€¦ # customer info

  userid, 1, ā€¦ # payments history

  userid, recordid1, ā€¦ # clickstream

  userid, recordid2, ā€¦ # clickstream
Streaming vs. Native

ā€¢ Limited API
ā€¢ About a 7-10% increase in run time
ā€¢ About a 1000% decrease in development
  time (as reported by a non-representative
  sample of developers)
Whereā€™s My Towel?
ā€¢ Tasks run chrooted in a non-deterministic
  location.
ā€¢ Itā€™s easy to store ļ¬les in HDFS when
  submitting a job, impossible to store
  directory trees.
ā€¢ For native Java jobs, your dependencies get
  packaged in the JAR alongside your code.
Streamingā€™s Little
         Helpers

Deļ¬ne your inputs and outputs:
--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10
Streamingā€™s Little
         Helpers
You can use any class in Hadoopā€™s classpath
as a codec, several come bundled:
-D mapred.output.key.comparator.class =
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner
org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
Streamingā€™s Little
         Helpers
ā€¢ Use S3 to storeā€¦
 ā€¢ input data
 ā€¢ output data
 ā€¢ supporting data (e.g., Geo-IP)
 ā€¢ your code
Mapper and Reducer

To specify the mapper and reducer to be
used in your streaming job, you can point
Hadoop to S3:
--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl
Support Files

When specifying a ļ¬le to store in the DC, a
URI fragment will be used as a symlink in the
local ļ¬lesystem:
 -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Support Files

When specifying a ļ¬le to store in the DC, a
URI fragment will be used as a symlink in the
local ļ¬lesystem:
 -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, ā€¦
   -cacheArchive s3://glowfish/lib/perllib.tgz
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, ā€¦
   -cacheArchive s3://glowfish/lib/perllib.tgz
Dependencies


But if you store an archive (Zip, TGZ, or JAR)
in the Distributed Cache, ā€¦
-cacheArchive s3://glowfish/lib/perllib.tgz#locallib
Dependencies


 Hadoop will uncompress it and create a link
to whatever directory it created, in the taskā€™s
             working directory.
Dependencies


Which is where it stores your mapper and
                reducer.
Dependencies


use lib qw/ locallib /;
Mapper
#!/usr/bin/env perl

use strict;
use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;

while ( <> ) {
  chomp;
  next unless /load_complete/;
  my @line = split /t/;
  my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
  my $json = $decoder->decode( $payload );
  if ( ! exists $json->{'ip'} ) {
    $missing_ip++;
    next;
  }
  print "$epocht$json->{'ip'}n";
}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
Mapper
#!/usr/bin/env perl

use strict;
use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;
my $missing_ip = 0;

while ( <> ) {
  chomp;
  next unless /load_complete/;
  my @line = split /t/;
  my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] );
  my $json = $decoder->decode( $payload );
  if ( ! exists $json->{'ip'} ) {
    $missing_ip++;
    next;
  }
  print "$epocht$json->{'ip'}n";
}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
#!/usr/bin/env perl

use strict;
use warnings;
use lib qw/ locallib /;

use Geo::IP;
use Regexp::Common qw/ net /;
use Readonly;

Readonly::Scalar my $TAB => "t";
my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE )
  or die "Could not open GeoIP database: $!n";

my $format_errors = 0;
my $invalid_ip_address = 0;
my $geo_lookup_errors = 0;

my $time_slot;
my $previous_time_slot = -1;
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
while ( <> ) {
  chomp;

  my @cols = split( TAB );
  if ( scalar @cols != 2 ) {
    $format_errors++;
    next;
  }
  my ( $time_slot, $ip_addr ) = @cols;
  if ( $previous_time_slot != -1 &&
       $time_slot != $previous_time_slot ) {
    # we've entered a new time slot, write the previous one out
    emit( $time_slot, $previous_time_slot );
  }

  if ( $ip_addr !~ /$RE{net}{IPv4}/ ) {
    $invalid_ip_address++;
    $previous_time_slot = $time_slot;
    next;
  }
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Reducer
  my $geo_record = $geo->record_by_addr( $ip_addr );
  if ( ! defined $geo_record ) {
    $geo_lookup_errors++;
    $previous_time_slot = $time_slot;
    next;
  }

  # update entry for time slot with lat and lon

  $previous_time_slot = $time_slot;
} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,
$format_errorsn";
print STDERR "reporter:counter:Job Counters,INVALID_IPS,
$invalid_ip_addressn";
print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,
$geo_lookup_errorsn";
Recap


ā€¢ EMR clusters are volatile!
Recap


ā€¢ EMR clusters are volatile.
ā€¢ Values for a given key will all go to a single
  reducer, sorted.
Recap

ā€¢ EMR clusters are volatile.
ā€¢ Values for a given key will all go to a single
  reducer, sorted.
ā€¢ Use S3 for everything, and plan your
  dataļ¬‚ow ahead.
( On data )
ā€¢ Store it wisely, e.g., using a directory
  structure looking like the following to get
  free partitioning in Hive/others:
       s3://bucket/path/data/run_date=2011-11-12


ā€¢ Donā€™t worry about getting the data out of
  S3, you can always write a simple job that
  does that and run it at the end of your
  workļ¬‚ow.
Recap
ā€¢ EMR clusters are volatile.
ā€¢ Values for a given key will all go to a single
  reducer, sorted. Watch for the key
  changing.
ā€¢ Use S3 for everything, and plan your
  dataļ¬‚ow ahead.
ā€¢ Make carton a part of your life, and
  especially of your build toolā€™s.
( carton )
ā€¢ Shipwright for humans
ā€¢ Reads dependencies from Makeļ¬le.PL
ā€¢ Installs them locally to your app
ā€¢ Deploy your stuff, including carton.lock
ā€¢ Run carton install --deployment
ā€¢ Tar result and upload to S3
URLs

ā€¢ The MapReduce Paper
  http://labs.google.com/papers/mapreduce.html


ā€¢ Apache Hadoop
  http://hadoop.apache.org/


ā€¢ Amazon Elastic MapReduce
  http://aws.amazon.com/elasticmapreduce/
URLs

ā€¢ Hadoop Streaming Tutorial (Apache)
  http://hadoop.apache.org/common/docs/r0.20.2/streaming.html



ā€¢ Hadoop Streaming How-To (Amazon)
  http://docs.amazonwebservices.com/ElasticMapReduce/latest/
  GettingStartedGuide/CreateJobFlowStreaming.html
URLs

ā€¢ Amazon EMR Perl Client Library
  http://aws.amazon.com/code/Elastic-MapReduce/2309


ā€¢ Amazon EMR Command-Line Tool
  http://aws.amazon.com/code/Elastic-MapReduce/2264
Thatā€™s All, Folks!

                Slides available at
http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce




   me@pedrofigueiredo.org

More Related Content

What's hot

05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
Ā 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Yahoo Developer Network
Ā 
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
Shinpei Ohtani
Ā 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
Lloyd Huang
Ā 

What's hot (20)

Upgrade time ! Java to Kotlin
Upgrade time ! Java to KotlinUpgrade time ! Java to Kotlin
Upgrade time ! Java to Kotlin
Ā 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
Ā 
2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator2017 02-07 - elastic & spark. building a search geo locator
2017 02-07 - elastic & spark. building a search geo locator
Ā 
Data Types and Processing in ES6
Data Types and Processing in ES6Data Types and Processing in ES6
Data Types and Processing in ES6
Ā 
HBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User GroupHBase + Hue - LA HBase User Group
HBase + Hue - LA HBase User Group
Ā 
Practical pig
Practical pigPractical pig
Practical pig
Ā 
PuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbqueryPuppetDB, Puppet Explorer and puppetdbquery
PuppetDB, Puppet Explorer and puppetdbquery
Ā 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
Ā 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Ā 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Ā 
puppet @techlifecookpad
puppet @techlifecookpadpuppet @techlifecookpad
puppet @techlifecookpad
Ā 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
Ā 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
Ā 
Working with databases in Perl
Working with databases in PerlWorking with databases in Perl
Working with databases in Perl
Ā 
Apache beam ā€” promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam ā€” promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam ā€” promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam ā€” promyk nadziei data engineera na Toruń JUG 28.03.2018
Ā 
Docker tips & tricks
Docker  tips & tricksDocker  tips & tricks
Docker tips & tricks
Ā 
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop MeetupIntegrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Ā 
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹Map reduceć‚³ćƒ¼ćƒ‰
Ā 
COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?COSCUP2012: How to write a bash script like the python?
COSCUP2012: How to write a bash script like the python?
Ā 
Writing Clean Code in Swift
Writing Clean Code in SwiftWriting Clean Code in Swift
Writing Clean Code in Swift
Ā 

Viewers also liked (7)

Perl in Teh Cloud
Perl in Teh CloudPerl in Teh Cloud
Perl in Teh Cloud
Ā 
The problem with Perl
The problem with PerlThe problem with Perl
The problem with Perl
Ā 
CPAN Training
CPAN TrainingCPAN Training
CPAN Training
Ā 
30 Minutes To CPAN
30 Minutes To CPAN30 Minutes To CPAN
30 Minutes To CPAN
Ā 
PERL Unit 6 regular expression
PERL Unit 6 regular expressionPERL Unit 6 regular expression
PERL Unit 6 regular expression
Ā 
Logic Progamming in Perl
Logic Progamming in PerlLogic Progamming in Perl
Logic Progamming in Perl
Ā 
626 pages
626 pages626 pages
626 pages
Ā 

Similar to Perl on Amazon Elastic MapReduce

Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
BugSense
Ā 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Norberto Leite
Ā 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
Kang-min Liu
Ā 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Tatsuhiko Miyagawa
Ā 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
Ā 

Similar to Perl on Amazon Elastic MapReduce (20)

mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
Ā 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
Ā 
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹MapReduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹MapReduceć‚³ćƒ¼ćƒ‰ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹MapReduceć‚³ćƒ¼ćƒ‰
ć‚µćƒ³ćƒ—ćƒ«ć‹ć‚‰č¦‹ć‚‹MapReduceć‚³ćƒ¼ćƒ‰
Ā 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
Ā 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ā 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Ā 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
Ā 
Big data shim
Big data shimBig data shim
Big data shim
Ā 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Ā 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ā 
Barcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop PresentationBarcelona MUG MongoDB + Hadoop Presentation
Barcelona MUG MongoDB + Hadoop Presentation
Ā 
MapReduce
MapReduceMapReduce
MapReduce
Ā 
pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
Ā 
Going crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHPGoing crazy with Node.JS and CakePHP
Going crazy with Node.JS and CakePHP
Ā 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Ā 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
Ā 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
Ā 
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQueryRemedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Remedie: Building a desktop app with HTTP::Engine, SQLite and jQuery
Ā 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
Ā 
Internationalizing CakePHP Applications
Internationalizing CakePHP ApplicationsInternationalizing CakePHP Applications
Internationalizing CakePHP Applications
Ā 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(ā˜Žļø+971_581248768%)**%*]'#abortion pills for sale in dubai@
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
Ā 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls šŸ„° 8617370543 Service Offer VIP Hot Model
Ā 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Ā 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Ā 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
Ā 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Ā 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
Ā 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Ā 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
Ā 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Ā 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Ā 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
Ā 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Ā 

Perl on Amazon Elastic MapReduce

  • 1. Perl on Amazon Elastic MapReduce
  • 2. A Gentle Introduction to MapReduce ā€¢ Distributed computing model ā€¢ Mappers process the input and forward intermediate results to reducers. ā€¢ Reducers aggregate these intermediate results, and emit the ļ¬nal results.
  • 3. $ map | sort | reduce
  • 4. MapReduce ā€¢ Input data sent to mappers as (k, v) pairs. ā€¢ After processing, mappers emit (k v ). out, out ā€¢ These pairs are sorted and sent to reducers. ā€¢ All (k out, vout) pairs for a given kout are sent to a single reducer.
  • 5. MapReduce ā€¢ Reducers get (k, [v , v , ā€¦, v ]). 1 2 n ā€¢ After processing, the reducer emits a (k , v ) f f per result.
  • 6. MapReduce We wanted to have a world map showing where people were starting our games (like Mozilla Glow)
  • 8. MapReduce ā€¢ Input: ( epoch, IP address ) ā€¢ Mappers group these into 5-minute blocks, and emit ( block Id, IP address ) ā€¢ Reducers get ( blockId, [ip , ip , ā€¦, ip ] ) 1 2 n ā€¢ Do a geo lookup and emit ( epoch, [ ( lat1, lon1 ), ( lat2, lon2), ā€¦ ] )
  • 9. $ map | sort | reduce
  • 10.
  • 11. Apache Hadoop ā€¢ Distributed programming framework ā€¢ Implements MapReduce ā€¢ Does all the usual distributed programming heavy-lifting for you ā€¢ Highly-fault tolerant, automatic task re- assignment in case of failure ā€¢ You focus on mappers and reducers
  • 12. Apache Hadoop ā€¢ Native Java API ā€¢ Streaming API which can use mappers and reducers written in any programming language. ā€¢ Distributed ļ¬le system (HDFS) ā€¢ Distributed Cache
  • 13. Amazon Elastic MapReduce ā€¢ On-demand Hadoop clusters running on EC2 instances. ā€¢ Improved S3 support for storage of input and output data. ā€¢ Build workļ¬‚ows by sending jobs to a cluster.
  • 14. EMR Downsides ā€¢ No control over the machine images. ā€¢ Perl 5.8.8 ā€¢ Ephemeral, when your cluster is shut down (or dies), HDFS is gone. ā€¢ HDFS not available at cluster-creation time. ā€¢ Debian
  • 15. Streaming vs. Native $ cat | map | sort | reduce
  • 16. Streaming vs. Native Instead of ( k, [ v1, v2, ā€¦, vn ] ) reducers get (( k1, v1 ), ā€¦, ( k1, vn ), ( k2, v1 ), ā€¦, ( k2, v2 ))
  • 17. Composite Keys ā€¢ Reducers receive both keys and values sorted ā€¢ Merge 3 tables: userid, 0, ā€¦ # customer info userid, 1, ā€¦ # payments history userid, recordid1, ā€¦ # clickstream userid, recordid2, ā€¦ # clickstream
  • 18. Streaming vs. Native ā€¢ Limited API ā€¢ About a 7-10% increase in run time ā€¢ About a 1000% decrease in development time (as reported by a non-representative sample of developers)
  • 19. Whereā€™s My Towel? ā€¢ Tasks run chrooted in a non-deterministic location. ā€¢ Itā€™s easy to store ļ¬les in HDFS when submitting a job, impossible to store directory trees. ā€¢ For native Java jobs, your dependencies get packaged in the JAR alongside your code.
  • 20. Streamingā€™s Little Helpers Deļ¬ne your inputs and outputs: --input s3://events/2011-30-10 --output s3://glowfish/output/2011-30-10
  • 21. Streamingā€™s Little Helpers You can use any class in Hadoopā€™s classpath as a codec, several come bundled: -D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
  • 22. Streamingā€™s Little Helpers ā€¢ Use S3 to storeā€¦ ā€¢ input data ā€¢ output data ā€¢ supporting data (e.g., Geo-IP) ā€¢ your code
  • 23. Mapper and Reducer To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3: --mapper s3://glowfish/bin/mapper.pl --reducer s3://glowfish/bin/reducer.pl
  • 24. Support Files When specifying a ļ¬le to store in the DC, a URI fragment will be used as a symlink in the local ļ¬lesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 25. Support Files When specifying a ļ¬le to store in the DC, a URI fragment will be used as a symlink in the local ļ¬lesystem: -cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat
  • 26. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, ā€¦ -cacheArchive s3://glowfish/lib/perllib.tgz
  • 27. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, ā€¦ -cacheArchive s3://glowfish/lib/perllib.tgz
  • 28. Dependencies But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, ā€¦ -cacheArchive s3://glowfish/lib/perllib.tgz#locallib
  • 29. Dependencies Hadoop will uncompress it and create a link to whatever directory it created, in the taskā€™s working directory.
  • 30. Dependencies Which is where it stores your mapper and reducer.
  • 32. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 33. Mapper #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use JSON::PP; my $decoder = JSON::PP->new->utf8; my $missing_ip = 0; while ( <> ) { chomp; next unless /load_complete/; my @line = split /t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epocht$json->{'ip'}n"; } print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ipn";
  • 34. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 35. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 36. Reducer #!/usr/bin/env perl use strict; use warnings; use lib qw/ locallib /; use Geo::IP; use Regexp::Common qw/ net /; use Readonly; Readonly::Scalar my $TAB => "t"; my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!n"; my $format_errors = 0; my $invalid_ip_address = 0; my $geo_lookup_errors = 0; my $time_slot; my $previous_time_slot = -1;
  • 37. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 38. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 39. Reducer while ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }
  • 40. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 41. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 42. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 43. Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; } # update entry for time slot with lat and lon $previous_time_slot = $time_slot; } # while ( <> ) emit( $time_slot + 1, $time_slot ); print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS, $format_errorsn"; print STDERR "reporter:counter:Job Counters,INVALID_IPS, $invalid_ip_addressn"; print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS, $geo_lookup_errorsn";
  • 44. Recap ā€¢ EMR clusters are volatile!
  • 45. Recap ā€¢ EMR clusters are volatile. ā€¢ Values for a given key will all go to a single reducer, sorted.
  • 46. Recap ā€¢ EMR clusters are volatile. ā€¢ Values for a given key will all go to a single reducer, sorted. ā€¢ Use S3 for everything, and plan your dataļ¬‚ow ahead.
  • 47. ( On data ) ā€¢ Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others: s3://bucket/path/data/run_date=2011-11-12 ā€¢ Donā€™t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workļ¬‚ow.
  • 48. Recap ā€¢ EMR clusters are volatile. ā€¢ Values for a given key will all go to a single reducer, sorted. Watch for the key changing. ā€¢ Use S3 for everything, and plan your dataļ¬‚ow ahead. ā€¢ Make carton a part of your life, and especially of your build toolā€™s.
  • 49. ( carton ) ā€¢ Shipwright for humans ā€¢ Reads dependencies from Makeļ¬le.PL ā€¢ Installs them locally to your app ā€¢ Deploy your stuff, including carton.lock ā€¢ Run carton install --deployment ā€¢ Tar result and upload to S3
  • 50. URLs ā€¢ The MapReduce Paper http://labs.google.com/papers/mapreduce.html ā€¢ Apache Hadoop http://hadoop.apache.org/ ā€¢ Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/
  • 51. URLs ā€¢ Hadoop Streaming Tutorial (Apache) http://hadoop.apache.org/common/docs/r0.20.2/streaming.html ā€¢ Hadoop Streaming How-To (Amazon) http://docs.amazonwebservices.com/ElasticMapReduce/latest/ GettingStartedGuide/CreateJobFlowStreaming.html
  • 52. URLs ā€¢ Amazon EMR Perl Client Library http://aws.amazon.com/code/Elastic-MapReduce/2309 ā€¢ Amazon EMR Command-Line Tool http://aws.amazon.com/code/Elastic-MapReduce/2264
  • 53. Thatā€™s All, Folks! Slides available at http://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce me@pedrofigueiredo.org

Editor's Notes

  1. \n
  2. Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.\n
  3. \n
  4. The sorting guarantees that all values for a given key are sent to a single reducer.\n
  5. \n
  6. Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.\n
  7. \n
  8. On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.\n2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago\n1 day to modify the Glow protocol, 1 day to build\nEverything stored on S3\n
  9. \n
  10. \n
  11. Serialisation, heartbeat, node management, directory, etc.\nSpeculative task execution, first one to finish wins\nPotentially very simple and contained code\n
  12. You supply the mapper, reducer, and driver code\n
  13. S3 gives you virtually unlimited storage with very high redundancy\nS3 performance: ~750MB of uncompressed data (110-byte rows -&gt; ~7M rows/sec)\nAll this is controlled using a REST API\nJobs are called &amp;#x2018;steps&amp;#x2019; in EMR lingo\n
  14. No way to customise the image and, e.g., install your own Perl\nSo it&amp;#x2019;s a good idea to store the final results of a workflow in S3\nNo way to store dependencies in HDFS when cluster is created\n
  15. \n
  16. \n
  17. If you set a value to 0, you&amp;#x2019;ll know that it&amp;#x2019;s going to be the first (k,v) the reducer will see, 1 will be the second, etc.\nwhen the userid changes, it&amp;#x2019;s a new user.\n
  18. E.g., no control over output file names, many of the API settings can&amp;#x2019;t be configured programmatically (cmd-line switches), no separate mappers per input, etc.\nBecause reducer input is also sorted on keys, when the key changes you know you won&amp;#x2019;t be seeing any more of those. Might need to keep track of the current key, to use as the previous.\n
  19. So how do you get all the CPAN goodness you know and love in there?\nHDFS operations are limited to copy, move, delete, and the host OS doesn&amp;#x2019;t see it - no untar&amp;#x2019;ing!\n
  20. Can have multiple inputs\n
  21. That -D is a Hadoop define, not a JVM system property definition\n
  22. On a streaming job you specify the programs to use as mapper and reducer\n
  23. \n
  24. \n
  25. In the unknown directory where the task is running, making it accessible to it\n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. At the end of the job, Hadoop aggregates counters from all tasks.\n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. Hive partitioning\n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n