Pig Out to Hadoop

Pig: Recent Work and Next
Steps
Alan
F.
Gates

@alanfgates

Page
1

Who Am I?
•  Pig committer and PMC Member
•  Original member of the engineering team in Yahoo
that took Pig from research to production
•  Author of Programming Pig from O’Reilly
•  HCatalog committer and mentor
•  Co-founder of Hortonworks
•  Tech lead of the team at Hortonworks that does Pig,
Hive, and HCatalog
•  Member of Apache Software Foundation and
Incubator PMC

© 2012 Hortonworks
Page 2

What is Pig?

© 2012 Hortonworks
Page 3

What is Pig?
•  A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';

© 2012 Hortonworks
Page 4

What is Pig?

•  that translates a script into a series of MapReduce jobs and then executes
those jobs

© 2012 Hortonworks
Page 5

What is Pig?

•  That translates a script into a series of MapReduce jobs and then executes
those jobs

byzip = foreach grouped
generate zipcode, COUNT(users);

© 2012 Hortonworks
Page 6

What is Pig?

those jobs
Map Reduce Job:
users = load 'users'; Input: ./users
byzip = foreach grouped Map: project(zipcode, userid)
generate zipcode, COUNT(users); Shuffle key: userid
Reduce: count
Output: ./count_by_zip

© 2012 Hortonworks
Page 7

What is Pig?

those jobs
Map Reduce Job:
Reduce: count
Lives on client
machine,
nothing to
install on
cluster
© 2012 Hortonworks
Page 8

Recent Work

© 2012 Hortonworks
Page 9

New Features in Pig 0.10
•  Released April, 2012
•  This release was a collaborative effort, with major features added by
Twitter, Yahoo, Hortonworks, and Google Summer of Code students
•  Not all the new features are covered here, see
http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a
complete list.

© 2012 Hortonworks
Page 10

Ruby UDFs
•  Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported
•  Evaluated via JRuby

power.pig:
register 'power.rb' using jruby as rf;
data = load ‘input’ as (a:int, b:int);
powered = foreach data generate rf.power(a, b);

power.rb:
require 'pigudf'
class Power < PigUdf
outputSchema "a:int"
def power(mantissa, exponent)
return nil if mantissa.nil? or exponent.nil?
mantissa**exponent
end
end

•  Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in
Python)
© 2012 Hortonworks
Page 11

PigStorage With Schemas
•  By default, PigStorage (the default load/store function) does not use a
schema
•  In 0.10, it can store a schema if instructed to
•  Schema stored in side file .pig_schema
•  If schema is available it will automatically be used

A = load 'studenttab10k' as
(name:chararray, age:int, gpa:double);
store A into 'foo' using PigStorage('t', '-schema');

A = load 'foo';
B = foreach A generate name, age;

© 2012 Hortonworks
Page 12

Additional UDF Improvements
•  Automatic generation of simpler UDFs
–  If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs
–  If you implement an Accumulator UDF, Pig can generate a basic UDF
•  JSON load and store functions
–  Requires schema that describes JSON, does not intuit schema from data
–  Schema stored in side file, no need to declare in script
•  Built in UDFs for Bloom filters
–  BuildBloom builds a bloom filter for one or more columns for a given input
–  Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false
positive rate
–  Bloom takes the file generated by BuildBloom and applies it to an input
define bb BuildBloom('Hash.JENKINS_HASH', '1000', '0.01');
A = load 'users';
B = group A all;
C = foreach B generate bb(A.name);
store C into 'mybloom';

define bloom Bloom('mybloom');
A = load 'transactions';
B = filter A by bloom(name);

© 2012 Hortonworks
Page 13

Language Improvements
•  Boolean now supported as a first class data type
a = load 'foo' as (n:chararray, a:int, g:double, b:boolean);
•  Default split destination - otherwise
–  records which do not match any of the ifs will go to this destination
–  records can still go to multiple ifs
split a into b if id < 3, c if id > 5, d otherwise;
•  Maps, tuples, and bags can now be generated without UDFs:
B = foreach A generate [key, value], (col1, col2),
{col1, col2};
•  Register a collection of jars at once with globs:
–  Uses HDFS globbing syntax
register '/home/me/jars/*.jar';

© 2012 Hortonworks
Page 14

Performance Improvements
•  Hash based aggregation
–  Up to 50% faster aggregation for sets with small number of distinct keys
–  Pig runtime automatically selects aggregation implementation
•  Push limit to loader
–  Now when you have a limit that can be applied to the load, Pig will stop reading
records after reaching the limit
–  Does not work after group, join, distinct, or order by

© 2012 Hortonworks
Page 15

Current Work in Pig – Not Yet Released
•  Work done on internal data representation and map è reduce transfer to
lower memory footprint and enhance performance
•  Datetime type has been added
•  Development of CUBE, ROLLUP, and RANK operators – patches posted and
being reviewed
•  Pig running natively on Windows – in the process of posting patches

© 2012 Hortonworks
Page 16

Pig with Hadoop 2.0
•  Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)
•  By default Pig 0.10 works with Hadoop 1.0
•  Must be recompiled to work with Hadoop 2.0
–  All the pieces included with released code, just need to run ant with the right flags set
•  Does not yet take advantage of new features in Hadoop 2.0

© 2012 Hortonworks
Page 17

Next Steps

© 2012 Hortonworks
Page 18

Pig Execution Today
Map Reduce Job:
Reduce: count

•  All planning done up front
•  No use made of any statistics or information that we have
•  Pig (mostly) uses vanilla MapReduce

© 2012 Hortonworks Page 19

Re-optimize on the Fly

MR Job MR Job

MR Job = planned

MR Job
MR Job = executed

MR Job

© 2012 Hortonworks
Page 20


MR Job MR Job

MR Job = planned

MR Job
MR Job = executed

MR Job

© 2012 Hortonworks
Page 21


MR Job MR Job

MR Job = planned

MR Job
MR Job = executed

MR Job

© 2012 Hortonworks
Page 22


MR Job MR Job
output: 50G output: 1G

Observe output size
MR Job = planned from both jobs, notice
MR Job that one of them is
MR Job = executed small enough to fit in
memory

MR Job

© 2012 Hortonworks
Page 24


MR Job MR Job

Observe output size
memory

Can change join to FR
join, thus map only,
and combine with last
MR Job MR job

© 2012 Hortonworks
Page 25

Modify MapReduce

grouped = group users by zipcode; Map
generate zipcode,
COUNT(users) as cnt;
sorted = order byzip by cnt Reduce
store sorted into 'count_by_zip';

Map

Reduce

© 2012 Hortonworks
Page 28

Modify MapReduce

generate zipcode,

This map is
useless. Whatever Map
can be done in it
can always be done
in the preceding
reduce. Having it
costs an extra write Reduce
to and read from
HDFS.

© 2012 Hortonworks
Page 29

Today
Hive
Pig Plan Others
Optimize
Plan Execute
Plan
Optimize
Optimize
Execute
Execute

•  Different in the front end; very similar in the backend
•  With HCatalog different apps can share metadata
•  No ability to share UDFs, operators, or innovations between projects
© 2012 Hortonworks
Page 31

Questions & Answers

TRY
download at hortonworks.com

LEARN
Hortonworks University

FOLLOW
twitter: @hortonworks
Facebook: facebook.com/hortonworks

MORE EVENTS
hortonworks.com/events

Page 33
© Hortonworks Inc. 2012

Pig Out to Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Pig Out to Hadoop

Similar a Pig Out to Hadoop (20)

Más de Hortonworks

Más de Hortonworks (20)

Último

Último (20)

Pig Out to Hadoop