Más contenido relacionado La actualidad más candente (20) Similar a Pig Out to Hadoop (20) Pig Out to Hadoop2. Who Am I?
• Pig committer and PMC Member
• Original member of the engineering team in Yahoo
that took Pig from research to production
• Author of Programming Pig from O’Reilly
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Tech lead of the team at Hortonworks that does Pig,
Hive, and HCatalog
• Member of Apache Software Foundation and
Incubator PMC
© 2012 Hortonworks
Page 2
4. What is Pig?
• A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
© 2012 Hortonworks
Page 4
5. What is Pig?
• A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
• that translates a script into a series of MapReduce jobs and then executes
those jobs
© 2012 Hortonworks
Page 5
6. What is Pig?
• A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
• That translates a script into a series of MapReduce jobs and then executes
those jobs
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped
generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
© 2012 Hortonworks
Page 6
7. What is Pig?
• A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
• That translates a script into a series of MapReduce jobs and then executes
those jobs
Map Reduce Job:
users = load 'users'; Input: ./users
grouped = group users by zipcode;
byzip = foreach grouped Map: project(zipcode, userid)
generate zipcode, COUNT(users); Shuffle key: userid
store byzip into 'count_by_zip';
Reduce: count
Output: ./count_by_zip
© 2012 Hortonworks
Page 7
8. What is Pig?
• A data flow language
users = load 'users';
grouped = group users by zipcode;
byzip = foreach grouped generate zipcode, COUNT(users);
store byzip into 'count_by_zip';
• That translates a script into a series of MapReduce jobs and then executes
those jobs
Map Reduce Job:
users = load 'users'; Input: ./users
grouped = group users by zipcode;
byzip = foreach grouped Map: project(zipcode, userid)
generate zipcode, COUNT(users); Shuffle key: userid
store byzip into 'count_by_zip';
Reduce: count
Output: ./count_by_zip
Lives on client
machine,
nothing to
install on
cluster
© 2012 Hortonworks
Page 8
10. New Features in Pig 0.10
• Released April, 2012
• This release was a collaborative effort, with major features added by
Twitter, Yahoo, Hortonworks, and Google Summer of Code students
• Not all the new features are covered here, see
http://hortonworks.com/blog/new-features-in-apache-pig-0-10/ for a
complete list.
© 2012 Hortonworks
Page 10
11. Ruby UDFs
• Pig 0.8, 0.9 UDFs could be done in Python and Java. Now Ruby also supported
• Evaluated via JRuby
power.pig:
register 'power.rb' using jruby as rf;
data = load ‘input’ as (a:int, b:int);
powered = foreach data generate rf.power(a, b);
power.rb:
require 'pigudf'
class Power < PigUdf
outputSchema "a:int"
def power(mantissa, exponent)
return nil if mantissa.nil? or exponent.nil?
mantissa**exponent
end
end
• Can also do Algebraic and Accumulator UDFs in Ruby (like in Java, but unlike in
Python)
© 2012 Hortonworks
Page 11
12. PigStorage With Schemas
• By default, PigStorage (the default load/store function) does not use a
schema
• In 0.10, it can store a schema if instructed to
• Schema stored in side file .pig_schema
• If schema is available it will automatically be used
A = load 'studenttab10k' as
(name:chararray, age:int, gpa:double);
store A into 'foo' using PigStorage('t', '-schema');
A = load 'foo';
B = foreach A generate name, age;
© 2012 Hortonworks
Page 12
13. Additional UDF Improvements
• Automatic generation of simpler UDFs
– If you implement an Algebraic UDF, Pig can generate Accumulator & basic UDFs
– If you implement an Accumulator UDF, Pig can generate a basic UDF
• JSON load and store functions
– Requires schema that describes JSON, does not intuit schema from data
– Schema stored in side file, no need to declare in script
• Built in UDFs for Bloom filters
– BuildBloom builds a bloom filter for one or more columns for a given input
– Can be constructed to be a certain size (# of hash functions and # of bits) or based on the desired false
positive rate
– Bloom takes the file generated by BuildBloom and applies it to an input
define bb BuildBloom('Hash.JENKINS_HASH', '1000', '0.01');
A = load 'users';
B = group A all;
C = foreach B generate bb(A.name);
store C into 'mybloom';
define bloom Bloom('mybloom');
A = load 'transactions';
B = filter A by bloom(name);
© 2012 Hortonworks
Page 13
14. Language Improvements
• Boolean now supported as a first class data type
a = load 'foo' as (n:chararray, a:int, g:double, b:boolean);
• Default split destination - otherwise
– records which do not match any of the ifs will go to this destination
– records can still go to multiple ifs
split a into b if id < 3, c if id > 5, d otherwise;
• Maps, tuples, and bags can now be generated without UDFs:
B = foreach A generate [key, value], (col1, col2),
{col1, col2};
• Register a collection of jars at once with globs:
– Uses HDFS globbing syntax
register '/home/me/jars/*.jar';
© 2012 Hortonworks
Page 14
15. Performance Improvements
• Hash based aggregation
– Up to 50% faster aggregation for sets with small number of distinct keys
– Pig runtime automatically selects aggregation implementation
• Push limit to loader
– Now when you have a limit that can be applied to the load, Pig will stop reading
records after reaching the limit
– Does not work after group, join, distinct, or order by
© 2012 Hortonworks
Page 15
16. Current Work in Pig – Not Yet Released
• Work done on internal data representation and map è reduce transfer to
lower memory footprint and enhance performance
• Datetime type has been added
• Development of CUBE, ROLLUP, and RANK operators – patches posted and
being reviewed
• Pig running natively on Windows – in the process of posting patches
© 2012 Hortonworks
Page 16
17. Pig with Hadoop 2.0
• Pig 0.10 is the first release of Pig that works with Hadoop 2.0 (fka Hadoop 0.23)
• By default Pig 0.10 works with Hadoop 1.0
• Must be recompiled to work with Hadoop 2.0
– All the pieces included with released code, just need to run ant with the right flags set
• Does not yet take advantage of new features in Hadoop 2.0
© 2012 Hortonworks
Page 17
19. Pig Execution Today
Map Reduce Job:
users = load 'users'; Input: ./users
grouped = group users by zipcode;
byzip = foreach grouped Map: project(zipcode, userid)
generate zipcode, COUNT(users); Shuffle key: userid
store byzip into 'count_by_zip';
Reduce: count
Output: ./count_by_zip
• All planning done up front
• No use made of any statistics or information that we have
• Pig (mostly) uses vanilla MapReduce
© 2012 Hortonworks Page 19
20. Re-optimize on the Fly
MR Job MR Job
MR Job = planned
MR Job
MR Job = executed
MR Job
© 2012 Hortonworks
Page 20
21. Re-optimize on the Fly
MR Job MR Job
MR Job = planned
MR Job
MR Job = executed
MR Job
© 2012 Hortonworks
Page 21
22. Re-optimize on the Fly
MR Job MR Job
MR Job = planned
MR Job
MR Job = executed
MR Job
© 2012 Hortonworks
Page 22
23. Re-optimize on the Fly
MR Job MR Job
MR Job = planned
MR Job
MR Job = executed
MR Job
© 2012 Hortonworks
Page 23
24. Re-optimize on the Fly
MR Job MR Job
output: 50G output: 1G
Observe output size
MR Job = planned from both jobs, notice
MR Job that one of them is
MR Job = executed small enough to fit in
memory
MR Job
© 2012 Hortonworks
Page 24
25. Re-optimize on the Fly
MR Job MR Job
output: 50G output: 1G
Observe output size
MR Job = planned from both jobs, notice
MR Job that one of them is
MR Job = executed small enough to fit in
memory
Can change join to FR
join, thus map only,
and combine with last
MR Job MR job
© 2012 Hortonworks
Page 25
26. Re-optimize on the Fly
MR Job MR Job
output: 50G output: 1G
Observe output size
MR Job = planned from both jobs, notice
MR Job that one of them is
MR Job = executed small enough to fit in
memory
Can change join to FR
join, thus map only,
and combine with last
MR job
© 2012 Hortonworks
Page 26
27. Re-optimize on the Fly
MR Job MR Job
output: 50G output: 1G
Observe output size
MR Job = planned from both jobs, notice
MR Job that one of them is
MR Job = executed small enough to fit in
memory
Can change join to FR
join, thus map only,
and combine with last
MR job
© 2012 Hortonworks
Page 27
28. Modify MapReduce
users = load 'users';
grouped = group users by zipcode; Map
byzip = foreach grouped
generate zipcode,
COUNT(users) as cnt;
sorted = order byzip by cnt Reduce
store sorted into 'count_by_zip';
Map
Reduce
© 2012 Hortonworks
Page 28
29. Modify MapReduce
users = load 'users';
grouped = group users by zipcode; Map
byzip = foreach grouped
generate zipcode,
COUNT(users) as cnt;
sorted = order byzip by cnt Reduce
store sorted into 'count_by_zip';
This map is
useless. Whatever Map
can be done in it
can always be done
in the preceding
reduce. Having it
costs an extra write Reduce
to and read from
HDFS.
© 2012 Hortonworks
Page 29
30. Modify MapReduce
users = load 'users';
grouped = group users by zipcode; Map
byzip = foreach grouped
generate zipcode,
COUNT(users) as cnt;
sorted = order byzip by cnt Reduce
store sorted into 'count_by_zip';
Reduce
© 2012 Hortonworks
Page 30
31. Today
Hive
Pig Plan Others
Optimize
Plan Execute
Plan
Optimize
Optimize
Execute
Execute
• Different in the front end; very similar in the backend
• With HCatalog different apps can share metadata
• No ability to share UDFs, operators, or innovations between projects
© 2012 Hortonworks
Page 31
32. Data Virtual Machine
Pig Others
Hive
Plan Plan Plan
Optimize
Data Virtual Machine
Execute
© 2012 Hortonworks Page 32
33. Questions & Answers
TRY
download at hortonworks.com
LEARN
Hortonworks University
FOLLOW
twitter: @hortonworks
Facebook: facebook.com/hortonworks
MORE EVENTS
hortonworks.com/events
Page 33
© Hortonworks Inc. 2012