3. Pig Dev Philosophy
Pigs Eat Anything.
Pigs Live Anywhere.
Pigs Are Domestic Animals.
Pigs Fly.
http://pig.apache.org/philosophy.html
4. Data Type
Name Description Example
Scalar int Signed 32-bit integer 10
long Signed 64-bit integer 10L
float 32-bit floating point 10.5F
double 64-bit floating point 10.5
Arrays chararray Character array (string)
in Unicode UTF-8 format
Hello World
bytearray Byte array (blog)
Complex tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18, 1)}
map A set of key value pairs. [open#apache]
Key is chararray.
Key is unique.
Value is any type.
5. Run local
pix -x local
grunt> a = LOAD '/etc/passwd' USING PigStorage(':');
grunt> DUMP a;
grunt> EXPLAIN a;
grunt> b = FOREACH a GENERATE $0 as id;
grunt> DUMP b;
grunt> EXPLAIN b;
grunt> c = FOREACH a GENERATE $1 as id;
grunt> DUMP c;
grunt> EXPLAIN c;
grunt> STORE b INTO ‘id.out’;
6. Word Count
pix -x local
grunt> a = LOAD './input.txt';
grunt> b = FOREACH a GENERATE FLATTEN(TOKENIZE((CHARARRAY)
$0)) AS word;
grunt> c = GROUP b by word;
grunt> d = FOREACH c GENERATE COUNT(b), GROUP;
grunt> e = ORDER d BY $0;
grunt> STORE e INTO './wordcount';
7. Schema
pix -x local
grunt> records = LOAD ‘sample.txt’ AS(year:int,
temperature:int, quality:int);
grunt> filtered_records = FILTER records BY temperature !=
9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR
quality == 5 OR quality == 9);
grunt> grouped_records = GROUP filtered_records BY year;
grunt> max_temp = FOREACH grouped_records GENERATE GROUP,
MAX(filterd_records.temperature);
grunt> DUMP max_temp;
grunt> ILLUSTRATE max_temp;
9. UDFs
Extends EvalFunc, FilterFunc, LoadFunc
Override
Make jar
grunt> REGISTER pig-examples.jar;
grunt> filtered = FILTER records BY temperature != 9999 AND
com.hadoop.pig.IsGoodQuality(quality);
grunt> DEFINE isGood com.hadoopbook.pig.IsGoodQuality();