This document provides an overview of Pig, an open source platform for analyzing large datasets. It describes Pig's data model, syntax, and key components like Pig Latin for expressing data flows and operations. Pig allows for analyzing structured and unstructured data using a high-level language that is compiled into MapReduce jobs for execution on Hadoop. The document outlines Pig's capabilities and limitations, and provides resources for learning more.
3. Structured Data Management in Hadoop
State of the World
HBase is a Hadoop subproject
▪
Powerset and Rapleaf are the main contributors
▪
Hypertable is Bigtable in C++
▪
Zvents are the main contributors
▪
Pig is an Apache Incubator project
▪
Yahoo! is the main contributor
▪
JAQL has been released as open source
▪
IBM is the main contributor
▪
Hive not available publicly, hopefully under contrib/ soon
▪
Facebook is the main contributor
▪
4. Pig
Philosophy
Pigs Eat Anything
▪
Operate on data with or without metadata
▪
Operate on relational, nested, or unstructured data
▪
Pigs Live Anywhere
▪
The language is independent of execution environment
▪
Pigs are Domestic Animals
▪
Integrate user code wherever possible
▪
Allow control over code reorganization when optimizing
▪
Pigs Fly
▪
5. Pig
Components
Pig Latin
▪
Dataflow programming language; procedural, not declarative
▪
Algebraic: each step specifies only a single data transformation
▪
Parse, verify, and build a logical plan
▪
Evaluation Mechanisms
▪
Local evaluation in single JVM
▪
Compilation to Hadoop MapReduce
▪
Grunt: interactive shell
▪
Pig Pen: debugging environment
▪
6. Pig
Data Model
Pig has four types of data items:
▪
Atom: string or number
▪
Tuple: “data record” consisting of an ordered sequence of “fields”
▪
Denoted with < > bracketing
▪
Bag: an unordered collection of tuples with possible duplicates and
▪
possibly inconsistent schemas
Denoted with { } bracketing
▪
Map: an unordered collection of data items where each data item has
▪
an associated key; the key must be a string
Denoted with [ ] bracketing
▪
7. Pig
Data Model, continued
Fields in a tuple may be named for easier access
▪
A “relation” is a Bag that has been assigned a name (“alias”)
▪
Example:
▪
Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
▪
Give the fields of t the names “f1”, “f2”, and “f3”
▪
Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3”
▪
We’ll look at Pig’s data access syntax on the next page
▪
8. Pig
Data Access
t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
▪
Method of Data Access Example Value for t Applies to which Data Item
Constant ‘1.0’ or ‘apache.org’ Constant Atom
‘1’
Positional Reference $0 Tuple
‘1’
Named Reference f1 Tuple
Projection f2.$0 { <2>, <4>, <5> } Bag
Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag
Map Lookup f3#’apache’ ‘search’ Map
Multiple Map Lookup (?) ? ? Map
9. Pig
Questions
How does a tuple with named fields differ from a map?
▪
How does a tuple of tuples differ from a bag?
▪
When do you ever use a map?
▪
For further information, see Pig’s documentation and mailing lists:
▪
Web site: incubator.apache.org/pig
▪
Wiki: http://wiki.apache.org/pig
▪
Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
▪
Language reference: http://wiki.apache.org/pig/PigLatin
▪
10. Pig
Statements
A Pig Latin statement is a command that produces a relation
▪
Pig commands can take zero, one, or more relations as input
▪
Pig commands can span multiple lines and must include “;” at the end
▪
To play with Pig syntax, you can use the grunt shell or the
▪
StandAloneParser
11. Pig
Example Data
Let ‘a.txt’ be a tab-delimited file with values:
▪
123
▪
421
▪
834
▪
433
▪
725
▪
843
▪
12. Pig
Example Data
Let ‘b.txt’ be a tab-delimited file with values:
▪
24
▪
89
▪
13
▪
27
▪
29
▪
46
▪
49
▪
13. Pig
Statements: LOAD and STORE
LOAD <filename> [USING <function>] [AS <schema>]
▪
Example:
▪
grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3);
▪
Now a is a relation with six tuples which share a common schema:
▪
a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> }
▪
all the tuples have field names “f1”, “f2”, and “f3”
▪
PigStorage() can be any deserialization function
▪
STORE <relation> INTO <filename> [USING <function>] does the reverse
▪
PigStorage() can’t handle nested relations; use BinStorage() instead
▪
14. Pig
Statements: FILTER
FILTER <relation> BY <condition>
▪
Example:
▪
grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4;
▪
The relation x has three tuples which again share the schema (f1, f2,
▪
f3):
x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> }
▪
In addition to standard numerical comparisons, you can also do string
▪
comparisons and even do regular expression matching
You can also use your own comparison function
▪
15. Pig
Statements: GROUP
GROUP <relation> BY [<fields> | ALL | ANY]
▪
Only makes sense if tuples in relation have partially shared schemas
▪
Example:
▪
grunt> y = GROUP x BY f1;
▪
The relation y has two tuples which share the schema (group, x):
▪
y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > }
▪
Using ANY will return a single tuple with all tuples into a single bag
▪
Note that GROUP is just syntactic sugar for COGROUP for a single
▪
relation
16. Pig
Statements: COGROUP
COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]];
▪
Example:
▪
grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER;
▪
The relation z has three tuples with the schema (group, x, b):
▪
z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } }
▪
Note that we could have used multiple fields with BY
▪
The INNER keyword on either relation will toss out the group records
▪
for which there are empty tuples for that relation
17. Pig
Statements: FOREACH ... GENERATE
FOREACH <relation> GENERATE <data item>, <data item>, ...;
▪
Example:
▪
w = FOREACH x GENERATE f1, f3;
▪
Equivalent to the projection x.(f1, f3)
▪
The relation w has three tuples which share the schema (f1, f3):
▪
w = { <8, 4>, <8, 3>, <7, 5> }
▪
Can also have “nested projections”:
▪
u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum;
▪
u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum)
▪
18. Pig
More Keywords and Statements
FLATTEN
▪
JOIN
▪
ORDER
▪
DISTINCT
▪
CROSS
▪
UNION
▪
SPLIT
▪
Write your own functions: http://wiki.apache.org/pig/PigFunctions
▪
19. Pig
Physical Execution via Hadoop MapReduce
How is a logical Pig plan executed via Hadoop?
▪
Details in SIGMOD paper
▪
Essentially each (CO)GROUP results in a new map and reduce function
▪
Similar to Teradata, intermediate data is materialized in the DFS
▪
For Pig commands that take multiple relations as input, an additional
▪
field is inserted into each tuple to indicate which relation it came from
20. Pig
Grunt Shell
Allows you to maintain a working session
▪
You can interact with the DFS as well as your Pig logical objects
▪
DUMP command will let you see the objects you are working with
▪
ILLUSTRATE command provides for simple debugging
▪
For more, check out http://wiki.apache.org/pig/Grunt
▪
21. Pig
Pig Pen
Run sequence of Pig commands over a representative sample of data
▪
Difficult to generate a representative sample when using highly
▪
selective FILTER or COGROUP statements
Algorithm runs multiple sampling passes over the data and generates
▪
representative data if necessary
Allows for incremental construction of complex Pig commands
▪
23. Pig
What’s Missing?
Metadata repository
▪
Browse schemas for persistent data
▪
Library of serialization and deserialization functions
▪
Optimized logical and physical organization of data
▪
SQL interface
▪
UDF in any language
▪
Execution dataflows other than MapReduce
▪
Hash joins, aggregate operators that don’t require a sort, etc.
▪
Query optimization
▪
24. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0