20080529dublinpt2

Working with Structured Data in
Hadoop

Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008

Structured Data Management in Hadoop
State of the World
HBase is a Hadoop subproject
▪

Powerset and Rapleaf are the main contributors
▪

Hypertable is Bigtable in C++
▪

Zvents are the main contributors
▪

Pig is an Apache Incubator project
▪

Yahoo! is the main contributor
▪

JAQL has been released as open source
▪

IBM is the main contributor
▪

Hive not available publicly, hopefully under contrib/ soon
▪

Facebook is the main contributor
▪

Pig
Philosophy
Pigs Eat Anything
▪

Operate on data with or without metadata
▪

Operate on relational, nested, or unstructured data
▪

Pigs Live Anywhere
▪

The language is independent of execution environment
▪

Pigs are Domestic Animals
▪

Integrate user code wherever possible
▪

Allow control over code reorganization when optimizing
▪

Pigs Fly
▪

Pig
Components
Pig Latin
▪

Dataﬂow programming language; procedural, not declarative
▪

Algebraic: each step speciﬁes only a single data transformation
▪

Parse, verify, and build a logical plan
▪

Evaluation Mechanisms
▪

Local evaluation in single JVM
▪

Compilation to Hadoop MapReduce
▪

Grunt: interactive shell
▪

Pig Pen: debugging environment
▪

Pig
Data Model
Pig has four types of data items:
▪

Atom: string or number
▪

Tuple: “data record” consisting of an ordered sequence of “ﬁelds”
▪

Denoted with < > bracketing
▪

Bag: an unordered collection of tuples with possible duplicates and
▪
possibly inconsistent schemas
Denoted with { } bracketing
▪

Map: an unordered collection of data items where each data item has
▪
an associated key; the key must be a string
Denoted with [ ] bracketing
▪

Pig
Data Model, continued
Fields in a tuple may be named for easier access
▪

A “relation” is a Bag that has been assigned a name (“alias”)
▪

Example:
▪

Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
▪

Give the ﬁelds of t the names “f1”, “f2”, and “f3”
▪

Give the ﬁelds of the tuples of the bag the names “g1”, “g2”, and “g3”
▪

We’ll look at Pig’s data access syntax on the next page
▪

Pig
Data Access
t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
▪

Method of Data Access Example Value for t Applies to which Data Item

Constant ‘1.0’ or ‘apache.org’ Constant Atom

‘1’
Positional Reference $0 Tuple

‘1’
Named Reference f1 Tuple

Projection f2.$0 { <2>, <4>, <5> } Bag

Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag

Map Lookup f3#’apache’ ‘search’ Map

Multiple Map Lookup (?) ? ? Map

Pig
Questions
How does a tuple with named ﬁelds differ from a map?
▪

How does a tuple of tuples differ from a bag?
▪

When do you ever use a map?
▪

For further information, see Pig’s documentation and mailing lists:
▪

Web site: incubator.apache.org/pig
▪

Wiki: http://wiki.apache.org/pig
▪

Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
▪

Language reference: http://wiki.apache.org/pig/PigLatin
▪

Pig
Statements
A Pig Latin statement is a command that produces a relation
▪

Pig commands can take zero, one, or more relations as input
▪

Pig commands can span multiple lines and must include “;” at the end
▪

To play with Pig syntax, you can use the grunt shell or the
▪
StandAloneParser

Pig
Example Data
Let ‘a.txt’ be a tab-delimited ﬁle with values:
▪

123
▪

421
▪

834
▪

433
▪

725
▪

843
▪

Pig
Example Data
Let ‘b.txt’ be a tab-delimited ﬁle with values:
▪

24
▪

89
▪

13
▪

27
▪

29
▪

46
▪

49
▪

Pig
Statements: LOAD and STORE
LOAD <filename> [USING <function>] [AS <schema>]
▪

Example:
▪

grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3);
▪

Now a is a relation with six tuples which share a common schema:
▪

a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> }
▪

all the tuples have field names “f1”, “f2”, and “f3”
▪

PigStorage() can be any deserialization function
▪

STORE <relation> INTO <filename> [USING <function>] does the reverse
▪

PigStorage() can’t handle nested relations; use BinStorage() instead
▪

Pig
Statements: FILTER
FILTER <relation> BY <condition>
▪

Example:
▪

grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4;
▪

The relation x has three tuples which again share the schema (f1, f2,
▪
f3):
x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> }
▪

In addition to standard numerical comparisons, you can also do string
▪
comparisons and even do regular expression matching
You can also use your own comparison function
▪

Pig
Statements: GROUP
GROUP <relation> BY [<ﬁelds> | ALL | ANY]
▪

Only makes sense if tuples in relation have partially shared schemas
▪

Example:
▪

grunt> y = GROUP x BY f1;
▪

The relation y has two tuples which share the schema (group, x):
▪

y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > }
▪

Using ANY will return a single tuple with all tuples into a single bag
▪

Note that GROUP is just syntactic sugar for COGROUP for a single
▪
relation

Pig
Statements: COGROUP
COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]];
▪

Example:
▪

grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER;
▪

The relation z has three tuples with the schema (group, x, b):
▪

z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } }
▪

Note that we could have used multiple fields with BY
▪

The INNER keyword on either relation will toss out the group records
▪
for which there are empty tuples for that relation

Pig
Statements: FOREACH ... GENERATE
FOREACH <relation> GENERATE <data item>, <data item>, ...;
▪

Example:
▪

w = FOREACH x GENERATE f1, f3;
▪

Equivalent to the projection x.(f1, f3)
▪

The relation w has three tuples which share the schema (f1, f3):
▪

w = { <8, 4>, <8, 3>, <7, 5> }
▪

Can also have “nested projections”:
▪

u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum;
▪

u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum)
▪

Pig
More Keywords and Statements
FLATTEN
▪

JOIN
▪

ORDER
▪

DISTINCT
▪

CROSS
▪

UNION
▪

SPLIT
▪

Write your own functions: http://wiki.apache.org/pig/PigFunctions
▪

Pig
Physical Execution via Hadoop MapReduce
How is a logical Pig plan executed via Hadoop?
▪

Details in SIGMOD paper
▪

Essentially each (CO)GROUP results in a new map and reduce function
▪

Similar to Teradata, intermediate data is materialized in the DFS
▪

For Pig commands that take multiple relations as input, an additional
▪
ﬁeld is inserted into each tuple to indicate which relation it came from

Pig
Grunt Shell
Allows you to maintain a working session
▪

You can interact with the DFS as well as your Pig logical objects
▪

DUMP command will let you see the objects you are working with
▪

ILLUSTRATE command provides for simple debugging
▪

For more, check out http://wiki.apache.org/pig/Grunt
▪

Pig
Pig Pen
Run sequence of Pig commands over a representative sample of data
▪

Difﬁcult to generate a representative sample when using highly
▪
selective FILTER or COGROUP statements
Algorithm runs multiple sampling passes over the data and generates
▪
representative data if necessary
Allows for incremental construction of complex Pig commands
▪

Pig
What’s Missing?
Metadata repository
▪

Browse schemas for persistent data
▪

Library of serialization and deserialization functions
▪

Optimized logical and physical organization of data
▪

SQL interface
▪

UDF in any language
▪

Execution dataﬂows other than MapReduce
▪

Hash joins, aggregate operators that don’t require a sort, etc.
▪

Query optimization
▪

20080529dublinpt2

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a 20080529dublinpt2

Similar a 20080529dublinpt2 (20)

Más de Jeff Hammerbacher

Más de Jeff Hammerbacher (20)

Último

Último (20)

20080529dublinpt2