2. This talk is about Apache Pig
• High-level data flow language (think: DSL) for writing
Hadoop MapReduce jobs
• Why and when should you care about Pig?
• You are an Hadoop beginner
• … and want to implement a JOIN, for instance
• You are an Hadoop expert
• You only scratch your head when you see
public static void main(String args...)
• You think Java is not the best tool for this job [pun!]
• Think: too low-level, too many lines of code, no interactive mode
for exploratory analysis, readability > performance, et cetera
Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.
Verisign Public Java is a trademark of Oracle Corporation. 2
3. A basic Pig script
• Example: sorting user records by users’ age
records = LOAD ‘/path/to/input’
AS (user:chararray, age:int);
sorted_records = ORDER records BY age DESC;
STORE sorted_records INTO ‘/path/to/output’;
• Popular alternatives to Pig
• Hive: ~ SQL for Hadoop
• Hadoop Streaming: use any programming language for MR
• Even though you still write code in a “real” programming
language, Streaming provides an environment that makes it more
convenient than native Hadoop Java code.
Verisign Public 3
4. Preliminaries
• Talk is based on Pig 0.10.0, released in April ’12
• Some notable 0.10.0 improvements
• Hadoop 2.0 support
• Loading and storing JSON
• Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs
• Amazon S3 support
Verisign Public 4
6. “Testing” Pig scripts – some examples
DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP
$ pig -x local
$ pig [-debug | -dryrun]
$ pig -param input=/path/to/small-sample.txt
Verisign Public 6
7. “Testing” Pig scripts (cont.)
• JobTracker UI • PigStats, JobStats,
HadoopJobHistoryLoader
Now what have you been using?
Also: inspecting Hadoop log files, …
Verisign Public 7
8. However…
• Previous approaches are primarily useful (and used)
for creating the Pig script in the first place
• Like ILLUSTRATE
• None of them are really geared towards unit testing
• Difficult to automate (think: production environment)
#!/bin/bash
pig –param date=$1 –param output=$2 myscript.pig
hadoop fs –copyToLocal $2 /tmp/jobresult
if [ ARGH!!! ] ...
• Difficult to integrate into a typical development
workflow, e.g. backed by Maven, Java and a CI server
$ mvn clean test ??
Verisign Public Maven is a trademark of JFrog ltd. 8
10. PigUnit
• Available in Pig since version 0.8
“PigUnit provides a unit-testing framework that plugs into JUnit
to help you write unit tests that can be run on a regular basis.”
-- Alan F. Gates, Programming Pig
• Easy way to add Pig unit testing to your dev workflow
iff you are a Java developer
• See “Tips and Tricks” later for working around this constraint
• Works with both JUnit and TestNG
• PigUnit docs have “potential”
• Some basic examples, then it’s looking at the source code of
both PigUnit and Pig (but it’s manageable)
• http://pig.apache.org/docs/r0.10.0/test.html#pigunit
Verisign Public 10
11. Getting PigUnit up and running
• PigUnit is not included in current Pig releases :(
• You must manually build the PigUnit jar file
$ cd /path/to/pig-sources # can be a release tarball
$ ant jar pigunit-jar
...
$ ls -l pig*jar
-rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar
-rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar
• Add these jar(s) to your CLASSPATH, done!
Verisign Public 11
12. PigUnit and Maven
• Unfortunately the Apache Pig project does not yet
publish an official Maven artifact for PigUnit
WILL NOT WORK IN pom.xml :(
<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pigunit</artifactId>
<version>0.10.0</version>
</dependency>
• Alternatives:
• Publish to your local Artifactory instance
• Use a local file-based <repository>
• Use a <system> scope in pom.xml (not recommended)
• Use trusted third-party repos like Cloudera’s
Verisign Public Artifactory is a trademark of JFrog ltd. 12
14. A simple PigUnit test
• Here, we provide input + output data in the Java code
• Pig script is read from file wordcount.pig
@Test
public void testSimpleExample() {
PigTest simpleTest = new PigTest(‚wordcount.pig‛);
String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ };
String[] expectedOutput = {
‚(foo,2)‛,
‚(bar,1)‛
};
simpleTest.assertOutput(
‚aliasInput‛, input,
‚aliasOutput‛, expectedOutput
);
}
Verisign Public 14
15. A simple PigUnit test (cont.)
• wordcount.pig
-- PigUnit populates the alias ‘aliasInput’
-- with the test input data
aliasInput = LOAD ‘<tmpLoc>’ AS <schema>;
-- ...here comes your actual code...
-- PigUnit will treat the contents of the alias
-- ‘aliasOutput’ as the actual output data in
-- the assert statement
aliasOutput = <your_final_statement>;
-- Note: PigUnit ignores STORE operations by default
STORE aliasOutput INTO ‘output’;
Verisign Public 15
16. A simple PigUnit test (cont.)
simpleTest.assertOutput(
1 ‚aliasInput‛, input,
2 ‚aliasOutput‛, expectedOutput
);
1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the
alias named aliasInput in the Pig script.
For this purpose Pig creates a temporary file, writes the
equivalent of StringUtils.join(input, ‚n‛) to the file,
and finally makes its location available to the LOAD operation.
2 Pig opens an iterator on the content of aliasOutput, and runs
assertEquals() based on StringUtils.join(..., ‚n‛)
with expectedOutput and the actual content.
See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.
Verisign Public 16
17. PigUnit drawbacks
• How to divide your “main” Pig script into testable units?
• Only run a single end-to-end test for the full script?
• Extract testable snippets from the main script?
• Argh, code duplication!
• Split the main script into logical units = smaller scripts; then run
individual tests and include the smaller scripts in the main script
• Ok-ish but splitting too much makes the Pig code hard to
understand (too many trees, no forest).
• PigUnit is a nice tool but batteries are not included
• It does work but it is not as convenient or powerful as you’d like.
• Notably you still need to know and write Java to use it. But one
compelling reason for Pig is that you can do without Java.
• You may end up writing your own wrapper/helper lib around it.
• Consider contributing this back to the Apache Pig project!
Verisign Public 17
19. Connecting to a real cluster (default: local mode)
// this is not enough to enable cluster mode in PigUnit
pigServer = new PigServer(ExecType.MAPREDUCE);
// ...do PigUnit stuff...
// rather:
Properties props = System.getProperties();
if (clusterMode)
props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛);
else
props.removeProperty(‚pigunit.exectype.cluster‛);
• $HADOOP_CONF_DIR must be in CLASSPATH
• Similar approach for enabling LZO support
• mapred.output.compress => ‚true‛
• mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛
Verisign Public 19
20. Write a convenient PigUnit runner for your users
• Pig user != Java developer
• Pig users should only need to provide three files:
• pig/myscript.pig
• input/testdata.txt
• output/expected.txt
• PigUnit runner discovers and runs tests for users
• PigTest#assertOutput() can also handle files
• But you must manage file uploads and similar “glue” yourself
pigUnitRunner.runPigTest(
new Path(scriptFile),
new Path(inputFile),
new Path(expectedOutputFile)
);
Verisign Public 20
21. Slightly off-topic: Java/Pig combo
• Pig API provides nifty features to control Pig workflows
through Java
• Similar to how working with PigUnit feels
• Definitely worth a look!
// ‘pigParams’ is the main glue between Java and Pig here,
// e.g. to specify the location of input data
pigServer.registerScript(scriptInputStream, pigParams);
ExecJob job = pigServer.store(
‚aliasOutput‛,
‚/path/to/output‛,
‚PigStorage()‛
);
if (job != null && job.getStatus() == JOB_STATUS.COMPLETED)
System.out.println(‚Happy world!‛);
Verisign Public 21