SlideShare una empresa de Scribd logo
1 de 22
Practical Pig + PigUnit




 Michael G. Noll, Verisign
 July 2012
This talk is about Apache Pig

   • High-level data flow language (think: DSL) for writing
     Hadoop MapReduce jobs
   • Why and when should you care about Pig?
           • You are an Hadoop beginner
                  • … and want to implement a JOIN, for instance
           • You are an Hadoop expert
           • You only scratch your head when you see
                public static void main(String args...)
           • You think Java is not the best tool for this job [pun!]
                  • Think: too low-level, too many lines of code, no interactive mode
                    for exploratory analysis, readability > performance, et cetera




                     Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation.
Verisign Public      Java is a trademark of Oracle Corporation.                                      2
A basic Pig script

   • Example: sorting user records by users’ age
           records = LOAD ‘/path/to/input’
                        AS (user:chararray, age:int);

           sorted_records = ORDER records BY age DESC;

           STORE sorted_records INTO ‘/path/to/output’;



   • Popular alternatives to Pig
           • Hive: ~ SQL for Hadoop
           • Hadoop Streaming: use any programming language for MR
                  • Even though you still write code in a “real” programming
                    language, Streaming provides an environment that makes it more
                    convenient than native Hadoop Java code.

Verisign Public                                                                  3
Preliminaries

   • Talk is based on Pig 0.10.0, released in April ’12
   • Some notable 0.10.0 improvements
           •      Hadoop 2.0 support
           •      Loading and storing JSON
           •      Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs
           •      Amazon S3 support




Verisign Public                                                                    4
Testing Pig – a primer




Verisign Public             5
“Testing” Pig scripts – some examples


              DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP


              $ pig -x local


              $ pig [-debug | -dryrun]


              $ pig -param input=/path/to/small-sample.txt




Verisign Public                                              6
“Testing” Pig scripts (cont.)

   • JobTracker UI              • PigStats, JobStats,
                                  HadoopJobHistoryLoader



  Now what have you been using?



     Also: inspecting Hadoop log files, …


Verisign Public                                            7
However…

   • Previous approaches are primarily useful (and used)
     for creating the Pig script in the first place
           • Like ILLUSTRATE
   • None of them are really geared towards unit testing
   • Difficult to automate (think: production environment)
                  #!/bin/bash
                  pig –param date=$1 –param output=$2 myscript.pig
                  hadoop fs –copyToLocal $2 /tmp/jobresult
                  if [ ARGH!!! ] ...


   • Difficult to integrate into a typical development
     workflow, e.g. backed by Maven, Java and a CI server
                  $ mvn clean test              ??

Verisign Public     Maven is a trademark of JFrog ltd.               8
PigUnit




Verisign Public   9
PigUnit

   • Available in Pig since version 0.8
              “PigUnit provides a unit-testing framework that plugs into JUnit
              to help you write unit tests that can be run on a regular basis.”
              -- Alan F. Gates, Programming Pig

   • Easy way to add Pig unit testing to your dev workflow
     iff you are a Java developer
           • See “Tips and Tricks” later for working around this constraint
   • Works with both JUnit and TestNG
   • PigUnit docs have “potential”
           • Some basic examples, then it’s looking at the source code of
             both PigUnit and Pig (but it’s manageable)
   • http://pig.apache.org/docs/r0.10.0/test.html#pigunit

Verisign Public                                                                   10
Getting PigUnit up and running

   • PigUnit is not included in current Pig releases :(
   • You must manually build the PigUnit jar file

         $ cd /path/to/pig-sources # can be a release tarball
         $ ant jar pigunit-jar
         ...
         $ ls -l pig*jar
         -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar
         -rw-r—r-- 1 mnoll mnoll   285627 ... pigunit.jar



   • Add these jar(s) to your CLASSPATH, done!




Verisign Public                                                 11
PigUnit and Maven

   • Unfortunately the Apache Pig project does not yet
     publish an official Maven artifact for PigUnit
                  WILL NOT WORK IN pom.xml :(
                  <dependency>
                      <groupId>org.apache.pig</groupId>
                      <artifactId>pigunit</artifactId>
                      <version>0.10.0</version>
                  </dependency>

   • Alternatives:
           •      Publish to your local Artifactory instance
           •      Use a local file-based <repository>
           •      Use a <system> scope in pom.xml (not recommended)
           •      Use trusted third-party repos like Cloudera’s


Verisign Public       Artifactory is a trademark of JFrog ltd.        12
A simple PigUnit test




Verisign Public            13
A simple PigUnit test

   • Here, we provide input + output data in the Java code
   • Pig script is read from file wordcount.pig
           @Test
           public void testSimpleExample() {
               PigTest simpleTest = new PigTest(‚wordcount.pig‛);

                  String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ };
                  String[] expectedOutput = {
                      ‚(foo,2)‛,
                      ‚(bar,1)‛
                  };

                  simpleTest.assertOutput(
                      ‚aliasInput‛, input,
                      ‚aliasOutput‛, expectedOutput
                  );
           }
Verisign Public                                                     14
A simple PigUnit test (cont.)

   • wordcount.pig
           -- PigUnit populates the alias ‘aliasInput’
           -- with the test input data
           aliasInput = LOAD ‘<tmpLoc>’ AS <schema>;

           -- ...here comes your actual code...

           -- PigUnit will treat the contents of the alias
           -- ‘aliasOutput’ as the actual output data in
           -- the assert statement
           aliasOutput = <your_final_statement>;

           -- Note: PigUnit ignores STORE operations by default
           STORE aliasOutput INTO ‘output’;




Verisign Public                                                   15
A simple PigUnit test (cont.)
                   simpleTest.assertOutput(
       1               ‚aliasInput‛, input,
       2               ‚aliasOutput‛, expectedOutput
                   );



       1          Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the
                  alias named aliasInput in the Pig script.
                  For this purpose Pig creates a temporary file, writes the
                  equivalent of StringUtils.join(input, ‚n‛) to the file,
                  and finally makes its location available to the LOAD operation.


       2          Pig opens an iterator on the content of aliasOutput, and runs
                  assertEquals() based on StringUtils.join(..., ‚n‛)
                  with expectedOutput and the actual content.

           See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util.

Verisign Public                                                                     16
PigUnit drawbacks

• How to divide your “main” Pig script into testable units?
       • Only run a single end-to-end test for the full script?
       • Extract testable snippets from the main script?
                  • Argh, code duplication!
       • Split the main script into logical units = smaller scripts; then run
         individual tests and include the smaller scripts in the main script
                  • Ok-ish but splitting too much makes the Pig code hard to
                    understand (too many trees, no forest).
• PigUnit is a nice tool but batteries are not included
       • It does work but it is not as convenient or powerful as you’d like.
                  • Notably you still need to know and write Java to use it. But one
                    compelling reason for Pig is that you can do without Java.
       • You may end up writing your own wrapper/helper lib around it.
                  • Consider contributing this back to the Apache Pig project!


Verisign Public                                                                        17
Tips and tricks




Verisign Public      18
Connecting to a real cluster (default: local mode)

     // this is not enough to enable cluster mode in PigUnit
     pigServer = new PigServer(ExecType.MAPREDUCE);
     // ...do PigUnit stuff...

     // rather:
     Properties props = System.getProperties();
     if (clusterMode)
         props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛);
     else
         props.removeProperty(‚pigunit.exectype.cluster‛);

   • $HADOOP_CONF_DIR must be in CLASSPATH
   • Similar approach for enabling LZO support
           • mapred.output.compress => ‚true‛
           • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛



Verisign Public                                                     19
Write a convenient PigUnit runner for your users

   • Pig user != Java developer
   • Pig users should only need to provide three files:
           •    pig/myscript.pig
           • input/testdata.txt
           • output/expected.txt
   • PigUnit runner discovers and runs tests for users
           • PigTest#assertOutput() can also handle files
           • But you must manage file uploads and similar “glue” yourself

      pigUnitRunner.runPigTest(
          new Path(scriptFile),
          new Path(inputFile),
          new Path(expectedOutputFile)
      );


Verisign Public                                                             20
Slightly off-topic: Java/Pig combo

   • Pig API provides nifty features to control Pig workflows
     through Java
           • Similar to how working with PigUnit feels
   • Definitely worth a look!
   // ‘pigParams’ is the main glue between Java and Pig here,
   // e.g. to specify the location of input data
   pigServer.registerScript(scriptInputStream, pigParams);

   ExecJob job = pigServer.store(
           ‚aliasOutput‛,
           ‚/path/to/output‛,
           ‚PigStorage()‛
       );

   if (job != null && job.getStatus() == JOB_STATUS.COMPLETED)
       System.out.println(‚Happy world!‛);

Verisign Public                                                  21
Thank You




© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.

Más contenido relacionado

La actualidad más candente

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰Wayne Chen
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809Tim Bunce
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종NAVER D2
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 

La actualidad más candente (20)

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
DBD::Gofer 200809
DBD::Gofer 200809DBD::Gofer 200809
DBD::Gofer 200809
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 

Destacado

Coscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopCoscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopWisely chen
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupQualitest
 
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...Yahoo Developer Network
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargQA or the Highway
 
Introduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitIntroduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitEdureka!
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 

Destacado (14)

Unit testing pig
Unit testing pigUnit testing pig
Unit testing pig
 
Coscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoopCoscup 2013 : Continuous Integration on top of hadoop
Coscup 2013 : Continuous Integration on top of hadoop
 
How to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest GroupHow to Test Big Data Systems | QualiTest Group
How to Test Big Data Systems | QualiTest Group
 
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
Feb 2013 HUG: HIT (Hadoop Integration Testing) for Automated Certification an...
 
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya GargBig Data - Hadoop and MapReduce for QA and testing by Aditya Garg
Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Introduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unitIntroduction to Big data tdd and pig unit
Introduction to Big data tdd and pig unit
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 

Similar a Practical Pig and PigUnit (Michael Noll, Verisign)

Testing your puppet code
Testing your puppet codeTesting your puppet code
Testing your puppet codeJulien Pivotto
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
How I hack on puppet modules
How I hack on puppet modulesHow I hack on puppet modules
How I hack on puppet modulesKris Buytaert
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodeKris Buytaert
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp SuiteBram Vogelaar
 
From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...Yury Bushmelev
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development WorkflowJeffery Smith
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Puppet
 
Using the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationUsing the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationCorey Osman
 
ASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & dockerASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & dockerJürgen Gutsch
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvMarkus Zapke-Gründemann
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-wayRobert Lujo
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetNicolas Brousse
 
Virtualenv
VirtualenvVirtualenv
VirtualenvWEBdeBS
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2Fernando Lopez Aguilar
 
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE LabHow to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE LabFIWARE
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentArthur Lutz
 
Puppet getting started by Dirk Götz
Puppet getting started by Dirk GötzPuppet getting started by Dirk Götz
Puppet getting started by Dirk GötzNETWAYS
 

Similar a Practical Pig and PigUnit (Michael Noll, Verisign) (20)

Testing your puppet code
Testing your puppet codeTesting your puppet code
Testing your puppet code
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
How I hack on puppet modules
How I hack on puppet modulesHow I hack on puppet modules
How I hack on puppet modules
 
Pipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as CodePipeline as code for your infrastructure as Code
Pipeline as code for your infrastructure as Code
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Puppet and the HashiCorp Suite
Puppet and the HashiCorp SuitePuppet and the HashiCorp Suite
Puppet and the HashiCorp Suite
 
From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...From SaltStack to Puppet and beyond...
From SaltStack to Puppet and beyond...
 
Puppet Development Workflow
Puppet Development WorkflowPuppet Development Workflow
Puppet Development Workflow
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
 
Using the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight explorationUsing the puppet debugger for lightweight exploration
Using the puppet debugger for lightweight exploration
 
ASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & dockerASP.NET 5 auf Raspberry PI & docker
ASP.NET 5 auf Raspberry PI & docker
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenv
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-way
 
Improving Operations Efficiency with Puppet
Improving Operations Efficiency with PuppetImproving Operations Efficiency with Puppet
Improving Operations Efficiency with Puppet
 
Virtualenv
VirtualenvVirtualenv
Virtualenv
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2How to deploy spark instance using ansible 2.0 in fiware lab v2
How to deploy spark instance using ansible 2.0 in fiware lab v2
 
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE LabHow to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
How to Deploy Spark Instance Using Ansible 2.0 in FIWARE Lab
 
PyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deploymentPyParis2018 - Python tooling for continuous deployment
PyParis2018 - Python tooling for continuous deployment
 
Puppet getting started by Dirk Götz
Puppet getting started by Dirk GötzPuppet getting started by Dirk Götz
Puppet getting started by Dirk Götz
 

Más de Swiss Big Data User Group

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useSwiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorSwiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisSwiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesSwiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseSwiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexitySwiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceSwiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketSwiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridSwiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseSwiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computingSwiss Big Data User Group
 

Más de Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Practical Pig and PigUnit (Michael Noll, Verisign)

  • 1. Practical Pig + PigUnit Michael G. Noll, Verisign July 2012
  • 2. This talk is about Apache Pig • High-level data flow language (think: DSL) for writing Hadoop MapReduce jobs • Why and when should you care about Pig? • You are an Hadoop beginner • … and want to implement a JOIN, for instance • You are an Hadoop expert • You only scratch your head when you see public static void main(String args...) • You think Java is not the best tool for this job [pun!] • Think: too low-level, too many lines of code, no interactive mode for exploratory analysis, readability > performance, et cetera Apache Hadoop, Pig and Hive are trademarks of the Apache Software Foundation. Verisign Public Java is a trademark of Oracle Corporation. 2
  • 3. A basic Pig script • Example: sorting user records by users’ age records = LOAD ‘/path/to/input’ AS (user:chararray, age:int); sorted_records = ORDER records BY age DESC; STORE sorted_records INTO ‘/path/to/output’; • Popular alternatives to Pig • Hive: ~ SQL for Hadoop • Hadoop Streaming: use any programming language for MR • Even though you still write code in a “real” programming language, Streaming provides an environment that makes it more convenient than native Hadoop Java code. Verisign Public 3
  • 4. Preliminaries • Talk is based on Pig 0.10.0, released in April ’12 • Some notable 0.10.0 improvements • Hadoop 2.0 support • Loading and storing JSON • Ctrl-C’ing a Pig job will terminate all associated Hadoop jobs • Amazon S3 support Verisign Public 4
  • 5. Testing Pig – a primer Verisign Public 5
  • 6. “Testing” Pig scripts – some examples DESCRIBE | EXPLAIN | ILLUSTRATE | DUMP $ pig -x local $ pig [-debug | -dryrun] $ pig -param input=/path/to/small-sample.txt Verisign Public 6
  • 7. “Testing” Pig scripts (cont.) • JobTracker UI • PigStats, JobStats, HadoopJobHistoryLoader Now what have you been using? Also: inspecting Hadoop log files, … Verisign Public 7
  • 8. However… • Previous approaches are primarily useful (and used) for creating the Pig script in the first place • Like ILLUSTRATE • None of them are really geared towards unit testing • Difficult to automate (think: production environment) #!/bin/bash pig –param date=$1 –param output=$2 myscript.pig hadoop fs –copyToLocal $2 /tmp/jobresult if [ ARGH!!! ] ... • Difficult to integrate into a typical development workflow, e.g. backed by Maven, Java and a CI server $ mvn clean test ?? Verisign Public Maven is a trademark of JFrog ltd. 8
  • 10. PigUnit • Available in Pig since version 0.8 “PigUnit provides a unit-testing framework that plugs into JUnit to help you write unit tests that can be run on a regular basis.” -- Alan F. Gates, Programming Pig • Easy way to add Pig unit testing to your dev workflow iff you are a Java developer • See “Tips and Tricks” later for working around this constraint • Works with both JUnit and TestNG • PigUnit docs have “potential” • Some basic examples, then it’s looking at the source code of both PigUnit and Pig (but it’s manageable) • http://pig.apache.org/docs/r0.10.0/test.html#pigunit Verisign Public 10
  • 11. Getting PigUnit up and running • PigUnit is not included in current Pig releases :( • You must manually build the PigUnit jar file $ cd /path/to/pig-sources # can be a release tarball $ ant jar pigunit-jar ... $ ls -l pig*jar -rw-r—r-- 1 mnoll mnoll 17768497 ... pig.jar -rw-r—r-- 1 mnoll mnoll 285627 ... pigunit.jar • Add these jar(s) to your CLASSPATH, done! Verisign Public 11
  • 12. PigUnit and Maven • Unfortunately the Apache Pig project does not yet publish an official Maven artifact for PigUnit WILL NOT WORK IN pom.xml :( <dependency> <groupId>org.apache.pig</groupId> <artifactId>pigunit</artifactId> <version>0.10.0</version> </dependency> • Alternatives: • Publish to your local Artifactory instance • Use a local file-based <repository> • Use a <system> scope in pom.xml (not recommended) • Use trusted third-party repos like Cloudera’s Verisign Public Artifactory is a trademark of JFrog ltd. 12
  • 13. A simple PigUnit test Verisign Public 13
  • 14. A simple PigUnit test • Here, we provide input + output data in the Java code • Pig script is read from file wordcount.pig @Test public void testSimpleExample() { PigTest simpleTest = new PigTest(‚wordcount.pig‛); String[] input = { ‚foo‛, ‚bar‛, ‚foo‛ }; String[] expectedOutput = { ‚(foo,2)‛, ‚(bar,1)‛ }; simpleTest.assertOutput( ‚aliasInput‛, input, ‚aliasOutput‛, expectedOutput ); } Verisign Public 14
  • 15. A simple PigUnit test (cont.) • wordcount.pig -- PigUnit populates the alias ‘aliasInput’ -- with the test input data aliasInput = LOAD ‘<tmpLoc>’ AS <schema>; -- ...here comes your actual code... -- PigUnit will treat the contents of the alias -- ‘aliasOutput’ as the actual output data in -- the assert statement aliasOutput = <your_final_statement>; -- Note: PigUnit ignores STORE operations by default STORE aliasOutput INTO ‘output’; Verisign Public 15
  • 16. A simple PigUnit test (cont.) simpleTest.assertOutput( 1 ‚aliasInput‛, input, 2 ‚aliasOutput‛, expectedOutput ); 1 Pig injects input[] = { ‚foo‛, ‚bar‛, ‚foo‛ } into the alias named aliasInput in the Pig script. For this purpose Pig creates a temporary file, writes the equivalent of StringUtils.join(input, ‚n‛) to the file, and finally makes its location available to the LOAD operation. 2 Pig opens an iterator on the content of aliasOutput, and runs assertEquals() based on StringUtils.join(..., ‚n‛) with expectedOutput and the actual content. See o.a.p.pigunit.{PigTest, Cluster} and o.a.p.test.Util. Verisign Public 16
  • 17. PigUnit drawbacks • How to divide your “main” Pig script into testable units? • Only run a single end-to-end test for the full script? • Extract testable snippets from the main script? • Argh, code duplication! • Split the main script into logical units = smaller scripts; then run individual tests and include the smaller scripts in the main script • Ok-ish but splitting too much makes the Pig code hard to understand (too many trees, no forest). • PigUnit is a nice tool but batteries are not included • It does work but it is not as convenient or powerful as you’d like. • Notably you still need to know and write Java to use it. But one compelling reason for Pig is that you can do without Java. • You may end up writing your own wrapper/helper lib around it. • Consider contributing this back to the Apache Pig project! Verisign Public 17
  • 19. Connecting to a real cluster (default: local mode) // this is not enough to enable cluster mode in PigUnit pigServer = new PigServer(ExecType.MAPREDUCE); // ...do PigUnit stuff... // rather: Properties props = System.getProperties(); if (clusterMode) props.setProperty(‚pigunit.exectype.cluster‛, ‚true‛); else props.removeProperty(‚pigunit.exectype.cluster‛); • $HADOOP_CONF_DIR must be in CLASSPATH • Similar approach for enabling LZO support • mapred.output.compress => ‚true‛ • mapred.output.compression.codec => ‚c.h.c.lzo.LzopCodec‛ Verisign Public 19
  • 20. Write a convenient PigUnit runner for your users • Pig user != Java developer • Pig users should only need to provide three files: • pig/myscript.pig • input/testdata.txt • output/expected.txt • PigUnit runner discovers and runs tests for users • PigTest#assertOutput() can also handle files • But you must manage file uploads and similar “glue” yourself pigUnitRunner.runPigTest( new Path(scriptFile), new Path(inputFile), new Path(expectedOutputFile) ); Verisign Public 20
  • 21. Slightly off-topic: Java/Pig combo • Pig API provides nifty features to control Pig workflows through Java • Similar to how working with PigUnit feels • Definitely worth a look! // ‘pigParams’ is the main glue between Java and Pig here, // e.g. to specify the location of input data pigServer.registerScript(scriptInputStream, pigParams); ExecJob job = pigServer.store( ‚aliasOutput‛, ‚/path/to/output‛, ‚PigStorage()‛ ); if (job != null && job.getStatus() == JOB_STATUS.COMPLETED) System.out.println(‚Happy world!‛); Verisign Public 21
  • 22. Thank You © 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners.