SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
Apache Pig Power Tools
a quick tour
By
Viswanath Gangavaram
Data Scientist
R&D, DSG, Ilabs, [24] 7 INC
5/9/2014 1
5/9/2014 2
What we are going to cover
 A very short introduction to Apache Pig
 The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS
 Advanced Pig Relational Operators
 Built-in functions
 User defined functions
 DEFINE(UDFs, Streaming, Macros)
 UDFs Vs. Pig Streaming
 JSON Parsing
 Single Row relation
 Real python in Pig(nltk, numpy, scipy, etc.)
 Embedding Pig-Latin for python in iterative processing
 Hadoop Globing
 Hue:- Hadoop ecosystem in the Browser
 External Libraries:- Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird
5/9/2014 3
A very short introduction to Apache Pig
Pig provides a higher level of abstraction for data users, giving them access to the power and
flexibility of Hadoop without requiring them to write extensive data processing applications
in low-level Java Code(MapReduce code). From the preface of “Programming Pig”
5/9/2014 4
A very short introduction to Apache Pig
Apache Pig in Hadoop 1.0 Ecosystem Apache Pig execution Life Cycle
A very short introduction to Apache Pig
5/9/2014 5
• Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this
platform is called Pig Latin, which includes operators for many of the traditional data operations (join,
sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and
writing data.
– Pigs fly
• Pig processes data quickly. Designers want to consistently improve its performance, and not
implement features in ways that weigh pig down so it can't fly.
• What does it mean to be Pig?
– Pigs Eats Everything
• Pig can operate on data whether it has metadata or not. It can operate on data that is
relational, nested, or unstructured. And it can easily be extended to operate on data beyond
files, including key/value stores, databases, etc.
– Pigs Live Everywhere
• Pig is intended to be a language for parallel data processing. It is not tied to one
particular parallel framework. Check for Pig on Tez
– Pigs Are Domestic Animals
• Pig is designed to be easily controlled and modified by its users.
• Pig allows integration of user code where ever possible, so it currently supports user defined
field transformation functions, user defined aggregates, and user defined conditionals.
• Pig supports user provided load and store functions.
• It supports external executables via its stream command and Map Reduce jars via its
MapReduce command.
• It allows users to provide a custom partitioner for their jobs in some circumstances and to set
the level of reduce parallelism for their jobs.
5/9/2014 6
Why we need to embrace this sort of philosophy ?
Because that’s the reality
5/9/2014 7
Apache Pig “Word counting:- The hello world of MapReduce”
inputFile = LOAD ‘mary’ using TextLoade() as ( line );
words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word;
grpd = GROUP words by word;
cntd = FOREACH grpd GENERATE group, COUNT(words)
DUMP cntd;
Output:-
(This , 2)
(is, 2)
(my, 2 )
(first , 2)
(apache, 2)
(pig,2)
(program, 2)
“mary” file content:-
This is my first apache pig program
This is my first apache pig program
5/9/2014 8
Apache Pig Latin: A data flow language
• Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs
should be read, processed, and then stored to one or more outputs in parallel.
• To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges
are data flows and the nodes are operators that process the data.
Comparing query(HIVE/SQL) and data flow languages(PIG)
• After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are
certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to
form queries. It allows users to describe what question they want answered, but not how they want it
answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data.
• Another major difference is that SQL is oriented around answering one question. When users want to do
several data operations together, they must either write separate queries, storing the intermediate data
into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of
the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also,
using subqueries creates an inside-out design where the first step in the data pipeline is the innermost
query.
• Pig, however, is designed with a long series of data operations in mind, so there is no need to write the
data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.
• SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which
means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel
data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the
power of Hadoop much more fully. - Extracted from “Programming Pig”
5/9/2014 9
Pig’s Data types
 Scalar types
• int, long, float, double, chararray, bytearray
 Complex types
• Map
– A map in Pig is a chararray to data element mapping, where that element can be any Pig
type, including a complex type.
– The chararray is called a key and is used as index to find the element, referred to as the
value.
– Map constants are formed using brackets to delimit the map, a hash between keys and
values, and a comma between key-value pairs.
» [‘dept’#’dsg’, ‘team’#’r&d’]
• Tuple
– A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into
fields, with each field containing one data element. These elements can be of any type.
– Tuple constants use parentheses to indicate the tuple and commas to delimit fields in
the tuple.
» (‘boss’, 55)
• Bag
– A bag is an unordered collection of tuples.
– Bag constants are constructed using braces, with the tuples in the bag separated by
commas.
» { (‘a’, 20), (‘b’, 20), (‘c’, 30) }
5/9/2014 11
Running Pig
One can run Pig (execute Pig Latin statements and Pig commands) using various modes.
Local Mode MapReduce Mode
Interactive Mode(Grunt Shell):-
Pig Latin statements and Pig commands
Yes Yes
Batch Mode Yes Yes
Execution Modes:-
 Local Mode
 To run Pig in local mode, you need access to a single machine; all files are installed and run using
your local host and file system. Specify local mode using the –x flag (pig -x local).
 MapReduce Mode
 To run Pig in MapReduce mode, you need access to a Hadoop Cluster and HDFS installation.
MapReduce mode is the default mode; you can, but do not need to specify, it using the –x flag (pig or
pig –x mapreduce)
/* local mode */
pig –x local …
java -cp pig.jar org.apache.pig.Main -x local …
/* mapreduce mode */
pig or pig –x mapreduce …
java -cp pig.jar org.apache.pig.Main ...
java -cp pig.jar org.apache.pig.Main -x mapreduce ...
Relational Operators
 LOAD
 Loads data from the file system.
 LOAD 'data' [USING function] [AS schema];
 If you specify a directory name, all the files in the directory are loaded.
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 STORE
 Stores or saves results to the file system.
 STORE alias INTO 'directory' [USING function];
 A = LOAD ‘t.txt' USING PigStorage('t');
 STORE A INTO USING PigStorage(‘*') AS (f1:int, f2:int);
 LIMIT
 Limits the number of output tuples.
 alias = LIMIT alias n;
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 B = LIMIT A 5;
 FILTER
 Selects tuples from a relation based on some condition..
 alias = FILTER alias BY expression;
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 B = FILTER A f2 > 2;
 DISTINCT
 Removes duplicate tuples in a relation.
 alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 B = DISTINCT A;
 DUMP
 Dumps or displays results to screen.
 DUMP alias;
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 DUMP A;
 ORDER BY
 Sorts a relation based on one or more fields.
 alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }
[PARALLEL n];
 A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);
 B = ORDER A BY f2;
 DUMP B;
 UNION
 Computes the union of two or more relations.
 alias = UNION [ONSCHEMA] alias, alias [, alias …];
 L1 = LOAD 'f1' USING (a : int, b : float);
 L2 = LOAD 'f1' USING (a : long, c : chararray);
 U = UNION ONSCHEMA L1, L2;
 DESCRIBE U ;
 U : {a : long, b : float, c : chararray}
 FOREACH
 Generates data transformations based on columns of data.
 alias = FOREACH { block | nested_block };
 X = FOREACH A GENERATE f1;
 X = FOREACH B {
 S = FILTER A BY 'xyz‘ == ‘3’;
 GENERATE COUNT (S.$0);
}
 CROSS
 Computes the cross product of two or more relations.
 alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];
 A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
 B = LOAD 'data2' AS (b1:int,b2:int);
 X = CROSS A, B
 (CO)GROUP
 Groups the data in one or more relations.
 The GROUP and COGROUP operators are identical.
 alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected'
| 'merge'] [PARTITION BY partitioner] [PARALLEL n];
 A = load 'student' AS (name:chararray, age:int, gpa:float);
 B = GROUP A BY age;
 DUMP B;
 Join(Inner)
 Performs an inner join of two or more relations based on common field values.
 alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY
{expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' |
'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
 A = load 'mydata';
 B = load 'mydata';
 C = join A by $0, B by $0;
 DUMP C;
 Join(Outer)
 Performs an outer join of two relations based on common field values.
 alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-
alias-column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner]
[PARALLEL n];
 A = LOAD 'a.txt' AS (n:chararray, a:int);
 B = LOAD 'b.txt' AS (n:chararray, m:chararray);
 C = JOIN A by $0 LEFT OUTER, B BY $0;
 DUMP C;
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS
 Shell commands
 Fs
 Invokes any FsShell command from within a Pig script or the Grunt shell.
 fs -mkdir /tmp
 fs -copyFromLocal file-x file-y
 fs -ls file-y
 Sh
 Invokes any sh shell command from within a Pig script or the Grunt shell.
 ls
 Pwd
 Utility commands
 Clear
 Exec
 Help
 History
 Kill
 Exec
 Run a Pig script.
 exec [–param param_name = param_value] [–param_file file_name] [script]
 Use the exec command to run a Pig script with no interaction between the script and the
Grunt shell (batch mode).
 Aliases defined in the script are not available to the shell;
 Run
 Run a Pig script
 run [–param param_name = param_value] [–param_file file_name] script
 Interactive mode
Advanced Relational Operators
 Splitting Data into Training and Testing Dataset
 SPLIT
 SPLIT users into kids if age < 18, adults if age >= 18 and age < 65, seniors otherwise;
 SPLIT data into testing if RANDOM() <= 0.10, training otherwise;
 SPLIT operator cannot handle non deterministic functions (such as RANDOM).
 Thus the above command won’t work and will raise an error:
DEFINE split_into_training_testing(inputData, split_percentage)
RETURNS training, testing{
data = foreach $inputData generate RANDOM() as random_assignment, *;
SPLIT data into testing_data if random_assignment <= $split_percentage, training_data otherwise;
$training = foreach training_data generate $1..;
$testing = foreach testing_data generate $1..;
};
inData = load 'some_files.txt‘ USING PigStorage(‘t’);
training, testing = split_into_training_testing(inData, 0.1);
Syntax for Macro definition:-
DEFINE macro_name (param [, param ...]) RETURNS {void | alias [, alias ...]} { pig_latin_fragment };
Syntax for Macro expansion:-
alias [, alias ...] = macro_name (param [, param ...]) ;
 ASSERT
 Assert a condition on the data..
 ASSERT alias BY expression [, message];
 A = LOAD 'data' AS (a0:int,a1:int,a2:int);
 ASSERT A by a0 > 0, 'a0 should be greater than 0';
 CUBE
 Performs cube/rollup operations.
 alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP
expression ] [PARALLEL n];
 cubedinp = CUBE salesinp BY CUBE(product,year);
 rolledup = CUBE salesinp BY ROLLUP(region,state,city);
 cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);
 SAMPLE
 Selects a random sample of data based on the specified sample size.
 SAMPLE alias size;
 A = LOAD 'data' AS (f1:int,f2:int,f3:int);
 X = SAMPLE A 0.01;
 RANK
 Returns each tuple with the rank within a relation.
 alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }
[DENSE] ];
 B = rank A;
 C = rank A by f1 DESC, f2 ASC;
 C = rank A by f1 DESC, f2 ASC DENSE;
 MAPREDUCE
 Executes native MapReduce jobs inside a Pig script.
 alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD
'outputLocation' USING loadFunc AS schema [`params, ... `];
 A = LOAD 'WordcountInput.txt';
 B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS
(word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
 IMPORT
 Import macros defined in a separate file.
 MPORT 'file-with-macro';
 STREAM
 Sends data to an external script or program.
 alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;
 A = LOAD 'data';
 B = STREAM A THROUGH `perl stream.pl -n 5`;
DEFINE:- UDFs, Streaming
 Assigns an alias to a UDF or streaming command.
 DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] };
DEFINE CMD `perl PigStreaming.pl - nameMap` input(stdin using PigStreaming(',')) output(stdout
using PigStreaming(','));
A = LOAD 'file';
B = STREAM B THROUGH CMD;
DEFINE CMD 'script' ship('/a/b/script');
OP = stream IP through CMD;
DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz');
X = STREAM A THROUGH Y;
Built-in functions
 Eval functions
 AVG
 CONCAT
 COUNT
 COUNT_STAR
 Math functions
 ABS
 SQRT
 Etc …
 STRING functions
 ENDSWITH
 TRIM
 …
 Datetime functions
 AddDuration
 GetDay
 GetHour
 …
 Dynamic Invokers
 DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
 encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
 decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
User Defined functions
Single row relations
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as sum;
d = order a by $0;
e = limit d c.sum/100;
Real python in Pig(nltk, numpy, scipy, etc.)
from pig_util import outputSchema
import nltk
import sys
import platform
from nltk.stem.lancaster import LancasterStemmer
@outputSchema("as:int")
def square(num):
if num == None:
return None
return ((num) * (num))
@outputSchema("word:chararray")
def returnString(word):
st = LancasterStemmer()
return st.stem('maximum') + 't'+ word +'t'+ word + 't' + platform.python_version()
@outputSchema("word:chararray")
def wordSteming(word):
st = LancasterStemmer()
return st.stem(word)
register 'streamingPython.py' using streaming_python as myfuncs;
a = LOAD 't.txt' as (a:chararray, b:chararray);
b = foreach a generate myfuncs.returnString('this is pig; this is weird') , myfuncs.square(25);
DUMP b;
Embedding Pig-Latin for python in iterative processing
 To enable control flow, you can embed Pig Latin statements and Pig commands in the Python, JavaScript and
Groovy scripting languages using a JDBC-like compile, bind, run model.
 DEMO
Hadoop Globing
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.had
oop.fs.Path)
Hue:- Hadoop ecosystem in the Browser
Pig’s Debugging Operators
5/9/2014 29
 d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.
 de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.
 e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.
 i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.
 q - To quit grunt shell
 Use the DUMP operator to display results to your terminal screen.
 Use the DESCRIBE operator to review the schema of a relation.
 Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to
compute a relation.
 Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
Shortcuts for Debugging Operators
Resources
 Introduction to Apache Pig by Adam Kawa
 Apache DataFu(incubating)
 Building Data Products at LinkedIn with DataFu
 A Brief tour of DataFu
 Pig Fundamentals
 Building a high level dataflow system on top of MapReduce: The Pig Experience
 Pig Hive Cascading
 Developing Pig on Apache Tez
 How to make your map-reduce jobs perform as we pig: Lessons from pig optimizations
 Apache Pig: Macro for splitting data into training and testing dataset
5/9/2014 30
Resources
• Programming Pig
– http://chimera.labs.oreilly.com/books/1234000001811/index.html
• Apache Pig’s Official Documentation
– http://pig.apache.org/docs/r0.12.1/
• Pig Design Patterns
– http://www.packtpub.com/pig-design-patterns/book
• External Libraries
– Piggybank
– DataFu
– DataFu Hourglass
– SimpleJson
– ElephantBird
5/9/2014 31
So what is pig?
5/9/2014 32
Pig is a champion

Más contenido relacionado

La actualidad más candente

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Rohit Agrawal
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisAravind Babu
 

La actualidad más candente (20)

Apache Pig
Apache PigApache Pig
Apache Pig
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Pig latin
Pig latinPig latin
Pig latin
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 

Destacado

Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigAvkash Chauhan
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commandsbispsolutions
 
THE LIFE CYCLE OF TURTLE
THE LIFE CYCLE OF TURTLETHE LIFE CYCLE OF TURTLE
THE LIFE CYCLE OF TURTLEvenyde84
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Life Cycle Of Stars
Life Cycle Of StarsLife Cycle Of Stars
Life Cycle Of StarsJan Parker
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakSean Roberts
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 

Destacado (19)

Hadoop Pig Syntax Card
Hadoop Pig Syntax CardHadoop Pig Syntax Card
Hadoop Pig Syntax Card
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hive commands
Hive commandsHive commands
Hive commands
 
What's new in Apache Hive
What's new in Apache HiveWhat's new in Apache Hive
What's new in Apache Hive
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
 
A Bee's Life
A Bee's LifeA Bee's Life
A Bee's Life
 
Life Cycle Of An Ant
Life Cycle Of An AntLife Cycle Of An Ant
Life Cycle Of An Ant
 
Life Cycle of Bee
Life Cycle of BeeLife Cycle of Bee
Life Cycle of Bee
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
THE LIFE CYCLE OF TURTLE
THE LIFE CYCLE OF TURTLETHE LIFE CYCLE OF TURTLE
THE LIFE CYCLE OF TURTLE
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Life Cycle Of Stars
Life Cycle Of StarsLife Cycle Of Stars
Life Cycle Of Stars
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & Cloudbreak
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Product life cycle
Product life cycleProduct life cycle
Product life cycle
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar a Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurSiddharth Mathur
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operationsgrim_radical
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Srivatsan Ramanujam
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 

Similar a Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs (20)

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
Pig latin
Pig latinPig latin
Pig latin
 
Pig
PigPig
Pig
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
PuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into OperationsPuppetDB: Sneaking Clojure into Operations
PuppetDB: Sneaking Clojure into Operations
 
Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)Python Powered Data Science at Pivotal (PyData 2013)
Python Powered Data Science at Pivotal (PyData 2013)
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 

Último

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 

Último (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

  • 1. Apache Pig Power Tools a quick tour By Viswanath Gangavaram Data Scientist R&D, DSG, Ilabs, [24] 7 INC 5/9/2014 1
  • 2. 5/9/2014 2 What we are going to cover  A very short introduction to Apache Pig  The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS  Advanced Pig Relational Operators  Built-in functions  User defined functions  DEFINE(UDFs, Streaming, Macros)  UDFs Vs. Pig Streaming  JSON Parsing  Single Row relation  Real python in Pig(nltk, numpy, scipy, etc.)  Embedding Pig-Latin for python in iterative processing  Hadoop Globing  Hue:- Hadoop ecosystem in the Browser  External Libraries:- Piggybank, DataFu, DataFu Hour Glass, SimpleJson, ElephantBird
  • 3. 5/9/2014 3 A very short introduction to Apache Pig Pig provides a higher level of abstraction for data users, giving them access to the power and flexibility of Hadoop without requiring them to write extensive data processing applications in low-level Java Code(MapReduce code). From the preface of “Programming Pig”
  • 4. 5/9/2014 4 A very short introduction to Apache Pig Apache Pig in Hadoop 1.0 Ecosystem Apache Pig execution Life Cycle
  • 5. A very short introduction to Apache Pig 5/9/2014 5 • Apache Pig is a high-level platform for executing data flows in parallel on Hadoop. The language for this platform is called Pig Latin, which includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. – Pigs fly • Pig processes data quickly. Designers want to consistently improve its performance, and not implement features in ways that weigh pig down so it can't fly. • What does it mean to be Pig? – Pigs Eats Everything • Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc. – Pigs Live Everywhere • Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. Check for Pig on Tez – Pigs Are Domestic Animals • Pig is designed to be easily controlled and modified by its users. • Pig allows integration of user code where ever possible, so it currently supports user defined field transformation functions, user defined aggregates, and user defined conditionals. • Pig supports user provided load and store functions. • It supports external executables via its stream command and Map Reduce jars via its MapReduce command. • It allows users to provide a custom partitioner for their jobs in some circumstances and to set the level of reduce parallelism for their jobs.
  • 6. 5/9/2014 6 Why we need to embrace this sort of philosophy ? Because that’s the reality
  • 7. 5/9/2014 7 Apache Pig “Word counting:- The hello world of MapReduce” inputFile = LOAD ‘mary’ using TextLoade() as ( line ); words = FOREACH inputFile GENERATE FLATTEN( TOKENIZE(line) ) as word; grpd = GROUP words by word; cntd = FOREACH grpd GENERATE group, COUNT(words) DUMP cntd; Output:- (This , 2) (is, 2) (my, 2 ) (first , 2) (apache, 2) (pig,2) (program, 2) “mary” file content:- This is my first apache pig program This is my first apache pig program
  • 8. 5/9/2014 8 Apache Pig Latin: A data flow language • Pig Latin is a dataflow language. This means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. • To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. Comparing query(HIVE/SQL) and data flow languages(PIG) • After a cursory look, people often say that Pig Latin is a procedural version of SQL. Although there are certainly similarities, there are more differences. SQL is a query language. Its focus is to allow users to form queries. It allows users to describe what question they want answered, but not how they want it answered. In Pig Latin, on the other hand, the user describes exactly how to process the input data. • Another major difference is that SQL is oriented around answering one question. When users want to do several data operations together, they must either write separate queries, storing the intermediate data into temporary tables, or write it in one query using subqueries inside that query to do the earlier steps of the processing. However, many SQL users find subqueries confusing and difficult to form properly. Also, using subqueries creates an inside-out design where the first step in the data pipeline is the innermost query. • Pig, however, is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables. • SQL is the English of data processing. It has the nice feature that everyone and every tool knows it, which means the barrier to adoption is very low. Our goal is to make Pig Latin the native language of parallel data-processing systems such as Hadoop. It may take some learning, but it will allow users to utilize the power of Hadoop much more fully. - Extracted from “Programming Pig”
  • 10. Pig’s Data types  Scalar types • int, long, float, double, chararray, bytearray  Complex types • Map – A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex type. – The chararray is called a key and is used as index to find the element, referred to as the value. – Map constants are formed using brackets to delimit the map, a hash between keys and values, and a comma between key-value pairs. » [‘dept’#’dsg’, ‘team’#’r&d’] • Tuple – A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type. – Tuple constants use parentheses to indicate the tuple and commas to delimit fields in the tuple. » (‘boss’, 55) • Bag – A bag is an unordered collection of tuples. – Bag constants are constructed using braces, with the tuples in the bag separated by commas. » { (‘a’, 20), (‘b’, 20), (‘c’, 30) }
  • 11. 5/9/2014 11 Running Pig One can run Pig (execute Pig Latin statements and Pig commands) using various modes. Local Mode MapReduce Mode Interactive Mode(Grunt Shell):- Pig Latin statements and Pig commands Yes Yes Batch Mode Yes Yes Execution Modes:-  Local Mode  To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the –x flag (pig -x local).  MapReduce Mode  To run Pig in MapReduce mode, you need access to a Hadoop Cluster and HDFS installation. MapReduce mode is the default mode; you can, but do not need to specify, it using the –x flag (pig or pig –x mapreduce) /* local mode */ pig –x local … java -cp pig.jar org.apache.pig.Main -x local … /* mapreduce mode */ pig or pig –x mapreduce … java -cp pig.jar org.apache.pig.Main ... java -cp pig.jar org.apache.pig.Main -x mapreduce ...
  • 12. Relational Operators  LOAD  Loads data from the file system.  LOAD 'data' [USING function] [AS schema];  If you specify a directory name, all the files in the directory are loaded.  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  STORE  Stores or saves results to the file system.  STORE alias INTO 'directory' [USING function];  A = LOAD ‘t.txt' USING PigStorage('t');  STORE A INTO USING PigStorage(‘*') AS (f1:int, f2:int);  LIMIT  Limits the number of output tuples.  alias = LIMIT alias n;  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  B = LIMIT A 5;  FILTER  Selects tuples from a relation based on some condition..  alias = FILTER alias BY expression;  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  B = FILTER A f2 > 2;
  • 13.  DISTINCT  Removes duplicate tuples in a relation.  alias = DISTINCT alias [PARTITION BY partitioner] [PARALLEL n];  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  B = DISTINCT A;  DUMP  Dumps or displays results to screen.  DUMP alias;  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  DUMP A;  ORDER BY  Sorts a relation based on one or more fields.  alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];  A = LOAD ‘t.txt' USING PigStorage('t') AS (f1:int, f2:int);  B = ORDER A BY f2;  DUMP B;  UNION  Computes the union of two or more relations.  alias = UNION [ONSCHEMA] alias, alias [, alias …];  L1 = LOAD 'f1' USING (a : int, b : float);  L2 = LOAD 'f1' USING (a : long, c : chararray);  U = UNION ONSCHEMA L1, L2;  DESCRIBE U ;  U : {a : long, b : float, c : chararray}
  • 14.  FOREACH  Generates data transformations based on columns of data.  alias = FOREACH { block | nested_block };  X = FOREACH A GENERATE f1;  X = FOREACH B {  S = FILTER A BY 'xyz‘ == ‘3’;  GENERATE COUNT (S.$0); }  CROSS  Computes the cross product of two or more relations.  alias = CROSS alias, alias [, alias …] [PARTITION BY partitioner] [PARALLEL n];  A = LOAD 'data1' AS (a1:int,a2:int,a3:int);  B = LOAD 'data2' AS (b1:int,b2:int);  X = CROSS A, B  (CO)GROUP  Groups the data in one or more relations.  The GROUP and COGROUP operators are identical.  alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];  A = load 'student' AS (name:chararray, age:int, gpa:float);  B = GROUP A BY age;  DUMP B;
  • 15.  Join(Inner)  Performs an inner join of two or more relations based on common field values.  alias = JOIN alias BY {expression|'('expression [, expression …]')'} (, alias BY {expression|'('expression [, expression …]')'} …) [USING 'replicated' | 'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];  A = load 'mydata';  B = load 'mydata';  C = join A by $0, B by $0;  DUMP C;  Join(Outer)  Performs an outer join of two relations based on common field values.  alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right- alias-column [USING 'replicated' | 'skewed' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];  A = LOAD 'a.txt' AS (n:chararray, a:int);  B = LOAD 'b.txt' AS (n:chararray, m:chararray);  C = JOIN A by $0 LEFT OUTER, B BY $0;  DUMP C;
  • 17. The Grunt Shell: An interactive shell to write and execute Pig-Latin and to access HDFS  Shell commands  Fs  Invokes any FsShell command from within a Pig script or the Grunt shell.  fs -mkdir /tmp  fs -copyFromLocal file-x file-y  fs -ls file-y  Sh  Invokes any sh shell command from within a Pig script or the Grunt shell.  ls  Pwd  Utility commands  Clear  Exec  Help  History  Kill  Exec  Run a Pig script.  exec [–param param_name = param_value] [–param_file file_name] [script]  Use the exec command to run a Pig script with no interaction between the script and the Grunt shell (batch mode).  Aliases defined in the script are not available to the shell;  Run  Run a Pig script  run [–param param_name = param_value] [–param_file file_name] script  Interactive mode
  • 18. Advanced Relational Operators  Splitting Data into Training and Testing Dataset  SPLIT  SPLIT users into kids if age < 18, adults if age >= 18 and age < 65, seniors otherwise;  SPLIT data into testing if RANDOM() <= 0.10, training otherwise;  SPLIT operator cannot handle non deterministic functions (such as RANDOM).  Thus the above command won’t work and will raise an error: DEFINE split_into_training_testing(inputData, split_percentage) RETURNS training, testing{ data = foreach $inputData generate RANDOM() as random_assignment, *; SPLIT data into testing_data if random_assignment <= $split_percentage, training_data otherwise; $training = foreach training_data generate $1..; $testing = foreach testing_data generate $1..; }; inData = load 'some_files.txt‘ USING PigStorage(‘t’); training, testing = split_into_training_testing(inData, 0.1); Syntax for Macro definition:- DEFINE macro_name (param [, param ...]) RETURNS {void | alias [, alias ...]} { pig_latin_fragment }; Syntax for Macro expansion:- alias [, alias ...] = macro_name (param [, param ...]) ;
  • 19.  ASSERT  Assert a condition on the data..  ASSERT alias BY expression [, message];  A = LOAD 'data' AS (a0:int,a1:int,a2:int);  ASSERT A by a0 > 0, 'a0 should be greater than 0';  CUBE  Performs cube/rollup operations.  alias = CUBE alias BY { CUBE expression | ROLLUP expression }, [ CUBE expression | ROLLUP expression ] [PARALLEL n];  cubedinp = CUBE salesinp BY CUBE(product,year);  rolledup = CUBE salesinp BY ROLLUP(region,state,city);  cubed_and_rolled = CUBE salesinp BY CUBE(product,year), ROLLUP(region, state, city);  SAMPLE  Selects a random sample of data based on the specified sample size.  SAMPLE alias size;  A = LOAD 'data' AS (f1:int,f2:int,f3:int);  X = SAMPLE A 0.01;  RANK  Returns each tuple with the rank within a relation.  alias = RANK alias [ BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [DENSE] ];  B = rank A;  C = rank A by f1 DESC, f2 ASC;  C = rank A by f1 DESC, f2 ASC DENSE;
  • 20.  MAPREDUCE  Executes native MapReduce jobs inside a Pig script.  alias1 = MAPREDUCE 'mr.jar' STORE alias2 INTO 'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `];  A = LOAD 'WordcountInput.txt';  B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;  IMPORT  Import macros defined in a separate file.  MPORT 'file-with-macro';  STREAM  Sends data to an external script or program.  alias = STREAM alias [, alias …] THROUGH {`command` | cmd_alias } [AS schema] ;  A = LOAD 'data';  B = STREAM A THROUGH `perl stream.pl -n 5`;
  • 21. DEFINE:- UDFs, Streaming  Assigns an alias to a UDF or streaming command.  DEFINE alias {function | [`command` [input] [output] [ship] [cache] [stderr] ] }; DEFINE CMD `perl PigStreaming.pl - nameMap` input(stdin using PigStreaming(',')) output(stdout using PigStreaming(',')); A = LOAD 'file'; B = STREAM B THROUGH CMD; DEFINE CMD 'script' ship('/a/b/script'); OP = stream IP through CMD; DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz'); X = STREAM A THROUGH Y;
  • 22. Built-in functions  Eval functions  AVG  CONCAT  COUNT  COUNT_STAR  Math functions  ABS  SQRT  Etc …  STRING functions  ENDSWITH  TRIM  …  Datetime functions  AddDuration  GetDay  GetHour  …  Dynamic Invokers  DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');  encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);  decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
  • 24. Single row relations a = load 'a.txt'; b = group a all; c = foreach b generate COUNT(a) as sum; d = order a by $0; e = limit d c.sum/100;
  • 25. Real python in Pig(nltk, numpy, scipy, etc.) from pig_util import outputSchema import nltk import sys import platform from nltk.stem.lancaster import LancasterStemmer @outputSchema("as:int") def square(num): if num == None: return None return ((num) * (num)) @outputSchema("word:chararray") def returnString(word): st = LancasterStemmer() return st.stem('maximum') + 't'+ word +'t'+ word + 't' + platform.python_version() @outputSchema("word:chararray") def wordSteming(word): st = LancasterStemmer() return st.stem(word) register 'streamingPython.py' using streaming_python as myfuncs; a = LOAD 't.txt' as (a:chararray, b:chararray); b = foreach a generate myfuncs.returnString('this is pig; this is weird') , myfuncs.square(25); DUMP b;
  • 26. Embedding Pig-Latin for python in iterative processing  To enable control flow, you can embed Pig Latin statements and Pig commands in the Python, JavaScript and Groovy scripting languages using a JDBC-like compile, bind, run model.  DEMO
  • 28. Hue:- Hadoop ecosystem in the Browser
  • 29. Pig’s Debugging Operators 5/9/2014 29  d alias - shortcut for DUMP. If alias is ignored last defined alias will be used.  de alias - shortcut for DESCRIBE. If alias is ignored last defined alias will be used.  e alias - shortcut for EXPLAIN. If alias is ignored last defined alias will be used.  i alias - shortcut for ILLUSTRATE. If alias is ignored last defined alias will be used.  q - To quit grunt shell  Use the DUMP operator to display results to your terminal screen.  Use the DESCRIBE operator to review the schema of a relation.  Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.  Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements. Shortcuts for Debugging Operators
  • 30. Resources  Introduction to Apache Pig by Adam Kawa  Apache DataFu(incubating)  Building Data Products at LinkedIn with DataFu  A Brief tour of DataFu  Pig Fundamentals  Building a high level dataflow system on top of MapReduce: The Pig Experience  Pig Hive Cascading  Developing Pig on Apache Tez  How to make your map-reduce jobs perform as we pig: Lessons from pig optimizations  Apache Pig: Macro for splitting data into training and testing dataset 5/9/2014 30
  • 31. Resources • Programming Pig – http://chimera.labs.oreilly.com/books/1234000001811/index.html • Apache Pig’s Official Documentation – http://pig.apache.org/docs/r0.12.1/ • Pig Design Patterns – http://www.packtpub.com/pig-design-patterns/book • External Libraries – Piggybank – DataFu – DataFu Hourglass – SimpleJson – ElephantBird 5/9/2014 31
  • 32. So what is pig? 5/9/2014 32 Pig is a champion