SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Prepared by,
Vetri.V
What is Pig?
 A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and Map/Reduce clusters.
 Pig is a scripting language.
 No compiler
 Rapid prototyping.
 Command line prompt (grunt shell)
 Pig is a domain specific language
 No control flow (no if/then/else
 Specific to data flows
 Not for writing ray tracers
 For the distribution of a pre-existing ray tracer.
What isn’t pig?
 A general framework for all distributed computation.
 Pig is map/reduce, just easier.
 A general purpose language.
 No scope
 Minimal variable support
 No control flow
Why we need Pig?
 Writing native Map/Reduce is hard
 Difficult to make abstractions
 Extremely verbose
 400lines of java becomes<30 lines of pig
 Joins are very difficult
 a big motivator for pig
 Basically, everything about java M/R is painful.
Pig has two execution types or modes:
 Local mode and Map/Reduce mode.
Local Mode:
 To run the scripts in local mode, no Hadoop or HDFS installation is required.
All files are installed and run from your local host and file system.
Map/reduce Mode:
 To run the scripts in map/reduce mode, you need access to a Hadoop cluster
and HDFS installation.
Local mode
 In local mode, Pig runs in a single JVM and accesses the local file system. This
mode is suitable only for small datasets and when trying out Pig.
Prepared by,
Vetri.V
Running the Pig Scripts in Local Mode:
 To run the Pig scripts in local mode, do the following:
1. Move to the pig_tmp directory.
2. Execute the following command using script1-local.pig (or script2-
local.pig).
$ pig -x local script1-local.pig
The output may contain a few Hadoop warnings which can be ignored:
3. A directory named script1-local-results.txt (or script2-local-results.txt) is
created. This directory contains the results file, part-r-0000.
Running the Pig Scripts in Map/reduce Mode:
 To run the Pig scripts in mapreduce mode, do the following:
1. Move to the pig_ tmp directory.
2. Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory.
$ hadoopfs –copyFromLocal excite.log.bz2 .
PIG INSTALLATIONS:
 Before install pig make sure ant installed on the machine.
 Step 1:
Download Pig tarball
 Step 2:
Untar Pig
 Step 3:
Set environment Variable HADOOP_HOME and HADOOP_CONF_DIR
 Step 4:
$cd /usr/local/pig
 Step 5:
$ant
 Step 6:
$cd contrib/piggybank/java
 Step 7:
$ant
This will generate /usr/local/pig/contrib/piggybank/java/piggybank.jar.
Test your Pig Installations:
$ export PIG_HOME=/usr/local/pig
$ export HADOOP_HOME=/opt/hadoop
$ pig
grunt> copyFromLocal /etc/passwd /tmp/passwd
grunt> A = load '/tmp/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> DUMP B;
(root)
(bin)
(daemon)
(adm)
Prepared by,
Vetri.V
Data Model- Pig:
 This includes Pig's data types, how it handles concepts like missing data, and
how you can describe your data to Pig.
Data types:
 It has two types.
 Scalar types.
 Int (ints are represented in interface by java.lang.Integer)
 Long( long are represented in interface by java.lang.Long)
 Float (float are represented in interface by java.lang.Float)
 Double
 Chararray (chararray are represented in interface by
java.lang.String)
 Bytearray
 Complex types.
 Complex Types: Tuples
o Every row in a relation is a tuple.
o Allows random access.
o Wraps ArrayList<Object>
o Must fit in memory.
o Can have a schema, but is not enforced.
 Potential for optimization.
 Complex types: Bags
o Pig’s only spillable data structure.
 Full structure does not have to fit in memory
o Two key operations
 Add a tuple.
 Iterate over Tuples
o No random access.
o No order guarantees.
o The object version of a relation.
 Every row is a tuple.
 Complex types: Maps
o Wraps HashMap<String,Object>
 Keys must be string.
o Must fit in memory
o Can be cumbersome to use.
 Vale type poorly understood in script.
Prepared by,
Vetri.V
Data types and syntax:
Datatype Syntax Example
int int as (a:int)
long long as (a:long)
float float as (a:float)
double double as (a:double)
chararray chararray as
(a:chararray)
bytearray bytearray as
(a:bytearray)
map map[] or map[type] where type is any valid type. This
declares all values in the map to be of this type.
as (a:map[],
b:map[int])
tuple tuple() or tuple(list_of_fields) where list_of_fieldsis a comma
separated list of field declarations.
as (a:tuple(),
b:tuple(x:int,
y:int))
bag bag{} or bag{t:(list_of_fields)} where list_of_fieldsis a comma
separated list of field declarations. Note that, oddly enough,
the tuple inside the bag must have a name, here specified as t,
even though you will never be able to access that
tuple t directly.
(a:bag{},
b:bag{t:(x:int,
y:int)})
Understanding a basic pig script:
Prepared by,
Vetri.V
Examples: Word Count:
The Pig object model:
Relations:
 Fundamental building block
 Analogous to a table, not a variable.
Prepared by,
Vetri.V
A detour: firing up pig:
Loading data:
Projection (aka FOREACH):
Prepared by,
Vetri.V
 Foreach means do something on every row in a relation.
 Creates a new relation.
 In this example, pruned is a new relation whose columns will be first name
age.
 With schemas, can also use column aliases.
Nested FOREACH:
 An advanced, but extremely powerful use of FOREACH let’s a script do more
analysis on the reducer.
 Imagine we wanted the distinct number of ages per department.
Schemas and types:
 Schema-less analysis is useful.
 Many data sources don’t have clear types.
 But schemas are also very useful.
 Ensuring correctness.
 Aiding optimization
 Schema gives an alias and type to the columns.
 Absent columns will be made null.
 Extra columns will be thrown out.
DESCRIBE- prints the schema of a relation:
Prepared by,
Vetri.V
Schemas vs. types:
 Schemas:
 A description of the types present.
 Used to help maintain correctness.
 Generally not enforced once script is run.
 Types:
 Describes the data present in a column.
 Generally parallel java types.
Type overview:
 Pig has a nested object model.
 Nested types
 Complex objects can contain other object.
 Pig primitives mirror java primitives.
 String, Int, Long, Float, Double.
 DataByteArray wraps a byte[].
 Working to add native Date Time support, and more.
Filter:
 A predicate is evaluated for each row.
 If false, the row is thrown out.
 Supports complicated predicates
 Boolean logic
 Regular expressions
 See Pig documentation for more.
Grouping:
 Abstractly, GROUPING creates a relation with unique keys, and the associated
rows.
 Examples: how many people in our data set are in each department?
Prepared by,
Vetri.V
Visualizing the group:
USING GROUP: an example:
 Goal: what % does each age group make of the total?
Prepared by,
Vetri.V
Understanding group
Groups: a retrospective:
 Grouping does not change the data.
 Reorganizes it based on the given key.
 Can group on multiple key.
 First column is always called group.
 A compound group key will be a tuple (“group”) whose elements are
the keys.
 Second column is bag.
 Name is the grouped relation
 Contains every row associated with key.
Flattening:
 Flatten is the opposite of group.
 Turns tuples into columns.
 Turns bags into rows.
Prepared by,
Vetri.V
Flattening Tuples(cont):
Flattening Bags:
 Syntax is the same as flattening tuples, but the idea is different.
 Tuples contain columns, thus flattening a tuple turn one column into many
columns.
 Bag contains, so flattening a bag turns one row into many rows.
Prepared by,
Vetri.V
Data is the same, just with different nesting. On the left, the rows are divided into
different bags.
Flattening Bags (cont):
 The schema indicates what is going on.
 Group goes from a flat structure to a nested one.
 Flatten goes from a nested structure to a flat one.
 Now that we have seen grouping, there’s a useful operation.
Prepared by,
Vetri.V
Fun with flattens:
 The group example can be done with flattens.
Prepared by,
Vetri.V
Flattening multiple Bags:
 The result from multiple flatten statements will be crossed.
 To only select a few columns in a Bag, syntax is bag_alias.(col1, coli2)
Joins:
 A big motivator for pig easier joins.
 Compares relation using a given key.
 Output all combinations of rows with equal keys.
 See appendix for more variations.
 How many credits does each student need?
 Joined schema is concatenation of joined relations schema.
 Relation name appended to aliases in case of ambiguity.
 In this case, there are two “dept” aliases.
Prepared by,
Vetri.V
Order by:
 Order by globally sorts a relation on a key (or set of keys).
 Global sort not guaranteed to be preserved through other transformations.
 A store after a global sort will result in one or more globally sorted part files.
Extending pig: UDFs
 UDF’s coupled with pig’s object model, allow for extensive transformation and
analysis.
What is a UDF?
 A user defined function (UDF) is a java function implementing EvalFunc<T>,
and can be used in a Pig script.
 Additional support for functions in python, Ruby, groovy.
 Much of the core Pig functionality is actually implemented in UDFs.
 COUNT in the previous example.
 Useful for learning how to implement your own.
 Src/org/apache/pig/built has many examples.
Types of UDFs:
 EvalFunc<T>
 Simple, one to one functions.
 Accumulator<T>
 Many to one.
 Left associative, NOT commutative.
 Algebraic<T>
 Many to one
Prepared by,
Vetri.V
 Associative, commutative.
 Makes use of combiners.
 All UDFs must returns pig types.
 Even intermediate stages.
EvalFunc <T>:
 Simplest kind of UDF.
 Only need to implement an “exe” function.
 Not ideal for “many to one “functions that vastly reduce amount of data ( such
as SUM or COUNT).
 In these cases, Algebraics are superior.
 Src/org/apache/pig/builtin/TOKENIZE.java is a nontrivial example.
A basic UDF:
Accumulator <T>:
 Used when the input is a large bag, but order matters.
 Allow you to work on the bag incrementally, can be much more
memory efficient.
 Difference between Algebraic UDFs is generally that you need to work on the
data in a given order.
 Used for session analysis when you need to analyze events in the
order they actually occurred.
 Src/org/apache/pig/builtin/COUNT.java is an example.
 Also implements algebraic (most Algebraic functions are also
Accumulative)
Algebraic:
 Commutative, algebraic functions.
 You can apply the function to any subset of the data (even partial
results) in any order.
 The most efficient.
 Takes advantage of hadoop combiners.
 Also the most complicated.
 Src/org/apache/pig/builtin/COUNT.java is an example.
Prepared by,
Vetri.V
What is map/Reduce barrier?
 A Map/Reduce barrier is a part of a script that forces a reduce stage.
 Some scripts can be done with just mappers.
Students=Load’students.txt as
(first:chararray,last:chararray,age:int,dept:chararray);
Students_filtered=FILTER students BY age>=20;
Students_proj=FOREACH students_filtered GENERATE last,dept;
 But most will need the full map/reduce cycle.
 The group is the difference, a “map reduce barrier” which
requires a reduce step.
Map/Reduce implications of operators:
 What will cause a map/reduce job?
 GROUP and COGROUP
 JOIN
 Excluding replicated join.
 CROSS
 To be avoided unless you are absolutely certain.
 Potential for huge explosion in data
 ORDER
 DISTINCT
 What will cause multiple map reduce jobs?
 Multiple uses of the above operations.
 Forking code paths.
 First step is identifying the M/R barriers.
Job 1:
Job 2:
Prepared by,
Vetri.V
Job 3:
A DAG example:
Projections:
 Projection reduces the amount of data being processed.
 Especially important b/w map and reduce stages when data goes
over the network.
Scalar projection:
 All interactions and transformations is in Pig is done on relations.
 Sometimes, we want access to an aggregate.
 Scalar projection allows us to use intermediate aggregate results in a
script.
Prepared by,
Vetri.V
SUM, COUNT, COUNT_STAR:
 In general, SUM, COUNT, and other aggregates implicitly work on the first
column.
 COUNT counts only non-null fields.
 COUNT_STAR counts all fields.
Sorting:
 Sorting is a global operation, but can be distributed.
 Must approximate distribution of the sort key.
 Imagine evenly distributed data between 1 and 100. With 10 reducers, can
send 1-10 to computer 1,11-20 to computer 2, and so on.
 In this way, the computation is distributed but the sort is global.
 Pig inserts a sorting job before an order by to estimate the key distribution.
Spilling:
 Spilling means that, at any time, a data structure can be asked to write itself to
disk.
 In Pig, there is a memory useage threshold
 This is why you can only add to bags, or iterate on them.
 Adding could force a spill to disk.
 Iterating can mean having to go disk for the contents.
JOIN OPTIMIZATIONS:
 Pig has three join optimizations. Using them can potentially makes jobs run
MUCH faster.
 Replicated join
 -a=join rel1 by x, rel2 by using ‘replicated’;
 Skewed join
 -a=join rel1 by x, rel2 by using ‘skewed’;
Prepared by,
Vetri.V
 Merge join.
 -a=join rel1 by x, rel2 by using ‘merge;
 Replicated join:
 Can be used when.
 Every relation besides the left-most relation can fit in
memory.
 Will invoke a map-side join.
 Will load all other relations into memory in the mapper and
do the join in place.
 Where applicable, massive resource savings.
 Skewed join:
 Useful when one of the relations being joined has a key which
dominates.
 Web logs, for example, often have a logged out user id which
can be a large % of the keys.
 The algorithm first samples the key distribution, and then replicates
the most popular keys.
 Some overhead, but worth it in cases of bad skew.
 Only works if there is skew in one relation
 If both relations have skew, the join degenerates to a cross,
which is unavoidable.
 Merge join:
 This is useful when you have relations that are already ordered.
 Cutting edge let’s you put an “order by” before the merge join.
 Will index the blocks that correspond to the relations, then will do a
traditional merge algorithm.
 Huge savings when applicable.
What happens when run a script?
 First, pig parses your script using ANTLR.
 The parser creates an intermediate representation (AST).
 The AST is converted to a logical plan.
 The logical plan is optimized, and then converts to a physical plan.
 The physical plan is optimized, and then converted to a series of Map/Reduce
jobs.
JOIN (inner)
 Performs inner, equijoin of two or more relations based on common field
values.
Prepared by,
Vetri.V
Example
 Suppose we have relations A and B.
 A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
 B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
 X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
JOIN (outer)
 Performs an outer join of two or more relations based on common field
values.
 It operates left outer and right outer and full outer join. The usage keywords
are namely LEFT OUTER, RIGHT OUTER, FULL OUTER.
Example:
 Left outer join
A = LOAD 'a.txt' AS (n:chararray, a:int);
B = LOAD 'b.txt' AS (n:chararray, m:chararray);
C = JOIN A by $0 LEFT OUTER, B BY $0;
Prepared by,
Vetri.V
LIMIT
 Limits the number of output tuples.
 Usage
 X = LIMIT A 3; (where x -relation and 3 is the number of tuple)
--- Thank you ---

Más contenido relacionado

La actualidad más candente

Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet FormatYue Chen
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hiveSubhas Kumar Ghosh
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6Rohit Agrawal
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functionsRupak Roy
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Import and Export Big Data using R Studio
Import and Export Big Data using R StudioImport and Export Big Data using R Studio
Import and Export Big Data using R StudioRupak Roy
 
Import web resources using R Studio
Import web resources using R StudioImport web resources using R Studio
Import web resources using R StudioRupak Roy
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandraNavanit Katiyar
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network
 

La actualidad más candente (19)

Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Hive commands
Hive commandsHive commands
Hive commands
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
R stata
R stataR stata
R stata
 
Introduction to scoop and its functions
Introduction to scoop and its functionsIntroduction to scoop and its functions
Introduction to scoop and its functions
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Import and Export Big Data using R Studio
Import and Export Big Data using R StudioImport and Export Big Data using R Studio
Import and Export Big Data using R Studio
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Import web resources using R Studio
Import web resources using R StudioImport web resources using R Studio
Import web resources using R Studio
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 

Similar a Pig

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environmentsJ'tong Atong
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Martin Odersky
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
Modules of the twenties
Modules of the twentiesModules of the twenties
Modules of the twentiesPuppet
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Future Programming Language
Future Programming LanguageFuture Programming Language
Future Programming LanguageYLTO
 
C interview question answer 1
C interview question answer 1C interview question answer 1
C interview question answer 1Amit Kapoor
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answerskavinilavuG
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answersRojaPriya
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answersAkash Gawali
 
Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Kiran Jonnalagadda
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languagesJulien Le Dem
 

Similar a Pig (20)

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Unit 3
Unit 3Unit 3
Unit 3
 
220 runtime environments
220 runtime environments220 runtime environments
220 runtime environments
 
What is c language
What is c languageWhat is c language
What is c language
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Modules of the twenties
Modules of the twentiesModules of the twenties
Modules of the twenties
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Technical Interview
Technical InterviewTechnical Interview
Technical Interview
 
Andy On Closures
Andy On ClosuresAndy On Closures
Andy On Closures
 
Future Programming Language
Future Programming LanguageFuture Programming Language
Future Programming Language
 
C interview question answer 1
C interview question answer 1C interview question answer 1
C interview question answer 1
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers
 
Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)Python and Zope: An introduction (May 2004)
Python and Zope: An introduction (May 2004)
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 
Go1
Go1Go1
Go1
 
Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 

Último

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Último (20)

20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

Pig

  • 1. Prepared by, Vetri.V What is Pig?  A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and Map/Reduce clusters.  Pig is a scripting language.  No compiler  Rapid prototyping.  Command line prompt (grunt shell)  Pig is a domain specific language  No control flow (no if/then/else  Specific to data flows  Not for writing ray tracers  For the distribution of a pre-existing ray tracer. What isn’t pig?  A general framework for all distributed computation.  Pig is map/reduce, just easier.  A general purpose language.  No scope  Minimal variable support  No control flow Why we need Pig?  Writing native Map/Reduce is hard  Difficult to make abstractions  Extremely verbose  400lines of java becomes<30 lines of pig  Joins are very difficult  a big motivator for pig  Basically, everything about java M/R is painful. Pig has two execution types or modes:  Local mode and Map/Reduce mode. Local Mode:  To run the scripts in local mode, no Hadoop or HDFS installation is required. All files are installed and run from your local host and file system. Map/reduce Mode:  To run the scripts in map/reduce mode, you need access to a Hadoop cluster and HDFS installation. Local mode  In local mode, Pig runs in a single JVM and accesses the local file system. This mode is suitable only for small datasets and when trying out Pig.
  • 2. Prepared by, Vetri.V Running the Pig Scripts in Local Mode:  To run the Pig scripts in local mode, do the following: 1. Move to the pig_tmp directory. 2. Execute the following command using script1-local.pig (or script2- local.pig). $ pig -x local script1-local.pig The output may contain a few Hadoop warnings which can be ignored: 3. A directory named script1-local-results.txt (or script2-local-results.txt) is created. This directory contains the results file, part-r-0000. Running the Pig Scripts in Map/reduce Mode:  To run the Pig scripts in mapreduce mode, do the following: 1. Move to the pig_ tmp directory. 2. Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory. $ hadoopfs –copyFromLocal excite.log.bz2 . PIG INSTALLATIONS:  Before install pig make sure ant installed on the machine.  Step 1: Download Pig tarball  Step 2: Untar Pig  Step 3: Set environment Variable HADOOP_HOME and HADOOP_CONF_DIR  Step 4: $cd /usr/local/pig  Step 5: $ant  Step 6: $cd contrib/piggybank/java  Step 7: $ant This will generate /usr/local/pig/contrib/piggybank/java/piggybank.jar. Test your Pig Installations: $ export PIG_HOME=/usr/local/pig $ export HADOOP_HOME=/opt/hadoop $ pig grunt> copyFromLocal /etc/passwd /tmp/passwd grunt> A = load '/tmp/passwd' using PigStorage(':'); grunt> B = foreach A generate $0 as id; grunt> DUMP B; (root) (bin) (daemon) (adm)
  • 3. Prepared by, Vetri.V Data Model- Pig:  This includes Pig's data types, how it handles concepts like missing data, and how you can describe your data to Pig. Data types:  It has two types.  Scalar types.  Int (ints are represented in interface by java.lang.Integer)  Long( long are represented in interface by java.lang.Long)  Float (float are represented in interface by java.lang.Float)  Double  Chararray (chararray are represented in interface by java.lang.String)  Bytearray  Complex types.  Complex Types: Tuples o Every row in a relation is a tuple. o Allows random access. o Wraps ArrayList<Object> o Must fit in memory. o Can have a schema, but is not enforced.  Potential for optimization.  Complex types: Bags o Pig’s only spillable data structure.  Full structure does not have to fit in memory o Two key operations  Add a tuple.  Iterate over Tuples o No random access. o No order guarantees. o The object version of a relation.  Every row is a tuple.  Complex types: Maps o Wraps HashMap<String,Object>  Keys must be string. o Must fit in memory o Can be cumbersome to use.  Vale type poorly understood in script.
  • 4. Prepared by, Vetri.V Data types and syntax: Datatype Syntax Example int int as (a:int) long long as (a:long) float float as (a:float) double double as (a:double) chararray chararray as (a:chararray) bytearray bytearray as (a:bytearray) map map[] or map[type] where type is any valid type. This declares all values in the map to be of this type. as (a:map[], b:map[int]) tuple tuple() or tuple(list_of_fields) where list_of_fieldsis a comma separated list of field declarations. as (a:tuple(), b:tuple(x:int, y:int)) bag bag{} or bag{t:(list_of_fields)} where list_of_fieldsis a comma separated list of field declarations. Note that, oddly enough, the tuple inside the bag must have a name, here specified as t, even though you will never be able to access that tuple t directly. (a:bag{}, b:bag{t:(x:int, y:int)}) Understanding a basic pig script:
  • 5. Prepared by, Vetri.V Examples: Word Count: The Pig object model: Relations:  Fundamental building block  Analogous to a table, not a variable.
  • 6. Prepared by, Vetri.V A detour: firing up pig: Loading data: Projection (aka FOREACH):
  • 7. Prepared by, Vetri.V  Foreach means do something on every row in a relation.  Creates a new relation.  In this example, pruned is a new relation whose columns will be first name age.  With schemas, can also use column aliases. Nested FOREACH:  An advanced, but extremely powerful use of FOREACH let’s a script do more analysis on the reducer.  Imagine we wanted the distinct number of ages per department. Schemas and types:  Schema-less analysis is useful.  Many data sources don’t have clear types.  But schemas are also very useful.  Ensuring correctness.  Aiding optimization  Schema gives an alias and type to the columns.  Absent columns will be made null.  Extra columns will be thrown out. DESCRIBE- prints the schema of a relation:
  • 8. Prepared by, Vetri.V Schemas vs. types:  Schemas:  A description of the types present.  Used to help maintain correctness.  Generally not enforced once script is run.  Types:  Describes the data present in a column.  Generally parallel java types. Type overview:  Pig has a nested object model.  Nested types  Complex objects can contain other object.  Pig primitives mirror java primitives.  String, Int, Long, Float, Double.  DataByteArray wraps a byte[].  Working to add native Date Time support, and more. Filter:  A predicate is evaluated for each row.  If false, the row is thrown out.  Supports complicated predicates  Boolean logic  Regular expressions  See Pig documentation for more. Grouping:  Abstractly, GROUPING creates a relation with unique keys, and the associated rows.  Examples: how many people in our data set are in each department?
  • 9. Prepared by, Vetri.V Visualizing the group: USING GROUP: an example:  Goal: what % does each age group make of the total?
  • 10. Prepared by, Vetri.V Understanding group Groups: a retrospective:  Grouping does not change the data.  Reorganizes it based on the given key.  Can group on multiple key.  First column is always called group.  A compound group key will be a tuple (“group”) whose elements are the keys.  Second column is bag.  Name is the grouped relation  Contains every row associated with key. Flattening:  Flatten is the opposite of group.  Turns tuples into columns.  Turns bags into rows.
  • 11. Prepared by, Vetri.V Flattening Tuples(cont): Flattening Bags:  Syntax is the same as flattening tuples, but the idea is different.  Tuples contain columns, thus flattening a tuple turn one column into many columns.  Bag contains, so flattening a bag turns one row into many rows.
  • 12. Prepared by, Vetri.V Data is the same, just with different nesting. On the left, the rows are divided into different bags. Flattening Bags (cont):  The schema indicates what is going on.  Group goes from a flat structure to a nested one.  Flatten goes from a nested structure to a flat one.  Now that we have seen grouping, there’s a useful operation.
  • 13. Prepared by, Vetri.V Fun with flattens:  The group example can be done with flattens.
  • 14. Prepared by, Vetri.V Flattening multiple Bags:  The result from multiple flatten statements will be crossed.  To only select a few columns in a Bag, syntax is bag_alias.(col1, coli2) Joins:  A big motivator for pig easier joins.  Compares relation using a given key.  Output all combinations of rows with equal keys.  See appendix for more variations.  How many credits does each student need?  Joined schema is concatenation of joined relations schema.  Relation name appended to aliases in case of ambiguity.  In this case, there are two “dept” aliases.
  • 15. Prepared by, Vetri.V Order by:  Order by globally sorts a relation on a key (or set of keys).  Global sort not guaranteed to be preserved through other transformations.  A store after a global sort will result in one or more globally sorted part files. Extending pig: UDFs  UDF’s coupled with pig’s object model, allow for extensive transformation and analysis. What is a UDF?  A user defined function (UDF) is a java function implementing EvalFunc<T>, and can be used in a Pig script.  Additional support for functions in python, Ruby, groovy.  Much of the core Pig functionality is actually implemented in UDFs.  COUNT in the previous example.  Useful for learning how to implement your own.  Src/org/apache/pig/built has many examples. Types of UDFs:  EvalFunc<T>  Simple, one to one functions.  Accumulator<T>  Many to one.  Left associative, NOT commutative.  Algebraic<T>  Many to one
  • 16. Prepared by, Vetri.V  Associative, commutative.  Makes use of combiners.  All UDFs must returns pig types.  Even intermediate stages. EvalFunc <T>:  Simplest kind of UDF.  Only need to implement an “exe” function.  Not ideal for “many to one “functions that vastly reduce amount of data ( such as SUM or COUNT).  In these cases, Algebraics are superior.  Src/org/apache/pig/builtin/TOKENIZE.java is a nontrivial example. A basic UDF: Accumulator <T>:  Used when the input is a large bag, but order matters.  Allow you to work on the bag incrementally, can be much more memory efficient.  Difference between Algebraic UDFs is generally that you need to work on the data in a given order.  Used for session analysis when you need to analyze events in the order they actually occurred.  Src/org/apache/pig/builtin/COUNT.java is an example.  Also implements algebraic (most Algebraic functions are also Accumulative) Algebraic:  Commutative, algebraic functions.  You can apply the function to any subset of the data (even partial results) in any order.  The most efficient.  Takes advantage of hadoop combiners.  Also the most complicated.  Src/org/apache/pig/builtin/COUNT.java is an example.
  • 17. Prepared by, Vetri.V What is map/Reduce barrier?  A Map/Reduce barrier is a part of a script that forces a reduce stage.  Some scripts can be done with just mappers. Students=Load’students.txt as (first:chararray,last:chararray,age:int,dept:chararray); Students_filtered=FILTER students BY age>=20; Students_proj=FOREACH students_filtered GENERATE last,dept;  But most will need the full map/reduce cycle.  The group is the difference, a “map reduce barrier” which requires a reduce step. Map/Reduce implications of operators:  What will cause a map/reduce job?  GROUP and COGROUP  JOIN  Excluding replicated join.  CROSS  To be avoided unless you are absolutely certain.  Potential for huge explosion in data  ORDER  DISTINCT  What will cause multiple map reduce jobs?  Multiple uses of the above operations.  Forking code paths.  First step is identifying the M/R barriers. Job 1: Job 2:
  • 18. Prepared by, Vetri.V Job 3: A DAG example: Projections:  Projection reduces the amount of data being processed.  Especially important b/w map and reduce stages when data goes over the network. Scalar projection:  All interactions and transformations is in Pig is done on relations.  Sometimes, we want access to an aggregate.  Scalar projection allows us to use intermediate aggregate results in a script.
  • 19. Prepared by, Vetri.V SUM, COUNT, COUNT_STAR:  In general, SUM, COUNT, and other aggregates implicitly work on the first column.  COUNT counts only non-null fields.  COUNT_STAR counts all fields. Sorting:  Sorting is a global operation, but can be distributed.  Must approximate distribution of the sort key.  Imagine evenly distributed data between 1 and 100. With 10 reducers, can send 1-10 to computer 1,11-20 to computer 2, and so on.  In this way, the computation is distributed but the sort is global.  Pig inserts a sorting job before an order by to estimate the key distribution. Spilling:  Spilling means that, at any time, a data structure can be asked to write itself to disk.  In Pig, there is a memory useage threshold  This is why you can only add to bags, or iterate on them.  Adding could force a spill to disk.  Iterating can mean having to go disk for the contents. JOIN OPTIMIZATIONS:  Pig has three join optimizations. Using them can potentially makes jobs run MUCH faster.  Replicated join  -a=join rel1 by x, rel2 by using ‘replicated’;  Skewed join  -a=join rel1 by x, rel2 by using ‘skewed’;
  • 20. Prepared by, Vetri.V  Merge join.  -a=join rel1 by x, rel2 by using ‘merge;  Replicated join:  Can be used when.  Every relation besides the left-most relation can fit in memory.  Will invoke a map-side join.  Will load all other relations into memory in the mapper and do the join in place.  Where applicable, massive resource savings.  Skewed join:  Useful when one of the relations being joined has a key which dominates.  Web logs, for example, often have a logged out user id which can be a large % of the keys.  The algorithm first samples the key distribution, and then replicates the most popular keys.  Some overhead, but worth it in cases of bad skew.  Only works if there is skew in one relation  If both relations have skew, the join degenerates to a cross, which is unavoidable.  Merge join:  This is useful when you have relations that are already ordered.  Cutting edge let’s you put an “order by” before the merge join.  Will index the blocks that correspond to the relations, then will do a traditional merge algorithm.  Huge savings when applicable. What happens when run a script?  First, pig parses your script using ANTLR.  The parser creates an intermediate representation (AST).  The AST is converted to a logical plan.  The logical plan is optimized, and then converts to a physical plan.  The physical plan is optimized, and then converted to a series of Map/Reduce jobs. JOIN (inner)  Performs inner, equijoin of two or more relations based on common field values.
  • 21. Prepared by, Vetri.V Example  Suppose we have relations A and B.  A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)  B = LOAD 'data2' AS (b1:int,b2:int); DUMP B; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9)  X = JOIN A BY a1, B BY b1; DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9) JOIN (outer)  Performs an outer join of two or more relations based on common field values.  It operates left outer and right outer and full outer join. The usage keywords are namely LEFT OUTER, RIGHT OUTER, FULL OUTER. Example:  Left outer join A = LOAD 'a.txt' AS (n:chararray, a:int); B = LOAD 'b.txt' AS (n:chararray, m:chararray); C = JOIN A by $0 LEFT OUTER, B BY $0;
  • 22. Prepared by, Vetri.V LIMIT  Limits the number of output tuples.  Usage  X = LIMIT A 3; (where x -relation and 3 is the number of tuple) --- Thank you ---