SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Big Data & Hadoop Training
By Aravindu Sandela
Topics to Discuss Today
Session 4
Need of PIG

PIG Components

Why PIG was created?

PIG Data Types

Why go for PIG when MapReduce is there?

Use Case in Healthcare

Use Cases where Pig is used

PIG UDF

Where not to use PIG

PIG Vs Hive

Let’s start with PIG
Need of Pig
Do you know Java?

10 lines of PIG = 200 lines of Java
+ Built in operations like:
Join, Group, Filter, Sort and more…

Oh Really!
Why Was Pig Created?

An ad-hoc way of creating and executing
map-reduce jobs on very large data sets
Rapid Development
No Java is required
Developed by Yahoo!
Why Should I Go For Pig When There Is MR?
1/20 the lines of the code

1/16 the Development Time

400

150

300

100

200

Minutes

200

100

50

0

0
Hadoop

Pig

Hadoop

Performance on par with waw Hadoop

Pig
Why Should I Go For Pig When There Is MR?
MapReduce
Powerful model for parallelism.
Based on a rigid procedural structure.
Provides a good opportunity to parallelize algorithm.
Have a higher level declarative language
Must think in terms of map and reduce functions
More than likely will require Java programmers
PIG
It is desirable to have a higher level declarative
language.
Similar to SQL query where the user specifies the
what and leaves the “how” to the underlying
processing engine.
Where I Should Use Pig?
Pig is a data flow language. It is at the top of Hadoop
and makes it possible to create complex jobs to
process large volumes of data quickly and efficiently.
It will consume any data that you feed it: Structured,
semi-structured, or unstructured.
Pig provides the common data operations (filters,
joins, ordering) and nested data types ( tuple, bags,
and maps) which are missing in map reduce.
Pig’s multi-query approach combines certain types of
operations together in a single pipeline, reducing the
number of times data is scanned. This means
1/20th the lines of code and 1/16th the development
time when compared to writing raw Map Reduce.
PIG scripts are easier and faster to write than standard
Java Hadoop jobs and PIG has lot of clever
optimizations like multi query execution, which can
make your complex queries execute quicker.
Where not to use PIG?

Really nasty data formats or completely unstructured data (video, audio,
raw human-readable text).
Pig is definitely slow compared to Map Reduce jobs.
When you would like more power to optimize your code.
Pig platform is designed for ETL type use case, it’s not a great choice for
real time scenarios
Pig is also not the right choice for pinpointing a single record in very large
data sets
Fragment replicate; skewed; merge join
User has to know when to use which join
What is Pig?
Pig is an open-source high-level dataflow
system.
It provides a simple language for queries
and data manipulation Pig Latin, that is
compiled into map-reduce jobs that are
run on Hadoop.
Why is it important?
Companies like Yahoo, Google and
Microsoft are collecting enormous
data sets in the form of click streams,
search logs, and web crawls.
Some form of ad-hoc processing and
analysis of all of this information is
required.
Use cases where Pig is used…

Processing of Web Logs
Data processing for search platforms
Support for Ad Hoc queries across large datasets.
Quick Prototyping of algorithms for processing large datasets.
Conceptual Data Flow
Load
Visits (User, URL, Time)

Load
Pages (URL , Page Rank)

Join
url = url

Group by User

Compute Average
PageRank

Filter
avgPR >0.5
Use Case
Taking DB dump
in CSV format and
ingest into HDFS

Matches
Read CSV file from HDFS

Map Task 1

Deidentify
columns based
on configurations

Store Deidentified
CSV file into HDFS

HDFS
Map Task 1

Map Task 2

Map Task 2

..

..
Pig -Basic Program Structure
Execution Modes
Local
Executes in a single JVM
Works exclusively with local file
system
Script

Great for development,
experimentation and prototyping
Hadoop Mode

Grunt
Embedded

Also known as Map Reduce mode
Pig renders Pig Latin into
MapReduce jobs and executes
them on the cluster
Can execute against semidistributed or fully-distributed
Hadoop installation
Pig-Basic Program Structure
Script:
Pig can run a script file that contains Pig commands.
Example: pig script.pig runs the commands in the local file script.pig.
Grunt:
Grunt is an interactive shell for running Pig commands. It is also possible to
run Pig scripts from within Grunt using run and exec (execute).
Embedded:
Embedded can run Pig programs from Java, much like you can use JDBC to
run SQL programs from Java.
Pig is made up of two Components

1)

Pig Latis is used to
express Data Flows

Pig

Data Flows

Distributed Execution on a
Hadoop Cluster

2)

Execution
Environments
Local execution in a single JVM
Pig Execution

No need to install anything extra on your Hadoop Cluster!
User Machine

Hadoop
Cluster

Pig resides on user machine

Job executes on Cluster
Pig Latin Program
Pig Latin Program
It is made up of a series of operations
or transformations that are applied to
the input data to produce output.
Field – piece of data.

Pig

Tuple – ordered set of fields, represented
with “(“ and “)”• (10.4, 5, word, 4, field1)
Bag – collection of tuples, represented with
“{“ and “}” {(10.4, 5, word, 4, field1), (this,
1, blah) }

A series of
MapReducejobs
Turns the transformations into…

Similar to Relational Database
Bag is a table in the Database
Tuple is a row in a table
Bags do not require that all tuples contain
the same number
Unlike Relational Database
Four Basic Types Of Data Models

Atom

Tuple
Data
Model Types

Bag

Map
Data Model
Supports four basic types
Atom: A simple atomic value (int , long, double, string)
ex: ‘Abhi’
Tuple: A sequence of fields that can be any of the data types
ex: (‘Abhi’, 14)
Bag: A collection of tuples of potentially varying structures, can
contain duplicates
ex: {(‘Abhi’), (‘Manu’, (14, 21))}
Map: An associative array, the key must be a char
array but the value can be any type.
Pig Data Types
Pig Data Type

Implementing Class

Bag

org.apache.pig.data.DataBag

Tuple

org.apache.pig.data.Tuple

Map

java.util.Map<Object, Object>

Integer

java.lang.Integer

Long

java.lang.Long

Float

java.lang.Float

Double

java.lang.Double

Chararray

java.lang.String

Bytearray

byte[ ]
Pig Latin Relational Operators
Category

Operator

Description

LOAD STORE DUMP

Loads data from the file system.
Saves a relation to the file system or other
storage. Prints a relation to the console

FILTER DISTINCT
FOREACH...GENERATE STREAM

Joins two or more relations.
Groups the data in two or more relations.
Groups the data in a single relation.
Creates the cross product of two or more
relations.

JOIN COGROUP GROUP CROSS

Removes unwanted rows from a relation.
Removes duplicate rows from a relation.
Adds or removes fields from a relation.
Transforms a relation using an external program.

Storing

ORDER LIMIT

Sorts a relation by one or more fields.
Limits the size of a relation to a maximum
number of tuples.

Combining and Splitting

UNION SPLIT

Combines two or more relations into one.
Splits a relation into two or more relations.

Loading and Storing

Filtering

Grouping and Joining
Pig Latin -Nulls
Pig includes the concepts of data
being null
Data of any type can be null

Pig
In Pig, when a data
element is NULL, it
means the value is
unknown.

Includes the
concept of a
data element
being

Null
Data of any type can be NULL.

Note the concept of null in pig is
same as SQL, unlike other
languages like java, C, Python
Data
File –Student

File –Student Roll

Name

Age

GPA

Name

Roll No.

Joe

18

2.5

Joe

45

3.0

Sam

24

Sam
Angle

21

7.9

Angle

1

John

17

9.0

John

12

Joe

19

2.9

Joe

19
Pig Latin –Group Operator
Example of GROUP Operator:
A = load 'student' as (name:chararray, age:int, gpa:float);
dump A;
( joe,18,2.5)
(sam,,3.0)
(angel,21,7.9)
( john,17,9.0)
( joe,19,2.9)
X = group A by name;
dump X;
( joe,{( joe,18,2.5),( joe,19,2.9)})
(sam,{(sam,,3.0)})
( john,{( john,17,9.0)})
(angel,{(angel,21,7.9)})
Pig Latin –COGroup Operator

Example of COGROUP Operator:
A = load 'student' as (name:chararray, age:int,gpa:float);
B = load 'studentRoll' as (name:chararray, rollno:int);
X = cogroup A by name, B by name;
dump X;
( joe,{( joe,18,2.5),( joe,19,2.9)},{( joe,45),( joe,19)})
(sam,{(sam,,3.0)},{(sam,24)})
( john,{( john,17,9.0)},{( john,12)})
(angel,{(angel,21,7.9)},{(angel,1)})
Joins and COGROUP
JOIN and COGROUP operators perform
similar functions.
JOIN creates a flat set of output records
while COGROUP creates a nested set of
output records.
UNION
UNION: To merge the contents
of two or more relations.
Diagnostic Operators & UDF Statements
Pig Latin Diagnostic Operators
Types of Pig Latin Diagnostic Operators:
DESCRIBE :

Prints a relation’s schema.

EXPLAIN :

Prints the logical and physical plans.

ILLUSTRATE : Shows a sample execution of the logical plan, using a
generated subset of the input.

Pig Latin UDF Statements
Types of Pig Latin UDF Statements:
REGISTER:

Registers a JAR file with the Pig runtime.

DEFINE :

Creates an alias for a UDF, streaming script, or a command
specification.
Describe
Use the DESCRIBE operator to review the fields and data-types.
EXPLAIN: Logical Plan
Use the EXPLAIN operator to review the logical, physical, and map reduce execution
plans that are used to compute the specified relationship.
The logical plan shows a pipeline of operators to be executed to build the relation.
Type checking and backend-independent optimizations (such as applying filters early
on) also apply.
EXPLAIN : Physical Plan
The physical plan shows how the logical operators are translated to backend-specific
physical operators. Some backend optimizations also apply.
Illustrate
ILLUSTRATE command is used to demonstrate a "good" example input data.
Judged by three measurements:
1: Completeness

2: Conciseness

3: Degree of realism
Pig Latin –File Loaders

Pig Latin File Loaders
TextLoader:

Loads from a plain text format
Each line corresponds to a tuple whose single field is
the line of text

CSVLoader:

Loads CSV files

XML Loader:

Loads XML files
Pig Latin –File Loaders
PigStorage:

Default storage
Loads/Stores relationships among the fields using field-delimited
text format
Tab is the default delimiter
Other delimiters can be specified in the query by using “using
PigStorage(‘ ‘)” .

BinStorage:

Loads / stores relationship from or to binary files
Uses Hadoop Writable objects

BinaryStorage:

Contain only single- field tuple with value of type byte array
Used with pig streaming

PigDump:

Stores relations using “toString()” representation of tuples
Pig Latin –Creating UDF
public class IsOfAge extends FilterFunc{
@Override
public Boolean exec(Tuple tuple) throws IOException{
if(tuple == null|| tuple.size() == 0) {
return false;
}
try {
Object object= tuple.get(0);
if(object == null)
{ return false;
}
int i = (Integer) object;
if(i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;
} else
{return false;
}
} catch (ExecException e){
throw new IOException(e);
}
}
}
Pig Latin –Calling A UDF

How to call a UDF?
register myudf.jar;
X = filter A by IsOfAge(age);
Big Data Hadoop Training

Más contenido relacionado

La actualidad más candente

Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1Vemula Ravi
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java ProfessionalsEdureka!
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 

La actualidad más candente (20)

Hadoop interview quations1
Hadoop interview quations1Hadoop interview quations1
Hadoop interview quations1
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 

Similar a Big Data Hadoop Training

unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.Triloki Gupta
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 

Similar a Big Data Hadoop Training (20)

Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Introduction to pig.
Introduction to pig.Introduction to pig.
Introduction to pig.
 
Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Pig
PigPig
Pig
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Big Data Hadoop Training

  • 1. Big Data & Hadoop Training By Aravindu Sandela
  • 2. Topics to Discuss Today Session 4 Need of PIG PIG Components Why PIG was created? PIG Data Types Why go for PIG when MapReduce is there? Use Case in Healthcare Use Cases where Pig is used PIG UDF Where not to use PIG PIG Vs Hive Let’s start with PIG
  • 3. Need of Pig Do you know Java? 10 lines of PIG = 200 lines of Java + Built in operations like: Join, Group, Filter, Sort and more… Oh Really!
  • 4. Why Was Pig Created? An ad-hoc way of creating and executing map-reduce jobs on very large data sets Rapid Development No Java is required Developed by Yahoo!
  • 5. Why Should I Go For Pig When There Is MR? 1/20 the lines of the code 1/16 the Development Time 400 150 300 100 200 Minutes 200 100 50 0 0 Hadoop Pig Hadoop Performance on par with waw Hadoop Pig
  • 6. Why Should I Go For Pig When There Is MR? MapReduce Powerful model for parallelism. Based on a rigid procedural structure. Provides a good opportunity to parallelize algorithm. Have a higher level declarative language Must think in terms of map and reduce functions More than likely will require Java programmers PIG It is desirable to have a higher level declarative language. Similar to SQL query where the user specifies the what and leaves the “how” to the underlying processing engine.
  • 7. Where I Should Use Pig? Pig is a data flow language. It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently. It will consume any data that you feed it: Structured, semi-structured, or unstructured. Pig provides the common data operations (filters, joins, ordering) and nested data types ( tuple, bags, and maps) which are missing in map reduce. Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned. This means 1/20th the lines of code and 1/16th the development time when compared to writing raw Map Reduce. PIG scripts are easier and faster to write than standard Java Hadoop jobs and PIG has lot of clever optimizations like multi query execution, which can make your complex queries execute quicker.
  • 8. Where not to use PIG? Really nasty data formats or completely unstructured data (video, audio, raw human-readable text). Pig is definitely slow compared to Map Reduce jobs. When you would like more power to optimize your code. Pig platform is designed for ETL type use case, it’s not a great choice for real time scenarios Pig is also not the right choice for pinpointing a single record in very large data sets Fragment replicate; skewed; merge join User has to know when to use which join
  • 9. What is Pig? Pig is an open-source high-level dataflow system. It provides a simple language for queries and data manipulation Pig Latin, that is compiled into map-reduce jobs that are run on Hadoop. Why is it important? Companies like Yahoo, Google and Microsoft are collecting enormous data sets in the form of click streams, search logs, and web crawls. Some form of ad-hoc processing and analysis of all of this information is required.
  • 10. Use cases where Pig is used… Processing of Web Logs Data processing for search platforms Support for Ad Hoc queries across large datasets. Quick Prototyping of algorithms for processing large datasets.
  • 11. Conceptual Data Flow Load Visits (User, URL, Time) Load Pages (URL , Page Rank) Join url = url Group by User Compute Average PageRank Filter avgPR >0.5
  • 12. Use Case Taking DB dump in CSV format and ingest into HDFS Matches Read CSV file from HDFS Map Task 1 Deidentify columns based on configurations Store Deidentified CSV file into HDFS HDFS Map Task 1 Map Task 2 Map Task 2 .. ..
  • 13. Pig -Basic Program Structure Execution Modes Local Executes in a single JVM Works exclusively with local file system Script Great for development, experimentation and prototyping Hadoop Mode Grunt Embedded Also known as Map Reduce mode Pig renders Pig Latin into MapReduce jobs and executes them on the cluster Can execute against semidistributed or fully-distributed Hadoop installation
  • 14. Pig-Basic Program Structure Script: Pig can run a script file that contains Pig commands. Example: pig script.pig runs the commands in the local file script.pig. Grunt: Grunt is an interactive shell for running Pig commands. It is also possible to run Pig scripts from within Grunt using run and exec (execute). Embedded: Embedded can run Pig programs from Java, much like you can use JDBC to run SQL programs from Java.
  • 15. Pig is made up of two Components 1) Pig Latis is used to express Data Flows Pig Data Flows Distributed Execution on a Hadoop Cluster 2) Execution Environments Local execution in a single JVM
  • 16. Pig Execution No need to install anything extra on your Hadoop Cluster! User Machine Hadoop Cluster Pig resides on user machine Job executes on Cluster
  • 17. Pig Latin Program Pig Latin Program It is made up of a series of operations or transformations that are applied to the input data to produce output. Field – piece of data. Pig Tuple – ordered set of fields, represented with “(“ and “)”• (10.4, 5, word, 4, field1) Bag – collection of tuples, represented with “{“ and “}” {(10.4, 5, word, 4, field1), (this, 1, blah) } A series of MapReducejobs Turns the transformations into… Similar to Relational Database Bag is a table in the Database Tuple is a row in a table Bags do not require that all tuples contain the same number Unlike Relational Database
  • 18. Four Basic Types Of Data Models Atom Tuple Data Model Types Bag Map
  • 19. Data Model Supports four basic types Atom: A simple atomic value (int , long, double, string) ex: ‘Abhi’ Tuple: A sequence of fields that can be any of the data types ex: (‘Abhi’, 14) Bag: A collection of tuples of potentially varying structures, can contain duplicates ex: {(‘Abhi’), (‘Manu’, (14, 21))} Map: An associative array, the key must be a char array but the value can be any type.
  • 20. Pig Data Types Pig Data Type Implementing Class Bag org.apache.pig.data.DataBag Tuple org.apache.pig.data.Tuple Map java.util.Map<Object, Object> Integer java.lang.Integer Long java.lang.Long Float java.lang.Float Double java.lang.Double Chararray java.lang.String Bytearray byte[ ]
  • 21. Pig Latin Relational Operators Category Operator Description LOAD STORE DUMP Loads data from the file system. Saves a relation to the file system or other storage. Prints a relation to the console FILTER DISTINCT FOREACH...GENERATE STREAM Joins two or more relations. Groups the data in two or more relations. Groups the data in a single relation. Creates the cross product of two or more relations. JOIN COGROUP GROUP CROSS Removes unwanted rows from a relation. Removes duplicate rows from a relation. Adds or removes fields from a relation. Transforms a relation using an external program. Storing ORDER LIMIT Sorts a relation by one or more fields. Limits the size of a relation to a maximum number of tuples. Combining and Splitting UNION SPLIT Combines two or more relations into one. Splits a relation into two or more relations. Loading and Storing Filtering Grouping and Joining
  • 22. Pig Latin -Nulls Pig includes the concepts of data being null Data of any type can be null Pig In Pig, when a data element is NULL, it means the value is unknown. Includes the concept of a data element being Null Data of any type can be NULL. Note the concept of null in pig is same as SQL, unlike other languages like java, C, Python
  • 23. Data File –Student File –Student Roll Name Age GPA Name Roll No. Joe 18 2.5 Joe 45 3.0 Sam 24 Sam Angle 21 7.9 Angle 1 John 17 9.0 John 12 Joe 19 2.9 Joe 19
  • 24. Pig Latin –Group Operator Example of GROUP Operator: A = load 'student' as (name:chararray, age:int, gpa:float); dump A; ( joe,18,2.5) (sam,,3.0) (angel,21,7.9) ( john,17,9.0) ( joe,19,2.9) X = group A by name; dump X; ( joe,{( joe,18,2.5),( joe,19,2.9)}) (sam,{(sam,,3.0)}) ( john,{( john,17,9.0)}) (angel,{(angel,21,7.9)})
  • 25. Pig Latin –COGroup Operator Example of COGROUP Operator: A = load 'student' as (name:chararray, age:int,gpa:float); B = load 'studentRoll' as (name:chararray, rollno:int); X = cogroup A by name, B by name; dump X; ( joe,{( joe,18,2.5),( joe,19,2.9)},{( joe,45),( joe,19)}) (sam,{(sam,,3.0)},{(sam,24)}) ( john,{( john,17,9.0)},{( john,12)}) (angel,{(angel,21,7.9)},{(angel,1)})
  • 26. Joins and COGROUP JOIN and COGROUP operators perform similar functions. JOIN creates a flat set of output records while COGROUP creates a nested set of output records.
  • 27. UNION UNION: To merge the contents of two or more relations.
  • 28. Diagnostic Operators & UDF Statements Pig Latin Diagnostic Operators Types of Pig Latin Diagnostic Operators: DESCRIBE : Prints a relation’s schema. EXPLAIN : Prints the logical and physical plans. ILLUSTRATE : Shows a sample execution of the logical plan, using a generated subset of the input. Pig Latin UDF Statements Types of Pig Latin UDF Statements: REGISTER: Registers a JAR file with the Pig runtime. DEFINE : Creates an alias for a UDF, streaming script, or a command specification.
  • 29. Describe Use the DESCRIBE operator to review the fields and data-types.
  • 30. EXPLAIN: Logical Plan Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship. The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backend-independent optimizations (such as applying filters early on) also apply.
  • 31. EXPLAIN : Physical Plan The physical plan shows how the logical operators are translated to backend-specific physical operators. Some backend optimizations also apply.
  • 32. Illustrate ILLUSTRATE command is used to demonstrate a "good" example input data. Judged by three measurements: 1: Completeness 2: Conciseness 3: Degree of realism
  • 33. Pig Latin –File Loaders Pig Latin File Loaders TextLoader: Loads from a plain text format Each line corresponds to a tuple whose single field is the line of text CSVLoader: Loads CSV files XML Loader: Loads XML files
  • 34. Pig Latin –File Loaders PigStorage: Default storage Loads/Stores relationships among the fields using field-delimited text format Tab is the default delimiter Other delimiters can be specified in the query by using “using PigStorage(‘ ‘)” . BinStorage: Loads / stores relationship from or to binary files Uses Hadoop Writable objects BinaryStorage: Contain only single- field tuple with value of type byte array Used with pig streaming PigDump: Stores relations using “toString()” representation of tuples
  • 35. Pig Latin –Creating UDF public class IsOfAge extends FilterFunc{ @Override public Boolean exec(Tuple tuple) throws IOException{ if(tuple == null|| tuple.size() == 0) { return false; } try { Object object= tuple.get(0); if(object == null) { return false; } int i = (Integer) object; if(i == 18 || i == 19 || i == 21 || i == 23 || i == 27) { return true; } else {return false; } } catch (ExecException e){ throw new IOException(e); } } }
  • 36. Pig Latin –Calling A UDF How to call a UDF? register myudf.jar; X = filter A by IsOfAge(age);