SlideShare a Scribd company logo
1 of 23
Running TPC-H On Pig Jie Li, Koichi Ishida, Muzhi Zhao,  Ralf Diestelkaemper, Xuan Wang, Yin Lin CPS 216: Data Intensive Computing Systems   Dec 9, 2011
Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[1]  https://issues.apache.org/jira/browse/HIVE-600
Benchmark Set Up ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Initial Result ,[object Object],[object Object]
Six Rules Of Writing Efficient Pig Scripts ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Rule 1: Reorder JOINs properly  ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],* We focused on the default hash join. The replicated join does not apply to    most of the TPC-H joins and its benefit is ignorable in most queries.
Apply Rule 1 to TPC-H  ,[object Object],[object Object]
Rule 2: COGROUP ,[object Object],[object Object]
Rule 2 Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Apply Rule 2 to TPC-H Query 13 ,[object Object],[object Object]
Rule 3: FLATTEN ,[object Object],[object Object]
Rule 3 Example SQL Pig t1 = group A by x; t2 = foreach t1 generate  FLATTEN (A),  AVG(A.y) as avg_y; t3 = filter t2 by y < avg_y; ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Apply Rule 2 and 3 to TPC-H Query 17 ,[object Object],[object Object],[object Object]
Rule 4: Project before (CO)GROUP ,[object Object],[object Object],[object Object],[object Object],[object Object]
Rule 4 Example ,[object Object],[object Object],[object Object],[object Object]
Rule 5: Remove types in LOAD ,[object Object],[object Object],[object Object],[object Object],[object Object]
Apply Rule 5 to TPC-H Query 6 ,[object Object],[object Object]
Rule 6: Use hash-based aggregation ,[object Object],[object Object],[object Object],[object Object]
Query 1 (Rule 6 will be applicable soon) Q1 has a group-by and several aggregations.
Six Rules Summary ,[object Object],[object Object],[object Object]
All rewritten queries based on Rule 1~5
Updated Result
Acknowledgement ,[object Object],[object Object],[object Object]

More Related Content

What's hot

R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamer
Jean-Baptiste Poullet
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
Alberto Labarga
 

What's hot (20)

Limits Profiling
Limits ProfilingLimits Profiling
Limits Profiling
 
heap Sort Algorithm
heap  Sort Algorithmheap  Sort Algorithm
heap Sort Algorithm
 
Raster Processing with Scipy.ndimage (Dev Meet Up II)
Raster Processing with Scipy.ndimage (Dev Meet Up II)Raster Processing with Scipy.ndimage (Dev Meet Up II)
Raster Processing with Scipy.ndimage (Dev Meet Up II)
 
Heapsort using Heap
Heapsort using HeapHeapsort using Heap
Heapsort using Heap
 
Write your own telegraf plugin
Write your own telegraf pluginWrite your own telegraf plugin
Write your own telegraf plugin
 
Heaps
HeapsHeaps
Heaps
 
R and cpp
R and cppR and cpp
R and cpp
 
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
 
Heap_Sort1.pptx
Heap_Sort1.pptxHeap_Sort1.pptx
Heap_Sort1.pptx
 
R belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamerR belgium 20121116-awson-cloud-beamer
R belgium 20121116-awson-cloud-beamer
 
Data Structure and Algorithms Heaps and Trees
Data Structure and Algorithms Heaps and TreesData Structure and Algorithms Heaps and Trees
Data Structure and Algorithms Heaps and Trees
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
A Peek Under the Aura Hood
A Peek Under the Aura HoodA Peek Under the Aura Hood
A Peek Under the Aura Hood
 
Heap sort
Heap sortHeap sort
Heap sort
 
Heap sort
Heap sortHeap sort
Heap sort
 
OSDC 2019 | Tick Tock: What the heck is time-series data? by Tanay Pant
OSDC 2019 | Tick Tock: What the heck is time-series data? by Tanay PantOSDC 2019 | Tick Tock: What the heck is time-series data? by Tanay Pant
OSDC 2019 | Tick Tock: What the heck is time-series data? by Tanay Pant
 
Heapsort quick sort
Heapsort quick sortHeapsort quick sort
Heapsort quick sort
 
The Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: VisualizationThe Weather of the Century Part 3: Visualization
The Weather of the Century Part 3: Visualization
 
The Weather of the Century
The Weather of the CenturyThe Weather of the Century
The Weather of the Century
 
Climate data in r with the raster package
Climate data in r with the raster packageClimate data in r with the raster package
Climate data in r with the raster package
 

Similar to Pig TPC-H Benchmark and Performance Tuning

A Beginner's Manual for PyRx
A Beginner's Manual for PyRxA Beginner's Manual for PyRx
A Beginner's Manual for PyRx
John Cahill
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
Nguyen Tran
 
20111018 boost and gtest
20111018 boost and gtest20111018 boost and gtest
20111018 boost and gtest
Will Shen
 

Similar to Pig TPC-H Benchmark and Performance Tuning (20)

How to build a feedback loop in software
How to build a feedback loop in softwareHow to build a feedback loop in software
How to build a feedback loop in software
 
Go 1.8 Release Party
Go 1.8 Release PartyGo 1.8 Release Party
Go 1.8 Release Party
 
A Beginner's Manual for PyRx
A Beginner's Manual for PyRxA Beginner's Manual for PyRx
A Beginner's Manual for PyRx
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
cpu scheduling
cpu schedulingcpu scheduling
cpu scheduling
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
Autodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular DockingAutodock Made Easy with MGL Tools - Molecular Docking
Autodock Made Easy with MGL Tools - Molecular Docking
 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
 
GPDB Meetup GPORCA OSS 101
GPDB Meetup GPORCA OSS 101GPDB Meetup GPORCA OSS 101
GPDB Meetup GPORCA OSS 101
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
CPU scheduling ppt file
CPU scheduling ppt fileCPU scheduling ppt file
CPU scheduling ppt file
 
databasehomeworkhelp.com_ Database System Assignment Help (1).pptx
databasehomeworkhelp.com_ Database System Assignment Help (1).pptxdatabasehomeworkhelp.com_ Database System Assignment Help (1).pptx
databasehomeworkhelp.com_ Database System Assignment Help (1).pptx
 
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
 
process synchronisation operating system
process synchronisation operating systemprocess synchronisation operating system
process synchronisation operating system
 
Tech Talk - JPA and Query Optimization - publish
Tech Talk  -  JPA and Query Optimization - publishTech Talk  -  JPA and Query Optimization - publish
Tech Talk - JPA and Query Optimization - publish
 
20111018 boost and gtest
20111018 boost and gtest20111018 boost and gtest
20111018 boost and gtest
 
Nginx 0.8.x + php 5.2.13 (fast cgi) setup web server
Nginx 0.8.x + php 5.2.13 (fast cgi) setup web serverNginx 0.8.x + php 5.2.13 (fast cgi) setup web server
Nginx 0.8.x + php 5.2.13 (fast cgi) setup web server
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...
SciPy22 - Building binary extensions with pybind11, scikit build, and cibuild...
 
000 237
000 237000 237
000 237
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Pig TPC-H Benchmark and Performance Tuning

  • 1. Running TPC-H On Pig Jie Li, Koichi Ishida, Muzhi Zhao, Ralf Diestelkaemper, Xuan Wang, Yin Lin CPS 216: Data Intensive Computing Systems Dec 9, 2011
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19. Query 1 (Rule 6 will be applicable soon) Q1 has a group-by and several aggregations.
  • 20.
  • 21. All rewritten queries based on Rule 1~5
  • 23.