SlideShare a Scribd company logo
1 of 14
Download to read offline
Effec%ve	
  Hive	
  Queries	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Secrets	
  From	
  the	
  Pros	
  
We	
  will	
  be	
  star,ng	
  at	
  11:03	
  PDT	
  
Use	
  the	
  Chat	
  Pane	
  in	
  GoToWebinar	
  to	
  Ask	
  Ques%ons!	
  
Assess	
  your	
  level	
  and	
  learn	
  new	
  stuff	
  
This	
  webinar	
  is	
  intended	
  for	
  intermediate	
  audiences	
  
(familiar	
  with	
  Apache	
  Hive	
  and	
  Hadoop,	
  but	
  not	
  experts)	
  
?	
  
AGENDA	
  
This	
  Webinar	
  provides	
  %ps	
  on	
  improving	
  the	
  performance	
  and	
  
beJer	
  u%lizing	
  resources	
  using	
  the	
  following	
  best	
  prac%ces:	
  
•  Data	
  Layout	
  (Par%%ons	
  and	
  Buckets)	
  
•  Data	
  Sampling	
  (Bucket	
  and	
  Block	
  sampling)	
  
•  Data	
  Processing	
  (Bucket	
  Map	
  Join	
  and	
  Parallel	
  
execu%on)	
  
Dataset	
  Used	
  
#	
  of	
  records:	
  	
  276M	
  records	
  
Columns:	
  
I%nerary	
  ID	
  
Year	
  &Quarter	
  of	
  Travel	
  
Trip	
  Origin	
  City	
  &	
  State	
  
Trip	
  Des%na%on	
  City	
  &	
  State	
  
Distance	
  between	
  Origin	
  &	
  Des%na%on	
  
Airline	
  Bookings	
  All	
  
Includes	
  stops	
  at	
  intermediate	
  ci%es	
  
#	
  of	
  records:	
  	
  116M	
  records	
  
Columns:	
  
I%nerary	
  ID	
  
Year	
  &Quarter	
  of	
  Travel	
  
Trip	
  Origin	
  City	
  &	
  State	
  
Trip	
  Des%na%on	
  City	
  &	
  State	
  
Distance	
  between	
  Origin	
  &	
  Des%na%on	
  
Airline	
  Bookings	
  Origin	
  Only	
  
Only	
  first	
  leg	
  of	
  travel	
  
#	
  of	
  records:	
  	
  50	
  
Columns:	
  
State	
  code	
  &	
  Name	
  
Popula%on	
  
Census	
  
Human	
  popula%on	
  by	
  US	
  State	
  
#1	
  -­‐	
  Data	
  Par%%oning	
  	
  
•  Problem	
  PaJern	
  
–  Query	
  a	
  subset	
  of	
  data	
  in	
  a	
  table	
  
–  Subset	
  iden%fied	
  by	
  “Column_Name	
  =	
  X”	
  filter	
  
•  Solu%on	
  paJern	
  
–  Layout	
  data	
  in	
  sub-­‐directories	
  with	
  each	
  directory	
  associated	
  
with	
  a	
  value	
  of	
  the	
  par%%on	
  column	
  
–  The	
  filter	
  on	
  par%%on	
  column	
  just	
  picks	
  a	
  single	
  sub	
  directory	
  
•  Approach	
  
–  Use	
  PARTITION	
  BY	
  clause	
  
•  Benefit	
  
–  Par%%on	
  pruning	
  
–  2.7x	
  faster	
  on	
  a	
  query	
  on	
  Airline	
  Bookings	
  Dataset	
  (29	
  seconds)	
  
#1	
  -­‐	
  Data	
  Par%%oning	
  
Airline	
  Bookings	
  All	
  Table	
  
Origin	
  State	
  (Par%%on	
  
Column	
  /	
  Sub-­‐directory)	
   CA	
   WY	
  AL	
  
File1001.dat	
  
File1002.dat	
  
File100n.dat	
  
File3001.dat	
  
File3002.dat	
  
File300n.dat	
  
Filex001.dat	
  
Filex002.dat	
  
Filex00n.dat	
  
Files	
  inside	
  the	
  
par%%on	
  
SELECT	
  origin_city,	
  origin_state	
  
FROM	
  Airline_Bookings_All	
  
WHERE	
  origin_state	
  =	
  ‘CA’	
  
CREATE	
  TABLE	
  Airline_Bookings_All	
  
….	
  
PARTITIONED	
  BY	
  (origin_state	
  STRING)	
  
#2	
  -­‐	
  Data	
  Bucke%ng	
  
•  Problem	
  PaJern	
  
–  Join	
  data	
  in	
  two	
  large	
  tables	
  efficiently	
  
–  Sample	
  data	
  inside	
  a	
  table	
  efficiently	
  
•  Solu%on	
  paJern	
  
–  More	
  efficient	
  processing	
  by	
  storing	
  data	
  in	
  hash	
  buckets	
  
•  Approach	
  
•  Use	
  bucke%ng	
  using	
  CLUSTERED	
  BY	
  ..	
  INTO	
  n	
  BUCKETS	
  
•  Benefit	
  
–  Bucket	
  Map	
  Join	
  
–  Bucket	
  Sampling	
  
#2	
  –	
  Data	
  Bucke%ng	
  
CREATE	
  TABLE	
  Airline_Bookings_All	
  
…	
  
CLUSTERED	
  BY	
  (i%nid)	
  INTO	
  64	
  BUCKETS	
  
set	
  hive.enforce.bucke%ng	
  =	
  true;	
  
INSERT	
  OVERWRITE	
  TABLE	
  Airline_Bookings_All	
  
SELECT	
  …	
  
FROM	
  ..	
  
Ailrine_Bookings_All	
  
File00.dat	
  
File63.dat	
  
File01.dat	
  
Each	
  File	
  contains	
  all	
  
the	
  rows	
  that	
  
correspond	
  to	
  the	
  
same	
  hash	
  of	
  i%nid	
  
column	
  
#2	
  -­‐	
  Data	
  Bucke%ng	
  
a	
  
File1001.dat	
  
File1002.dat	
  
File100n.dat	
  
Filex001.dat	
  
Filex002.dat	
  
Filex00m.dat	
  
Files	
  containing	
  table	
  
data	
  bucketed	
  on	
  a	
  
column	
  
b	
  
set	
  hive.op%mize.bucketmapjoin	
  =	
  true;	
  
	
  
SELECT	
  /*+	
  MAPJOIN(a,	
  b)	
  */	
  a.*,	
  b.*	
  
FROM	
  Airline_Bookings_All	
  a	
  JOIN	
  Airline_Bookings_Origin_Only	
  b	
  
ON	
  a.i%nid	
  =	
  b.i%nid	
  
Note:	
  	
  
1.  Both	
  the	
  tables	
  are	
  bucketed	
  on	
  i%nid	
  column	
  
2.  The	
  numbers	
  of	
  buckets	
  in	
  the	
  two	
  tables	
  are	
  a	
  strict	
  mul%ple	
  of	
  each	
  other	
  
#3	
  -­‐	
  Bucket	
  Sampling	
  
•  Problem	
  PaJern	
  
–  Work	
  on	
  joinable	
  samples	
  of	
  data	
  from	
  different	
  tables	
  
•  Solu%on	
  paJern	
  
–  Use	
  Bucket	
  Sampling	
  
•  Approach	
  
•  TABLESAMPLE	
  (BUCKET	
  x	
  OUT	
  OF	
  Y	
  ON	
  column)	
  
•  Benefit	
  
–  Useful	
  while	
  working	
  with	
  sample	
  data	
  and	
  joins	
  
#3	
  -­‐	
  Bucket	
  Sampling	
  
Filex002.dat	
  
Filex030.dat	
  
Filex064.dat	
  
Files	
  containing	
  bookings	
  data	
  
bucketed	
  on	
  i%nid	
  
a	
  
SELECT	
  a.*,	
  b.*	
  
FROM	
  Airline_Bookings_All	
  TABLESAMPLE(bucket	
  30	
  out	
  of	
  64	
  on	
  i%nid)	
  a	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ,	
  Airline_Bookings_Origin_Only	
  TABLESAMPLE(bucket	
  30	
  out	
  of	
  64	
  on	
  i%nid)	
  b	
  
WHERE	
  a.i%nid	
  =	
  b.i%nid	
  
Filex001.dat	
  
Filex063.dat	
  
Filey002.dat	
  
Filey030.dat	
  
Filey064.dat	
  
b	
  
Filey001.dat	
  
Filey063.dat	
  
#4	
  –	
  Block	
  Sampling	
  
•  Problem	
  PaJern	
  
–  View	
  a	
  sample	
  of	
  a	
  data	
  with	
  in	
  a	
  table	
  
–  Sample	
  size	
  expressed	
  as	
  number	
  of	
  rows,	
  %age	
  of	
  data,	
  or	
  
number	
  of	
  MBs	
  
•  Solu%on	
  paJern	
  
–  Use	
  Block	
  sampling	
  
•  Approach	
  
–  Use	
  TABLESAMPLE	
  (n%,	
  nM,	
  or	
  n	
  ROWS)	
  
•  Benefit	
  
–  Geyng	
  a	
  random	
  sample	
  from	
  the	
  table	
  
–  More	
  op%ons	
  to	
  specify	
  how	
  many	
  samples	
  to	
  generate	
  
#5	
  –	
  Parallel	
  Execu%on	
  
SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct
FROM
(
SELECT itinid,
year,
quarter,
origin,
originstate
FROM air_travel_bookings_8
)a
JOIN
(
SELECT itinid,
origin,
originstate
FROM air_travel_origins_8
)B
ON
( A.itinid = b.itinid
and a.origin = b.origin
and a.originstate = b.originstate)
GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage	
  1	
  
Stage	
  2	
  
Stage	
  3	
  
Stage	
  1	
  
Stage	
  2	
  
Stage	
  3	
  
Stage	
  1	
   Stage	
  2	
  
Stage	
  3	
  
set	
  hive.exec.parallel	
  =	
  false;	
  
set	
  hive.exec.parallel	
  =	
  true;	
  
Summary	
  
•  Iterate	
  quickly	
  on	
  Query	
  Design	
  
– Use	
  Bucket	
  and	
  Block	
  Sampling	
  
•  Run	
  queries	
  faster	
  
– Par%%oning	
  to	
  invoke	
  Par%%on	
  Pruning	
  
– Bucke%ng	
  to	
  invoke	
  Bucket	
  Map	
  Joins	
  
– Execute	
  complex	
  queries	
  in	
  parallel	
  
THANK	
  YOU	
  
Managed	
  Cluster	
   Built-­‐In	
  Connectors	
   Friendly	
  User-­‐Interface	
   Dedicated	
  Support	
  
•  100%	
  Managed	
  Hadoop	
  Cluster	
  in	
  the	
  Cloud	
  
•  Auto-­‐Scaling	
  Cluster.	
  Full	
  Life-­‐cycle	
  Management	
  
•  +12	
  Connectors	
  to	
  Applica%ons	
  and	
  Data	
  Sources	
  
•  14-­‐Day	
  Free	
  Trial	
  (free	
  account	
  available)	
  
•  24/7	
  Customer	
  Support	
  
What’s	
  Included?	
  
è	
  www.qubole.com/try	
  ç	
  

More Related Content

Similar to Effective Hive Queries

Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data ProcessorCory Bethrant
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures Mugisha Moses
 
EnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsEnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsBruce Jacobs
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Obtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesObtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesKiran Venna
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Rational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestRational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestGEBS Reporting
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsJan Aerts
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13AnusAhmad
 
Informix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableInformix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableKeshav Murthy
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures Pradipta Mishra
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 

Similar to Effective Hive Queries (20)

Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures
 
EnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentalsEnviroInsite training workshop - Database fundamentals
EnviroInsite training workshop - Database fundamentals
 
Data import-cheatsheet
Data import-cheatsheetData import-cheatsheet
Data import-cheatsheet
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
R Cheat Sheet
R Cheat SheetR Cheat Sheet
R Cheat Sheet
 
Obtain better data accuracy using reference tables
Obtain better data accuracy using reference tablesObtain better data accuracy using reference tables
Obtain better data accuracy using reference tables
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Rational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuestRational Publishing Engine and Rational ClearQuest
Rational Publishing Engine and Rational ClearQuest
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13[Www.pkbulk.blogspot.com]dbms13
[Www.pkbulk.blogspot.com]dbms13
 
Informix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_tableInformix partitioning interval_rolling_window_table
Informix partitioning interval_rolling_window_table
 
Introduction to programming c and data-structures
Introduction to programming c and data-structures Introduction to programming c and data-structures
Introduction to programming c and data-structures
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Flights Landing Overrun Project
Flights Landing Overrun ProjectFlights Landing Overrun Project
Flights Landing Overrun Project
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
matlab_tutorial.ppt
matlab_tutorial.pptmatlab_tutorial.ppt
matlab_tutorial.ppt
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 

More from Qubole

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data AdoptionQubole
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on YarnQubole
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at PinterestQubole
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data TipsQubole
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloudQubole
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 

More from Qubole (20)

Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
State of Big Data Adoption
State of Big Data AdoptionState of Big Data Adoption
State of Big Data Adoption
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Spark on Yarn
Spark on YarnSpark on Yarn
Spark on Yarn
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Qubole State of the Big Data Industry
Qubole State of the Big Data IndustryQubole State of the Big Data Industry
Qubole State of the Big Data Industry
 
Big Data Platform at Pinterest
Big Data Platform at PinterestBig Data Platform at Pinterest
Big Data Platform at Pinterest
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
Expert Big Data Tips
Expert Big Data TipsExpert Big Data Tips
Expert Big Data Tips
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Presto in the cloud
Presto in the cloudPresto in the cloud
Presto in the cloud
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Effective Hive Queries

  • 1. Effec%ve  Hive  Queries                                Secrets  From  the  Pros   We  will  be  star,ng  at  11:03  PDT   Use  the  Chat  Pane  in  GoToWebinar  to  Ask  Ques%ons!   Assess  your  level  and  learn  new  stuff   This  webinar  is  intended  for  intermediate  audiences   (familiar  with  Apache  Hive  and  Hadoop,  but  not  experts)   ?  
  • 2. AGENDA   This  Webinar  provides  %ps  on  improving  the  performance  and   beJer  u%lizing  resources  using  the  following  best  prac%ces:   •  Data  Layout  (Par%%ons  and  Buckets)   •  Data  Sampling  (Bucket  and  Block  sampling)   •  Data  Processing  (Bucket  Map  Join  and  Parallel   execu%on)  
  • 3. Dataset  Used   #  of  records:    276M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  All   Includes  stops  at  intermediate  ci%es   #  of  records:    116M  records   Columns:   I%nerary  ID   Year  &Quarter  of  Travel   Trip  Origin  City  &  State   Trip  Des%na%on  City  &  State   Distance  between  Origin  &  Des%na%on   Airline  Bookings  Origin  Only   Only  first  leg  of  travel   #  of  records:    50   Columns:   State  code  &  Name   Popula%on   Census   Human  popula%on  by  US  State  
  • 4. #1  -­‐  Data  Par%%oning     •  Problem  PaJern   –  Query  a  subset  of  data  in  a  table   –  Subset  iden%fied  by  “Column_Name  =  X”  filter   •  Solu%on  paJern   –  Layout  data  in  sub-­‐directories  with  each  directory  associated   with  a  value  of  the  par%%on  column   –  The  filter  on  par%%on  column  just  picks  a  single  sub  directory   •  Approach   –  Use  PARTITION  BY  clause   •  Benefit   –  Par%%on  pruning   –  2.7x  faster  on  a  query  on  Airline  Bookings  Dataset  (29  seconds)  
  • 5. #1  -­‐  Data  Par%%oning   Airline  Bookings  All  Table   Origin  State  (Par%%on   Column  /  Sub-­‐directory)   CA   WY  AL   File1001.dat   File1002.dat   File100n.dat   File3001.dat   File3002.dat   File300n.dat   Filex001.dat   Filex002.dat   Filex00n.dat   Files  inside  the   par%%on   SELECT  origin_city,  origin_state   FROM  Airline_Bookings_All   WHERE  origin_state  =  ‘CA’   CREATE  TABLE  Airline_Bookings_All   ….   PARTITIONED  BY  (origin_state  STRING)  
  • 6. #2  -­‐  Data  Bucke%ng   •  Problem  PaJern   –  Join  data  in  two  large  tables  efficiently   –  Sample  data  inside  a  table  efficiently   •  Solu%on  paJern   –  More  efficient  processing  by  storing  data  in  hash  buckets   •  Approach   •  Use  bucke%ng  using  CLUSTERED  BY  ..  INTO  n  BUCKETS   •  Benefit   –  Bucket  Map  Join   –  Bucket  Sampling  
  • 7. #2  –  Data  Bucke%ng   CREATE  TABLE  Airline_Bookings_All   …   CLUSTERED  BY  (i%nid)  INTO  64  BUCKETS   set  hive.enforce.bucke%ng  =  true;   INSERT  OVERWRITE  TABLE  Airline_Bookings_All   SELECT  …   FROM  ..   Ailrine_Bookings_All   File00.dat   File63.dat   File01.dat   Each  File  contains  all   the  rows  that   correspond  to  the   same  hash  of  i%nid   column  
  • 8. #2  -­‐  Data  Bucke%ng   a   File1001.dat   File1002.dat   File100n.dat   Filex001.dat   Filex002.dat   Filex00m.dat   Files  containing  table   data  bucketed  on  a   column   b   set  hive.op%mize.bucketmapjoin  =  true;     SELECT  /*+  MAPJOIN(a,  b)  */  a.*,  b.*   FROM  Airline_Bookings_All  a  JOIN  Airline_Bookings_Origin_Only  b   ON  a.i%nid  =  b.i%nid   Note:     1.  Both  the  tables  are  bucketed  on  i%nid  column   2.  The  numbers  of  buckets  in  the  two  tables  are  a  strict  mul%ple  of  each  other  
  • 9. #3  -­‐  Bucket  Sampling   •  Problem  PaJern   –  Work  on  joinable  samples  of  data  from  different  tables   •  Solu%on  paJern   –  Use  Bucket  Sampling   •  Approach   •  TABLESAMPLE  (BUCKET  x  OUT  OF  Y  ON  column)   •  Benefit   –  Useful  while  working  with  sample  data  and  joins  
  • 10. #3  -­‐  Bucket  Sampling   Filex002.dat   Filex030.dat   Filex064.dat   Files  containing  bookings  data   bucketed  on  i%nid   a   SELECT  a.*,  b.*   FROM  Airline_Bookings_All  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  a                      ,  Airline_Bookings_Origin_Only  TABLESAMPLE(bucket  30  out  of  64  on  i%nid)  b   WHERE  a.i%nid  =  b.i%nid   Filex001.dat   Filex063.dat   Filey002.dat   Filey030.dat   Filey064.dat   b   Filey001.dat   Filey063.dat  
  • 11. #4  –  Block  Sampling   •  Problem  PaJern   –  View  a  sample  of  a  data  with  in  a  table   –  Sample  size  expressed  as  number  of  rows,  %age  of  data,  or   number  of  MBs   •  Solu%on  paJern   –  Use  Block  sampling   •  Approach   –  Use  TABLESAMPLE  (n%,  nM,  or  n  ROWS)   •  Benefit   –  Geyng  a  random  sample  from  the  table   –  More  op%ons  to  specify  how  many  samples  to  generate  
  • 12. #5  –  Parallel  Execu%on   SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct FROM ( SELECT itinid, year, quarter, origin, originstate FROM air_travel_bookings_8 )a JOIN ( SELECT itinid, origin, originstate FROM air_travel_origins_8 )B ON ( A.itinid = b.itinid and a.origin = b.origin and a.originstate = b.originstate) GROUP BY a.year, a.quarter, a.origin, a.originstate; Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   Stage  1   Stage  2   Stage  3   set  hive.exec.parallel  =  false;   set  hive.exec.parallel  =  true;  
  • 13. Summary   •  Iterate  quickly  on  Query  Design   – Use  Bucket  and  Block  Sampling   •  Run  queries  faster   – Par%%oning  to  invoke  Par%%on  Pruning   – Bucke%ng  to  invoke  Bucket  Map  Joins   – Execute  complex  queries  in  parallel  
  • 14. THANK  YOU   Managed  Cluster   Built-­‐In  Connectors   Friendly  User-­‐Interface   Dedicated  Support   •  100%  Managed  Hadoop  Cluster  in  the  Cloud   •  Auto-­‐Scaling  Cluster.  Full  Life-­‐cycle  Management   •  +12  Connectors  to  Applica%ons  and  Data  Sources   •  14-­‐Day  Free  Trial  (free  account  available)   •  24/7  Customer  Support   What’s  Included?   è  www.qubole.com/try  ç