A session from Qubole Best Practice Webinar Series- “Big Data Secrets from the Pros”. Covers how to make Apache Hive queries run faster by
a. Better layout of data on HDFS via partitioning and bucketing
b. Designing test queries by using block and bucket sampling before running the queries on large datasets
c. Using bucket map joins and parallel processing to run queries faster
Visit www.qubole.com for more information.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Effective Hive Queries
1. Effec%ve
Hive
Queries
Secrets
From
the
Pros
We
will
be
star,ng
at
11:03
PDT
Use
the
Chat
Pane
in
GoToWebinar
to
Ask
Ques%ons!
Assess
your
level
and
learn
new
stuff
This
webinar
is
intended
for
intermediate
audiences
(familiar
with
Apache
Hive
and
Hadoop,
but
not
experts)
?
2. AGENDA
This
Webinar
provides
%ps
on
improving
the
performance
and
beJer
u%lizing
resources
using
the
following
best
prac%ces:
• Data
Layout
(Par%%ons
and
Buckets)
• Data
Sampling
(Bucket
and
Block
sampling)
• Data
Processing
(Bucket
Map
Join
and
Parallel
execu%on)
3. Dataset
Used
#
of
records:
276M
records
Columns:
I%nerary
ID
Year
&Quarter
of
Travel
Trip
Origin
City
&
State
Trip
Des%na%on
City
&
State
Distance
between
Origin
&
Des%na%on
Airline
Bookings
All
Includes
stops
at
intermediate
ci%es
#
of
records:
116M
records
Columns:
I%nerary
ID
Year
&Quarter
of
Travel
Trip
Origin
City
&
State
Trip
Des%na%on
City
&
State
Distance
between
Origin
&
Des%na%on
Airline
Bookings
Origin
Only
Only
first
leg
of
travel
#
of
records:
50
Columns:
State
code
&
Name
Popula%on
Census
Human
popula%on
by
US
State
4. #1
-‐
Data
Par%%oning
• Problem
PaJern
– Query
a
subset
of
data
in
a
table
– Subset
iden%fied
by
“Column_Name
=
X”
filter
• Solu%on
paJern
– Layout
data
in
sub-‐directories
with
each
directory
associated
with
a
value
of
the
par%%on
column
– The
filter
on
par%%on
column
just
picks
a
single
sub
directory
• Approach
– Use
PARTITION
BY
clause
• Benefit
– Par%%on
pruning
– 2.7x
faster
on
a
query
on
Airline
Bookings
Dataset
(29
seconds)
5. #1
-‐
Data
Par%%oning
Airline
Bookings
All
Table
Origin
State
(Par%%on
Column
/
Sub-‐directory)
CA
WY
AL
File1001.dat
File1002.dat
File100n.dat
File3001.dat
File3002.dat
File300n.dat
Filex001.dat
Filex002.dat
Filex00n.dat
Files
inside
the
par%%on
SELECT
origin_city,
origin_state
FROM
Airline_Bookings_All
WHERE
origin_state
=
‘CA’
CREATE
TABLE
Airline_Bookings_All
….
PARTITIONED
BY
(origin_state
STRING)
6. #2
-‐
Data
Bucke%ng
• Problem
PaJern
– Join
data
in
two
large
tables
efficiently
– Sample
data
inside
a
table
efficiently
• Solu%on
paJern
– More
efficient
processing
by
storing
data
in
hash
buckets
• Approach
• Use
bucke%ng
using
CLUSTERED
BY
..
INTO
n
BUCKETS
• Benefit
– Bucket
Map
Join
– Bucket
Sampling
7. #2
–
Data
Bucke%ng
CREATE
TABLE
Airline_Bookings_All
…
CLUSTERED
BY
(i%nid)
INTO
64
BUCKETS
set
hive.enforce.bucke%ng
=
true;
INSERT
OVERWRITE
TABLE
Airline_Bookings_All
SELECT
…
FROM
..
Ailrine_Bookings_All
File00.dat
File63.dat
File01.dat
Each
File
contains
all
the
rows
that
correspond
to
the
same
hash
of
i%nid
column
8. #2
-‐
Data
Bucke%ng
a
File1001.dat
File1002.dat
File100n.dat
Filex001.dat
Filex002.dat
Filex00m.dat
Files
containing
table
data
bucketed
on
a
column
b
set
hive.op%mize.bucketmapjoin
=
true;
SELECT
/*+
MAPJOIN(a,
b)
*/
a.*,
b.*
FROM
Airline_Bookings_All
a
JOIN
Airline_Bookings_Origin_Only
b
ON
a.i%nid
=
b.i%nid
Note:
1. Both
the
tables
are
bucketed
on
i%nid
column
2. The
numbers
of
buckets
in
the
two
tables
are
a
strict
mul%ple
of
each
other
9. #3
-‐
Bucket
Sampling
• Problem
PaJern
– Work
on
joinable
samples
of
data
from
different
tables
• Solu%on
paJern
– Use
Bucket
Sampling
• Approach
• TABLESAMPLE
(BUCKET
x
OUT
OF
Y
ON
column)
• Benefit
– Useful
while
working
with
sample
data
and
joins
10. #3
-‐
Bucket
Sampling
Filex002.dat
Filex030.dat
Filex064.dat
Files
containing
bookings
data
bucketed
on
i%nid
a
SELECT
a.*,
b.*
FROM
Airline_Bookings_All
TABLESAMPLE(bucket
30
out
of
64
on
i%nid)
a
,
Airline_Bookings_Origin_Only
TABLESAMPLE(bucket
30
out
of
64
on
i%nid)
b
WHERE
a.i%nid
=
b.i%nid
Filex001.dat
Filex063.dat
Filey002.dat
Filey030.dat
Filey064.dat
b
Filey001.dat
Filey063.dat
11. #4
–
Block
Sampling
• Problem
PaJern
– View
a
sample
of
a
data
with
in
a
table
– Sample
size
expressed
as
number
of
rows,
%age
of
data,
or
number
of
MBs
• Solu%on
paJern
– Use
Block
sampling
• Approach
– Use
TABLESAMPLE
(n%,
nM,
or
n
ROWS)
• Benefit
– Geyng
a
random
sample
from
the
table
– More
op%ons
to
specify
how
many
samples
to
generate
12. #5
–
Parallel
Execu%on
SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct
FROM
(
SELECT itinid,
year,
quarter,
origin,
originstate
FROM air_travel_bookings_8
)a
JOIN
(
SELECT itinid,
origin,
originstate
FROM air_travel_origins_8
)B
ON
( A.itinid = b.itinid
and a.origin = b.origin
and a.originstate = b.originstate)
GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage
1
Stage
2
Stage
3
Stage
1
Stage
2
Stage
3
Stage
1
Stage
2
Stage
3
set
hive.exec.parallel
=
false;
set
hive.exec.parallel
=
true;
13. Summary
• Iterate
quickly
on
Query
Design
– Use
Bucket
and
Block
Sampling
• Run
queries
faster
– Par%%oning
to
invoke
Par%%on
Pruning
– Bucke%ng
to
invoke
Bucket
Map
Joins
– Execute
complex
queries
in
parallel
14. THANK
YOU
Managed
Cluster
Built-‐In
Connectors
Friendly
User-‐Interface
Dedicated
Support
• 100%
Managed
Hadoop
Cluster
in
the
Cloud
• Auto-‐Scaling
Cluster.
Full
Life-‐cycle
Management
• +12
Connectors
to
Applica%ons
and
Data
Sources
• 14-‐Day
Free
Trial
(free
account
available)
• 24/7
Customer
Support
What’s
Included?
è
www.qubole.com/try
ç