Effective Hive Queries

Eﬀec%ve
Hive
Queries

Secrets
From
the
Pros

We
will
be
star,ng
at
11:03
PDT

Use
the
Chat
Pane
in
GoToWebinar
to
Ask
Ques%ons!

Assess
your
level
and
learn
new
stuﬀ

This
webinar
is
intended
for
intermediate
audiences

(familiar
with
Apache
Hive
and
Hadoop,
but
not
experts)

?

AGENDA

This
Webinar
provides
%ps
on
improving
the
performance
and

beJer
u%lizing
resources
using
the
following
best
prac%ces:

•  Data
Layout
(Par%%ons
and
Buckets)

•  Data
Sampling
(Bucket
and
Block
sampling)

•  Data
Processing
(Bucket
Map
Join
and
Parallel

execu%on)

Dataset
Used

#
of
records:

276M
records

Columns:

I%nerary
ID

Year
&Quarter
of
Travel

Trip
Origin
City
&
State

Trip
Des%na%on
City
&
State

Distance
between
Origin
&
Des%na%on

Airline
Bookings
All

Includes
stops
at
intermediate
ci%es

#
of
records:

116M
records

Columns:

I%nerary
ID

Year
&Quarter
of
Travel

Trip
Origin
City
&
State

Trip
Des%na%on
City
&
State

Distance
between
Origin
&
Des%na%on

Airline
Bookings
Origin
Only

Only
ﬁrst
leg
of
travel

#
of
records:

50

Columns:

State
code
&
Name

Popula%on

Census

Human
popula%on
by
US
State

#1
-‐
Data
Par%%oning

•  Problem
PaJern

–  Query
a
subset
of
data
in
a
table

–  Subset
iden%fied
by
“Column_Name
=
X”
filter

•  Solu%on
paJern

–  Layout
data
in
sub-‐directories
with
each
directory
associated

with
a
value
of
the
par%%on
column

–  The
filter
on
par%%on
column
just
picks
a
single
sub
directory

•  Approach

–  Use
PARTITION
BY
clause

•  Benefit

–  Par%%on
pruning

–  2.7x
faster
on
a
query
on
Airline
Bookings
Dataset
(29
seconds)

#1
-‐
Data
Par%%oning

Airline
Bookings
All
Table

Origin
State
(Par%%on

Column
/
Sub-‐directory)
CA
WY
AL

File1001.dat

File1002.dat

File100n.dat

File3001.dat

File3002.dat

File300n.dat

Filex001.dat

Filex002.dat

Filex00n.dat

Files
inside
the

par%%on

SELECT
origin_city,
origin_state

FROM
Airline_Bookings_All

WHERE
origin_state
=
‘CA’

CREATE
TABLE

….

PARTITIONED
BY
(origin_state
STRING)

#2
-‐
Data
Bucke%ng

•  Problem
PaJern

–  Join
data
in
two
large
tables
efficiently

–  Sample
data
inside
a
table
efficiently

•  Solu%on
paJern

–  More
efficient
processing
by
storing
data
in
hash
buckets

•  Approach

•  Use
bucke%ng
using
CLUSTERED
BY
..
INTO
n
BUCKETS

•  Benefit

–  Bucket
Map
Join

–  Bucket
Sampling

#2
–
Data
Bucke%ng

CREATE
TABLE

…

CLUSTERED
BY
(i%nid)
INTO
64
BUCKETS

set
hive.enforce.bucke%ng
=
true;

INSERT
OVERWRITE
TABLE

SELECT
…

FROM
..

Ailrine_Bookings_All

File00.dat

File63.dat

File01.dat

Each
File
contains
all

the
rows
that

correspond
to
the

same
hash
of
i%nid

column

#2
-‐
Data
Bucke%ng

a

File1001.dat

File1002.dat

File100n.dat

Filex001.dat

Filex002.dat

Filex00m.dat

Files
containing
table

data
bucketed
on
a

column

b

set
hive.op%mize.bucketmapjoin
=
true;

SELECT
/*+
MAPJOIN(a,
b)
*/
a.*,
b.*

FROM
a
JOIN
Airline_Bookings_Origin_Only
b

ON
a.i%nid
=
b.i%nid

Note:

1.  Both
the
tables
are
bucketed
on
i%nid
column

2.  The
numbers
of
buckets
in
the
two
tables
are
a
strict
mul%ple
of
each
other

#3
-‐
Bucket
Sampling

•  Problem
PaJern

–  Work
on
joinable
samples
of
data
from
diﬀerent
tables

•  Solu%on
paJern

–  Use
Bucket
Sampling

•  Approach

•  TABLESAMPLE
(BUCKET
x
OUT
OF
Y
ON
column)

•  Beneﬁt

–  Useful
while
working
with
sample
data
and
joins

#3
-‐
Bucket
Sampling

Filex002.dat

Filex030.dat

Filex064.dat

Files
containing
bookings
data

bucketed
on
i%nid

a

SELECT
a.*,
b.*

FROM
TABLESAMPLE(bucket
30
out
of
64
on
i%nid)
a

,
Airline_Bookings_Origin_Only
TABLESAMPLE(bucket
30
out
of
64
on
i%nid)
b

WHERE
a.i%nid
=
b.i%nid

Filex001.dat

Filex063.dat

Filey002.dat

Filey030.dat

Filey064.dat

b

Filey001.dat

Filey063.dat

#4
–
Block
Sampling

•  Problem
PaJern

–  View
a
sample
of
a
data
with
in
a
table

–  Sample
size
expressed
as
number
of
rows,
%age
of
data,
or

number
of
MBs

•  Solu%on
paJern

–  Use
Block
sampling

•  Approach

–  Use
TABLESAMPLE
(n%,
nM,
or
n
ROWS)

•  Beneﬁt

–  Geyng
a
random
sample
from
the
table

–  More
op%ons
to
specify
how
many
samples
to
generate

#5
–
Parallel
Execu%on

SELECT a.year, a.quarter, a.origin, a.originstate, count(*) ct
FROM
(
SELECT itinid,
year,
quarter,
origin,
originstate
FROM air_travel_bookings_8
)a
JOIN
(
SELECT itinid,
origin,
originstate
FROM air_travel_origins_8
)B
ON
( A.itinid = b.itinid
and a.origin = b.origin
and a.originstate = b.originstate)
GROUP BY
a.year, a.quarter, a.origin, a.originstate;
Stage
1

Stage
2

Stage
3

Stage
1

Stage
2

Stage
3

Stage
1
Stage
2

Stage
3

set
hive.exec.parallel
=
false;

set
hive.exec.parallel
=
true;

Summary

•  Iterate
quickly
on
Query
Design

– Use
Bucket
and
Block
Sampling

•  Run
queries
faster

– Par%%oning
to
invoke
Par%%on
Pruning

– Bucke%ng
to
invoke
Bucket
Map
Joins

– Execute
complex
queries
in
parallel

THANK
YOU

Managed
Cluster
Built-‐In
Connectors
Friendly
User-‐Interface
Dedicated
Support

•  100%
Managed
Hadoop
Cluster
in
the
Cloud

•  Auto-‐Scaling
Cluster.
Full
Life-‐cycle
Management

•  +12
Connectors
to
Applica%ons
and
Data
Sources

•  14-‐Day
Free
Trial
(free
account
available)

•  24/7
Customer
Support

What’s
Included?

è
www.qubole.com/try
ç

Effective Hive Queries

Recommended

Recommended

More Related Content

Similar to Effective Hive Queries

Similar to Effective Hive Queries (20)

More from Qubole

More from Qubole (20)

Recently uploaded

Recently uploaded (20)

Effective Hive Queries