Identifying Appropriate Test Statistics Involving Population Mean
Overview of stinger interactive query for hive
1. Overview
of
S+nger:
Interac+ve
Query
for
Hive
@ddkaiser
linkedin.com/in/dkaiser
slideshare.net/ddkaiser
dkaiser@cdk.com
dkaiser@hortonworks.com
OC
Big
Data
Meetup
#1
May
21,
2014
David
Kaiser
2. Who Am I?
David
Kaiser
20+
years
experience
with
Linux
3
years
experience
with
Hadoop
Career
experiences:
• Data
Warehousing
• Geospa+al
Analy+cs
• Open-‐source
Solu+ons
and
Architecture
Employed
at
Hortonworks
as
a
Senior
Solu+ons
Engineer
@ddkaiser
linkedin.com/in/dkaiser
slideshare.net/ddkaiser
dkaiser@cdk.com
dkaiser@hortonworks.com
3. Overview of Stinger: Interactive Query for Hive
• Abstract:
– Hadoop
is
about
so
much
more
than
batch
processing.
With
the
recent
release
of
Hadoop
2,
there
have
been
many
new
approaches
for
increased
applica+on
performance.
– Hive
is
the
most
used
SQL
implementa+on
on
Hadoop.
– Hive
provides
the
most
amount
of
SQL
compa+bility
on
Hadoop.
– But…
Hive
is
Slow.
– Hive
WAS
Slow.
– This
talk
will
discuss
the
S+nger
ini+a+ve,
which
improved
Hive
performance
over
100x.
4. S"nger
Project
(announced
February
2013)
Batch AND Interactive SQL-in-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive
0.13,
April
2014
• Hive
on
Apache
Tez
• Query
Service
• Buffer
Cache
• Cost
Based
Op+mizer
(Op+q)
• Vectorized
Processing
Hive
0.11,
May
2013:
• Base
Op+miza+ons
• SQL
Analy+c
Func+ons
• ORCFile,
Modern
File
Format
Hive
0.12,
October
2013:
• VARCHAR,
DATE
Types
• ORCFile
predicate
pushdown
• Advanced
Op+miza+ons
• Performance
Boosts
via
YARN
Speed
Improve
Hive
query
performance
by
100X
to
allow
for
interac+ve
query
+mes
(seconds)
Scale
The
only
SQL
processing
in
Hadoop
designed
for
queries
that
scale
from
TB
to
PB
SQL
Support
broadest
range
of
SQL
seman+cs
for
analy+c
applica+ons
running
against
Hadoop
Goals:
An Open Community at its finest: Apache Hive Contribution
1,672Jira Tickets Closed
145Developers
44Companies
~400,000Lines Of Code Added…
13Months
5. Outcomes from the Stinger Project
Page 5
Feature
Descrip"on
Benefit
Tez
Integra+on
Tez
is
significantly
beeer
engine
than
MapReduce
Latency
Vectorized
Query
Take
advantage
of
modern
hardware
by
processing
thousand-‐row
blocks
rather
than
row-‐at-‐a-‐+me.
Throughput
Query
Planner
Using
extensive
sta+s+cs
now
available
in
Metastore
to
beeer
plan
and
op+mize
query,
including
predicate
pushdown
during
compila+on
to
eliminate
por+ons
of
input
(beyond
par++on
pruning)
Latency
ORC
File
Columnar,
type
aware
format
with
indices
Latency
Cost
Based
Op+mizer
(Op+q)
Join
re-‐ordering
and
other
op+miza+ons
based
on
column
sta+s+cs
including
histograms
etc.
Latency
Hive
as
a
Service
Leaves
engine
running
between
sessions
Latency
Buffer
Cache
Leaves
most
used
HDFS
file
blocks
in
memory
Latency
6. Hadoop 2: Moving Past MapReduce
Page
6
HADOOP
1.0
HDFS
(redundant,
reliable
storage)
MapReduce
(cluster
resource
management
&
data
processing)
HDFS2
(redundant,
highly-‐available
&
reliable
storage)
YARN
(cluster
resource
management)
MapReduce
(data
processing)
Others
HADOOP
2.0
Single
Use
System
Batch
Apps
Mul/
Purpose
Pla5orm
Batch,
Interac/ve,
Online,
Streaming,
…
7. Apache Tez as the new Primitive
HDFS2
(redundant,
reliable
storage)
Tez
(execu+on
engine)
YARN
(cluster
resource
management)
HADOOP
2.0
MapReduce
as
Base
Apache
Tez
as
Base
HDFS
(redundant,
reliable
storage)
MapReduce
(cluster
resource
management
&
data
processing)
Pig
(data
flow)
Hive
(sql)
Others
(Cascading)
HADOOP
1.0
Data
Flow
Pig
SQL
Hive
Others
(Cascading)
Batch
MapReduce
Slider
(con+nuous
execu+on)
Online
Data
Processing
HBase,
Accumulo
Real
Time
Stream
Processing
Storm
8. Complete Open Source Stack
• YARN is the logical extension of Apache Hadoop
– Complements
HDFS,
the
data
reservoir
• Resource Management for the Enterprise Data Lake
– Shared,
secure,
mul+-‐tenant
Hadoop
Allows for all processing in Open-Source Hadoop
Page
8
HDFS2
(Redundant,
Reliable
Storage)
YARN
(Cluster
Resource
Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm,
S4,…)
GRAPH
(Giraph)
IN-‐MEMORY
(Spark)
HPC
MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
9. Feature
Descrip"on
Benefit
Tez
Session
Overcomes
Map-‐Reduce
job-‐launch
latency
by
pre-‐
launching
Tez
AppMaster
Latency
Tez
Container
Pre-‐
Launch
Overcomes
Map-‐Reduce
latency
by
pre-‐launching
hot
containers
ready
to
serve
queries.
Latency
Tez
Container
Re-‐
Use
Finished
maps
and
reduces
pick
up
more
work
rather
than
exi+ng.
Reduces
latency
and
eliminates
difficult
split-‐size
tuning.
Out
of
box
performance!
Latency
Run+me
re-‐
configura+on
of
DAG
Run+me
query
tuning
by
picking
aggrega+on
parallelism
using
online
query
sta+s+cs
Throughput
Tez
In-‐Memory
Cache
Hot
data
kept
in
RAM
for
fast
access.
Latency
Complex
DAGs
Tez
Broadcast
Edge
and
Map-‐Reduce-‐Reduce
paeern
improve
query
scale
and
throughput.
Throughput
Hive On Tez - Execution
10. ORC File Advantages
Sustained Query Times
Apache Hive 0.12 provides
sustained acceptable query
times even at petabyte scale
131
GB
(78%
Smaller)
File
Size
Comparison
Across
Encoding
Methods
Dataset:
TPC-‐DS
Scale
500
Dataset
221
GB
(62%
Smaller)
Encoded
with
Text
Encoded
with
RCFile
Encoded
with
ORCFile
Encoded
with
Parquet
505
GB
(14%
Smaller)
585
GB
(Original
Size)
• Larger
Block
Sizes
• Columnar
format
arranges
columns
adjacent
within
the
file
for
compression
&
fast
access
Impala
Hive
12
Smaller Footprint
Better encoding with ORC in
Apache Hive 0.12 reduces resource
requirements for your cluster.
11. ORCFile
File
Format
Page 11
Query-‐Op"mized:
Split-‐able,
columnar
storage
file
Efficient
Reads:
Break
into
large
“stripes”
of
data
for
efficient
read
Fast
Filtering:
Built
in
index,
min/max,
metadata
for
fast
filtering
blocks
-‐
bloom
filters
if
desired
Efficient
Compression:
Decompose
complex
row
types
into
primi+ves:
massive
compression
and
efficient
comparisons
for
filtering
Precomputa"on:
Built
in
aggregates
per
block
(min,
max,
count,
sum,
etc.)
12. A Journey to SQL Compliance
Evolu"on
of
SQL
Compliance
in
Hive
SQL
Datatypes
SQL
Seman"cs
INT/TINYINT/SMALLINT/BIGINT
SELECT,
INSERT
FLOAT/DOUBLE
GROUP
BY,
ORDER
BY,
HAVING
BOOLEAN
JOIN
on
explicit
join
key
ARRAY,
MAP,
STRUCT,
UNION
Inner,
outer,
cross
and
semi
joins
STRING
Sub-‐queries
in
the
FROM
clause
BINARY
ROLLUP
and
CUBE
TIMESTAMP
UNION
DECIMAL
Standard
aggrega+ons
(sum,
avg,
etc.)
DATE
Custom
Java
UDFs
VARCHAR
Windowing
func+ons
(OVER,
RANK,
etc.)
CHAR
Advanced
UDFs
(ngram,
XPath,
URL)
Interval
Types
Sub-‐queries
for
IN/NOT
IN,
HAVING
JOINs
in
WHERE
Clause
INSERT/UPDATE/DELETE
Legend
Hive
10
or
earlier
Roadmap
Hive
11
Hive
12
Hive
13
13. Tez – Execution Performance
• Performance gains over Map Reduce
– Eliminate
replicated
write
barrier
between
successive
computa+ons.
– Eliminate
job
launch
overhead
of
workflow
jobs.
– Eliminate
extra
stage
of
map
reads
in
every
workflow
job.
– Eliminate
queue
and
resource
conten+on
suffered
by
workflow
jobs
that
are
started
aper
a
predecessor
job
completes.
Page
13
Pig/Hive
-‐
MR
Pig/Hive
-‐
Tez
14. Hive
–
MR
Hive
–
Tez
Hive-on-MR vs. Hive-on-Tez
SELECT
a.state,
COUNT(*),
AVERAGE(c.price)
FROM
a
JOIN
b
on
(a.id
=
b.id)
JOIN
c
on
(a.itemId
=
c.itemId)
GROUP
by
a.state
SELECT
a.state
JOIN
(a,
c)
SELECT
c.price
SELECT
b.id
JOIN(a,
b)
GROUP
BY
a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT
a.state,
c.itemId
JOIN
(a,
c)
JOIN(a,
b)
GROUP
BY
a.state
COUNT(*)
AVERAGE(c.price)
SELECT
b.id
Tez
avoids
unneeded
writes
to
HDFS
15. Vectorization
• Rewrite all operations to operate on blocks of 1K+
records, rather than one record at a time
• Block is array of Java scalars, not Objects (eliminate
Objects – compounding GC gains over time)
• Avoids many function calls, CPU pipeline stalls
• Size to fit in L1 cache, avoid cache misses
Page
15
16. Stinger Phase 3: Unlocking Interactive Query
S"nger
Phase
3:
Features
and
Benefits
Container
Pre-‐Launch
Overcomes
Java
VM
startup
latency
by
pre-‐
launching
hot
containers
ready
to
serve
queries
Container
Re-‐Use
Finished
Maps
and
Reduces
pick
up
more
work
rather
than
exi+ng.
Reduces
latency
and
eliminates
difficult
split
size
tuning
Tez
Integra+on
Tez
Broadcast
Edge
and
Map-‐Reduce-‐Reduce
paeern
improve
query
scale
and
throughput
In-‐Memory
Cache
Hot
data
kept
in
RAM
for
fast
access
22. Your Fastest On-ramp to Enterprise Hadoop™!
Page
22
hep://hortonworks.com/products/hortonworks-‐sandbox/
The
Sandbox
lets
you
experience
Apache
Hadoop
from
the
convenience
of
your
own
laptop
–
no
data
center,
no
cloud
and
no
internet
connec+on
needed!
The
Hortonworks
Sandbox
is:
• A
free
download:
hep://hortonworks.com/products/hortonworks-‐sandbox/
• A
complete,
self
contained
virtual
machine
with
Apache
Hadoop
pre-‐configured
• A
personal,
portable
and
standalone
Hadoop
environment
• A
set
of
hands-‐on,
step-‐by-‐step
tutorials
that
allow
you
to
learn
and
explore
Hadoop