Más contenido relacionado La actualidad más candente (20) Similar a Cloudera Impala: A modern SQL Query Engine for Hadoop (20) Más de Cloudera, Inc. (20) Cloudera Impala: A modern SQL Query Engine for Hadoop1. Cloudera
Impala:
A
Modern
SQL
USE
PUBLICLY
DO
NOT
Query
Engine
for
Hadoop
PRIOR
TO
10/23/12
Headline
Goes
Here
JusJn
Erickson
|
Product
Manager
Speaker
Name
or
Subhead
Goes
Here
January
2013
2. Agenda
• Intro
to
Impala
• Impala’s
Architecture
• Comparisons
Confidential. ©2013 Cloudera, Inc. All
2
Rights Reserved.
3. Why
Hadoop?
• Scalability
• Simply
scales
just
by
adding
nodes
• Local
processing
to
avoid
network
boTlenecks
• Flexibility
• All
kinds
of
data
(blobs,
documents,
records,
etc)
• In
all
forms
(structured,
semi-‐structured,
unstructured)
• Store
anything
then
later
analyze
what
you
need
• Efficiency
• Cost
efficiency
(<$1k/TB)
on
commodity
hardware
• Unified
storage,
metadata,
security
(no
duplicaJon
or
synchronizaJon)
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
4. What’s
Impala?
• Interac<ve
SQL
• Typically
4-‐35x
faster
than
Hive
(observed
up
to
100x
faster)
• Responses
in
seconds
instead
of
minutes
(someJmes
sub-‐second)
• Nearly
ANSI-‐92
standard
SQL
queries
with
HiveQL
• CompaJble
SQL
interface
for
exisJng
Hadoop/CDH
applicaJons
• Based
on
industry
standard
SQL
• Na<vely
on
Hadoop/HBase
storage
and
metadata
• Flexibility,
scale,
and
cost
advantages
of
Hadoop
• No
duplicaJon/synchronizaJon
of
data
and
metadata
• Local
processing
to
avoid
network
boTlenecks
• Separate
run<me
from
MapReduce
• MapReduce
is
designed
and
great
for
batch
• Impala
is
purpose-‐built
for
low-‐latency
SQL
queries
on
Hadoop
Confidential. ©2013 Cloudera, Inc. All
4
Rights Reserved.
5. So
what?
• Interac<ve
BI/analy<cs
• BI
tools
impracJcal
on
Hadoop
before
Impala
• Move
from
10s
of
Hadoop
users
per
cluster
to
100s
of
SQL
users
• More
and
faster
value
from
“big
data”
• ELT/data
processing
with
<ght
SLAs
• Sub-‐minute
SLAs
now
possible
• Cost
efficiency
• Fewer
nodes
to
meet
response
Jme
SLAs
Confidential. ©2013 Cloudera, Inc. All
5
Rights Reserved.
6. Impala
Architecture
• Two
binaries:
impalad
and
statestored
• Impala
daemon
(impalad)
• one
Impala
daemon
on
each
node
with
data
• handles
external
client
requests
and
all
internal
requests
related
to
query
execuJon
• State
store
daemon
(statestored)
• provides
name
service
and
metadata
distribuJon
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
7. Impala
Architecture:
Query
ExecuJon
Phases
• Request
arrives
via
ODBC/JDBC/Beeswax/Shell
• Planner
turns
request
into
collecJons
of
plan
fragments
• Coordinator
iniJates
execuJon
on
impalad's
local
to
data
• During
execuJon:
• intermediate
results
are
streamed
between
executors
• query
results
are
streamed
back
to
client
• subject
to
limitaJons
imposed
to
blocking
operators
(top-‐n,
aggregaJon)
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
8. Impala
Architecture:
Planner
• Example:
query
with
join
and
aggregaJon
SELECT
state,
SUM(revenue)
FROM
HdfsTbl
h
JOIN
HbaseTbl
b
ON
(...)
GROUP
BY
1
ORDER
BY
2
desc
LIMIT
10
TopN
Agg
TopN
Agg
Hash
Hash
Agg
Join
Join
Hdfs
Hbase
Exch
Exch
Hdfs
Hbase
Scan
Scan
at
coordinator
at
DataNodes
at
region
servers
Scan
Scan
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
9. Impala
Architecture:
Query
ExecuJon
• Request
arrives
via
ODBC/JDBC/Beeswax/Shell
SQL
App
Hive
Metastore
HDFS
NN
Statestore
ODBC
SQL
request
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Executor
Query
Executor
Query
Executor
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
10. Impala
Architecture:
Query
ExecuJon
• Planner
turns
request
into
collecJons
of
plan
fragments
• Coordinator
iniJates
execuJon
on
impalad's
local
to
data
SQL
App
Hive
Metastore
HDFS
NN
Statestore
ODBC
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Executor
Query
Executor
Query
Executor
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
11. Impala
Architecture:
Query
ExecuJon
• Intermediate
results
are
streamed
between
impalad’s
• Query
results
are
streamed
back
to
client
SQL
App
Hive
Metastore
HDFS
NN
Statestore
ODBC
query
results
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Executor
Query
Executor
Query
Executor
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
12. Impala
and
Hive
• Shared
with
Hive:
• Metadata
(table
definiJons)
• ODBC/JDBC
drivers
• Hue
Beeswax
• SQL
syntax
(HiveQL)
• Flexible
file
formats
• Machine
pool
• Improvements:
• Purpose-‐built
query
engine
direct
on
HDFS
and
HBase
• No
JVM
startup
and
no
MapReduce
• In-‐memory
data
transfers
• NaJve
distributed
relaJonal
query
engine
Confidential. ©2012 Cloudera, Inc. All
Rights Reserved.
13. What
about
an
EDW/RDBMS?
• “Right
tool
for
the
right
job”
• EDW/RDBMS
great
for:
• OLTP’s
complex
transacJons
• Highly
planned
and
opJmized
known
workloads
• Opera4onal
reports
and
drill
into
repeated
known
queries
• Impala’s
great
for:
• Exploratory
analy4cs
with
new
previously-‐unknown
queries
• Queries
on
big
and
growing
data
sets
• EDW/RDBMS
can’t:
• Dump
in
raw
data
then
later
define
schema
and
query
what
you
want
• Evolve
schemas
without
an
expensive
schema
upgrade
planning
process
• Simply
scales
just
by
adding
nodes
• Store
at
<
$1k/TB
instead
of
$10-‐150k/TB
Confidential. ©2013 Cloudera, Inc. All
13
Rights Reserved.
14. AlternaJve
Hadoop
Query
Approaches
MapReduce
Remote
Query
Side
Storage
Query
Query
Query
Query
Node
Node
Node
Node
Query
MR
Hive
Engine
MR
OR
MR
DN
NN
DN
HDFS
DN
DN
DN
High-‐latency
MR
Network
boTleneck
Query
subset
of
data
Separate
nodes
for
SQL/MR
Separate
nodes
for
SQL/MR
RDBMS
rigid
schema
Duplicate
metadata,
Duplicate
metadata,
Duplicate
storage,
security,
SQL,
MR,
etc.
security,
SQL,
MR,
etc.
metadata,
security,
SQL,
etc.
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
15. Comparing
Impala
to
Dremel
• What
is
Dremel:
• columnar
storage
for
data
with
nested
structures
• distributed
scalable
aggregaJon
on
top
of
that
• Columnar
storage
in
Hadoop:
joint
project
between
Cloudera
and
TwiTer
• new
columnar
format,
derived
from
Doug
Culng's
Trevni
• stores
data
in
appropriate
naJve/binary
types
• can
also
store
nested
structures
similar
to
Dremel's
ColumnIO
• Distributed
aggregaJon:
Impala
• Impala
plus
columnar
format:
a
superset
of
the
published
version
of
Dremel
(which
didn't
support
joins
and
mulJple
file
formats)
Confidential. ©2013 Cloudera, Inc. All
Rights Reserved.
16. Impala
Roadmap
• GA
(target
April
2013)
• All
CDH4
OSes:
RHEL/CentOS,
Ubuntu,
Debian,
SLES
• JDBC
driver
• More
formats:
Avro,
LZO-‐compressed
• Columnar
format
• MR/Impala
resource
isolaJon
• Perf
(joins,
aggregaJons,
SQL
features)
• AutomaJc
metadata
distribuJon
• Post-‐GA
top
requests:
• UDFs
• Memory
caching
• Nested
data
• Window
funcJons
Confidential. ©2013 Cloudera, Inc. All
16
Rights Reserved.