11. statestored & Hive metastore
(for database metadata)
Overview
impalad daemon runs on HDFS nodes
Queries run on "relevant" nodes
Supports common HDFS file formats
(for cluster metadata)
12. Overview (cont'd)
Does not use Map/Reduce
Not fault tolerant !
(query fails if any query on any node fails)
Submit queries via Hue/Beeswax
Thrift API, CLI, ODBC, JDBC
16. 9 queries, run in CDH Quickstart VM
Macbook Pro Retina, mid 2012
16GB RAM,
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
Hardware
No other load on system during queries
Pseudo-cluster + Impala daemons
CDH 4.2, Impala 1.0
17. Benchmarks (cont'd)
(from simple projection queries to
multiple joins, aggregation, multiple
predicates, and order by)
Impala vs. Hive performance
"TPC-DS" sample dataset
(http://www.tpc.org/tpcds/)
24. Query "G"
select
count(c.c_customer_sk)
from customer c
join customer_address ca
on c.c_current_addr_sk = ca.ca_address_sk
join customer_demographics cd
on c.c_current_cdemo_sk = cd.cd_demo_sk
where
ca.ca_zip in ('20191', '20194') and
cd.cd_credit_rating in ('Unknown', 'High Risk');
26. select
i_item_id,
s_state,
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4
from store_sales
join date_dim
on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
join item
on (store_sales.ss_item_sk = item.i_item_sk)
join customer_demographics
on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk)
join store
on (store_sales.ss_store_sk = store.s_store_sk)
where
cd_gender = 'M' and
cd_marital_status = 'S' and
cd_education_status = 'College' and
d_year = 2002 and
s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD')
group by
i_item_id,
s_state
order by
i_item_id,
s_state
limit 100;
Query "TPC-DS"
27. Query Hive (sec) # M/R jobs Impala (sec) x Hive perf.
A 13.8 1 0.25 54
B 30.0 1 0.41 73
C 33.3 1 0.42 79
D 23.2 1 0.64 36
E 21.6 1 0.62 35
F 59.1 2 1.96 30
G 78.5 3 1.56 50
H 59.6 2 1.89 32
TPC-DS 204.5 6 3.23 63
(remember, unscientific...)
33. Queries performed in-memory
Intermediate data never hits disk!
Data streamed to clients
C++
runtime code generation
intrinsics for optimization
Execution engine:
41. Current Limitations
(as of version 1.0.1)
No join order optimization
No custom file formats, SerDes or UDFs
Limit required when using ORDER BY
Joins limited by aggregate memory of cluster
("put larger table on left")
42. Current Limitations
(as of version 1.0.1)
No advanced data structures
(arrays, maps, json, etc.)
Only basic DDL (otherwise do in Hive)
Limited file formats and compression
(though probably fine for most people)
45. Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout, it
is capable of running aggregation queries
over trillion-row tables in seconds. The
system scales to thousands of CPUs and
petabytes of data, and has thousands of
users at Google.
Comparing Impala to Dremel
- http://research.google.com/pubs/pub36632.html
46. Comparing Impala to Dremel
Impala = Dremel features circa 2010 + join
support, assuming columnar data format
(but, Google doesn't stand still...)
Dremel is production, mature
Basis for Google's BigQuery
47. Comparing Impala to Hive
Hive uses Map/Reduce -> high latency
Impala is in-memory, low-
latency query engine
Impala sacrifices fault tolerance
for performance
49. "Apache Drill is an open-source software framework that supports
data-intensive distributed applications for interactive analysis of large-
scale datasets. Drill is the open source version of Google's Dremel
system which is available as an IaaS service called Google BigQuery. One
explicitly stated design goal is that Drill is able to scale to 10,000 servers
or more and to be able to process petabyes of data and trillions of
records in seconds. Currently, Drill is incubating at Apache."
- http://incubator.apache.org/drill/drill_overview.html
Comparing Impala to Drill
50. "The Stinger Initiative is a collection of
development threads in the Hive community
that will deliver 100X performance
improvements as well as SQL compatibility."
Comparing Impala to Stinger
- http://hortonworks.com/stinger/
51. Comparing Impala to Stinger
Stinger
Improve Hive performance (e.g. optimize execution plan)
Support for analytics (e.g. OVER clause, window functions)
TEZ framework to optimize execution
Columnar file format
http://hortonworks.com/stinger/
52. Stinger Phase 1 performance...
(Stinger phase 1 is really just Hive 0.11)
54. Same 9 queries (as w/ Impala), run
in HortonWorks Sandbox VM
Macbook Pro Retina, mid 2012
16GB RAM,
4GB for VM (VMWare 5),
Intel i7 2.6GHz quad-core processor
Hardware (same as w/ Impala)
No other load on system during queries
HortonWorks Data Platform (HDP) 1.3
Running pseudo-cluster
55. Query Hive (sec)
# M/R
jobs
Stinger
Phase 1 (sec)
# M/R
jobs
x Hive
perf.
A 13.8 1 10.0 1 1.4
B 30.0 1 15.8 1 1.9
C 33.3 1 14.1 1 2.4
D 23.2 1 18.7 1 1.2
E 21.6 1 19.7 1 1.1
F 59.1 2 34.3 1 1.7
G 78.5 3 35.2 1 2.2
H 59.6 2 31.5 1 1.9
TPC-DS 204.5 6 37.2 1 5.5
(remember, unscientific...)
56. Query
Stinger Phase 1
(sec)
Impala (sec) x Stinger perf.
A 10.0 0.25 39
B 15.8 0.41 38
C 14.1 0.42 33
D 18.7 0.64 29
E 19.7 0.62 32
F 34.3 1.96 18
G 35.2 1.56 23
H 31.5 1.89 17
TPC-DS 37.2 3.23 12
(remember, unscientific...)
57. Impala Review
In-memory, distributed
SQL query engine
Integrates into
existing HDFS
Not Map/Reduce
Focus on
performance
(native code)
Competition...
Interactive data
analysis
58. References
Google Dremel - http://research.google.com/pubs/pub36632.html
Apache Drill - http://incubator.apache.org/drill/
TPC-DS dataset - http://www.tpc.org/tpcds/
Stinger Initiative - http://hortonworks.com/blog/100x-faster-hive/
http://hortonworks.com/stinger/
Cloudera Impala resources
http://www.cloudera.com/content/support/en/documentation/cloudera-impala/cloudera-
impala-documentation-v1-latest.html
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-
hadoop-for-real/
59. Photo Attributions
Impala - http://www.flickr.com/photos/gerardstolk/5897570970/
Measuring tape - http://www.morguefile.com/archive/display/24850
Bridge frame - http://www.morguefile.com/archive/display/9699
Balance - http://www.morguefile.com/archive/display/93433
* All others are iStockPhoto (I paid for them...)