SlideShare a Scribd company logo
1 of 58
1
Impala:
Modern, Open-Source SQL Engine
For Hadoop
Gwen Shapira
@gwenshap
gshapira@cloudera.com
Agenda
• Why Hadoop?
• Data Processing in Hadoop
• User’s view of Impala
• Impala Use Cases
• Impala Architecture
• Performance highlights
2
3
In the beginning….
was the database
For a while,
the database was all we needed.
4
Data is not what it used to be
5
DataGrowth
STRUCTURED DATA – 20%
1980 2012
UNSTRUCTUREDDATA–80%
Hadoop was Invented to Solve:
• Large volumes of data
• Data that is only valuable in bulk
• High ingestion rates
• Data that requires more processing
• Differently structured data
• Evolving data
• High license costs
6
What is Apache Hadoop?
7
Has the Flexibility to Store and
Mine Any Type of Data
 Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
 Not bound by a single schema
Excels at
Processing Complex Data
 Scale-out architecture divides workloads
across multiple nodes
 Flexible file system eliminates ETL
bottlenecks
Scales
Economically
 Can be deployed on commodity
hardware
 Open source platform guards against
vendor lock
Hadoop Distributed
File System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and processing
that is…
 Distributed
 Fault tolerant
 Scalable
CORE HADOOP SYSTEM COMPONENTS
8
Processing Data in Hadoop
Map Reduce
• Versatile
• Flexible
• Scalable
• High latency
• Batch oriented
• Java
• Challenging paradigm
9
Hive & Pig
• Hive – Turn SQL into MapReduce
• Pig – Turn execution plans into MapReduce
• Makes MapReduce easier
• But not any faster
10
Towards a Better Map Reduce
• Spark – Next generation MapReduce
With in-memory caching
Lazy Evaluation
Fast recovery times from node failures
• Tez – Next generation MapReduce.
Reduced overhead, more flexibility.
Currently Alpha
11
12
And now to something completely different!
What is Impala?
13
Impala Overview
14
Interactive SQL for Hadoop
 Responses in seconds
 Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
 Purpose-built for low-latency queries
 Separate runtime from MapReduce
 Designed as part of the Hadoop ecosystem
Open Source
 Apache-licensed
Impala Overview
Runs directly within Hadoop
 reads widely used Hadoop file formats
 talks to widely used Hadoop storage managers
 runs on same nodes that run Hadoop processes
High performance
 C++ instead of Java
 runtime code generation
 completely new execution engine – No MapReduce
 Beta version released since October 2012
 General availability (v1.0) release out since April 2013
 Latest release (v1.2.3) released on December 23rd
Impala is Production Ready
User View of Impala: Overview
• Distributed service in cluster:
one Impala daemon on each node with data
• Highly available: no single point of failure
• Submit query to any daemon:
• ODBC/JDBC
• Impala CLI
• Hue
• Query is distributed to all nodes with relevant data
• Impala uses Hive’s metadata
User View of Impala: File Formats
• There is no ‘Impala format’.
• Impala supports:
• Uncompressed/lzo-compressed text files
• Sequence files and RCFile with snappy/gzip
compression
• Avro data files
• Parquet columnar format (more on that later)
• HBase
User View of Impala: SQL Support
• Most of SQL-92
• INSERT INTO … SELECT …
• Only equi-joins; no non-equi joins, no cross products
• Order By requires Limit (for now)
• DDL support
• SQL-style authorization via Apache Sentry (incubating)
• UDFs and UDAFs are supported
20
Use Cases
Impala Use Cases
21
Interactive BI/analytics on more data
Asking new questions – exploration, ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that
offloads the data warehouse for:
Global Financial Services Company
22
Saved 90% on incremental EDW spend &
improved performance by 5x
Offload data warehouse for query-able archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive query
on more data
Digital Media Company
24
20x performance improvement for exploration
& data discovery
Easily identify new data sets for modeling
Interact with raw data directly to test
hypotheses
Avoid expensive DW schema changes
Accelerate ‘time to answer’
25
Impala Architecture
Impala Architecture
• Impala daemon (impalad) – N instances
• Query execution
• State store daemon (statestored) – 1 instance
• Provides name service and metadata distribution
• Catalog daemon (catalogd) – 1 instance
• Relays metadata changes to all impalad’s
Impala Query Execution
27
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/HUE/Shell
Impala Query Execution
28
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Impala Query Execution
29
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query results
Query Planner
2-phase planning
 Left deep tree
 Partition plan to maximize data locality
Join order
 Before 1.2.3: Order of tables in query.
 1.2.3 and above: Cost based if statistics exist
Plan Operators
 Scan, HashJoin, HashAggregation, Union, TopN, Exchange
 All operators are fully distributed
30
31
Query Execution Example
Simple Example
SELECT state, SUM(revenue)
FROM HdfsTbl h
JOIN HbaseTbl b ON (id)
GROUP BY state
ORDER BY 2 desc LIMIT 10
How does a database execute a query?
• Left Deep Tree
• Data flows from bottom
to top
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Wait – Why is this a left-deep tree?
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
Agg
HashJoin
Scan: t0
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data.
• So, the RHS table (Hbase
scan) is scanned first.
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Scan
Hbase
first
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Hash Join Node fills the
hash table with the RHS
table data. TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Start scanning LHS (Hdfs)
table
• For each row from LHS,
probe the hash table for
matching rows
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from
LHS, probe the hash
table for matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from
LHS, probe the hash
table for matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Probe hash
table and a
matching
row is
found.
How does a database execute a query?
• Matched rows are
bubbled up the
execution tree TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Continue scanning the
LHS (Hdfs) table
• For each row from LHS,
probe the hash table for
matching rows
• Unmatched rows are
discarded
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
No
matching
row
How does a database execute a query?
• All rows have been
returned from the hash
join node. Agg node can
start returning rows
• Rows are bubbled up the
execution tree
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
How does a database execute a query?
• Rows from the
aggregation node
bubbles up to the top-n
node
• When all rows are
returned by the agg
node, top-n node can
restart return rows to
the end-user
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Key takeaways
 Data flows from bottom to top in the execution tree
and finally goes to the end user
 Larger tables go on the left
 Collect statistics
 Filter early
49
Simpler Example
SELECT state, SUM(revenue)
FROM HdfsTbl h
JOIN HbaseTbl b ON (id)
GROUP BY state
How does an MPP database execute a query?
Tbl b
Scan
Hash
Join
Tbl a
Scan
Exch
Agg
Exch
Agg
Agg
Hash
Join
Tbl a
Scan
Tbl b
Scan
Broadcast
Re-distribute by
“state”
How does a MPP database execute a query
A join B
A join B
A join B
Local
Agg
Local
Agg
Local
Agg
Scan and
Broadcast
Tbl B
Final
Agg
Final
Agg
Final
Agg
Re-distribute by
“state”
Local read
Tbl A
53
Performance
Impala Performance Results
• Impala’s Latest Milestone:
• Comparable commercial MPP DBMS speed
• Natively on Hadoop
• Three Result Sets:
• Impala vs Hive 0.12 (Impala 6-70x faster)
• Impala vs “DBMS-Y” (Impala average of 2x faster)
• Impala scalability (Impala achieves linear scale)
• Background
• 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported
language)
• Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB)
• Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks)
• Methodical testing (multiple runs, reviewed fairness for competition, etc)
• Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/
54
Impala vs Hive 0.12 (Lower bars are better)
55
Impala vs “DBMS-Y” (Lower bars are
better)
56
Impala Scalability: 2x the Hardware
(Expectation: Cut Response Times in Half)
57
Impala Scalability: 2x the Hardware and 2x Users/Data
(Expectation: Constant Response Times)
58
2x the Users, 2x the Hardware
2x the Data, 2x the Hardware
59

More Related Content

What's hot

Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep divehuguk
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance UpdateCloudera, Inc.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentationpunesparkmeetup
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 

What's hot (20)

Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep dive
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 

Viewers also liked

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK TelecomGruter
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerApache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기Ted Won
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 

Viewers also liked (16)

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerApache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 

Similar to Incredible Impala

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series DatabasePramit Choudhary
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopSpagoWorld
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 

Similar to Incredible Impala (20)

Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache hive
Apache hiveApache hive
Apache hive
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Architectural Evolution Starting from Hadoop
Architectural Evolution Starting from HadoopArchitectural Evolution Starting from Hadoop
Architectural Evolution Starting from Hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 

More from Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Recently uploaded

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Incredible Impala

  • 1. 1 Impala: Modern, Open-Source SQL Engine For Hadoop Gwen Shapira @gwenshap gshapira@cloudera.com
  • 2. Agenda • Why Hadoop? • Data Processing in Hadoop • User’s view of Impala • Impala Use Cases • Impala Architecture • Performance highlights 2
  • 4. For a while, the database was all we needed. 4
  • 5. Data is not what it used to be 5 DataGrowth STRUCTURED DATA – 20% 1980 2012 UNSTRUCTUREDDATA–80%
  • 6. Hadoop was Invented to Solve: • Large volumes of data • Data that is only valuable in bulk • High ingestion rates • Data that requires more processing • Differently structured data • Evolving data • High license costs 6
  • 7. What is Apache Hadoop? 7 Has the Flexibility to Store and Mine Any Type of Data  Ask questions across structured and unstructured data that were previously impossible to ask or solve  Not bound by a single schema Excels at Processing Complex Data  Scale-out architecture divides workloads across multiple nodes  Flexible file system eliminates ETL bottlenecks Scales Economically  Can be deployed on commodity hardware  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is…  Distributed  Fault tolerant  Scalable CORE HADOOP SYSTEM COMPONENTS
  • 9. Map Reduce • Versatile • Flexible • Scalable • High latency • Batch oriented • Java • Challenging paradigm 9
  • 10. Hive & Pig • Hive – Turn SQL into MapReduce • Pig – Turn execution plans into MapReduce • Makes MapReduce easier • But not any faster 10
  • 11. Towards a Better Map Reduce • Spark – Next generation MapReduce With in-memory caching Lazy Evaluation Fast recovery times from node failures • Tez – Next generation MapReduce. Reduced overhead, more flexibility. Currently Alpha 11
  • 12. 12 And now to something completely different!
  • 14. Impala Overview 14 Interactive SQL for Hadoop  Responses in seconds  Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
  • 15. Impala Overview Runs directly within Hadoop  reads widely used Hadoop file formats  talks to widely used Hadoop storage managers  runs on same nodes that run Hadoop processes High performance  C++ instead of Java  runtime code generation  completely new execution engine – No MapReduce
  • 16.  Beta version released since October 2012  General availability (v1.0) release out since April 2013  Latest release (v1.2.3) released on December 23rd Impala is Production Ready
  • 17. User View of Impala: Overview • Distributed service in cluster: one Impala daemon on each node with data • Highly available: no single point of failure • Submit query to any daemon: • ODBC/JDBC • Impala CLI • Hue • Query is distributed to all nodes with relevant data • Impala uses Hive’s metadata
  • 18. User View of Impala: File Formats • There is no ‘Impala format’. • Impala supports: • Uncompressed/lzo-compressed text files • Sequence files and RCFile with snappy/gzip compression • Avro data files • Parquet columnar format (more on that later) • HBase
  • 19. User View of Impala: SQL Support • Most of SQL-92 • INSERT INTO … SELECT … • Only equi-joins; no non-equi joins, no cross products • Order By requires Limit (for now) • DDL support • SQL-style authorization via Apache Sentry (incubating) • UDFs and UDAFs are supported
  • 21. Impala Use Cases 21 Interactive BI/analytics on more data Asking new questions – exploration, ML Data processing with tight SLAs Query-able archive w/full fidelity Cost-effective, ad hoc query environment that offloads the data warehouse for:
  • 22. Global Financial Services Company 22 Saved 90% on incremental EDW spend & improved performance by 5x Offload data warehouse for query-able archive Store decades of data cost-effectively Process & analyze on the same system Improved capabilities through interactive query on more data
  • 23. Digital Media Company 24 20x performance improvement for exploration & data discovery Easily identify new data sets for modeling Interact with raw data directly to test hypotheses Avoid expensive DW schema changes Accelerate ‘time to answer’
  • 25. Impala Architecture • Impala daemon (impalad) – N instances • Query execution • State store daemon (statestored) – 1 instance • Provides name service and metadata distribution • Catalog daemon (catalogd) – 1 instance • Relays metadata changes to all impalad’s
  • 26. Impala Query Execution 27 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/HUE/Shell
  • 27. Impala Query Execution 28 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
  • 28. Impala Query Execution 29 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
  • 29. Query Planner 2-phase planning  Left deep tree  Partition plan to maximize data locality Join order  Before 1.2.3: Order of tables in query.  1.2.3 and above: Cost based if statistics exist Plan Operators  Scan, HashJoin, HashAggregation, Union, TopN, Exchange  All operators are fully distributed 30
  • 31. Simple Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state ORDER BY 2 desc LIMIT 10
  • 32. How does a database execute a query? • Left Deep Tree • Data flows from bottom to top TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 33. Wait – Why is this a left-deep tree? HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin Agg HashJoin Scan: t0
  • 34. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. • So, the RHS table (Hbase scan) is scanned first. TopN Agg Hash Join Hdfs Scan Hbase Scan Scan Hbase first
  • 35. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 36. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 37. How does a database execute a query? • Hash Join Node fills the hash table with the RHS table data. TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 38. How does a database execute a query? • Start scanning LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 39. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 40. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 41. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 42. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan Probe hash table and a matching row is found.
  • 43. How does a database execute a query? • Matched rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 44. How does a database execute a query? • Continue scanning the LHS (Hdfs) table • For each row from LHS, probe the hash table for matching rows • Unmatched rows are discarded TopN Agg Hash Join Hdfs Scan Hbase Scan No matching row
  • 45. How does a database execute a query? • All rows have been returned from the hash join node. Agg node can start returning rows • Rows are bubbled up the execution tree TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 46. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 47. How does a database execute a query? • Rows from the aggregation node bubbles up to the top-n node • When all rows are returned by the agg node, top-n node can restart return rows to the end-user TopN Agg Hash Join Hdfs Scan Hbase Scan
  • 48. Key takeaways  Data flows from bottom to top in the execution tree and finally goes to the end user  Larger tables go on the left  Collect statistics  Filter early 49
  • 49. Simpler Example SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (id) GROUP BY state
  • 50. How does an MPP database execute a query? Tbl b Scan Hash Join Tbl a Scan Exch Agg Exch Agg Agg Hash Join Tbl a Scan Tbl b Scan Broadcast Re-distribute by “state”
  • 51. How does a MPP database execute a query A join B A join B A join B Local Agg Local Agg Local Agg Scan and Broadcast Tbl B Final Agg Final Agg Final Agg Re-distribute by “state” Local read Tbl A
  • 53. Impala Performance Results • Impala’s Latest Milestone: • Comparable commercial MPP DBMS speed • Natively on Hadoop • Three Result Sets: • Impala vs Hive 0.12 (Impala 6-70x faster) • Impala vs “DBMS-Y” (Impala average of 2x faster) • Impala scalability (Impala achieves linear scale) • Background • 20 pre-selected, diverse TPC-DS queries (modified to remove unsupported language) • Sufficient data scale for realistic comparison (3 TB, 15 TB, and 30 TB) • Realistic nodes (e.g. 8-core CPU, 96GB RAM, 12x2TB disks) • Methodical testing (multiple runs, reviewed fairness for competition, etc) • Details: http://blog.cloudera.com/blog/2014/01/impala-performance-dbms-class-speed/ 54
  • 54. Impala vs Hive 0.12 (Lower bars are better) 55
  • 55. Impala vs “DBMS-Y” (Lower bars are better) 56
  • 56. Impala Scalability: 2x the Hardware (Expectation: Cut Response Times in Half) 57
  • 57. Impala Scalability: 2x the Hardware and 2x Users/Data (Expectation: Constant Response Times) 58 2x the Users, 2x the Hardware 2x the Data, 2x the Hardware
  • 58. 59

Editor's Notes

  1. Open source system implemented (mostly) In Java. Provides reliable and scalable storage and processing of extremely large volumes of data (TB to PB) on commodity hardware.Two primary components, HDFS and MapReduce. Based on software originally developed at Google.An important aspect of Hadoop in the context of this talk is that it allows for inexpensive storage (<10% of other solutions) of any type of data, and places no constraints on how that data is processed.Allows companies to begin storing data that was previously thrown away.Hadoop has value as a stand-alone analytics system, but many if not most organizations will be looking to integrate into existing data infrastructure.
  2. Take away – MapReduce is going away in favor of multi-framework Hadoop. Most of the replacements are improved MR. Impala is different.
  3. Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-65x faster than Hive; up to 100x seenNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github
  4. Interactive BI/Analytics on more dataRaw, full fidelity data – nothing lost through aggregation or ETL/LTNew sources & types – structured/unstructuredHistorical dataAsking new questionsExploration and data discovery for analytics and machine learning – need to find a data set for a model, which requires lots of simple queries to summarize, count, and validate.Hypothesis testing – avoid having to subset and fit the data to a warehouse just to ask a single questionData processing with tight SLAsCost-effective platformMinimize data movementReduce strain on data warehouseQuery-able storageReplace production data warehouse for DR/active archiveStore decades of data cost effectively (for better modeling or data retention mandates) without sacrificing the capability to analyze
  5. Nows, we’ve finished scanning the RHS table and have finished building the hash table. We can now start scanning the LHS table to do the join.
  6. If there’s a match, the joined row will bubble up the execution tree to the aggregation node.
  7. This row doesn’t match. So, it won’t bubble up.
  8. Now that all the rows have been returned from the hash join node, the aggregation node can start returning rows.
  9. B are scanned in parallel, and broadcast to all impalad. Each Impalad reads its local data block for A and do the join. This is broadcast join. After the join is done, we do the aggregation. But before we can produce the final result, we need to redistribute the result of “local agg” according to the group by expression “state” and do the final aggregate.
  10. We added a redundant condition in the WHERE clause to the query that doesn't change the query semantics or results returned. This is transparently mentioned in both our public blog post and published queries as the "explicit partition filter/predicate." Like window functions, this is done as a workaround for a feature limitation in both Impala and Hive to match what a user should to do optimize for these systems. Please also note that this change was done for all compared systems (Impala, Hive, and "DBMS-Y") to ensure an apples-to-apples comparison.