SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
SQL Engines for
Hadoop – The
case for Impala
Budapest Data Forum, June 4th, 2015
tiny.cloudera.com/mark-sql-budapest
Mark Grover | @mark_grover
SQL Engines for
Hadoop – The
case for Impala
Budapest Data Forum, June 4th, 2015
tiny.cloudera.com/mark-sql-budapest
Mark Grover | @mark_grover
Grover Mark
SQL Engines for
Hadoop – The
case for Impala
Budapest Data Forum, June 4th, 2015
tiny.cloudera.com/mark-sql-budapest
Mark Grover | @mark_grover
Grover Mark
József
4
Agenda
5
Agenda
•  Word Count
©2015 Cloudera, Inc. All Rights Reserved.
6
Agenda
©2015 Cloudera, Inc. All Rights Reserved.
7
Past
©2015 Cloudera, Inc. All Rights Reserved.
8
Future
©2015 Cloudera, Inc. All Rights Reserved.
9
What is Apache Hadoop?
9
Has the Flexibility to Store
and Mine Any Type of Data
§  Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§  Not bound by a single schema
Excels at
Processing Complex Data
§  Scale-out architecture divides
workloads across multiple nodes
§  Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§  Can be deployed on commodity
hardware
§  Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and
processing that is…
ü  Scalable
ü  Fault tolerant
ü  Distributed
CORE HADOOP SYSTEM COMPONENTS
©2015 Cloudera, Inc. All Rights Reserved.
10
What is Apache Hadoop?
Has the Flexibility to Store
and Mine Any Type of Data
§  Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§  Not bound by a single schema
Excels at
Processing Complex Data
§  Scale-out architecture divides
workloads across multiple nodes
§  Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§  Can be deployed on commodity
hardware
§  Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open source
platform for data storage and
processing that is…
ü  Scalable
ü  Fault tolerant
ü  Distributed
CORE HADOOP SYSTEM COMPONENTS
©2015 Cloudera, Inc. All Rights Reserved.
11
What? No schema?
•  How do you use SQL?
•  Do you lose Schema-on-read when using SQL-on-Hadoop?
©2015 Cloudera, Inc. All Rights Reserved.
12
SQL engines
13
Hive
©2015 Cloudera, Inc. All Rights Reserved.
•  First SQL engine
•  Converts SQL to MapReduce
•  New execution engine additions
–  Hive-on-Spark
–  Hive-on-Tez
14
Confusion #1
•  If there’s Hive, why we need more execution engines?
–  Just a matter of speed?
©2015 Cloudera, Inc. All Rights Reserved.
15
Early architecture using Hive
©2015 Cloudera, Inc. All Rights Reserved.
Hive
Hadoop
RDBMS
Web servers
Different behaviour
based on
aggregations
16
Early architecture using Hive
©2015 Cloudera, Inc. All Rights Reserved.
Hive
Hadoop
RDBMS
Web servers
Different behaviour
based on
aggregations
Why do
we need
this?
17
Why not?
©2015 Cloudera, Inc. All Rights Reserved.
Hive
Hadoop
Web servers
Aggregations/
Recommendations
Different behaviour
based on
aggregations
18
Aha #1
•  Performance
–  Speed of execution
–  Concurrency
•  Need an MPP like solution
©2015 Cloudera, Inc. All Rights Reserved.
19
Aha #1
•  Impala
•  Drill
•  Presto
•  Hive
–  Hive-on-Tez
–  Hive-on-Spark
©2015 Cloudera, Inc. All Rights Reserved.
20
Impala
21
Impala - Goals
•  General-purpose SQL query engine:
•  Works for both for analytical and transactional/single-row workloads
•  Runs directly within Hadoop:
–  reads widely used Hadoop file formats
–  talks to widely used Hadoop storage managers
–  runs on same nodes that run Hadoop processes
•  High performance
–  Execution times
–  Concurrency
•  Open source
©2015 Cloudera, Inc. All Rights Reserved.
22
User view of Impala
•  There is no ‘Impala format’!
•  Supported file formats:
–  uncompressed/lzo-compressed text files
–  sequence files with snappy/gzip compression
–  RCFile with snappy/gzip compression
–  Avro data files
–  Parquet (columnar format)
–  HBase
–  And, more…
©2015 Cloudera, Inc. All Rights Reserved.
23
Impala Use Cases
Interactive BI/analytics on more data
Asking new questions – exploration, ML
Data processing with tight SLAs
Query-able archive w/full fidelity
Cost-effective, ad hoc query environment that offloads/
replaces the data warehouse for:
24
Global Financial Services Company
Saved 90% on incremental EDW spend &
improved performance by 5x
Offload data warehouse for query-able archive
Store decades of data cost-effectively
Process & analyze on the same system
Improved capabilities through interactive query on more
data
25
Digital Media Company
20x performance improvement for exploration & data
discovery
Easily identify new data sets for modeling
Interact with raw data directly to test hypotheses
Avoid expensive DW schema changes
Accelerate ‘time to answer’
26
Architecture of
Impala
27
Impala Architecture
•  Three binaries: impalad, statestored, catalogd
•  Impala daemon (impalad) – N instances
–  handles client requests and all internal requests related to query execution
•  State store daemon (statestored) – 1 instance
–  Provides name service and metadata distribution
•  Catalog daemon (catalogd) – 1 instance
–  Relays metadata changes to all impalad’s
28
Impala Architecture: Query Execution
•  Request arrives via odbc/jdbc
Query	
  Planner	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  request	
  
Query	
  Coordinator	
   Query	
  Coordinator	
  
HiveMetastore	
   HDFS	
  NN	
   statestored	
   catalogd	
  
29
Impala Architecture: Query Execution
•  Planner turns request into collections of plan fragments
•  Coordinator initiates execution on remote impalad's
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
HiveMetastore	
   HDFS	
  NN	
   statestored	
   catalogd	
  
30
Impala Architecture: Query Execution
•  Intermediate results are streamed between impalad's Query results
are streamed back to client
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
SQL	
  App	
  
ODBC	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
Query	
  Planner	
  
Query	
  Coordinator	
  
Query	
  Executor	
  
HDFS	
  DN	
   HBase	
  
query	
  results	
  
HiveMetastore	
   HDFS	
  NN	
   statestored	
   catalogd	
  
31
Query Planning: Overview
•  2-phase planning process:
–  single-node plan
–  plan partitioning: partition single-node plan to maximize scan locality, minimize
data movement
•  Parallelization of operators:
–  All query operators are fully distributed
32
Single-Node Plan: Example Query	
  
SELECT t1.custid,
SUM(t2.revenue) AS revenue
FROM LargeHdfsTable t1
JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id)
JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id)
WHERE t3.category = 'Online'
GROUP BY t1.custid
ORDER BY revenue DESC LIMIT 10;
33
Query Planning: Single-Node Plan
•  Single-node plan for example:
HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
34
Single-node plan
•  SQL query as a left-deep tree of plan operators
•  Scan, HashJoin, HashAggregation, Union, TopN, Exchange
35
Plan Partitioning
•  Partition single-node plan
–  Maximize scan locality
–  Minimize data movement
•  Parallelization of operators:
–  All query operators are fully distributed
36
Query	
  Planning:	
  Distributed	
  Plans	
  
HashJoinScan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Broadcast
hash t2.idhash t1.id1
hash t1.custid
at HDFS DN
at HBase RS
at coordinator
37
Impala Execution Engine
•  Written in C++ for minimal execution overhead
•  Internal in-memory tuple format puts fixed-width data at fixed offsets
•  Uses intrinsics/special cpu instructions for text parsing, crc32
computation, etc.
•  Runtime code generation for “big loops”
38
Runtime code generation
•  example of "big loop": insert batch of rows into hash table
•  known at query compile time: # of tuples in a batch, tuple layout,
column types, etc.
•  generate at compile time: unrolled loop that inlines all function calls,
contains no dead code, minimizes branches
•  code generated using LLVM
39
Comparing Impala to Dremel
•  What is Dremel?
–  columnar storage for data with nested structures
–  distributed scalable aggregation on top of that
•  Columnar storage in Hadoop: Parquet
–  stores data in appropriate native/binary types
–  can also store nested structures similar to Dremel's ColumnIO
–  Parquet is open source: github.com/parquet
•  Distributed aggregation: Impala
•  Impala plus Parquet: a superset of the published version of Dremel
(which didn't support joins)
40
But, what makes Impala fast?
•  No MapReduce
•  Use of memory
•  LLVM
•  C++
•  Vectorization
•  Tight integration with Parquet
41
Confusion #2 – What do I use?
•  Hive on Mapreduce?
•  Hive on Tez?
•  Hive on Spark?
•  Impala?
•  Spark SQL?
•  Drill?
©2015 Cloudera, Inc. All Rights Reserved.
42
Processing Frameworks in Hadoop
©2015 Cloudera, Inc. All Rights Reserved.
Hadoop Storage managers (HDFS/HBase/Solr)
Hive
Pig
Tez
Cascading
MapReduce
Hive
Cascading
Crunch
Pig
Mahout
Giraph
Dato(Graphlab)
Spark
Hive(inbeta)
Cascading
Crunch
Pig
MLlib
GraphX
SQL
engines
General purpose
execution engines
Storage
managers
Impala
Drill
H20
Oryx
Graph
processing
engines
Machine
Learning
engines
Abstraction
engines
Storm/Trident
SparkStreaming
Real-time
frameworks
SparkSQL
Presto
43
Free books!
•  @hadooparchbook
•  hadooparchitecturebook.com
•  github.com/hadooparchitecturebook
•  slideshare.com/hadooparchbook
•  Later today at 3:10 PM
©2015 Cloudera, Inc. All Rights Reserved.
44
Impala
•  SQL-on-Hadoop engine
•  Near real time SQL
•  Reads YOUR file formats
•  Allows to write custom UDFs
•  Open source
•  Commonly used for Data Warehouse offloading
©2015 Cloudera, Inc. All Rights Reserved.
45
I’d love to talk more!
•  @mark_grover
•  Find me at Cloudera booth!
©2015 Cloudera, Inc. All Rights Reserved.

Más contenido relacionado

La actualidad más candente

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaData Science London
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 

La actualidad más candente (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 

Destacado

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationshadooparchbook
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRmarkgrover
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata Londonhadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialhadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis FirehoseAmazon Web Services
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 

Destacado (13)

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
(BDT320) New! Streaming Data Flows with Amazon Kinesis Firehose
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 

Similar a SQL Engines for Hadoop - The case for Impala

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauSam Palani
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptxbetalab
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 

Similar a SQL Engines for Hadoop - The case for Impala (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 

Más de markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 

Más de markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 

Último

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 

Último (20)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 

SQL Engines for Hadoop - The case for Impala

  • 1. SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest Mark Grover | @mark_grover
  • 2. SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest Mark Grover | @mark_grover Grover Mark
  • 3. SQL Engines for Hadoop – The case for Impala Budapest Data Forum, June 4th, 2015 tiny.cloudera.com/mark-sql-budapest Mark Grover | @mark_grover Grover Mark József
  • 5. 5 Agenda •  Word Count ©2015 Cloudera, Inc. All Rights Reserved.
  • 6. 6 Agenda ©2015 Cloudera, Inc. All Rights Reserved.
  • 7. 7 Past ©2015 Cloudera, Inc. All Rights Reserved.
  • 8. 8 Future ©2015 Cloudera, Inc. All Rights Reserved.
  • 9. 9 What is Apache Hadoop? 9 Has the Flexibility to Store and Mine Any Type of Data §  Ask questions across structured and unstructured data that were previously impossible to ask or solve §  Not bound by a single schema Excels at Processing Complex Data §  Scale-out architecture divides workloads across multiple nodes §  Flexible file system eliminates ETL bottlenecks Scales Economically §  Can be deployed on commodity hardware §  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is… ü  Scalable ü  Fault tolerant ü  Distributed CORE HADOOP SYSTEM COMPONENTS ©2015 Cloudera, Inc. All Rights Reserved.
  • 10. 10 What is Apache Hadoop? Has the Flexibility to Store and Mine Any Type of Data §  Ask questions across structured and unstructured data that were previously impossible to ask or solve §  Not bound by a single schema Excels at Processing Complex Data §  Scale-out architecture divides workloads across multiple nodes §  Flexible file system eliminates ETL bottlenecks Scales Economically §  Can be deployed on commodity hardware §  Open source platform guards against vendor lock Hadoop Distributed File System (HDFS) Self-Healing, High Bandwidth Clustered Storage MapReduce Distributed Computing Framework Apache Hadoop is an open source platform for data storage and processing that is… ü  Scalable ü  Fault tolerant ü  Distributed CORE HADOOP SYSTEM COMPONENTS ©2015 Cloudera, Inc. All Rights Reserved.
  • 11. 11 What? No schema? •  How do you use SQL? •  Do you lose Schema-on-read when using SQL-on-Hadoop? ©2015 Cloudera, Inc. All Rights Reserved.
  • 13. 13 Hive ©2015 Cloudera, Inc. All Rights Reserved. •  First SQL engine •  Converts SQL to MapReduce •  New execution engine additions –  Hive-on-Spark –  Hive-on-Tez
  • 14. 14 Confusion #1 •  If there’s Hive, why we need more execution engines? –  Just a matter of speed? ©2015 Cloudera, Inc. All Rights Reserved.
  • 15. 15 Early architecture using Hive ©2015 Cloudera, Inc. All Rights Reserved. Hive Hadoop RDBMS Web servers Different behaviour based on aggregations
  • 16. 16 Early architecture using Hive ©2015 Cloudera, Inc. All Rights Reserved. Hive Hadoop RDBMS Web servers Different behaviour based on aggregations Why do we need this?
  • 17. 17 Why not? ©2015 Cloudera, Inc. All Rights Reserved. Hive Hadoop Web servers Aggregations/ Recommendations Different behaviour based on aggregations
  • 18. 18 Aha #1 •  Performance –  Speed of execution –  Concurrency •  Need an MPP like solution ©2015 Cloudera, Inc. All Rights Reserved.
  • 19. 19 Aha #1 •  Impala •  Drill •  Presto •  Hive –  Hive-on-Tez –  Hive-on-Spark ©2015 Cloudera, Inc. All Rights Reserved.
  • 21. 21 Impala - Goals •  General-purpose SQL query engine: •  Works for both for analytical and transactional/single-row workloads •  Runs directly within Hadoop: –  reads widely used Hadoop file formats –  talks to widely used Hadoop storage managers –  runs on same nodes that run Hadoop processes •  High performance –  Execution times –  Concurrency •  Open source ©2015 Cloudera, Inc. All Rights Reserved.
  • 22. 22 User view of Impala •  There is no ‘Impala format’! •  Supported file formats: –  uncompressed/lzo-compressed text files –  sequence files with snappy/gzip compression –  RCFile with snappy/gzip compression –  Avro data files –  Parquet (columnar format) –  HBase –  And, more… ©2015 Cloudera, Inc. All Rights Reserved.
  • 23. 23 Impala Use Cases Interactive BI/analytics on more data Asking new questions – exploration, ML Data processing with tight SLAs Query-able archive w/full fidelity Cost-effective, ad hoc query environment that offloads/ replaces the data warehouse for:
  • 24. 24 Global Financial Services Company Saved 90% on incremental EDW spend & improved performance by 5x Offload data warehouse for query-able archive Store decades of data cost-effectively Process & analyze on the same system Improved capabilities through interactive query on more data
  • 25. 25 Digital Media Company 20x performance improvement for exploration & data discovery Easily identify new data sets for modeling Interact with raw data directly to test hypotheses Avoid expensive DW schema changes Accelerate ‘time to answer’
  • 27. 27 Impala Architecture •  Three binaries: impalad, statestored, catalogd •  Impala daemon (impalad) – N instances –  handles client requests and all internal requests related to query execution •  State store daemon (statestored) – 1 instance –  Provides name service and metadata distribution •  Catalog daemon (catalogd) – 1 instance –  Relays metadata changes to all impalad’s
  • 28. 28 Impala Architecture: Query Execution •  Request arrives via odbc/jdbc Query  Planner   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Executor   HDFS  DN   HBase   SQL  request   Query  Coordinator   Query  Coordinator   HiveMetastore   HDFS  NN   statestored   catalogd  
  • 29. 29 Impala Architecture: Query Execution •  Planner turns request into collections of plan fragments •  Coordinator initiates execution on remote impalad's Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   HiveMetastore   HDFS  NN   statestored   catalogd  
  • 30. 30 Impala Architecture: Query Execution •  Intermediate results are streamed between impalad's Query results are streamed back to client Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   query  results   HiveMetastore   HDFS  NN   statestored   catalogd  
  • 31. 31 Query Planning: Overview •  2-phase planning process: –  single-node plan –  plan partitioning: partition single-node plan to maximize scan locality, minimize data movement •  Parallelization of operators: –  All query operators are fully distributed
  • 32. 32 Single-Node Plan: Example Query   SELECT t1.custid, SUM(t2.revenue) AS revenue FROM LargeHdfsTable t1 JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10;
  • 33. 33 Query Planning: Single-Node Plan •  Single-node plan for example: HashJoin Scan: t1 Scan: t3 Scan: t2 HashJoin TopN Agg
  • 34. 34 Single-node plan •  SQL query as a left-deep tree of plan operators •  Scan, HashJoin, HashAggregation, Union, TopN, Exchange
  • 35. 35 Plan Partitioning •  Partition single-node plan –  Maximize scan locality –  Minimize data movement •  Parallelization of operators: –  All query operators are fully distributed
  • 36. 36 Query  Planning:  Distributed  Plans   HashJoinScan: t1 Scan: t3 Scan: t2 HashJoin TopN Pre-Agg MergeAgg TopN Broadcast Broadcast hash t2.idhash t1.id1 hash t1.custid at HDFS DN at HBase RS at coordinator
  • 37. 37 Impala Execution Engine •  Written in C++ for minimal execution overhead •  Internal in-memory tuple format puts fixed-width data at fixed offsets •  Uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc. •  Runtime code generation for “big loops”
  • 38. 38 Runtime code generation •  example of "big loop": insert batch of rows into hash table •  known at query compile time: # of tuples in a batch, tuple layout, column types, etc. •  generate at compile time: unrolled loop that inlines all function calls, contains no dead code, minimizes branches •  code generated using LLVM
  • 39. 39 Comparing Impala to Dremel •  What is Dremel? –  columnar storage for data with nested structures –  distributed scalable aggregation on top of that •  Columnar storage in Hadoop: Parquet –  stores data in appropriate native/binary types –  can also store nested structures similar to Dremel's ColumnIO –  Parquet is open source: github.com/parquet •  Distributed aggregation: Impala •  Impala plus Parquet: a superset of the published version of Dremel (which didn't support joins)
  • 40. 40 But, what makes Impala fast? •  No MapReduce •  Use of memory •  LLVM •  C++ •  Vectorization •  Tight integration with Parquet
  • 41. 41 Confusion #2 – What do I use? •  Hive on Mapreduce? •  Hive on Tez? •  Hive on Spark? •  Impala? •  Spark SQL? •  Drill? ©2015 Cloudera, Inc. All Rights Reserved.
  • 42. 42 Processing Frameworks in Hadoop ©2015 Cloudera, Inc. All Rights Reserved. Hadoop Storage managers (HDFS/HBase/Solr) Hive Pig Tez Cascading MapReduce Hive Cascading Crunch Pig Mahout Giraph Dato(Graphlab) Spark Hive(inbeta) Cascading Crunch Pig MLlib GraphX SQL engines General purpose execution engines Storage managers Impala Drill H20 Oryx Graph processing engines Machine Learning engines Abstraction engines Storm/Trident SparkStreaming Real-time frameworks SparkSQL Presto
  • 43. 43 Free books! •  @hadooparchbook •  hadooparchitecturebook.com •  github.com/hadooparchitecturebook •  slideshare.com/hadooparchbook •  Later today at 3:10 PM ©2015 Cloudera, Inc. All Rights Reserved.
  • 44. 44 Impala •  SQL-on-Hadoop engine •  Near real time SQL •  Reads YOUR file formats •  Allows to write custom UDFs •  Open source •  Commonly used for Data Warehouse offloading ©2015 Cloudera, Inc. All Rights Reserved.
  • 45. 45 I’d love to talk more! •  @mark_grover •  Find me at Cloudera booth! ©2015 Cloudera, Inc. All Rights Reserved.