4. About me
Oracle ACE
Data and Linux geek
Long time opensource
supporter
works for @redgluept as
Data Architect
@drune
5. Big Data Thinking Strategy
●Think small
●Think big
●Don’t think at all (hype is here)
6. What is Apache Hive?
●open source, TB/PB scale date warehousing
framework based on Hadoop
●The first and more complete SQL-on-”Hadoop”
●SQL:2003 and SQL:2011 compatible
●Data store on several formats
●Several execution engines available
●Interactive Query support (In-memory cache)
7. Apache Hive - Before you ask
●Datawarehouse/OLAP activities (data mining, data
exploration, batch processing, ETL, etc) - “The
heavy lifting of data”
●Low cost scaling, built as extensibility in mind
●Use large datasets (gigabytes/terabytes) scale
●Don’t use Hive for any OLTP activities
●ACID exists, not recommended yet
8. The reason behind Hive
I had written, as part of working with the Feed team - what became - a rather complicated MR
job to rank friends by mutual friends.
In doing so I had pretty much used every Hadoop trick in the bag (partitioners, separate
map and reduce sorting keys, comparators, in-memory hash tables and so on) and realized how
hard it was to write an optimal MR job (particularly on large data sets).
Assembling data into complex data structures was also painful.
I really wanted to see these types of operators exposed in a high level declarative form so
that the average user would never have to go through this. Fortunately - our team had
Oracle veterans well versed in the art of SQL.
Joydeep Sen Sarma (Facebook)
9. The reason behind Hive
Instead of complex MR jobs
You have declarative language...
10. Apache Hive versions & branches
master branch-1
Version 2.x
New code
New features
Version 1.x
Stable
Backwards
compatibility
Critical
bugs
Hadoop 1.x and 2.x
supported
Hadoop 2.x
supported
stable features
11. Data Model (data units & types)
●Supports primitive column types (integers,
numbers, strings, date time and booleans)
●Supports complex types: Structs, Maps and
Arrays
●Concept of databases, tables, partitions and
buckets
●SerDe: serialize and deserialized API is used to
move data in and out of tables
12. Data Model (partitions & bucketing)
● Partitioning: used for distributing load horizontally,performance benefit,
organization data
PARTITIONED BY (flightName STRING, AircraftName STRING)
/employees/flightName=ABC/AircraftName=XYZ
● Buckets (clusters): decomposing data sets into more manageable parts, help
on map-side joins, and correct sampling on the same bucket
“Records with the same flightID will always be stored in the same bucket.
Assuming the number of flightID is much greater than the number of buckets, each
bucket will have many flightIDs”
CLUSTERED BY (flightID) INTO XX BUCKETS;
13. Data Model (complex data types)
Array Ordered collection of
fields. Fields of the
same type
array(1,2)
Map Unordered key value
pairs. Keys are
primitives, values are
any type
Map (‘a’, 1, ‘b’, 2)
Struct A collection of named
fields
Struct(‘a’,10, 2.5)
15. HiveQL
●HiveQL is an SQL-like query language for Hive
●Supports DDL and DML
●Supports multi-table inserts
●Possible to write custom map-reduce scripts
●Supports UDF, UDAF UDTF
16. DDL (some examples)
HIVE> CREATE DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> DROP DATABASE/SCHEMA, TABLE, VIEW, INDEX
HIVE> TRUNCATE TABLE
HIVE> ALTER DATABASE/SCHEMA, TABLE, VIEW
HIVE> SHOW DATABASES/SCHEMAS, TABLES, TBLPROPERTIES, VIEWS,
PARTITIONS, FUNCTIONS
HIVE> DESCRIBE DATABASE/SCHEMA, table_name, view_name
17. File formats
● Parquet: compressed, efficient columnar data
representation available to any project in the Hadoop
● ORC: made for Hive, support Hive type model,columnar
storage, block compression, predicate pushdown, ACID*,
etc
● Avro: JSON for defining data types and protocols, and
serializes data in a compact binary format
● Compressed file formats (LZO, *.GZIP)
● Plain Text Files
● Any other type to data subject to a format is possible to be
read (csv, json, xml, etc)
18. ORC
●Stored as columns and compressed = smaller disk
reads
●ORC has a built-in index, min/max values, and
other aggregates (eg: sum,max) = skip entire
blocks to speed up reads
●ORC implements predicate pushdown and bloom
filters
●ORC scale
●You should use it :-)
19. Indexing
● Not recommended because of ORC;
● ORC has build in Indexes which allow the format to skip
blocks of data during read
● Hive indexes are implemented as tables
● Compact indexes and bitmap indexes supported
● Tables that provide information about which data is in
which blocks and are used to skip data (like ORC already
does)
● Not supported on Tez engine - ignored
● Indexes in Hive are not like indexes in other databases.
21. Hive Architecture
Hive Web
Interface
Hive CLI (beeline, hive)
Hive JDBC/ODBC
Driver
Compiler (Parser, Semantic Analyser,
Logical Plan Generator, Query plan
Generator)
Executor
Optimizer
Metastore
client
Trift Server (HiveServer2)
Metastore RDBMS
Execution
Engines
Map Reduce Tez Spark
Resource Management YARN
Storage HDFS HBase
Azure Storage
Amazon S3
22. Metastore
● Typically stored in a RDBMS (MySQL; SQLServer;
PostgreSQL, Derby*) - ACID and concurrency on metadata
querys
● Contains: metadata for databases, tables or partitions
● Provides two features: data discovery and data abstraction
● Data abstraction: provide information about data formats,
extractors and loaders in table creation and reused, (ex:
dictionary tables - Oracle)
● Data discovery: discover relevant and specific data, allow
other tools to use metadata to explore data (Ex: SparkSQL)
24. Execution engines
● 3 execution engines are available:
○ MapReduce (mr)
○ Tez
○ Spark
MR: The original, most stable and more reliable, batch oriented, disk-
based parallel (like traditional Hadoop MR jobs).
Tez: High performance batch and interactive data processing. Stable in
99% of the time. The one that you should use. Default on HDP.
Spark: Uses Apache Spark (in-memory computing platform), High-
performance (like Tez), not used in production (yet), good progress
25. MapReduce vs Tez/Spark
MapReduce:
● One pair of map and reduce does one level of aggregation over the
data. Complex computations typically require multiple such steps.
Tez/Spark:
● DAG (Directed Acyclic Graph)
● The graph does not have cycles because the fault tolerance
mechanism used by Tez is re-execution of failed tasks
● The limitations of MapReduce in Hadoop became a key point to
introduce DAG
● Pipelining consecutive map steps into one
● Enforce concurrency and serialization between MapReduce jobs
26. Tez & DAGs
DAG Definition:
● Data processing is expressed in the form of a directed acyclic graph
(DAG)
Two main components:
● vertices - in the graph representing processing of data
○ user logic, that analyses and modifies the data, sits in the vertices
● edges - representing movement of data between the processing
○ Defines routing of data between tasks (One-To-One, Broadcast
Scatter-Gather)
○ Defines when a consumer task is scheduled (Sequential,
Concurrent)
○ Defines the lifetime/reliability of a task output
27. Hive Cost Based Optimizer - Why
● Distributed SQL query processing in Hadoop differs from conventional
relational query engine when it comes to handling of intermediate
result sets
● Query processing requires sorting and reassembling of intermediate
result set - shuffling
● Most of the existing optimizations in Hive are about minimizing
shuffling cost and logical optimizations like filter push down,
projection pruning and partition pruning
● Join reordering and join algorithm possible with cost based optimizer.
28. Hive CBO - What to get
● Based on a project called Apache Calcite (https://calcite.apache.org/)
● You can get using a Cost Based Optimizer:
○ How to order Join (join reordering)
○ Algorithm to use for a Join
○ Intermediate result be persisted or should it be recomputed on
failure
○ degree of parallelism at any operator (number of mappers and
reducers
○ Semi Join selection
○ (others optimizer tricks like histograms)
30. Hive - The present-future
● Tez and Spark head to head on performance and stability
● LLAP (Long Live and Process) - Hive interactive querys
● ACID
31. Hive next big thing: LLAP
● Sub second querys (Interactive Querys)
● In-memory caching layer with async I/O
● Fast concurrent execution
● Move from disk oriented to memory oriented execution (trend)
● Disks are connect to CPU via network - data locality is not relevant
SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
2007 - 15TB
2009 - 2PB
SQL:2011 - Seventh revision of the ISO (1987) and ANSI (1986) standard for the SQL database query language
https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-UnderstandingHiveBranches
Release and feature branches not added to slide as we might be complex
Predicate Pushdown: Running operations that filter or cutdown data as close to the beginning of your map reduce pipeline as possible
Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.
Show create tables different formats (ORC and PLAINTEXT)
Create an index on a table:
Not supported in TEZ
Set hive.execution.engine=mr
create index idxFlightNum on table flightperfall(flightnum) AS 'COMPACT' WITH DEFERRED REBUILD;
alter index idxFlightNum ON flightperfall rebuild;
show formatted index on flightperfall;
explain select * from flightperfall where flightnum=613 limit 1;
set hive.optimize.index.filter.compact.minsize=10;
explain select * from flightperfall where flightnum=613 limit 1;
Set hive.optimize.index.filter.compact.minsize=5368709120
Execution times;
Show operator tree with index and without index
ORC vs CSV query time:
select * from flightperfall_orc where flightnum=613 limit 1;
Describe - components
HiveCLI - management tools
Ambari - Apache Ambari is a tool for provisioning, managing, and monitoring Apache Hadoop clusters.
HiveServer2 - HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. It is based on Apache Thrift RPC. Itis an improved version of HiveServer and supports multi-client concurrency and authentication and better support for open API clients like JDBC and ODBC.
Driver - Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution
Compiler - The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore:
Parser – Transform a query string to a parse tree representation
Semantic Analyser - Transform the parse tree to an internal query representation (column names are verified and expansions like * are performed), Type-checking and any implicit type conversions and partition checking.
Logical Plan Generator - Convert the internal query representation to a logical plan, which consists of a tree of operators. This step also includes the optimizer to transform the plan to improve performance;
Query Plan Generator – Convert the logical plan to a series of map-reduce tasks (or DAGs stages)
Optimizer - As of 2011, it was rule-based and performed the following: column pruning and predicate pushdown. Now it is cost based like RDBMS.
Executor engine (Processing) - The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages
Metastore - The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
https://cwiki.apache.org/confluence/display/Hive/Design
ssh root@127.0.0.1 -p 2222 (sandbox)
Test CLI (beeline and hive cmd)
Beeline: !connect jdbc:hive2://localhost:10000
Show ambari
- Identify metastore hive (mysql database)
- mysql -u root -p ; password: hadoop ; show databases; use hive; select * from DBS; select * from TBLS;
Identify execution engines:
SET hive.execution.engine
Identify CBO active:
set hive.cbo.enable;
set hive.compute.query.using.stats;
set hive.stats.fetch.column.stats;
set hive.stats.fetch.partition.stats;
explain select * from sample_07, sample_08 where sample_07.code = sample_08.code and sample_07.salary > 1000;
Conditions for CBO: example: statistics of table, colums or other (too few joins).
Show a database, a table and a file stored in HDFS
Hdfs
Tez – Hindi for “speed”
Example: jobs A and B are independent of each other, but job C needs the results from A and B to complete, Tez will execute A and B in any order and forward the results to C
One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards
Sequential: Consumer task may be scheduled after a producer task completes.
Concurrent: Consumer task must be co-scheduled with a producer task.
In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging.
In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging.
http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-14-2.pdf
Query:
SELECT year, month, origin, dest, distance FROM flights.flightperfall_orc where flightnum in (select max(flightnum) from flights.flightperfpartorc where year=2008)
MR: (41.1 seconds)
Tez: (3.761 seconds)
Show Tez View (via ambari)
analyze table customer COMPUTE STATISTICS;
analyze table customer COMPUTE STATISTICS for columns;
use foodmart;
explain select * from sales_fact_dec_1998 sf, customer c, product p, store ss
where sf.customer_id = c.customer_id
and p.product_id = sf.product_id
and ss.store_id = sf.store_id
and sf.customer_id > 100
and ss.store_id = 5