Hive SQL Query Language for Analyzing Big Data

Hive

BALA KRISHNA G
Global Big Data Bootcamp – Jan 2014
(http://globalbigdataconference.com)

Global Big Data Conference - 2014

My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 years
Worked at various big companies SUN/ORACLE, IBM, etc.,

www.linkedin.com/in/gbalakrishna/
bala.gsbk@outlook.com

Speaker : Bala


2

Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab

Lecture
–
–
–
–
–
–
–

Need for Hive
Hive history
Hive powered by
What is Hive?
Hive Architecture
Hive Query Life cycle
Hive Query Language (HiveQL)

Lab:
– Extensive hands-on-experience on Hive
– Derive various insights from a real-world dataset by Hive

Speaker : Bala


3

Need for Hive

Do I need to
learn JAVA?

Speaker : Bala


Don’t worry!
I am here to
rescue you

4

Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business
Intelligence)
Oftentimes, require a series of complex MR jobs chained
together (Advanced data processing)
MR 4
MR 1
MR 6

MR 2

MR 3
MR 5
Speaker : Bala


legends
MR – Map Reduce
Mapper Task
Reducer Task
5

20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time significantly (~16 times)

300

300

code

250

200

200

Minutes

250

time

150
100

150
100

50

50

0

0
Hadoop

Speaker : Bala

Pig

Hadoop

Pig
6

Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW

Speaker : Bala


7

Hive powered by

Uses for processing large amount of user and
central to meet company reporting need’s

Data analytics and Data cleaning

Ad hoc queries reporting and analytics
And many more…
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
Speaker : Bala


8

What is Hive?

Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project under apache
Works on high throughput and high latency
principle (same as Hadoop)
Ability to plug-in custom Map Reduce programs
Mainly targeted for structured data
Hides Map Reduce program complexities to end
user
Speaker : Bala


9

Hive Architecture

HIVE

Meta
Store

CLI
Web
Interface
Python

ODBC
Perl

Speaker : Bala

Driver

HADOOP

Map
Reduce

Compiler

Optimizer
Hive Thrift
Server

HDFS
Plan
executor


10

Metastore
Stores metadata of tables like database location, owner,
creation time, access attributes, table schema, etc.,
Comprises of two components 1) Service 2) Data storage
Hive Service
Embedded
Metastore

Driver

Metastore
Service

Local
Metastore

Driver

Metastore
Service

Remote
Metastore

Driver

Speaker : Bala

Derby

MySQL

Metastore
Server


MySQL

11

Hive Query Life cycle Insight

Speaker : Bala


12

Hive Query Life cycle contd.,
1

Hive
Interface
14

11

10

Execution
Engine

13

Driver

12

Hadoop
Map
Reduce

9

Metastore

2

Compiler

3

Parser

Semantic
Analyzer

8

5

4

Speaker : Bala

Physical
plan
Optimizer
generator

6
6


Logical
plan
generator

7
7

Optimizer

13

Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id

Name

Age

Sex

State

In Hive warehouse
stored as a folder
/user/$USER/warehouse/sample

Speaker : Bala


14

Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state column
sample
Id

Name

Age

Sex

State

Partition 1

Partition 2

Stored as many subfolders under sample directory
/user/$USER/warehouse/State=AL/

/user/$USER/warehouse/State=NC/

/user/$USER/warehouse/State=GA/

/user/$USER/warehouse/State=ND/

Speaker : Bala


15

Data Models contd.,
Bucket: Divides into further chunks by an other column for
sampling
Let say sample table is partitioned by ‘State’ column and
clustered by ‘Age’ column of 2 buckets
In warehouse, the data is stored as
/user/$USER/warehouse/State=AL/part-00000
/user/$USER/warehouse/State=AL/part-00001
/user/$USER/warehouse/State=GA/part-00000
/user/$USER/warehouse/State=GA/part-00001

.
.
/user/$USER/warehouse/State=ND/part-00000
/user/$USER/warehouse/State=ND/part-00001
Speaker : Bala


16

Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware House
1)

Local FS

copy

HDFS

File

Hive
Warehouse

– Copy file from HDFS to Hive Ware House
2)
HDFS
File

Speaker : Bala

copy

Hive
Warehouse


17

Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in HDFS with out copying it into Hive Ware House

3)
HDFS
File

Speaker : Bala

Referenced
referenced


Hive
Warehouse

18

Data Loading Techniques contd.,
Explain when to go for external table and managed table?

Speaker : Bala


19

Question - 01
In which scenario you use Hive?
1.
2.

Structured data

3.

Any kind of data

4.

Speaker : Bala

Completely unstructured nasty data

None of the above


20

Question – 01 answer

2. Hive is mainly used to analyze
structured data. Typically, Hive runs on
the data that is generated by
MapReduce job (or) pig

Speaker : Bala


21

Question - 02
Which option is not correct about
Metastore?
1.
2.

It has information about number of
partitions and number of buckets

3.

It can give you time at which the table is
created

4.

Speaker : Bala

It stores the table location

It stores the actual data


22


4. Metastore stores only the metadata.
Actual data is stored in HDFS.

Speaker : Bala


23

Question – 03 (last question)
What is incorrect about Hive?
1.
2.

Hive runs on top of HDFS

3.

Hive is a proprietary software

4.

Speaker : Bala

Hive internally generates MapReduce
jobs to serve your query

Hive supports multiple interfaces to
interact with


24


3. Hive is an open source. Not a
proprietary software. Hive community
is growing very rapidly.

Speaker : Bala


25

Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, etc.,
DML – provides a way to modify content
Query statements – provides a way to retrieve the content

Speaker : Bala


26

Data types

Booleans:

Primitive Types

TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)

BOOLEAN
(TRUE or FALSE)

String:
STRING
(sequence of
characters)

Speaker : Bala

Integers:

Floating point
numbers:
Usage
variable_name <Data Type>
ex: name STRING


Float (4 bytes)
Double (8 bytes)

27

Data types contd.,
ARRAY

Usage

collection of multiple
same data type values

name ARRAY <primitive type>
ex: marks ARRAY<INT>

Complex Types
Usage
STRUCT
collection of multiple
different data type
values

MAP
collection of
(key, value) pairs

Speaker : Bala


name STRUCT <type1, type2,
type3, …>
ex: record STRUCT <name
STRING, id INT, marks
ARRAY<INT>>

Usage
name MAP <key, value>
ex: score MAP<STRING, INT>

28

Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record STRUCT <name STRING, id INT, marks ARRAY<INT>>
– score MAP<STRING, INT>
SELECT marks[0], record.name, score[‘joe’]

Complex type inside a complex type is allowed
– array inside a struct (as seen before)

Speaker : Bala


29

DDL
CREATE TABLE sample(id INT, name STRING,
schema
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY (state STRING)

age INT,

sex

comments for readability
partition data by state column

ROW FORMAT DELIMITED

rows are delimited by ‘n’

FIELDS TERMINATED BY ‘,’

fields are terminated by ‘,’

STORED AS TEXTFILE;

store file as a text file

Table is created in warehouse directory and completely managed by Hive
Specific row format and file format can be expressed by custom SerDe

Speaker : Bala


30

SerDe

SerDe stands for Serializer and Deserializer

Deserializer
HDFS
File

InputFile
Format

<Key,
Value>

Deserializ
er

Row

Serializer

<Key,
Value>

OutputFile
Format

HDFS
File

Serializer

Row

Speaker : Bala


31

DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/department/sample’

Table is not created in warehouse directory and just referenced by Hive
The file referenced is in HDFS (hdfs://user/department/sample)

Speaker : Bala


32

DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE TABLE external_sample
Since external_sample table is *not* managed by Hive, it just deletes the
metadata leaving actual data untouched

Speaker : Bala


33

DML
Load data into managed table from local file system
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
The file ‘/home/hive/sample.txt’ is in local file system
It is copied into Hive warehouse folder

Load data into managed table from HDFS
LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE
sample;
The file ‘/user/hive/sample.txt’ is in HDFS

It is copied into Hive warehouse folder

Speaker : Bala


34

DML contd.,
Insert results into a new table
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be created before hand
select query results are loaded (overwritten) into new sample

Create a new table with automatically derived schema
CREATE TABLE newsample
AS SELECT * from sample;
creates newsample time with automatically derived schema
query results are populated into it

Speaker : Bala


35

Query statements
To list available databases
SHOW DATABASES;

To use a particular database
USE <databasename>;

To list all tables available in a database
SHOW TABLES;

Speaker : Bala


36

Query statements contd.,
select
SELECT * FROM sample;

Aggregation functions
SELECT COUNT(DISTINCT state) FROM sample;

Group by, Sort by, Order by
SELECT COUNT(*) FROM sample GROUP BY state;
SELECT * FROM sample SORT BY id DESC;

FROM sample SELECT * ORDER BY id ASC;

Speaker : Bala


37

Query statements contd.,
Joins
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)

Left join and Right joins are also supported
Multiple joins are accepted

Speaker : Bala


38

Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row

UDAF:
– User defined aggregate function
– Custom aggregated function logic can be written
– Operates on groups retrieved by group by clause

UDTF:
– User defined table function
– Operates on entire table

Speaker : Bala


39

Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine learning algorithms can be a challenging
task
Performance tradeoff with actual MR programs in various
scenarios
– The gap is narrowing with release to release

Speaker : Bala


40

Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a specific function: DESCRIBE FUNCTION
<function_name>
Explain about config files the one in /usr/lib/hive/conf folder
– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?

SETTING parameters in the hive session

Speaker : Bala


41

References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf
Venner, Jason (2009). Pro Hadoop
http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/

Speaker : Bala


42

Q/A

Speaker : Bala


43

Speaker : Bala


44

Backup slides

Speaker : Bala


45

Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advantages of using schema on read
– Faster load time
– Impacts query time

Speaker : Bala


46

Hive SQL Query Language for Analyzing Big Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hive SQL Query Language for Analyzing Big Data

Similar a Hive SQL Query Language for Analyzing Big Data (20)

Último

Último (20)

Hive SQL Query Language for Analyzing Big Data