2. Table of Contents
Introducing Apache Hive ............................................................................................................................... 3
Starting Notes ........................................................................................................................................... 3
What is Apache Hive? ............................................................................................................................... 3
Motivation for Hive ................................................................................................................................... 3
Some History ............................................................................................................................................. 3
Deep Dive into Hive ...................................................................................................................................... 4
Hive Architecture ...................................................................................................................................... 4
Hive Data Model ....................................................................................................................................... 5
Hive Query Language (HiveQL) ................................................................................................................. 5
Hive Web Interface ................................................................................................................................... 5
Hive Authorizations................................................................................................................................... 6
Getting Hands dirty – Hive Hands On ........................................................................................................... 7
Hive Installation ........................................................................................................................................ 7
Sample Hive Queries ................................................................................................................................. 8
Hive SerDe ................................................................................................................................................. 9
Hive JDBC .................................................................................................................................................. 9
Changing Hive default Data store ........................................................................................................... 11
Hive User Defined Functions (UDF) ........................................................................................................ 12
2
Page
3. Introducing Apache Hive
Starting Notes
This is a Hive kick start guide and talks about basic Hive concepts and shows you how hive works over
Hadoop. The tutorial expects you to be familiar with Hadoop basics and assumes that you have a
preliminary understanding of what Map-Reduce programming is.
What is Apache Hive?
Apache Hive is an open source data warehouse system for Hadoop. Hive is an abstraction over Apache
Hadoop and provides the users an SQL like query interface. The user only sees table like data and can
fire hive queries to get results from the data and Hive internally creates, plans and executes Map-
Reduce Jobs for us and gives us the desired results. Hive is suitable for both unstructured and semi-
structured data.
Motivation for Hive
The prime motivation for Hive was to enable users to quickly come up with fast business solutions using
a familiar language rather than having to think Map-Reduce for every problem. Hive uses its own query
language HiveQL which is very much similar to the traditional SQL.
Some History
Hive written in Java language was initially developed by Facebook, and is under Apache2.0 License, and
is now being developed by companies like Netflix. The current version of Hive is 0.9.9 which is
compatible with Hadoop 0.20.x and 1.x.
3
Page
4. Small Dive into Hive
Hive Architecture
Below is a diagram showing the Hive Architecture and its components. Hive works on top of Hadoop and
needs Apache Hadoop to be running on your box. Let’s have a quick note on the components:
• Hadoop - Hive needs Hadoop as a Base Framework to operate.
• Driver - Hive has its own drivers to communicate with the Hadoop World.
• Command Line Interface (CLI) – The Hive CLI is the console for firing Hive Queries. The CLI would
be used for operating on our data.
• Web Interface - Hive also provides a web interface to monitor/administrate Hive jobs.
• Metastore – Metastore is the Hive’s data warehouse which stores all the structure information
of various tables/partitions in Hive.
• Thrift Server – Hive provides a Thrift Server with itself via which we can expose Hive as a service
which can then be used for connecting via JDBC/ODBC etc.
4
Page
5. Apart from these above components there are few more vital actors responsible for the query
execution:
• Hive Compiler which is responsible for the semantic analysis of the input query and creating an
execution plan. The execution plan is a DAG of stages.
• Hive Execution Engine which executes the execution plan created by the Hive Compiler.
• Hive also has an Optimizer that optimizes the execution plan.
Hive Data Model
Any data into Hive can be organized into 3 data models:
• Tables – are similar to relational DB’
• Partitions – every table can have one or more partition keys. The data is stored in files based on
the partition key specified. Without a partition all the values would be submitted to the MR Job,
whereas on specifying the partition key only a small subset of data would be passed to the MR
jobs. Hive makes different directories for different partitions to hold data.
• Buckets – the data in each partition may again be divided into buckets based on the hash values.
Each bucket is stored as a file in the partition directory.
Hive Query Language (HiveQL)
HQL is very much similar to SQL and user can use SQL syntax for loading, querying tables. Hive queries
are checked by the compiler for correctness and execution plan is created from the queries. The Hive
Executer then runs the execution plan accordingly. The Hive Language Manual can be found here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual
Hive Web Interface
The Hive web interface is another alternative for the command line interface. We can use the Hive web
interface for administering and monitoring hive jobs. You can view the Hive web interface here by
default:
http://localhost:9999/hwi
Note: Your Hive Server must be running in order to view the Hive Web Interface.
5
Page
6. Hive Authorizations
Hive Authorization system consists of Users, Groups and Roles. Roles are the name given to a set of
grants given to any particular User and can be reused. A role may be assigned to Users, Groups and to
some other Roles. Hive roles must be created manually before being used. Users and Groups need not
be created manually and the Metastore will determine the username of the connecting user and the
groups associated with her.
More on users, Groups and Roles can be found here:
https://cwiki.apache.org/Hive/languagemanual-auth.html
6
Page
7. Getting Hands dirty – Hive Hands On
Hive Installation
This is a short crisp guide to installing Hive on top of your Hadoop setup. Apache Hadoop is a pre-
requisite for Apache Hive and must be installed on your box before we can proceed.
1. Download Hive: You can download ‘Apache Hive Here’.
2. Extract Hive: Extract your Hive archive to any location of your choice.
3. Export Environment Variables & Path:
Hive needs three environment variables set:
• JAVA_HOME: the path where your java is present.
• HADOOP_HOME: path where your Hadoop is present.
• HIVE_HOME: path where you’ve just extracted your hive.
Set these environment variables accordingly to continue. You can export environment variables
by the shell command:
$> export HADOOP_HOME=/home/ubuntu/hadoop/
$> export HIVE_HOME=/home/ubuntu/hive/
$> export PATH=$PATH:$HADOOP_PATH/bin
$> export PATH=$PATH:$HIVE_PATH/bin
4. Create Warehouse directory for Hive:
Hive stores all its data in a directory /user/hive/warehouse/. So let’s create the path for it.
$> sudo mkdir -P /user/hive/warehouse
5. Start Hadoop:
Hive needs Hadoop running for its operations. Let’s start Hadoop by the start-all script. Since
you have set the HADOOP_HOME/bin to PATH, you must be able to call the start-all.sh directly.
Else you might have to go to the directory and issue the start-all.sh (inside bin).
$> start-all.sh
6. Start Hive:
Finally start hive by issuing the command ‘hive’.
$> hive
7
Page
hive> show databases;
8. You can also start the hive server by the command:
$> hive -service hiveserver
Sample Hive Queries
Below is a sample hive query to create table and import data into the table. The query assumes the data
files to be present on your local system, and has the following data format:
File: movies.dat
movie_id:movie_name:tags
File: ratings.dat
user_id:movie_id:rating:timestamp
Download Movie Lens Dataset Here: http://www.grouplens.org/node/73
CREATE TABLES:
———————-------
CREATE TABLE movies (movie_id int, movie_name string, tags string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;
CREATE TABLE ratings (user_id string, movie_id string, rating float, tmstmp
string)
ROW FORMAT DELIMITED FIELDS TERMINATE BY ‘:’;
LOAD DATA FROM LOCAL PATH (Not HDFS):
———————————————————------------------
LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/movies.dat’
OVERWRITE INTO TABLE movies;
LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/ratings.dat’
OVERWRITE INTO TABLE ratings;
VERIFY:
———–---
8
DESCRIBE movies;
Page
DESCRIBE ratings;
9. OR,
SELECT * FROM movies LIMIT 10;
SELECT * FROM movies LIMIT 10;
Hive SerDe
The above queries work fine with single character delimiters, but many a time we face situations where
we have a complex delimiter or multi-character delimiters. In these scenarios we need to use the Hive
SerDe to get our data into Hive tables.
Here is a sample SerDe implementation for a sample USER file which has ‘::’ as the field delimiter. The
data in the USER file is in this format id::gender::age::occupation::zipcode
While using SerDe we specify a regular expression which is used to divide our line of data into fields.
Query Using SerDe:
---------------------
CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode
INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"
);
Hive Partitions
Hive partitions can be created on any particular field/column data. The hive partitions make your
queries faster and Hadoop by default keeps your entire data divided into partitions into separate
directories. We can create a partitioned table in Hive by this query below; here we are choosing the date
column to be partitioned:
create table table_name (
id int,
date string,
name string
)
partitioned by (date string)
Hive JDBC
On starting the Hive Thrift server hive can be exposed as a service and we can connect to hive via JDBC.
Connecting to Hive via JDBC is similar to connecting to any relational DB like MySQL etc. Below is a
sample code for demonstrating few common hive queries:
9
import java.sql.SQLException;
Page
import java.sql.Connection;
import java.sql.ResultSet;
10. import java.sql.Statement;
import java.sql.DriverManager;
public class HiveJDBC {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
/**
* @param args
* @throws SQLException
*/
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
System.exit(1);
}
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String tableName = "testHiveDriverTable";
stmt.executeQuery("drop table " + tableName);
ResultSet res = stmt.executeQuery("create table " + tableName
+ " (city string, temperature int)");
// show tables
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
// describe table
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "t" + res.getString(2));
}
// load data into table
// NOTE: filepath has to be local to the hive server
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
String filepath = "/home/ubuntu/yash_workspace/data";
sql = "load data local inpath '" + filepath + "' into table " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
// select * query
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(String.valueOf(res.getString(1)) + "t" + res.getInt(2));
}
// regular hive query
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
stmt.close();
10
con.close();
}
}
Page
11. The above code fires two select queries on a newly created table in Hive and prints the output on
console. Other complex queries are left on the reader to explore.
Changing Hive default Data store
Hive by default has Derby database as its data store and many a time we may need to change the
default data store to some other database. There are a couple of configurations changes that we need
to take care for changing Hive’s default data store. Here we will be changing the default data store to
MySQL DB.
Below are the steps for using MySQL Database as Hive’s database:
• Download MySQL JDBC Driver, and copy the jar file in Hive’s lib folder.
• Create a ‘metastore_db’ database in MySQL.
• Create a user ‘hiveuser’ in MySQL.
• Grant all permissions to ‘hiveuser’. GRANT ALL ON *.* TO ‘hiveuser’@localhost IDENTIFIED BY
‘your_password’
• Add the following configuration tags to hive-site.xml :
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExists=true</value>
<description>The jdbc connection string for the metastore_db you just created</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>con.mysql.jdbc.driver</value>
<description>Driver class name for jdbc</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>DB username, we just created on MySQL</description>
</property>
11
<property>
<name>javax.jdo.option.ConnectionPassword</name>
Page
<value>your_password</value>
12. <description>Password for your user – hiveuser</description>
</property>
Hive User Defined Functions (UDF)
Hive allows users to create their own User Defined Functions for using them in their hive queries. You
need to extend the Hive’s UDF class for creating your user defined UDF. Below is a small UDF for auto-
increment functionality. The UDF can also be used to get the ROWNUM of a table in Hive:
package com.confusedcoders.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
public class AutoIncrUdf extends UDF{
int lastValue;
public int evaluate() {
lastValue++;
return lastValue;
}
}
USAGE:
add jar /home/ubuntu/Desktop/HiveUdf.jar;
create temporary function incr as 'com.confusedcoders.hive.udf.AutoIncrUdf';
SELECT userid, incr() as rownum FROM USER LIMIT 10;
12
Page