SlideShare a Scribd company logo
1 of 12
Download to read offline
Apache Hive Walkthrough




    YASH SHARMA - ConfusedCoders
Table of Contents
       Introducing Apache Hive ............................................................................................................................... 3
          Starting Notes ........................................................................................................................................... 3
          What is Apache Hive? ............................................................................................................................... 3
          Motivation for Hive ................................................................................................................................... 3
          Some History ............................................................................................................................................. 3
       Deep Dive into Hive ...................................................................................................................................... 4
          Hive Architecture ...................................................................................................................................... 4
          Hive Data Model ....................................................................................................................................... 5
          Hive Query Language (HiveQL) ................................................................................................................. 5
          Hive Web Interface ................................................................................................................................... 5
          Hive Authorizations................................................................................................................................... 6
       Getting Hands dirty – Hive Hands On ........................................................................................................... 7
          Hive Installation ........................................................................................................................................ 7
          Sample Hive Queries ................................................................................................................................. 8
          Hive SerDe ................................................................................................................................................. 9
          Hive JDBC .................................................................................................................................................. 9
          Changing Hive default Data store ........................................................................................................... 11
          Hive User Defined Functions (UDF) ........................................................................................................ 12
2
Page
Introducing Apache Hive


       Starting Notes
       This is a Hive kick start guide and talks about basic Hive concepts and shows you how hive works over
       Hadoop. The tutorial expects you to be familiar with Hadoop basics and assumes that you have a
       preliminary understanding of what Map-Reduce programming is.



       What is Apache Hive?
       Apache Hive is an open source data warehouse system for Hadoop. Hive is an abstraction over Apache
       Hadoop and provides the users an SQL like query interface. The user only sees table like data and can
       fire hive queries to get results from the data and Hive internally creates, plans and executes Map-
       Reduce Jobs for us and gives us the desired results. Hive is suitable for both unstructured and semi-
       structured data.



       Motivation for Hive
       The prime motivation for Hive was to enable users to quickly come up with fast business solutions using
       a familiar language rather than having to think Map-Reduce for every problem. Hive uses its own query
       language HiveQL which is very much similar to the traditional SQL.



       Some History
       Hive written in Java language was initially developed by Facebook, and is under Apache2.0 License, and
       is now being developed by companies like Netflix. The current version of Hive is 0.9.9 which is
       compatible with Hadoop 0.20.x and 1.x.
3
Page
Small Dive into Hive


       Hive Architecture
       Below is a diagram showing the Hive Architecture and its components. Hive works on top of Hadoop and
       needs Apache Hadoop to be running on your box. Let’s have a quick note on the components:

          •   Hadoop - Hive needs Hadoop as a Base Framework to operate.
          •   Driver - Hive has its own drivers to communicate with the Hadoop World.
          •   Command Line Interface (CLI) – The Hive CLI is the console for firing Hive Queries. The CLI would
              be used for operating on our data.
          •   Web Interface - Hive also provides a web interface to monitor/administrate Hive jobs.
          •   Metastore – Metastore is the Hive’s data warehouse which stores all the structure information
              of various tables/partitions in Hive.
          •   Thrift Server – Hive provides a Thrift Server with itself via which we can expose Hive as a service
              which can then be used for connecting via JDBC/ODBC etc.
4
Page
Apart from these above components there are few more vital actors responsible for the query
       execution:

          •   Hive Compiler which is responsible for the semantic analysis of the input query and creating an
              execution plan. The execution plan is a DAG of stages.
          •   Hive Execution Engine which executes the execution plan created by the Hive Compiler.
          •   Hive also has an Optimizer that optimizes the execution plan.



       Hive Data Model
       Any data into Hive can be organized into 3 data models:

          •   Tables – are similar to relational DB’
          •   Partitions – every table can have one or more partition keys. The data is stored in files based on
              the partition key specified. Without a partition all the values would be submitted to the MR Job,
              whereas on specifying the partition key only a small subset of data would be passed to the MR
              jobs. Hive makes different directories for different partitions to hold data.
          •   Buckets – the data in each partition may again be divided into buckets based on the hash values.
              Each bucket is stored as a file in the partition directory.



       Hive Query Language (HiveQL)
       HQL is very much similar to SQL and user can use SQL syntax for loading, querying tables. Hive queries
       are checked by the compiler for correctness and execution plan is created from the queries. The Hive
       Executer then runs the execution plan accordingly. The Hive Language Manual can be found here:

       https://cwiki.apache.org/confluence/display/Hive/LanguageManual



       Hive Web Interface
       The Hive web interface is another alternative for the command line interface. We can use the Hive web
       interface for administering and monitoring hive jobs. You can view the Hive web interface here by
       default:

       http://localhost:9999/hwi

       Note: Your Hive Server must be running in order to view the Hive Web Interface.
5
Page
Hive Authorizations
       Hive Authorization system consists of Users, Groups and Roles. Roles are the name given to a set of
       grants given to any particular User and can be reused. A role may be assigned to Users, Groups and to
       some other Roles. Hive roles must be created manually before being used. Users and Groups need not
       be created manually and the Metastore will determine the username of the connecting user and the
       groups associated with her.

       More on users, Groups and Roles can be found here:

       https://cwiki.apache.org/Hive/languagemanual-auth.html
6
Page
Getting Hands dirty – Hive Hands On


       Hive Installation
       This is a short crisp guide to installing Hive on top of your Hadoop setup. Apache Hadoop is a pre-
       requisite for Apache Hive and must be installed on your box before we can proceed.

          1. Download Hive: You can download ‘Apache Hive Here’.
          2. Extract Hive: Extract your Hive archive to any location of your choice.
          3. Export Environment Variables & Path:
             Hive needs three environment variables set:
                 • JAVA_HOME: the path where your java is present.
                 • HADOOP_HOME: path where your Hadoop is present.
                 • HIVE_HOME: path where you’ve just extracted your hive.

              Set these environment variables accordingly to continue. You can export environment variables
              by the shell command:

              $> export HADOOP_HOME=/home/ubuntu/hadoop/

              $> export HIVE_HOME=/home/ubuntu/hive/

              $> export PATH=$PATH:$HADOOP_PATH/bin

              $> export PATH=$PATH:$HIVE_PATH/bin

          4. Create Warehouse directory for Hive:
             Hive stores all its data in a directory /user/hive/warehouse/. So let’s create the path for it.

              $> sudo mkdir -P /user/hive/warehouse




          5. Start Hadoop:
             Hive needs Hadoop running for its operations. Let’s start Hadoop by the start-all script. Since
             you have set the HADOOP_HOME/bin to PATH, you must be able to call the start-all.sh directly.
             Else you might have to go to the directory and issue the start-all.sh (inside bin).

                      $> start-all.sh

          6. Start Hive:
             Finally start hive by issuing the command ‘hive’.

              $> hive
7
Page




              hive> show databases;
You can also start the hive server by the command:

               $> hive -service hiveserver




       Sample Hive Queries
       Below is a sample hive query to create table and import data into the table. The query assumes the data
       files to be present on your local system, and has the following data format:

       File: movies.dat

       movie_id:movie_name:tags


       File: ratings.dat

       user_id:movie_id:rating:timestamp



       Download Movie Lens Dataset Here: http://www.grouplens.org/node/73


       CREATE TABLES:
       ———————-------
       CREATE TABLE movies (movie_id int, movie_name string, tags string)
       ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;
       CREATE TABLE ratings (user_id string, movie_id string, rating float, tmstmp
       string)
       ROW FORMAT DELIMITED FIELDS TERMINATE BY ‘:’;



       LOAD DATA FROM LOCAL PATH (Not HDFS):
       ———————————————————------------------
       LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/movies.dat’
       OVERWRITE INTO TABLE movies;
       LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/ratings.dat’
       OVERWRITE INTO TABLE ratings;



       VERIFY:
       ———–---
8




       DESCRIBE movies;
Page




       DESCRIBE ratings;
OR,
       SELECT * FROM movies LIMIT 10;
       SELECT * FROM movies LIMIT 10;



       Hive SerDe
       The above queries work fine with single character delimiters, but many a time we face situations where
       we have a complex delimiter or multi-character delimiters. In these scenarios we need to use the Hive
       SerDe to get our data into Hive tables.

       Here is a sample SerDe implementation for a sample USER file which has ‘::’ as the field delimiter. The
       data in the USER file is in this format id::gender::age::occupation::zipcode

       While using SerDe we specify a regular expression which is used to divide our line of data into fields.

       Query Using SerDe:

       ---------------------

       CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode
       INT)
       ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
       WITH SERDEPROPERTIES (
       "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)"
       );




       Hive Partitions
       Hive partitions can be created on any particular field/column data. The hive partitions make your
       queries faster and Hadoop by default keeps your entire data divided into partitions into separate
       directories. We can create a partitioned table in Hive by this query below; here we are choosing the date
       column to be partitioned:

       create table table_name (
         id                int,
         date              string,
         name              string
       )
       partitioned by (date string)



       Hive JDBC
       On starting the Hive Thrift server hive can be exposed as a service and we can connect to hive via JDBC.
       Connecting to Hive via JDBC is similar to connecting to any relational DB like MySQL etc. Below is a
       sample code for demonstrating few common hive queries:
9




       import java.sql.SQLException;
Page




       import java.sql.Connection;
       import java.sql.ResultSet;
import java.sql.Statement;
       import java.sql.DriverManager;

       public class HiveJDBC {
         private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";

         /**
       * @param args
       * @throws SQLException
          */
         public static void main(String[] args) throws SQLException {
             try {
             Class.forName(driverName);
           } catch (ClassNotFoundException e) {
             // TODO Auto-generated catch block
             e.printStackTrace();
             System.exit(1);
           }
           Connection con =
                 DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
           Statement stmt = con.createStatement();
           String tableName = "testHiveDriverTable";
           stmt.executeQuery("drop table " + tableName);
           ResultSet res = stmt.executeQuery("create table " + tableName
                                                           + " (city string, temperature int)");
           // show tables
           String sql = "show tables '" + tableName + "'";
           System.out.println("Running: " + sql);
           res = stmt.executeQuery(sql);
           if (res.next()) {
             System.out.println(res.getString(1));
           }
           // describe table
           sql = "describe " + tableName;
           System.out.println("Running: " + sql);
           res = stmt.executeQuery(sql);
           while (res.next()) {
             System.out.println(res.getString(1) + "t" + res.getString(2));
           }

               // load data into table
               // NOTE: filepath has to be local to the hive server
               // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
               String filepath = "/home/ubuntu/yash_workspace/data";
               sql = "load data local inpath '" + filepath + "' into table " + tableName;
               System.out.println("Running: " + sql);
               res = stmt.executeQuery(sql);

               // select * query
               sql = "select * from " + tableName;
               System.out.println("Running: " + sql);
               res = stmt.executeQuery(sql);
               while (res.next()) {
                 System.out.println(String.valueOf(res.getString(1)) + "t" + res.getInt(2));
               }

               // regular hive query
               sql = "select count(1) from " + tableName;
               System.out.println("Running: " + sql);
               res = stmt.executeQuery(sql);
               while (res.next()) {
                 System.out.println(res.getString(1));
               }

               stmt.close();
10




               con.close();
           }
       }
Page
The above code fires two select queries on a newly created table in Hive and prints the output on
       console. Other complex queries are left on the reader to explore.




       Changing Hive default Data store
       Hive by default has Derby database as its data store and many a time we may need to change the
       default data store to some other database. There are a couple of configurations changes that we need
       to take care for changing Hive’s default data store. Here we will be changing the default data store to
       MySQL DB.

       Below are the steps for using MySQL Database as Hive’s database:
           • Download MySQL JDBC Driver, and copy the jar file in Hive’s lib folder.
           • Create a ‘metastore_db’ database in MySQL.
           • Create a user ‘hiveuser’ in MySQL.
           • Grant all permissions to ‘hiveuser’. GRANT ALL ON *.* TO ‘hiveuser’@localhost IDENTIFIED BY
              ‘your_password’
           • Add the following configuration tags to hive-site.xml :



           <property>
           <name>javax.jdo.option.ConnectionURL</name>
           <value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExists=true</value>
           <description>The jdbc connection string for the metastore_db you just created</description>
           </property>


           <property>
           <name>javax.jdo.option.ConnectionDriverName</name>
           <value>con.mysql.jdbc.driver</value>
           <description>Driver class name for jdbc</description>
           </property>


           <property>
           <name>javax.jdo.option.ConnectionUserName</name>
           <value>hiveuser</value>
           <description>DB username, we just created on MySQL</description>
           </property>
11




           <property>
           <name>javax.jdo.option.ConnectionPassword</name>
Page




           <value>your_password</value>
<description>Password for your user – hiveuser</description>
            </property>




       Hive User Defined Functions (UDF)
       Hive allows users to create their own User Defined Functions for using them in their hive queries. You
       need to extend the Hive’s UDF class for creating your user defined UDF. Below is a small UDF for auto-
       increment functionality. The UDF can also be used to get the ROWNUM of a table in Hive:

       package com.confusedcoders.hive.udf;

       import org.apache.hadoop.hive.ql.exec.UDF;


       public class AutoIncrUdf extends UDF{

                  int lastValue;

            public int evaluate() {
                  lastValue++;
                return lastValue;
           }

       }



       USAGE:

       add jar /home/ubuntu/Desktop/HiveUdf.jar;

       create temporary function incr as 'com.confusedcoders.hive.udf.AutoIncrUdf';

       SELECT userid, incr() as rownum FROM USER LIMIT 10;
12
Page

More Related Content

What's hot

Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseShiva Rama Krishna Dasharathi
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configurationSubhas Kumar Ghosh
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2benjaminwootton
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopKorea Sdec
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveKorea Sdec
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopAiden Seonghak Hong
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsYahoo Developer Network
 
Apache hadoop 2_installation
Apache hadoop 2_installationApache hadoop 2_installation
Apache hadoop 2_installationsushantbit04
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariJayush Luniya
 
R hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveR hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveAiden Seonghak Hong
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 

What's hot (17)

Apache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exerciseApache Hadoop & Hive installation with movie rating exercise
Apache Hadoop & Hive installation with movie rating exercise
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2Configuring Your First Hadoop Cluster On EC2
Configuring Your First Hadoop Cluster On EC2
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
R hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing HadoopR hive tutorial supplement 1 - Installing Hadoop
R hive tutorial supplement 1 - Installing Hadoop
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer toolsMay 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
 
Apache hadoop 2_installation
Apache hadoop 2_installationApache hadoop 2_installation
Apache hadoop 2_installation
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
R hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhiveR hive tutorial supplement 3 - Rstudio-server setup for rhive
R hive tutorial supplement 3 - Rstudio-server setup for rhive
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 

Similar to Apache Hive micro guide - ConfusedCoders

Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptxVIJAYAPRABAP
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureAnita Luthra
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxraghavanand36
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

Similar to Apache Hive micro guide - ConfusedCoders (20)

Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
6.hive
6.hive6.hive
6.hive
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
Hive training
Hive trainingHive training
Hive training
 
01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx01-Introduction-to-Hive.pptx
01-Introduction-to-Hive.pptx
 
Hive
HiveHive
Hive
 
An Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptxAn Introduction-to-Hive and its Applications and Implementations.pptx
An Introduction-to-Hive and its Applications and Implementations.pptx
 
Hortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on AzureHortonworks Setup & Configuration on Azure
Hortonworks Setup & Configuration on Azure
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Instant hadoop of your own
Instant hadoop of your ownInstant hadoop of your own
Instant hadoop of your own
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Apache Hive micro guide - ConfusedCoders

  • 1. Apache Hive Walkthrough YASH SHARMA - ConfusedCoders
  • 2. Table of Contents Introducing Apache Hive ............................................................................................................................... 3 Starting Notes ........................................................................................................................................... 3 What is Apache Hive? ............................................................................................................................... 3 Motivation for Hive ................................................................................................................................... 3 Some History ............................................................................................................................................. 3 Deep Dive into Hive ...................................................................................................................................... 4 Hive Architecture ...................................................................................................................................... 4 Hive Data Model ....................................................................................................................................... 5 Hive Query Language (HiveQL) ................................................................................................................. 5 Hive Web Interface ................................................................................................................................... 5 Hive Authorizations................................................................................................................................... 6 Getting Hands dirty – Hive Hands On ........................................................................................................... 7 Hive Installation ........................................................................................................................................ 7 Sample Hive Queries ................................................................................................................................. 8 Hive SerDe ................................................................................................................................................. 9 Hive JDBC .................................................................................................................................................. 9 Changing Hive default Data store ........................................................................................................... 11 Hive User Defined Functions (UDF) ........................................................................................................ 12 2 Page
  • 3. Introducing Apache Hive Starting Notes This is a Hive kick start guide and talks about basic Hive concepts and shows you how hive works over Hadoop. The tutorial expects you to be familiar with Hadoop basics and assumes that you have a preliminary understanding of what Map-Reduce programming is. What is Apache Hive? Apache Hive is an open source data warehouse system for Hadoop. Hive is an abstraction over Apache Hadoop and provides the users an SQL like query interface. The user only sees table like data and can fire hive queries to get results from the data and Hive internally creates, plans and executes Map- Reduce Jobs for us and gives us the desired results. Hive is suitable for both unstructured and semi- structured data. Motivation for Hive The prime motivation for Hive was to enable users to quickly come up with fast business solutions using a familiar language rather than having to think Map-Reduce for every problem. Hive uses its own query language HiveQL which is very much similar to the traditional SQL. Some History Hive written in Java language was initially developed by Facebook, and is under Apache2.0 License, and is now being developed by companies like Netflix. The current version of Hive is 0.9.9 which is compatible with Hadoop 0.20.x and 1.x. 3 Page
  • 4. Small Dive into Hive Hive Architecture Below is a diagram showing the Hive Architecture and its components. Hive works on top of Hadoop and needs Apache Hadoop to be running on your box. Let’s have a quick note on the components: • Hadoop - Hive needs Hadoop as a Base Framework to operate. • Driver - Hive has its own drivers to communicate with the Hadoop World. • Command Line Interface (CLI) – The Hive CLI is the console for firing Hive Queries. The CLI would be used for operating on our data. • Web Interface - Hive also provides a web interface to monitor/administrate Hive jobs. • Metastore – Metastore is the Hive’s data warehouse which stores all the structure information of various tables/partitions in Hive. • Thrift Server – Hive provides a Thrift Server with itself via which we can expose Hive as a service which can then be used for connecting via JDBC/ODBC etc. 4 Page
  • 5. Apart from these above components there are few more vital actors responsible for the query execution: • Hive Compiler which is responsible for the semantic analysis of the input query and creating an execution plan. The execution plan is a DAG of stages. • Hive Execution Engine which executes the execution plan created by the Hive Compiler. • Hive also has an Optimizer that optimizes the execution plan. Hive Data Model Any data into Hive can be organized into 3 data models: • Tables – are similar to relational DB’ • Partitions – every table can have one or more partition keys. The data is stored in files based on the partition key specified. Without a partition all the values would be submitted to the MR Job, whereas on specifying the partition key only a small subset of data would be passed to the MR jobs. Hive makes different directories for different partitions to hold data. • Buckets – the data in each partition may again be divided into buckets based on the hash values. Each bucket is stored as a file in the partition directory. Hive Query Language (HiveQL) HQL is very much similar to SQL and user can use SQL syntax for loading, querying tables. Hive queries are checked by the compiler for correctness and execution plan is created from the queries. The Hive Executer then runs the execution plan accordingly. The Hive Language Manual can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual Hive Web Interface The Hive web interface is another alternative for the command line interface. We can use the Hive web interface for administering and monitoring hive jobs. You can view the Hive web interface here by default: http://localhost:9999/hwi Note: Your Hive Server must be running in order to view the Hive Web Interface. 5 Page
  • 6. Hive Authorizations Hive Authorization system consists of Users, Groups and Roles. Roles are the name given to a set of grants given to any particular User and can be reused. A role may be assigned to Users, Groups and to some other Roles. Hive roles must be created manually before being used. Users and Groups need not be created manually and the Metastore will determine the username of the connecting user and the groups associated with her. More on users, Groups and Roles can be found here: https://cwiki.apache.org/Hive/languagemanual-auth.html 6 Page
  • 7. Getting Hands dirty – Hive Hands On Hive Installation This is a short crisp guide to installing Hive on top of your Hadoop setup. Apache Hadoop is a pre- requisite for Apache Hive and must be installed on your box before we can proceed. 1. Download Hive: You can download ‘Apache Hive Here’. 2. Extract Hive: Extract your Hive archive to any location of your choice. 3. Export Environment Variables & Path: Hive needs three environment variables set: • JAVA_HOME: the path where your java is present. • HADOOP_HOME: path where your Hadoop is present. • HIVE_HOME: path where you’ve just extracted your hive. Set these environment variables accordingly to continue. You can export environment variables by the shell command: $> export HADOOP_HOME=/home/ubuntu/hadoop/ $> export HIVE_HOME=/home/ubuntu/hive/ $> export PATH=$PATH:$HADOOP_PATH/bin $> export PATH=$PATH:$HIVE_PATH/bin 4. Create Warehouse directory for Hive: Hive stores all its data in a directory /user/hive/warehouse/. So let’s create the path for it. $> sudo mkdir -P /user/hive/warehouse 5. Start Hadoop: Hive needs Hadoop running for its operations. Let’s start Hadoop by the start-all script. Since you have set the HADOOP_HOME/bin to PATH, you must be able to call the start-all.sh directly. Else you might have to go to the directory and issue the start-all.sh (inside bin). $> start-all.sh 6. Start Hive: Finally start hive by issuing the command ‘hive’. $> hive 7 Page hive> show databases;
  • 8. You can also start the hive server by the command: $> hive -service hiveserver Sample Hive Queries Below is a sample hive query to create table and import data into the table. The query assumes the data files to be present on your local system, and has the following data format: File: movies.dat movie_id:movie_name:tags File: ratings.dat user_id:movie_id:rating:timestamp Download Movie Lens Dataset Here: http://www.grouplens.org/node/73 CREATE TABLES: ———————------- CREATE TABLE movies (movie_id int, movie_name string, tags string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’; CREATE TABLE ratings (user_id string, movie_id string, rating float, tmstmp string) ROW FORMAT DELIMITED FIELDS TERMINATE BY ‘:’; LOAD DATA FROM LOCAL PATH (Not HDFS): ———————————————————------------------ LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/movies.dat’ OVERWRITE INTO TABLE movies; LOAD DATA LOCAL INPATH ‘/home/ubuntu/workspace/data/ratings.dat’ OVERWRITE INTO TABLE ratings; VERIFY: ———–--- 8 DESCRIBE movies; Page DESCRIBE ratings;
  • 9. OR, SELECT * FROM movies LIMIT 10; SELECT * FROM movies LIMIT 10; Hive SerDe The above queries work fine with single character delimiters, but many a time we face situations where we have a complex delimiter or multi-character delimiters. In these scenarios we need to use the Hive SerDe to get our data into Hive tables. Here is a sample SerDe implementation for a sample USER file which has ‘::’ as the field delimiter. The data in the USER file is in this format id::gender::age::occupation::zipcode While using SerDe we specify a regular expression which is used to divide our line of data into fields. Query Using SerDe: --------------------- CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)" ); Hive Partitions Hive partitions can be created on any particular field/column data. The hive partitions make your queries faster and Hadoop by default keeps your entire data divided into partitions into separate directories. We can create a partitioned table in Hive by this query below; here we are choosing the date column to be partitioned: create table table_name ( id int, date string, name string ) partitioned by (date string) Hive JDBC On starting the Hive Thrift server hive can be exposed as a service and we can connect to hive via JDBC. Connecting to Hive via JDBC is similar to connecting to any relational DB like MySQL etc. Below is a sample code for demonstrating few common hive queries: 9 import java.sql.SQLException; Page import java.sql.Connection; import java.sql.ResultSet;
  • 10. import java.sql.Statement; import java.sql.DriverManager; public class HiveJDBC { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (city string, temperature int)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/home/ubuntu/yash_workspace/data"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getString(1)) + "t" + res.getInt(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } stmt.close(); 10 con.close(); } } Page
  • 11. The above code fires two select queries on a newly created table in Hive and prints the output on console. Other complex queries are left on the reader to explore. Changing Hive default Data store Hive by default has Derby database as its data store and many a time we may need to change the default data store to some other database. There are a couple of configurations changes that we need to take care for changing Hive’s default data store. Here we will be changing the default data store to MySQL DB. Below are the steps for using MySQL Database as Hive’s database: • Download MySQL JDBC Driver, and copy the jar file in Hive’s lib folder. • Create a ‘metastore_db’ database in MySQL. • Create a user ‘hiveuser’ in MySQL. • Grant all permissions to ‘hiveuser’. GRANT ALL ON *.* TO ‘hiveuser’@localhost IDENTIFIED BY ‘your_password’ • Add the following configuration tags to hive-site.xml : <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/metastore_db?createDatabaseIfNotExists=true</value> <description>The jdbc connection string for the metastore_db you just created</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>con.mysql.jdbc.driver</value> <description>Driver class name for jdbc</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> <description>DB username, we just created on MySQL</description> </property> 11 <property> <name>javax.jdo.option.ConnectionPassword</name> Page <value>your_password</value>
  • 12. <description>Password for your user – hiveuser</description> </property> Hive User Defined Functions (UDF) Hive allows users to create their own User Defined Functions for using them in their hive queries. You need to extend the Hive’s UDF class for creating your user defined UDF. Below is a small UDF for auto- increment functionality. The UDF can also be used to get the ROWNUM of a table in Hive: package com.confusedcoders.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF; public class AutoIncrUdf extends UDF{ int lastValue; public int evaluate() { lastValue++; return lastValue; } } USAGE: add jar /home/ubuntu/Desktop/HiveUdf.jar; create temporary function incr as 'com.confusedcoders.hive.udf.AutoIncrUdf'; SELECT userid, incr() as rownum FROM USER LIMIT 10; 12 Page