SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
Hive

BALA KRISHNA G
Global Big Data Bootcamp – Jan 2014
(http://globalbigdataconference.com)

Global Big Data Conference - 2014
My introduction
Senior Software and Research Engineer
Big data trainer
Experience on Hadoop and Strom for more than 1.5 years
Worked at various big companies SUN/ORACLE, IBM, etc.,

www.linkedin.com/in/gbalakrishna/
bala.gsbk@outlook.com

Speaker : Bala

Global Big Data Conference - 2014

2
Agenda
Class structure
– 1 hour lecture and 1 ½ hour lab

Lecture
–
–
–
–
–
–
–

Need for Hive
Hive history
Hive powered by
What is Hive?
Hive Architecture
Hive Query Life cycle
Hive Query Language (HiveQL)

Lab:
– Extensive hands-on-experience on Hive
– Derive various insights from a real-world dataset by Hive

Speaker : Bala

Global Big Data Conference - 2014

3
Need for Hive

Do I need to
learn JAVA?

Speaker : Bala

Global Big Data Conference - 2014

Don’t worry!
I am here to
rescue you

4
Need for Hive contd.,
In general, one MR job is not suffice to derive BI (Business
Intelligence)
Oftentimes, require a series of complex MR jobs chained
together (Advanced data processing)
MR 4
MR 1
MR 6

MR 2

MR 3
MR 5
Speaker : Bala

Global Big Data Conference - 2014

legends
MR – Map Reduce
Mapper Task
Reducer Task
5
Need for Hive contd.,
20 lines of code in Hive can result into ~200 lines of Java code
Lowers the development time significantly (~16 times)

300

300

code

250

200

200

Minutes

250

time

150
100

150
100

50

50

0

0
Hadoop

Speaker : Bala

Pig

Hadoop
Global Big Data Conference - 2014

Pig
6
Need for Hive contd.,
Just focuses on “WHAT” part of your data analysis
“HOW” part is rest assured by framework
HOW

Speaker : Bala

Global Big Data Conference - 2014

7
Hive powered by

Uses for processing large amount of user and
central to meet company reporting need’s

Data analytics and Data cleaning

Ad hoc queries reporting and analytics
And many more…
https://cwiki.apache.org/confluence/display/Hive/PoweredBy
Speaker : Bala

Global Big Data Conference - 2014

8
What is Hive?

Data warehouse built on top of Hadoop
Provides an SQL like interface to analyze data
An open source project under apache
Works on high throughput and high latency
principle (same as Hadoop)
Ability to plug-in custom Map Reduce programs
Mainly targeted for structured data
Hides Map Reduce program complexities to end
user
Speaker : Bala

Global Big Data Conference - 2014

9
Hive Architecture

HIVE

Meta
Store

CLI
Web
Interface
Python

ODBC
Perl

Speaker : Bala

Driver

HADOOP

Map
Reduce

Compiler

Optimizer
Hive Thrift
Server

HDFS
Plan
executor

Global Big Data Conference - 2014

10
Metastore
Stores metadata of tables like database location, owner,
creation time, access attributes, table schema, etc.,
Comprises of two components 1) Service 2) Data storage
Hive Service
Embedded
Metastore

Driver

Metastore
Service

Local
Metastore

Driver

Metastore
Service

Remote
Metastore

Driver

Speaker : Bala

Derby

MySQL

Metastore
Server

Global Big Data Conference - 2014

MySQL

11
Hive Query Life cycle Insight

Speaker : Bala

Global Big Data Conference - 2014

12
Hive Query Life cycle contd.,
1

Hive
Interface
14

11

10

Execution
Engine

13

Driver

12

Hadoop
Map
Reduce

9

Metastore

2

Compiler

3

Parser

Semantic
Analyzer

8

5

4

Speaker : Bala

Physical
plan
Optimizer
generator

6
6

Global Big Data Conference - 2014

Logical
plan
generator

7
7

Optimizer

13
Data Models
Database: Holds namespace for tables
Table: Container of actual data
sample
Id

Name

Age

Sex

State

In Hive warehouse
stored as a folder
/user/$USER/warehouse/sample

Speaker : Bala

Global Big Data Conference - 2014

14
Data Models contd.,
Partition: Horizontal slice of table by a partition key
Let say sample table is partitioned by state column
sample
Id

Name

Age

Sex

State

Partition 1

Partition 2

Stored as many subfolders under sample directory
/user/$USER/warehouse/State=AL/

/user/$USER/warehouse/State=NC/

/user/$USER/warehouse/State=GA/

/user/$USER/warehouse/State=ND/

Speaker : Bala

Global Big Data Conference - 2014

15
Data Models contd.,
Bucket: Divides into further chunks by an other column for
sampling
Let say sample table is partitioned by ‘State’ column and
clustered by ‘Age’ column of 2 buckets
In warehouse, the data is stored as
/user/$USER/warehouse/State=AL/part-00000
/user/$USER/warehouse/State=AL/part-00001
/user/$USER/warehouse/State=GA/part-00000
/user/$USER/warehouse/State=GA/part-00001

.
.
/user/$USER/warehouse/State=ND/part-00000
/user/$USER/warehouse/State=ND/part-00001
Speaker : Bala

Global Big Data Conference - 2014

16
Data Loading Techniques
Managed Table: Tables managed by Hive Ware House
– Copy file from local file system to Hive Ware House
1)

Local FS

copy

HDFS

File

Hive
Warehouse

– Copy file from HDFS to Hive Ware House
2)
HDFS
File

Speaker : Bala

copy

Hive
Warehouse

Global Big Data Conference - 2014

17
Data Loading Techniques contd.,
External Table: Tables are just referenced by Hive Ware House
– Directly managing file in HDFS with out copying it into Hive Ware House

3)
HDFS
File

Speaker : Bala

Referenced
referenced

Global Big Data Conference - 2014

Hive
Warehouse

18
Data Loading Techniques contd.,
Explain when to go for external table and managed table?

Speaker : Bala

Global Big Data Conference - 2014

19
Question - 01
In which scenario you use Hive?
1.
2.

Structured data

3.

Any kind of data

4.

Speaker : Bala

Completely unstructured nasty data

None of the above

Global Big Data Conference - 2014

20
Question – 01 answer

2. Hive is mainly used to analyze
structured data. Typically, Hive runs on
the data that is generated by
MapReduce job (or) pig

Speaker : Bala

Global Big Data Conference - 2014

21
Question - 02
Which option is not correct about
Metastore?
1.
2.

It has information about number of
partitions and number of buckets

3.

It can give you time at which the table is
created

4.

Speaker : Bala

It stores the table location

It stores the actual data

Global Big Data Conference - 2014

22
Question – 02 answer

4. Metastore stores only the metadata.
Actual data is stored in HDFS.

Speaker : Bala

Global Big Data Conference - 2014

23
Question – 03 (last question)
What is incorrect about Hive?
1.
2.

Hive runs on top of HDFS

3.

Hive is a proprietary software

4.

Speaker : Bala

Hive internally generates MapReduce
jobs to serve your query

Hive supports multiple interfaces to
interact with

Global Big Data Conference - 2014

24
Question – 03 answer

3. Hive is an open source. Not a
proprietary software. Hive community
is growing very rapidly.

Speaker : Bala

Global Big Data Conference - 2014

25
Hive Query Language (Hive QL)
Data types – provides types for variables
DDL – provides a way to define databases, tables, etc.,
DML – provides a way to modify content
Query statements – provides a way to retrieve the content

Speaker : Bala

Global Big Data Conference - 2014

26
Data types

Booleans:

Primitive Types

TINYINT (1 byte)
SMALLINT (2 bytes)
INT (4 bytes)
BIGINT (8 bytes)

BOOLEAN
(TRUE or FALSE)

String:
STRING
(sequence of
characters)

Speaker : Bala

Integers:

Floating point
numbers:
Usage
variable_name <Data Type>
ex: name STRING

Global Big Data Conference - 2014

Float (4 bytes)
Double (8 bytes)

27
Data types contd.,
ARRAY

Usage

collection of multiple
same data type values

name ARRAY <primitive type>
ex: marks ARRAY<INT>

Complex Types
Usage
STRUCT
collection of multiple
different data type
values

MAP
collection of
(key, value) pairs

Speaker : Bala

Global Big Data Conference - 2014

name STRUCT <type1, type2,
type3, …>
ex: record STRUCT <name
STRING, id INT, marks
ARRAY<INT>>

Usage
name MAP <key, value>
ex: score MAP<STRING, INT>

28
Data types contd.,
Key must be a primitive in MAP
Referencing complex types
Previous example:
– marks ARRAY<INT>
– record STRUCT <name STRING, id INT, marks ARRAY<INT>>
– score MAP<STRING, INT>
SELECT marks[0], record.name, score[‘joe’]

Complex type inside a complex type is allowed
– array inside a struct (as seen before)

Speaker : Bala

Global Big Data Conference - 2014

29
DDL
CREATE TABLE sample(id INT, name STRING,
schema
STRING, state STRING)
COMMENT ‘This is a sample table’
PARTITIONED BY (state STRING)

age INT,

sex

comments for readability
partition data by state column

ROW FORMAT DELIMITED

rows are delimited by ‘n’

FIELDS TERMINATED BY ‘,’

fields are terminated by ‘,’

STORED AS TEXTFILE;

store file as a text file

Table is created in warehouse directory and completely managed by Hive
Specific row format and file format can be expressed by custom SerDe

Speaker : Bala

Global Big Data Conference - 2014

30
SerDe

SerDe stands for Serializer and Deserializer

Deserializer
HDFS
File

InputFile
Format

<Key,
Value>

Deserializ
er

Row

Serializer

<Key,
Value>

OutputFile
Format

HDFS
File

Serializer

Row

Speaker : Bala

Global Big Data Conference - 2014

31
DDL contd.,
CREATE EXTERNAL TABLE external_sample(id INT, name STRING,
age INT, sex STRING, state STRING)
LOCATION ‘/user/department/sample’

Table is not created in warehouse directory and just referenced by Hive
The file referenced is in HDFS (hdfs://user/department/sample)

Speaker : Bala

Global Big Data Conference - 2014

32
DDL contd.,
DELETE TABLE sample
Since sample table is managed by Hive, it deletes entire data along with
metadata
DELETE TABLE external_sample
Since external_sample table is *not* managed by Hive, it just deletes the
metadata leaving actual data untouched

Speaker : Bala

Global Big Data Conference - 2014

33
DML
Load data into managed table from local file system
LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE
sample;
The file ‘/home/hive/sample.txt’ is in local file system
It is copied into Hive warehouse folder

Load data into managed table from HDFS
LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE
sample;
The file ‘/user/hive/sample.txt’ is in HDFS

It is copied into Hive warehouse folder

Speaker : Bala

Global Big Data Conference - 2014

34
DML contd.,
Insert results into a new table
INSERT OVERWRITE TABLE newsample
SELECT * from sample;
newsample table must be created before hand
select query results are loaded (overwritten) into new sample

Create a new table with automatically derived schema
CREATE TABLE newsample
AS SELECT * from sample;
creates newsample time with automatically derived schema
query results are populated into it

Speaker : Bala

Global Big Data Conference - 2014

35
Query statements
To list available databases
SHOW DATABASES;

To use a particular database
USE <databasename>;

To list all tables available in a database
SHOW TABLES;

Speaker : Bala

Global Big Data Conference - 2014

36
Query statements contd.,
select
SELECT * FROM sample;

Aggregation functions
SELECT COUNT(DISTINCT state) FROM sample;

Group by, Sort by, Order by
SELECT COUNT(*) FROM sample GROUP BY state;
SELECT * FROM sample SORT BY id DESC;

FROM sample SELECT * ORDER BY id ASC;

Speaker : Bala

Global Big Data Conference - 2014

37
Query statements contd.,
Joins
SELECT s.* , o.*
FROM sample s
JOIN orders o
ON (s.id = o.id)

Left join and Right joins are also supported
Multiple joins are accepted

Speaker : Bala

Global Big Data Conference - 2014

38
Custom Functions
UDF:
– User defined function
– Complex/additional logic can be expressed
– Operates on row by row

UDAF:
– User defined aggregate function
– Custom aggregated function logic can be written
– Operates on groups retrieved by group by clause

UDTF:
– User defined table function
– Operates on entire table

Speaker : Bala

Global Big Data Conference - 2014

39
Hive Limitations
Not suitable for unstructured data
Perfectly suitable for OLAP system (analysis)
Representing machine learning algorithms can be a challenging
task
Performance tradeoff with actual MR programs in various
scenarios
– The gap is narrowing with release to release

Speaker : Bala

Global Big Data Conference - 2014

40
Important practical tips
Hive logs: /tmp/$USER/hive.log
To know available functions: SET FUNCTIONS
To know help about a specific function: DESCRIBE FUNCTION
<function_name>
Explain about config files the one in /usr/lib/hive/conf folder
– hive-site.xml, hive-default.xml, (or) specify custom file using –f option ?

SETTING parameters in the hive session

Speaker : Bala

Global Big Data Conference - 2014

41
References
Hadoop: The Definitive Guide -Tom White
https://cwiki.apache.org/confluence/display/Hive/Home
http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf
Venner, Jason (2009). Pro Hadoop
http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/

Speaker : Bala

Global Big Data Conference - 2014

42
Q/A

Speaker : Bala

Global Big Data Conference - 2014

43
Speaker : Bala

Global Big Data Conference - 2014

44
Backup slides

Speaker : Bala

Global Big Data Conference - 2014

45
Schema on Read (?)
[To do] where to put this slide?
Explain what is schema on read
Explain what is schema on write
Advantages of using schema on read
– Faster load time
– Impacts query time

Speaker : Bala

Global Big Data Conference - 2014

46

Más contenido relacionado

La actualidad más candente

Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 

La actualidad más candente (20)

Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Hive
HiveHive
Hive
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Hive presentation
Hive presentationHive presentation
Hive presentation
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 

Similar a Hive SQL Query Language for Analyzing Big Data

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Kevin Crocker
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideDanairat Thanabodithammachari
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahoMartin Ferguson
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
 

Similar a Hive SQL Query Language for Analyzing Big Data (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
field_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentahofield_guide_to_hadoop_pentaho
field_guide_to_hadoop_pentaho
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 

Último

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Último (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Hive SQL Query Language for Analyzing Big Data

  • 1. Hive BALA KRISHNA G Global Big Data Bootcamp – Jan 2014 (http://globalbigdataconference.com) Global Big Data Conference - 2014
  • 2. My introduction Senior Software and Research Engineer Big data trainer Experience on Hadoop and Strom for more than 1.5 years Worked at various big companies SUN/ORACLE, IBM, etc., www.linkedin.com/in/gbalakrishna/ bala.gsbk@outlook.com Speaker : Bala Global Big Data Conference - 2014 2
  • 3. Agenda Class structure – 1 hour lecture and 1 ½ hour lab Lecture – – – – – – – Need for Hive Hive history Hive powered by What is Hive? Hive Architecture Hive Query Life cycle Hive Query Language (HiveQL) Lab: – Extensive hands-on-experience on Hive – Derive various insights from a real-world dataset by Hive Speaker : Bala Global Big Data Conference - 2014 3
  • 4. Need for Hive Do I need to learn JAVA? Speaker : Bala Global Big Data Conference - 2014 Don’t worry! I am here to rescue you 4
  • 5. Need for Hive contd., In general, one MR job is not suffice to derive BI (Business Intelligence) Oftentimes, require a series of complex MR jobs chained together (Advanced data processing) MR 4 MR 1 MR 6 MR 2 MR 3 MR 5 Speaker : Bala Global Big Data Conference - 2014 legends MR – Map Reduce Mapper Task Reducer Task 5
  • 6. Need for Hive contd., 20 lines of code in Hive can result into ~200 lines of Java code Lowers the development time significantly (~16 times) 300 300 code 250 200 200 Minutes 250 time 150 100 150 100 50 50 0 0 Hadoop Speaker : Bala Pig Hadoop Global Big Data Conference - 2014 Pig 6
  • 7. Need for Hive contd., Just focuses on “WHAT” part of your data analysis “HOW” part is rest assured by framework HOW Speaker : Bala Global Big Data Conference - 2014 7
  • 8. Hive powered by Uses for processing large amount of user and central to meet company reporting need’s Data analytics and Data cleaning Ad hoc queries reporting and analytics And many more… https://cwiki.apache.org/confluence/display/Hive/PoweredBy Speaker : Bala Global Big Data Conference - 2014 8
  • 9. What is Hive? Data warehouse built on top of Hadoop Provides an SQL like interface to analyze data An open source project under apache Works on high throughput and high latency principle (same as Hadoop) Ability to plug-in custom Map Reduce programs Mainly targeted for structured data Hides Map Reduce program complexities to end user Speaker : Bala Global Big Data Conference - 2014 9
  • 10. Hive Architecture HIVE Meta Store CLI Web Interface Python ODBC Perl Speaker : Bala Driver HADOOP Map Reduce Compiler Optimizer Hive Thrift Server HDFS Plan executor Global Big Data Conference - 2014 10
  • 11. Metastore Stores metadata of tables like database location, owner, creation time, access attributes, table schema, etc., Comprises of two components 1) Service 2) Data storage Hive Service Embedded Metastore Driver Metastore Service Local Metastore Driver Metastore Service Remote Metastore Driver Speaker : Bala Derby MySQL Metastore Server Global Big Data Conference - 2014 MySQL 11
  • 12. Hive Query Life cycle Insight Speaker : Bala Global Big Data Conference - 2014 12
  • 13. Hive Query Life cycle contd., 1 Hive Interface 14 11 10 Execution Engine 13 Driver 12 Hadoop Map Reduce 9 Metastore 2 Compiler 3 Parser Semantic Analyzer 8 5 4 Speaker : Bala Physical plan Optimizer generator 6 6 Global Big Data Conference - 2014 Logical plan generator 7 7 Optimizer 13
  • 14. Data Models Database: Holds namespace for tables Table: Container of actual data sample Id Name Age Sex State In Hive warehouse stored as a folder /user/$USER/warehouse/sample Speaker : Bala Global Big Data Conference - 2014 14
  • 15. Data Models contd., Partition: Horizontal slice of table by a partition key Let say sample table is partitioned by state column sample Id Name Age Sex State Partition 1 Partition 2 Stored as many subfolders under sample directory /user/$USER/warehouse/State=AL/ /user/$USER/warehouse/State=NC/ /user/$USER/warehouse/State=GA/ /user/$USER/warehouse/State=ND/ Speaker : Bala Global Big Data Conference - 2014 15
  • 16. Data Models contd., Bucket: Divides into further chunks by an other column for sampling Let say sample table is partitioned by ‘State’ column and clustered by ‘Age’ column of 2 buckets In warehouse, the data is stored as /user/$USER/warehouse/State=AL/part-00000 /user/$USER/warehouse/State=AL/part-00001 /user/$USER/warehouse/State=GA/part-00000 /user/$USER/warehouse/State=GA/part-00001 . . /user/$USER/warehouse/State=ND/part-00000 /user/$USER/warehouse/State=ND/part-00001 Speaker : Bala Global Big Data Conference - 2014 16
  • 17. Data Loading Techniques Managed Table: Tables managed by Hive Ware House – Copy file from local file system to Hive Ware House 1) Local FS copy HDFS File Hive Warehouse – Copy file from HDFS to Hive Ware House 2) HDFS File Speaker : Bala copy Hive Warehouse Global Big Data Conference - 2014 17
  • 18. Data Loading Techniques contd., External Table: Tables are just referenced by Hive Ware House – Directly managing file in HDFS with out copying it into Hive Ware House 3) HDFS File Speaker : Bala Referenced referenced Global Big Data Conference - 2014 Hive Warehouse 18
  • 19. Data Loading Techniques contd., Explain when to go for external table and managed table? Speaker : Bala Global Big Data Conference - 2014 19
  • 20. Question - 01 In which scenario you use Hive? 1. 2. Structured data 3. Any kind of data 4. Speaker : Bala Completely unstructured nasty data None of the above Global Big Data Conference - 2014 20
  • 21. Question – 01 answer 2. Hive is mainly used to analyze structured data. Typically, Hive runs on the data that is generated by MapReduce job (or) pig Speaker : Bala Global Big Data Conference - 2014 21
  • 22. Question - 02 Which option is not correct about Metastore? 1. 2. It has information about number of partitions and number of buckets 3. It can give you time at which the table is created 4. Speaker : Bala It stores the table location It stores the actual data Global Big Data Conference - 2014 22
  • 23. Question – 02 answer 4. Metastore stores only the metadata. Actual data is stored in HDFS. Speaker : Bala Global Big Data Conference - 2014 23
  • 24. Question – 03 (last question) What is incorrect about Hive? 1. 2. Hive runs on top of HDFS 3. Hive is a proprietary software 4. Speaker : Bala Hive internally generates MapReduce jobs to serve your query Hive supports multiple interfaces to interact with Global Big Data Conference - 2014 24
  • 25. Question – 03 answer 3. Hive is an open source. Not a proprietary software. Hive community is growing very rapidly. Speaker : Bala Global Big Data Conference - 2014 25
  • 26. Hive Query Language (Hive QL) Data types – provides types for variables DDL – provides a way to define databases, tables, etc., DML – provides a way to modify content Query statements – provides a way to retrieve the content Speaker : Bala Global Big Data Conference - 2014 26
  • 27. Data types Booleans: Primitive Types TINYINT (1 byte) SMALLINT (2 bytes) INT (4 bytes) BIGINT (8 bytes) BOOLEAN (TRUE or FALSE) String: STRING (sequence of characters) Speaker : Bala Integers: Floating point numbers: Usage variable_name <Data Type> ex: name STRING Global Big Data Conference - 2014 Float (4 bytes) Double (8 bytes) 27
  • 28. Data types contd., ARRAY Usage collection of multiple same data type values name ARRAY <primitive type> ex: marks ARRAY<INT> Complex Types Usage STRUCT collection of multiple different data type values MAP collection of (key, value) pairs Speaker : Bala Global Big Data Conference - 2014 name STRUCT <type1, type2, type3, …> ex: record STRUCT <name STRING, id INT, marks ARRAY<INT>> Usage name MAP <key, value> ex: score MAP<STRING, INT> 28
  • 29. Data types contd., Key must be a primitive in MAP Referencing complex types Previous example: – marks ARRAY<INT> – record STRUCT <name STRING, id INT, marks ARRAY<INT>> – score MAP<STRING, INT> SELECT marks[0], record.name, score[‘joe’] Complex type inside a complex type is allowed – array inside a struct (as seen before) Speaker : Bala Global Big Data Conference - 2014 29
  • 30. DDL CREATE TABLE sample(id INT, name STRING, schema STRING, state STRING) COMMENT ‘This is a sample table’ PARTITIONED BY (state STRING) age INT, sex comments for readability partition data by state column ROW FORMAT DELIMITED rows are delimited by ‘n’ FIELDS TERMINATED BY ‘,’ fields are terminated by ‘,’ STORED AS TEXTFILE; store file as a text file Table is created in warehouse directory and completely managed by Hive Specific row format and file format can be expressed by custom SerDe Speaker : Bala Global Big Data Conference - 2014 30
  • 31. SerDe SerDe stands for Serializer and Deserializer Deserializer HDFS File InputFile Format <Key, Value> Deserializ er Row Serializer <Key, Value> OutputFile Format HDFS File Serializer Row Speaker : Bala Global Big Data Conference - 2014 31
  • 32. DDL contd., CREATE EXTERNAL TABLE external_sample(id INT, name STRING, age INT, sex STRING, state STRING) LOCATION ‘/user/department/sample’ Table is not created in warehouse directory and just referenced by Hive The file referenced is in HDFS (hdfs://user/department/sample) Speaker : Bala Global Big Data Conference - 2014 32
  • 33. DDL contd., DELETE TABLE sample Since sample table is managed by Hive, it deletes entire data along with metadata DELETE TABLE external_sample Since external_sample table is *not* managed by Hive, it just deletes the metadata leaving actual data untouched Speaker : Bala Global Big Data Conference - 2014 33
  • 34. DML Load data into managed table from local file system LOAD DATA LOCAL INPATH '/home/hive/sample.txt' INTO TABLE sample; The file ‘/home/hive/sample.txt’ is in local file system It is copied into Hive warehouse folder Load data into managed table from HDFS LOAD DATA INPATH '/user/hive/sample.txt' INTO TABLE sample; The file ‘/user/hive/sample.txt’ is in HDFS It is copied into Hive warehouse folder Speaker : Bala Global Big Data Conference - 2014 34
  • 35. DML contd., Insert results into a new table INSERT OVERWRITE TABLE newsample SELECT * from sample; newsample table must be created before hand select query results are loaded (overwritten) into new sample Create a new table with automatically derived schema CREATE TABLE newsample AS SELECT * from sample; creates newsample time with automatically derived schema query results are populated into it Speaker : Bala Global Big Data Conference - 2014 35
  • 36. Query statements To list available databases SHOW DATABASES; To use a particular database USE <databasename>; To list all tables available in a database SHOW TABLES; Speaker : Bala Global Big Data Conference - 2014 36
  • 37. Query statements contd., select SELECT * FROM sample; Aggregation functions SELECT COUNT(DISTINCT state) FROM sample; Group by, Sort by, Order by SELECT COUNT(*) FROM sample GROUP BY state; SELECT * FROM sample SORT BY id DESC; FROM sample SELECT * ORDER BY id ASC; Speaker : Bala Global Big Data Conference - 2014 37
  • 38. Query statements contd., Joins SELECT s.* , o.* FROM sample s JOIN orders o ON (s.id = o.id) Left join and Right joins are also supported Multiple joins are accepted Speaker : Bala Global Big Data Conference - 2014 38
  • 39. Custom Functions UDF: – User defined function – Complex/additional logic can be expressed – Operates on row by row UDAF: – User defined aggregate function – Custom aggregated function logic can be written – Operates on groups retrieved by group by clause UDTF: – User defined table function – Operates on entire table Speaker : Bala Global Big Data Conference - 2014 39
  • 40. Hive Limitations Not suitable for unstructured data Perfectly suitable for OLAP system (analysis) Representing machine learning algorithms can be a challenging task Performance tradeoff with actual MR programs in various scenarios – The gap is narrowing with release to release Speaker : Bala Global Big Data Conference - 2014 40
  • 41. Important practical tips Hive logs: /tmp/$USER/hive.log To know available functions: SET FUNCTIONS To know help about a specific function: DESCRIBE FUNCTION <function_name> Explain about config files the one in /usr/lib/hive/conf folder – hive-site.xml, hive-default.xml, (or) specify custom file using –f option ? SETTING parameters in the hive session Speaker : Bala Global Big Data Conference - 2014 41
  • 42. References Hadoop: The Definitive Guide -Tom White https://cwiki.apache.org/confluence/display/Hive/Home http://www.sfbayacm.org/wp/wpcontent/uploads/2010/01/sig_2010_v21.pdf Venner, Jason (2009). Pro Hadoop http://hortonworks.com/big-data-insights/how-facebook-uses-hadoopand-hive/ Speaker : Bala Global Big Data Conference - 2014 42
  • 43. Q/A Speaker : Bala Global Big Data Conference - 2014 43
  • 44. Speaker : Bala Global Big Data Conference - 2014 44
  • 45. Backup slides Speaker : Bala Global Big Data Conference - 2014 45
  • 46. Schema on Read (?) [To do] where to put this slide? Explain what is schema on read Explain what is schema on write Advantages of using schema on read – Faster load time – Impacts query time Speaker : Bala Global Big Data Conference - 2014 46