Hive

DataWarehousing on Hadoop
HIVE

Hadoop is great for large-data processing!
But writing Java programs for everything is verbose
and slow
Analysts don’t want to (or can’t) write Java
Solution: develop higher-level data processing languages
Hive: HQL is like SQL
Pig: Pig Latin is a bit like Perl
Need for High-Level Languages

Problem: Data, data and more data
200GB per day in March 2008
2+TB(compressed) raw data per day today
The Hadoop Experiment
Much superior to availability and scalability of
commercial DBs
Efficiency not that great and required more hardware
PartialAvailability/resilience/scale more important than
ACID
Problem: Programmability and Metadata
Map-reduce hard to program (users know
sql/bash/python)
Need to publish data in well known schemas
Why Hive??

Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS, metadata)
Metastore: schema, location in HDFS, SerDe
HIVE: Components

Tables
Typed columns (int, float, string, boolean)
Also, list: map (for JSON-like data)
Partitions
For example, range-partition tables by date
Command : PARTITIONED BY
Buckets
Hash partitions within ranges (useful for sampling,
join optimization)
Command : CLUSTERED BY
Data Model

Database: namespace containing a set of tables
Holds table definitions (column types, physical
layout)
Holds partitioning information
Can be stored in Derby, MySQL, and many other
relational databases
Metastore

Warehouse directory in HDFS
E.g., /user/hive/warehouse
Tables stored in subdirectories of warehouse
Partitions form subdirectories of tables
Actual data stored in flat files
Control char-delimited text, or SequenceFiles
With custom SerDe, can use arbitrary format
Physical Layout

HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUIHIVE: Components

CREATE TABLE sample (foo INT, bar STRING)
PARTITIONED BY (ds STRING);
SHOW TABLES '.*s';
DESCRIBE sample;
ALTER TABLE sample ADD COLUMNS (new_col
INT);
DROP TABLE sample;
Examples – DDL Operations

LOAD DATA LOCAL INPATH './sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
LOAD DATA INPATH '/user/falvariz/hive/sample.txt'
OVERWRITE INTO TABLE sample PARTITION
(ds='2012-02-24');
Examples – DML Operations

SELECT * FROM (
FROM pv_users
SELECTTRANSFORM(pv_users.userid, pv_users.date) USING
'map_script'
AS(dt, uid)
CLUSTER BY(dt)) map
INSERT INTOTABLE pv_users_reduced
SELECTTRANSFORM(map.dt, map.uid) USING
'reduce_script'AS (date, count);
Running Custom Map/Reduce
Scripts

Machine 2
Machine 1
<k1, v1>
<k2, v2>
<k3, v3>
<k4, v4>
<k5, v5>
<k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1>
<nk2, nv2>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk1, nv6>
Local
Map
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
<nk1, nv1>
<nk3, nv3>
<nk1, nv6>
Global
Shuffle
<nk1, nv1>
<nk1, nv6>
<nk3, nv3>
<nk2, nv4>
<nk2, nv5>
<nk2, nv2>
Local
Sort
<nk2, 3>
<nk1, 2>
<nk3, 1>
Local
Reduce

• SQL:
INSERT INTOTABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
pageid age
1 25
2 25
1 32
X =
page_view
user
pv_users
Hive QL – Join

key value
111 <1,1>
111 <1,2>
222 <1,1>
key value
111 <2,25>
222 <2,32>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
Shuffle
Sort Reduce
Hive QL – Join in Map Reduce

 Outer Joins
INSERT INTOTABLE pv_users
SELECT pv.*, u.gender, u.age
FROM page_view pv FULL OUTER JOIN user u
ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03;
Joins

 Only Equality Joins with conjunctions supported
 Future
 Pruning of values send from map to reduce on the
basis of projections
 Make Cartesian product more memory efficient
 Map side joins
Hash Joins if one of the tables is very small
Exploit pre-sorted data by doing map-side merge join
Join To Map Reduce

SQL:
FROM (a join b on a.key = b.key) join c on a.key = c.key
SELECT …
key av bv
1 111 222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reduce
key av bv cv
1 111 222 333
ABC
Hive Optimizations – Merge Sequential Map Reduce Jobs

SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
pageid age
1 25
2 25
1 32
2 25
pv_users
pageid age count
1 25 1
2 25 2
1 32 1
Hive QL – Group By

pa
pageid age
1 25
2 25
pv_users
pa
pageid age
1 32
2 25
Map
key value
<1,25> 1
<2,25> 1
key value
<1,32> 1
<2,25> 1
key value
<1,25> 1
<1,32> 1
key value
<2,25> 1
<2,25> 1
Shuffle
Sort
Reduce
Hive QL – Group By in Map Reduce

SELECT pageid, COUNT(DISTINCT userid)
FROM page_view GROUP BY pageid
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
Hive QL – Group By with Distinct

pageid count
1 1
page_view
pageid count
1 1
2 1
Shuffle
and
Sort
Reduce
pageid userid time
1 111 9:08:01
2 111 9:08:13
pageid userid time
1 222 9:08:14
2 111 9:08:20
key v
<1,111>
<2,111>
<2,111>
key v
<1,222>
Hive QL – Group By with Distinct in Map Reduce

FROM pv_users
INSERT INTOTABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age)
INSERT INTO LOCAL DIRECTORY‘/home/me/pv_age_sum.dir’
FIELDSTERMINATED BY‘,’ LINESTERMINATED BY 013
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);
Inserts into Files, Tables and Local Files

Hive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hive

Similar to Hive (20)

Recently uploaded

Recently uploaded (20)

Hive

Editor's Notes