2. Motivation
Structured log and dimension data
– Well known schemas, different serialization formats (binary/text)
– Rich data structures – nesting/maps/lists
Query language over structured data
– SQL helps in easier adoption by business analysts + reduced learning
curve for everyone
– Developers love streaming and direct access to map-reduce
– Query Language brings together SQL and Streaming
Data Management
– Tables/Partitions for easy data addressability
– Abstractions allow optimizations:
Organize data for large joins/sampling
Add indices/manage compression/replication transparently
3. What is HIVE?
Mgmt. Web UI
Map Reduce HDFS
Hive CLI
Browsing Queries DDL
Thrift API Parser
Execution
Planner
Hive QL
SerDe
Thrift Jute JSON..
MetaStore
4. Dealing with Structured Data
Type system
– Primitive types
– Recursively build up using Composition/Maps/Lists
Generic (De)Serialization Interface (SerDe)
– To recursively list schema
– To recursively access fields within a row object
Serialization families implement interface
– Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?)
XPath like field expressions
– profiles.network[@is_primary=1].id
Inbuilt DDL
– Define schema over delimited text files
– Leverages Thrift DDL
5. Data Model
#Partitions=32
Schema Sort-key=uid
uid
Library
Hash clicks
Partitioning
views IP
Logical Partitioning userId
…
AdId
/hive/clicks
/hive/clicks/ds=2008-03-25 Tables Dimensions
/hive/clicks/ds=2008-03-25/0
HDFS MetaStore
6. MetaStore
Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Sort column
– Mapping from columns to well known Dimensions
Thrift API
– Current clients in Php (Web Interface), Python (CLI), Java (Query
Engine), Perl (Tests)
Stores all properties in text files
7. Hive CLI
Implemented in Python
– uses MetaStore Thrift API
DDL:
– create table/drop table/rename table
– alter table add column etc.
Browsing:
– show tables
– describe table
– cat table
Loading Data
– load data inpath <path1, …> into table <tablename/partition-spec>]
[bucketed <N> ways by <dimension>]
Queries
– Issue queries in Hive QL.
8. Hive Query Language
Philosophy
– SQL like constructs + Hadoop Streaming
Query Operators in initial version
– Projections
– Equijoins and Cogroups
– Group by
– Sampling
Output of these operators can be:
– passed to Streaming mappers/reducers
– can be stored in another Hive Table
– can be output to HDFS files
9. Hive Query Language
Package these capabilities into a more formal SQL like query language
in next version
Introduce other important constructs:
– Views
– Multi table inserts
– Order bys
– Select distincts
– SQL like column expressions
– A bunch of other builtin functions
Still work in progress
10. Query Language - Examples
Multi table inserts
FROM ad_impressions_stg imps
INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1
INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0
Joins
FROM ad_impressions imps, ad_dimensions ads
INSERT INTO ad_legals_joined select imps.*, ads.campaignid
JOIN ON(imps.adid, ads.adid)
WHERE imps.legal = 1
11. Query Language - Examples
Group By
FROM ad_legals_joined imps
INSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary
select imps.adid, count_distinct(imps.uid)
group by(imps.adid)
INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary
select imps.campaign_id, count_distinct(imps.uid)
group by(imps.campaignid)
12. Query Language – HadoopStreaming
APPLY ON TABLE
CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’
(ts date, adid long, uid long)
FROM (APPLY filter_legal ON TABLE ad_impression)
INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’
APPLY can also be applied after JOIN as reducer script
FROM ad_impressions imps, ad_dimensions ads
INSERT INTO ad_legals_joined select imps.*, ads.campaignid
JOIN ON(imps.adid, ads.adid)
APPLY filter_legal BEFORE OUTPUT
13. Query Language – Views
Used for expressing
– Union alls
– APPLY operators
Example
CREATE VIEW actions
SELECT photo_views.*
UNION ALL
SELECT video_views.*
UNION ALL
SELECT profile_views.* …
14. Hive Usage in Facebook
Applications:
– Summarization
Eg: Daily/Weekly aggregations of impression/click counts
– Ad hoc Analysis
Eg: how many group admins broken down by state/country
– Data Mining (Assembling training data)
Eg: User Engagement as a function of user attributes
Usage statistics:
– Total Users: ~40 (about 25% of engineering !)
– Hive Data (compressed): 22 TB total, ~200GB incoming per day
– Jobs over last 7 days:
Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800,
Loaders: 600
* Aggregates biased because of multi-stage map-reduce
15. Conclusion
Release to Open Source in 3-4 months
People:
– Suresh Anthony (suresh@facebook.com)
– Jeff Hammerbacher (jeffh@)
– Joydeep Sarma (jssarma@)
– Ashish Thusoo (athusoo@)
– Pete Wyckoff (pwyckoff@)