Etu L2 Training - Hadoop 企業應用實作

Etu Big Data 手作進階
企業應用實作

2
• Hadoop與海量資料處理概論
• Sqoop 介紹與實作
• Pig 程式設計與實作
• Hive程式設計與實作
Etu Big Data 企業應用實作

3
Hadoop 與海量資料處理概論

4
海量數據
現在進行式 …..
結構 vs. 非結構
生成速度快
處理技術難

5
Big Data時代來臨
Structured (結構化)
•Relational Database
•File in record format
Semi-structured (半結構化)
•XML
•Logs
•Click-stream
•Equipment / Device
•RFID tag
Unstructured (非結
構化)
•Web Pages
•E-mail
•Multimedia
•Instant Messages
•More Binary Files
行動/網際網路
Mobile/Internet
物聯網
Internet of Things

8
數據淘金
數據挖掘
分群分類
商品推薦
精準廣告
演算法
行為預測

9
很多的非/半結構化資料
要在一定的時間內處理完
而且成本不能太高
30字箴言
Volume Variety
Velocity

10
資料大到傳統方法無法處理
12字箴言

11
數據的類型
11
Social Media Machine / SensorDOC / MediaWeb
Clickstream
AppsCall Log/xDR
Log

12
Scale Up vs. Scale Out
檔案系統
ETL 工具
或
脚本
關連式
資料庫
分散式檔案
系統
分散式檔案
系統
分散式檔案
系統
平行運算平行運算平行運算
NoSQL NoSQL NoSQL
Scale Out (TB to PB)
ScaleUp(uptoTB)
原始數據
資料處理
查詢應用

14
Hadoop與大數據處理
關聯式資料庫 & DW
異質資料處理平台
結構化
非結構化
15%
85%

15
What is ?
 Framework for running distributed applications on
large cluster built of commodity hardware
 Originally created by Doug Cutting
 OSS implementation of Google‟s MapReduce and GFS
 Hardware failures assumed in design
 Fault-tolerant via replication
 The name of “Hadoop” has now evolved to cover a
family of software, but the core is essentially
MapReduce and a distributed file system

16
Why ?
 Need to process lots of data (up to Petabyte scale)
 Need to parallelize processing across multitude of CPU
 Achieve above while KeepIng Software Simple
 Give scalability with low cost commodity hardware
 Achieve linear scalability

17
What is Hadoop used for ?
17
 Searching
 Log Processing
 Recommendation Systems
 Business Intelligence / Data Warehousing
 Video and Image Analysis
 Archiving

18
Hadoop 不只是 Hadoop
HIVE
Big Data Applications
Pig!
Zoo
Keeper
SQL
RAW
非結構化
資料匯入
SQL
資料匯入
分散式檔案系統
類SQL資料庫系統
(非即時性)
分散式資料庫
(即時性)
平行運算框架
資料處理語言Data Mining 程式庫

19
Hadoop 是一整個生態系統
 ZooKeeper – 管理協調服務
 HBase – 分散式即時資料庫
 HIVE – Hadoop的資料倉儲系統
 Pig – Hadoop的資料處理流程語言
 Mahout – Hadoop的數據挖掘函式庫
 Sqoop – Hadoop與關連式資料庫的轉換工具
19

21
HDFS Overview
 Hadoop Distributed File System
 Based on Google‟s GFS (Google File System)
 Master/slave architecture
 Write once read multiple times
 Fault tolerant via replication
 Optimized for larger files
 Focus on streaming data (high-throughput > low
latency)
 Rack-aware (reduce inter-cluster network I/O)
21

22
HDFS Client API’s
 “Shell-like” commands ( hadoop dfs [cmd] )
22
 Native Java API
 Thrift API for other languages
 C++, Java, Python, PHP, Ruby, C#
cat chgrp chmod chown
copyFromLocal copyToLocal cp du,dus
expunge get getmerge ls,lsr
mkdir movefromLocal mv put
rm,rmr setrep stat tail
test text touchz

23
HDFS Architecture-Read
23
Name Node
Read
Client
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
block op (heartbeat, replication, re-balancing)
name,replicas,block_id
name
block_id
location
Xfer Data

24
HDFS Architecture-Write
24
Name Node
Write
Client
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
name
block_size
replication
node to write
Write Data

26
Fault Tolerance
26
Name Node
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
Auto Replicate

27
Map Reduce 平行處理框架

28
MapReduce Overview
28
 Distributed programming paradigm and the framework
that is OSS implementation of Google’s MapReduce
 Modeled using the ideas behind of functional
programming map() and reduce ()
 Distributed on as many node as you would like
 2 phase process:
map() reduce()
sub divide
& conquer
combine &
reduce

29
MapReduce ABC’s
29
 Essentially, it’s…
A. Take a large problem and divided it into sub-problems
B. Perform the same function on all sub-problems
C. Combine the output from all sub-problems
 M/R is excellent for problems where the “sub-
problems” are NOT interdependent
 The output of one “mapper” should not depend on the
output or the communication with another “mapper”
 The reduce phase doesn’t begin execution until all mappers
have finished
 Failed mapper and reduce tasks get auto restarted
 Rack/HDFS aware (data locality)

30
MapReduce 流程
split 0
split 1
split 2
split 3
split 4
part0
map K1
K2
map K1
K2
map K1
K2
K1
K1
K1
reduce
K2
K2
K2
reduce part1
input
HDFS
sort/copy
merge
output
HDFS

31
Each of mapper
process a file block
31
Word Count
I am a tiger, you are also a tiger
a,2
also,1
am,1
are,1
I,1
tiger,2
you,1
I,1
am,1
a,1
tiger,1
you,1
are,1
also,1
a, 1
tiger,1
a,2
also,1
am,1
are,1
I, 1
tiger,2
you,1
reduce
reduce
map
map
map
a, 1
a,1
also,1
am,1
are,1
I,1
tiger,1
tiger,1
you,1
Shuffle & Sort
reduce phase, sum
and count

32
Data Locality
M/R
Tasktrackers on the same
machines as datanodes
One Rack A Different Rack
Job on stars
Different job
Idle
Thursday, May 27, 2010

34
Pig
 Framework and language (Pig Latin) for
creating and submitting Hadoop
MapReduce jobs
 Common data operations like join, group
by, filter, sort, select, etc. are provided
 Don‟t need to know Java
 Remove boilerplate aspect from M/R
 200 lines in Java -> 15 lines in Pig
 Feels like SQL
34

35
Pig
 Fact from Wiki: 40% of Yahoo‟s M/R jobs
are in Pig
 Interactive shell [grunt] exist
 User Defined Functions [UDF]
 Allows you to specify Java code where the logic
is too complex for Pig Latin
 UDF‟s can be part of most every operation in Pig
 Great for loading and storing custom formats as
well as transforming data
35

36
COGROUP JOIN SPLIT
CROSS LIMIT STORE
DISTINCT LOAD STREAM
FILTER MAPREDUCE UNION
FOREACH ORDER BY
GROUP SAMPLE
36
Pig Relational Operations

37
Example Pig Script
37
Taken from Pig Wiki

39
What HBase is
 No-SQL (means Non-SQL, not SQL sucks)
 Good at fast/streaming writes
 Fault tolerant
 Good at linear horizontal scalability
 Very efficient at managing billions of rows and
millions of columns
 Good at keeping row history
 Good at auto-balancing
 A complement to a SQL/DW
 Great with non-normalized data
39

40
What HBase is NOT
 Made for table joins
 Made for splitting into normalized tables
 A replacement for RDBMS
 Great for storing small amount of data
 Great for large binary data (prefer < 1MB per
cell)
 ACID compliant
40

41
Data Model
 Simple View is a map
 Table: similar to relation db table
 Row Key: row is identified and sorted by key
 Row Value
Key Row Value
Key Row Value
Key Row Value
Table
row1
row2
row3

42
Data Model (Columns)
 Table:
 Row: multiple columns in row’s value
 Column
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column2 Column3 Column4

43
Data Model (Column Family)
 Table:
Row:
Column Family: columns are grouped into family. Column
family must be predefined with a column family prefix, e.g.
“privateData”, when creating schema
 Column: Column is denoted using family+qualifier, e.g
“privateData:mobilePhone”.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2

44
Data Model (Sparse)
 Table:
 Row:
 Column Family:
 Column: can be added into existing column family on the fly.
Rows can have widely varying number of columns.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column5 Column4
Column3 Column6
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2

45
45
HBase Architecture
The master keeps track of the
metadata for Region Sever and
served Regions and store it in
ZK
The Hbase client
communicate with ZK
only to get region info.
All HBase data (Hlog & HFile) are stored on HDFS

47
Hive 簡介
• 由 Facebook 開發
• 架構於 Hadoop 之上, 設計用來管理結構化資料的中介軟
體
• 以 MapReduce 為執行環境
• 資料儲存於HDFS上
• Metadata 儲存於RDMBS中
• Hive的設計原則
• 採用類SQL語法
• 擴充性 – Types, Functions, Formats, Scripts
• 性能與水平擴展能力兼具

48
Hive – How it works
Driver
(compiler, optimizer, execut
or)
metastore
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop Cluster
M/R M/R M/R M/R
Web UI CLI
JDBC
ODBC
Create M/R Job

49
Hadoop 的企業應用
RDBMS
Sensors Devices
Web
Log
Crawlers
ERP CRM LOB APPs
Connectors
Unstructured Data
S
S
R
S
SSAS
Hyperion
Familiar End User Tools
PowerView Excel with
PowerPivot
Embedded
BI
Predictive
Analytics
Structured data
Etu Appliance
through
Hive QL
SQL
SQL

50
參照RDBMS中的資料表
RDBMS
Customers
WebLogs
Products
HDFS

51
離線數據分析
RDBMS
Customers
Products
HDFS
Sales History

52
RDBMS
HDFS
Sales 2008
Sales 2009
Sales 2010
Sales 2008
ODBC/JDBC
歷史數據與線上數據交互運用

53
利用 Hadoop 進行數據彙總
RDBMS
WebLogs
HDFS
WebLog
Summary

56
企業採用 Hadoop 技術架構的挑戰
• 技術/人才缺口
1. 企業對 Hadoop 架構普遍陌生
2. Hadoop 叢集規劃、部署、管理
與系統調校的技術門檻高
• 專業服務資源缺口
1. 缺乏在地、專業、有實務經驗的
Hadoop 顧問服務
2. 缺乏能夠提供完整 Big Data 解
決方案設計、導入、與維護的專
業廠商
還處於市場早期
助您跨越 Big Data 鴻溝

57
Introducing – Etu Appliance 2.0
A purpose built high performance appliance for big
data processing
• Automates cluster deployment
• Optimize for the highest performance of big data processing
• Delivers availability and security

58
Key Benefits to Hadoopers
• Fully automated deployment and configuration
Simplifies configuration and deployment
• Reliably deploy running mission critical big data application
High availability made easy
• Process and control sensitivity data with confidence
Enterprise-grade security
• Fully optimized operating system to boost your data processing
performance
Boost big data processing
• Adapts to your workload and grows with your business
Provides a scalable and extensible foundation

59
What’s New in Etu Appliance 2.0
• New deployment feature – Auto deployment and
configuration for master node high availability
• New security features – LDAP integration and
Kerberos authentication
• New data source – Introduce Etu™ Dataflow data
collection service with built in Syslog and FTP server
for better integration with existing IT infrastructure
• New user experience – new Etu™ Management
Console with HDFS file browser and HBase table
management

60
Etu Appliance 2.0 – Hardware Specification
Master Node – Etu 1000M
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 300GB/SAS 3.5”/15K RPM x 2 (RAID 1)
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data Software Stack
Power: Redundant Power / 100V~240V
Worker Node – Etu 1000W
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 2TB/SATA 3.5”/7.2K RPM x 4
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power /100V~240V
Worker Node – Etu 2000W
CPU: 2 x 6 Core
RAM: 48GB
HDD: 2TB/SATA 3.5”/7.2K RPM x 8
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power / 100V~240V

61
Sqoop : SQL to Hadoop
• What is Sqoop ?
• Sqoop import to HDFS
• Sqoop import to Hive
• Sqoop import to Hbase
• Sqoop Incremental Imports
• Sqoop export

62
What is Sqoop
• A tool designed to transfer data between Hadoop and
relational databases
• Use MapReduce to import and export data
• Provide parallels operations
• Fault Tolerance, of course!

63
How it works
JDBC JDBC JDBC
Map Map Map
HDFS/HIVE/HBas
e
SQL statement
Create Map Tasks

64
Using sqoop
$ sqoop tool-name [tool-arguments]
Please try …
$ sqoop help

65
Sqoop Common Arguments
--connect <jdbc-uri>
--driver
--help
-P
--password <password>
--username <username>
--verbose

66
Using Options Files
$ sqoop import --connect jdbc:mysql://etu-master/db ...
Or
$ sqoop --options-file ./import.txt --table TEST
The options file contains:
import
--connect
Jdbc:mysql://etu-master/db
--username
root
--password
etuadmin

67
Sqoop Import
Command: sqoop import (generic-args) (import-args)
Arguments:
• --connect <jdbc-uri> Specify JDBC connect string
• --driver <class-name> Manually specify JDBC driver class to use
• --help Print usage instructions
• -P Read password from console
• --password <password> Set authentication password
• --username <username> Set authentication username
• --verbose Print more information while working

68
Import Arguments
• --append Append data to an existing dataset in HDFS
• -m,--num-mappers <n> Use n map tasks to import in parallel
• -e,--query <statement> Import the results of statement.
• --table <table-name> Table to read
• --target-dir <dir> HDFS destination dir
• --where <where clause> WHERE clause to use during import
• -z,--compress Enable compression

69
Let’s try it!
Please refer to L2 training note:
• Import nyse_daily to HDFS
• Import nyse_dividends to HDFS

70
Incremental Import
• Sqoop support incremental import
• Argument
--check-column : column to be examined for importing
--incremental : append or lastmodified
--last-value : max value from previous import

71
Import All Tables
• sqoop-all-tables support to import a set of table from
RDBMS to HDFS.

72
Sqoop to Hive
• Sqoop support hive
• Add following argument when import
--hive-import : specify sqoop target to Hive
--hive-table : specify target table name in Hive
--hive-overwrite : overwrite if table existed

73
Sqoop to HBase
--column-family <family> Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the
row key
--hbase-table <table-name> Specifies an HBase table to use as the target
instead of HDFS

74
Sqoop Export
• Target table must be exist
• Default operation is INSERT, you could specify
UPDATE
• Syntax : sqoop export (generic args) (export-args)

75
Export Arguments
--export-dir <dir> HDFS source path for export
--table <name> Table to populate
--update-key Anchor column to use for updates.
--update-mode updateonly or allowinsert

77
More Sqoop Information
• Sqoop User Guide :
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.
html

79
Pig 程式設計
• Introduction to Pig
• Reading and Writing Data with Pig
• Pig Latin Basics
• Debugging Pig Scripts
• Pig Best Practices
• Pig and HBase

80
Pig Introduction
• Pig was originally created at Yahoo! To answer a
similar need to Hive
– Many developers did nit have the Java and/or MapReduce
knowledge required to write standard MapReduce programs
– But still needed to query data
• Pig is a dataflow language
– Language is called Pig Latin
– Relatively simple syntax
– Under the covers, Pig Latin scripts are turned into MapReduce
jobs and executed on the cluster

81
Pig Features
• Pig supports many features which allow developers to
perform sophisticated data analysis without having to
write Java MapReduce code
– Joining datasets
– Grouping data
– Referring to elements by position rather than name
• Useful for datasets with many elements
– Loading non-delimited data using a custom SerDe
– Creation of user-defined functions, written in Java
– And more

82
Pig Word Count
Book = LOAD 'shakespeare/*' USING PigStorage() AS (lines:chararray);
Wordlist = FOREACH Book GENERATE FLATTEN(TOKENIZE(lines)) as word;
GroupWords = GROUP Wordlist BY word;
CountGroupWords = FOREACH GroupWords GENERATE group as word,
COUNT(Wordlist) as num_occurence;
WordCountSorted = ORDER CountGroupWords BY $1 DESC;
STORE WordCountSorted INTO 'wordcount' USING PigStorage(',');

83
Pig Data Types
• Scalar Types
– int
– long
– float
– double
– chararray
– bytearray
• Complex Types
– tuple ex. (19,2,3)
– bag ex. {(19,2), (18,1)}
– map ex. [open#apache]
• NULL

84
Pig Data Type Concepts
• In Pig, a single element of data is an atom
• A collection of atoms – such as a row, or a partial row
– is a tuple
• Tuples are collected together into bags
• Typically, a Pig Latin script starts by loading one or
more datasets into bags, and then creates new bags
by modifying those it already has

85
Pig Schema
• Pig eats everything
– If schema is available, Pig will make use of it
– If schema is not available, Pig will make the best guesses it can
based on how the script treats the data
A = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above, Pig will expect this data to have
2 fields with specified data types
– If there are more fields they will be truncated
– If there are less fields NULL will be filled

86
Pig Latin: Data Input
• The function is LOAD
sample = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above
– sample is the name of relation
– The file text.csv is loaded.
– Pig will expect this data to have 2 fields with specified data
types
• If there are more fields they will be truncated
• If there are less fields NULL will be filled

87
Pig Latin: Data Output
• STORE – Output a relation into a specified HDFS
folder
STORE sample_out into „/tmp/output‟;
• DUMP – Output a relation to screen
DUMP sample_out;

88
Pig Latin: Relational Operations
• FOREACH
• FILTER
• GROUP
• ORDER BY
• DISTINCT
• JOIN
• LIMIT

89
Pig Latin: FOREACH
• FOREACH takes a set of expressions and applies them
to every record in the data pipeline, and generates
new records to send down the pipeline to the next
operator.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FOREACH a GENERATE id, name;

90
Pig Latin: FILTER
• FILTER allows you to select which records will be
retained in your data pipeline.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FILTER a BY id matches „100*‟;

91
Pig Latin: GROUP
• GROUP statement collects together records with the
same key.
• It is different than the GROUP BY clause in SQL, as in
Pig Latin there is no direct connection between GROUP
and aggregate functions.
• GROUP collects all records with the key provided into
a bag and then you can pass this to an aggregate
function.

92
Pig Latin: GROUP (cont)
Example:
A = LOAD „text.csv‟ as (id, name, phone, zip, address);
B = GROUP A BY zip;
C = FOREACH B GENERATE group, COUNT(id);
STORE C INTO „population_by_zipcode‟;

93
Pig Latin: ORDER BY
• ORDER statement sorts your data for you by the field
specified.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = ORDER a BY fee;
c = ORDER a BY fee DESC, name;
DUMP c;

94
Pig Latin: DISTINCT
• DISTINCT statement removes duplicate records. Note
it works only on entire records, not on individual
fields.
• Example:
a = LOAD „url.csv‟ as (userid, url, dl_bytes, ul_bytes);
b = FOREACH a GENERATE userid, url;
c = DISTINCT b;

95
Pig Latin: JOIN
• JOIN selects records from one input to put together
with records from another input. This is done by
indicating keys from each input, and when those keys
are equal, the two rows are joined.
• Example:
call = LOAD „call.csv‟ as (MSISDN, callee, duration);
user = LOAD „user.csv‟ as (name, MSISDN, address);
call_bill = JOIN call by MSISDN, user by MSISDN;
bill = FOREACH call_bill GENERATE name, MSISDN, callee,
duration, address;
STORE bill into „to_be_billed‟;

96
Pig Latin: LIMIT
• LIMIT allows you to limit the number of results in the
output.
• Example:
b = ORDER a BY fee DESC, name;
top100 = LIMIT b 100;
DUMP top100;

97
Pig Latin: UDF
• UDF(User Defined Function) lets users combine Pig
operators along with their own or other‟s code.
• UDF can be written in Java and Python.
• UDFs have to be registered before use.
• Piggybank is useful
• Example:
register „path_to_UDF/piggybank.jar‟;
b = FOREACH a GENERATE id,
org.apache.pig.piggybank.evaluation.string.Reverse(name);

98
Debugging Pig
• DESCRIBE
– Show the schema of a relation in your scripts
• EXPLAIN
– Show your scripts‟ execution plan in MapReduce manner
• ILLUSTRATE
– Run scripts with a sampled data
• Pig Statistics
– A summary set of statistics on your script

99
More about Pig
• Visit Pig‟s Home Page http://pig.apache.org
• http://pig.apache.org/docs/r0.9.2/

101
Hive 程式設計與實作
• Hive Introduction
• Getting Data into Hive
• Manipulating Data with Hive
• Partitioning and Bucketing Data
• Hive Best Practices
• Hive and HBase

102
Hive: Introduction
• Hive was originally developed at Facebook
– Provide a very SQL-like language
– Can be used by people who know SQL
– Under the covers, generates MapReduce job that run on the
Hadoop cluster
– Enabling Hive requires almost no extra work by the system
administrator

103
Hive: Architecture
• Driver
• 將HiveQL語法編譯成
MapReduce任務,進行
最佳化,發送到Job
Tracker執行
• CLI/Web UI
• Ad-hoc查詢
• Schema查詢
• 管理介面
• Metastore
• JDBC/ODBC
• 標準介面與其他資料庫
工具及應用程式介接
Driver
(compiler,
optimizer,,executor)
metastore
Web UI CLI
JDBC
ODBC

104
Driver
(compiler, optimizer, execut
or)
metastore
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop Cluster
M/R M/R M/R M/R
Web UI CLI
JDBC
ODBC
Create M/R Job
Hive: How it works

105
The Hive Data Model
• Hive „layers‟ table definitions on top of data in HDFS
• Databases
• Tables
– Typed columns (int, float, string, boolean, etc)
– Also, list: map (for JSON-like data)
• Partition
• Buckets

106
Hive Datatypes : Primitive Types
• TINYINT (1 byte signed integer)
• SMALLINT (2 bytes signed integer)
• INT (4 bytes signed integer)
• BIGINT (8 bytes signed integer)
• BOOLEAN (TRUEor FALSE)
• FLOAT (single precision floating point)
• DOUBLE (Double precision floating point)
• STRING (Array of Char)
• BINARY (Array of Bytes)
• TIMESTAMP (integer, float or string)

107
Hive Datatypes: Collection Types
• ARRAY <primitive-type>
• MAP <primitive-type, primitive-type>
• STRUCT <col_name : primitive-type, …>

108
Text File Delimiters
• By default, hive store data as text file BUT you could
choose other file formats.
• Hive‟s default record and field delimiters:
CREATE TABLE …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „001‟
COLLECTION ITEMS TERMINATED BY „002‟
MAP KEY TERMINATED BY „003‟
LINES TERMINATED BY „n‟
STORED AS TEXTFILE;

109
The Hive Metastore
• Hive‟s Metastore is a database containing table
definitions and other metadata
– By default, stored locally on the client machine in a Derby
database
– If multiple people will be using Hive, the system administrator
should create a shared Metastore
• Usually in MySQL or some other relational database server

110
Hive is Schema on Read
• Relational Database
– Schema on Write
– Gatekeeper
– Alter schema is painful!
• Hive
– Schema on Read
– Requires less ETL efforts

111
Hive Data: Physical Layout
• Hive tables are stored in Hive‟s „warehouse‟ directory
in HDFS
– By default, /user/hive/warehouse
• Tables are stored in subdirectories of the warehouse
directory
– Partitions form subdirectories of tables
• Possible to create external tables if the data is already
in HDFS and should not be move from its current
location
• Actually data is stored in flat files
– Control character-delimited text or SequenceFiles
– Can be arbitrary format with the use of a custom
Serializer/Deserializer(“SerDe”)

112
Hive Limitations
• Not all “standard” SQL is supported
– No correlated subqueries, for example
• No support for UPDATE or DELETE
• No support for INSERT single rows
• Relatively limited number of built-in functions

113
Starting The Hive Shell
• To launch the Hive shell, start a terminal and run
– $ hive
• Results in the Hive prompt:
– hive>
• Autocomplete – Tab
• Query Column Headers
– Hive> set hive.cli.print.header=true;
– Hive> set hive.cli.print.current.db=true;

114
Hive’s Word Count
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, „s‟)) AS word FROM docs) w
GROUP BY word
ORDER BY count DESC;
SELECT * FROM word_counts LIMIT 30;

116
Data Definition
• Database
– CREATE/DROP
– ALTER (set DBPROPERTIES, name-value pair)
– SHOW/DESCRIBE
– USE
• Table
– CREATE/DROP
– ALTER
– SHOW/DESCRIBE
– CREATE EXTERNAL TABLE

117
Creating Tables
• CREATE TABLE IF NOT EXISTS table name …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „t‟
• CREATE EXTERNAL TABLE …
LOCATION „/user/mydata‟

118
Creating Tables
hive> SHOW TABLES;
Hive>CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
„t‟ STORED AS TEXTFILE;
Hive>DESCRIBE shakespeare;

119
Modify Tables
• ALTER TABLE … CHANGE
COLUMN old_name new_name type
AFTER column;
• ALTER TABLE … (ADD|REPLACE)
COLUMNS (column_name column type);

120
Partition
• Help to organize data in a logical fashion, such as
hierarchically.
• CREATE TABLE …
PARTITIONED BY (column name datatype, …)
• CREATE TABLE employee (
name STRING,
salary FLOAT)
PARTITIONED BY (country STRING, state STRING)
• Physical Layout in Hive
…/employees/country=CA/state=AB
…/employees/country=CA/state=BC
…

122
Loading Data into Hive
• LOAD DATA INPATH … INTO TABLE … PARTITION …
• Data is loaded into Hive with LOAD DATA INPATH
statement
– Assumes that the data in already in HDFS
LOAD DATA INPATH “shakespeare_freq” INTO TABLE
shakespeare;
• If the data is on the local filesystem, use LOAD DATA
LOCAL INPATH

123
Inserting Data into Table from
Queries
• INSERT OVERWRITE TABLE employees
PARTITION (country=„US‟, state=„OR‟)
SELECT * FROM staged_employees se
WHERE se.cnty=„US‟ and se.st=„OR‟;

124
Dynamic Partitions Properties
• hive.exec.dynamic.partition=true
• hive.exec.dynamic.partition.mode=nonstrict
• hive.exec.max.dynamic.partitions.pernode=100
• hive.exec.max.dynamic.partitions=+1000
• hive.exec.max.created.files=100000

125
Dynamics Partition Inserts
• What if I have so many partitions ?
PARTITION (country, state)
SELECT …, se.cnty, se.st
FROM staged_employees se;
• You can mix static and dynamic partition, for example:
PARTITION (country=„US‟, state)
SELECT …, se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = „US‟;

126
Create Table and Loading Data
• CREATE TABLE ca_employees
AS SELECT name, salary
FROM employees se
WHERE se.state=„CA‟;

127
Storing Output Results
• The SELECT statement on the previous slide would
write the data to console
• To store the result in HDFS, create a new table then
write, for example:
INSERT OVERWRITE TABLE newTable SELECT
s.word, s.freq, k.freq FROM shakespeare s JOIN
kjv k ON (s.word = k.word) WHERE s.freq >= 5;
• Results are stored in the table
• Results are just files within the newTable directory
– Data can be used in subsequent queries, or in MapReduce jobs

128
Exporting Data
• If the data file are already formatted as you want,
just copy.
• Or you can use INSERT … DIRECTORY …, for example
• INSERT OVERWRITE
LOCAL DIRECTORY „./ca_employees‟
SELECT name, salary, address
FROM employees se
WHERE se.state=„CA‟;

130
SELECT … FROM
• SELECT col_name or functions FROM tab_name;
hive> SELECT name FROM employees e;
• SELECT … FROM … [LIMIT N]
– * or Column alias
– Column Arithmetic Operators, Aggregation Function
– FROM

131
Arithmetic Operators
Operator Types Description
A + B Numbers Add A and B
A - B Numbers Subtract B from A
A * B Numbers Multiply A and B
A / B Numbers Divide A with B
A % B Numbers The remainder of dividing A with B
A & B Numbers Bitwise AND
A | B Numbers Bitwise OR
A ^ B Numbers Bitwise XOR
~A Numbers Bitwise NOT of A

132
Aggregate Functions
count(*) covar_pop(col)
count(expr) covar_samp(col)
sum(col) corr(col1,col2)
sum(DISTINCT col) percentile(int_expr,p)
avg(col) histogram_numeric
min(col) collect_set(col)
max(col) stddev_pop(col)
variance(col), var_pop(col) stddev_samp(col)
var_samp(col)
• Map Side aggregation for performance improve
hive> SET hive.map.aggr=true;

133
Other Functions
• https://cwiki.apache.org/confluence/display/Hive/Lan
guageManual+UDF#LanguageManualUDF-
BuiltinFunctions

134
When Hive Can Avoid Map Reduce
• SELECT * FROM employees;
• SELECT * FROM employees
WHERE country=„us‟ AND state=„CA‟
LIMIT 100;

135
WHERE
• >, <, =, >=, <=, !=
• IS NULL/IS NOT NULL
• OR AND NOT
• LIKE
– X% (prefix „X‟)
– %X (suffix „X‟)
– %X% (substring)
– _ (single character)
• RLIKE (Java Regular Expression)

136
GROUP BY
• Often used in conjunction with aggregate functions,
avg, count, etc.
• HAVING
– constrain the group produced by GROUP BY in a way that
could be expressed with a subquery.
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange=„NASDAQ‟ AND symbol=„AAPL‟
GROUP BY year (ymd)
HAVING avg(price_close) > 50.0;

137
JOIN
• Inner JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
• LEFT SEMI-JOIN
• Map-side Joins

138
Inner JOIN
• SELECT a.ymd, a.price_close, b.price_close
FROM stocks a JOIN stocks b ON a.ymd=b.ymd
WHERE a.symbol=„AAPL‟ AND b.symbol=„IBM‟
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol
WHERE s.symbol=„AAPL‟

139
LEFT OUTER JOIN
• All records from lefthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stock s LEFT OUTER JOIN dividends d
ON s.ymd=d.ymd AND s.symbol = d.symbol

140
RIGHT and FULL OUTER JOIN
• RIGHT OUTER JOIN
– All records from righthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• FULL OUTER JOIN
– All records from all tables that match WHERE are returned.
NULL will be return if not match ON criteria

141
LEFT SEMI-JOIN
• Returns records from lefthand table if records are
found in righthand table that satisfy the ON
predicates.
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks LEFT SEMI JOIN dividends d
ON s.ymd = d.ymd AND s.symbol = d.symbol
• RIGHT SEMI-JOIN is not supported

142
Map-side Joins
• If one of the table is small, the largest table can be
streamed through mappers and the small tables are
cached in memory.
• SELECT /*+ MAPJOIN(d)*/ s.ymd, s.symbol,
s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol

143
ORDER BY and SORT BY
• ORDER BY performs a total ordering of query result
set.
• All data passed through single reducer. Caution for
larger data sets. For example:
– SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDER
BY s.ymd ASC, s.symbol DESC;
• SORT BY performs a local ordering, where each
reducer‟s output will be sorted.

144
DISTRIBUTE BY with SORT BY
• By default, MapReduce partition mapper output by
hash of key-value. This will cause the overlap of data
between reducers.
• We can use DISTRIBUTED BY to ensure the record
with the same column go to the same reducer and use
SORT BY to order the data.
• SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTED BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC
• DISTRIBUTED BY requires SORT BY

145
CLUSTER BY
• Short-hand of DISTRIBUTED BY … SORT BY
• CLUSTER BY does not perform SORT
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;

146
Creating User-Defined Functions
• Hive supports manipulation of data via user-created
functions
• Example:
INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM
(userid, movieid, rating, unixtime) USING 'python
weekday_mapper.py' AS (userid, movieid, rating, weekday)
FROM u_data;

147
Hive: Where to Learn More
• http://hive.apache.org/
• Programming Hive

148
Choosing Between Pig and Hive
• Typically, organizations wanting an abstraction on top
of standard MapReduce will choose to use either Hive
or Pig
• Which one is chosen depends on the skillset of the
target users
– Those with an SQL background will naturally gravitate towards
Hive
– Those who do not know SQL will often choose Pig
• Each has strengths and weaknesses, it is worth
spending some time investigating each so you can
make an informed decision
• Some organizations are now choosing to use both
– Pig deals better with less-structured data, so Pig is used to
manipulate the data into a more structured form, then Hive is
used to query that structured data

www.etusolution.com
info@etusolution.com
Taipei, Taiwan
318, Rueiguang Rd., Taipei 114, Taiwan
T: +886 2 7720 1888
F: +886 2 8798 6069
Beijing, China
Room B-26, Landgent Center,
No. 24, East Third Ring Middle Rd.,
Beijing, China 100022
T: +86 10 8441 7988
F: +86 10 8441 7227
Contact

Etu L2 Training - Hadoop 企業應用實作

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Etu L2 Training - Hadoop 企業應用實作

Similar a Etu L2 Training - Hadoop 企業應用實作 (20)

Más de James Chen

Más de James Chen (6)

Último

Último (20)

Etu L2 Training - Hadoop 企業應用實作

Notas del editor