5. 5
Big Data時代來臨
Structured (結構化)
•Relational Database
•File in record format
Semi-structured (半結構化)
•XML
•Logs
•Click-stream
•Equipment / Device
•RFID tag
Unstructured (非結
構化)
•Web Pages
•E-mail
•Multimedia
•Instant Messages
•More Binary Files
行動/網際網路
Mobile/Internet
物聯網
Internet of Things
15. 15
What is ?
Framework for running distributed applications on
large cluster built of commodity hardware
Originally created by Doug Cutting
OSS implementation of Google‟s MapReduce and GFS
Hardware failures assumed in design
Fault-tolerant via replication
The name of “Hadoop” has now evolved to cover a
family of software, but the core is essentially
MapReduce and a distributed file system
16. 16
Why ?
Need to process lots of data (up to Petabyte scale)
Need to parallelize processing across multitude of CPU
Achieve above while KeepIng Software Simple
Give scalability with low cost commodity hardware
Achieve linear scalability
17. 17
What is Hadoop used for ?
17
Searching
Log Processing
Recommendation Systems
Business Intelligence / Data Warehousing
Video and Image Analysis
Archiving
18. 18
Hadoop 不只是 Hadoop
HIVE
Big Data Applications
Pig!
Zoo
Keeper
SQL
RAW
非結構化
資料匯入
SQL
資料匯入
分散式檔案系統
類SQL資料庫系統
(非即時性)
分散式資料庫
(即時性)
平行運算框架
資料處理語言Data Mining 程式庫
21. 21
HDFS Overview
Hadoop Distributed File System
Based on Google‟s GFS (Google File System)
Master/slave architecture
Write once read multiple times
Fault tolerant via replication
Optimized for larger files
Focus on streaming data (high-throughput > low
latency)
Rack-aware (reduce inter-cluster network I/O)
21
22. 22
HDFS Client API’s
“Shell-like” commands ( hadoop dfs [cmd] )
22
Native Java API
Thrift API for other languages
C++, Java, Python, PHP, Ruby, C#
cat chgrp chmod chown
copyFromLocal copyToLocal cp du,dus
expunge get getmerge ls,lsr
mkdir movefromLocal mv put
rm,rmr setrep stat tail
test text touchz
23. 23
HDFS Architecture-Read
23
Name Node
Read
Client
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
block op (heartbeat, replication, re-balancing)
name,replicas,block_id
name
block_id
location
Xfer Data
28. 28
MapReduce Overview
28
Distributed programming paradigm and the framework
that is OSS implementation of Google’s MapReduce
Modeled using the ideas behind of functional
programming map() and reduce ()
Distributed on as many node as you would like
2 phase process:
map() reduce()
sub divide
& conquer
combine &
reduce
29. 29
MapReduce ABC’s
29
Essentially, it’s…
A. Take a large problem and divided it into sub-problems
B. Perform the same function on all sub-problems
C. Combine the output from all sub-problems
M/R is excellent for problems where the “sub-
problems” are NOT interdependent
The output of one “mapper” should not depend on the
output or the communication with another “mapper”
The reduce phase doesn’t begin execution until all mappers
have finished
Failed mapper and reduce tasks get auto restarted
Rack/HDFS aware (data locality)
31. 31
Each of mapper
process a file block
31
Word Count
I am a tiger, you are also a tiger
a,2
also,1
am,1
are,1
I,1
tiger,2
you,1
I,1
am,1
a,1
tiger,1
you,1
are,1
also,1
a, 1
tiger,1
a,2
also,1
am,1
are,1
I, 1
tiger,2
you,1
reduce
reduce
map
map
map
a, 1
a,1
also,1
am,1
are,1
I,1
tiger,1
tiger,1
you,1
Shuffle & Sort
reduce phase, sum
and count
32. 32
Data Locality
M/R
Tasktrackers on the same
machines as datanodes
One Rack A Different Rack
Job on stars
Different job
Idle
Thursday, May 27, 2010
34. 34
Pig
Framework and language (Pig Latin) for
creating and submitting Hadoop
MapReduce jobs
Common data operations like join, group
by, filter, sort, select, etc. are provided
Don‟t need to know Java
Remove boilerplate aspect from M/R
200 lines in Java -> 15 lines in Pig
Feels like SQL
34
35. 35
Pig
Fact from Wiki: 40% of Yahoo‟s M/R jobs
are in Pig
Interactive shell [grunt] exist
User Defined Functions [UDF]
Allows you to specify Java code where the logic
is too complex for Pig Latin
UDF‟s can be part of most every operation in Pig
Great for loading and storing custom formats as
well as transforming data
35
36. 36
COGROUP JOIN SPLIT
CROSS LIMIT STORE
DISTINCT LOAD STREAM
FILTER MAPREDUCE UNION
FOREACH ORDER BY
GROUP SAMPLE
36
Pig Relational Operations
39. 39
What HBase is
No-SQL (means Non-SQL, not SQL sucks)
Good at fast/streaming writes
Fault tolerant
Good at linear horizontal scalability
Very efficient at managing billions of rows and
millions of columns
Good at keeping row history
Good at auto-balancing
A complement to a SQL/DW
Great with non-normalized data
39
40. 40
What HBase is NOT
Made for table joins
Made for splitting into normalized tables
A replacement for RDBMS
Great for storing small amount of data
Great for large binary data (prefer < 1MB per
cell)
ACID compliant
40
41. 41
Data Model
Simple View is a map
Table: similar to relation db table
Row Key: row is identified and sorted by key
Row Value
Key Row Value
Key Row Value
Key Row Value
Table
row1
row2
row3
42. 42
Data Model (Columns)
Table:
Row: multiple columns in row’s value
Column
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column2 Column3 Column4
Column1 Column2 Column3 Column4
43. 43
Data Model (Column Family)
Table:
Row:
Column Family: columns are grouped into family. Column
family must be predefined with a column family prefix, e.g.
“privateData”, when creating schema
Column: Column is denoted using family+qualifier, e.g
“privateData:mobilePhone”.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column2 Column3 Column4
Column1 Column2 Column3 Column4
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2
44. 44
Data Model (Sparse)
Table:
Row:
Column Family:
Column: can be added into existing column family on the fly.
Rows can have widely varying number of columns.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column5 Column4
Column3 Column6
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2
45. 45
45
HBase Architecture
The master keeps track of the
metadata for Region Sever and
served Regions and store it in
ZK
The Hbase client
communicate with ZK
only to get region info.
All HBase data (Hlog & HFile) are stored on HDFS
48. 48
Hive – How it works
Driver
(compiler, optimizer, execut
or)
metastore
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop Cluster
M/R M/R M/R M/R
Web UI CLI
JDBC
ODBC
Create M/R Job
49. 49
Hadoop 的企業應用
RDBMS
Sensors Devices
Web
Log
Crawlers
ERP CRM LOB APPs
Connectors
Unstructured Data
S
S
R
S
SSAS
Hyperion
Familiar End User Tools
PowerView Excel with
PowerPivot
Embedded
BI
Predictive
Analytics
Structured data
Etu Appliance
through
Hive QL
SQL
SQL
56. 56
企業採用 Hadoop 技術架構的挑戰
• 技術/人才缺口
1. 企業對 Hadoop 架構普遍陌生
2. Hadoop 叢集規劃、部署、管理
與系統調校的技術門檻高
• 專業服務資源缺口
1. 缺乏在地、專業、有實務經驗的
Hadoop 顧問服務
2. 缺乏能夠提供完整 Big Data 解
決方案設計、導入、與維護的專
業廠商
還處於市場早期
助您跨越 Big Data 鴻溝
57. 57
Introducing – Etu Appliance 2.0
A purpose built high performance appliance for big
data processing
• Automates cluster deployment
• Optimize for the highest performance of big data processing
• Delivers availability and security
58. 58
Key Benefits to Hadoopers
• Fully automated deployment and configuration
Simplifies configuration and deployment
• Reliably deploy running mission critical big data application
High availability made easy
• Process and control sensitivity data with confidence
Enterprise-grade security
• Fully optimized operating system to boost your data processing
performance
Boost big data processing
• Adapts to your workload and grows with your business
Provides a scalable and extensible foundation
59. 59
What’s New in Etu Appliance 2.0
• New deployment feature – Auto deployment and
configuration for master node high availability
• New security features – LDAP integration and
Kerberos authentication
• New data source – Introduce Etu™ Dataflow data
collection service with built in Syslog and FTP server
for better integration with existing IT infrastructure
• New user experience – new Etu™ Management
Console with HDFS file browser and HBase table
management
60. 60
Etu Appliance 2.0 – Hardware Specification
Master Node – Etu 1000M
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 300GB/SAS 3.5”/15K RPM x 2 (RAID 1)
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data Software Stack
Power: Redundant Power / 100V~240V
Worker Node – Etu 1000W
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 2TB/SATA 3.5”/7.2K RPM x 4
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power /100V~240V
Worker Node – Etu 2000W
CPU: 2 x 6 Core
RAM: 48GB
HDD: 2TB/SATA 3.5”/7.2K RPM x 8
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power / 100V~240V
61. 61
Sqoop : SQL to Hadoop
• What is Sqoop ?
• Sqoop import to HDFS
• Sqoop import to Hive
• Sqoop import to Hbase
• Sqoop Incremental Imports
• Sqoop export
62. 62
What is Sqoop
• A tool designed to transfer data between Hadoop and
relational databases
• Use MapReduce to import and export data
• Provide parallels operations
• Fault Tolerance, of course!
63. 63
How it works
JDBC JDBC JDBC
Map Map Map
HDFS/HIVE/HBas
e
SQL statement
Create Map Tasks
66. 66
Using Options Files
$ sqoop import --connect jdbc:mysql://etu-master/db ...
Or
$ sqoop --options-file ./import.txt --table TEST
The options file contains:
import
--connect
Jdbc:mysql://etu-master/db
--username
root
--password
etuadmin
67. 67
Sqoop Import
Command: sqoop import (generic-args) (import-args)
Arguments:
• --connect <jdbc-uri> Specify JDBC connect string
• --driver <class-name> Manually specify JDBC driver class to use
• --help Print usage instructions
• -P Read password from console
• --password <password> Set authentication password
• --username <username> Set authentication username
• --verbose Print more information while working
68. 68
Import Arguments
• --append Append data to an existing dataset in HDFS
• -m,--num-mappers <n> Use n map tasks to import in parallel
• -e,--query <statement> Import the results of statement.
• --table <table-name> Table to read
• --target-dir <dir> HDFS destination dir
• --where <where clause> WHERE clause to use during import
• -z,--compress Enable compression
69. 69
Let’s try it!
Please refer to L2 training note:
• Import nyse_daily to HDFS
• Import nyse_dividends to HDFS
70. 70
Incremental Import
• Sqoop support incremental import
• Argument
--check-column : column to be examined for importing
--incremental : append or lastmodified
--last-value : max value from previous import
71. 71
Import All Tables
• sqoop-all-tables support to import a set of table from
RDBMS to HDFS.
72. 72
Sqoop to Hive
• Sqoop support hive
• Add following argument when import
--hive-import : specify sqoop target to Hive
--hive-table : specify target table name in Hive
--hive-overwrite : overwrite if table existed
73. 73
Sqoop to HBase
--column-family <family> Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the
row key
--hbase-table <table-name> Specifies an HBase table to use as the target
instead of HDFS
74. 74
Sqoop Export
• Target table must be exist
• Default operation is INSERT, you could specify
UPDATE
• Syntax : sqoop export (generic args) (export-args)
75. 75
Export Arguments
--export-dir <dir> HDFS source path for export
--table <name> Table to populate
--update-key Anchor column to use for updates.
--update-mode updateonly or allowinsert
79. 79
Pig 程式設計
• Introduction to Pig
• Reading and Writing Data with Pig
• Pig Latin Basics
• Debugging Pig Scripts
• Pig Best Practices
• Pig and HBase
80. 80
Pig Introduction
• Pig was originally created at Yahoo! To answer a
similar need to Hive
– Many developers did nit have the Java and/or MapReduce
knowledge required to write standard MapReduce programs
– But still needed to query data
• Pig is a dataflow language
– Language is called Pig Latin
– Relatively simple syntax
– Under the covers, Pig Latin scripts are turned into MapReduce
jobs and executed on the cluster
81. 81
Pig Features
• Pig supports many features which allow developers to
perform sophisticated data analysis without having to
write Java MapReduce code
– Joining datasets
– Grouping data
– Referring to elements by position rather than name
• Useful for datasets with many elements
– Loading non-delimited data using a custom SerDe
– Creation of user-defined functions, written in Java
– And more
82. 82
Pig Word Count
Book = LOAD 'shakespeare/*' USING PigStorage() AS (lines:chararray);
Wordlist = FOREACH Book GENERATE FLATTEN(TOKENIZE(lines)) as word;
GroupWords = GROUP Wordlist BY word;
CountGroupWords = FOREACH GroupWords GENERATE group as word,
COUNT(Wordlist) as num_occurence;
WordCountSorted = ORDER CountGroupWords BY $1 DESC;
STORE WordCountSorted INTO 'wordcount' USING PigStorage(',');
83. 83
Pig Data Types
• Scalar Types
– int
– long
– float
– double
– chararray
– bytearray
• Complex Types
– tuple ex. (19,2,3)
– bag ex. {(19,2), (18,1)}
– map ex. [open#apache]
• NULL
84. 84
Pig Data Type Concepts
• In Pig, a single element of data is an atom
• A collection of atoms – such as a row, or a partial row
– is a tuple
• Tuples are collected together into bags
• Typically, a Pig Latin script starts by loading one or
more datasets into bags, and then creates new bags
by modifying those it already has
85. 85
Pig Schema
• Pig eats everything
– If schema is available, Pig will make use of it
– If schema is not available, Pig will make the best guesses it can
based on how the script treats the data
A = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above, Pig will expect this data to have
2 fields with specified data types
– If there are more fields they will be truncated
– If there are less fields NULL will be filled
86. 86
Pig Latin: Data Input
• The function is LOAD
sample = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above
– sample is the name of relation
– The file text.csv is loaded.
– Pig will expect this data to have 2 fields with specified data
types
• If there are more fields they will be truncated
• If there are less fields NULL will be filled
87. 87
Pig Latin: Data Output
• STORE – Output a relation into a specified HDFS
folder
STORE sample_out into „/tmp/output‟;
• DUMP – Output a relation to screen
DUMP sample_out;
88. 88
Pig Latin: Relational Operations
• FOREACH
• FILTER
• GROUP
• ORDER BY
• DISTINCT
• JOIN
• LIMIT
89. 89
Pig Latin: FOREACH
• FOREACH takes a set of expressions and applies them
to every record in the data pipeline, and generates
new records to send down the pipeline to the next
operator.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FOREACH a GENERATE id, name;
90. 90
Pig Latin: FILTER
• FILTER allows you to select which records will be
retained in your data pipeline.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FILTER a BY id matches „100*‟;
91. 91
Pig Latin: GROUP
• GROUP statement collects together records with the
same key.
• It is different than the GROUP BY clause in SQL, as in
Pig Latin there is no direct connection between GROUP
and aggregate functions.
• GROUP collects all records with the key provided into
a bag and then you can pass this to an aggregate
function.
92. 92
Pig Latin: GROUP (cont)
Example:
A = LOAD „text.csv‟ as (id, name, phone, zip, address);
B = GROUP A BY zip;
C = FOREACH B GENERATE group, COUNT(id);
STORE C INTO „population_by_zipcode‟;
93. 93
Pig Latin: ORDER BY
• ORDER statement sorts your data for you by the field
specified.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = ORDER a BY fee;
c = ORDER a BY fee DESC, name;
DUMP c;
94. 94
Pig Latin: DISTINCT
• DISTINCT statement removes duplicate records. Note
it works only on entire records, not on individual
fields.
• Example:
a = LOAD „url.csv‟ as (userid, url, dl_bytes, ul_bytes);
b = FOREACH a GENERATE userid, url;
c = DISTINCT b;
95. 95
Pig Latin: JOIN
• JOIN selects records from one input to put together
with records from another input. This is done by
indicating keys from each input, and when those keys
are equal, the two rows are joined.
• Example:
call = LOAD „call.csv‟ as (MSISDN, callee, duration);
user = LOAD „user.csv‟ as (name, MSISDN, address);
call_bill = JOIN call by MSISDN, user by MSISDN;
bill = FOREACH call_bill GENERATE name, MSISDN, callee,
duration, address;
STORE bill into „to_be_billed‟;
96. 96
Pig Latin: LIMIT
• LIMIT allows you to limit the number of results in the
output.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = ORDER a BY fee DESC, name;
top100 = LIMIT b 100;
DUMP top100;
97. 97
Pig Latin: UDF
• UDF(User Defined Function) lets users combine Pig
operators along with their own or other‟s code.
• UDF can be written in Java and Python.
• UDFs have to be registered before use.
• Piggybank is useful
• Example:
register „path_to_UDF/piggybank.jar‟;
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = FOREACH a GENERATE id,
org.apache.pig.piggybank.evaluation.string.Reverse(name);
98. 98
Debugging Pig
• DESCRIBE
– Show the schema of a relation in your scripts
• EXPLAIN
– Show your scripts‟ execution plan in MapReduce manner
• ILLUSTRATE
– Run scripts with a sampled data
• Pig Statistics
– A summary set of statistics on your script
99. 99
More about Pig
• Visit Pig‟s Home Page http://pig.apache.org
• http://pig.apache.org/docs/r0.9.2/
101. 101
Hive 程式設計與實作
• Hive Introduction
• Getting Data into Hive
• Manipulating Data with Hive
• Partitioning and Bucketing Data
• Hive Best Practices
• Hive and HBase
102. 102
Hive: Introduction
• Hive was originally developed at Facebook
– Provide a very SQL-like language
– Can be used by people who know SQL
– Under the covers, generates MapReduce job that run on the
Hadoop cluster
– Enabling Hive requires almost no extra work by the system
administrator
105. 105
The Hive Data Model
• Hive „layers‟ table definitions on top of data in HDFS
• Databases
• Tables
– Typed columns (int, float, string, boolean, etc)
– Also, list: map (for JSON-like data)
• Partition
• Buckets
106. 106
Hive Datatypes : Primitive Types
• TINYINT (1 byte signed integer)
• SMALLINT (2 bytes signed integer)
• INT (4 bytes signed integer)
• BIGINT (8 bytes signed integer)
• BOOLEAN (TRUEor FALSE)
• FLOAT (single precision floating point)
• DOUBLE (Double precision floating point)
• STRING (Array of Char)
• BINARY (Array of Bytes)
• TIMESTAMP (integer, float or string)
108. 108
Text File Delimiters
• By default, hive store data as text file BUT you could
choose other file formats.
• Hive‟s default record and field delimiters:
CREATE TABLE …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „001‟
COLLECTION ITEMS TERMINATED BY „002‟
MAP KEY TERMINATED BY „003‟
LINES TERMINATED BY „n‟
STORED AS TEXTFILE;
109. 109
The Hive Metastore
• Hive‟s Metastore is a database containing table
definitions and other metadata
– By default, stored locally on the client machine in a Derby
database
– If multiple people will be using Hive, the system administrator
should create a shared Metastore
• Usually in MySQL or some other relational database server
110. 110
Hive is Schema on Read
• Relational Database
– Schema on Write
– Gatekeeper
– Alter schema is painful!
• Hive
– Schema on Read
– Requires less ETL efforts
111. 111
Hive Data: Physical Layout
• Hive tables are stored in Hive‟s „warehouse‟ directory
in HDFS
– By default, /user/hive/warehouse
• Tables are stored in subdirectories of the warehouse
directory
– Partitions form subdirectories of tables
• Possible to create external tables if the data is already
in HDFS and should not be move from its current
location
• Actually data is stored in flat files
– Control character-delimited text or SequenceFiles
– Can be arbitrary format with the use of a custom
Serializer/Deserializer(“SerDe”)
112. 112
Hive Limitations
• Not all “standard” SQL is supported
– No correlated subqueries, for example
• No support for UPDATE or DELETE
• No support for INSERT single rows
• Relatively limited number of built-in functions
113. 113
Starting The Hive Shell
• To launch the Hive shell, start a terminal and run
– $ hive
• Results in the Hive prompt:
– hive>
• Autocomplete – Tab
• Query Column Headers
– Hive> set hive.cli.print.header=true;
– Hive> set hive.cli.print.current.db=true;
114. 114
Hive’s Word Count
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, „s‟)) AS word FROM docs) w
GROUP BY word
ORDER BY count DESC;
SELECT * FROM word_counts LIMIT 30;
116. 116
Data Definition
• Database
– CREATE/DROP
– ALTER (set DBPROPERTIES, name-value pair)
– SHOW/DESCRIBE
– USE
• Table
– CREATE/DROP
– ALTER
– SHOW/DESCRIBE
– CREATE EXTERNAL TABLE
117. 117
Creating Tables
• CREATE TABLE IF NOT EXISTS table name …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „t‟
• CREATE EXTERNAL TABLE …
LOCATION „/user/mydata‟
118. 118
Creating Tables
hive> SHOW TABLES;
Hive>CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
„t‟ STORED AS TEXTFILE;
Hive>DESCRIBE shakespeare;
119. 119
Modify Tables
• ALTER TABLE … CHANGE
COLUMN old_name new_name type
AFTER column;
• ALTER TABLE … (ADD|REPLACE)
COLUMNS (column_name column type);
120. 120
Partition
• Help to organize data in a logical fashion, such as
hierarchically.
• CREATE TABLE …
PARTITIONED BY (column name datatype, …)
• CREATE TABLE employee (
name STRING,
salary FLOAT)
PARTITIONED BY (country STRING, state STRING)
• Physical Layout in Hive
…/employees/country=CA/state=AB
…/employees/country=CA/state=BC
…
122. 122
Loading Data into Hive
• LOAD DATA INPATH … INTO TABLE … PARTITION …
• Data is loaded into Hive with LOAD DATA INPATH
statement
– Assumes that the data in already in HDFS
LOAD DATA INPATH “shakespeare_freq” INTO TABLE
shakespeare;
• If the data is on the local filesystem, use LOAD DATA
LOCAL INPATH
123. 123
Inserting Data into Table from
Queries
• INSERT OVERWRITE TABLE employees
PARTITION (country=„US‟, state=„OR‟)
SELECT * FROM staged_employees se
WHERE se.cnty=„US‟ and se.st=„OR‟;
125. 125
Dynamics Partition Inserts
• What if I have so many partitions ?
• INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT …, se.cnty, se.st
FROM staged_employees se;
• You can mix static and dynamic partition, for example:
• INSERT OVERWRITE TABLE employees
PARTITION (country=„US‟, state)
SELECT …, se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = „US‟;
126. 126
Create Table and Loading Data
• CREATE TABLE ca_employees
AS SELECT name, salary
FROM employees se
WHERE se.state=„CA‟;
127. 127
Storing Output Results
• The SELECT statement on the previous slide would
write the data to console
• To store the result in HDFS, create a new table then
write, for example:
INSERT OVERWRITE TABLE newTable SELECT
s.word, s.freq, k.freq FROM shakespeare s JOIN
kjv k ON (s.word = k.word) WHERE s.freq >= 5;
• Results are stored in the table
• Results are just files within the newTable directory
– Data can be used in subsequent queries, or in MapReduce jobs
128. 128
Exporting Data
• If the data file are already formatted as you want,
just copy.
• Or you can use INSERT … DIRECTORY …, for example
• INSERT OVERWRITE
LOCAL DIRECTORY „./ca_employees‟
SELECT name, salary, address
FROM employees se
WHERE se.state=„CA‟;
130. 130
SELECT … FROM
• SELECT col_name or functions FROM tab_name;
hive> SELECT name FROM employees e;
• SELECT … FROM … [LIMIT N]
– * or Column alias
– Column Arithmetic Operators, Aggregation Function
– FROM
131. 131
Arithmetic Operators
Operator Types Description
A + B Numbers Add A and B
A - B Numbers Subtract B from A
A * B Numbers Multiply A and B
A / B Numbers Divide A with B
A % B Numbers The remainder of dividing A with B
A & B Numbers Bitwise AND
A | B Numbers Bitwise OR
A ^ B Numbers Bitwise XOR
~A Numbers Bitwise NOT of A
134. 134
When Hive Can Avoid Map Reduce
• SELECT * FROM employees;
• SELECT * FROM employees
WHERE country=„us‟ AND state=„CA‟
LIMIT 100;
135. 135
WHERE
• >, <, =, >=, <=, !=
• IS NULL/IS NOT NULL
• OR AND NOT
• LIKE
– X% (prefix „X‟)
– %X (suffix „X‟)
– %X% (substring)
– _ (single character)
• RLIKE (Java Regular Expression)
136. 136
GROUP BY
• Often used in conjunction with aggregate functions,
avg, count, etc.
• HAVING
– constrain the group produced by GROUP BY in a way that
could be expressed with a subquery.
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange=„NASDAQ‟ AND symbol=„AAPL‟
GROUP BY year (ymd)
HAVING avg(price_close) > 50.0;
137. 137
JOIN
• Inner JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
• LEFT SEMI-JOIN
• Map-side Joins
138. 138
Inner JOIN
• SELECT a.ymd, a.price_close, b.price_close
FROM stocks a JOIN stocks b ON a.ymd=b.ymd
WHERE a.symbol=„AAPL‟ AND b.symbol=„IBM‟
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol
WHERE s.symbol=„AAPL‟
139. 139
LEFT OUTER JOIN
• All records from lefthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stock s LEFT OUTER JOIN dividends d
ON s.ymd=d.ymd AND s.symbol = d.symbol
WHERE s.symbol=„AAPL‟
140. 140
RIGHT and FULL OUTER JOIN
• RIGHT OUTER JOIN
– All records from righthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• FULL OUTER JOIN
– All records from all tables that match WHERE are returned.
NULL will be return if not match ON criteria
141. 141
LEFT SEMI-JOIN
• Returns records from lefthand table if records are
found in righthand table that satisfy the ON
predicates.
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks LEFT SEMI JOIN dividends d
ON s.ymd = d.ymd AND s.symbol = d.symbol
• RIGHT SEMI-JOIN is not supported
142. 142
Map-side Joins
• If one of the table is small, the largest table can be
streamed through mappers and the small tables are
cached in memory.
• SELECT /*+ MAPJOIN(d)*/ s.ymd, s.symbol,
s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol
WHERE s.symbol=„AAPL‟
143. 143
ORDER BY and SORT BY
• ORDER BY performs a total ordering of query result
set.
• All data passed through single reducer. Caution for
larger data sets. For example:
– SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDER
BY s.ymd ASC, s.symbol DESC;
• SORT BY performs a local ordering, where each
reducer‟s output will be sorted.
144. 144
DISTRIBUTE BY with SORT BY
• By default, MapReduce partition mapper output by
hash of key-value. This will cause the overlap of data
between reducers.
• We can use DISTRIBUTED BY to ensure the record
with the same column go to the same reducer and use
SORT BY to order the data.
• SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTED BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC
• DISTRIBUTED BY requires SORT BY
145. 145
CLUSTER BY
• Short-hand of DISTRIBUTED BY … SORT BY
• CLUSTER BY does not perform SORT
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;
146. 146
Creating User-Defined Functions
• Hive supports manipulation of data via user-created
functions
• Example:
INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM
(userid, movieid, rating, unixtime) USING 'python
weekday_mapper.py' AS (userid, movieid, rating, weekday)
FROM u_data;
147. 147
Hive: Where to Learn More
• http://hive.apache.org/
• Programming Hive
148. 148
Choosing Between Pig and Hive
• Typically, organizations wanting an abstraction on top
of standard MapReduce will choose to use either Hive
or Pig
• Which one is chosen depends on the skillset of the
target users
– Those with an SQL background will naturally gravitate towards
Hive
– Those who do not know SQL will often choose Pig
• Each has strengths and weaknesses, it is worth
spending some time investigating each so you can
make an informed decision
• Some organizations are now choosing to use both
– Pig deals better with less-structured data, so Pig is used to
manipulate the data into a more structured form, then Hive is
used to query that structured data
149. www.etusolution.com
info@etusolution.com
Taipei, Taiwan
318, Rueiguang Rd., Taipei 114, Taiwan
T: +886 2 7720 1888
F: +886 2 8798 6069
Beijing, China
Room B-26, Landgent Center,
No. 24, East Third Ring Middle Rd.,
Beijing, China 100022
T: +86 10 8441 7988
F: +86 10 8441 7227
Contact
Notas del editor
In this join, all the records from left hand table that match WHERE clause are return. If the right hand table doesn’t have a record that matches the ON criteria, NULL is used for each column selected from the right hand table.