SlideShare una empresa de Scribd logo
1 de 149
Etu Big Data 手作進階
企業應用實作
2
• Hadoop與海量資料處理概論
• Sqoop 介紹與實作
• Pig 程式設計與實作
• Hive程式設計與實作
Etu Big Data 企業應用實作
3
Hadoop 與海量資料處理概論
4
海量數據
現在進行式 …..
結構 vs. 非結構
生成速度快
處理技術難
5
Big Data時代來臨
Structured (結構化)
•Relational Database
•File in record format
Semi-structured (半結構化)
•XML
•Logs
•Click-stream
•Equipment / Device
•RFID tag
Unstructured (非結
構化)
•Web Pages
•E-mail
•Multimedia
•Instant Messages
•More Binary Files
行動/網際網路
Mobile/Internet
物聯網
Internet of Things
6
RELEVANT
7
關鍵在於…
個人化
8
數據淘金
數據挖掘
分群分類
商品推薦
精準廣告
演算法
行為預測
9
很多的非/半結構化資料
要在一定的時間內處理完
而且成本不能太高
30字箴言
Volume Variety
Velocity
10
資料大到傳統方法無法處理
12字箴言
11
數據的類型
11
Social Media Machine / SensorDOC / MediaWeb
Clickstream
AppsCall Log/xDR
Log
12
Scale Up vs. Scale Out
檔案系統
ETL 工具
或
脚本
關連式
資料庫
分散式檔案
系統
分散式檔案
系統
分散式檔案
系統
平行運算 平行運算 平行運算
NoSQL NoSQL NoSQL
Scale Out (TB to PB)
ScaleUp(uptoTB)
原始數據
資料處理
查詢應用
13
Big Data & Hadoop
14
Hadoop與大數據處理
關聯式資料庫 & DW
異質資料處理平台
結構化
非結構化
15%
85%
15
What is ?
 Framework for running distributed applications on
large cluster built of commodity hardware
 Originally created by Doug Cutting
 OSS implementation of Google‟s MapReduce and GFS
 Hardware failures assumed in design
 Fault-tolerant via replication
 The name of “Hadoop” has now evolved to cover a
family of software, but the core is essentially
MapReduce and a distributed file system
16
Why ?
 Need to process lots of data (up to Petabyte scale)
 Need to parallelize processing across multitude of CPU
 Achieve above while KeepIng Software Simple
 Give scalability with low cost commodity hardware
 Achieve linear scalability
17
What is Hadoop used for ?
17
 Searching
 Log Processing
 Recommendation Systems
 Business Intelligence / Data Warehousing
 Video and Image Analysis
 Archiving
18
Hadoop 不只是 Hadoop
HIVE
Big Data Applications
Pig!
Zoo
Keeper
SQL
RAW
非結構化
資料匯入
SQL
資料匯入
分散式檔案系統
類SQL資料庫系統
(非即時性)
分散式資料庫
(即時性)
平行運算框架
資料處理語言Data Mining 程式庫
19
Hadoop 是一整個生態系統
 ZooKeeper – 管理協調服務
 HBase – 分散式即時資料庫
 HIVE – Hadoop的資料倉儲系統
 Pig – Hadoop的資料處理流程語言
 Mahout – Hadoop的數據挖掘函式庫
 Sqoop – Hadoop與關連式資料庫的轉換工具
19
20
HDFS 分散式檔案系統
21
HDFS Overview
 Hadoop Distributed File System
 Based on Google‟s GFS (Google File System)
 Master/slave architecture
 Write once read multiple times
 Fault tolerant via replication
 Optimized for larger files
 Focus on streaming data (high-throughput > low
latency)
 Rack-aware (reduce inter-cluster network I/O)
21
22
HDFS Client API’s
 “Shell-like” commands ( hadoop dfs [cmd] )
22
 Native Java API
 Thrift API for other languages
 C++, Java, Python, PHP, Ruby, C#
cat chgrp chmod chown
copyFromLocal copyToLocal cp du,dus
expunge get getmerge ls,lsr
mkdir movefromLocal mv put
rm,rmr setrep stat tail
test text touchz
23
HDFS Architecture-Read
23
Name Node
Read
Client
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
block op (heartbeat, replication, re-balancing)
name,replicas,block_id
name
block_id
location
Xfer Data
24
HDFS Architecture-Write
24
Name Node
Write
Client
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
name
block_size
replication
node to write
Write Data
25
HDFS 容錯機制
26
Fault Tolerance
26
Name Node
Data Node
local disk
Data Node
local disk
Data Node
local disk
Data Node
local disk
Auto Replicate
27
Map Reduce 平行處理框架
28
MapReduce Overview
28
 Distributed programming paradigm and the framework
that is OSS implementation of Google’s MapReduce
 Modeled using the ideas behind of functional
programming map() and reduce ()
 Distributed on as many node as you would like
 2 phase process:
map() reduce()
sub divide
& conquer
combine &
reduce
29
MapReduce ABC’s
29
 Essentially, it’s…
A. Take a large problem and divided it into sub-problems
B. Perform the same function on all sub-problems
C. Combine the output from all sub-problems
 M/R is excellent for problems where the “sub-
problems” are NOT interdependent
 The output of one “mapper” should not depend on the
output or the communication with another “mapper”
 The reduce phase doesn’t begin execution until all mappers
have finished
 Failed mapper and reduce tasks get auto restarted
 Rack/HDFS aware (data locality)
30
MapReduce 流程
split 0
split 1
split 2
split 3
split 4
part0
map K1
K2
map K1
K2
map K1
K2
K1
K1
K1
reduce
K2
K2
K2
reduce part1
input
HDFS
sort/copy
merge
output
HDFS
31
Each of mapper
process a file block
31
Word Count
I am a tiger, you are also a tiger
a,2
also,1
am,1
are,1
I,1
tiger,2
you,1
I,1
am,1
a,1
tiger,1
you,1
are,1
also,1
a, 1
tiger,1
a,2
also,1
am,1
are,1
I, 1
tiger,2
you,1
reduce
reduce
map
map
map
a, 1
a,1
also,1
am,1
are,1
I,1
tiger,1
tiger,1
you,1
Shuffle & Sort
reduce phase, sum
and count
32
Data Locality
M/R
Tasktrackers on the same
machines as datanodes
One Rack A Different Rack
Job on stars
Different job
Idle
Thursday, May 27, 2010
33
一定要會Java嗎?
34
Pig
 Framework and language (Pig Latin) for
creating and submitting Hadoop
MapReduce jobs
 Common data operations like join, group
by, filter, sort, select, etc. are provided
 Don‟t need to know Java
 Remove boilerplate aspect from M/R
 200 lines in Java -> 15 lines in Pig
 Feels like SQL
34
35
Pig
 Fact from Wiki: 40% of Yahoo‟s M/R jobs
are in Pig
 Interactive shell [grunt] exist
 User Defined Functions [UDF]
 Allows you to specify Java code where the logic
is too complex for Pig Latin
 UDF‟s can be part of most every operation in Pig
 Great for loading and storing custom formats as
well as transforming data
35
36
COGROUP JOIN SPLIT
CROSS LIMIT STORE
DISTINCT LOAD STREAM
FILTER MAPREDUCE UNION
FOREACH ORDER BY
GROUP SAMPLE
36
Pig Relational Operations
37
Example Pig Script
37
Taken from Pig Wiki
38
HBase – The Big Table
39
What HBase is
 No-SQL (means Non-SQL, not SQL sucks)
 Good at fast/streaming writes
 Fault tolerant
 Good at linear horizontal scalability
 Very efficient at managing billions of rows and
millions of columns
 Good at keeping row history
 Good at auto-balancing
 A complement to a SQL/DW
 Great with non-normalized data
39
40
What HBase is NOT
 Made for table joins
 Made for splitting into normalized tables
 A replacement for RDBMS
 Great for storing small amount of data
 Great for large binary data (prefer < 1MB per
cell)
 ACID compliant
40
41
Data Model
 Simple View is a map
 Table: similar to relation db table
 Row Key: row is identified and sorted by key
 Row Value
Key Row Value
Key Row Value
Key Row Value
Table
row1
row2
row3
42
Data Model (Columns)
 Table:
 Row: multiple columns in row’s value
 Column
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column2 Column3 Column4
Column1 Column2 Column3 Column4
43
Data Model (Column Family)
 Table:
Row:
Column Family: columns are grouped into family. Column
family must be predefined with a column family prefix, e.g.
“privateData”, when creating schema
 Column: Column is denoted using family+qualifier, e.g
“privateData:mobilePhone”.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column2 Column3 Column4
Column1 Column2 Column3 Column4
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2
44
Data Model (Sparse)
 Table:
 Row:
 Column Family:
 Column: can be added into existing column family on the fly.
Rows can have widely varying number of columns.
Key Column1 Column2
Key
Key
Table
row1
row2
row3
Column3 Column4
Column1 Column5 Column4
Column3 Column6
Column Family 1
Column Family 1
Column Family 1
Column Family 2
Column Family 2
Column Family 2
45
45
HBase Architecture
The master keeps track of the
metadata for Region Sever and
served Regions and store it in
ZK
The Hbase client
communicate with ZK
only to get region info.
All HBase data (Hlog & HFile) are stored on HDFS
46
Hive Hadoop的資料倉儲
47
Hive 簡介
• 由 Facebook 開發
• 架構於 Hadoop 之上, 設計用來管理結構化資料的中介軟
體
• 以 MapReduce 為執行環境
• 資料儲存於HDFS上
• Metadata 儲存於RDMBS中
• Hive的設計原則
• 採用類SQL語法
• 擴充性 – Types, Functions, Formats, Scripts
• 性能與水平擴展能力兼具
48
Hive – How it works
Driver
(compiler, optimizer, execut
or)
metastore
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop Cluster
M/R M/R M/R M/R
Web UI CLI
JDBC
ODBC
Create M/R Job
49
Hadoop 的企業應用
RDBMS
Sensors Devices
Web
Log
Crawlers
ERP CRM LOB APPs
Connectors
Unstructured Data
S
S
R
S
SSAS
Hyperion
Familiar End User Tools
PowerView Excel with
PowerPivot
Embedded
BI
Predictive
Analytics
Structured data
Etu Appliance
through
Hive QL
SQL
SQL
50
參照RDBMS中的資料表
RDBMS
Customers
WebLogs
Products
HDFS
51
離線數據分析
RDBMS
Customers
Products
HDFS
Sales History
52
RDBMS
HDFS
Sales 2008
Sales 2009
Sales 2010
Sales 2008
ODBC/JDBC
歷史數據與線上數據交互運用
53
利用 Hadoop 進行數據彙總
RDBMS
WebLogs
HDFS
WebLog
Summary
54
Hadoop 邁向主流市場
55
跨越鴻溝
56
企業採用 Hadoop 技術架構的挑戰
• 技術/人才缺口
1. 企業對 Hadoop 架構普遍陌生
2. Hadoop 叢集規劃、部署、管理
與系統調校的技術門檻高
• 專業服務資源缺口
1. 缺乏在地、專業、有實務經驗的
Hadoop 顧問服務
2. 缺乏能夠提供完整 Big Data 解
決方案設計、導入、與維護的專
業廠商
還處於市場早期
助您跨越 Big Data 鴻溝
57
Introducing – Etu Appliance 2.0
A purpose built high performance appliance for big
data processing
• Automates cluster deployment
• Optimize for the highest performance of big data processing
• Delivers availability and security
58
Key Benefits to Hadoopers
• Fully automated deployment and configuration
Simplifies configuration and deployment
• Reliably deploy running mission critical big data application
High availability made easy
• Process and control sensitivity data with confidence
Enterprise-grade security
• Fully optimized operating system to boost your data processing
performance
Boost big data processing
• Adapts to your workload and grows with your business
Provides a scalable and extensible foundation
59
What’s New in Etu Appliance 2.0
• New deployment feature – Auto deployment and
configuration for master node high availability
• New security features – LDAP integration and
Kerberos authentication
• New data source – Introduce Etu™ Dataflow data
collection service with built in Syslog and FTP server
for better integration with existing IT infrastructure
• New user experience – new Etu™ Management
Console with HDFS file browser and HBase table
management
60
Etu Appliance 2.0 – Hardware Specification
Master Node – Etu 1000M
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 300GB/SAS 3.5”/15K RPM x 2 (RAID 1)
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data Software Stack
Power: Redundant Power / 100V~240V
Worker Node – Etu 1000W
CPU: 2 x 6 Core
RAM: 48 GB ECC
HDD: 2TB/SATA 3.5”/7.2K RPM x 4
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power /100V~240V
Worker Node – Etu 2000W
CPU: 2 x 6 Core
RAM: 48GB
HDD: 2TB/SATA 3.5”/7.2K RPM x 8
NIC: Dual Port/1 Gb Ethernet x 1
S/W: Etu™ OS/Etu™ Big Data
Software Stack
Power: Single Power / 100V~240V
61
Sqoop : SQL to Hadoop
• What is Sqoop ?
• Sqoop import to HDFS
• Sqoop import to Hive
• Sqoop import to Hbase
• Sqoop Incremental Imports
• Sqoop export
62
What is Sqoop
• A tool designed to transfer data between Hadoop and
relational databases
• Use MapReduce to import and export data
• Provide parallels operations
• Fault Tolerance, of course!
63
How it works
JDBC JDBC JDBC
Map Map Map
HDFS/HIVE/HBas
e
SQL statement
Create Map Tasks
64
Using sqoop
$ sqoop tool-name [tool-arguments]
Please try …
$ sqoop help
65
Sqoop Common Arguments
--connect <jdbc-uri>
--driver
--help
-P
--password <password>
--username <username>
--verbose
66
Using Options Files
$ sqoop import --connect jdbc:mysql://etu-master/db ...
Or
$ sqoop --options-file ./import.txt --table TEST
The options file contains:
import
--connect
Jdbc:mysql://etu-master/db
--username
root
--password
etuadmin
67
Sqoop Import
Command: sqoop import (generic-args) (import-args)
Arguments:
• --connect <jdbc-uri> Specify JDBC connect string
• --driver <class-name> Manually specify JDBC driver class to use
• --help Print usage instructions
• -P Read password from console
• --password <password> Set authentication password
• --username <username> Set authentication username
• --verbose Print more information while working
68
Import Arguments
• --append Append data to an existing dataset in HDFS
• -m,--num-mappers <n> Use n map tasks to import in parallel
• -e,--query <statement> Import the results of statement.
• --table <table-name> Table to read
• --target-dir <dir> HDFS destination dir
• --where <where clause> WHERE clause to use during import
• -z,--compress Enable compression
69
Let’s try it!
Please refer to L2 training note:
• Import nyse_daily to HDFS
• Import nyse_dividends to HDFS
70
Incremental Import
• Sqoop support incremental import
• Argument
--check-column : column to be examined for importing
--incremental : append or lastmodified
--last-value : max value from previous import
71
Import All Tables
• sqoop-all-tables support to import a set of table from
RDBMS to HDFS.
72
Sqoop to Hive
• Sqoop support hive
• Add following argument when import
--hive-import : specify sqoop target to Hive
--hive-table : specify target table name in Hive
--hive-overwrite : overwrite if table existed
73
Sqoop to HBase
--column-family <family> Sets the target column family for the import
--hbase-create-table If specified, create missing HBase tables
--hbase-row-key <col> Specifies which input column to use as the
row key
--hbase-table <table-name> Specifies an HBase table to use as the target
instead of HDFS
74
Sqoop Export
• Target table must be exist
• Default operation is INSERT, you could specify
UPDATE
• Syntax : sqoop export (generic args) (export-args)
75
Export Arguments
--export-dir <dir> HDFS source path for export
--table <name> Table to populate
--update-key Anchor column to use for updates.
--update-mode updateonly or allowinsert
76
Sqoop Job
77
More Sqoop Information
• Sqoop User Guide :
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.
html
78
Coffee Break!
79
Pig 程式設計
• Introduction to Pig
• Reading and Writing Data with Pig
• Pig Latin Basics
• Debugging Pig Scripts
• Pig Best Practices
• Pig and HBase
80
Pig Introduction
• Pig was originally created at Yahoo! To answer a
similar need to Hive
– Many developers did nit have the Java and/or MapReduce
knowledge required to write standard MapReduce programs
– But still needed to query data
• Pig is a dataflow language
– Language is called Pig Latin
– Relatively simple syntax
– Under the covers, Pig Latin scripts are turned into MapReduce
jobs and executed on the cluster
81
Pig Features
• Pig supports many features which allow developers to
perform sophisticated data analysis without having to
write Java MapReduce code
– Joining datasets
– Grouping data
– Referring to elements by position rather than name
• Useful for datasets with many elements
– Loading non-delimited data using a custom SerDe
– Creation of user-defined functions, written in Java
– And more
82
Pig Word Count
Book = LOAD 'shakespeare/*' USING PigStorage() AS (lines:chararray);
Wordlist = FOREACH Book GENERATE FLATTEN(TOKENIZE(lines)) as word;
GroupWords = GROUP Wordlist BY word;
CountGroupWords = FOREACH GroupWords GENERATE group as word,
COUNT(Wordlist) as num_occurence;
WordCountSorted = ORDER CountGroupWords BY $1 DESC;
STORE WordCountSorted INTO 'wordcount' USING PigStorage(',');
83
Pig Data Types
• Scalar Types
– int
– long
– float
– double
– chararray
– bytearray
• Complex Types
– tuple ex. (19,2,3)
– bag ex. {(19,2), (18,1)}
– map ex. [open#apache]
• NULL
84
Pig Data Type Concepts
• In Pig, a single element of data is an atom
• A collection of atoms – such as a row, or a partial row
– is a tuple
• Tuples are collected together into bags
• Typically, a Pig Latin script starts by loading one or
more datasets into bags, and then creates new bags
by modifying those it already has
85
Pig Schema
• Pig eats everything
– If schema is available, Pig will make use of it
– If schema is not available, Pig will make the best guesses it can
based on how the script treats the data
A = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above, Pig will expect this data to have
2 fields with specified data types
– If there are more fields they will be truncated
– If there are less fields NULL will be filled
86
Pig Latin: Data Input
• The function is LOAD
sample = LOAD „text.csv‟ as (field1:chararray, field2:int);
• In the example above
– sample is the name of relation
– The file text.csv is loaded.
– Pig will expect this data to have 2 fields with specified data
types
• If there are more fields they will be truncated
• If there are less fields NULL will be filled
87
Pig Latin: Data Output
• STORE – Output a relation into a specified HDFS
folder
STORE sample_out into „/tmp/output‟;
• DUMP – Output a relation to screen
DUMP sample_out;
88
Pig Latin: Relational Operations
• FOREACH
• FILTER
• GROUP
• ORDER BY
• DISTINCT
• JOIN
• LIMIT
89
Pig Latin: FOREACH
• FOREACH takes a set of expressions and applies them
to every record in the data pipeline, and generates
new records to send down the pipeline to the next
operator.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FOREACH a GENERATE id, name;
90
Pig Latin: FILTER
• FILTER allows you to select which records will be
retained in your data pipeline.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address);
b = FILTER a BY id matches „100*‟;
91
Pig Latin: GROUP
• GROUP statement collects together records with the
same key.
• It is different than the GROUP BY clause in SQL, as in
Pig Latin there is no direct connection between GROUP
and aggregate functions.
• GROUP collects all records with the key provided into
a bag and then you can pass this to an aggregate
function.
92
Pig Latin: GROUP (cont)
Example:
A = LOAD „text.csv‟ as (id, name, phone, zip, address);
B = GROUP A BY zip;
C = FOREACH B GENERATE group, COUNT(id);
STORE C INTO „population_by_zipcode‟;
93
Pig Latin: ORDER BY
• ORDER statement sorts your data for you by the field
specified.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = ORDER a BY fee;
c = ORDER a BY fee DESC, name;
DUMP c;
94
Pig Latin: DISTINCT
• DISTINCT statement removes duplicate records. Note
it works only on entire records, not on individual
fields.
• Example:
a = LOAD „url.csv‟ as (userid, url, dl_bytes, ul_bytes);
b = FOREACH a GENERATE userid, url;
c = DISTINCT b;
95
Pig Latin: JOIN
• JOIN selects records from one input to put together
with records from another input. This is done by
indicating keys from each input, and when those keys
are equal, the two rows are joined.
• Example:
call = LOAD „call.csv‟ as (MSISDN, callee, duration);
user = LOAD „user.csv‟ as (name, MSISDN, address);
call_bill = JOIN call by MSISDN, user by MSISDN;
bill = FOREACH call_bill GENERATE name, MSISDN, callee,
duration, address;
STORE bill into „to_be_billed‟;
96
Pig Latin: LIMIT
• LIMIT allows you to limit the number of results in the
output.
• Example:
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = ORDER a BY fee DESC, name;
top100 = LIMIT b 100;
DUMP top100;
97
Pig Latin: UDF
• UDF(User Defined Function) lets users combine Pig
operators along with their own or other‟s code.
• UDF can be written in Java and Python.
• UDFs have to be registered before use.
• Piggybank is useful
• Example:
register „path_to_UDF/piggybank.jar‟;
a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee);
b = FOREACH a GENERATE id,
org.apache.pig.piggybank.evaluation.string.Reverse(name);
98
Debugging Pig
• DESCRIBE
– Show the schema of a relation in your scripts
• EXPLAIN
– Show your scripts‟ execution plan in MapReduce manner
• ILLUSTRATE
– Run scripts with a sampled data
• Pig Statistics
– A summary set of statistics on your script
99
More about Pig
• Visit Pig‟s Home Page http://pig.apache.org
• http://pig.apache.org/docs/r0.9.2/
100
Coffee Break!
101
Hive 程式設計與實作
• Hive Introduction
• Getting Data into Hive
• Manipulating Data with Hive
• Partitioning and Bucketing Data
• Hive Best Practices
• Hive and HBase
102
Hive: Introduction
• Hive was originally developed at Facebook
– Provide a very SQL-like language
– Can be used by people who know SQL
– Under the covers, generates MapReduce job that run on the
Hadoop cluster
– Enabling Hive requires almost no extra work by the system
administrator
103
Hive: Architecture
• Driver
• 將HiveQL語法編譯成
MapReduce任務,進行
最佳化,發送到Job
Tracker執行
• CLI/Web UI
• Ad-hoc查詢
• Schema查詢
• 管理介面
• Metastore
• JDBC/ODBC
• 標準介面與其他資料庫
工具及應用程式介接
Driver
(compiler,
optimizer,,executor)
metastore
Web UI CLI
JDBC
ODBC
104
Driver
(compiler, optimizer, execut
or)
metastore
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop Cluster
M/R M/R M/R M/R
Web UI CLI
JDBC
ODBC
Create M/R Job
Hive: How it works
105
The Hive Data Model
• Hive „layers‟ table definitions on top of data in HDFS
• Databases
• Tables
– Typed columns (int, float, string, boolean, etc)
– Also, list: map (for JSON-like data)
• Partition
• Buckets
106
Hive Datatypes : Primitive Types
• TINYINT (1 byte signed integer)
• SMALLINT (2 bytes signed integer)
• INT (4 bytes signed integer)
• BIGINT (8 bytes signed integer)
• BOOLEAN (TRUEor FALSE)
• FLOAT (single precision floating point)
• DOUBLE (Double precision floating point)
• STRING (Array of Char)
• BINARY (Array of Bytes)
• TIMESTAMP (integer, float or string)
107
Hive Datatypes: Collection Types
• ARRAY <primitive-type>
• MAP <primitive-type, primitive-type>
• STRUCT <col_name : primitive-type, …>
108
Text File Delimiters
• By default, hive store data as text file BUT you could
choose other file formats.
• Hive‟s default record and field delimiters:
CREATE TABLE …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „001‟
COLLECTION ITEMS TERMINATED BY „002‟
MAP KEY TERMINATED BY „003‟
LINES TERMINATED BY „n‟
STORED AS TEXTFILE;
109
The Hive Metastore
• Hive‟s Metastore is a database containing table
definitions and other metadata
– By default, stored locally on the client machine in a Derby
database
– If multiple people will be using Hive, the system administrator
should create a shared Metastore
• Usually in MySQL or some other relational database server
110
Hive is Schema on Read
• Relational Database
– Schema on Write
– Gatekeeper
– Alter schema is painful!
• Hive
– Schema on Read
– Requires less ETL efforts
111
Hive Data: Physical Layout
• Hive tables are stored in Hive‟s „warehouse‟ directory
in HDFS
– By default, /user/hive/warehouse
• Tables are stored in subdirectories of the warehouse
directory
– Partitions form subdirectories of tables
• Possible to create external tables if the data is already
in HDFS and should not be move from its current
location
• Actually data is stored in flat files
– Control character-delimited text or SequenceFiles
– Can be arbitrary format with the use of a custom
Serializer/Deserializer(“SerDe”)
112
Hive Limitations
• Not all “standard” SQL is supported
– No correlated subqueries, for example
• No support for UPDATE or DELETE
• No support for INSERT single rows
• Relatively limited number of built-in functions
113
Starting The Hive Shell
• To launch the Hive shell, start a terminal and run
– $ hive
• Results in the Hive prompt:
– hive>
• Autocomplete – Tab
• Query Column Headers
– Hive> set hive.cli.print.header=true;
– Hive> set hive.cli.print.current.db=true;
114
Hive’s Word Count
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, „s‟)) AS word FROM docs) w
GROUP BY word
ORDER BY count DESC;
SELECT * FROM word_counts LIMIT 30;
115
HiveQL: Data Definition
116
Data Definition
• Database
– CREATE/DROP
– ALTER (set DBPROPERTIES, name-value pair)
– SHOW/DESCRIBE
– USE
• Table
– CREATE/DROP
– ALTER
– SHOW/DESCRIBE
– CREATE EXTERNAL TABLE
117
Creating Tables
• CREATE TABLE IF NOT EXISTS table name …
ROW FORMAT DELIMITED
FIELDS TERMINATED BY „t‟
• CREATE EXTERNAL TABLE …
LOCATION „/user/mydata‟
118
Creating Tables
hive> SHOW TABLES;
Hive>CREATE TABLE shakespeare
(freq INT, word STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
„t‟ STORED AS TEXTFILE;
Hive>DESCRIBE shakespeare;
119
Modify Tables
• ALTER TABLE … CHANGE
COLUMN old_name new_name type
AFTER column;
• ALTER TABLE … (ADD|REPLACE)
COLUMNS (column_name column type);
120
Partition
• Help to organize data in a logical fashion, such as
hierarchically.
• CREATE TABLE …
PARTITIONED BY (column name datatype, …)
• CREATE TABLE employee (
name STRING,
salary FLOAT)
PARTITIONED BY (country STRING, state STRING)
• Physical Layout in Hive
…/employees/country=CA/state=AB
…/employees/country=CA/state=BC
…
121
HiveQL: Data Manipulation
122
Loading Data into Hive
• LOAD DATA INPATH … INTO TABLE … PARTITION …
• Data is loaded into Hive with LOAD DATA INPATH
statement
– Assumes that the data in already in HDFS
LOAD DATA INPATH “shakespeare_freq” INTO TABLE
shakespeare;
• If the data is on the local filesystem, use LOAD DATA
LOCAL INPATH
123
Inserting Data into Table from
Queries
• INSERT OVERWRITE TABLE employees
PARTITION (country=„US‟, state=„OR‟)
SELECT * FROM staged_employees se
WHERE se.cnty=„US‟ and se.st=„OR‟;
124
Dynamic Partitions Properties
• hive.exec.dynamic.partition=true
• hive.exec.dynamic.partition.mode=nonstrict
• hive.exec.max.dynamic.partitions.pernode=100
• hive.exec.max.dynamic.partitions=+1000
• hive.exec.max.created.files=100000
125
Dynamics Partition Inserts
• What if I have so many partitions ?
• INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT …, se.cnty, se.st
FROM staged_employees se;
• You can mix static and dynamic partition, for example:
• INSERT OVERWRITE TABLE employees
PARTITION (country=„US‟, state)
SELECT …, se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = „US‟;
126
Create Table and Loading Data
• CREATE TABLE ca_employees
AS SELECT name, salary
FROM employees se
WHERE se.state=„CA‟;
127
Storing Output Results
• The SELECT statement on the previous slide would
write the data to console
• To store the result in HDFS, create a new table then
write, for example:
INSERT OVERWRITE TABLE newTable SELECT
s.word, s.freq, k.freq FROM shakespeare s JOIN
kjv k ON (s.word = k.word) WHERE s.freq >= 5;
• Results are stored in the table
• Results are just files within the newTable directory
– Data can be used in subsequent queries, or in MapReduce jobs
128
Exporting Data
• If the data file are already formatted as you want,
just copy.
• Or you can use INSERT … DIRECTORY …, for example
• INSERT OVERWRITE
LOCAL DIRECTORY „./ca_employees‟
SELECT name, salary, address
FROM employees se
WHERE se.state=„CA‟;
129
HiveQL:Queries
130
SELECT … FROM
• SELECT col_name or functions FROM tab_name;
hive> SELECT name FROM employees e;
• SELECT … FROM … [LIMIT N]
– * or Column alias
– Column Arithmetic Operators, Aggregation Function
– FROM
131
Arithmetic Operators
Operator Types Description
A + B Numbers Add A and B
A - B Numbers Subtract B from A
A * B Numbers Multiply A and B
A / B Numbers Divide A with B
A % B Numbers The remainder of dividing A with B
A & B Numbers Bitwise AND
A | B Numbers Bitwise OR
A ^ B Numbers Bitwise XOR
~A Numbers Bitwise NOT of A
132
Aggregate Functions
count(*) covar_pop(col)
count(expr) covar_samp(col)
sum(col) corr(col1,col2)
sum(DISTINCT col) percentile(int_expr,p)
avg(col) histogram_numeric
min(col) collect_set(col)
max(col) stddev_pop(col)
variance(col), var_pop(col) stddev_samp(col)
var_samp(col)
• Map Side aggregation for performance improve
hive> SET hive.map.aggr=true;
133
Other Functions
• https://cwiki.apache.org/confluence/display/Hive/Lan
guageManual+UDF#LanguageManualUDF-
BuiltinFunctions
134
When Hive Can Avoid Map Reduce
• SELECT * FROM employees;
• SELECT * FROM employees
WHERE country=„us‟ AND state=„CA‟
LIMIT 100;
135
WHERE
• >, <, =, >=, <=, !=
• IS NULL/IS NOT NULL
• OR AND NOT
• LIKE
– X% (prefix „X‟)
– %X (suffix „X‟)
– %X% (substring)
– _ (single character)
• RLIKE (Java Regular Expression)
136
GROUP BY
• Often used in conjunction with aggregate functions,
avg, count, etc.
• HAVING
– constrain the group produced by GROUP BY in a way that
could be expressed with a subquery.
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange=„NASDAQ‟ AND symbol=„AAPL‟
GROUP BY year (ymd)
HAVING avg(price_close) > 50.0;
137
JOIN
• Inner JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
• LEFT SEMI-JOIN
• Map-side Joins
138
Inner JOIN
• SELECT a.ymd, a.price_close, b.price_close
FROM stocks a JOIN stocks b ON a.ymd=b.ymd
WHERE a.symbol=„AAPL‟ AND b.symbol=„IBM‟
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol
WHERE s.symbol=„AAPL‟
139
LEFT OUTER JOIN
• All records from lefthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stock s LEFT OUTER JOIN dividends d
ON s.ymd=d.ymd AND s.symbol = d.symbol
WHERE s.symbol=„AAPL‟
140
RIGHT and FULL OUTER JOIN
• RIGHT OUTER JOIN
– All records from righthand table that match WHERE are
returned. NULL will be return if not match ON criteria
• FULL OUTER JOIN
– All records from all tables that match WHERE are returned.
NULL will be return if not match ON criteria
141
LEFT SEMI-JOIN
• Returns records from lefthand table if records are
found in righthand table that satisfy the ON
predicates.
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks LEFT SEMI JOIN dividends d
ON s.ymd = d.ymd AND s.symbol = d.symbol
• RIGHT SEMI-JOIN is not supported
142
Map-side Joins
• If one of the table is small, the largest table can be
streamed through mappers and the small tables are
cached in memory.
• SELECT /*+ MAPJOIN(d)*/ s.ymd, s.symbol,
s.price_close, d.dividend
FROM stocks s JOIN dividends d ON s.ymd=d.ymd
AND s.symbol=d.symbol
WHERE s.symbol=„AAPL‟
143
ORDER BY and SORT BY
• ORDER BY performs a total ordering of query result
set.
• All data passed through single reducer. Caution for
larger data sets. For example:
– SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDER
BY s.ymd ASC, s.symbol DESC;
• SORT BY performs a local ordering, where each
reducer‟s output will be sorted.
144
DISTRIBUTE BY with SORT BY
• By default, MapReduce partition mapper output by
hash of key-value. This will cause the overlap of data
between reducers.
• We can use DISTRIBUTED BY to ensure the record
with the same column go to the same reducer and use
SORT BY to order the data.
• SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTED BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC
• DISTRIBUTED BY requires SORT BY
145
CLUSTER BY
• Short-hand of DISTRIBUTED BY … SORT BY
• CLUSTER BY does not perform SORT
• SELECT s.ymd, s.symbol, s.price_close
FROM stocks s
CLUSTER BY s.symbol;
146
Creating User-Defined Functions
• Hive supports manipulation of data via user-created
functions
• Example:
INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM
(userid, movieid, rating, unixtime) USING 'python
weekday_mapper.py' AS (userid, movieid, rating, weekday)
FROM u_data;
147
Hive: Where to Learn More
• http://hive.apache.org/
• Programming Hive
148
Choosing Between Pig and Hive
• Typically, organizations wanting an abstraction on top
of standard MapReduce will choose to use either Hive
or Pig
• Which one is chosen depends on the skillset of the
target users
– Those with an SQL background will naturally gravitate towards
Hive
– Those who do not know SQL will often choose Pig
• Each has strengths and weaknesses, it is worth
spending some time investigating each so you can
make an informed decision
• Some organizations are now choosing to use both
– Pig deals better with less-structured data, so Pig is used to
manipulate the data into a more structured form, then Hive is
used to query that structured data
www.etusolution.com
info@etusolution.com
Taipei, Taiwan
318, Rueiguang Rd., Taipei 114, Taiwan
T: +886 2 7720 1888
F: +886 2 8798 6069
Beijing, China
Room B-26, Landgent Center,
No. 24, East Third Ring Middle Rd.,
Beijing, China 100022
T: +86 10 8441 7988
F: +86 10 8441 7227
Contact

Más contenido relacionado

La actualidad más candente

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANASAP Technology
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 

La actualidad más candente (20)

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Liquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANALiquidity Risk Management powered by SAP HANA
Liquidity Risk Management powered by SAP HANA
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 

Destacado

Portfolio Project_SAllen
Portfolio Project_SAllenPortfolio Project_SAllen
Portfolio Project_SAllenStephany Allen
 
Sam 4846ingles3 3IV11 ALMANZA
Sam 4846ingles3  3IV11 ALMANZASam 4846ingles3  3IV11 ALMANZA
Sam 4846ingles3 3IV11 ALMANZALuis Luna
 
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía electoral
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía  electoral2016: año de la reforma. Hacia umbrales mínimos de ciudadanía  electoral
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía electoralEduardo Nelson German
 
Ohop Creek Restoration Phases I & II Wildlife Surveys
Ohop Creek Restoration Phases I & II Wildlife SurveysOhop Creek Restoration Phases I & II Wildlife Surveys
Ohop Creek Restoration Phases I & II Wildlife SurveysNisqually River Council
 
Week 8presentation
Week 8presentationWeek 8presentation
Week 8presentationJustRita1
 
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%Eduardo Nelson German
 
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...Eduardo Nelson German
 
Week 8presentation
Week 8presentationWeek 8presentation
Week 8presentationJustRita1
 
Unique selling point
Unique selling pointUnique selling point
Unique selling pointmoradwael
 
The Business of F&B Innovation - Chef Tim Abejuela
The Business of F&B Innovation - Chef Tim Abejuela The Business of F&B Innovation - Chef Tim Abejuela
The Business of F&B Innovation - Chef Tim Abejuela courageasia
 
1901 n orange ave (medical marijuana)
1901 n orange ave (medical marijuana)1901 n orange ave (medical marijuana)
1901 n orange ave (medical marijuana)Brendan O'Connor
 
Mindfulness.pptx
Mindfulness.pptxMindfulness.pptx
Mindfulness.pptxDevon Wolfe
 
Southland colonial pedestrianoverpass_tre0003-g_technicalproposal
Southland colonial pedestrianoverpass_tre0003-g_technicalproposalSouthland colonial pedestrianoverpass_tre0003-g_technicalproposal
Southland colonial pedestrianoverpass_tre0003-g_technicalproposalBrendan O'Connor
 
Modulo para profesores
Modulo para profesoresModulo para profesores
Modulo para profesoresemili2015
 

Destacado (20)

Html5
Html5Html5
Html5
 
Portfolio Project_SAllen
Portfolio Project_SAllenPortfolio Project_SAllen
Portfolio Project_SAllen
 
Animais úteis e nocivos
Animais úteis e nocivosAnimais úteis e nocivos
Animais úteis e nocivos
 
Mazhar Ali- Resume
Mazhar Ali- ResumeMazhar Ali- Resume
Mazhar Ali- Resume
 
Sam 4846ingles3 3IV11 ALMANZA
Sam 4846ingles3  3IV11 ALMANZASam 4846ingles3  3IV11 ALMANZA
Sam 4846ingles3 3IV11 ALMANZA
 
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía electoral
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía  electoral2016: año de la reforma. Hacia umbrales mínimos de ciudadanía  electoral
2016: año de la reforma. Hacia umbrales mínimos de ciudadanía electoral
 
Ohop Creek Restoration Phases I & II Wildlife Surveys
Ohop Creek Restoration Phases I & II Wildlife SurveysOhop Creek Restoration Phases I & II Wildlife Surveys
Ohop Creek Restoration Phases I & II Wildlife Surveys
 
Week 8presentation
Week 8presentationWeek 8presentation
Week 8presentation
 
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%
El nuevo INDEC publica el nuevo dato de PBI 2015: +2.1%
 
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...
Informe de ejecución del presupuesto del Gobierno de la ciudad de Buenos Aire...
 
Week 8presentation
Week 8presentationWeek 8presentation
Week 8presentation
 
Darío vélez lópez
Darío vélez lópezDarío vélez lópez
Darío vélez lópez
 
Unique selling point
Unique selling pointUnique selling point
Unique selling point
 
MGMT 650 KM Project
MGMT 650 KM ProjectMGMT 650 KM Project
MGMT 650 KM Project
 
The Business of F&B Innovation - Chef Tim Abejuela
The Business of F&B Innovation - Chef Tim Abejuela The Business of F&B Innovation - Chef Tim Abejuela
The Business of F&B Innovation - Chef Tim Abejuela
 
1901 n orange ave (medical marijuana)
1901 n orange ave (medical marijuana)1901 n orange ave (medical marijuana)
1901 n orange ave (medical marijuana)
 
Mindfulness.pptx
Mindfulness.pptxMindfulness.pptx
Mindfulness.pptx
 
Southland colonial pedestrianoverpass_tre0003-g_technicalproposal
Southland colonial pedestrianoverpass_tre0003-g_technicalproposalSouthland colonial pedestrianoverpass_tre0003-g_technicalproposal
Southland colonial pedestrianoverpass_tre0003-g_technicalproposal
 
Modulo para profesores
Modulo para profesoresModulo para profesores
Modulo para profesores
 
Hadoop 介紹 20141024
Hadoop 介紹 20141024Hadoop 介紹 20141024
Hadoop 介紹 20141024
 

Similar a Etu L2 Training - Hadoop 企業應用實作

SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 

Similar a Etu L2 Training - Hadoop 企業應用實作 (20)

SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
מיכאל
מיכאלמיכאל
מיכאל
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 

Más de James Chen

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012James Chen
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結James Chen
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 

Más de James Chen (6)

Hadoop con 2015 hadoop enables enterprise data lake
Hadoop con 2015   hadoop enables enterprise data lakeHadoop con 2015   hadoop enables enterprise data lake
Hadoop con 2015 hadoop enables enterprise data lake
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012Hadoop的典型应用与企业化之路 for HBTC 2012
Hadoop的典型应用与企业化之路 for HBTC 2012
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 

Último

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Etu L2 Training - Hadoop 企業應用實作

  • 1. Etu Big Data 手作進階 企業應用實作
  • 2. 2 • Hadoop與海量資料處理概論 • Sqoop 介紹與實作 • Pig 程式設計與實作 • Hive程式設計與實作 Etu Big Data 企業應用實作
  • 4. 4 海量數據 現在進行式 ….. 結構 vs. 非結構 生成速度快 處理技術難
  • 5. 5 Big Data時代來臨 Structured (結構化) •Relational Database •File in record format Semi-structured (半結構化) •XML •Logs •Click-stream •Equipment / Device •RFID tag Unstructured (非結 構化) •Web Pages •E-mail •Multimedia •Instant Messages •More Binary Files 行動/網際網路 Mobile/Internet 物聯網 Internet of Things
  • 11. 11 數據的類型 11 Social Media Machine / SensorDOC / MediaWeb Clickstream AppsCall Log/xDR Log
  • 12. 12 Scale Up vs. Scale Out 檔案系統 ETL 工具 或 脚本 關連式 資料庫 分散式檔案 系統 分散式檔案 系統 分散式檔案 系統 平行運算 平行運算 平行運算 NoSQL NoSQL NoSQL Scale Out (TB to PB) ScaleUp(uptoTB) 原始數據 資料處理 查詢應用
  • 13. 13 Big Data & Hadoop
  • 15. 15 What is ?  Framework for running distributed applications on large cluster built of commodity hardware  Originally created by Doug Cutting  OSS implementation of Google‟s MapReduce and GFS  Hardware failures assumed in design  Fault-tolerant via replication  The name of “Hadoop” has now evolved to cover a family of software, but the core is essentially MapReduce and a distributed file system
  • 16. 16 Why ?  Need to process lots of data (up to Petabyte scale)  Need to parallelize processing across multitude of CPU  Achieve above while KeepIng Software Simple  Give scalability with low cost commodity hardware  Achieve linear scalability
  • 17. 17 What is Hadoop used for ? 17  Searching  Log Processing  Recommendation Systems  Business Intelligence / Data Warehousing  Video and Image Analysis  Archiving
  • 18. 18 Hadoop 不只是 Hadoop HIVE Big Data Applications Pig! Zoo Keeper SQL RAW 非結構化 資料匯入 SQL 資料匯入 分散式檔案系統 類SQL資料庫系統 (非即時性) 分散式資料庫 (即時性) 平行運算框架 資料處理語言Data Mining 程式庫
  • 19. 19 Hadoop 是一整個生態系統  ZooKeeper – 管理協調服務  HBase – 分散式即時資料庫  HIVE – Hadoop的資料倉儲系統  Pig – Hadoop的資料處理流程語言  Mahout – Hadoop的數據挖掘函式庫  Sqoop – Hadoop與關連式資料庫的轉換工具 19
  • 21. 21 HDFS Overview  Hadoop Distributed File System  Based on Google‟s GFS (Google File System)  Master/slave architecture  Write once read multiple times  Fault tolerant via replication  Optimized for larger files  Focus on streaming data (high-throughput > low latency)  Rack-aware (reduce inter-cluster network I/O) 21
  • 22. 22 HDFS Client API’s  “Shell-like” commands ( hadoop dfs [cmd] ) 22  Native Java API  Thrift API for other languages  C++, Java, Python, PHP, Ruby, C# cat chgrp chmod chown copyFromLocal copyToLocal cp du,dus expunge get getmerge ls,lsr mkdir movefromLocal mv put rm,rmr setrep stat tail test text touchz
  • 23. 23 HDFS Architecture-Read 23 Name Node Read Client Data Node local disk Data Node local disk Data Node local disk Data Node local disk block op (heartbeat, replication, re-balancing) name,replicas,block_id name block_id location Xfer Data
  • 24. 24 HDFS Architecture-Write 24 Name Node Write Client Data Node local disk Data Node local disk Data Node local disk Data Node local disk name block_size replication node to write Write Data
  • 26. 26 Fault Tolerance 26 Name Node Data Node local disk Data Node local disk Data Node local disk Data Node local disk Auto Replicate
  • 28. 28 MapReduce Overview 28  Distributed programming paradigm and the framework that is OSS implementation of Google’s MapReduce  Modeled using the ideas behind of functional programming map() and reduce ()  Distributed on as many node as you would like  2 phase process: map() reduce() sub divide & conquer combine & reduce
  • 29. 29 MapReduce ABC’s 29  Essentially, it’s… A. Take a large problem and divided it into sub-problems B. Perform the same function on all sub-problems C. Combine the output from all sub-problems  M/R is excellent for problems where the “sub- problems” are NOT interdependent  The output of one “mapper” should not depend on the output or the communication with another “mapper”  The reduce phase doesn’t begin execution until all mappers have finished  Failed mapper and reduce tasks get auto restarted  Rack/HDFS aware (data locality)
  • 30. 30 MapReduce 流程 split 0 split 1 split 2 split 3 split 4 part0 map K1 K2 map K1 K2 map K1 K2 K1 K1 K1 reduce K2 K2 K2 reduce part1 input HDFS sort/copy merge output HDFS
  • 31. 31 Each of mapper process a file block 31 Word Count I am a tiger, you are also a tiger a,2 also,1 am,1 are,1 I,1 tiger,2 you,1 I,1 am,1 a,1 tiger,1 you,1 are,1 also,1 a, 1 tiger,1 a,2 also,1 am,1 are,1 I, 1 tiger,2 you,1 reduce reduce map map map a, 1 a,1 also,1 am,1 are,1 I,1 tiger,1 tiger,1 you,1 Shuffle & Sort reduce phase, sum and count
  • 32. 32 Data Locality M/R Tasktrackers on the same machines as datanodes One Rack A Different Rack Job on stars Different job Idle Thursday, May 27, 2010
  • 34. 34 Pig  Framework and language (Pig Latin) for creating and submitting Hadoop MapReduce jobs  Common data operations like join, group by, filter, sort, select, etc. are provided  Don‟t need to know Java  Remove boilerplate aspect from M/R  200 lines in Java -> 15 lines in Pig  Feels like SQL 34
  • 35. 35 Pig  Fact from Wiki: 40% of Yahoo‟s M/R jobs are in Pig  Interactive shell [grunt] exist  User Defined Functions [UDF]  Allows you to specify Java code where the logic is too complex for Pig Latin  UDF‟s can be part of most every operation in Pig  Great for loading and storing custom formats as well as transforming data 35
  • 36. 36 COGROUP JOIN SPLIT CROSS LIMIT STORE DISTINCT LOAD STREAM FILTER MAPREDUCE UNION FOREACH ORDER BY GROUP SAMPLE 36 Pig Relational Operations
  • 38. 38 HBase – The Big Table
  • 39. 39 What HBase is  No-SQL (means Non-SQL, not SQL sucks)  Good at fast/streaming writes  Fault tolerant  Good at linear horizontal scalability  Very efficient at managing billions of rows and millions of columns  Good at keeping row history  Good at auto-balancing  A complement to a SQL/DW  Great with non-normalized data 39
  • 40. 40 What HBase is NOT  Made for table joins  Made for splitting into normalized tables  A replacement for RDBMS  Great for storing small amount of data  Great for large binary data (prefer < 1MB per cell)  ACID compliant 40
  • 41. 41 Data Model  Simple View is a map  Table: similar to relation db table  Row Key: row is identified and sorted by key  Row Value Key Row Value Key Row Value Key Row Value Table row1 row2 row3
  • 42. 42 Data Model (Columns)  Table:  Row: multiple columns in row’s value  Column Key Column1 Column2 Key Key Table row1 row2 row3 Column3 Column4 Column1 Column2 Column3 Column4 Column1 Column2 Column3 Column4
  • 43. 43 Data Model (Column Family)  Table: Row: Column Family: columns are grouped into family. Column family must be predefined with a column family prefix, e.g. “privateData”, when creating schema  Column: Column is denoted using family+qualifier, e.g “privateData:mobilePhone”. Key Column1 Column2 Key Key Table row1 row2 row3 Column3 Column4 Column1 Column2 Column3 Column4 Column1 Column2 Column3 Column4 Column Family 1 Column Family 1 Column Family 1 Column Family 2 Column Family 2 Column Family 2
  • 44. 44 Data Model (Sparse)  Table:  Row:  Column Family:  Column: can be added into existing column family on the fly. Rows can have widely varying number of columns. Key Column1 Column2 Key Key Table row1 row2 row3 Column3 Column4 Column1 Column5 Column4 Column3 Column6 Column Family 1 Column Family 1 Column Family 1 Column Family 2 Column Family 2 Column Family 2
  • 45. 45 45 HBase Architecture The master keeps track of the metadata for Region Sever and served Regions and store it in ZK The Hbase client communicate with ZK only to get region info. All HBase data (Hlog & HFile) are stored on HDFS
  • 47. 47 Hive 簡介 • 由 Facebook 開發 • 架構於 Hadoop 之上, 設計用來管理結構化資料的中介軟 體 • 以 MapReduce 為執行環境 • 資料儲存於HDFS上 • Metadata 儲存於RDMBS中 • Hive的設計原則 • 採用類SQL語法 • 擴充性 – Types, Functions, Formats, Scripts • 性能與水平擴展能力兼具
  • 48. 48 Hive – How it works Driver (compiler, optimizer, execut or) metastore Data Node Data Node Data Node Data Node Hadoop Cluster M/R M/R M/R M/R Web UI CLI JDBC ODBC Create M/R Job
  • 49. 49 Hadoop 的企業應用 RDBMS Sensors Devices Web Log Crawlers ERP CRM LOB APPs Connectors Unstructured Data S S R S SSAS Hyperion Familiar End User Tools PowerView Excel with PowerPivot Embedded BI Predictive Analytics Structured data Etu Appliance through Hive QL SQL SQL
  • 52. 52 RDBMS HDFS Sales 2008 Sales 2009 Sales 2010 Sales 2008 ODBC/JDBC 歷史數據與線上數據交互運用
  • 56. 56 企業採用 Hadoop 技術架構的挑戰 • 技術/人才缺口 1. 企業對 Hadoop 架構普遍陌生 2. Hadoop 叢集規劃、部署、管理 與系統調校的技術門檻高 • 專業服務資源缺口 1. 缺乏在地、專業、有實務經驗的 Hadoop 顧問服務 2. 缺乏能夠提供完整 Big Data 解 決方案設計、導入、與維護的專 業廠商 還處於市場早期 助您跨越 Big Data 鴻溝
  • 57. 57 Introducing – Etu Appliance 2.0 A purpose built high performance appliance for big data processing • Automates cluster deployment • Optimize for the highest performance of big data processing • Delivers availability and security
  • 58. 58 Key Benefits to Hadoopers • Fully automated deployment and configuration Simplifies configuration and deployment • Reliably deploy running mission critical big data application High availability made easy • Process and control sensitivity data with confidence Enterprise-grade security • Fully optimized operating system to boost your data processing performance Boost big data processing • Adapts to your workload and grows with your business Provides a scalable and extensible foundation
  • 59. 59 What’s New in Etu Appliance 2.0 • New deployment feature – Auto deployment and configuration for master node high availability • New security features – LDAP integration and Kerberos authentication • New data source – Introduce Etu™ Dataflow data collection service with built in Syslog and FTP server for better integration with existing IT infrastructure • New user experience – new Etu™ Management Console with HDFS file browser and HBase table management
  • 60. 60 Etu Appliance 2.0 – Hardware Specification Master Node – Etu 1000M CPU: 2 x 6 Core RAM: 48 GB ECC HDD: 300GB/SAS 3.5”/15K RPM x 2 (RAID 1) NIC: Dual Port/1 Gb Ethernet x 1 S/W: Etu™ OS/Etu™ Big Data Software Stack Power: Redundant Power / 100V~240V Worker Node – Etu 1000W CPU: 2 x 6 Core RAM: 48 GB ECC HDD: 2TB/SATA 3.5”/7.2K RPM x 4 NIC: Dual Port/1 Gb Ethernet x 1 S/W: Etu™ OS/Etu™ Big Data Software Stack Power: Single Power /100V~240V Worker Node – Etu 2000W CPU: 2 x 6 Core RAM: 48GB HDD: 2TB/SATA 3.5”/7.2K RPM x 8 NIC: Dual Port/1 Gb Ethernet x 1 S/W: Etu™ OS/Etu™ Big Data Software Stack Power: Single Power / 100V~240V
  • 61. 61 Sqoop : SQL to Hadoop • What is Sqoop ? • Sqoop import to HDFS • Sqoop import to Hive • Sqoop import to Hbase • Sqoop Incremental Imports • Sqoop export
  • 62. 62 What is Sqoop • A tool designed to transfer data between Hadoop and relational databases • Use MapReduce to import and export data • Provide parallels operations • Fault Tolerance, of course!
  • 63. 63 How it works JDBC JDBC JDBC Map Map Map HDFS/HIVE/HBas e SQL statement Create Map Tasks
  • 64. 64 Using sqoop $ sqoop tool-name [tool-arguments] Please try … $ sqoop help
  • 65. 65 Sqoop Common Arguments --connect <jdbc-uri> --driver --help -P --password <password> --username <username> --verbose
  • 66. 66 Using Options Files $ sqoop import --connect jdbc:mysql://etu-master/db ... Or $ sqoop --options-file ./import.txt --table TEST The options file contains: import --connect Jdbc:mysql://etu-master/db --username root --password etuadmin
  • 67. 67 Sqoop Import Command: sqoop import (generic-args) (import-args) Arguments: • --connect <jdbc-uri> Specify JDBC connect string • --driver <class-name> Manually specify JDBC driver class to use • --help Print usage instructions • -P Read password from console • --password <password> Set authentication password • --username <username> Set authentication username • --verbose Print more information while working
  • 68. 68 Import Arguments • --append Append data to an existing dataset in HDFS • -m,--num-mappers <n> Use n map tasks to import in parallel • -e,--query <statement> Import the results of statement. • --table <table-name> Table to read • --target-dir <dir> HDFS destination dir • --where <where clause> WHERE clause to use during import • -z,--compress Enable compression
  • 69. 69 Let’s try it! Please refer to L2 training note: • Import nyse_daily to HDFS • Import nyse_dividends to HDFS
  • 70. 70 Incremental Import • Sqoop support incremental import • Argument --check-column : column to be examined for importing --incremental : append or lastmodified --last-value : max value from previous import
  • 71. 71 Import All Tables • sqoop-all-tables support to import a set of table from RDBMS to HDFS.
  • 72. 72 Sqoop to Hive • Sqoop support hive • Add following argument when import --hive-import : specify sqoop target to Hive --hive-table : specify target table name in Hive --hive-overwrite : overwrite if table existed
  • 73. 73 Sqoop to HBase --column-family <family> Sets the target column family for the import --hbase-create-table If specified, create missing HBase tables --hbase-row-key <col> Specifies which input column to use as the row key --hbase-table <table-name> Specifies an HBase table to use as the target instead of HDFS
  • 74. 74 Sqoop Export • Target table must be exist • Default operation is INSERT, you could specify UPDATE • Syntax : sqoop export (generic args) (export-args)
  • 75. 75 Export Arguments --export-dir <dir> HDFS source path for export --table <name> Table to populate --update-key Anchor column to use for updates. --update-mode updateonly or allowinsert
  • 77. 77 More Sqoop Information • Sqoop User Guide : http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide. html
  • 79. 79 Pig 程式設計 • Introduction to Pig • Reading and Writing Data with Pig • Pig Latin Basics • Debugging Pig Scripts • Pig Best Practices • Pig and HBase
  • 80. 80 Pig Introduction • Pig was originally created at Yahoo! To answer a similar need to Hive – Many developers did nit have the Java and/or MapReduce knowledge required to write standard MapReduce programs – But still needed to query data • Pig is a dataflow language – Language is called Pig Latin – Relatively simple syntax – Under the covers, Pig Latin scripts are turned into MapReduce jobs and executed on the cluster
  • 81. 81 Pig Features • Pig supports many features which allow developers to perform sophisticated data analysis without having to write Java MapReduce code – Joining datasets – Grouping data – Referring to elements by position rather than name • Useful for datasets with many elements – Loading non-delimited data using a custom SerDe – Creation of user-defined functions, written in Java – And more
  • 82. 82 Pig Word Count Book = LOAD 'shakespeare/*' USING PigStorage() AS (lines:chararray); Wordlist = FOREACH Book GENERATE FLATTEN(TOKENIZE(lines)) as word; GroupWords = GROUP Wordlist BY word; CountGroupWords = FOREACH GroupWords GENERATE group as word, COUNT(Wordlist) as num_occurence; WordCountSorted = ORDER CountGroupWords BY $1 DESC; STORE WordCountSorted INTO 'wordcount' USING PigStorage(',');
  • 83. 83 Pig Data Types • Scalar Types – int – long – float – double – chararray – bytearray • Complex Types – tuple ex. (19,2,3) – bag ex. {(19,2), (18,1)} – map ex. [open#apache] • NULL
  • 84. 84 Pig Data Type Concepts • In Pig, a single element of data is an atom • A collection of atoms – such as a row, or a partial row – is a tuple • Tuples are collected together into bags • Typically, a Pig Latin script starts by loading one or more datasets into bags, and then creates new bags by modifying those it already has
  • 85. 85 Pig Schema • Pig eats everything – If schema is available, Pig will make use of it – If schema is not available, Pig will make the best guesses it can based on how the script treats the data A = LOAD „text.csv‟ as (field1:chararray, field2:int); • In the example above, Pig will expect this data to have 2 fields with specified data types – If there are more fields they will be truncated – If there are less fields NULL will be filled
  • 86. 86 Pig Latin: Data Input • The function is LOAD sample = LOAD „text.csv‟ as (field1:chararray, field2:int); • In the example above – sample is the name of relation – The file text.csv is loaded. – Pig will expect this data to have 2 fields with specified data types • If there are more fields they will be truncated • If there are less fields NULL will be filled
  • 87. 87 Pig Latin: Data Output • STORE – Output a relation into a specified HDFS folder STORE sample_out into „/tmp/output‟; • DUMP – Output a relation to screen DUMP sample_out;
  • 88. 88 Pig Latin: Relational Operations • FOREACH • FILTER • GROUP • ORDER BY • DISTINCT • JOIN • LIMIT
  • 89. 89 Pig Latin: FOREACH • FOREACH takes a set of expressions and applies them to every record in the data pipeline, and generates new records to send down the pipeline to the next operator. • Example: a = LOAD „text.csv‟ as (id, name, phone, zip, address); b = FOREACH a GENERATE id, name;
  • 90. 90 Pig Latin: FILTER • FILTER allows you to select which records will be retained in your data pipeline. • Example: a = LOAD „text.csv‟ as (id, name, phone, zip, address); b = FILTER a BY id matches „100*‟;
  • 91. 91 Pig Latin: GROUP • GROUP statement collects together records with the same key. • It is different than the GROUP BY clause in SQL, as in Pig Latin there is no direct connection between GROUP and aggregate functions. • GROUP collects all records with the key provided into a bag and then you can pass this to an aggregate function.
  • 92. 92 Pig Latin: GROUP (cont) Example: A = LOAD „text.csv‟ as (id, name, phone, zip, address); B = GROUP A BY zip; C = FOREACH B GENERATE group, COUNT(id); STORE C INTO „population_by_zipcode‟;
  • 93. 93 Pig Latin: ORDER BY • ORDER statement sorts your data for you by the field specified. • Example: a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee); b = ORDER a BY fee; c = ORDER a BY fee DESC, name; DUMP c;
  • 94. 94 Pig Latin: DISTINCT • DISTINCT statement removes duplicate records. Note it works only on entire records, not on individual fields. • Example: a = LOAD „url.csv‟ as (userid, url, dl_bytes, ul_bytes); b = FOREACH a GENERATE userid, url; c = DISTINCT b;
  • 95. 95 Pig Latin: JOIN • JOIN selects records from one input to put together with records from another input. This is done by indicating keys from each input, and when those keys are equal, the two rows are joined. • Example: call = LOAD „call.csv‟ as (MSISDN, callee, duration); user = LOAD „user.csv‟ as (name, MSISDN, address); call_bill = JOIN call by MSISDN, user by MSISDN; bill = FOREACH call_bill GENERATE name, MSISDN, callee, duration, address; STORE bill into „to_be_billed‟;
  • 96. 96 Pig Latin: LIMIT • LIMIT allows you to limit the number of results in the output. • Example: a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee); b = ORDER a BY fee DESC, name; top100 = LIMIT b 100; DUMP top100;
  • 97. 97 Pig Latin: UDF • UDF(User Defined Function) lets users combine Pig operators along with their own or other‟s code. • UDF can be written in Java and Python. • UDFs have to be registered before use. • Piggybank is useful • Example: register „path_to_UDF/piggybank.jar‟; a = LOAD „text.csv‟ as (id, name, phone, zip, address, fee); b = FOREACH a GENERATE id, org.apache.pig.piggybank.evaluation.string.Reverse(name);
  • 98. 98 Debugging Pig • DESCRIBE – Show the schema of a relation in your scripts • EXPLAIN – Show your scripts‟ execution plan in MapReduce manner • ILLUSTRATE – Run scripts with a sampled data • Pig Statistics – A summary set of statistics on your script
  • 99. 99 More about Pig • Visit Pig‟s Home Page http://pig.apache.org • http://pig.apache.org/docs/r0.9.2/
  • 101. 101 Hive 程式設計與實作 • Hive Introduction • Getting Data into Hive • Manipulating Data with Hive • Partitioning and Bucketing Data • Hive Best Practices • Hive and HBase
  • 102. 102 Hive: Introduction • Hive was originally developed at Facebook – Provide a very SQL-like language – Can be used by people who know SQL – Under the covers, generates MapReduce job that run on the Hadoop cluster – Enabling Hive requires almost no extra work by the system administrator
  • 103. 103 Hive: Architecture • Driver • 將HiveQL語法編譯成 MapReduce任務,進行 最佳化,發送到Job Tracker執行 • CLI/Web UI • Ad-hoc查詢 • Schema查詢 • 管理介面 • Metastore • JDBC/ODBC • 標準介面與其他資料庫 工具及應用程式介接 Driver (compiler, optimizer,,executor) metastore Web UI CLI JDBC ODBC
  • 104. 104 Driver (compiler, optimizer, execut or) metastore Data Node Data Node Data Node Data Node Hadoop Cluster M/R M/R M/R M/R Web UI CLI JDBC ODBC Create M/R Job Hive: How it works
  • 105. 105 The Hive Data Model • Hive „layers‟ table definitions on top of data in HDFS • Databases • Tables – Typed columns (int, float, string, boolean, etc) – Also, list: map (for JSON-like data) • Partition • Buckets
  • 106. 106 Hive Datatypes : Primitive Types • TINYINT (1 byte signed integer) • SMALLINT (2 bytes signed integer) • INT (4 bytes signed integer) • BIGINT (8 bytes signed integer) • BOOLEAN (TRUEor FALSE) • FLOAT (single precision floating point) • DOUBLE (Double precision floating point) • STRING (Array of Char) • BINARY (Array of Bytes) • TIMESTAMP (integer, float or string)
  • 107. 107 Hive Datatypes: Collection Types • ARRAY <primitive-type> • MAP <primitive-type, primitive-type> • STRUCT <col_name : primitive-type, …>
  • 108. 108 Text File Delimiters • By default, hive store data as text file BUT you could choose other file formats. • Hive‟s default record and field delimiters: CREATE TABLE … ROW FORMAT DELIMITED FIELDS TERMINATED BY „001‟ COLLECTION ITEMS TERMINATED BY „002‟ MAP KEY TERMINATED BY „003‟ LINES TERMINATED BY „n‟ STORED AS TEXTFILE;
  • 109. 109 The Hive Metastore • Hive‟s Metastore is a database containing table definitions and other metadata – By default, stored locally on the client machine in a Derby database – If multiple people will be using Hive, the system administrator should create a shared Metastore • Usually in MySQL or some other relational database server
  • 110. 110 Hive is Schema on Read • Relational Database – Schema on Write – Gatekeeper – Alter schema is painful! • Hive – Schema on Read – Requires less ETL efforts
  • 111. 111 Hive Data: Physical Layout • Hive tables are stored in Hive‟s „warehouse‟ directory in HDFS – By default, /user/hive/warehouse • Tables are stored in subdirectories of the warehouse directory – Partitions form subdirectories of tables • Possible to create external tables if the data is already in HDFS and should not be move from its current location • Actually data is stored in flat files – Control character-delimited text or SequenceFiles – Can be arbitrary format with the use of a custom Serializer/Deserializer(“SerDe”)
  • 112. 112 Hive Limitations • Not all “standard” SQL is supported – No correlated subqueries, for example • No support for UPDATE or DELETE • No support for INSERT single rows • Relatively limited number of built-in functions
  • 113. 113 Starting The Hive Shell • To launch the Hive shell, start a terminal and run – $ hive • Results in the Hive prompt: – hive> • Autocomplete – Tab • Query Column Headers – Hive> set hive.cli.print.header=true; – Hive> set hive.cli.print.current.db=true;
  • 114. 114 Hive’s Word Count CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, „s‟)) AS word FROM docs) w GROUP BY word ORDER BY count DESC; SELECT * FROM word_counts LIMIT 30;
  • 116. 116 Data Definition • Database – CREATE/DROP – ALTER (set DBPROPERTIES, name-value pair) – SHOW/DESCRIBE – USE • Table – CREATE/DROP – ALTER – SHOW/DESCRIBE – CREATE EXTERNAL TABLE
  • 117. 117 Creating Tables • CREATE TABLE IF NOT EXISTS table name … ROW FORMAT DELIMITED FIELDS TERMINATED BY „t‟ • CREATE EXTERNAL TABLE … LOCATION „/user/mydata‟
  • 118. 118 Creating Tables hive> SHOW TABLES; Hive>CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY „t‟ STORED AS TEXTFILE; Hive>DESCRIBE shakespeare;
  • 119. 119 Modify Tables • ALTER TABLE … CHANGE COLUMN old_name new_name type AFTER column; • ALTER TABLE … (ADD|REPLACE) COLUMNS (column_name column type);
  • 120. 120 Partition • Help to organize data in a logical fashion, such as hierarchically. • CREATE TABLE … PARTITIONED BY (column name datatype, …) • CREATE TABLE employee ( name STRING, salary FLOAT) PARTITIONED BY (country STRING, state STRING) • Physical Layout in Hive …/employees/country=CA/state=AB …/employees/country=CA/state=BC …
  • 122. 122 Loading Data into Hive • LOAD DATA INPATH … INTO TABLE … PARTITION … • Data is loaded into Hive with LOAD DATA INPATH statement – Assumes that the data in already in HDFS LOAD DATA INPATH “shakespeare_freq” INTO TABLE shakespeare; • If the data is on the local filesystem, use LOAD DATA LOCAL INPATH
  • 123. 123 Inserting Data into Table from Queries • INSERT OVERWRITE TABLE employees PARTITION (country=„US‟, state=„OR‟) SELECT * FROM staged_employees se WHERE se.cnty=„US‟ and se.st=„OR‟;
  • 124. 124 Dynamic Partitions Properties • hive.exec.dynamic.partition=true • hive.exec.dynamic.partition.mode=nonstrict • hive.exec.max.dynamic.partitions.pernode=100 • hive.exec.max.dynamic.partitions=+1000 • hive.exec.max.created.files=100000
  • 125. 125 Dynamics Partition Inserts • What if I have so many partitions ? • INSERT OVERWRITE TABLE employees PARTITION (country, state) SELECT …, se.cnty, se.st FROM staged_employees se; • You can mix static and dynamic partition, for example: • INSERT OVERWRITE TABLE employees PARTITION (country=„US‟, state) SELECT …, se.cnty, se.st FROM staged_employees se WHERE se.cnty = „US‟;
  • 126. 126 Create Table and Loading Data • CREATE TABLE ca_employees AS SELECT name, salary FROM employees se WHERE se.state=„CA‟;
  • 127. 127 Storing Output Results • The SELECT statement on the previous slide would write the data to console • To store the result in HDFS, create a new table then write, for example: INSERT OVERWRITE TABLE newTable SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN kjv k ON (s.word = k.word) WHERE s.freq >= 5; • Results are stored in the table • Results are just files within the newTable directory – Data can be used in subsequent queries, or in MapReduce jobs
  • 128. 128 Exporting Data • If the data file are already formatted as you want, just copy. • Or you can use INSERT … DIRECTORY …, for example • INSERT OVERWRITE LOCAL DIRECTORY „./ca_employees‟ SELECT name, salary, address FROM employees se WHERE se.state=„CA‟;
  • 130. 130 SELECT … FROM • SELECT col_name or functions FROM tab_name; hive> SELECT name FROM employees e; • SELECT … FROM … [LIMIT N] – * or Column alias – Column Arithmetic Operators, Aggregation Function – FROM
  • 131. 131 Arithmetic Operators Operator Types Description A + B Numbers Add A and B A - B Numbers Subtract B from A A * B Numbers Multiply A and B A / B Numbers Divide A with B A % B Numbers The remainder of dividing A with B A & B Numbers Bitwise AND A | B Numbers Bitwise OR A ^ B Numbers Bitwise XOR ~A Numbers Bitwise NOT of A
  • 132. 132 Aggregate Functions count(*) covar_pop(col) count(expr) covar_samp(col) sum(col) corr(col1,col2) sum(DISTINCT col) percentile(int_expr,p) avg(col) histogram_numeric min(col) collect_set(col) max(col) stddev_pop(col) variance(col), var_pop(col) stddev_samp(col) var_samp(col) • Map Side aggregation for performance improve hive> SET hive.map.aggr=true;
  • 134. 134 When Hive Can Avoid Map Reduce • SELECT * FROM employees; • SELECT * FROM employees WHERE country=„us‟ AND state=„CA‟ LIMIT 100;
  • 135. 135 WHERE • >, <, =, >=, <=, != • IS NULL/IS NOT NULL • OR AND NOT • LIKE – X% (prefix „X‟) – %X (suffix „X‟) – %X% (substring) – _ (single character) • RLIKE (Java Regular Expression)
  • 136. 136 GROUP BY • Often used in conjunction with aggregate functions, avg, count, etc. • HAVING – constrain the group produced by GROUP BY in a way that could be expressed with a subquery. SELECT year(ymd), avg(price_close) FROM stocks WHERE exchange=„NASDAQ‟ AND symbol=„AAPL‟ GROUP BY year (ymd) HAVING avg(price_close) > 50.0;
  • 137. 137 JOIN • Inner JOIN • LEFT OUTER JOIN • RIGHT OUTER JOIN • FULL OUTER JOIN • LEFT SEMI-JOIN • Map-side Joins
  • 138. 138 Inner JOIN • SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd=b.ymd WHERE a.symbol=„AAPL‟ AND b.symbol=„IBM‟ • SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbol=„AAPL‟
  • 139. 139 LEFT OUTER JOIN • All records from lefthand table that match WHERE are returned. NULL will be return if not match ON criteria • SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stock s LEFT OUTER JOIN dividends d ON s.ymd=d.ymd AND s.symbol = d.symbol WHERE s.symbol=„AAPL‟
  • 140. 140 RIGHT and FULL OUTER JOIN • RIGHT OUTER JOIN – All records from righthand table that match WHERE are returned. NULL will be return if not match ON criteria • FULL OUTER JOIN – All records from all tables that match WHERE are returned. NULL will be return if not match ON criteria
  • 141. 141 LEFT SEMI-JOIN • Returns records from lefthand table if records are found in righthand table that satisfy the ON predicates. • SELECT s.ymd, s.symbol, s.price_close FROM stocks LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol • RIGHT SEMI-JOIN is not supported
  • 142. 142 Map-side Joins • If one of the table is small, the largest table can be streamed through mappers and the small tables are cached in memory. • SELECT /*+ MAPJOIN(d)*/ s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd=d.ymd AND s.symbol=d.symbol WHERE s.symbol=„AAPL‟
  • 143. 143 ORDER BY and SORT BY • ORDER BY performs a total ordering of query result set. • All data passed through single reducer. Caution for larger data sets. For example: – SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDER BY s.ymd ASC, s.symbol DESC; • SORT BY performs a local ordering, where each reducer‟s output will be sorted.
  • 144. 144 DISTRIBUTE BY with SORT BY • By default, MapReduce partition mapper output by hash of key-value. This will cause the overlap of data between reducers. • We can use DISTRIBUTED BY to ensure the record with the same column go to the same reducer and use SORT BY to order the data. • SELECT s.ymd, s.symbol, s.price_close FROM stocks s DISTRIBUTED BY s.symbol SORT BY s.symbol ASC, s.ymd ASC • DISTRIBUTED BY requires SORT BY
  • 145. 145 CLUSTER BY • Short-hand of DISTRIBUTED BY … SORT BY • CLUSTER BY does not perform SORT • SELECT s.ymd, s.symbol, s.price_close FROM stocks s CLUSTER BY s.symbol;
  • 146. 146 Creating User-Defined Functions • Hive supports manipulation of data via user-created functions • Example: INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;
  • 147. 147 Hive: Where to Learn More • http://hive.apache.org/ • Programming Hive
  • 148. 148 Choosing Between Pig and Hive • Typically, organizations wanting an abstraction on top of standard MapReduce will choose to use either Hive or Pig • Which one is chosen depends on the skillset of the target users – Those with an SQL background will naturally gravitate towards Hive – Those who do not know SQL will often choose Pig • Each has strengths and weaknesses, it is worth spending some time investigating each so you can make an informed decision • Some organizations are now choosing to use both – Pig deals better with less-structured data, so Pig is used to manipulate the data into a more structured form, then Hive is used to query that structured data
  • 149. www.etusolution.com info@etusolution.com Taipei, Taiwan 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 Beijing, China Room B-26, Landgent Center, No. 24, East Third Ring Middle Rd., Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227 Contact

Notas del editor

  1. In this join, all the records from left hand table that match WHERE clause are return. If the right hand table doesn’t have a record that matches the ON criteria, NULL is used for each column selected from the right hand table.