SlideShare una empresa de Scribd logo
1 de 27
Hive Query Language (HQL)
Example 
• Here’s one way of populating a single row table: 
% echo 'X' > /tmp/dummy.txt 
% hive -e "CREATE TABLE dummy (value STRING);  
LOAD DATA LOCAL INPATH '/tmp/dummy.txt'  
OVERWRITE INTO TABLE dummy" 
• Use ‘hive’ to start hive 
• SHOW TABLES; 
• SELECT * FROM dummy; 
• DROP TABLE dummy; 
• SHOW TABLES;
Example 
• Let’s see how to use Hive to run a query on the weather dataset 
• The first step is to load the data into Hive’s managed storage. 
• Here we’ll have Hive use the local file-system for storage; later we’ll see how 
to store tables in HDFS. 
• Just like an RDBMS, Hive organizes its data into tables. 
• We create a table to hold the weather data using the CREATE TABLE 
statement: 
CREATE TABLE records (year STRING, temperature INT, quality INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't'; 
• The first line declares a records table with three columns: year, temperature, 
and quality. 
• The type of each column must be specified, too: here the year is a string, 
while the other two columns are integers. 
• So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to 
HiveQL. 
• It is saying that each row in the data file is tab-delimited text.
Example 
• Next we can populate Hive with the data. This is just a small sample, for 
exploratory purposes: 
LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt' 
OVERWRITE INTO TABLE records; 
• Running this command tells Hive to put the specified local file in its 
warehouse directory. 
• This is a simple filesystem operation. 
• There is no attempt, for example, to parse the file and store it in an internal 
database format, since Hive does not mandate any particular file format. 
Files are stored verbatim: they are not modified by Hive. 
• Tables are stored as directories under Hive’s warehouse directory, which is 
controlled by the hive.metastore.warehouse.dir, and defaults to 
/user/hive/warehouse 
• The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete 
any existing files in the directory for the table. If it is omitted, then the new 
files are simply added to the table’s directory (unless they have the same 
names, in which case they replace the old files).
Example 
• Query: 
SELECT year, MAX(temperature) 
FROM records 
WHERE temperature != 9999 
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) 
GROUP BY year; 
• Hive is configured using an XML configuration file like Hadoop’s. The file is 
called hive-site.xml and is located in Hive’s conf directory.
Hive Services 
• cli 
– The command line interface to Hive (the shell). This is the default service. 
• hiveserver 
– Runs Hive as a server exposing a Thrift service, enabling access from a range of 
clients written in different languages. Applications using the Thrift, JDBC, and 
ODBC connectors need to run a Hive server to communicate with Hive. Set the 
HIVE_PORT environment variable to specify the port the server will listen on 
(defaults to 10,000). 
• hwi 
– The Hive Web Interface. 
• jar 
– The Hive equivalent to hadoop jar, a convenient way to run Java applications that 
includes both Hadoop and Hive classes on the classpath. 
• metastore 
– By default, the metastore is run in the same process as the Hive service. Using 
this service, it is possible to run the metastore as a standalone (remote) process. 
Set the METASTORE_PORT environment variable to specify the port the server 
will listen on.
Hive architecture
Clients 
• Thrift Client 
– The Hive Thrift Client makes it easy to run Hive commands from a wide range of 
programming languages. Thrift bindings for Hive are available for C++, Java, PHP, 
Python, and Ruby. They can be found in the src/service/src subdirectory in the 
Hive distribution. 
• JDBC Driver 
– Hive provides a Type 4 (pure Java) JDBC driver, defined in the class 
org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of 
the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive 
server running in a separate process at the given host and port. (The driver 
makes calls to an interface implemented by the Hive Thrift Client using the Java 
Thrift bindings.) You may alternatively choose to connect to Hive via JDBC in 
embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same 
JVM as the application invoking it, so there is no need to launch it as a 
standalone server since it does not use the Thrift service or the Hive Thrift Client. 
• ODBC Driver 
– The Hive ODBC Driver allows applications that support the ODBC protocol to 
connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to 
communicate with the Hive server.) The ODBC driver is still in development, so 
you should refer to the latest instructions on the Hive wiki for how to build and 
run it.
The Metastore (Embedded) 
• The metastore is the central repository of Hive metadata. The metastore is 
divided into two pieces: 
– a service and 
– the backing store for the data 
• By default, the metastore service runs in the same JVM as the Hive service 
and contains an embedded Derby database instance backed by the local 
disk. 
• This is called the embedded metastore configuration. 
• Using an embedded metastore is a simple way to get started with Hive; 
however, only one embedded Derby database can access the database files 
on disk at any one time, which means you can only have one Hive session 
open at a time that shares the same metastore.
Local Metastore 
• The solution to supporting multiple sessions (and therefore multiple users) is 
to use a standalone database. 
• This configuration is referred to as a local metastore, since the metastore 
service still runs in the same process as the Hive service, but connects to a 
database running in a separate process, either on the same machine or on a 
remote machine. 
• Any JDBC-compliant database may be used. 
• MySQL is a popular choice for the standalone metastore. 
• In this case, javax.jdo.option.ConnectionURL is set to 
jdbc:mysql://host/dbname?createDatabaseIfNotExist=true, and 
javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The 
user name and password should be set, too, of course.)
Remote Metastore 
• Going a step further, there’s another metastore configuration called a 
remote metastore, where one or more metastore servers run in separate 
processes to the Hive service. 
• This brings better manageability and security, since the database tier can be 
completely firewalled off, and the clients no longer need the database 
credentials.
Updates, transactions, and indexes 
• Updates, transactions, and indexes are mainstays of traditional 
databases. 
• Yet, until recently, these features have not been considered a 
part of Hive’s feature set. 
• This is because Hive was built to operate over HDFS data using 
MapReduce, where full-table scans are the norm and a table 
update is achieved by transforming the data into a new table. 
• For a data warehousing application that runs over large portions 
of the dataset, this works well.
HiveQL or HQL 
• Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 
specification.
Tables 
• A Hive table is logically made up of the data being stored and the associated 
metadata describing the layout of the data in the table. 
• The data typically resides in HDFS, although it may reside in any Hadoop 
filesystem, including the local filesystem or S3. 
• Hive stores the metadata in a relational database—and not in HDFS, say 
Metastore. 
• Many relational databases have a facility for multiple namespaces, which 
allow users and applications to be segregated into different databases or 
schemas. 
• Hive supports the same facility, and provides commands such as CREATE 
DATABASE dbname, USE dbname, and DROP DATABASE dbname. 
• You can fully qualify a table by writing dbname.tablename. If no database is 
specified, tables belong to the default database.
Managed Tables and External Tables 
• When you create a table in Hive, by default Hive will manage the data, which 
means that Hive moves the data into its warehouse directory. Alternatively, 
you may create an external table, which tells Hive to refer to the data that is 
at an existing location outside the warehouse directory. 
• The difference between the two types of table is seen in the LOAD and DROP 
semantics. 
• When you load data into a managed table, it is moved into Hive’s warehouse 
directory. 
• CREATE EXTERNAL TABLE 
• With the EXTERNAL keyword, Hive knows that it is not managing the data, so 
it doesn’t move it to its warehouse directory. 
• Hive organizes tables into partitions, a way of dividing a table into coarse-grained 
parts based on the value of a partition column, such as date. Using 
partitions can make it faster to do queries on slices of the data. 
• CREATE TABLE logs (ts BIGINT, line STRING) 
PARTITIONED BY (dt STRING, country STRING); 
• LOAD DATA LOCAL INPATH 'input/hive/partitions/file1‘ 
INTO TABLE logs 
PARTITION (dt='2001-01-01', country='GB'); 
• SHOW PARTITIONS logs;
Buckets 
• There are two reasons why you might want to organize your 
tables (or partitions) into buckets. 
• The first is to enable more efficient queries. 
• Bucketing imposes extra structure on the table, which Hive can 
take advantage of when performing certain queries. 
• In particular, a join of two tables that are bucketed on the same 
columns—which include the join columns—can be efficiently 
implemented as a map-side join.
Buckets 
• The second reason to bucket a table is to make sampling more 
efficient. 
• When working with large datasets, it is very convenient to try 
out queries on a fraction of your dataset while you are in the 
process of developing or refining them. 
• CREATE TABLE bucketed_users (id INT, name STRING) 
CLUSTERED BY (id) INTO 4 BUCKETS; 
• Any particular bucket will effectively have a random set of users 
in it. 
• The data within a bucket may additionally be sorted by one or 
more columns. This allows even more efficient map-side joins, 
since the join of each bucket becomes an efficient merge-sort. 
• CREATE TABLE bucketed_users (id INT, name STRING) 
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
Buckets 
• Physically, each bucket is just a file in the table (or partition) 
directory. 
• In fact, buckets correspond to MapReduce output file partitions: 
a job will produce as many buckets (output files) as reduce 
tasks. 
• We can do a sampling of the table using the TABLESAMPLE 
clause, which restricts the query to a fraction of the buckets in 
the table rather than the whole table: 
• SELECT * FROM bucketed_users 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 
• Bucket numbering is 1-based, so this query retrieves all the 
users from the first of four buckets. 
• For a large, evenly distributed dataset, approximately one 
quarter of the table’s rows would be returned.
Storage Formats 
• There are two dimensions that 
govern table storage in Hive: the 
row format and the file format. 
• The row format dictates how 
rows, and the fields in a particular 
row, are stored. 
• In Hive parlance, the row format 
is defined by a SerDe. 
• When you create a table with no 
ROW FORMAT or STORED AS 
clauses, the default format is 
delimited text, with a row per 
line. 
• You can use sequence files in Hive 
by using the declaration STORED 
AS SEQUENCEFILE in the CREATE 
TABLE statement. 
• Hive provides another binary 
storage format called RCFile, short 
for Record Columnar File. RCFiles 
are similar to sequence files, 
except that they store data in a 
columnoriented fashion.
Lets do some Hands on 
• Look at the data in /input/tsv/, and choose any one of the TSV file 
• Understand the col types and names 
• drop table X; 
drop table bucketed_X; 
• CREATE TABLE X(id STRING, name STRING) 
• ROW FORMAT DELIMITED 
• FIELDS TERMINATED BY 't'; 
• LOAD DATA LOCAL INPATH '/input/tsv/X.tsv‘ 
OVERWRITE INTO TABLE X; 
• SELECT * FROM X; 
• Write some query : SELECT x 
• FROM X 
• WHERE y == ABC 
• GROUP BY z; 
• dfs -cat /user/hive/warehouse/X/X.tsv; 
• CREATE TABLE bucketed_X (id STRING, name STRING) 
CLUSTERED BY (id) INTO 4 BUCKETS; 
• CREATE TABLE bucketed_X (id STRING, name STRING) 
CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS; 
• SELECT * FROM X;
Lets do some Hands on 
• SET hive.enforce.bucketing=true; 
• INSERT OVERWRITE TABLE bucketed_X 
SELECT * FROM X; 
• dfs -ls /user/hive/warehouse/bucketed_X; 
• SELECT * FROM bucketed_X 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 
• SELECT * FROM bucketed_X 
TABLESAMPLE(BUCKET 1 OUT OF 2 ON id); 
• SELECT * FROM X 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
Querying Data 
• Sorting data in Hive can be achieved by use of a standard ORDER BY clause, 
but there is a catch. 
• ORDER BY produces a result that is totally sorted, as expected, but to do so it 
sets the number of reducers to one, making it very inefficient for large 
datasets. 
• When a globally sorted result is not required—and in many cases it isn’t— 
then you can use Hive’s nonstandard extension, SORT BY instead. 
• SORT BY produces a sorted file per reducer. 
• In some cases, you want to control which reducer a particular row goes to, 
typically so you can perform some subsequent aggregation. This is what 
Hive’s DISTRIBUTE BY clause does. 
• FROM records 
SELECT year, temperature 
DISTRIBUTE BY year 
SORT BY year ASC, temperature DESC;
MapReduce Scripts 
• Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and 
REDUCE clauses make it possible to invoke an external script or program 
from Hive. 
#!/usr/bin/env python 
import re 
import sys 
for line in sys.stdin: 
(year, temp, q) = line.strip().split() 
if (temp != "9999" and re.match("[01459]", q)): 
print "%st%s" % (year, temp) 
is_good_quality.py 
#!/usr/bin/env python 
import sys 
(last_key, max_val) = (None, 0) 
for line in sys.stdin: 
(key, val) = line.strip().split("t") 
if last_key and last_key != key: 
print "%st%s" % (last_key, max_val) 
(last_key, max_val) = (key, int(val)) 
else: 
(last_key, max_val) = (key, max(max_val, int(val))) 
if last_key: 
print "%st%s" % (last_key, max_val) 
/code/hive/python/max_temperature_reduce.py; 
• ADD FILE Hadoop_training/code/hive/python/is_good_quality.py; ADD FILE 
Hadoop_training/code/hive/python/max_temperature_reduce.py;
Example 
CREATE TABLE records (year STRING, temperature INT, quality INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't'; 
LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt' 
OVERWRITE INTO TABLE records; 
FROM records 
SELECT TRANSFORM(year, temperature, quality) 
USING 'is_good_quality.py' 
AS year, temperature; 
FROM ( 
FROM records 
MAP year, temperature, quality 
USING 'is_good_quality.py' 
AS year, temperature) map_output 
REDUCE year, temperature 
USING 'max_temperature_reduce.py' 
AS year, temperature;
Joins 
• The simplest kind of join is the inner join, where each match in the input 
tables results in a row in the output. 
• SELECT sales.*, things.* 
FROM sales JOIN things ON (sales.id = things.id); 
• Outer Join: 
• SELECT sales.*, things.* 
FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); 
• SELECT sales.*, things.* 
FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); 
• SELECT sales.*, things.* 
FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Joins 
• SELECT * 
FROM things 
WHERE things.id IN (SELECT id from sales); 
• Is not possible in Hive – but can be written as: 
• SELECT * 
FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
End of session 
Day – 4: Hive Query Language (HQL)

Más contenido relacionado

La actualidad más candente

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 

La actualidad más candente (20)

Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
NoSql
NoSqlNoSql
NoSql
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 

Destacado

Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
Kamal A
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

Destacado (20)

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Introduction to Apache HBase Training
Introduction to Apache HBase TrainingIntroduction to Apache HBase Training
Introduction to Apache HBase Training
 
14 hql
14 hql14 hql
14 hql
 
Hive example
Hive exampleHive example
Hive example
 
Swing
SwingSwing
Swing
 
Resume
ResumeResume
Resume
 
Tap The Hive Mind
Tap The Hive MindTap The Hive Mind
Tap The Hive Mind
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Similar a 03 hive query language (hql)

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
Anonymous9etQKwW
 

Similar a 03 hive query language (hql) (20)

Apache Hive
Apache HiveApache Hive
Apache Hive
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Unit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptxUnit II Hadoop Ecosystem_Updated.pptx
Unit II Hadoop Ecosystem_Updated.pptx
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Hive
HiveHive
Hive
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Hive
HiveHive
Hive
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPERCCS334 BIG DATA ANALYTICS UNIT 5 PPT  ELECTIVE PAPER
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
01 hbase
01 hbase01 hbase
01 hbase
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 

Más de Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 

Más de Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 

Último (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 

03 hive query language (hql)

  • 2. Example • Here’s one way of populating a single row table: % echo 'X' > /tmp/dummy.txt % hive -e "CREATE TABLE dummy (value STRING); LOAD DATA LOCAL INPATH '/tmp/dummy.txt' OVERWRITE INTO TABLE dummy" • Use ‘hive’ to start hive • SHOW TABLES; • SELECT * FROM dummy; • DROP TABLE dummy; • SHOW TABLES;
  • 3. Example • Let’s see how to use Hive to run a query on the weather dataset • The first step is to load the data into Hive’s managed storage. • Here we’ll have Hive use the local file-system for storage; later we’ll see how to store tables in HDFS. • Just like an RDBMS, Hive organizes its data into tables. • We create a table to hold the weather data using the CREATE TABLE statement: CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; • The first line declares a records table with three columns: year, temperature, and quality. • The type of each column must be specified, too: here the year is a string, while the other two columns are integers. • So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to HiveQL. • It is saying that each row in the data file is tab-delimited text.
  • 4. Example • Next we can populate Hive with the data. This is just a small sample, for exploratory purposes: LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt' OVERWRITE INTO TABLE records; • Running this command tells Hive to put the specified local file in its warehouse directory. • This is a simple filesystem operation. • There is no attempt, for example, to parse the file and store it in an internal database format, since Hive does not mandate any particular file format. Files are stored verbatim: they are not modified by Hive. • Tables are stored as directories under Hive’s warehouse directory, which is controlled by the hive.metastore.warehouse.dir, and defaults to /user/hive/warehouse • The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, then the new files are simply added to the table’s directory (unless they have the same names, in which case they replace the old files).
  • 5. Example • Query: SELECT year, MAX(temperature) FROM records WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year; • Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml and is located in Hive’s conf directory.
  • 6. Hive Services • cli – The command line interface to Hive (the shell). This is the default service. • hiveserver – Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000). • hwi – The Hive Web Interface. • jar – The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath. • metastore – By default, the metastore is run in the same process as the Hive service. Using this service, it is possible to run the metastore as a standalone (remote) process. Set the METASTORE_PORT environment variable to specify the port the server will listen on.
  • 8. Clients • Thrift Client – The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution. • JDBC Driver – Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings.) You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client. • ODBC Driver – The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.
  • 9. The Metastore (Embedded) • The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: – a service and – the backing store for the data • By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. • This is called the embedded metastore configuration. • Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore.
  • 10. Local Metastore • The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. • This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service, but connects to a database running in a separate process, either on the same machine or on a remote machine. • Any JDBC-compliant database may be used. • MySQL is a popular choice for the standalone metastore. • In this case, javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIfNotExist=true, and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The user name and password should be set, too, of course.)
  • 11. Remote Metastore • Going a step further, there’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive service. • This brings better manageability and security, since the database tier can be completely firewalled off, and the clients no longer need the database credentials.
  • 12. Updates, transactions, and indexes • Updates, transactions, and indexes are mainstays of traditional databases. • Yet, until recently, these features have not been considered a part of Hive’s feature set. • This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. • For a data warehousing application that runs over large portions of the dataset, this works well.
  • 13. HiveQL or HQL • Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification.
  • 14. Tables • A Hive table is logically made up of the data being stored and the associated metadata describing the layout of the data in the table. • The data typically resides in HDFS, although it may reside in any Hadoop filesystem, including the local filesystem or S3. • Hive stores the metadata in a relational database—and not in HDFS, say Metastore. • Many relational databases have a facility for multiple namespaces, which allow users and applications to be segregated into different databases or schemas. • Hive supports the same facility, and provides commands such as CREATE DATABASE dbname, USE dbname, and DROP DATABASE dbname. • You can fully qualify a table by writing dbname.tablename. If no database is specified, tables belong to the default database.
  • 15. Managed Tables and External Tables • When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory. • The difference between the two types of table is seen in the LOAD and DROP semantics. • When you load data into a managed table, it is moved into Hive’s warehouse directory. • CREATE EXTERNAL TABLE • With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn’t move it to its warehouse directory. • Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column, such as date. Using partitions can make it faster to do queries on slices of the data. • CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING); • LOAD DATA LOCAL INPATH 'input/hive/partitions/file1‘ INTO TABLE logs PARTITION (dt='2001-01-01', country='GB'); • SHOW PARTITIONS logs;
  • 16. Buckets • There are two reasons why you might want to organize your tables (or partitions) into buckets. • The first is to enable more efficient queries. • Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. • In particular, a join of two tables that are bucketed on the same columns—which include the join columns—can be efficiently implemented as a map-side join.
  • 17. Buckets • The second reason to bucket a table is to make sampling more efficient. • When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them. • CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; • Any particular bucket will effectively have a random set of users in it. • The data within a bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort. • CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
  • 18. Buckets • Physically, each bucket is just a file in the table (or partition) directory. • In fact, buckets correspond to MapReduce output file partitions: a job will produce as many buckets (output files) as reduce tasks. • We can do a sampling of the table using the TABLESAMPLE clause, which restricts the query to a fraction of the buckets in the table rather than the whole table: • SELECT * FROM bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); • Bucket numbering is 1-based, so this query retrieves all the users from the first of four buckets. • For a large, evenly distributed dataset, approximately one quarter of the table’s rows would be returned.
  • 19. Storage Formats • There are two dimensions that govern table storage in Hive: the row format and the file format. • The row format dictates how rows, and the fields in a particular row, are stored. • In Hive parlance, the row format is defined by a SerDe. • When you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text, with a row per line. • You can use sequence files in Hive by using the declaration STORED AS SEQUENCEFILE in the CREATE TABLE statement. • Hive provides another binary storage format called RCFile, short for Record Columnar File. RCFiles are similar to sequence files, except that they store data in a columnoriented fashion.
  • 20. Lets do some Hands on • Look at the data in /input/tsv/, and choose any one of the TSV file • Understand the col types and names • drop table X; drop table bucketed_X; • CREATE TABLE X(id STRING, name STRING) • ROW FORMAT DELIMITED • FIELDS TERMINATED BY 't'; • LOAD DATA LOCAL INPATH '/input/tsv/X.tsv‘ OVERWRITE INTO TABLE X; • SELECT * FROM X; • Write some query : SELECT x • FROM X • WHERE y == ABC • GROUP BY z; • dfs -cat /user/hive/warehouse/X/X.tsv; • CREATE TABLE bucketed_X (id STRING, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; • CREATE TABLE bucketed_X (id STRING, name STRING) CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS; • SELECT * FROM X;
  • 21. Lets do some Hands on • SET hive.enforce.bucketing=true; • INSERT OVERWRITE TABLE bucketed_X SELECT * FROM X; • dfs -ls /user/hive/warehouse/bucketed_X; • SELECT * FROM bucketed_X TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); • SELECT * FROM bucketed_X TABLESAMPLE(BUCKET 1 OUT OF 2 ON id); • SELECT * FROM X TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
  • 22. Querying Data • Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a catch. • ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the number of reducers to one, making it very inefficient for large datasets. • When a globally sorted result is not required—and in many cases it isn’t— then you can use Hive’s nonstandard extension, SORT BY instead. • SORT BY produces a sorted file per reducer. • In some cases, you want to control which reducer a particular row goes to, typically so you can perform some subsequent aggregation. This is what Hive’s DISTRIBUTE BY clause does. • FROM records SELECT year, temperature DISTRIBUTE BY year SORT BY year ASC, temperature DESC;
  • 23. MapReduce Scripts • Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make it possible to invoke an external script or program from Hive. #!/usr/bin/env python import re import sys for line in sys.stdin: (year, temp, q) = line.strip().split() if (temp != "9999" and re.match("[01459]", q)): print "%st%s" % (year, temp) is_good_quality.py #!/usr/bin/env python import sys (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("t") if last_key and last_key != key: print "%st%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%st%s" % (last_key, max_val) /code/hive/python/max_temperature_reduce.py; • ADD FILE Hadoop_training/code/hive/python/is_good_quality.py; ADD FILE Hadoop_training/code/hive/python/max_temperature_reduce.py;
  • 24. Example CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt' OVERWRITE INTO TABLE records; FROM records SELECT TRANSFORM(year, temperature, quality) USING 'is_good_quality.py' AS year, temperature; FROM ( FROM records MAP year, temperature, quality USING 'is_good_quality.py' AS year, temperature) map_output REDUCE year, temperature USING 'max_temperature_reduce.py' AS year, temperature;
  • 25. Joins • The simplest kind of join is the inner join, where each match in the input tables results in a row in the output. • SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id); • Outer Join: • SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); • SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); • SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
  • 26. Joins • SELECT * FROM things WHERE things.id IN (SELECT id from sales); • Is not possible in Hive – but can be written as: • SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
  • 27. End of session Day – 4: Hive Query Language (HQL)