SlideShare una empresa de Scribd logo
1 de 29
Hive and HiveQL
What is Hive?
• Apache Hive is a data warehouse system for Hadoop.
• Hive is not a relational database, it only maintains metadata information about
your Big Data stored on HDFS.
• Hive allows to treat your Big Data as tables and perform SQL-like operations on
the data using a scripting language called HiveQL.
• Hive is not a database, but it uses a database (called the metastore) to store the
tables that you define. Hive uses Derby by default.
• A Hive table consists of a schema stored in the metastore and data stored on
HDFS.
• Hive converts HiveQL commands into MapReduce jobs.
Hive Architecture Contd..
Step 1: Issuing Commands
Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive
query is submitted to the HiveServer.
Step 2: Hive Query Plan
The Hive query is compiled, optimized and planned as a MapReduce job.
Step 3: MapReduce Job Executes
The corresponding MapReduce job is executed on the Hadoop cluster.
Comparison with Traditional Database
Hive data types
Arithmetic Operators
Mathematical functions
Aggregate functions
Other built-in functions
Managed Tables
• When a table is created in Hive, by default Hive will manage the data, which
means that Hive moves the data into its warehouse directory.
• When data is loaded into a managed table, it is moved into Hive’s warehouse
directory.
CREATE TABLE managed_table (dummy STRING);
LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• It will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for
the managed_table table, which is hdfs://user/hive/warehouse/managed_table
• If the table is later dropped, then the table, including its metadata and its data, is
deleted.
External Tables
• When a External table is created, it tells Hive to refer to the data that is at an
existing location outside the warehouse directory and it is not managed by Hive.
• The location of the external data is specified at table creation time
CREATE EXTERNAL TABLE external_table (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n'
LOCATION '/user/tom/external_table_location/file.txt';
• Creation and deletion of the data can be controlled.
• Hive tables can be organized into
buckets, which imposes extra
structure on the table and how the
underlying files are stored.
Bucketing has two key benefits:
• More efficient queries: especially
when performing joins on the
same bucketed columns.
• More efficient sampling: because
the data is already split up into
smaller pieces.
Storage Formats
• There are two dimensions that govern table storage in Hive
• Row format : The row format dictates how rows, and the fields in a particular row, are
stored. The row format is defined by a SerDe.
• File format : The file format dictates the container format for fields in a row.
The default storage format: Delimited text
• When a table is created with no ROW FORMAT or STORED AS clauses, the default format is
delimited text, with a row per line.
• The default row delimiter is not a tab character, but the Control-A character.
• The default collection item delimiter is a Control-B character, used to delimit items in an ARRAY
or STRUCT, or key-value pairs in a MAP.
• The default map key delimiter is a Control-C character, used to delimit the key and value in a
MAP.
• Rows in a table are delimited by a newline character.
Storage Formats Contd..
CREATE TABLE XYZ;
is identical to the more explicit:
CREATE TABLE ...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '001'
COLLECTION ITEMS TERMINATED BY '002'
MAP KEYS TERMINATED BY '003'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
Importing Data
INSERT OVERWRITE TABLE
INSERT OVERWRITE TABLE target
SELECT col1, col2
FROM source;
• Forpartitionedtables,youcanspecifythepartitiontoinsertintobysupplyingaPARTITION
clause:
INSERT OVERWRITE TABLE target
PARTITION (dt='2010-01-01')
SELECT col1, col2
FROM source;
Importing Data Contd..
Multitable insert
FROM records2
INSERT OVERWRITE TABLE stations_by_year
SELECT year, COUNT(DISTINCT station)
GROUP BY year
INSERT OVERWRITE TABLE records_by_year
SELECT year, COUNT(1)
GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year
SELECT year, COUNT(1)
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
GROUP BY year;
Importing Data Contd..
CREATE TABLE...AS SELECT
CREATE TABLE target
AS
SELECT col1, col2
FROM source;
• A CTAS operation is atomic, so if the SELECT query fails for some
reason, then the table is not created.
Altering Tables
• ALTER TABLE source RENAME TO target;
• ALTER TABLE target ADD COLUMNS (col3 STRING);
Dropping Tables
DROP TABLE table_name;
• The DROP TABLE statement deletes the data and metadata for a table. In the
case of external tables, only the metadata is deleted—the data is left
untouched.
Querying data(Sorting and Aggregating)
• Sorting data in Hive can be achieved by use of a standard ORDER BY clause.
ORDER BY produces a result that is totally sorted, so sets the number of reducers
to one.
• SORT BY produces a sorted file per reducer.
• DISTRIBUTE BY clause used to control which reducer a particular row goes to.
• Inner joins
Querying data(Joins)
• Left Outer Join
• Right Outer Join
• Full Outer Join
• Left Semi Join
Subqueries
• A subquery is a SELECT statement that is embedded in another SQL statement.
Hive has limited support for subqueries, only permitting a subquery in the FROM
clause of a SELECT statement.
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)
GROUP BY station, year
) mt
GROUP BY station, year;
Views
• A view is a sort of “virtual table” that is defined by a SELECT statement.
CREATE VIEW max_temperatures (station, year, max_temperature) AS
SELECT station, year, MAX(temperature)
FROM valid_records
GROUP BY station, year;
• With the views in place, we can now use them by running a query:
SELECT station, year, AVG(max_temperature)
FROM max_temperatures
GROUP BY station, year;
Hive Join Strategies
Hive and HiveQL - Module6
Hive and HiveQL - Module6
Hive and HiveQL - Module6

Más contenido relacionado

La actualidad más candente

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimizationWBUTTUTORIALS
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetAnkit Beohar
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
 

La actualidad más candente (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Aggregate fact tables
Aggregate fact tablesAggregate fact tables
Aggregate fact tables
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
6.hive
6.hive6.hive
6.hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Partitioning
PartitioningPartitioning
Partitioning
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Sqoop
SqoopSqoop
Sqoop
 
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangExperiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
 
Data partitioning
Data partitioningData partitioning
Data partitioning
 

Similar a Hive and HiveQL - Module6

Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Rohit Agrawal
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Clase 11 manejo tablas modificada
Clase 11 manejo tablas   modificadaClase 11 manejo tablas   modificada
Clase 11 manejo tablas modificadaTitiushko Jazz
 
Clase 11 manejo tablas modificada
Clase 11 manejo tablas   modificadaClase 11 manejo tablas   modificada
Clase 11 manejo tablas modificadaTitiushko Jazz
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupRemus Rusanu
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Aman Sinha
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Web Services
 
Юра Гуляев. Oracle tables
Юра Гуляев. Oracle tablesЮра Гуляев. Oracle tables
Юра Гуляев. Oracle tablesAleksandr Motsjonov
 
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxhive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxOmarBen27
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Michael Rys
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx086ChintanPatel1
 

Similar a Hive and HiveQL - Module6 (20)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Clase 11 manejo tablas modificada
Clase 11 manejo tablas   modificadaClase 11 manejo tablas   modificada
Clase 11 manejo tablas modificada
 
Clase 11 manejo tablas modificada
Clase 11 manejo tablas   modificadaClase 11 manejo tablas   modificada
Clase 11 manejo tablas modificada
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
01 hbase
01 hbase01 hbase
01 hbase
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
Юра Гуляев. Oracle tables
Юра Гуляев. Oracle tablesЮра Гуляев. Oracle tables
Юра Гуляев. Oracle tables
 
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptxhive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
hive_slidesjhsdjhsasdfksnfjisnvosjnv-2.pptx
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptxShshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
Shshsjsjsjs-4 - Copdjsjjsjsjsjakakakaaky.pptx
 
Tunning overview
Tunning overviewTunning overview
Tunning overview
 

Más de Rohit Agrawal

Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Rohit Agrawal
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Rohit Agrawal
 
Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8Rohit Agrawal
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Rohit Agrawal
 
Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4Rohit Agrawal
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

Más de Rohit Agrawal (8)

Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8Advance HBase and Zookeeper - Module 8
Advance HBase and Zookeeper - Module 8
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4Advance MapReduce Concepts - Module 4
Advance MapReduce Concepts - Module 4
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 

Último

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Hive and HiveQL - Module6

  • 2. What is Hive? • Apache Hive is a data warehouse system for Hadoop. • Hive is not a relational database, it only maintains metadata information about your Big Data stored on HDFS. • Hive allows to treat your Big Data as tables and perform SQL-like operations on the data using a scripting language called HiveQL. • Hive is not a database, but it uses a database (called the metastore) to store the tables that you define. Hive uses Derby by default. • A Hive table consists of a schema stored in the metastore and data stored on HDFS. • Hive converts HiveQL commands into MapReduce jobs.
  • 3.
  • 4. Hive Architecture Contd.. Step 1: Issuing Commands Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer. Step 2: Hive Query Plan The Hive query is compiled, optimized and planned as a MapReduce job. Step 3: MapReduce Job Executes The corresponding MapReduce job is executed on the Hadoop cluster.
  • 11. Managed Tables • When a table is created in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory. • When data is loaded into a managed table, it is moved into Hive’s warehouse directory. CREATE TABLE managed_table (dummy STRING); LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; • It will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table, which is hdfs://user/hive/warehouse/managed_table • If the table is later dropped, then the table, including its metadata and its data, is deleted.
  • 12. External Tables • When a External table is created, it tells Hive to refer to the data that is at an existing location outside the warehouse directory and it is not managed by Hive. • The location of the external data is specified at table creation time CREATE EXTERNAL TABLE external_table (dummy STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' LOCATION '/user/tom/external_table_location/file.txt'; • Creation and deletion of the data can be controlled.
  • 13.
  • 14. • Hive tables can be organized into buckets, which imposes extra structure on the table and how the underlying files are stored. Bucketing has two key benefits: • More efficient queries: especially when performing joins on the same bucketed columns. • More efficient sampling: because the data is already split up into smaller pieces.
  • 15. Storage Formats • There are two dimensions that govern table storage in Hive • Row format : The row format dictates how rows, and the fields in a particular row, are stored. The row format is defined by a SerDe. • File format : The file format dictates the container format for fields in a row. The default storage format: Delimited text • When a table is created with no ROW FORMAT or STORED AS clauses, the default format is delimited text, with a row per line. • The default row delimiter is not a tab character, but the Control-A character. • The default collection item delimiter is a Control-B character, used to delimit items in an ARRAY or STRUCT, or key-value pairs in a MAP. • The default map key delimiter is a Control-C character, used to delimit the key and value in a MAP. • Rows in a table are delimited by a newline character.
  • 16. Storage Formats Contd.. CREATE TABLE XYZ; is identical to the more explicit: CREATE TABLE ... ROW FORMAT DELIMITED FIELDS TERMINATED BY '001' COLLECTION ITEMS TERMINATED BY '002' MAP KEYS TERMINATED BY '003' LINES TERMINATED BY 'n' STORED AS TEXTFILE;
  • 17. Importing Data INSERT OVERWRITE TABLE INSERT OVERWRITE TABLE target SELECT col1, col2 FROM source; • Forpartitionedtables,youcanspecifythepartitiontoinsertintobysupplyingaPARTITION clause: INSERT OVERWRITE TABLE target PARTITION (dt='2010-01-01') SELECT col1, col2 FROM source;
  • 18. Importing Data Contd.. Multitable insert FROM records2 INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;
  • 19. Importing Data Contd.. CREATE TABLE...AS SELECT CREATE TABLE target AS SELECT col1, col2 FROM source; • A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table is not created.
  • 20. Altering Tables • ALTER TABLE source RENAME TO target; • ALTER TABLE target ADD COLUMNS (col3 STRING); Dropping Tables DROP TABLE table_name; • The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the metadata is deleted—the data is left untouched.
  • 21. Querying data(Sorting and Aggregating) • Sorting data in Hive can be achieved by use of a standard ORDER BY clause. ORDER BY produces a result that is totally sorted, so sets the number of reducers to one. • SORT BY produces a sorted file per reducer. • DISTRIBUTE BY clause used to control which reducer a particular row goes to.
  • 23. • Left Outer Join • Right Outer Join • Full Outer Join • Left Semi Join
  • 24. Subqueries • A subquery is a SELECT statement that is embedded in another SQL statement. Hive has limited support for subqueries, only permitting a subquery in the FROM clause of a SELECT statement. SELECT station, year, AVG(max_temperature) FROM ( SELECT station, year, MAX(temperature) AS max_temperature FROM records2 WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY station, year ) mt GROUP BY station, year;
  • 25. Views • A view is a sort of “virtual table” that is defined by a SELECT statement. CREATE VIEW max_temperatures (station, year, max_temperature) AS SELECT station, year, MAX(temperature) FROM valid_records GROUP BY station, year; • With the views in place, we can now use them by running a query: SELECT station, year, AVG(max_temperature) FROM max_temperatures GROUP BY station, year;