Enviar búsqueda
Cargar
Indexed Hive
•
44 recomendaciones
•
21,514 vistas
N
NikhilDeshpande
Seguir
Accelerating Hive queries with indexes.
Leer menos
Leer más
Tecnología
Vista de diapositivas
Denunciar
Compartir
Vista de diapositivas
Denunciar
Compartir
1 de 27
Descargar ahora
Descargar para leer sin conexión
Recomendados
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Recomendados
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
From HelloWorld to Configurable and Reusable Apache Spark Applications in Sca...
Databricks
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
Eyad Garelnabi
The PostgreSQL Query Planner
The PostgreSQL Query Planner
Command Prompt., Inc
Common MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
Understanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
Data Source API in Spark
Data Source API in Spark
Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
Hive tuning
Hive tuning
Michael Zhang
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Más contenido relacionado
La actualidad más candente
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
Eyad Garelnabi
The PostgreSQL Query Planner
The PostgreSQL Query Planner
Command Prompt., Inc
Common MongoDB Use Cases
Common MongoDB Use Cases
DATAVERSITY
Understanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
Data Source API in Spark
Data Source API in Spark
Databricks
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Databricks
Apache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
La actualidad más candente
(20)
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
The PostgreSQL Query Planner
The PostgreSQL Query Planner
Common MongoDB Use Cases
Common MongoDB Use Cases
Understanding and Improving Code Generation
Understanding and Improving Code Generation
Spark tuning
Spark tuning
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Data Source API in Spark
Data Source API in Spark
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
Apache Tez – Present and Future
Apache Tez – Present and Future
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
Destacado
Hive tuning
Hive tuning
Michael Zhang
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Tony Antony
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
Introduction to scala for a c programmer
Introduction to scala for a c programmer
Girish Kumar A L
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
Apache hive
Apache hive
pradipbajpai68
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
Advanced topics in hive
Advanced topics in hive
Uday Vakalapudi
Introduction to Hive
Introduction to Hive
Uday Vakalapudi
Python to scala
Python to scala
kao kuo-tung
Scala - A Scalable Language
Scala - A Scalable Language
Mario Gleichmann
Destacado
(20)
Hive tuning
Hive tuning
Optimizing Hive Queries
Optimizing Hive Queries
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
Introduction to scala for a c programmer
Introduction to scala for a c programmer
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Apache hive
Apache hive
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Advanced topics in hive
Advanced topics in hive
Introduction to Hive
Introduction to Hive
Python to scala
Python to scala
Scala - A Scalable Language
Scala - A Scalable Language
Similar a Indexed Hive
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
SQL Windowing
SQL Windowing
Sandun Perera
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8
Ivan Ma
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
Dave Stokes
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
InfluxData
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
Jim Mlodgenski
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
Getting Started with PostGIS
Getting Started with PostGIS
EDB
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
Five Lessons in Distributed Databases
Five Lessons in Distributed Databases
jbellis
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
InfluxData
Amazon Redshift
Amazon Redshift
Jeff Patti
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
Amazon Web Services
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
Romeo Kienzler
SQL on Hadoop
SQL on Hadoop
Swiss Big Data User Group
At the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
Similar a Indexed Hive
(20)
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
SQL Windowing
SQL Windowing
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Getting Started with PostGIS
Getting Started with PostGIS
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Five Lessons in Distributed Databases
Five Lessons in Distributed Databases
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Amazon Redshift
Amazon Redshift
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop
SQL on Hadoop
At the core you will have KUSTO
At the core you will have KUSTO
Último
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Alan Dix
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
DianaGray10
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Commit University
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
Pixlogix Infotech
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
LoriGlavin3
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
mohitsingh558521
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Alex Barbosa Coqueiro
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
NavinnSomaal
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Addepto
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
2toLead Limited
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
LoriGlavin3
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
BkGupta21
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
ScyllaDB
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Fwdays
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
LoriGlavin3
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Sergiu Bodiu
Último
(20)
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Indexed Hive
1.
Indexed Hive A quick
demonstration of Hive performance acceleration using indexes By: Prafulla Tekawade Nikhil Deshpande www.persistentsys.com
2.
Summary •
This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. • The slides include information on • Indexes • A specific set of Group By queries • Rewrite technique • Performance experiment and results © 2010 Persistent Systems Ltd www.persistentsys.com 2
3.
Hive usage
• HDFS spreads and scatters the data to different locations (data nodes). • Data dumped & loaded into HDFS ‘as it is’. • Only one view to the data, original data structure & layout • Typically data is append-only • Processing times dominated by full data scan times Can the data access times be better? © 2010 Persistent Systems Ltd www.persistentsys.com 3
4.
Hive usage
What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing. Different ways to get performance • Columnar storage • Data partitioning • Indexing (different view of same data) • … © 2010 Persistent Systems Ltd www.persistentsys.com 4
5.
Hive Indexing
• Provides key-based data view • Keys data duplicated • Storage layout favors search & lookup performance • Provided better data access for certain operations • A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases! © 2010 Persistent Systems Ltd www.persistentsys.com 5
6.
How does the
index look like? An index is a table with 3 columns hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK l_shipdate string Key _bucketname string References to _offsets array<string> values Data in index looks like hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK 1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"] 1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"] © 2010 Persistent Systems Ltd www.persistentsys.com 6
7.
Hive index in
HQL • SELECT (mapping, projection, association, given key, fetch value) • WHERE (filters on keys) • GROUP BY (grouping on keys) • JOIN (join key as index key) Indexes have high potential for accelerating wide range of queries. © 2010 Persistent Systems Ltd www.persistentsys.com 7
8.
Hive Index • Index
as Reference • Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! • Uses Query Rewrite technique to transform queries on base table to index table. • Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. • Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications). © 2010 Persistent Systems Ltd www.persistentsys.com 8
9.
Indexes and Query
Rewrites Demo targeting: • GROUP BY, aggregation • Index as Data • Group By Key = Index Key • Query rewritten to use indexes, but still a valid query (nothing special in it!) © 2010 Persistent Systems Ltd www.persistentsys.com 9
10.
Query Rewrites: simple
gb SELECT DISTINCT l_shipdate FROM lineitem; SELECT l_shipdate FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 10
11.
Query Rewrites: simple
agg SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate; SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 11
12.
Query Rewrites: gb
+ where SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate; SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996; © 2010 Persistent Systems Ltd www.persistentsys.com 12
13.
Query Rewrites: gb
on func(key) SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate); SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year; © 2010 Persistent Systems Ltd www.persistentsys.com 13
14.
Histogram Query SELECT YEAR(l_shipdate)
AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Monthly_shipments FROM lineitem GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS Monthly_shipments FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); © 2010 Persistent Systems Ltd www.persistentsys.com 14
15.
Year on Year
Query SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 15
16.
Year on Year
Query SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 16
17.
Performance tests Hardware and
software configuration: • 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) • 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row- store format, HDFS replication factor: 2 • Hive development branch (~0.5) • Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) • Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) © 2010 Persistent Systems Ltd www.persistentsys.com 17
18.
Perf gain for
Histogram Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 24.161 76.79 506.005 1551.555 q1_idx 21.268 27.292 35.502 86.133 © 2010 Persistent Systems Ltd www.persistentsys.com 18
19.
Perf gain for
Year on Year Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 73.66 130.587 764.619 2146.423 q1_idx 69.393 75.493 92.867 190.619 © 2010 Persistent Systems Ltd www.persistentsys.com 19
20.
Why index performs
better? Reducing data increases I/O efficiency Exploiting storage layout optimization If you need only X, separate X from “Right tool for the job”, e.g. two ways the rest to do GROUP BY Lesser data to process, better sort + agg or memory footprint, better locality of hash & agg reference… Sort step already done in index! Parallelization • Process the index data in same manner as base table, distribute the processing across nodes • Scalable! © 2010 Persistent Systems Ltd www.persistentsys.com 20
21.
Near-by future
More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans. … © 2010 Persistent Systems Ltd www.persistentsys.com 21
22.
Index Design
Hive Hive Query DDL Index Query Rewrite Compiler Builder Compiler Engine Hive Hive DDL Query Engine Engine Hadoop MR HDFS © 2010 Persistent Systems Ltd www.persistentsys.com 22
23.
Hive Compiler
Parser / AST Generator Semantic Analyzer Optimizer / Operator Query Plan Rewrite Generator Execution Engine Plan Generator To Hadoop MR © 2010 Persistent Systems Ltd www.persistentsys.com 23
24.
Query Rewrite Engine
Rule Engine Rewritten Query Tree Query Tree Rewrite Rules Repository Rewrite Rule Rewrite Rewrite Rule Rewrite Rule Rewrite Trigger Rewrite Rule Rewrite Action Condition Rewrite Rewrite Trigger Rewrite Rewrite Rule Action Rewrite Condition Trigger Rewrite Rewrite Rule Trigger Rewrite Condition Action Rewrite Action Trigger Rewrite Action Condition Condition Rewrite Trigger Action Condition © 2010 Persistent Systems Ltd www.persistentsys.com 24
25.
Learning Hive
• Hive compiler is not ‘Syntax Directed Translation’ driven • Tree visitor based, separation of data structs and compiler logic • Tree is immutable (harder to change, harder to rewrite) • Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries. • Optimizer is not yet mature • Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table scans) • Optimizer is rule-based, not cost-based, no stats collected • Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesn’t) • Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). • Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk). © 2010 Persistent Systems Ltd www.persistentsys.com 25
26.
How to get
it? • Needs a working Hadoop cluster (tested with 0.20.2) • For the Hive with Indexing support: • Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 • Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test. © 2010 Persistent Systems Ltd www.persistentsys.com 26
27.
Thank You!
prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in © 2010 Persistent Systems Ltd www.persistentsys.com 27
Descargar ahora