Enviar búsqueda
Cargar
Indexed Hive
•
44 recomendaciones
•
21,520 vistas
N
NikhilDeshpande
Seguir
Accelerating Hive queries with indexes.
Leer menos
Leer más
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 27
Descargar ahora
Descargar para leer sin conexión
Recomendados
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
Ceph Community
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Local Apache NiFi Processor Debug
Local Apache NiFi Processor Debug
Deon Huang
Data Federation with Apache Spark
Data Federation with Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Recomendados
Hive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
CEPH DAY BERLIN - MASTERING CEPH OPERATIONS: UPMAP AND THE MGR BALANCER
Ceph Community
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Local Apache NiFi Processor Debug
Local Apache NiFi Processor Debug
Deon Huang
Data Federation with Apache Spark
Data Federation with Apache Spark
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
Ceph and RocksDB
Ceph and RocksDB
Sage Weil
Spark shuffle introduction
Spark shuffle introduction
colorant
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
Rajeshbabu Chintaguntla
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
Spark vs Hadoop
Spark vs Hadoop
Olesya Eidam
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
ORC Files
ORC Files
Owen O'Malley
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Apache Spark Core
Apache Spark Core
Girish Khanzode
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
Hive tuning
Hive tuning
Michael Zhang
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Más contenido relacionado
La actualidad más candente
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Ceph Community
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
Ceph and RocksDB
Ceph and RocksDB
Sage Weil
Spark shuffle introduction
Spark shuffle introduction
colorant
Apache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
Rajeshbabu Chintaguntla
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Andrew Lamb
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
ScyllaDB
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Maksud Ibrahimov
Spark vs Hadoop
Spark vs Hadoop
Olesya Eidam
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
ORC Files
ORC Files
Owen O'Malley
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Apache Spark Core
Apache Spark Core
Girish Khanzode
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
La actualidad más candente
(20)
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Ceph and RocksDB
Ceph and RocksDB
Spark shuffle introduction
Spark shuffle introduction
Apache Spark Overview
Apache Spark Overview
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
Spark vs Hadoop
Spark vs Hadoop
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
ORC Files
ORC Files
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Apache Spark Core
Apache Spark Core
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Destacado
Hive tuning
Hive tuning
Michael Zhang
Optimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
nzhang
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Tony Antony
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
DataWorks Summit
Introduction to scala for a c programmer
Introduction to scala for a c programmer
Girish Kumar A L
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Skills Matter Talks
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
Denny Lee
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
DataWorks Summit
Apache hive
Apache hive
pradipbajpai68
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Databricks
Advanced topics in hive
Advanced topics in hive
Uday Vakalapudi
Introduction to Hive
Introduction to Hive
Uday Vakalapudi
Python to scala
Python to scala
kao kuo-tung
Scala - A Scalable Language
Scala - A Scalable Language
Mario Gleichmann
Destacado
(20)
Hive tuning
Hive tuning
Optimizing Hive Queries
Optimizing Hive Queries
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
SQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Using Apache Hive with High Performance
Using Apache Hive with High Performance
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
Introduction to scala for a c programmer
Introduction to scala for a c programmer
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
Apache hive
Apache hive
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Advanced topics in hive
Advanced topics in hive
Introduction to Hive
Introduction to Hive
Python to scala
Python to scala
Scala - A Scalable Language
Scala - A Scalable Language
Similar a Indexed Hive
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
Altinity Ltd
SQL Windowing
SQL Windowing
Sandun Perera
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8
Ivan Ma
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
Dave Stokes
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Big Data Spain
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
InfluxData
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
Jim Mlodgenski
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Sudhir Mallem
Getting Started with PostGIS
Getting Started with PostGIS
EDB
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
Five Lessons in Distributed Databases
Five Lessons in Distributed Databases
jbellis
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
InfluxData
Amazon Redshift
Amazon Redshift
Jeff Patti
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
Amazon Web Services
SQL on Hadoop
SQL on Hadoop
Swiss Big Data User Group
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
Romeo Kienzler
At the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
Similar a Indexed Hive
(20)
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
SQL Windowing
SQL Windowing
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Getting Started with PostGIS
Getting Started with PostGIS
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Five Lessons in Distributed Databases
Five Lessons in Distributed Databases
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Amazon Redshift
Amazon Redshift
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
SQL on Hadoop
SQL on Hadoop
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
At the core you will have KUSTO
At the core you will have KUSTO
Último
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Anna Loughnan Colquhoun
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Overkill Security
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Miguel Araújo
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
apidays
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Juan lago vázquez
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
sudhanshuwaghmare1
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
MadyBayot
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
debabhi2
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
apidays
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
MIND CTI
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
apidays
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
The Digital Insurer
Architecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
The Digital Insurer
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
wesley chun
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
apidays
Último
(20)
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
Architecting Cloud Native Applications
Architecting Cloud Native Applications
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Indexed Hive
1.
Indexed Hive A quick
demonstration of Hive performance acceleration using indexes By: Prafulla Tekawade Nikhil Deshpande www.persistentsys.com
2.
Summary •
This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. • The slides include information on • Indexes • A specific set of Group By queries • Rewrite technique • Performance experiment and results © 2010 Persistent Systems Ltd www.persistentsys.com 2
3.
Hive usage
• HDFS spreads and scatters the data to different locations (data nodes). • Data dumped & loaded into HDFS ‘as it is’. • Only one view to the data, original data structure & layout • Typically data is append-only • Processing times dominated by full data scan times Can the data access times be better? © 2010 Persistent Systems Ltd www.persistentsys.com 3
4.
Hive usage
What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing. Different ways to get performance • Columnar storage • Data partitioning • Indexing (different view of same data) • … © 2010 Persistent Systems Ltd www.persistentsys.com 4
5.
Hive Indexing
• Provides key-based data view • Keys data duplicated • Storage layout favors search & lookup performance • Provided better data access for certain operations • A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases! © 2010 Persistent Systems Ltd www.persistentsys.com 5
6.
How does the
index look like? An index is a table with 3 columns hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK l_shipdate string Key _bucketname string References to _offsets array<string> values Data in index looks like hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK 1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"] 1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"] © 2010 Persistent Systems Ltd www.persistentsys.com 6
7.
Hive index in
HQL • SELECT (mapping, projection, association, given key, fetch value) • WHERE (filters on keys) • GROUP BY (grouping on keys) • JOIN (join key as index key) Indexes have high potential for accelerating wide range of queries. © 2010 Persistent Systems Ltd www.persistentsys.com 7
8.
Hive Index • Index
as Reference • Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! • Uses Query Rewrite technique to transform queries on base table to index table. • Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. • Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications). © 2010 Persistent Systems Ltd www.persistentsys.com 8
9.
Indexes and Query
Rewrites Demo targeting: • GROUP BY, aggregation • Index as Data • Group By Key = Index Key • Query rewritten to use indexes, but still a valid query (nothing special in it!) © 2010 Persistent Systems Ltd www.persistentsys.com 9
10.
Query Rewrites: simple
gb SELECT DISTINCT l_shipdate FROM lineitem; SELECT l_shipdate FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 10
11.
Query Rewrites: simple
agg SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate; SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 11
12.
Query Rewrites: gb
+ where SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate; SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996; © 2010 Persistent Systems Ltd www.persistentsys.com 12
13.
Query Rewrites: gb
on func(key) SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate); SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year; © 2010 Persistent Systems Ltd www.persistentsys.com 13
14.
Histogram Query SELECT YEAR(l_shipdate)
AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Monthly_shipments FROM lineitem GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS Monthly_shipments FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); © 2010 Persistent Systems Ltd www.persistentsys.com 14
15.
Year on Year
Query SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 15
16.
Year on Year
Query SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 16
17.
Performance tests Hardware and
software configuration: • 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) • 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row- store format, HDFS replication factor: 2 • Hive development branch (~0.5) • Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) • Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) © 2010 Persistent Systems Ltd www.persistentsys.com 17
18.
Perf gain for
Histogram Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 24.161 76.79 506.005 1551.555 q1_idx 21.268 27.292 35.502 86.133 © 2010 Persistent Systems Ltd www.persistentsys.com 18
19.
Perf gain for
Year on Year Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 73.66 130.587 764.619 2146.423 q1_idx 69.393 75.493 92.867 190.619 © 2010 Persistent Systems Ltd www.persistentsys.com 19
20.
Why index performs
better? Reducing data increases I/O efficiency Exploiting storage layout optimization If you need only X, separate X from “Right tool for the job”, e.g. two ways the rest to do GROUP BY Lesser data to process, better sort + agg or memory footprint, better locality of hash & agg reference… Sort step already done in index! Parallelization • Process the index data in same manner as base table, distribute the processing across nodes • Scalable! © 2010 Persistent Systems Ltd www.persistentsys.com 20
21.
Near-by future
More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans. … © 2010 Persistent Systems Ltd www.persistentsys.com 21
22.
Index Design
Hive Hive Query DDL Index Query Rewrite Compiler Builder Compiler Engine Hive Hive DDL Query Engine Engine Hadoop MR HDFS © 2010 Persistent Systems Ltd www.persistentsys.com 22
23.
Hive Compiler
Parser / AST Generator Semantic Analyzer Optimizer / Operator Query Plan Rewrite Generator Execution Engine Plan Generator To Hadoop MR © 2010 Persistent Systems Ltd www.persistentsys.com 23
24.
Query Rewrite Engine
Rule Engine Rewritten Query Tree Query Tree Rewrite Rules Repository Rewrite Rule Rewrite Rewrite Rule Rewrite Rule Rewrite Trigger Rewrite Rule Rewrite Action Condition Rewrite Rewrite Trigger Rewrite Rewrite Rule Action Rewrite Condition Trigger Rewrite Rewrite Rule Trigger Rewrite Condition Action Rewrite Action Trigger Rewrite Action Condition Condition Rewrite Trigger Action Condition © 2010 Persistent Systems Ltd www.persistentsys.com 24
25.
Learning Hive
• Hive compiler is not ‘Syntax Directed Translation’ driven • Tree visitor based, separation of data structs and compiler logic • Tree is immutable (harder to change, harder to rewrite) • Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries. • Optimizer is not yet mature • Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table scans) • Optimizer is rule-based, not cost-based, no stats collected • Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesn’t) • Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). • Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk). © 2010 Persistent Systems Ltd www.persistentsys.com 25
26.
How to get
it? • Needs a working Hadoop cluster (tested with 0.20.2) • For the Hive with Indexing support: • Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 • Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test. © 2010 Persistent Systems Ltd www.persistentsys.com 26
27.
Thank You!
prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in © 2010 Persistent Systems Ltd www.persistentsys.com 27
Descargar ahora