SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Indexed Hive
A quick demonstration of Hive performance acceleration
using indexes
                                    By:
                                          Prafulla Tekawade
                                          Nikhil Deshpande




                                                     www.persistentsys.com
Summary

  • This presentation describes the performance
  experiment based on Hive using indexes to accelerate
  query execution.
  • The slides include information on
     • Indexes
     • A specific set of Group By queries
     • Rewrite technique
     • Performance experiment and results




© 2010 Persistent Systems Ltd          www.persistentsys.com   2
Hive usage

  • HDFS spreads and scatters the data to different
    locations (data nodes).
  • Data dumped & loaded into HDFS ‘as it is’.
  • Only one view to the data, original data structure &
    layout
  • Typically data is append-only
  • Processing times dominated by full data scan times
           Can the data access times be better?



© 2010 Persistent Systems Ltd                     www.persistentsys.com   3
Hive usage

  What can be done to speed-up queries?
  Cut down the data I/O. Lesser data means faster
    processing.

  Different ways to get performance
  • Columnar storage
  • Data partitioning
  • Indexing (different view of same data)
  • …

© 2010 Persistent Systems Ltd                www.persistentsys.com   4
Hive Indexing

  •      Provides key-based data view
  •      Keys data duplicated
  •      Storage layout favors search & lookup performance
  •      Provided better data access for certain operations
  •      A cheaper alternative to full data scans!
         How cheap?
         An order of magnitude better in certain cases!




© 2010 Persistent Systems Ltd                 www.persistentsys.com   5
How does the index look like?

An index is a table with 3 columns
         hive> describe
            default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx
            __;
         OK
         l_shipdate       string           Key
         _bucketname      string                        References to
         _offsets         array<string>                    values


Data in index looks like
hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2;
OK
1992-01-08                      hdfs://hadoop1:54310/user/…/lineitem.tbl   ["662368"]
1992-01-16                      hdfs://hadoop1:54310/user/…/lineitem.tbl   ["143623","390763","637910"]




© 2010 Persistent Systems Ltd                                                  www.persistentsys.com   6
Hive index in HQL

  • SELECT (mapping, projection, association, given key,
    fetch value)
  • WHERE (filters on keys)
  • GROUP BY (grouping on keys)
  • JOIN (join key as index key)

  Indexes have high potential for accelerating wide range
    of queries.



© 2010 Persistent Systems Ltd             www.persistentsys.com   7
Hive Index
• Index as Reference
• Index as Data

This demonstration uses Index as Data technique to show order
  of magnitude performance gain!
• Uses Query Rewrite technique to transform queries on base
  table to index table.
• Limited applicability currently (e.g. demo based on GB) but
  technique itself has wide potential.
• Also a very quick way to demonstrate importance of index for
  performance (no deep optimizer/execution engine
  modifications).

© 2010 Persistent Systems Ltd                www.persistentsys.com   8
Indexes and Query Rewrites

  Demo targeting:
  • GROUP BY, aggregation
  • Index as Data
           • Group By Key = Index Key
  • Query rewritten to use indexes, but still a valid query
    (nothing special in it!)




© 2010 Persistent Systems Ltd               www.persistentsys.com   9
Query Rewrites: simple gb


SELECT DISTINCT l_shipdate
  FROM lineitem;



SELECT l_shipdate
  FROM __lineitem_shipdate_idx__;




© 2010 Persistent Systems Ltd   www.persistentsys.com   10
Query Rewrites: simple agg


SELECT l_shipdate, COUNT(1)
  FROM lineitem
 GROUP BY l_shipdate;




SELECT l_shipdate, size(`_offsets`)
  FROM __lineitem_shipdate_idx__;




© 2010 Persistent Systems Ltd   www.persistentsys.com   11
Query Rewrites: gb + where

SELECT l_shipdate, COUNT(1)
  FROM lineitem
 WHERE YEAR(l_shipdate) >= 1992
       AND YEAR(l_shipdate) <= 1996
 GROUP BY l_shipdate;



SELECT l_shipdate, size(` _offsets `)
  FROM __lineitem_shipdate_idx__
 WHERE YEAR(l_shipdate) >= 1992
       AND YEAR(l_shipdate) <= 1996;

© 2010 Persistent Systems Ltd         www.persistentsys.com   12
Query Rewrites: gb on func(key)

SELECT YEAR(l_shipdate) AS Year,
       COUNT(1)         AS Total
  FROM lineitem
  GROUP BY YEAR(l_shipdate);




SELECT Year, SUM(cnt) AS Total
  FROM (SELECT YEAR(l_shipdate) AS Year,
               size(`_offsets`) AS cnt
          FROM __lineitem_shipdate_idx__) AS t
  GROUP BY Year;

© 2010 Persistent Systems Ltd       www.persistentsys.com   13
Histogram Query

SELECT YEAR(l_shipdate) AS                              Year,
       MONTH(l_shipdate) AS                             Month,
       COUNT(1)          AS                             Monthly_shipments
  FROM lineitem
 GROUP BY YEAR(l_shipdate),                             MONTH(l_shipdate);

SELECT YEAR(l_shipdate)                      AS Year,
                    MONTH(l_shipdate) AS Month,
                    SUM(sz)                  AS Monthly_shipments
   FROM               (SELECT l_shipdate, SIZE(`_offsets`) AS sz
                                FROM __lineitem_shipdate_idx__) AS t
   GROUP              BY YEAR(l_shipdate), MONTH(l_shipdate);
© 2010 Persistent Systems Ltd                                   www.persistentsys.com   14
Year on Year Query

  SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments,

          (y2_shipments-y1_shipments)/y1_shipments AS Delta
     FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,

                                COUNT(1) AS Shipments

                 FROM           lineitem

                 WHERE          YEAR(l_shipdate) = 1997

                 GROUP          BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1

               JOIN (SELECT YEAR(l_shipdate)              AS Year, MONTH(l_shipdate) AS Month,

                                     COUNT(1) AS Shipments

                          FROM       lineitem

                          WHERE      YEAR(l_shipdate) = 1998

                          GROUP      BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2

                   ON y1.Month = y2.Month;




© 2010 Persistent Systems Ltd                                                      www.persistentsys.com   15
Year on Year Query

SELECT y1.Month AS Month, y1.shipments AS y1_shipments,
       y2.shipments AS y2_shipments,
       ( y2_shipments - y1_shipments ) / y1_shipments AS delta
  FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,
               SUM(sz) AS shipments
          FROM (SELECT l_shipdate, size(` _offsets `) AS sz
                  FROM __lineitem_shipdate_idx__) AS t1
         WHERE YEAR(l_shipdate) = 1997
         GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1

             JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month,
                          SUM(sz) AS shipments
                     FROM (SELECT l_shipdate, size(` _offsets `) AS sz
                             FROM __lineitem_shipdate_idx__) AS t
                    WHERE YEAR(l_shipdate) = 1998
                    GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2
         ON y1.Month = y2.Month;




© 2010 Persistent Systems Ltd                               www.persistentsys.com   16
Performance tests


Hardware and software configuration:
• 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in
       RAID5, 16GB RAM)
• 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized,
  data not partitioned and clustered, Hive tables stored in row-
  store format, HDFS replication factor: 2
• Hive development branch (~0.5)
• Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
• Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g.
       TPC-H 30GB data: 21GB lineitem, ~180Million tuples)



© 2010 Persistent Systems Ltd                                www.persistentsys.com   17
Perf gain for Histogram Query




                                                                                         Graphs
                                                                                         not to
                                                                                         scale




                                  (sec)     1M       1G       10G        30G
                                q1_noidx   24.161   76.79    506.005   1551.555
                                 q1_idx    21.268   27.292   35.502     86.133

© 2010 Persistent Systems Ltd                                                     www.persistentsys.com   18
Perf gain for Year on Year Query




                                                                                           Graphs
                                                                                           not to
                                                                                           scale




                                  (sec)     1M        1G       10G        30G
                                q1_noidx   73.66    130.587   764.619   2146.423
                                 q1_idx    69.393   75.493    92.867    190.619

© 2010 Persistent Systems Ltd                                                      www.persistentsys.com   19
Why index performs better?

  Reducing data increases I/O efficiency             Exploiting storage layout optimization

   If you need only X, separate X from               “Right tool for the job”, e.g. two ways
  the rest                                           to do GROUP BY
   Lesser data to process, better                         sort + agg or
  memory footprint, better locality of                     hash & agg
  reference…                                          Sort step already done in index!

                                Parallelization

                                • Process the index data in same
                                manner as base table, distribute the
                                processing across nodes
                                • Scalable!




© 2010 Persistent Systems Ltd                                           www.persistentsys.com    20
Near-by future

  More rewrites
  Partitioning Index data per key.
  Run-time operators for index usage (lookup, join, filter
    etc., since rewrites only a partial solution).
  Optimizer support for index operators.
  Cost based optimizer to choose index and non-index
    plans.
  …



© 2010 Persistent Systems Ltd               www.persistentsys.com   21
Index Design


                                  Hive                     Hive
                                                                            Query
                                  DDL       Index         Query
                                                                           Rewrite
                                Compiler   Builder       Compiler
                                                                           Engine

                                  Hive                     Hive
                                  DDL                     Query
                                 Engine                   Engine


                                             Hadoop MR


                                                HDFS



© 2010 Persistent Systems Ltd                                www.persistentsys.com   22
Hive Compiler


    Parser / AST
     Generator

                                  Semantic
                                  Analyzer   Optimizer /
                                              Operator
                                 Query          Plan
                                Rewrite      Generator     Execution
                                Engine                       Plan
                                                           Generator


                                                                                        To
                                                                                      Hadoop
                                                                                        MR


© 2010 Persistent Systems Ltd                                 www.persistentsys.com            23
Query Rewrite Engine



                                        Rule Engine


                                                                                                 Rewritten
                                                                                                Query Tree
   Query
   Tree
                                Rewrite Rules Repository
                                            Rewrite Rule

                                    Rewrite
                                                 Rewrite Rule
                                                                                           Rewrite Rule
                                                          Rewrite
                                    Trigger          Rewrite Rule
                                        Rewrite            Action
                                   Condition                  Rewrite                Rewrite
                                        Trigger
                                            Rewrite
                                                         Rewrite Rule
                                                               Action                                        Rewrite
                                       Condition
                                            Trigger
                                                                  Rewrite
                                                             Rewrite Rule            Trigger
                                                Rewrite
                                           Condition
                                                                   Action
                                                                      Rewrite                                Action
                                                Trigger
                                                    Rewrite            Action       Condition
                                               Condition                  Rewrite
                                                    Trigger
                                                                           Action
                                                   Condition




© 2010 Persistent Systems Ltd                                                        www.persistentsys.com             24
Learning Hive

  • Hive compiler is not ‘Syntax Directed Translation’ driven
          • Tree visitor based, separation of data structs and compiler logic
          • Tree is immutable (harder to change, harder to rewrite)
          • Query semantic information is separately maintained from the query lexical/parse tree, in
            different data structures, which are loosely bound in a Query Block data structure, which itself
            is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which
            everything is hung. This makes it very difficult to rewrite queries.
  • Optimizer is not yet mature
          • Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table
            scans)
          • Optimizer is rule-based, not cost-based, no stats collected
          • Query tuning is harder job (requires special knowledge of the optimizer guts, what works and
            what doesn’t)
  • Setting up development environment is tedious (build system heavily relies on internet
    connection, troublesome behind restrictive firewalls).
  • Folks in the community are very active, dependent JIRAs are fast moving target and
    development-wise, we need to keep up with them actively (e.g. if branching, need to
    frequently refresh from trunk).



© 2010 Persistent Systems Ltd                                                    www.persistentsys.com         25
How to get it?
• Needs a working Hadoop cluster (tested with 0.20.2)
• For the Hive with Indexing support:
   • Hive Index DDL patch (JIRA 417) now part of hive trunk
      https://issues.apache.org/jira/browse/HIVE-417
   • Get the Hive branch with Index Query Rewrite patch applied from
      Github (a fork/branch of Hive development tree, a snapshot of Hive +
      Index DDL source tree, not latest, but single place to get all)
      http://github.com/prafullat/hive
      Refer Hive documentation for building
      http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an
      d_building
      See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test.




© 2010 Persistent Systems Ltd                           www.persistentsys.com   26
Thank You!
                                prafulla_tekawade at persistent dot co dot in
                                nikhil_deshpande at persistent dot co dot in




© 2010 Persistent Systems Ltd                                           www.persistentsys.com   27

Más contenido relacionado

La actualidad más candente

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at FacebookDatabricks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loadingalex_araujo
 

La actualidad más candente (20)

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
The PostgreSQL Query Planner
The PostgreSQL Query PlannerThe PostgreSQL Query Planner
The PostgreSQL Query Planner
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Spark SQL Bucketing at Facebook
 Spark SQL Bucketing at Facebook Spark SQL Bucketing at Facebook
Spark SQL Bucketing at Facebook
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 

Destacado

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015 Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015 Tony Antony
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerDataWorks Summit
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmerGirish Kumar A L
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable LanguageMario Gleichmann
 

Destacado (20)

Hive tuning
Hive tuningHive tuning
Hive tuning
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015 Eliminating SAN Congestion Just Got Much Easier-  webinar - Nov 2015
Eliminating SAN Congestion Just Got Much Easier- webinar - Nov 2015
 
ORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, SmallerORC: 2015 Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
 
Introduction to scala for a c programmer
Introduction to scala for a c programmerIntroduction to scala for a c programmer
Introduction to scala for a c programmer
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)Jump Start into Apache Spark (Seattle Spark Meetup)
Jump Start into Apache Spark (Seattle Spark Meetup)
 
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceORC File & Vectorization - Improving Hive Data Storage and Query Performance
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
 
Apache hive
Apache hiveApache hive
Apache hive
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Advanced topics in hive
Advanced topics in hiveAdvanced topics in hive
Advanced topics in hive
 
Introduction to Hive
Introduction to HiveIntroduction to Hive
Introduction to Hive
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Scala - A Scalable Language
Scala - A Scalable LanguageScala - A Scalable Language
Scala - A Scalable Language
 

Similar a Indexed Hive

ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Future of Data Meetup
 
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql820180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8Ivan Ma
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)Dave Stokes
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataInfluxData
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLJim Mlodgenski
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Getting Started with PostGIS
Getting Started with PostGISGetting Started with PostGIS
Getting Started with PostGISEDB
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...InfluxData
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon RedshiftJeff Patti
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon RedshiftAmazon Web Services
 
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...Romeo Kienzler
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTORiccardo Zamana
 

Similar a Indexed Hive (20)

ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
SQL Windowing
SQL WindowingSQL Windowing
SQL Windowing
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
20180420 hk-the powerofmysql8
20180420 hk-the powerofmysql820180420 hk-the powerofmysql8
20180420 hk-the powerofmysql8
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
 
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
 
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxDataOptimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
Optimizing InfluxDB Performance in the Real World | Sam Dillard | InfluxData
 
Scaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQLScaling PostgreSQL With GridSQL
Scaling PostgreSQL With GridSQL
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Getting Started with PostGIS
Getting Started with PostGISGetting Started with PostGIS
Getting Started with PostGIS
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
 
Amazon Redshift
Amazon RedshiftAmazon Redshift
Amazon Redshift
 
Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
 
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ET...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 

Último

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Último (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Indexed Hive

  • 1. Indexed Hive A quick demonstration of Hive performance acceleration using indexes By: Prafulla Tekawade Nikhil Deshpande www.persistentsys.com
  • 2. Summary • This presentation describes the performance experiment based on Hive using indexes to accelerate query execution. • The slides include information on • Indexes • A specific set of Group By queries • Rewrite technique • Performance experiment and results © 2010 Persistent Systems Ltd www.persistentsys.com 2
  • 3. Hive usage • HDFS spreads and scatters the data to different locations (data nodes). • Data dumped & loaded into HDFS ‘as it is’. • Only one view to the data, original data structure & layout • Typically data is append-only • Processing times dominated by full data scan times Can the data access times be better? © 2010 Persistent Systems Ltd www.persistentsys.com 3
  • 4. Hive usage What can be done to speed-up queries? Cut down the data I/O. Lesser data means faster processing. Different ways to get performance • Columnar storage • Data partitioning • Indexing (different view of same data) • … © 2010 Persistent Systems Ltd www.persistentsys.com 4
  • 5. Hive Indexing • Provides key-based data view • Keys data duplicated • Storage layout favors search & lookup performance • Provided better data access for certain operations • A cheaper alternative to full data scans! How cheap? An order of magnitude better in certain cases! © 2010 Persistent Systems Ltd www.persistentsys.com 5
  • 6. How does the index look like? An index is a table with 3 columns hive> describe default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx __; OK l_shipdate string Key _bucketname string References to _offsets array<string> values Data in index looks like hive> select * from default__tpch1m_lineitem_tpch1m_lineitem_shipdate_idx__ limit 2; OK 1992-01-08 hdfs://hadoop1:54310/user/…/lineitem.tbl ["662368"] 1992-01-16 hdfs://hadoop1:54310/user/…/lineitem.tbl ["143623","390763","637910"] © 2010 Persistent Systems Ltd www.persistentsys.com 6
  • 7. Hive index in HQL • SELECT (mapping, projection, association, given key, fetch value) • WHERE (filters on keys) • GROUP BY (grouping on keys) • JOIN (join key as index key) Indexes have high potential for accelerating wide range of queries. © 2010 Persistent Systems Ltd www.persistentsys.com 7
  • 8. Hive Index • Index as Reference • Index as Data This demonstration uses Index as Data technique to show order of magnitude performance gain! • Uses Query Rewrite technique to transform queries on base table to index table. • Limited applicability currently (e.g. demo based on GB) but technique itself has wide potential. • Also a very quick way to demonstrate importance of index for performance (no deep optimizer/execution engine modifications). © 2010 Persistent Systems Ltd www.persistentsys.com 8
  • 9. Indexes and Query Rewrites Demo targeting: • GROUP BY, aggregation • Index as Data • Group By Key = Index Key • Query rewritten to use indexes, but still a valid query (nothing special in it!) © 2010 Persistent Systems Ltd www.persistentsys.com 9
  • 10. Query Rewrites: simple gb SELECT DISTINCT l_shipdate FROM lineitem; SELECT l_shipdate FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 10
  • 11. Query Rewrites: simple agg SELECT l_shipdate, COUNT(1) FROM lineitem GROUP BY l_shipdate; SELECT l_shipdate, size(`_offsets`) FROM __lineitem_shipdate_idx__; © 2010 Persistent Systems Ltd www.persistentsys.com 11
  • 12. Query Rewrites: gb + where SELECT l_shipdate, COUNT(1) FROM lineitem WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996 GROUP BY l_shipdate; SELECT l_shipdate, size(` _offsets `) FROM __lineitem_shipdate_idx__ WHERE YEAR(l_shipdate) >= 1992 AND YEAR(l_shipdate) <= 1996; © 2010 Persistent Systems Ltd www.persistentsys.com 12
  • 13. Query Rewrites: gb on func(key) SELECT YEAR(l_shipdate) AS Year, COUNT(1) AS Total FROM lineitem GROUP BY YEAR(l_shipdate); SELECT Year, SUM(cnt) AS Total FROM (SELECT YEAR(l_shipdate) AS Year, size(`_offsets`) AS cnt FROM __lineitem_shipdate_idx__) AS t GROUP BY Year; © 2010 Persistent Systems Ltd www.persistentsys.com 13
  • 14. Histogram Query SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Monthly_shipments FROM lineitem GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS Monthly_shipments FROM (SELECT l_shipdate, SIZE(`_offsets`) AS sz FROM __lineitem_shipdate_idx__) AS t GROUP BY YEAR(l_shipdate), MONTH(l_shipdate); © 2010 Persistent Systems Ltd www.persistentsys.com 14
  • 15. Year on Year Query SELECT y1.Month AS Month, y1.shipments AS Y1_shipments, y2.shipments AS Y2_shipments, (y2_shipments-y1_shipments)/y1_shipments AS Delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, COUNT(1) AS Shipments FROM lineitem WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 15
  • 16. Year on Year Query SELECT y1.Month AS Month, y1.shipments AS y1_shipments, y2.shipments AS y2_shipments, ( y2_shipments - y1_shipments ) / y1_shipments AS delta FROM (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t1 WHERE YEAR(l_shipdate) = 1997 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y1 JOIN (SELECT YEAR(l_shipdate) AS Year, MONTH(l_shipdate) AS Month, SUM(sz) AS shipments FROM (SELECT l_shipdate, size(` _offsets `) AS sz FROM __lineitem_shipdate_idx__) AS t WHERE YEAR(l_shipdate) = 1998 GROUP BY YEAR(l_shipdate), MONTH(l_shipdate)) AS y2 ON y1.Month = y2.Month; © 2010 Persistent Systems Ltd www.persistentsys.com 16
  • 17. Performance tests Hardware and software configuration: • 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) • 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row- store format, HDFS replication factor: 2 • Hive development branch (~0.5) • Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) • Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) © 2010 Persistent Systems Ltd www.persistentsys.com 17
  • 18. Perf gain for Histogram Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 24.161 76.79 506.005 1551.555 q1_idx 21.268 27.292 35.502 86.133 © 2010 Persistent Systems Ltd www.persistentsys.com 18
  • 19. Perf gain for Year on Year Query Graphs not to scale (sec) 1M 1G 10G 30G q1_noidx 73.66 130.587 764.619 2146.423 q1_idx 69.393 75.493 92.867 190.619 © 2010 Persistent Systems Ltd www.persistentsys.com 19
  • 20. Why index performs better? Reducing data increases I/O efficiency Exploiting storage layout optimization  If you need only X, separate X from  “Right tool for the job”, e.g. two ways the rest to do GROUP BY  Lesser data to process, better  sort + agg or memory footprint, better locality of  hash & agg reference…  Sort step already done in index! Parallelization • Process the index data in same manner as base table, distribute the processing across nodes • Scalable! © 2010 Persistent Systems Ltd www.persistentsys.com 20
  • 21. Near-by future More rewrites Partitioning Index data per key. Run-time operators for index usage (lookup, join, filter etc., since rewrites only a partial solution). Optimizer support for index operators. Cost based optimizer to choose index and non-index plans. … © 2010 Persistent Systems Ltd www.persistentsys.com 21
  • 22. Index Design Hive Hive Query DDL Index Query Rewrite Compiler Builder Compiler Engine Hive Hive DDL Query Engine Engine Hadoop MR HDFS © 2010 Persistent Systems Ltd www.persistentsys.com 22
  • 23. Hive Compiler Parser / AST Generator Semantic Analyzer Optimizer / Operator Query Plan Rewrite Generator Execution Engine Plan Generator To Hadoop MR © 2010 Persistent Systems Ltd www.persistentsys.com 23
  • 24. Query Rewrite Engine Rule Engine Rewritten Query Tree Query Tree Rewrite Rules Repository Rewrite Rule Rewrite Rewrite Rule Rewrite Rule Rewrite Trigger Rewrite Rule Rewrite Action Condition Rewrite Rewrite Trigger Rewrite Rewrite Rule Action Rewrite Condition Trigger Rewrite Rewrite Rule Trigger Rewrite Condition Action Rewrite Action Trigger Rewrite Action Condition Condition Rewrite Trigger Action Condition © 2010 Persistent Systems Ltd www.persistentsys.com 24
  • 25. Learning Hive • Hive compiler is not ‘Syntax Directed Translation’ driven • Tree visitor based, separation of data structs and compiler logic • Tree is immutable (harder to change, harder to rewrite) • Query semantic information is separately maintained from the query lexical/parse tree, in different data structures, which are loosely bound in a Query Block data structure, which itself is loosely bound to parse tree, yet there doesn’t exist a bigger data flow graph off which everything is hung. This makes it very difficult to rewrite queries. • Optimizer is not yet mature • Doesn’t handle many ‘obvious’ opportunities (e.g. sort group by for cases other than base table scans) • Optimizer is rule-based, not cost-based, no stats collected • Query tuning is harder job (requires special knowledge of the optimizer guts, what works and what doesn’t) • Setting up development environment is tedious (build system heavily relies on internet connection, troublesome behind restrictive firewalls). • Folks in the community are very active, dependent JIRAs are fast moving target and development-wise, we need to keep up with them actively (e.g. if branching, need to frequently refresh from trunk). © 2010 Persistent Systems Ltd www.persistentsys.com 25
  • 26. How to get it? • Needs a working Hadoop cluster (tested with 0.20.2) • For the Hive with Indexing support: • Hive Index DDL patch (JIRA 417) now part of hive trunk https://issues.apache.org/jira/browse/HIVE-417 • Get the Hive branch with Index Query Rewrite patch applied from Github (a fork/branch of Hive development tree, a snapshot of Hive + Index DDL source tree, not latest, but single place to get all) http://github.com/prafullat/hive Refer Hive documentation for building http://wiki.apache.org/hadoop/Hive/GettingStarted#Downloading_an d_building See the ql/src/test/queries/client/positive/ql_rewrite_gbtoidx.q test. © 2010 Persistent Systems Ltd www.persistentsys.com 26
  • 27. Thank You! prafulla_tekawade at persistent dot co dot in nikhil_deshpande at persistent dot co dot in © 2010 Persistent Systems Ltd www.persistentsys.com 27