SlideShare una empresa de Scribd logo
1 de 41
Hive+Tez: A Performance
deep dive
Jitendra Pandey
Gopal Vijayaraghavan
© Hortonworks Inc. 2014.
Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Hive 0.13, April, 2013
• Hive on Apache Tez
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:
© Hortonworks Inc. 2014.
SPEED: Increasing Hive Performance
Key Highlights
– Tez: New execution engine
– Vectorized Query Processing
– Startup time improvement
– Statistics to accelerate query execution
– Cost Based Optimizer: Optiq
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months
Elements of Fast SQL Execution
• Query Planner/Cost Based
Optimizer w/ Statistics
• Query Startup
• Query Execution
• I/O Path
© Hortonworks Inc. 2014.
Statistics and Cost-based optimization
• Statistics:
– Hive has table and column level statistics
– Used to determine parallelism, join selection
• Optiq: Open source, Apache licensed query execution framework in Java
– Used by Apache Drill, Apache Cascading, Lucene DB
– Based on Volcano paper
– 20 man years dev, more than 50 optimization rules
• Goals for hive
– Ease of Use – no manual tuning for queries, make choices automatically based on cost
– View Chaining/Ad hoc queries involving multiple views
– Help enable BI Tools front-ending Hive
– Emphasis on latency reduction
• Cost computation will be used for
 Join ordering
 Join algorithm selection
 Tez vertex boundary selection
Page 4
HIVE-5775
© Hortonworks Inc. 2014.
TPC-DS Query 17
select i_item_id
,i_item_desc
,s_state
,count(ss_quantity) as store_sales_quantitycount
,….
from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item i
where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk
and s.s_store_sk = ss.ss_store_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_item_sk = sr.sr_item_sk
…
group by i_item_id ,i_item_desc, ,s_state
order by i_item_id ,i_item_desc, s_state
limit 100;
 Joins Store Sales, Store Returns and Catalog Sales fact tables.
 Each of the fact tables are independently restricted by time.
 Analysis at Item and Store grain, so these dimensions are also joined in.
 As specified Query starts by joining the 3 Fact tables.
© Hortonworks Inc. 2014.
TPC-DS Query 17
Specified
Join Tree
Non CBO Plan
CBO
Plan
© Hortonworks Inc. 2014.
TPC-DS Query 17
Run 1 Run 2
Non
CBO
127.53 100.71
CBO 50.9 44.52
 Fact tables
 partitioned by Day,
 bucketed by Item
 Bucketing off
 Bucketing should help CBO plan.
 SR table much smaller. Better chance of Bucket Join
in place of Shuffle Join.
Join Ordering Cost Estimate
['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'],
'store']]
3547898.061
…
['store_returns', 'd2’] 19224.71
['store_sales', 'store_returns’] 23057497.991
['d1', 'store_sales'] 26142.943
Facts restricted to 3 months
Orderings considered by Planner
© Hortonworks Inc. 2014.
Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
© Hortonworks Inc. 2014.
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT g1.x, g1.avg, g2.cnt
FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2
ON (g1.x = g2.x)
ORDER BY avg;
GROUP a BY a.x
JOIN (a,b)
GROUP b BY b.x
ORDER BY
M M M
R R
M M
R
M M
R
M
R
HDFS HDFS
HDFS
M M M
R R
R
M M
R
GROUP BY a.x
JOIN (a,b)
ORDER BY
GROUP BY x
Tez avoids
unnecessary writes
to HDFS
HIVE-4660
© Hortonworks Inc. 2014.
Shuffle Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM inventory inv
JOIN store_sales ss
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
© Hortonworks Inc. 2014.
Broadcast Join
SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand
FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales
group by ss_item_sk) ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M
M
M
M M
HDFS
Store Sales
scan. Group by
and aggregation
reduce size of
this input.
Inventory scan
and Join
Broadcast
edge
M M M
HDFS
Store Sales
scan. Group by
and aggregation.
Inventory and Store
Sales (aggr.) output
scan and shuffle
join.
R R
R R
RR
M
MMM
HDFS
© Hortonworks Inc. 2014.
1-1 Edge
• Typical star schema join involve join between large number of
tables
• Dimension aren’t always tiny (Customer dimension)
• Might not be able to handle all dimensions in single vertex as
broadcast joins
• Tez allows streaming records from one processor to the next via
a 1-1 Edge
– Transfer details (streaming, files, etc) are handled transparently
– Scheduling/cluster capacity is worked out by Tez
• Allows hive to build a pipeline of in memory joins which we can
stream records through
© Hortonworks Inc. 2014.
Dynamically Partitioned Hash Join
SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand
FROM store_sales ss
JOIN inventory inv
ON (inv.inv_item_sk = ss.ss_item_sk);
Hive – MR Hive – Tez
M MM
M M
HDFS
Inventory scan
(Runs on
cluster
potentially more
than 1 mapper)
Store Sales
scan and Join
(Custom vertex
reads both
inputs – no side
file reads)
Custom
edge (routes
outputs of
previous stage to
the correct
Mappers of the
next stage)
M MM
M
HDFS
Inventory scan
(Runs as single
local map task)
Store Sales
scan and Join
(Inventory hash
table read as
side file)
HDFS
© Hortonworks Inc. 2014.
Dynamically Partitioned Hash Join
Plans look very similar to map join but the way things work change between
MR and Tez.
Hive – MR (Bucket map-join) Hive – Tez
• Not dynamically partitioned.
• Both tables need to be bucketed by the join
key.
• Local task that generates the hash table
writes n files corresponding to n buckets.
• Number of mappers for the join must be
same as the number of buckets.
• Each of these mappers reads the
corresponding bucket file of the local task to
perform the join.
• Only one of the sides needs to be bucketed
and the other side is dynamically bucketed.
• Also works if neither side is explicitly
bucketed, but another operation forced
bucketing in the pipeline (traits)
• No writing to HDFS.
• There can be more mappers than number of
buckets, and a bucket can be processed in
parallel on multiple mappers.
© Hortonworks Inc. 2014.
Union all
SELECT count(*) FROM (
SELECT distinct ss_customer_sk from store_sales where ss_store_sk = 1
UNION ALL
SELECT distinct ss_customer_sk from store_sales where ss_store_sk = 2) as customers
Hive – MR Hive – Tez
M M M
R
M M M
HDFS
R
M
R
HDFS
M M M
R
M M M
HDFS
R
R
Two MR jobs to
do the distinct
Both sub-queries
are materialized
onto HDFS
Single map
reads both sides
and aggregates
In Tez the sub-query
output is pre-aggregated
and send directly to a
common final node
© Hortonworks Inc. 2014.
Multi-insert queries
FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk
and d_year = 2000)
INSERT INTO TABLE t1 SELECT distinct ss_item_sk
INSERT INTO TABLE t2 SELECT distinct ss_customer_sk;
Hive – MR Hive – Tez
M MM
M
HDFS
Map join
date_dim/store
sales
Two MR jobs to
do the distinct
M MM
M M
HDFS
RR
HDFS
M M M
R
M M M
R
HDFS
Broadcast Join
(scan date_dim,
join store sales)
Distinct for
customer + items
Materialize join on
HDFS
© Hortonworks Inc. 2014.
Execution
“A good plan violently executed now is better
than a perfect plan executed next week.
George S. Patton
© Hortonworks Inc. 2014.
Faster Query Setup
• AM per-session instead of per-query
– Reused across JDBC connections
• No more local tasks
– Except fetch aggregation
• Metastore fetches are much faster
– Metastore direct sql fast-path
– Partition filters pushed to metastore
• Use distributed cache efficiently for hive-exec.jar
– /home/$user/.hiveJars
• UDF Jars as well
– .jar.<sha1> identifier to avoid conflicts
– Multiple version compatibility easily
– YARN localizes the jars once per node (not per query)
• Kryo instead of XML to serialize operators
– Works better on jdk7
Page 18
© Hortonworks Inc. 2014.
Faster Operator Pipeline
• Previously on hive
© Hortonworks Inc. 2014.
Operator Vectorization
• Avoid Writable objects & use primitive int/long
– Allows efficient JIT code for primitive types
• Generate per-type loops & avoid runtime type-checks
• The classes generated look like
– LongColEqualDoubleColumn
– LongColEqualLongColumn
– LongColEqualLongScalar
• Avoid duplicate operations on repeated values
– isRepeating & hasNulls
© Hortonworks Inc. 2014.
Optimized Row Columnar File
• ORC Vectorized Reader
• Logical Compression helps reader
– isRepeating
• Split per-stripe
• Row-group level indexes
• Stripe level indexes
• PPD avoids a lot of IO
– Column conditions are ANDed
© Hortonworks Inc. 2014.
Faster Statistics
• ORC stripe footers aggregate stats per-column
– Min/Max/Sum/Count
• set hive.stats.autogather=true;
• ANALYZE TABLE <table> compute statistics partialscan;
– Reads only ORC footers
• Predicate computation without Tez/MR tasks
© Hortonworks Inc. 2014.
Faster Execution: Tez
• Multiple edge types
– Broadcast
– Shuffle
– One-to-One
• Multiple output types
– Sorted
– Unsorted
– Unsorted Partitioned
• Per-vertex configurations
– Instead of one configuration between M&R tasks
© Hortonworks Inc. 2014.
Tez I/O speed-ups
• Tez shuffle can use keep-alive over HTTP
• Shuffle scheduler can optimize connection count
– Can fetch all map outputs from one node via 1 connection
• Can skip fetching 0 sized partitions from a mapper
– Speeds up group-by queries with high locality
– Reducers finish shuffle faster
• Shuffle threads are re-used in container re-use
– Secure shuffle has crypto thread-local inits
© Hortonworks Inc. 2014.
Skewed Reducers: auto-parallelism
• Often queries are slow because of one slow reducer
• Skewed data is too common in real life queries
• This avoids running too many reducers with with very little data
• Future
– This can be extended to group by input size
– This mechanism can actually speculate on stalling reducers better (split into 3)
© Hortonworks Inc. 2014.
A Query in motion
Page 26
• 4-way Map join + map reduce reduce query
• Timeline in left to right, each lane represents one container
© Hortonworks Inc. 2014.
Defer/Skip tasks
Page 27
• No more uploading hive-exec.jar/UDFs for every query
• No more spinning up an AM for each stage
• No more computation on hive client (local task)
© Hortonworks Inc. 2014.
Concurrency of small tasks
Page 28
• Hive used to run several lightweight tasks in a local VM
• LocalTask was a bottleneck
– No locality
– No parallelism
– Small VM
• Tez Broadcast edges solve that problem
© Hortonworks Inc. 2014.
Concurrent Split Generation
Page 29
• Tez input intializers are run parallel
• No more spinning up an AM for each stage
• No more computation on hive client (local task)
© Hortonworks Inc. 2014.
Split Elimination
Page 30
• ORC comes with Predicate Push Down in the reader
• Queries with SARGable where clauses
– http://en.wikipedia.org/wiki/Sargable
• Run the SARGs in the AM, using ORC footer data
– Eliminate splits before task spinups, avoid container costs
• Offers a soft cache for the ORC footers
• Zero splits offers an early exit for data validity checks (i.e price < 0)
© Hortonworks Inc. 2014.
Pipelining Split->Task
Page 31
• The task only depends on its own input
• It starts talking to YARN immediately once its inputs are ready
• Faster generation of dimension tables
• Fact tables can optimize on this further
– Will break existing FileSplit mechanism
© Hortonworks Inc. 2014.
Filling up the pipeline
Page 32
• Tez allows grouping splits dynamically
• Obsoletes CombineFileInputFormat
• Grouped according to locality
–1.7 x available containers (or any factor actually)
• Allow query to use up 100% of queue capacity
–Without tuning mapred split size for each data-set
© Hortonworks Inc. 2014.
ORC Split extras
• RCFile had horrible split performance
– rcfile::sync() was slow to find a sync point
• ORC Reader allows exact splits for stripes
• ORC Writer can pad a stripe to an HDFS block
– 5%-7% overhead measured on table
– 100% locality of a stripe in a block
© Hortonworks Inc. 2014.
Container reuse
• Tez specific feature
• Run an entire DAG using the same containers
• Different vertices use same container
• Saves time talking to YARN for new containers
© Hortonworks Inc. 2014.
Container reuse (II)
• Tez provides an object registry within a vertex
• This can be used to cache map-join hash-tables
• JVM JIT kicks in and optimizes better on re-use
© Hortonworks Inc. 2014.
Container re-use (Session)
• Keep a container group alive between queries
• Fast query spin-up and skip YARN queue
• Even better JIT performance on >1 queries
© Hortonworks Inc. 2014.
HiveServer2 and Sessions
• HiveServer2 can keep sessions alive
–Between different JDBC queries
• New security model helps
–All secure queries run as “hive” user
• Ideal for short exploratory queries
• Uses same JARs (no download for task)
• Even better JIT performance on >1 queries
© Hortonworks Inc. 2014.
Supersize it!
• 78 vertex + 8374 tasks on 50 containers
Page 38
© Hortonworks Inc. 2014.
Query overload #2
• 5000 hive query test-set
• Only 3.9k triggered compute tasks
• Rest was optimized away into fetch tasks or metadata tasks
• Gets progressively faster as the JVM JIT improves the native code
Page 39
© Hortonworks Inc. 2014.
Big picture
1501.895
1176.479
631.027
4.872
0
200
400
600
800
1000
1200
1400
1600
Text Columnar Partitioned Stinger
Latency
© Hortonworks Inc. 2014.
Roadmap
• Expand uses for CBO
– Join Algorithm selection
– Tez checkpoint selection (recovery)
• Temp Tables
– Session life-time
– Sharing of intermediate results
• Materialized views
– Pre-compute common results/aggregations
– Transparently route via CBO
• Join/Grouping w/o sort
– Tez decouples algorithm from data transfer
• Sort-merge bucket in Tez
– Leverage vertex manager
– Co-locate partitions on HDFS
• Inline sampling/range partitioning with Tez
– Sample/create histogram dynamically for skew joins and total order sort
Page 41

Más contenido relacionado

La actualidad más candente

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Yuki Gonda
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 

La actualidad más candente (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-Hadoop -ResourceManager HAの仕組み-
Hadoop -ResourceManager HAの仕組み-
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 

Similar a Hive + Tez: A Performance Deep Dive

Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2t3rmin4t0r
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoicebazaarvoice_engineering
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesVladimír Schreiner
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterYahoo Developer Network
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 

Similar a Hive + Tez: A Performance Deep Dive (20)

Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
מיכאל
מיכאלמיכאל
מיכאל
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 

Último (20)

Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 

Hive + Tez: A Performance Deep Dive

  • 1. Hive+Tez: A Performance deep dive Jitendra Pandey Gopal Vijayaraghavan
  • 2. © Hortonworks Inc. 2014. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Hive 0.13, April, 2013 • Hive on Apache Tez • Cost Based Optimizer (Optiq) • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
  • 3. © Hortonworks Inc. 2014. SPEED: Increasing Hive Performance Key Highlights – Tez: New execution engine – Vectorized Query Processing – Startup time improvement – Statistics to accelerate query execution – Cost Based Optimizer: Optiq Interactive Query Times across ALL use cases • Simple and advanced queries in seconds • Integrates seamlessly with existing tools • Currently a >100x improvement in just nine months Elements of Fast SQL Execution • Query Planner/Cost Based Optimizer w/ Statistics • Query Startup • Query Execution • I/O Path
  • 4. © Hortonworks Inc. 2014. Statistics and Cost-based optimization • Statistics: – Hive has table and column level statistics – Used to determine parallelism, join selection • Optiq: Open source, Apache licensed query execution framework in Java – Used by Apache Drill, Apache Cascading, Lucene DB – Based on Volcano paper – 20 man years dev, more than 50 optimization rules • Goals for hive – Ease of Use – no manual tuning for queries, make choices automatically based on cost – View Chaining/Ad hoc queries involving multiple views – Help enable BI Tools front-ending Hive – Emphasis on latency reduction • Cost computation will be used for  Join ordering  Join algorithm selection  Tez vertex boundary selection Page 4 HIVE-5775
  • 5. © Hortonworks Inc. 2014. TPC-DS Query 17 select i_item_id ,i_item_desc ,s_state ,count(ss_quantity) as store_sales_quantitycount ,…. from store_sales ss ,store_returns sr, catalog_sales cs, date_dim d1, date_dim d2, date_dim d3, store s, item i where d1.d_quarter_name = '2000Q1’ and d1.d_date_sk = ss.ss_sold_date_sk and i.i_item_sk = ss.ss_item_sk and s.s_store_sk = ss.ss_store_sk and ss.ss_customer_sk = sr.sr_customer_sk and ss.ss_item_sk = sr.sr_item_sk … group by i_item_id ,i_item_desc, ,s_state order by i_item_id ,i_item_desc, s_state limit 100;  Joins Store Sales, Store Returns and Catalog Sales fact tables.  Each of the fact tables are independently restricted by time.  Analysis at Item and Store grain, so these dimensions are also joined in.  As specified Query starts by joining the 3 Fact tables.
  • 6. © Hortonworks Inc. 2014. TPC-DS Query 17 Specified Join Tree Non CBO Plan CBO Plan
  • 7. © Hortonworks Inc. 2014. TPC-DS Query 17 Run 1 Run 2 Non CBO 127.53 100.71 CBO 50.9 44.52  Fact tables  partitioned by Day,  bucketed by Item  Bucketing off  Bucketing should help CBO plan.  SR table much smaller. Better chance of Bucket Join in place of Shuffle Join. Join Ordering Cost Estimate ['item', [[[[[['d2', 'store_returns'], 'store_sales'], 'catalog_sales'], 'd1'], 'd3'], 'store']] 3547898.061 … ['store_returns', 'd2’] 19224.71 ['store_sales', 'store_returns’] 23057497.991 ['d1', 'store_sales'] 26142.943 Facts restricted to 3 months Orderings considered by Planner
  • 8. © Hortonworks Inc. 2014. Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task ProcessorInput Output
  • 9. © Hortonworks Inc. 2014. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT g1.x, g1.avg, g2.cnt FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP a BY a.x JOIN (a,b) GROUP b BY b.x ORDER BY M M M R R M M R M M R M R HDFS HDFS HDFS M M M R R R M M R GROUP BY a.x JOIN (a,b) ORDER BY GROUP BY x Tez avoids unnecessary writes to HDFS HIVE-4660
  • 10. © Hortonworks Inc. 2014. Shuffle Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM inventory inv JOIN store_sales ss ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez
  • 11. © Hortonworks Inc. 2014. Broadcast Join SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M M M M M HDFS Store Sales scan. Group by and aggregation reduce size of this input. Inventory scan and Join Broadcast edge M M M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R R R RR M MMM HDFS
  • 12. © Hortonworks Inc. 2014. 1-1 Edge • Typical star schema join involve join between large number of tables • Dimension aren’t always tiny (Customer dimension) • Might not be able to handle all dimensions in single vertex as broadcast joins • Tez allows streaming records from one processor to the next via a 1-1 Edge – Transfer details (streaming, files, etc) are handled transparently – Scheduling/cluster capacity is worked out by Tez • Allows hive to build a pipeline of in memory joins which we can stream records through
  • 13. © Hortonworks Inc. 2014. Dynamically Partitioned Hash Join SELECT ss.ss_item_sk, ss.ss_quantity, inv.inv_quantity_on_hand FROM store_sales ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M MM M M HDFS Inventory scan (Runs on cluster potentially more than 1 mapper) Store Sales scan and Join (Custom vertex reads both inputs – no side file reads) Custom edge (routes outputs of previous stage to the correct Mappers of the next stage) M MM M HDFS Inventory scan (Runs as single local map task) Store Sales scan and Join (Inventory hash table read as side file) HDFS
  • 14. © Hortonworks Inc. 2014. Dynamically Partitioned Hash Join Plans look very similar to map join but the way things work change between MR and Tez. Hive – MR (Bucket map-join) Hive – Tez • Not dynamically partitioned. • Both tables need to be bucketed by the join key. • Local task that generates the hash table writes n files corresponding to n buckets. • Number of mappers for the join must be same as the number of buckets. • Each of these mappers reads the corresponding bucket file of the local task to perform the join. • Only one of the sides needs to be bucketed and the other side is dynamically bucketed. • Also works if neither side is explicitly bucketed, but another operation forced bucketing in the pipeline (traits) • No writing to HDFS. • There can be more mappers than number of buckets, and a bucket can be processed in parallel on multiple mappers.
  • 15. © Hortonworks Inc. 2014. Union all SELECT count(*) FROM ( SELECT distinct ss_customer_sk from store_sales where ss_store_sk = 1 UNION ALL SELECT distinct ss_customer_sk from store_sales where ss_store_sk = 2) as customers Hive – MR Hive – Tez M M M R M M M HDFS R M R HDFS M M M R M M M HDFS R R Two MR jobs to do the distinct Both sub-queries are materialized onto HDFS Single map reads both sides and aggregates In Tez the sub-query output is pre-aggregated and send directly to a common final node
  • 16. © Hortonworks Inc. 2014. Multi-insert queries FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive – MR Hive – Tez M MM M HDFS Map join date_dim/store sales Two MR jobs to do the distinct M MM M M HDFS RR HDFS M M M R M M M R HDFS Broadcast Join (scan date_dim, join store sales) Distinct for customer + items Materialize join on HDFS
  • 17. © Hortonworks Inc. 2014. Execution “A good plan violently executed now is better than a perfect plan executed next week. George S. Patton
  • 18. © Hortonworks Inc. 2014. Faster Query Setup • AM per-session instead of per-query – Reused across JDBC connections • No more local tasks – Except fetch aggregation • Metastore fetches are much faster – Metastore direct sql fast-path – Partition filters pushed to metastore • Use distributed cache efficiently for hive-exec.jar – /home/$user/.hiveJars • UDF Jars as well – .jar.<sha1> identifier to avoid conflicts – Multiple version compatibility easily – YARN localizes the jars once per node (not per query) • Kryo instead of XML to serialize operators – Works better on jdk7 Page 18
  • 19. © Hortonworks Inc. 2014. Faster Operator Pipeline • Previously on hive
  • 20. © Hortonworks Inc. 2014. Operator Vectorization • Avoid Writable objects & use primitive int/long – Allows efficient JIT code for primitive types • Generate per-type loops & avoid runtime type-checks • The classes generated look like – LongColEqualDoubleColumn – LongColEqualLongColumn – LongColEqualLongScalar • Avoid duplicate operations on repeated values – isRepeating & hasNulls
  • 21. © Hortonworks Inc. 2014. Optimized Row Columnar File • ORC Vectorized Reader • Logical Compression helps reader – isRepeating • Split per-stripe • Row-group level indexes • Stripe level indexes • PPD avoids a lot of IO – Column conditions are ANDed
  • 22. © Hortonworks Inc. 2014. Faster Statistics • ORC stripe footers aggregate stats per-column – Min/Max/Sum/Count • set hive.stats.autogather=true; • ANALYZE TABLE <table> compute statistics partialscan; – Reads only ORC footers • Predicate computation without Tez/MR tasks
  • 23. © Hortonworks Inc. 2014. Faster Execution: Tez • Multiple edge types – Broadcast – Shuffle – One-to-One • Multiple output types – Sorted – Unsorted – Unsorted Partitioned • Per-vertex configurations – Instead of one configuration between M&R tasks
  • 24. © Hortonworks Inc. 2014. Tez I/O speed-ups • Tez shuffle can use keep-alive over HTTP • Shuffle scheduler can optimize connection count – Can fetch all map outputs from one node via 1 connection • Can skip fetching 0 sized partitions from a mapper – Speeds up group-by queries with high locality – Reducers finish shuffle faster • Shuffle threads are re-used in container re-use – Secure shuffle has crypto thread-local inits
  • 25. © Hortonworks Inc. 2014. Skewed Reducers: auto-parallelism • Often queries are slow because of one slow reducer • Skewed data is too common in real life queries • This avoids running too many reducers with with very little data • Future – This can be extended to group by input size – This mechanism can actually speculate on stalling reducers better (split into 3)
  • 26. © Hortonworks Inc. 2014. A Query in motion Page 26 • 4-way Map join + map reduce reduce query • Timeline in left to right, each lane represents one container
  • 27. © Hortonworks Inc. 2014. Defer/Skip tasks Page 27 • No more uploading hive-exec.jar/UDFs for every query • No more spinning up an AM for each stage • No more computation on hive client (local task)
  • 28. © Hortonworks Inc. 2014. Concurrency of small tasks Page 28 • Hive used to run several lightweight tasks in a local VM • LocalTask was a bottleneck – No locality – No parallelism – Small VM • Tez Broadcast edges solve that problem
  • 29. © Hortonworks Inc. 2014. Concurrent Split Generation Page 29 • Tez input intializers are run parallel • No more spinning up an AM for each stage • No more computation on hive client (local task)
  • 30. © Hortonworks Inc. 2014. Split Elimination Page 30 • ORC comes with Predicate Push Down in the reader • Queries with SARGable where clauses – http://en.wikipedia.org/wiki/Sargable • Run the SARGs in the AM, using ORC footer data – Eliminate splits before task spinups, avoid container costs • Offers a soft cache for the ORC footers • Zero splits offers an early exit for data validity checks (i.e price < 0)
  • 31. © Hortonworks Inc. 2014. Pipelining Split->Task Page 31 • The task only depends on its own input • It starts talking to YARN immediately once its inputs are ready • Faster generation of dimension tables • Fact tables can optimize on this further – Will break existing FileSplit mechanism
  • 32. © Hortonworks Inc. 2014. Filling up the pipeline Page 32 • Tez allows grouping splits dynamically • Obsoletes CombineFileInputFormat • Grouped according to locality –1.7 x available containers (or any factor actually) • Allow query to use up 100% of queue capacity –Without tuning mapred split size for each data-set
  • 33. © Hortonworks Inc. 2014. ORC Split extras • RCFile had horrible split performance – rcfile::sync() was slow to find a sync point • ORC Reader allows exact splits for stripes • ORC Writer can pad a stripe to an HDFS block – 5%-7% overhead measured on table – 100% locality of a stripe in a block
  • 34. © Hortonworks Inc. 2014. Container reuse • Tez specific feature • Run an entire DAG using the same containers • Different vertices use same container • Saves time talking to YARN for new containers
  • 35. © Hortonworks Inc. 2014. Container reuse (II) • Tez provides an object registry within a vertex • This can be used to cache map-join hash-tables • JVM JIT kicks in and optimizes better on re-use
  • 36. © Hortonworks Inc. 2014. Container re-use (Session) • Keep a container group alive between queries • Fast query spin-up and skip YARN queue • Even better JIT performance on >1 queries
  • 37. © Hortonworks Inc. 2014. HiveServer2 and Sessions • HiveServer2 can keep sessions alive –Between different JDBC queries • New security model helps –All secure queries run as “hive” user • Ideal for short exploratory queries • Uses same JARs (no download for task) • Even better JIT performance on >1 queries
  • 38. © Hortonworks Inc. 2014. Supersize it! • 78 vertex + 8374 tasks on 50 containers Page 38
  • 39. © Hortonworks Inc. 2014. Query overload #2 • 5000 hive query test-set • Only 3.9k triggered compute tasks • Rest was optimized away into fetch tasks or metadata tasks • Gets progressively faster as the JVM JIT improves the native code Page 39
  • 40. © Hortonworks Inc. 2014. Big picture 1501.895 1176.479 631.027 4.872 0 200 400 600 800 1000 1200 1400 1600 Text Columnar Partitioned Stinger Latency
  • 41. © Hortonworks Inc. 2014. Roadmap • Expand uses for CBO – Join Algorithm selection – Tez checkpoint selection (recovery) • Temp Tables – Session life-time – Sharing of intermediate results • Materialized views – Pre-compute common results/aggregations – Transparently route via CBO • Join/Grouping w/o sort – Tez decouples algorithm from data transfer • Sort-merge bucket in Tez – Leverage vertex manager – Co-locate partitions on HDFS • Inline sampling/range partitioning with Tez – Sample/create histogram dynamically for skew joins and total order sort Page 41

Notas del editor

  1. base optimizations: Star join, MMR->MR, Multiple map joins grouped to single mapper. Which analytic functions? Windowing functions, over clause Advanced optimizations Predicate push down only eliminates the orc stripes? Performance boosts via YARN Improvements in shuffle
  2. Tools? BI tools, Tableu, Microstrategy Hive-0.13 is 100x faster. Startup time improvements: - Pre-launch the App master, keep containers around, what are the elements of query startup. - Faster metastore lookup. Using statistics other than Optiq: - Metadata queries - Estimating number of reducers - Map join coversion Optique: Join reordering
  3. What is Optiq 50 optimization rules, examples - Join reordering rules, filter push down, column pruning. Should we mention we generate AST? Ad hoc queries involving multiple views: Currently supported to create views, the query on a view is executed by replacing the view with the subquery. What is tez vertex boundary?
  4. What is shuffle+map? Why is d1 not joined with ss before first shuffle?
  5. Why is Run2 slower for Non-CBO ? What is bucketing off?
  6. Why higher throughput? How many contributors now?
  7. No unncessary writes to HDFS. Number of processes reduced. The edges between M and R can be generalized.
  8. On MR: each mapper sorts partitions of both tables In Tez a mapper sorts only one table, the operators don’t have to switch between data sources.
  9. Inventory is the bigger table in this case. Similar to map-join w/o the need to build a hash table on the client Will work with any level of sub-query nesting Uses stats to determine if applicable How it works: Broadcast result set is computed in parallel on the cluster Join processor are spun up in parallel Broadcast set is streamed to join processor Join processors build hash table Other relation is joined with hashtable Tez handles: Best parallelism Best data transfer of the hashed relation Best scheduling to avoid latencies Why broadcast join is better than the map join? -- Multiple hashes can be generated in parallel -- hashtable in memory can be more compact than the serialized one in local task -- subqueries were always on streaming side and were joined with shuffle join Parallelism: Splits of a dimension table processed in parallel across mappers Data transfer - No hdfs write in between Schedule - read from rack local replica of the dimensional table
  10. Comparing the bucketed map join in MR vs Tez Inventory table is already bucketed. In MR, The hash map for each bucket is built in a single mapper in sequence, loaded in hdfs, then joined with store sales where the hash table is read as a side file. In Tez, The inventory scan is run in parallel in multiple mappers that process buckets. ------ Kicks in when large table is bucketed Bucketed table Dynamic as part of query processing Uses custom edge to match the partitioning on the smaller table Allows hash-join in cases where broadcast would be too large Tez gives us the option of building custom edges and vertex managers Fine grained control over how the data is replicated and partitioned Scheduling and actual data transfer is handled by Tez
  11. Common operation in decision support queries Caused additional no-op stages in MR plans Last stage spins up multi-input mapper to write result Intermediate unions have to be materialized before additional processing Tez has union that handles these cases transparently w/o any intermediate steps
  12. Allows the same input to be split and written to different tables or partitions Avoids duplicate scans/processing Useful for ETL Similar to “Splits” in PIG In MR a “split” in the operator pipeline has to be written to HDFS and processed by multiple additional MR jobs Tez allows to send the mulitple outputs directly to downstream processors
  13. checkcast
  14. Tpch query 1 and query 6. Before:
  15. 1Tb of tpc-hdata compreses to 200Gb of ORC data. 30Tb of tpc-ds data compresses to approx ~6Tb of ORC data.