Enviar búsqueda
Cargar
Hive 3 a new horizon
•
2 recomendaciones
•
391 vistas
A
Abdelkrim Hadjidj
Seguir
Hive 3 - Paris Future of Data Meetup
Leer menos
Leer más
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 50
Recomendados
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
Hive 3 a new horizon
Hive 3 a new horizon
Artem Ervits
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
What's new in Ambari
What's new in Ambari
DataWorks Summit
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
Recomendados
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Artem Ervits
Hive 3 a new horizon
Hive 3 a new horizon
Artem Ervits
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
What's new in Ambari
What's new in Ambari
DataWorks Summit
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
Advanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Murtaza Doctor
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
Amazon Web Services
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
SQL on Hadoop
SQL on Hadoop
nvvrajesh
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
Dealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
DataWorks Summit
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
BlueData, Inc.
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
DataWorks Summit
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
Más contenido relacionado
La actualidad más candente
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
Advanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Murtaza Doctor
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
Amazon Web Services
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
DataWorks Summit
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
SQL on Hadoop
SQL on Hadoop
nvvrajesh
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
DataWorks Summit
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
Dealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
DataWorks Summit
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
BlueData, Inc.
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
DataWorks Summit
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
DataWorks Summit
La actualidad más candente
(20)
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Analyzing the World's Largest Security Data Lake!
Analyzing the World's Largest Security Data Lake!
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
Advanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
SQL on Hadoop
SQL on Hadoop
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
Streaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Dealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Big Data Analytics from Edge to Core
Big Data Analytics from Edge to Core
Similar a Hive 3 a new horizon
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
What's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
HortonworksJapan
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Hortonworks
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
Hortonworks
An Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
Scaling db infra_pay_pal
Scaling db infra_pay_pal
pramod garre
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
Similar a Hive 3 a new horizon
(20)
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
What's New in Apache Hive
What's New in Apache Hive
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
An Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
Scaling db infra_pay_pal
Scaling db infra_pay_pal
Performance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Más de Abdelkrim Hadjidj
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
Abdelkrim Hadjidj
Paris FOD meetup - koordinator
Paris FOD meetup - koordinator
Abdelkrim Hadjidj
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
Abdelkrim Hadjidj
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101
Abdelkrim Hadjidj
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Abdelkrim Hadjidj
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Abdelkrim Hadjidj
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
Abdelkrim Hadjidj
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scale
Abdelkrim Hadjidj
Future of Data Meetup : Boontadata
Future of Data Meetup : Boontadata
Abdelkrim Hadjidj
Más de Abdelkrim Hadjidj
(9)
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
Paris FOD meetup - koordinator
Paris FOD meetup - koordinator
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scale
Future of Data Meetup : Boontadata
Future of Data Meetup : Boontadata
Último
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Zilliz
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
UiPathCommunity
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
apidays
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Deepika Singh
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
apidays
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Khushali Kathiriya
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
apidays
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
apidays
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
danishmna97
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Angeliki Cooney
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
Dropbox
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Nanddeep Nachan
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Edi Saputra
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Orbitshub
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Remote DBA Services
Último
(20)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
Hive 3 a new horizon
1.
1 © Hortonworks
Inc. 2011–2018. All rights reserved © Hortonworks Inc. 2011 – 2017 Apache Hive 3: A new horizon Gunther Hagleitner, Ashutosh Chauhan,Jesus Camacho Rodriguez, Gopal Vijayaraghavan, Thejas Nair,
2.
2 © Hortonworks
Inc. 2011–2018. All rights reserved 7000 analysts, 80ms average latency, 1PB data. 250k BI queries per hour On demand deep reporting in the cloud over 100Tb in minutes.
3.
© Hortonworks Inc.
2011- 2018. All rights reserved | 3 Agenda ● Data Analytics Studio ● Apache Hive 3 ● Hive-Spark interoperability ● Performance ● Look ahead
4.
© Hortonworks Inc.
2011- 2018. All rights reserved | 4 Data Analytics Studio
5.
© Hortonworks Inc.
2011- 2018. All rights reserved | 5 Self-service question #1: Why is my query slow? Noisy neighbors Poor schema Inefficient queries Unstable demand Smart query log search Storage Optimizations Query Optimizations Demand Shifting Hortonworks Data Analytics Studio
6.
6 © Hortonworks
Inc. 2011–2018. All rights reserved One of the Extensible DataPlane Services ⬢ DAS 1.0 available now for HDP 3.0! ⬢ Monthly release cadence ⬢ Replaces Hive & Tez Views ⬢ Separate install from stack Hortonworks Data Analytics Studio HORTONWORKS DATAPLANE SERVICE DATA SOURCE INTEGRATION DATA SERVICES CATALOG …DATA LIFECYCLE MANAGER DATA STEWARD STUDIO +OTHER (partner) SECURITY CONTROLS CORE CAPABILITIES MULTIPLE CLUSTERS AND SOURCES MULTIHYBRID *not yet available, coming soon EXTENSIBLE SERVICES IBM DSX* DATA ANALYTICS STUDIO
7.
© Hortonworks Inc.
2011- 2018. All rights reserved | 7 Apache Hive 3
8.
8 © Hortonworks
Inc. 2011–2018. All rights reserved Hive3: EDW analyst pipeline Tableau BI systems Materialized view Surrogate key Constraints Query Result Cache Workload management • Results return from HDFS/cache directly • Reduce load from repetitive queries • Allows more queries to be run in parallel • Reduce resource starvation in large clusters • Active/Passive HA • More “tools” for optimizer to use • More ”tools” for DBAs to tune/optimize • Invisible tuning of DB from users’ perspective • ACID v2 is as fast as regular tables • Hive 3 is optimized for S3/WASB/GCP • Support for JDBC/Kafka/Druid out of the box ACID v2 Cloud Storage Connectors
9.
© Hortonworks Inc.
2011- 2018. All rights reserved | 9 Connectors
10.
10 © Hortonworks
Inc. 2011–2018. All rights reserved Hive-1010: Information schema & sysdb Question: Find which tables have a column with ‘ssn’ as part of the column name? use information_schema; SELECT table_schema, table_name FROM information_schema.columns WHERE column_name LIKE '%ssn%'; Question: Find the biggest tables in the system. use sys; SELECT tbl_name, total_size FROM table_stats_view v, tbls t WHERE t.tbl_id = v.tbl_id ORDER BY cast(v.total_size as int) DESC LIMIT 3;
11.
11 © Hortonworks
Inc. 2011–2018. All rights reserved HIVE-1555: JDBC connector • How did we build the information_schema? • We mapped the metastore into Hive’s table space! • Uses Hive-JDBC connector • Read-only for now • Supports automatic pushdown of full subqueries • Cost-based optimizer decides part of query runs in RDBMS versus Hive • Joins, aggregates, filters, projections, etc
12.
12 © Hortonworks
Inc. 2011–2018. All rights reserved JDBC Table mapping example CREATE TABLE postgres_table ( id INT, name varchar(20) ); CREATE EXTERNAL TABLE hive_table ( id INT, name varchar(20) ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "POSTGRES", "hive.sql.jdbc.driver"="org.postgresql.Driver", "hive.sql.jdbc.url"="jdbc:postgresql://...", "hive.sql.dbcp.username"="jdbctest", "hive.sql.dbcp.password"="", "hive.sql.query"="select * from postgres_table", "hive.sql.column.mapping" = "id=ID, name=NAME" ); In Postgres In Hive
13.
13 © Hortonworks
Inc. 2011–2018. All rights reserved Druid Connector Realtime Node Realtime Node Realtime Node Broker HiveServer2 Instantly analyze kafka data with milliseconds latency
14.
14 © Hortonworks
Inc. 2011–2018. All rights reserved Druid Connector - Joins between Hive and realtime data in Druid Bloom filter pushdown greatly reduces data transfer Send promotional email to all customers from CA who purchased more than 1000$ worth of merchandise today. create external table sales(`__time` timestamp, quantity int, sales_price double,customer_id bigint, item_id int, store_id int) stored by 'org.apache.hadoop.hive.druid.DruidStorageHandler' tblproperties ( "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "sales-topic", "druid.kafka.ingestion.maxRowsInMemory" = "5"); create table customers (customer_id bigint, first_name string, last_name string, email string, state string); select email from customers join sales using customer_id where to_date(sales.__time) = date ‘2018-09-06’ and quantity * sales_price > 1000 and customers.state = ‘CA’;
15.
15 © Hortonworks
Inc. 2011–2018. All rights reserved Kafka Connector LLAP Node LLAP Node LLAP Node Query Coordinator HiveServer2 Ad-hoc / Ingest / Transform
16.
16 © Hortonworks
Inc. 2011–2018. All rights reserved Kafka connector Transformation over stream in real time I want to have moving average over sliding window in kafka from stock ticker kafka stream. create external table tickers (`__time` timestamp , stock_id bigint, stock_sym varchar(4), price decimal (10,2), exhange_id int) stored by 'org.apache.hadoop.hive.kafka.KafkaStorageHandler’ tblproperties ("kafka.topic" = "stock-topic", "kafka.bootstrap.servers"="localhost:9092", "kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe"); create external table moving_avg (`__time` timestamp , stock_id bigint, avg_price decimal (10,2) stored by 'org.apache.hadoop.hive.kafka.KafkaStorageHandler' tblproperties ("kafka.topic" = "averages-topic", "kafka.bootstrap.servers"="localhost:9092", "kafka.serde.class"="org.apache.hadoop.hive.serde2.JsonSerDe"); Insert into table moving_avg select CURRENT_TIMESTAMP, stock_id, avg(price) group by stock_id, from tickers where __timestamp > to_unix_timestamp(CURRENT_TIMESTAMP - 5 minutes) * 1000
17.
© Hortonworks Inc.
2011- 2018. All rights reserved | 17 Table types
18.
18 © Hortonworks
Inc. 2011–2018. All rights reserved Managed and External Tables • Hive 3 cleans up semantics of managed and external tables • External: Outside control and management of data • Managed: Fully under Hive control, ACID only • Non-native tables are external • ACID: Full IUD on ORC, Insert-only on other formats • Defaults have changed • Managed: ORC + ACID • External: TextFile • Two tablespaces with different permissions & ownership
19.
19 © Hortonworks
Inc. 2011–2018. All rights reserved Differences between external and managed tables • Storage based auth (doAs=true) is supported for external tables • Ranger and SBA can co-exist in HDP 3 (Ranger is default) • Script to convert from file permissions to Ranger policies on tables Note: SBA in HDP 3 requires ACL in HDFS. ACL is turned on by default in HDP3 Hive managed table ACID on by default No SBA, Ranger auth only Statistics and other optimizations apply Spark access via HiveWarehouseConnector External tables No ACID, Text by default SBA possible Some optimizations unavailable Spark direct file access
20.
20 © Hortonworks
Inc. 2011–2018. All rights reserved ACID v2 V1: CREATE TABLE hello_acid (load_date date, key int, value int) CLUSTERED BY(key) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true'); V2: CREATE TABLE hello_acid_v2 (load_date date, key int, value int); • Performance just as good as non-ACID tables • No bucketing required • Fully compatible with native cloud storage
21.
© Hortonworks Inc.
2011- 2018. All rights reserved | 21 SQL Enhancements
22.
22 © Hortonworks
Inc. 2011–2018. All rights reserved Materialized view Optimizing workloads and queries without changing the SQL SELECT distinct dest,origin FROM flights; SELECT origin, count(*) FROM flights GROUP BY origin HAVING origin = ‘OAK’; CREATE MATERIALIZED VIEW flight_agg AS SELECT dest,origin,count(*) FROM flights GROUP BY dest,origin;
23.
23 © Hortonworks
Inc. 2011–2018. All rights reserved Materialized view - Maintenance • Partial table rewrites are supported • Typical: Denormalize last month of data only • Rewrite engine will produce union of latest and historical data • Updates to base tables • Invalidates views, but • Can choose to allow stale views (max staleness) for performance • Can partial match views and compute delta after updates • Incremental updates • Common classes of views allow for incremental updates • Others need full refresh
24.
24 © Hortonworks
Inc. 2011–2018. All rights reserved Constraints & defaults • Helps optimizer to produce better plans • BI tool integrations • Data Integrity • hive.constraint.notnull.enforce = true • SQL compatibility & offload scenarios Example: CREATE TABLE Persons ( ID Int NOT NULL, Name String NOT NULL, Age Int, Creator String DEFAULT CURRENT_USER(), CreateDate Date DEFAULT CURRENT_DATE(), PRIMARY KEY (ID) DISABLE NOVALIDATE ); CREATE TABLE BusinessUnit ( ID Int NOT NULL, Head Int NOT NULL, Creator String DEFAULT CURRENT_USER(), CreateDate Date DEFAULT CURRENT_DATE(), PRIMARY KEY (ID) DISABLE NOVALIDATE, CONSTRAINT fk FOREIGN KEY (Head) REFERENCES Persons(ID) DISABLE NOVALIDATE );
25.
25 © Hortonworks
Inc. 2011–2018. All rights reserved Default clause & Surrogate keys • Multiple approaches • Sequence number is dense & increasing, but: Bottleneck in distributed DBMS • UUID is easy & distributable, but large and slow • Surrogate key UDF is easy & distributable & fast, but: No sequence and has gaps CREATE TABLE AIRLINES_V2 (ID BIGINT DEFAULT SURROGATE_KEY(), CODE STRING, DESCRIPTION STRING, PRIMARY KEY (ID) DISABLE NOVALIDATE); INSERT INTO AIRLINES_V2 (CODE, DESCRIPTION) SELECT * FROM AIRLINES; ALTER TABLE FLIGHTS ADD COLUMNS (carrier_sk BIGINT); MERGE INTO FLIGHTS f USING AIRLINES_V2 a ON f.uniquecarrier = a.code WHEN MATCHED THEN UPDATE SET carrier_sk = a.id;
26.
26 © Hortonworks
Inc. 2011–2018. All rights reserved ⬢ Solution ● Query fails because of stats estimation error ● Runtime sends observed statistics back to coordinator ● Statistics overrides are created at session, server or global level ● Query is replanned and resubmitted Hive-17626: Optimizer is learning from planning mistakes ⬢ Symptoms ● Memory exhaustion due to under provisioning ● Excessive runtime (future) ● Excessive spilling (future)
27.
© Hortonworks Inc.
2011- 2018. All rights reserved | 27 Multitenancy
28.
28 © Hortonworks
Inc. 2011–2018. All rights reserved HIVE-17481: LLAP workload management ⬢ Effectively share LLAP cluster resources – Resource allocation per user policy; separate ETL and BI, etc. ⬢ Resources based guardrails – Protect against long running queries, high memory usage ⬢ Improved, query-aware scheduling – Scheduler is aware of query characteristics, types, etc. – Fragments easy to pre-empt compared to containers – Queries get guaranteed fractions of the cluster, but can use empty space
29.
29 © Hortonworks
Inc. 2011–2018. All rights reserved Common Triggers ● ELAPSED_TIME ● EXECUTION_TIME ● TOTAL_TASKS ● HDFS_BYTES_READ, HDFS_BYTES_WRITTEN ● CREATED FILES ● CREATED_DYNAMIC_PARTITIONS Example CREATE RESOURCE PLAN guardrail; CREATE TRIGGER guardrail.long_running WHEN EXECUTION_TIME > 2000 DO KILL; ALTER TRIGGER guardrail.long_running ADD TO UNMANAGED; ALTER RESOURCE PLAN guardrail ENABLE ACTIVATE; Guardrail Example
30.
30 © Hortonworks
Inc. 2011–2018. All rights reserved Resource plans example CREATE RESOURCE PLAN daytime; CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5; CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20; CREATE RULE downgrade IN daytime WHEN total_runtime > 3000 THEN MOVE etl; ADD RULE downgrade TO bi; CREATE APPLICATION MAPPING tableau in daytime TO bi; ALTER PLAN daytime SET default pool= etl; APPLY PLAN daytime; daytime bi: 80% etl: 20% Downgrade when total_runtime>3000
31.
© Hortonworks Inc.
2011- 2018. All rights reserved | 31 BI caching
32.
32 © Hortonworks
Inc. 2011–2018. All rights reserved HIVE-18513: Query result cache Returns results directly from storage (e.g. HDFS) without actually executing the query If the same query had ran before Important for dashboards, reports etc. where repetitive queries is common Without cache With cache
33.
33 © Hortonworks
Inc. 2011–2018. All rights reserved HIVE-18513: Query result cache details • hive.query.results.cache.enabled=true (on by default) • Works only on hive managed tables • If you JOIN an external table with Hive managed table, Hive will fall back to executing the full query. Because Hive can’t know if external table data has changed • Works with ACID • That means if Hive table has been updated, the query will be rerun automatically • Is different from LLAP cache • LLAP cache is a data cache. That means multiple queries can benefit by avoiding reading from disk. Speeds up the read path. • Result cache effectively bypasses execution of query • Stored at /tmp/hive/__resultcache__/, default space is 2GB, LRU eviction • Tunable setting hive.query.results.cache.max.size (bytes)
34.
34 © Hortonworks
Inc. 2011–2018. All rights reserved Metastore Cache • With query execution time being < 1 sec, compilation time starts to dominate • Metadata retrieval is often significant part of compilation time. Most of it is in RDBMS queries. • Cloud RDBMS As a Service is often slower, and frequent queries leads to throttling. • Metadata cache speeds compilation time by around 50% with onprem mysql. Significantly more improvement with cloud RDBMS. • Cache is consistent in single metastore setup, eventually consistent with HA setup. Consistent HA setup support is in the works.
35.
© Hortonworks Inc.
2011- 2018. All rights reserved | 35 Phew. That was a lot.
36.
36 © Hortonworks
Inc. 2011–2018. All rights reserved Hive 3 feature summary ⬢ EDW offload – Surrogate key and constraints – Information schema – Materialized views ⬢ Perf – Query result & metastore caches – LLAP workload management ⬢ Real-time capabilities with Kafka – Ingest in ACID tables – Instantly query using Druid ⬢ Unified SQL – JDBC connector – Druid connector – Spark-hive connector ⬢ Table types – ACID v2 and on by default – External v Managed ⬢ Cloud – AWS/GCP/Azure cloud storage natively supported now
37.
© Hortonworks Inc.
2011- 2018. All rights reserved | 37 Spark-Hive connect
38.
38 © Hortonworks
Inc. 2011–2018. All rights reserved Hive is broadening its reach SQL over Hadoop • External tables only • No ACID • No LLAP • SBA OK (doAs=True) • Some perf penalty Hive as EDW • ACID • LLAP • doAs=False • Managed tables • Column-level security • Stats, better perf etc. External table/ Direct Hive Warehouse Connector Spark
39.
39 © Hortonworks
Inc. 2011–2018. All rights reserved Driver MetaStore HiveServer2 LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons Isolate Spark and Hive Catalogs/Tables Leverage connector for Spark <-> Hive Uses Apache Arrow for fast data transfer HWC HWC
40.
40 © Hortonworks
Inc. 2011–2018. All rights reserved Driver MetaStore HiveServer2 LLAP DaemonsExecutors Spark Meta Hive Meta HWC (Thrift JDBC) Executors LLAP Daemons a) hive.executeUpdate(“INSERT INTO s SELECT * FROM t”) 1. Driver submits update op to HiveServer2 2. Process update through Tez and/or LLAP 3. HWC returns true on success 1 2 3
41.
41 © Hortonworks
Inc. 2011–2018. All rights reserved Driver MetaStore HiveServer2 LLAP DaemonsExecutors Spark Meta Hive Meta Executors LLAP Daemons b) df.write.format(HIVE_WAREHOUSE_CONNECTOR).save() 1.Driver launches DataWriter tasks 2.Tasks write ORC files 3.On commit, Driver executes LOAD DATA INTO TABLE HDFS /tmp 1 2 3 ACID Tables
42.
42 © Hortonworks
Inc. 2011–2018. All rights reserved Driver MetaStore HiveServer+Tez Executors Spark Meta Hive Meta Executors c) df.write.format(STREAM_TO_STREAM).start() 1.Driver launches DataWriter tasks 2.Tasks open Txns 3.Write rows to ACID tables in Tx ACID Tables1 2 3
43.
© Hortonworks Inc.
2011- 2018. All rights reserved | 43 Performance
44.
44 © Hortonworks
Inc. 2011–2018. All rights reserved • Ran all 99 TPCDS queries • Total query runtime have improved multifold in each release! Benchmark journey TPCDS 10TB scale on 10 node cluster HDP 2.5 Hive1 HDP 2.5 LLAP HDP 2.6 LLAP 25x 3x 2x HDP 3.0 LLAP 2016 20182017 ACID tables
45.
45 © Hortonworks
Inc. 2011–2018. All rights reserved • Faster analytical queries with improved vectorization in HDP 3.0 • Vectorized execution of PTF, rollup and grouping sets. • Perf gain compared to HDP 2.6 • TPCDS query67 ~ 10x! • TPCDS query36 ~ 30x! • TPCDS query27 ~ 20x! OLAP Vectorization
46.
46 © Hortonworks
Inc. 2011–2018. All rights reserved SELECT * FROM ( SELECT AVG(ss_list_price) B1_LP, COUNT(ss_list_price) B1_CNT ,COUNT(DISTINCT ss_list_price) B1_CNTD FROM store_sales WHERE ss_quantity BETWEEN 0 AND 5 AND (ss_list_price BETWEEN 11 and 11+10 OR ss_coupon_amt BETWEEN 460 and 460+1000 OR ss_wholesale_cost BETWEEN 14 and 14+20)) B1, ( SELECT AVG(ss_list_price) B2_LP, COUNT(ss_list_price) B2_CNT ,COUNT(DISTINCT ss_list_price) B2_CNTD FROM store_sales WHERE ss_quantity BETWEEN 6 AND 10 AND (ss_list_price BETWEEN 91 and 91+10 OR ss_coupon_amt BETWEEN 1430 and 1430+1000 OR ss_wholesale_cost BETWEEN 32 and 32+20)) B2, . . . LIMIT 100; TPCDS SQL query 28 joins 6 instances of store_sales table Shared scan - 4x improvement! RS RS RS RS RS Scan store_sales Combined OR’ed B1-B6 Filters B1 Filter B2 Filter B3 Filter B4 Filter B5 Filter Join
47.
47 © Hortonworks
Inc. 2011–2018. All rights reserved • Dramatically improves performance of very selective joins • Builds a bloom filter from one side of join and filters rows from other side • Skips scan and further evaluation of rows that would not qualify the join Dynamic Semijoin Reduction - 7x improvement for q72 SELECT … FROM sales JOIN time ON sales.time_id = time.time_id WHERE time.year = 2014 AND time.quarter IN ('Q1', 'Q2’) Reduced scan on sales
48.
© Hortonworks Inc.
2011- 2018. All rights reserved | 48 We’ve made it! Questions?
49.
© Hortonworks Inc.
2011- 2018. All rights reserved | 49 Oh. One more thing.
50.
50 © Hortonworks
Inc. 2011–2018. All rights reserved ⬢ Hive on Kubernetes solves: – Hive/LLAP side install (to main cluster) – Multiple versions of Hive – Multiple warehouse & compute instances – Dynamic configuration and secrets management – Stateful and work preserving restarts (cache) – Rolling restart for upgrades. Fast rollback to previous good state. Hive on Kubernetes (WIP) Kubernetes Hosting Environments AWS GCP Data OS CPU / MEMORY / STORAGE OPENSHIFTAZURE CLOUD PROVIDERS ON- PREM/HYB RID DATA PLANE SERVICES Cluster Lifecycle Manager Data Analytics Studio (DAS) Organizational Services COMPUTE CLUSTER SHARED SERVICES Ranger Atlas Metastore Tiller API Server DAS Web Service Query Coordinators Query Executors Registry Blobstore Indexe r RDBMS Hive Server Long-running kubernetes cluster Inter-cluster communication Intra-cluster communication Ingress Controller or Load Balancer Internal Service Endpoint for ReplicaSet or StatefulSet Ephemeral kubernetes cluster