SlideShare una empresa de Scribd logo
1 de 26
Impala Product Update 
Justin Erickson | Director, Product Management 
September 2014 
©2014 Cloudera, Inc. All Rights 
Reserved. 
1
Agenda 
• Impala releases 
• Impala roadmap 
• Perf update 
©2014 Cloudera, Inc. All Rights 
Reserved. 
2
Key Milestones and Features 
• Impala 1.0 
• ~SQL-92 (minus correlated sub-queries) 
• Native Hadoop file formats (Parquet, Avro, text, Sequence, …) 
• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) 
• Service-level resource isolation with other Hadoop frameworks 
• Impala 1.1 
• Fine-grained, role-based authorization via Apache Sentry 
• Auditing (Impala 1.1.1 and CM 4.7+) 
• Impala 1.2 
• Custom language extensibility (UDFs, UDAFs) 
• Cost-based join-order optimization 
• On-par performance compared to traditional MPP query engines while maintaining native 
Hadoop data flexibility 
• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) 
• Resource management 
©2014 Cloudera, Inc. All Rights 
Reserved. 
3
Just Released 
Impala 1.4 / CDH 5.1 (also with version for CDH 4.x) 
• Additional SQL: 
• DECIMAL data type 
• Additional built-in functions from EDW 
• ORDER BY without LIMIT 
• Continued performance gains: 
• HDFS caching support (CDH 5 only) 
• Faster selective joins 
• Faster COMPUTE STATS 
4 
©2014 Cloudera, Inc. All Rights 
Reserved.
Impala near-term roadmap 
Targeted for Impala 2.0 (fall 2014): 
• Additional SQL: 
• Analytic/window functions 
• Subqueries in the WHERE clause 
• Additional data types (VARCHAR, CHAR) 
• Disk-based joins and aggregations 
• GRANT/REVOKE 
Considerations for Impala 2.x (priority and inclusion based on your feedback): 
• Nested/complex types (next highest priority) 
• Navigator Lineage 
• Updates via MERGE 
• Incremental stats 
• Additional SQL functions (GROUPING, ROLLUP, CUBE, MINUS, INTERSECT built-ins, etc) 
• UDTFs 
• Intra-node parallel joins and aggregations 
• Even faster performance 
• S3 integration 
©2014 Cloudera, Inc. All Rights 
Reserved. 
5
SQL-on-Hadoop benchmark: 
Impala, Presto, Stinger, Spark SQL 
• Upcoming benchmarks on latest versions of: 
• Impala (1.4.0) 
• Presto (0.74) 
• Stinger (final) phase 3 => aka Hive 0.13.0 
• Spark SQL (1.1) 
• Published with smaller memory configuration (64 GB / node) 
• Demonstrates leadership is independent of memory size 
• Dropped Shark given retirement for Hive-on-Spark 
• As always, our public benchmarks are: 
• Based on industry standards (TPC) 
• Repeatable (https://github.com/cloudera/impala-tpcds-kit) 
• Methodical testing with multiple runs on same hardware 
• Help competing software put its best foot forward 
• SQL-92 join style for engines without CBO 
• JVM tuning for Presto 
• Run on optimal file formats for each 
©2014 Cloudera, Inc. All Rights 
Reserved. 
6
Impala’s Multi-User over 10x faster: 
Gap widening compared to May’s update 
©2014 Cloudera, Inc. All Rights 
Reserved. 
7
Faster = more work in less time: 
Impala enables over 8.7x throughput 
©2014 Cloudera, Inc. All Rights 
Reserved. 
8
Performance Takeaways 
• Impala’s advantage expands from 5x single-user to >10x with just 10 user 
• Performance gap is widening since May 
• Single user Presto went from 5x before to 7.5x now 
• Single user Hive/Tez went from 5x before to 9x now 
• Mid-term trends will further favor Impala’s design approach 
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint 
roadmap) 
• CPU efficiency will increase in importance 
• Native code enables easy optimizations for CPU instruction sets (e.g. floating 
point operations, math operations, encrypt/decrypt) 
• The Intel joint roadmap helps support these opportunities 
©2014 Cloudera, Inc. All Rights 
Reserved. 
9
Try It Out! 
• 100% Apache-licensed open source 
• Downloads on http://impala.io/: 
• Live online 
• VM 
• Installation 
• Questions/comments? 
• Community: http://impala.io/community 
• Email: impala-user@cloudera.org 
©2014 Cloudera, Inc. All Rights 
Reserved. 
10
©2014 Cloudera, Inc. All Rights 
Reserved. 
11
Real Time 
Audience 
Dashboard 
September 2014
Introduction 
13 
Tubular Labs 
SAAS Platform for online Video 
Audience Development 
(e.g. Big Data for YouTube videos) 
David Koblas 
VP Engineering, Tubular Labs
Overview 
14 
This presentation will talk about the work 
Tubular Labs has done to use Impala as 
one of the core components to our SAAS 
platform. We'll go through the pipeline 
for getting data into the system, to how 
we've distributed responsibility across 
AWS instances, and other tips and tricks 
for getting real-time responses to our 
end-user queries over billions of data 
points.
User Story: Audience Also Watches 
15 
For any YouTube video can we figure out 
who the audience is and what other 
videos and channels they are watching. 
Also to have the ability to slice the 
audience by demographic information. 
…and have it all run interactively from a 
web SAAS platform.
Tubular App 
16
Technology Options 
17 
• Pre-compute (e.g. Map/Reduce) 
• MySQL or similar 
• Data Warehouse 
• Impala or Redshift 
• Homebrew
Impala 0.7 
18 
Now we have a technology 
… 
Make it interactive 
… 
and make a bet on Cloudera
Now We Have A Technology 
Time To Make It Fast 
and Economical 
19 
Source: Tubular Labs
Pipeline 
20 
Loading 
• Sqoop 
- collect data from MySQL 
• Hive 
- preprocess data 
Query 
• Impala 
- interactive display 
• Python 
- REST endpoint
AWS EC2: Node types 
21 
• m1.xlarge 
- 1.6TB of Instance Storage 
- slow IO 
• hi1.4xlarge 
- 2TB of SSD 
- expensive 
Note: this would be an i2.4xlarge instance today
Managing costs 
22 
Problem 
• hi1.4xlarge - expensive 
• m1.xlarge - slow IO 
Solution – HDFS rack replication for separation 
• One copy of data on both racks 
• Hive creates tables on m1.xlarge instances 
• Impala queries on hi1.4xlarge instances
Interactive Performance 
23 
Problem 
• Large tables take time to scan 
• No indexes 
• Need to deliver results in < 1second 
Solution – partitioning (duh!) 
• Partitions are targeted to be between 100…200MB 
• The query log is your friend
Tubular App 
24
Summary 
25 
Impala can back your SAAS application 
• We’re now running version 1.3 
• We’re “spinning” 10TB of data 
• Delivering queries in < 2seconds 
We’re hiring – but you already knew that.
Bay Area Impala User Group Meetup (Sept 16 2014)

Más contenido relacionado

La actualidad más candente

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
Yahoo Developer Network
 
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
Michael Noel
 

La actualidad más candente (20)

Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Managing Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache AmbariManaging Enterprise Hadoop Clusters with Apache Ambari
Managing Enterprise Hadoop Clusters with Apache Ambari
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Hive - 1455: Cloud Storage
Hive - 1455: Cloud StorageHive - 1455: Cloud Storage
Hive - 1455: Cloud Storage
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
 
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
SQL 2014 AlwaysOn Availability Groups for SharePoint Farms - SPS Sydney 2014
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 

Destacado

Low Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your clusterLow Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your cluster
DataWorks Summit
 

Destacado (9)

Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)
 
HiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取りHiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取り
 
Impala Performance Update
Impala Performance UpdateImpala Performance Update
Impala Performance Update
 
SQL on Hadoop für praktikables BI auf Big Data
SQL on Hadoop für praktikables BI auf Big DataSQL on Hadoop für praktikables BI auf Big Data
SQL on Hadoop für praktikables BI auf Big Data
 
Low Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your clusterLow Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your cluster
 
Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
 
Hug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewHug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overview
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 

Similar a Bay Area Impala User Group Meetup (Sept 16 2014)

Similar a Bay Area Impala User Group Meetup (Sept 16 2014) (20)

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Bay Area Impala User Group Meetup (Sept 16 2014)

  • 1. Impala Product Update Justin Erickson | Director, Product Management September 2014 ©2014 Cloudera, Inc. All Rights Reserved. 1
  • 2. Agenda • Impala releases • Impala roadmap • Perf update ©2014 Cloudera, Inc. All Rights Reserved. 2
  • 3. Key Milestones and Features • Impala 1.0 • ~SQL-92 (minus correlated sub-queries) • Native Hadoop file formats (Parquet, Avro, text, Sequence, …) • Enterprise-readiness (authentication, ODBC/JDBC drivers, etc) • Service-level resource isolation with other Hadoop frameworks • Impala 1.1 • Fine-grained, role-based authorization via Apache Sentry • Auditing (Impala 1.1.1 and CM 4.7+) • Impala 1.2 • Custom language extensibility (UDFs, UDAFs) • Cost-based join-order optimization • On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility • Impala 1.3 / CDH 5.0 (also has version for CDH 4.x) • Resource management ©2014 Cloudera, Inc. All Rights Reserved. 3
  • 4. Just Released Impala 1.4 / CDH 5.1 (also with version for CDH 4.x) • Additional SQL: • DECIMAL data type • Additional built-in functions from EDW • ORDER BY without LIMIT • Continued performance gains: • HDFS caching support (CDH 5 only) • Faster selective joins • Faster COMPUTE STATS 4 ©2014 Cloudera, Inc. All Rights Reserved.
  • 5. Impala near-term roadmap Targeted for Impala 2.0 (fall 2014): • Additional SQL: • Analytic/window functions • Subqueries in the WHERE clause • Additional data types (VARCHAR, CHAR) • Disk-based joins and aggregations • GRANT/REVOKE Considerations for Impala 2.x (priority and inclusion based on your feedback): • Nested/complex types (next highest priority) • Navigator Lineage • Updates via MERGE • Incremental stats • Additional SQL functions (GROUPING, ROLLUP, CUBE, MINUS, INTERSECT built-ins, etc) • UDTFs • Intra-node parallel joins and aggregations • Even faster performance • S3 integration ©2014 Cloudera, Inc. All Rights Reserved. 5
  • 6. SQL-on-Hadoop benchmark: Impala, Presto, Stinger, Spark SQL • Upcoming benchmarks on latest versions of: • Impala (1.4.0) • Presto (0.74) • Stinger (final) phase 3 => aka Hive 0.13.0 • Spark SQL (1.1) • Published with smaller memory configuration (64 GB / node) • Demonstrates leadership is independent of memory size • Dropped Shark given retirement for Hive-on-Spark • As always, our public benchmarks are: • Based on industry standards (TPC) • Repeatable (https://github.com/cloudera/impala-tpcds-kit) • Methodical testing with multiple runs on same hardware • Help competing software put its best foot forward • SQL-92 join style for engines without CBO • JVM tuning for Presto • Run on optimal file formats for each ©2014 Cloudera, Inc. All Rights Reserved. 6
  • 7. Impala’s Multi-User over 10x faster: Gap widening compared to May’s update ©2014 Cloudera, Inc. All Rights Reserved. 7
  • 8. Faster = more work in less time: Impala enables over 8.7x throughput ©2014 Cloudera, Inc. All Rights Reserved. 8
  • 9. Performance Takeaways • Impala’s advantage expands from 5x single-user to >10x with just 10 user • Performance gap is widening since May • Single user Presto went from 5x before to 7.5x now • Single user Hive/Tez went from 5x before to 9x now • Mid-term trends will further favor Impala’s design approach • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math operations, encrypt/decrypt) • The Intel joint roadmap helps support these opportunities ©2014 Cloudera, Inc. All Rights Reserved. 9
  • 10. Try It Out! • 100% Apache-licensed open source • Downloads on http://impala.io/: • Live online • VM • Installation • Questions/comments? • Community: http://impala.io/community • Email: impala-user@cloudera.org ©2014 Cloudera, Inc. All Rights Reserved. 10
  • 11. ©2014 Cloudera, Inc. All Rights Reserved. 11
  • 12. Real Time Audience Dashboard September 2014
  • 13. Introduction 13 Tubular Labs SAAS Platform for online Video Audience Development (e.g. Big Data for YouTube videos) David Koblas VP Engineering, Tubular Labs
  • 14. Overview 14 This presentation will talk about the work Tubular Labs has done to use Impala as one of the core components to our SAAS platform. We'll go through the pipeline for getting data into the system, to how we've distributed responsibility across AWS instances, and other tips and tricks for getting real-time responses to our end-user queries over billions of data points.
  • 15. User Story: Audience Also Watches 15 For any YouTube video can we figure out who the audience is and what other videos and channels they are watching. Also to have the ability to slice the audience by demographic information. …and have it all run interactively from a web SAAS platform.
  • 17. Technology Options 17 • Pre-compute (e.g. Map/Reduce) • MySQL or similar • Data Warehouse • Impala or Redshift • Homebrew
  • 18. Impala 0.7 18 Now we have a technology … Make it interactive … and make a bet on Cloudera
  • 19. Now We Have A Technology Time To Make It Fast and Economical 19 Source: Tubular Labs
  • 20. Pipeline 20 Loading • Sqoop - collect data from MySQL • Hive - preprocess data Query • Impala - interactive display • Python - REST endpoint
  • 21. AWS EC2: Node types 21 • m1.xlarge - 1.6TB of Instance Storage - slow IO • hi1.4xlarge - 2TB of SSD - expensive Note: this would be an i2.4xlarge instance today
  • 22. Managing costs 22 Problem • hi1.4xlarge - expensive • m1.xlarge - slow IO Solution – HDFS rack replication for separation • One copy of data on both racks • Hive creates tables on m1.xlarge instances • Impala queries on hi1.4xlarge instances
  • 23. Interactive Performance 23 Problem • Large tables take time to scan • No indexes • Need to deliver results in < 1second Solution – partitioning (duh!) • Partitions are targeted to be between 100…200MB • The query log is your friend
  • 25. Summary 25 Impala can back your SAAS application • We’re now running version 1.3 • We’re “spinning” 10TB of data • Delivering queries in < 2seconds We’re hiring – but you already knew that.