SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
Cloudera  impala  Performance  
         Evaluation  
(with  Comparison  to  Hive)	
 
               Dec. 8, 2012
     CELLANT Corp. R&D Strategy Division
              Yukinori SUDA
                @sudabon
About  Cloudera  impala	
 
•  Latest version is 0.3 beta
•  Open-sourced implementation inspired by Google Dremel
   and F1
•  Developed by famous Hadoop distributor Cloudera
•  Bring real-time, ad-hoc query capability on Apache Hadoop
•  Query data stored in HDFS or Apache Hbase
•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive
•  Support for TextFile and SequenceFile as Hive storage format
•  Also support SequenceFile compressed as Snappy, Gzip and
   Bzip
•  Directly access the data through a specialized distributed
   query engine
Architecture	
 
•  State Store works as an impala-state-store(statestored) daemon
•  Query Planner, Query Coordinator and Query Exec Engine work as an
   impalad daemon
System  Environment	
 
      •  Install via Cloudera Manager Free Edition
           Master                                          Slave



・HDFS	
   NameNode	
   SecondaryNameNode	
                                                     ・HDFS	
・MapReduceV1	
                                                                DataNode	
   JobTracker	
                                                            ・MapReduceV1	
・impala	
                                                                     TaskTracker	
   impalad	
                                                               ・impala	
   impala-­‐‑state-­‐‑store	
                                                 impalad	
   (statestored)
        1  Sever                                                               13  Servers

      All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
Server  Specification	
 

•  CPU
   o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory
   o  4GB

•  Disk
   o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS
   o  CentOS 6.2
Benchmark	
 
•  Use CDH4.1 + impala version 0.2 and 0.3
•  Use hivebench in open-sourced benchmark tool
   “HiBench”
   o  https://github.com/hibench
•  Modified datasets to 1/10 scale
   o  Default configuration generates table with 1 billion rows
•  Modified query sentence
   o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance
   o  Deleted “datediff” function (I mistook not to be supported)
•  Combines a few Hive storage format with a few
   compression method
   o  TextFile, SequenceFile, RCFile
   o  No compression, Gzip, Snappy
•  Comparison with job query latency
   o  Average job latency over 5 measurements
Modified  Datasets	
 
•  Uservisits table              •  Rankings table
   o  100 million rows              o  12 million rows
   o  Schema                        o  Schema
        •  sourceIP     string           •  pageURL       string
        •  destURL      string           •  pageRank      int
        •  visitDate    string           •  avgDuration   int
        •  adRevenue    double
        •  userAgent    string
        •  countryCode string
        •  languageCode string
        •  searchWord   string
        •  duration     int
Modified  Query	
 
SELECT                                  ON
  sourceIP,                                (R.pageURL = NUV.destURL)
  sum(adRevenue) as totalRevenue,
  avg(pageRank)
                                        GROUP BY sourceIP
FROM                                    ORDER BY totalRevenue DESC
  rankings R                            LIMIT 1
JOIN (
  SELECT
     sourceIP,
     destURL,
     adRevenue
  FROM
     uservisits UV
  WHERE
     UV.visitData >= ‘1999-01-01’
     AND UV.visitData <= ‘2001-01-01’
  ) NUV
Benchmark  Result  
    (Hive)
Benchmark  Result  
 (impala  0.2)
Benchmark  Result  
 (impala  0.3)
Conclusion	
 
•  Impala is over 10 times faster than MR + Hive
   o  Impala 0.3
        •  SequenceFile compressed as Snappy: 14.337 seconds
   o  Impala 0.2
        •  SequenceFile compressed as Gzip: 19.733 seconds
   o  Hive
        •  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5
   makes faster
   o  Support RCFile and Trevni columner format
Thank  you

Más contenido relacionado

La actualidad más candente

In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Low Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your clusterLow Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your cluster
DataWorks Summit
 

La actualidad más candente (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Hive vs. Impala
Hive vs. ImpalaHive vs. Impala
Hive vs. Impala
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかApache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Low Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your clusterLow Latency SQL on Hadoop - What's best for your cluster
Low Latency SQL on Hadoop - What's best for your cluster
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 

Similar a Performance evaluation of cloudera impala (with Comparison to Hive)

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Yukinori Suda
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Yahoo Developer Network
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Modern Data Stack France
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 

Similar a Performance evaluation of cloudera impala (with Comparison to Hive) (20)

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to HivePerformance evaluation of cloudera impala 0.6 beta with comparison to Hive
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
 
Cloudera Impala presentation
Cloudera Impala presentationCloudera Impala presentation
Cloudera Impala presentation
 
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopJan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of FilesScaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 

Más de Yukinori Suda

自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜
Yukinori Suda
 
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
Yukinori Suda
 

Más de Yukinori Suda (9)

Hadoop operation chaper 4
Hadoop operation chaper 4Hadoop operation chaper 4
Hadoop operation chaper 4
 
Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話
 
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスHadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
 
自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜
 
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
 
Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1
 
HiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取りHiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取り
 
Performance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GAPerformance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GA
 
Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Performance evaluation of cloudera impala (with Comparison to Hive)

  • 1. Cloudera  impala  Performance   Evaluation   (with  Comparison  to  Hive) Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
  • 2. About  Cloudera  impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and Bzip •  Directly access the data through a specialized distributed query engine
  • 3. Architecture •  State Store works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon
  • 4. System  Environment •  Install via Cloudera Manager Free Edition Master Slave ・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-­‐‑state-­‐‑store impalad (statestored) 1  Sever 13  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
  • 5. Server  Specification •  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading •  Memory o  4GB •  Disk o  7,200 rpm SATA mechanical Hard Disk Drive •  OS o  CentOS 6.2
  • 6. Benchmark •  Use CDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool “HiBench” o  https://github.com/hibench •  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows •  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported) •  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy •  Comparison with job query latency o  Average job latency over 5 measurements
  • 7. Modified  Datasets •  Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int
  • 8. Modified  Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIP FROM ORDER BY totalRevenue DESC rankings R LIMIT 1 JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV
  • 10. Benchmark  Result   (impala  0.2)
  • 11. Benchmark  Result   (impala  0.3)
  • 12. Conclusion •  Impala is over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds •  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format