SlideShare una empresa de Scribd logo
1 de 41
Descargar para leer sin conexión
Real-time Big Data Analytics Engine
using Impala
Jason Shih
Etu
28 Sept, HIT 2013
Outline
•  Motivation & Users’ perspective
•  Impala architecture and data analytics stack Overview
•  Performance benchmark
•  Use Cases (Demo)
HIT 2013
2
Motivation & Users’ Perspective
•  Leverage existing Hadoop deployment
•  Reuse HIVE metadata, metastore, DLL & JDBC/ODBC drivers.
•  File format widely support in Hadoop
•  Read performance: disk awareness and short-circuit
•  MPP SQL query engine (over Hadoop)
•  billion to trillion records at interactive speeds
•  Both analytical & transactional
•  General purpose & ad-hoc
•  MR
•  High latency, dismissed for interactive workload
•  Disk-based intermediated outputs
•  Execution strategies (lack of optimization base on data statistics)
•  Task and scheduling overhead
•  Task launch delay 5~10sec (pre-defined delay due to the periodic heartbeat for
new scheduled tasks).
HIT 2013
3
Motivation & Users’ Perspective (cont’)
•  High performance
•  In memory query engine
•  C++ instead of JAVA
•  Runtime code generation
•  Completely new execution engine (cf. MR framework)
•  Data locality and short-circuit read
•  HDFS-2246: avoid HDFS API overhead
•  HDFS-34: Making Short-Circuit Local Reads Secure
•  Intermediate data never hits disk
•  Data stream to client
HIT 2013
4
Motivation & Users’ Perspective (cont’)
•  MPP-RDB Paradigm
•  HDFS:
•  Scalability & Availability
•  Price Performance & Commodity
•  MPP DW appliance:
•  Exadata, Vertica, HANA, Aster (SQL-MapReduce), HWAQ (Pivotal
HD) & Dremel etc.
•  Pros:
•  Very matured & highly optimized engine
•  Cons
•  Generally not fault-tolerance
!  For long run queries when cluster scale-up
!  Lack rich analytics (machine learning)
HIT 2013
5
•  Impala
•  Real-time queries in Apache Hadoop sit atop HDFS.
•  ~2010-2012, 7 FTE (Marcel Kornacker)
•  Completely open source, ASLv2
•  GA: connectors for BI, DW general available
Google F1 - The Fault-
Tolerant Distributed RDBMS,
May 2012
6Ref: http://www.wired.com/wiredenterprise/2012/10/cloudera-impala-hadoop/
Impala Overview: SQL Support
•  Functionality highlight:
•  SQL-92 features minus correlated subqueries
•  SELECT, INSERT INTO, , SELECT ... INSERT INTO … VALUES(…)
•  ORDER BY requires LIMIT
•  Flexible file format: RCFile
•  Unsupported/Limitation
•  WITH clause does not support recursive queries in the WITH
•  Only hash join
•  Joined tables has to fit in aggregated memory of all executing nodes
•  No beyond SQL
•  buckets, samples, transforms, array, structs, maps, xpath and json
•  UDF support
•  Impala 1.2: Support HIVE UDFs (existing jars without recompile)
•  Impala native UDF/UDA and UDF/UDA register in metadata catalog
HIT 2013 7
Impala SQL: create table
HIT 2013 8
Ref: SQL Language Element:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/
Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html
Architecture Overview
•  Two daemons:
•  impalad:
•  Run on all HDFS DNs
•  Functions as distributed query engine
•  Handle client and internal requests (query exec)
•  Design execution plan for queries and processes query on DNs
•  Thrift services for these two roles
•  statestored:
•  Cluster metadata, name service & metadata distribution
–  cf. HIVE metastore: RDB metadata
•  Metadata updated when add/delete impalad processes
•  Daemon cache metadata (INVALIDATE METADATA or REFRESH)
•  Export thrift service
•  Send periodic heartbeats, check for live backend and pushes new data
•  Fail of statestore wont affect query execution except for stale state of DN
HIT 2013
9
Architecture Overview: Impala daemons
•  Impalad:
•  Impala 1.1 integrate Sentry for fine-grained authorization framework
•  Daemon startup arg (default):
•  impalad -log_dir=/opt/impala/var/log/impala -state_store_port=24000 -
state_store_host=impala-server -be_port=22000
•  Enabled security
•  Rely on existing Kerberos subsystem for authentication framework
•  -use_statestore -kerberos_reinit_interval=60 -principal=impala/impalad-
server@TESTDOMAIN.COM -keytab_file=impala.keytab
•  Authorization:
•  -authorization_policy_file arg., feed with .ini fmt
•  divide into [groups] & [roles] (opt: [databases] & [users])
•  [users] will override OS-level mapping of users to groups.
•  E.g.:
•  Statestored:
•  daemon startup:
•  statestored -log_dir=/opt/impala/var/log/impala -state_store_port=24000
•  Enable Kerberos:
•  -kerberos_reinit_interval=60 –principal=impala/statestored-server@TESTDOMAIN.COM -
keytab_file=impala.keytab
•  Available flags:
•  http://statestored-server:25010/varz
HIT 2013
10
Architecture Overview (cont’)
•  Query execution phases
•  Planner, coordinator, executor
•  Queries arrive via JDBC/ODBC, Thrift API/CLI, Hue/Beeswax
•  Planner turns request into collections of plan fragments
•  Coordinator initiates execution on impalad(s) local to data
HIT 2013
11
Architecture Overview: Query Execution
•  Plan fragments upon request from JDBC/ODBC or thrift client
•  Initiate execution on impalad by coordinator
•  Intermediate result: streamed between impalad
•  Results are streamed back to client
12
Architecture Overview: Query Plan
HIT 2013
•  Plan node & operators:
•  Depth-first execution tree
•  Scan, HashJoin, HashAggr, Union, TopN, Exchange
•  Two phases processes
•  Single node plan (left-deep tree)
•  Plan fragments: Partitioning operator tree
•  Fragment: distributed atomic executable unit (plan nodes)
•  Distributed plans:
•  Query operators are fully distributed
•  Max. scan locality & min. data movement
•  Parallel joins:
•  Order: FROM clause
•  Broadcast join & partitioned join
•  Future roadmap: cost-based optimization based on column stats & cost of data
transfers
13
Architecture Overview: Query Plan (cont’)
HIT 2013
14
Logging and Profile
•  Impala logs:
•  Logging level control by
•  GLOG_v env: “GLOG”
–  Default level = 1, connection logging and execution profile
–  Level 2 logged ea. RPC initiated and execution progress info
–  Everything plus logging of every row read in 3rd level.
•  -logbuflevel daemon startup flag.
•  Exam:
•  $IMPALA_HOME/var/log/impala/{impalad,statestore}.{INFO,WARNING,ERROR}
•  Consolidate: impala-server.log & impala-state-store.log
•  http://impalad-server:25000/logs
•  Content:
•  Startup opt: CPU, available spindles, flags, version and machine info
•  Query profile: composition, degree of data locality, throughput statistics and responding
time.
•  Auditing log featured in release 1.1.1
•  Extensive analytics data for query execution:
•  query profile stored in zlib-compressed fmt:
•  $IMPALA_HOME/var/log/impala/profiles
•  http://impalad-server:25000/queries
HIT 2013
15
Performance Tip
•  Partitioning
•  Large table & always or almost always queried with conditions on the
partitioning columns
•  JOIN
•  Broadcast join by default.
•  Partitioned join
•  suitable for large tables of roughly equal size
•  subsets of rows can be processed in parallel by sending portion of each
tables
•  Join the biggest table first
•  Joining the table with the most selective filter
•  INSERT
•  not suitable for loading large quantities of data into HDFS-based tables, due to
the lack of parallelized operations
•  Staging temporary files in an ETL pipeline and upload to HDFS (refresh)
•  Resource usage:
•  Impalad startup flag: “-mem_limits” 16
Troubleshooting Hint
•  Queries are slow?
•  Test: “select count(*) from table”
•  Non-zero “Total remote scan volume” shown in impalad log indicate either
some DNs not running impalad or impalad instance fail to contact one or more
impalad instances.
•  Missing impalad instances from DN
•  live backend: http://statestore:25010/metrics
•  Data locality and native checksuming (>= CDH 4.2)
•  Enable properties: “dfs.client.read.shortcircuit”
&“dfs.client.read.shortcircuit.skip.checksum”
•  Rebuild/reinstall hadoop native lib “libhadoop.so” if needed.
•  Error:
–  Unknown disk id. This will negatively affect performance. Check your hdfs settings to
enable block location metadata
–  Unable to load native-hadoop library for your platform... using builtin-java classes where
applicable
HIT 2013
17
Troubleshooting Hint (cont’)
•  Queries getting slower?:
•  Impalad paging after mem exceeded
•  E.g.: mem-limit.h:86] Query: 0:0Exceeded limit: limit=26996031488 consumption=26996148624
•  Incorrect result?
•  Invalid metadata (GA: REFRESH, post-GA: INVALID METADATA)
•  Invalid query?
•  Cross check the query in HIVE
•  Useful debugging info from impala service logs.
•  Invalid/unsupported stmt:
•  http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-
Impala/ciiu_langref.html#langref
•  Auth error:
•  Server logging:
•  Minor code may provide more information (Cannot contact any KDC for realm or Kerberos:
•  GSSAPI Error: Unspecified GSS failure
•  Client: “Error connecting: <class 'thrift.transport.TTransport.TTransportException'>, TSocket read 0 bytes”
•  Ensured
•  valid Kerberos ticket lifetime at client
•  Specify “-s” service principal and flag “–k” aim for kerberized impalad connection.
HIT 2013
18
Limitation and Wish List
•  limitation:
•  Subquery referenced in the SELECT
•  Optional WITH clause before the INSERT.
•  Recursive queries in the WITH clauses
•  Inconsistent VIEW
•  parenthesis in WHERE clauses
•  Wish list
•  SQL modeling tool
•  Fault tolerance query
•  Memory management (caching parquet table) & usage estimation
•  Aggregation group of columns (> 30 etc.)
HIT 2013
19
Impala: Now & Future Roadmap
•  Now (1.1.x/1.0)
•  OS Support:
•  RHEL/CentOS 5.7, Ubuntu, Debian, SLES, and Oracle Linux
•  Connecters: JDBC/ODBC drivers
•  DDL support & SQL performance optimization
•  Fast & memory efficient: join & aggregation
•  File format: Parquet, Avro & LZO compressed
•  Future (1.2) – late 2013
•  UDF and extensibility
•  Automatic metadata refresh
•  In-memory HDFS caching
•  Cost-base join order optimization
•  Preview of YARN-integrated resource manager
•  2.0 Roadmap – first 3rd of 2014
•  SQL 2003-compliant analytic window functions
•  Additional authentication mechanisms
•  UDTFs (user-defined table functions)
•  Intra-node parallelized aggregations and joins
•  Nested data
•  YARN-integrated resource manager
•  Additional data types – including Date and Decimal types
HIT 2013
20
More Information & Related Works
•  “Dremel: Interactive Analysis of Web-Scale Datasets”, Sergey Melnik et
al., Google
•  Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-
queries-in-apache- hadoop-for-real/
•  “Impala unlocks Interactive BI on Hadoop with MicroStrategy”, Justin
Erickson & Jochen Demuth, Cloudera
•  “Cloudera impala Performance Evaluation”, Yukinori SUDA
•  “HANA vs Impala, on AWS Cloud”, Aron MacDonald
•  “Spark and Shark: High-speed In-memory Analytics over Hadoop Data”,
Reynold Xin, AMPLab
•  Stinger Initiative http://hortonworks.com/blog/100x-faster-hive/
•  Apache Drill: http://incubator.apache.org/drill/
HIT 2013
21
Performance Evaluation
0 20 40 60 80 100
Shark
Impala
PIG
Elephant
Km/h
Ref: Wiki & http://www.speedofanimals.com
Breakdown of DNS Anomaly Analytics
HIT 2013 23
Two DN + Master
-  Dual DC E5620 2.40GHz
-  MEM 32GB ea.
-  4 spindles, 2T ea.
HDFS(GB)
QueryResp.(sec)
Data Volume and Ingest
HIT 2013
1D 1W 1M 2M
Data (Raw) (GB) 5.1 35 140 280
Data (HDFS) (GB) 3.8 25.9 103.6 207.2
Blocks (HDFS) 31 211 844 1598
MEvt 42 291 1,166 2,209
24
PIG vs. Impala
•  Domain level compute in preprocessing streaming.
•  DN sort throughput: ~120MB/s throughput & SIP/Qry ~ 50MB/s.
•  Processing time scale linearly with data vol.
HIT 2013 27
Query Resp. (sec)
Impala: 71s
7 times faster.
Observation & Estimation
•  Speed-up: 4.5~7 times
•  DL Calc.: 57~70% memory usage
•  Data ingest
!  Est. ~3TB take ~55K sec.
•  Plus pre-processing time
!  Throughput constrain to GbE linkage (in/out bound)
!  Avg. throughput ~80MB/s
•  non-encrypted file transfer
•  RTQ: ~15K sec for 3TB process
!  c.f. 115K base on MR
HIT 2013 28
Query Throughput & Latency
•  Queries
•  20 from TPC-DS
•  3 categories
•  Interactive: 1month
•  Reports: several months
•  Deep analytics: all data
•  Fact table:
•  1TB snappy-seq.-files/5Yr
•  Resource level:
•  20 nodes, 24cores/node.
•  Speed-up:
•  Interactive: 25~68
•  Reports: 6~56
•  Deep analytics: 6~55
29Ref: “Impala: A Modern, Open-Source SQL Engine for Hadoop”, Marcel Kornacker, Cloudera
Impala vs. Stinger
•  Stinger
•  Optimize execution plan
•  TEZ framework optimize execution
•  Columnar file format
30Ref: Cloudera Impala Overview, Scott Leberknight, Cloudera.
Impala Use Cases
Offloads DW for ad hoc query environment, ETL and archiving
Interactive BI/analytics on large volume of data
Real-time response for unstructured data analysis
Impala and HIVE
HIT 2013 32
•  Impala:
•  Native MPP query engine for low
runtime overhead & interactive SQL
•  No fault tolerance
•  GA: UDF supported
•  HIVE
•  MapReduce as an execution engine
•  Fault-tolerant leveraging MR framework
•  High runtime overhead (extensive
layering)
•  UDF
•  Common for client:
•  SQL syntax
•  highly compatible with HiveQL
•  ODBC/JDBC drivers
•  Metadata (table definition)
•  HUE
Data Warehouse Offload
33Ref: Hadoop and the Data Warehouse: When to Use Which, Teradata
Query Run Times
•  Table with 60M Records
34
Ref: HANA vs Impala, on AWS Cloud
TPC-H Query Run Times
•  Lineitem table 60M Rows
35
Ref: HANA vs Impala, on AWS Cloud
•  On-demand Customer Segmentation based on various
demographic and mobile behavior attributes
•  On-demand Customer Profiling through fast screening & ranking
of critical attributes With the power of distributed in-memory
computation on hadoop, Impala enables market analyst to
conduct various interactive analytics such as OLAP, statistical
correlation, and data mining on big data.
HIT 2013 36
「 標族群 」關聯屬性分析
33%
28%
27%
12%
Facebook 43%
Twitter 31%
Google+19%
LinkedIn 7%
27%
23%
39%
11%
Facebook 44%
Twitter 30%
Google+17%
LinkedIn 9%
53%
47%
56%
44%
app 28%
app 17%
app 23%
app 18%
app 14%
app 25%
app 14%
app 20%
App 33%
app 10%
–
39
DEMO
•  CREATE TABLE, LOAD DATA from HDFS
DROP TABLE IF EXISTS demo;
CREATE EXTERNAL TABLE demo
(
a string,
b int,
c int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/etu/demo';
•  PIG & Impala:
•  SUM
•  SUM with GROUP BY
HIT 2013 40
DEMO (cont’)
•  SUM in PIG:
a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray,
col2:int, col3:int);
b = foreach a generate col1, col2, col3, 1 as col4;
d = group b by col4;
d1 = foreach d generate SUM(b.col4);
store d1 into 'demo/count2' using PigStorage(',');
•  SUM in Impala:
SELECT sum(demo.c)
FROM demo;
HIT 2013 41
DEMO (cont’)
•  SUM with GROUP BY in PIG
a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray,
col2:int, col3:int);
b = foreach a generate col1, col2, col3, 1 as col4;
c = group b by col1;
c1 = foreach c generate group, SUM(b.col2);
store c1 into 'demo/count1' using PigStorage(',');
•  SUM with GROUP BY in Impala
SELECT demo.a AS tag,
sum(demo.b) AS val
FROM demo
GROUP BY demo.a;
HIT 2013 42
DEMO (cont’)
•  Speed-up:
HIT 2013 43
Query Resp. (sec)
X 60
X 18
Two DN, same spec for DNS log analytics.
Dual DC E5620, MEM 32GB ea.
~100 time faster when cluster scale.
44
Question?
jasonshih@etusolution.com
Slideshare
www.slideshare.net/hlshih/hit2013-impala-0925etu
Acknowledgement
Dr. CM Fan, MFactory, SYSTEX
www.etusolution.com
info@etusolution.com
Taipei, Taiwan
318, Rueiguang Rd., Taipei 114, Taiwan
T: +886 2 7720 1888
F: +886 2 8798 6069
Beijing, China
Room B-26, Landgent Center,
No. 24, East Third Ring Middle Rd.,
Beijing, China 100022
T: +86 10 8441 7988
F: +86 10 8441 7227
Contact

Más contenido relacionado

La actualidad más candente

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep divehuguk
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Cloudera, Inc.
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017larsgeorge
 
Impala Resource Management - OUTDATED
Impala Resource Management - OUTDATEDImpala Resource Management - OUTDATED
Impala Resource Management - OUTDATEDMatthew Jacobs
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Wangda Tan
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaHBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaCloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 

La actualidad más candente (20)

Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Cloudera impala
Cloudera impalaCloudera impala
Cloudera impala
 
Cloudera Impala technical deep dive
Cloudera Impala technical deep diveCloudera Impala technical deep dive
Cloudera Impala technical deep dive
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017HBase and Impala Notes - Munich HUG - 20131017
HBase and Impala Notes - Munich HUG - 20131017
 
Impala Resource Management - OUTDATED
Impala Resource Management - OUTDATEDImpala Resource Management - OUTDATED
Impala Resource Management - OUTDATED
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, ClouderaHBaseCon 2012 | HBase Filtering - Lars George, Cloudera
HBaseCon 2012 | HBase Filtering - Lars George, Cloudera
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 

Destacado

Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
 
20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのかKatsunori Kanda
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSSN Masahiro
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajpCloudera Japan
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsYue Chen
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)Naoki (Neo) SATO
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...MapR Technologies Japan
 
Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Recruit Technologies
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataCloudera, Inc.
 

Destacado (20)

Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか20150207 何故scalaを選んだのか
20150207 何故scalaを選んだのか
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Impala 2.0 Update #impalajp
Impala 2.0 Update #impalajpImpala 2.0 Update #impalajp
Impala 2.0 Update #impalajp
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security Systems
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
[Azure Deep Dive] Spark と Azure HDInsight によるビッグ データ分析入門 (2017/03/27)
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
Hadoop最新情報 - YARN, Omni, Drill, Impala, Shark, Vertica - MapR CTO Meetup 2014...
 
Hadoopカンファレンス20140707
Hadoopカンファレンス20140707Hadoopカンファレンス20140707
Hadoopカンファレンス20140707
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 

Similar a Real-time Big Data Analytics Using Impala

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)Marco Gralike
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
MariaDB - a MySQL Replacement #SELF2014
MariaDB - a MySQL Replacement #SELF2014MariaDB - a MySQL Replacement #SELF2014
MariaDB - a MySQL Replacement #SELF2014Colin Charles
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 

Similar a Real-time Big Data Analytics Using Impala (20)

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)An AMIS Overview of Oracle database 12c (12.1)
An AMIS Overview of Oracle database 12c (12.1)
 
SQL Server Clustering Part1
SQL Server Clustering Part1SQL Server Clustering Part1
SQL Server Clustering Part1
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
An AMIS overview of database 12c
An AMIS overview of database 12cAn AMIS overview of database 12c
An AMIS overview of database 12c
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
MariaDB - a MySQL Replacement #SELF2014
MariaDB - a MySQL Replacement #SELF2014MariaDB - a MySQL Replacement #SELF2014
MariaDB - a MySQL Replacement #SELF2014
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 

Último

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Real-time Big Data Analytics Using Impala

  • 1. Real-time Big Data Analytics Engine using Impala Jason Shih Etu 28 Sept, HIT 2013
  • 2. Outline •  Motivation & Users’ perspective •  Impala architecture and data analytics stack Overview •  Performance benchmark •  Use Cases (Demo) HIT 2013 2
  • 3. Motivation & Users’ Perspective •  Leverage existing Hadoop deployment •  Reuse HIVE metadata, metastore, DLL & JDBC/ODBC drivers. •  File format widely support in Hadoop •  Read performance: disk awareness and short-circuit •  MPP SQL query engine (over Hadoop) •  billion to trillion records at interactive speeds •  Both analytical & transactional •  General purpose & ad-hoc •  MR •  High latency, dismissed for interactive workload •  Disk-based intermediated outputs •  Execution strategies (lack of optimization base on data statistics) •  Task and scheduling overhead •  Task launch delay 5~10sec (pre-defined delay due to the periodic heartbeat for new scheduled tasks). HIT 2013 3
  • 4. Motivation & Users’ Perspective (cont’) •  High performance •  In memory query engine •  C++ instead of JAVA •  Runtime code generation •  Completely new execution engine (cf. MR framework) •  Data locality and short-circuit read •  HDFS-2246: avoid HDFS API overhead •  HDFS-34: Making Short-Circuit Local Reads Secure •  Intermediate data never hits disk •  Data stream to client HIT 2013 4
  • 5. Motivation & Users’ Perspective (cont’) •  MPP-RDB Paradigm •  HDFS: •  Scalability & Availability •  Price Performance & Commodity •  MPP DW appliance: •  Exadata, Vertica, HANA, Aster (SQL-MapReduce), HWAQ (Pivotal HD) & Dremel etc. •  Pros: •  Very matured & highly optimized engine •  Cons •  Generally not fault-tolerance !  For long run queries when cluster scale-up !  Lack rich analytics (machine learning) HIT 2013 5
  • 6. •  Impala •  Real-time queries in Apache Hadoop sit atop HDFS. •  ~2010-2012, 7 FTE (Marcel Kornacker) •  Completely open source, ASLv2 •  GA: connectors for BI, DW general available Google F1 - The Fault- Tolerant Distributed RDBMS, May 2012 6Ref: http://www.wired.com/wiredenterprise/2012/10/cloudera-impala-hadoop/
  • 7. Impala Overview: SQL Support •  Functionality highlight: •  SQL-92 features minus correlated subqueries •  SELECT, INSERT INTO, , SELECT ... INSERT INTO … VALUES(…) •  ORDER BY requires LIMIT •  Flexible file format: RCFile •  Unsupported/Limitation •  WITH clause does not support recursive queries in the WITH •  Only hash join •  Joined tables has to fit in aggregated memory of all executing nodes •  No beyond SQL •  buckets, samples, transforms, array, structs, maps, xpath and json •  UDF support •  Impala 1.2: Support HIVE UDFs (existing jars without recompile) •  Impala native UDF/UDA and UDF/UDA register in metadata catalog HIT 2013 7
  • 8. Impala SQL: create table HIT 2013 8 Ref: SQL Language Element: http://www.cloudera.com/content/cloudera-content/cloudera-docs/ Impala/latest/Installing-and-Using-Impala/ciiu_langref_sql.html
  • 9. Architecture Overview •  Two daemons: •  impalad: •  Run on all HDFS DNs •  Functions as distributed query engine •  Handle client and internal requests (query exec) •  Design execution plan for queries and processes query on DNs •  Thrift services for these two roles •  statestored: •  Cluster metadata, name service & metadata distribution –  cf. HIVE metastore: RDB metadata •  Metadata updated when add/delete impalad processes •  Daemon cache metadata (INVALIDATE METADATA or REFRESH) •  Export thrift service •  Send periodic heartbeats, check for live backend and pushes new data •  Fail of statestore wont affect query execution except for stale state of DN HIT 2013 9
  • 10. Architecture Overview: Impala daemons •  Impalad: •  Impala 1.1 integrate Sentry for fine-grained authorization framework •  Daemon startup arg (default): •  impalad -log_dir=/opt/impala/var/log/impala -state_store_port=24000 - state_store_host=impala-server -be_port=22000 •  Enabled security •  Rely on existing Kerberos subsystem for authentication framework •  -use_statestore -kerberos_reinit_interval=60 -principal=impala/impalad- server@TESTDOMAIN.COM -keytab_file=impala.keytab •  Authorization: •  -authorization_policy_file arg., feed with .ini fmt •  divide into [groups] & [roles] (opt: [databases] & [users]) •  [users] will override OS-level mapping of users to groups. •  E.g.: •  Statestored: •  daemon startup: •  statestored -log_dir=/opt/impala/var/log/impala -state_store_port=24000 •  Enable Kerberos: •  -kerberos_reinit_interval=60 –principal=impala/statestored-server@TESTDOMAIN.COM - keytab_file=impala.keytab •  Available flags: •  http://statestored-server:25010/varz HIT 2013 10
  • 11. Architecture Overview (cont’) •  Query execution phases •  Planner, coordinator, executor •  Queries arrive via JDBC/ODBC, Thrift API/CLI, Hue/Beeswax •  Planner turns request into collections of plan fragments •  Coordinator initiates execution on impalad(s) local to data HIT 2013 11
  • 12. Architecture Overview: Query Execution •  Plan fragments upon request from JDBC/ODBC or thrift client •  Initiate execution on impalad by coordinator •  Intermediate result: streamed between impalad •  Results are streamed back to client 12
  • 13. Architecture Overview: Query Plan HIT 2013 •  Plan node & operators: •  Depth-first execution tree •  Scan, HashJoin, HashAggr, Union, TopN, Exchange •  Two phases processes •  Single node plan (left-deep tree) •  Plan fragments: Partitioning operator tree •  Fragment: distributed atomic executable unit (plan nodes) •  Distributed plans: •  Query operators are fully distributed •  Max. scan locality & min. data movement •  Parallel joins: •  Order: FROM clause •  Broadcast join & partitioned join •  Future roadmap: cost-based optimization based on column stats & cost of data transfers 13
  • 14. Architecture Overview: Query Plan (cont’) HIT 2013 14
  • 15. Logging and Profile •  Impala logs: •  Logging level control by •  GLOG_v env: “GLOG” –  Default level = 1, connection logging and execution profile –  Level 2 logged ea. RPC initiated and execution progress info –  Everything plus logging of every row read in 3rd level. •  -logbuflevel daemon startup flag. •  Exam: •  $IMPALA_HOME/var/log/impala/{impalad,statestore}.{INFO,WARNING,ERROR} •  Consolidate: impala-server.log & impala-state-store.log •  http://impalad-server:25000/logs •  Content: •  Startup opt: CPU, available spindles, flags, version and machine info •  Query profile: composition, degree of data locality, throughput statistics and responding time. •  Auditing log featured in release 1.1.1 •  Extensive analytics data for query execution: •  query profile stored in zlib-compressed fmt: •  $IMPALA_HOME/var/log/impala/profiles •  http://impalad-server:25000/queries HIT 2013 15
  • 16. Performance Tip •  Partitioning •  Large table & always or almost always queried with conditions on the partitioning columns •  JOIN •  Broadcast join by default. •  Partitioned join •  suitable for large tables of roughly equal size •  subsets of rows can be processed in parallel by sending portion of each tables •  Join the biggest table first •  Joining the table with the most selective filter •  INSERT •  not suitable for loading large quantities of data into HDFS-based tables, due to the lack of parallelized operations •  Staging temporary files in an ETL pipeline and upload to HDFS (refresh) •  Resource usage: •  Impalad startup flag: “-mem_limits” 16
  • 17. Troubleshooting Hint •  Queries are slow? •  Test: “select count(*) from table” •  Non-zero “Total remote scan volume” shown in impalad log indicate either some DNs not running impalad or impalad instance fail to contact one or more impalad instances. •  Missing impalad instances from DN •  live backend: http://statestore:25010/metrics •  Data locality and native checksuming (>= CDH 4.2) •  Enable properties: “dfs.client.read.shortcircuit” &“dfs.client.read.shortcircuit.skip.checksum” •  Rebuild/reinstall hadoop native lib “libhadoop.so” if needed. •  Error: –  Unknown disk id. This will negatively affect performance. Check your hdfs settings to enable block location metadata –  Unable to load native-hadoop library for your platform... using builtin-java classes where applicable HIT 2013 17
  • 18. Troubleshooting Hint (cont’) •  Queries getting slower?: •  Impalad paging after mem exceeded •  E.g.: mem-limit.h:86] Query: 0:0Exceeded limit: limit=26996031488 consumption=26996148624 •  Incorrect result? •  Invalid metadata (GA: REFRESH, post-GA: INVALID METADATA) •  Invalid query? •  Cross check the query in HIVE •  Useful debugging info from impala service logs. •  Invalid/unsupported stmt: •  http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using- Impala/ciiu_langref.html#langref •  Auth error: •  Server logging: •  Minor code may provide more information (Cannot contact any KDC for realm or Kerberos: •  GSSAPI Error: Unspecified GSS failure •  Client: “Error connecting: <class 'thrift.transport.TTransport.TTransportException'>, TSocket read 0 bytes” •  Ensured •  valid Kerberos ticket lifetime at client •  Specify “-s” service principal and flag “–k” aim for kerberized impalad connection. HIT 2013 18
  • 19. Limitation and Wish List •  limitation: •  Subquery referenced in the SELECT •  Optional WITH clause before the INSERT. •  Recursive queries in the WITH clauses •  Inconsistent VIEW •  parenthesis in WHERE clauses •  Wish list •  SQL modeling tool •  Fault tolerance query •  Memory management (caching parquet table) & usage estimation •  Aggregation group of columns (> 30 etc.) HIT 2013 19
  • 20. Impala: Now & Future Roadmap •  Now (1.1.x/1.0) •  OS Support: •  RHEL/CentOS 5.7, Ubuntu, Debian, SLES, and Oracle Linux •  Connecters: JDBC/ODBC drivers •  DDL support & SQL performance optimization •  Fast & memory efficient: join & aggregation •  File format: Parquet, Avro & LZO compressed •  Future (1.2) – late 2013 •  UDF and extensibility •  Automatic metadata refresh •  In-memory HDFS caching •  Cost-base join order optimization •  Preview of YARN-integrated resource manager •  2.0 Roadmap – first 3rd of 2014 •  SQL 2003-compliant analytic window functions •  Additional authentication mechanisms •  UDTFs (user-defined table functions) •  Intra-node parallelized aggregations and joins •  Nested data •  YARN-integrated resource manager •  Additional data types – including Date and Decimal types HIT 2013 20
  • 21. More Information & Related Works •  “Dremel: Interactive Analysis of Web-Scale Datasets”, Sergey Melnik et al., Google •  Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time- queries-in-apache- hadoop-for-real/ •  “Impala unlocks Interactive BI on Hadoop with MicroStrategy”, Justin Erickson & Jochen Demuth, Cloudera •  “Cloudera impala Performance Evaluation”, Yukinori SUDA •  “HANA vs Impala, on AWS Cloud”, Aron MacDonald •  “Spark and Shark: High-speed In-memory Analytics over Hadoop Data”, Reynold Xin, AMPLab •  Stinger Initiative http://hortonworks.com/blog/100x-faster-hive/ •  Apache Drill: http://incubator.apache.org/drill/ HIT 2013 21
  • 22. Performance Evaluation 0 20 40 60 80 100 Shark Impala PIG Elephant Km/h Ref: Wiki & http://www.speedofanimals.com
  • 23. Breakdown of DNS Anomaly Analytics HIT 2013 23 Two DN + Master -  Dual DC E5620 2.40GHz -  MEM 32GB ea. -  4 spindles, 2T ea. HDFS(GB) QueryResp.(sec)
  • 24. Data Volume and Ingest HIT 2013 1D 1W 1M 2M Data (Raw) (GB) 5.1 35 140 280 Data (HDFS) (GB) 3.8 25.9 103.6 207.2 Blocks (HDFS) 31 211 844 1598 MEvt 42 291 1,166 2,209 24
  • 25. PIG vs. Impala •  Domain level compute in preprocessing streaming. •  DN sort throughput: ~120MB/s throughput & SIP/Qry ~ 50MB/s. •  Processing time scale linearly with data vol. HIT 2013 27 Query Resp. (sec) Impala: 71s 7 times faster.
  • 26. Observation & Estimation •  Speed-up: 4.5~7 times •  DL Calc.: 57~70% memory usage •  Data ingest !  Est. ~3TB take ~55K sec. •  Plus pre-processing time !  Throughput constrain to GbE linkage (in/out bound) !  Avg. throughput ~80MB/s •  non-encrypted file transfer •  RTQ: ~15K sec for 3TB process !  c.f. 115K base on MR HIT 2013 28
  • 27. Query Throughput & Latency •  Queries •  20 from TPC-DS •  3 categories •  Interactive: 1month •  Reports: several months •  Deep analytics: all data •  Fact table: •  1TB snappy-seq.-files/5Yr •  Resource level: •  20 nodes, 24cores/node. •  Speed-up: •  Interactive: 25~68 •  Reports: 6~56 •  Deep analytics: 6~55 29Ref: “Impala: A Modern, Open-Source SQL Engine for Hadoop”, Marcel Kornacker, Cloudera
  • 28. Impala vs. Stinger •  Stinger •  Optimize execution plan •  TEZ framework optimize execution •  Columnar file format 30Ref: Cloudera Impala Overview, Scott Leberknight, Cloudera.
  • 29. Impala Use Cases Offloads DW for ad hoc query environment, ETL and archiving Interactive BI/analytics on large volume of data Real-time response for unstructured data analysis
  • 30. Impala and HIVE HIT 2013 32 •  Impala: •  Native MPP query engine for low runtime overhead & interactive SQL •  No fault tolerance •  GA: UDF supported •  HIVE •  MapReduce as an execution engine •  Fault-tolerant leveraging MR framework •  High runtime overhead (extensive layering) •  UDF •  Common for client: •  SQL syntax •  highly compatible with HiveQL •  ODBC/JDBC drivers •  Metadata (table definition) •  HUE
  • 31. Data Warehouse Offload 33Ref: Hadoop and the Data Warehouse: When to Use Which, Teradata
  • 32. Query Run Times •  Table with 60M Records 34 Ref: HANA vs Impala, on AWS Cloud
  • 33. TPC-H Query Run Times •  Lineitem table 60M Rows 35 Ref: HANA vs Impala, on AWS Cloud
  • 34. •  On-demand Customer Segmentation based on various demographic and mobile behavior attributes •  On-demand Customer Profiling through fast screening & ranking of critical attributes With the power of distributed in-memory computation on hadoop, Impala enables market analyst to conduct various interactive analytics such as OLAP, statistical correlation, and data mining on big data. HIT 2013 36
  • 35. 「 標族群 」關聯屬性分析 33% 28% 27% 12% Facebook 43% Twitter 31% Google+19% LinkedIn 7% 27% 23% 39% 11% Facebook 44% Twitter 30% Google+17% LinkedIn 9% 53% 47% 56% 44% app 28% app 17% app 23% app 18% app 14% app 25% app 14% app 20% App 33% app 10% – 39
  • 36. DEMO •  CREATE TABLE, LOAD DATA from HDFS DROP TABLE IF EXISTS demo; CREATE EXTERNAL TABLE demo ( a string, b int, c int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/etu/demo'; •  PIG & Impala: •  SUM •  SUM with GROUP BY HIT 2013 40
  • 37. DEMO (cont’) •  SUM in PIG: a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; d = group b by col4; d1 = foreach d generate SUM(b.col4); store d1 into 'demo/count2' using PigStorage(','); •  SUM in Impala: SELECT sum(demo.c) FROM demo; HIT 2013 41
  • 38. DEMO (cont’) •  SUM with GROUP BY in PIG a = load 'demo/demo_data.csv' using PigStorage(',') as (col1:chararray, col2:int, col3:int); b = foreach a generate col1, col2, col3, 1 as col4; c = group b by col1; c1 = foreach c generate group, SUM(b.col2); store c1 into 'demo/count1' using PigStorage(','); •  SUM with GROUP BY in Impala SELECT demo.a AS tag, sum(demo.b) AS val FROM demo GROUP BY demo.a; HIT 2013 42
  • 39. DEMO (cont’) •  Speed-up: HIT 2013 43 Query Resp. (sec) X 60 X 18 Two DN, same spec for DNS log analytics. Dual DC E5620, MEM 32GB ea. ~100 time faster when cluster scale.
  • 41. www.etusolution.com info@etusolution.com Taipei, Taiwan 318, Rueiguang Rd., Taipei 114, Taiwan T: +886 2 7720 1888 F: +886 2 8798 6069 Beijing, China Room B-26, Landgent Center, No. 24, East Third Ring Middle Rd., Beijing, China 100022 T: +86 10 8441 7988 F: +86 10 8441 7227 Contact