Hongbin Ma and Luke Han (Kyligence)
Apache Kylin is an open source distributed analytics engine that provides a SQL interface and multi-dimensional analysis on Hadoop supporting extremely large datasets. In the forthcoming Kylin release, we optimized query performance by exploring the potentials of parallel storage on top of HBase. This talk explains how that work was done.
Apache Kylin’s Performance Boost from Apache HBase
1. Hongbin Ma, Luke Han
Kyligence Inc.
Apache Kylin’s
Performance Boost from
Apache HBase
2. About us
Hongbin Ma| 马洪宾
PMC member of Apache Kylin
Technical partner of Kyligence Inc.
mahongbin@apache.org
Kyligence Inc.
Kyligence is a leading data intelligence company focusing on Big Data technologies and
innovation, offering intelligent platform and product powered by Apache Kylin™ for
enterprise ready business analytics solutions.
Luke Han | 韩卿
Co-creator & VP of Apache Kylin
ASF Member
Co-founder & CEO at Kyligence Inc.
lukehan@apache.org
4. What is Apache Kylin
Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
Works well with extremely large datasets
Provides REST API, ODBC and JDBC as user interface
Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.
6. What is Apache Kylin
Apache Kylin is an open source distributed analytics engine that
provides a SQL interface for multi-dimensional analysis on Hadoop
Works well with extremely large datasets
Provides REST API, ODBC and JDBC as user interface
Widely adopted by many companies like eBay, JD, Baidu, NetEase, VIP.com,
etc.
Apache Kylin pre-calculates OLAP cubes with a horizontal scalable
computation framework(MapReduce, Spark, etc.) and store the cubes
into a reliable & scalable data store(HBase, Casscandra, etc.)
7. Architecture Design
Cube Builder
(MapReduce, Spark, etc…)
SQL
Low Latency -
SecondsRouting
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
Online Analysis Data Flow
Offline Data Flow
Clients/Users interactive with
Kylin via SQL
OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cubes
(HBase)
SQL
REST ServerDataSource
Abstraction
Engine
Abstraction
Storage
Abstraction
9. Cubes stored in HBase
Let’s take a looks at
cuboid (D1,D3,D5)
where all dimensions are:
(D1,D2,D3,D4,D5)
This cuboid is donated as “cuboid 00010101”
10. Why HBase as the first choice?
Well integrated with Hadoop
Block encoding to reduce storage footprint
Good at both seeking and scanning
Coprocessors to move computation to data
Scalable and flexible as a data store
11. Region server
How Kylin queries HBase
Kylin Query
Server
region
coprocessor
Country Metrics…DateSellerIDCuboidID
2. Scan with Fuzzy Key Filter
1. Filter/Aggregation push down
3. Half baked results
12. May still be slow when
The cuboid is large because there’s really lots of combinations in it
Cuboid layout is not friendly to query, e.g. filter on suffix dimensions while
group by prefix dimensions.
The filter in query is huge and complex
Regions are returning too many half-baked results
14. Novelty
Compared with “pure” MPP solutions
Cube data is more query-friendly because it is pre-aggregated and sorted.
Faster speed
Less CPU consumption
Less storage read
Able to leverage column storage and inverted index just like typical MPP
Compared with “pure” Cubing technologies
Overcome the bottleneck in cube size
Overcome the bottleneck in cube visiting speed
15. Problem
The sizes of different cuboids in the same cube may vary
Too many parallelism for small cuboids is harmful
A RPC is required for each shard, we don’t want to abuse network/CPU
resource
16. Solution: Shard Circle
0
1
2
3
4
5
6
7
8
9
Given estimated size for each cuboid 𝑆𝑖,
and expected size for each region 𝑆𝑟 (specified by modeler)
𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝑅𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚 =
𝑆𝑖 ∗ 𝑓𝑎𝑐𝑡𝑜𝑟
𝑆𝑟
𝑐𝑢𝑏𝑜𝑖𝑑𝐶𝑖𝑟𝑐𝑙𝑒𝑆𝑡𝑎𝑟𝑡 = ℎ𝑎𝑠ℎ 𝑖 𝑀𝑂𝐷 𝑟𝑒𝑔𝑖𝑜𝑛𝑁𝑢𝑚
17. Salted Cuboid Rows
ShardID at the beginning of row key
Configurable policies for computing ShardID
From hash result of remaining row key – facilitate randomize
From specific dimension values – facilitate runtime performance
Country Metrics…DateSellerIDCuboidIDShardID
18. Compute ShardID from SellerID
For queries those group by SellerID
Each shard aggregating non-joint subset of SellerIDs
No further aggregation at merge side
For queries those filter by SellerID
The push down SellerID filter can be trimmed to contain only interested
SellerIDs
20. Small cuboids getting less shards
1.005586592
0.625 0.625
0.678571429
0.794117647
0
0.2
0.4
0.6
0.8
1
1.2
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
13 regions 23 regions
21. Q & A
To get more information about Apache Kylin:
Apache Kylin Website: http://kylin.apache.org
Kyligence Website: http://kyligence.io
Twitter: @ApacheKylin
Mail list: dev@kylin.apache.org