SlideShare una empresa de Scribd logo
1 de 42
Apache Kylin Introduction 
Dec 8, 2014 |@ApacheKylin 
http://kylin.io 
Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq 
Yang Li Architect & Tech Leader | yangli9@ebay.com
Agenda 
 What’s Apache Kylin? 
 Tech Highlights 
 Performance 
 Open Source 
 Q & A
Extreme OLAP Engine for Big Data 
Kylin is an open source Distributed Analytics Engine from eBay 
that provides SQL interface and multi-dimensional analysis 
(OLAP) on Hadoop supporting extremely large datasets 
http://kylin.io 
What’s Kylin 
kylin / ˈkiːˈlɪn / 麒麟 
--n. (in Chinese art) a mythical animal of composite form 
• Open Sourced on Oct 1st, 2014 
• Be accepted as Apache Incubator Project on Nov 25th, 2014
http://kylin.io 
Big Data Era 
 More and more data becoming available on Hadoop 
 Limitations in existing Business Intelligence (BI) Tools 
 Limited support for Hadoop 
 Data size growing exponentially 
 High latency of interactive queries 
 Scale-Up architecture 
 Challenges to adopt Hadoop as interactive analysis system 
 Majority of analyst groups are SQL savvy 
 No mature SQL interface on Hadoop 
 OLAP capability on Hadoop ecosystem not ready yet
http://kylin.io 
Business Needs for Big Data Analysis 
 Sub-second query latency on billions of rows 
 ANSI SQL for both analysts and engineers 
 Full OLAP capability to offer advanced functionality 
 Seamless Integration with BI Tools 
 Support of high cardinality and high dimensions 
 High concurrency – thousands of end users 
 Distributed and scale out architecture for large data volume 
 Open source solution
h6ttp://kylin.io 
Why not 
Build an engine from scratch?
Kylin is designed to accelerate 80+% analytics queries performance on Hadoop 
http://kylin.io 
Transaction 
Strategy 
Operation 
High Level 
Aggregation 
•Very High Level, e.g GMV by 
site by vertical by weeks 
Analysis 
Query 
•Middle level, e.g GMV by site by vertical, 
by category (level x) past 12 weeks 
Drill Down 
to Detail •Detail Level (Summary Table) 
Low Level 
Aggregation 
•First Level 
Aggragation 
Transaction 
Level 
•Transaction Data 
Analytics Query Taxonomy 
OLAP 
OLTP
http://kylin.io 
Technical Challenges 
 Huge volume data 
 Table scan 
 Big table joins 
 Data shuffling 
 Analysis on different granularity 
 Runtime aggregation expensive 
 Map Reduce job 
 Batch processing
OLAP Cube – Balance between Space and Time 
http://kylin.io 
9 
• Cuboid = one combination of dimensions 
• Cube = all combination of dimensions (all cuboids) 
time, item 
time item location supplier 
time, item, location 
item, location 
time, location, supplier 
time, item, location, supplier 
time, location 
Time, supplier 
item, supplier 
location, supplier 
time, item, supplier 
item, location, supplier 
0-D(apex) cuboid 
1-D cuboids 
2-D cuboids 
3-D cuboids 
4-D(base) cuboid 
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 
2. (9/15, milk, Urbana, *) - <time, item, location> 
3. (*, milk, Urbana, *) - <item, location> 
4. (*, milk, Chicago, *) - <item, location> 
5. (*, milk, *, *) - <item>
http://kylin.io 
From Relational to Key-Value
Data 
Cube 
OLAP 
Cube 
(HBase) 
http://kylin.io 
Kylin Architecture Overview 
11 
SQL-Based Tool 
(BI Tools: Tableau…) 
REST Server 
Query Engine 
SQL 
Cube Build Engine 
(MapReduce…) 
SQL 
Low Latency - 
Seconds 
Mid Latency - Minutes 
Routing 
3rd Party App 
(Web App, Mobile…) 
Metadata 
Hadoop 
Hive 
REST API JDBC/ODBC 
 Online Analysis Data Flow 
 Offline Data Flow 
 Clients/Users interactive with 
Kylin via SQL 
 OLAP Cube is transparent to 
users 
Star Schema Data Key Value Data
Features Highlights 
 Extremely Fast OLAP Engine at Scale 
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data 
 ANSI SQL Interface on Hadoop 
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions 
http://kylin.io 
 Seamless Integration with BI Tools 
Kylin currently offers integration capability with BI Tools like Tableau. 
 Interactive Query Capability 
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive 
queries for the same dataset 
 MOLAP Cube 
User can define a data model and pre-build in Kylin with more than 10+ billions of raw 
data records
http://kylin.io 
Features Highlights Cons 
 Compression and Encoding Support 
 Incremental Refresh of Cubes 
 Approximate Query Capability for distinct Count (HyperLogLog) 
 Leverage HBase Coprocessor for query latency 
 Job Management and Monitoring 
 Easy Web interface to manage, build, monitor and query cubes 
 Security capability to set ACL at Cube/Project Level 
 Support LDAP Integration
How Does Kylin Utilize Hadoop Components? 
http://kylin.io 
 Hive 
 Input source 
 Pre-join star schema during cube building 
 MapReduce 
 Pre-aggregation metrics during cube building 
 HDFS 
 Store intermediated files during cube building. 
 HBase 
 Store data cube. 
 Serve query on data cube. 
 Coprocessor is used for query processing.
http://kylin.io 
Why Kylin is Fast? 
 Pre-built cube – query result already be calculated 
 Leveraging distributed computing infrastructure 
 No runtime Hive table scan and MapReduce job 
 Compression and encoding 
 Put “Computing” to “Data” 
 Cached
Agenda 
 What’s Kylin 
 Tech Highlights 
 Performance 
 Open Source 
 Q & A
How to Define Cube? 
End User Cube Modeler Admin 
Row Key Column 
Val 1 
Val 2 
Val 3 
http://kylin.io 
Data Modeling 
Cube: … 
Fact Table: … 
Dimensions: … 
Measures: … 
Dim 
Fact Storage(HBase): … 
Dim Dim 
Source 
Star Schema 
row A 
row B 
row C 
Column Family 
Target 
HBase Storage 
Mapping 
Cube Metadata
http://kylin.io 
• Dimension 
– Normal 
– Mandatory 
– Hierarchy 
– Derived 
• Measure 
– Sum 
– Count 
– Max 
– Min 
– Average 
– Distinct Count (based on HyperLogLog) 
How to Define Cube? 
Cube Metadata
http://kylin.io 
 Dimension that must present on cuboid 
 E.g. Date 
Normal 
A B C 
A B - 
- B C 
A - C 
A - - 
- B - 
- - C 
- - - 
A is mandatory 
A B C 
A B - 
A - C 
A - - 
How to Define Cube? 
Mandatory Dimension
 Dimensions that form a “contains” relationship where parent level is 
http://kylin.io 
required for child level to make sense. 
 E.g. Year -> Month -> Day; Country -> City 
Normal 
A B C 
A B - 
- B C 
A - C 
A - - 
- B - 
- - C 
- - - 
A -> B -> C is hierarchy 
A B C 
A B - 
A - - 
- - - 
How to Define Cube? 
Hierarchy Dimension
 Dimensions on lookup table that can be derived by PK 
http://kylin.io 
 E.g. User ID -> [Name, Age, Gender] 
Normal 
A B C 
A B - 
- B C 
A - C 
A - - 
- B - 
- - C 
- - - 
A, B, C is derived by ID 
ID 
- 
How to Define Cube? 
Derived Dimension
How to Build Cube? 
http://kylin.io 
Cube Build Job Flow
How to Build Cube? 
http://kylin.io 
Cube Build Result
http://kylin.io 
 Dynamic data management framework. 
 Formerly known as Optiq, Calcite is an Apache incubator project, used by 
Apache Drill and Apache Hive, among others. 
 http://optiq.incubator.apache.org 
How to Query Cube? 
Query Engine – Calcite
http://kylin.io 
• Metadata SPI 
– Provide table schema from kylin metadata 
• Optimize Rule 
– Translate the logic operator into kylin operator 
• Relational Operator 
– Find right cube 
– Translate SQL into storage engine api call 
– Generate physical execute plan by linq4j java implementation 
• Result Enumerator 
– Translate storage engine result into java implementation result. 
• SQL Function 
– Add HyperLogLog for distinct count 
– Implement date time related functions (i.e. Quarter) 
How to Query Cube? 
Calcite Plugins
How to Query Cube? 
Kylin Explain Plan 
SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, 
test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT 
FROM test_kylin_fact 
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = 
test_category.site_id 
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id 
WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' 
GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, 
test_kylin_fact.lstg_format_name,test_sites.site_name 
OLAPToEnumerableConverter 
OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], 
LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) 
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) 
OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], 
http://kylin.io 
LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) 
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) 
OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) 
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) 
OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) 
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) 
OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) 
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
How to Query Cube? 
http://kylin.io 
Storage Engine 
 Plugin-able query engine 
 Common iterator interface for storage engine 
 Isolate query engine from underline storage 
 Translate cube query into HBase table scan 
 Columns, Groups  Cuboid ID 
 Filters -> Scan Range (Row Key) 
 Aggregations -> Measure Columns (Row Values) 
 Scan HBase table and translate HBase result into cube result 
 HBase Result (key + value) -> Cube Result (dimensions + measures)
http://kylin.io 
 “Curse of dimensionality”: N dimension cube has 2N cuboid 
 Full Cube vs. Partial Cube 
 Hugh data volume 
 Dictionary Encoding 
 Incremental Building 
How to Optimize Cube? 
Cube Optimization
How to Optimize Cube? 
http://kylin.io 
Full Cube vs. Partial Cube 
 Full Cube 
 Pre-aggregate all dimension combinations 
 “Curse of dimensionality”: N dimension cube has 2N cuboid. 
 Partial Cube 
 To avoid dimension explosion, we divide the dimensions into 
different aggregation groups 
 2N+M+L  2N + 2M + 2L 
 For cube with 30 dimensions, if we divide these dimensions into 3 
group, the cuboid number will reduce from 1 Billion to 3 Thousands 
 230  210 + 210 + 210 
 Tradeoff between online aggregation and offline pre-aggregation
How to Optimize Cube? 
http://kylin.io 
Partial Cube
http://kylin.io 
 Data cube has lost of duplicated dimension values 
 Dictionary maps dimension values into IDs that will reduce the memory and storage 
footprint. 
 Dictionary is based on Trie 
How to Optimize Cube? 
Dictionary Encoding
How to Optimize Cube? 
http://kylin.io 
Incremental Build
http://kylin.io 
Inverted Index 
 Challenge 
 Has no raw data records 
 Slow table scan on high cardinality dimensions 
 Inverted Index Storage (an ongoing effort) 
 Persist the raw table 
 Bitmap inverted index 
 Time range partition 
 In-memory (block cache) 
 Parallel scan (endpoint coprocessor)
Agenda 
 What’s Apache Kylin? 
 Tech Highlights 
 Performance 
 Open Source 
 Q & A
http://kylin.io 
Kylin vs. Hive 
200 
150 
100 
50 
0 
# Query 
Type 
SQL #1 SQL #2 SQL #3 
Return Dataset Query 
On Kylin (s) 
Query 
On Hive (s) 
Comments 
1 High Level 
Aggregation 
4 0.129 157.437 1,217 times 
2 Analysis Query 22,669 1.615 109.206 68 times 
3 Drill Down to 
Detail 
325,029 12.058 113.123 9 times 
4 Drill Down to 
Detail 
524,780 22.42 6383.21 278 times 
5 Data Dump 972,002 49.054 N/A 
Hive 
Kylin 
High Level 
Aggregatio 
n 
Analysis 
Query 
Drill Down 
to Detail 
Low Level 
Aggregatio 
n 
Transactio 
n Level 
Based on 12+B records case
http://kylin.io 
Performance -- Concurrency 
Linear scale out with more nodes
http://kylin.io 
Performance - Query Latency 
90%tile queries <5s 
Green Line: 90%tile queries 
Gray Line: 95%tile queries
Agenda 
 What’s Apache Kylin? 
 Tech Highlights 
 Performance 
 Open Source 
 Q & A
http://kylin.io 
 Kylin Core 
 Fundamental framework of 
Kylin OLAP Engine 
 Extension 
 Plugins to support for 
additional functions and 
features 
 Integration 
 Lifecycle Management 
Support to integrate with 
other applications 
 Interface 
 Allows for third party users to 
build more features via user-interface 
atop Kylin core 
 Driver 
 ODBC and JDBC Drivers 
Kylin OLAP 
Core 
Extension 
 Security 
 Redis Storage 
 Spark Engine 
Interface 
 Web Console 
 Customized BI 
 Ambari/Hue Plugin 
Integration 
 ODBC Driver 
 ETL 
 Scheduling 
Kylin Ecosystem
http://kylin.io 
Kylin Evolution Roadmap 
2013 2014 2015 
Jan, 2014 
Initial 
Prototype 
for MOLAP 
• Basic end to end 
POC 
MOLAP 
• Incremental 
Refresh 
• ANSI SQL 
• ODBC Driver 
• Web GUI 
• ACL 
• Open Source 
HOLAP 
• InvertedIndex 
• JDBC Driver 
• Automation 
• Capacity 
Managment 
• Support Spark 
• … more 
Next Gen 
• In-Memory 
Analysis 
• Real Time & 
Streaming (TBD) 
• … more 
TBD 
Future… 
Sep, 2013 
Sep, 2014 
Q1, 2015
http://kylin.io 
Open Source 
 Kylin Site: 
 http://kylin.io 
 Twitter: 
 @ApacheKylin 
 Source Code Repo: 
 https://github.com/KylinOLAP 
 Google Group: 
 Kylin OLAP
http://kylin.io 
Thanks 
http://kylin.io

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
Apache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingApache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 Beijing
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 

Similar a Apache Kylin: Hadoop OLAP Engine, 2014 Dec

HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Michael Stack
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large dataset
Chun'en Ni
 

Similar a Apache Kylin: Hadoop OLAP Engine, 2014 Dec (20)

HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data SpainApache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin and Use Cases - 2018 Big Data Spain
 
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big dataHBaseConAsia2018  Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on AzureBuild 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Develop a Custom Data Solution Architecture with NorthBay
Develop a Custom Data Solution Architecture with NorthBayDevelop a Custom Data Solution Architecture with NorthBay
Develop a Custom Data Solution Architecture with NorthBay
 
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the wayHow we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra and what we learned along the way
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?
 
Apache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large datasetApache kylin boost your sqls on extremely large dataset
Apache kylin boost your sqls on extremely large dataset
 
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large datasetApache kylin boost your SQLs on extremely large dataset
Apache kylin boost your SQLs on extremely large dataset
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

Apache Kylin: Hadoop OLAP Engine, 2014 Dec

  • 1. Apache Kylin Introduction Dec 8, 2014 |@ApacheKylin http://kylin.io Luke Han Sr. Product Manager | lukhan@ebay.com | @lukehq Yang Li Architect & Tech Leader | yangli9@ebay.com
  • 2. Agenda  What’s Apache Kylin?  Tech Highlights  Performance  Open Source  Q & A
  • 3. Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets http://kylin.io What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Be accepted as Apache Incubator Project on Nov 25th, 2014
  • 4. http://kylin.io Big Data Era  More and more data becoming available on Hadoop  Limitations in existing Business Intelligence (BI) Tools  Limited support for Hadoop  Data size growing exponentially  High latency of interactive queries  Scale-Up architecture  Challenges to adopt Hadoop as interactive analysis system  Majority of analyst groups are SQL savvy  No mature SQL interface on Hadoop  OLAP capability on Hadoop ecosystem not ready yet
  • 5. http://kylin.io Business Needs for Big Data Analysis  Sub-second query latency on billions of rows  ANSI SQL for both analysts and engineers  Full OLAP capability to offer advanced functionality  Seamless Integration with BI Tools  Support of high cardinality and high dimensions  High concurrency – thousands of end users  Distributed and scale out architecture for large data volume  Open source solution
  • 6. h6ttp://kylin.io Why not Build an engine from scratch?
  • 7. Kylin is designed to accelerate 80+% analytics queries performance on Hadoop http://kylin.io Transaction Strategy Operation High Level Aggregation •Very High Level, e.g GMV by site by vertical by weeks Analysis Query •Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail •Detail Level (Summary Table) Low Level Aggregation •First Level Aggragation Transaction Level •Transaction Data Analytics Query Taxonomy OLAP OLTP
  • 8. http://kylin.io Technical Challenges  Huge volume data  Table scan  Big table joins  Data shuffling  Analysis on different granularity  Runtime aggregation expensive  Map Reduce job  Batch processing
  • 9. OLAP Cube – Balance between Space and Time http://kylin.io 9 • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) time, item time item location supplier time, item, location item, location time, location, supplier time, item, location, supplier time, location Time, supplier item, supplier location, supplier time, item, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item>
  • 11. Data Cube OLAP Cube (HBase) http://kylin.io Kylin Architecture Overview 11 SQL-Based Tool (BI Tools: Tableau…) REST Server Query Engine SQL Cube Build Engine (MapReduce…) SQL Low Latency - Seconds Mid Latency - Minutes Routing 3rd Party App (Web App, Mobile…) Metadata Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data
  • 12. Features Highlights  Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data  ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions http://kylin.io  Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau.  Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset  MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • 13. http://kylin.io Features Highlights Cons  Compression and Encoding Support  Incremental Refresh of Cubes  Approximate Query Capability for distinct Count (HyperLogLog)  Leverage HBase Coprocessor for query latency  Job Management and Monitoring  Easy Web interface to manage, build, monitor and query cubes  Security capability to set ACL at Cube/Project Level  Support LDAP Integration
  • 14. How Does Kylin Utilize Hadoop Components? http://kylin.io  Hive  Input source  Pre-join star schema during cube building  MapReduce  Pre-aggregation metrics during cube building  HDFS  Store intermediated files during cube building.  HBase  Store data cube.  Serve query on data cube.  Coprocessor is used for query processing.
  • 15. http://kylin.io Why Kylin is Fast?  Pre-built cube – query result already be calculated  Leveraging distributed computing infrastructure  No runtime Hive table scan and MapReduce job  Compression and encoding  Put “Computing” to “Data”  Cached
  • 16. Agenda  What’s Kylin  Tech Highlights  Performance  Open Source  Q & A
  • 17. How to Define Cube? End User Cube Modeler Admin Row Key Column Val 1 Val 2 Val 3 http://kylin.io Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Dim Fact Storage(HBase): … Dim Dim Source Star Schema row A row B row C Column Family Target HBase Storage Mapping Cube Metadata
  • 18. http://kylin.io • Dimension – Normal – Mandatory – Hierarchy – Derived • Measure – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) How to Define Cube? Cube Metadata
  • 19. http://kylin.io  Dimension that must present on cuboid  E.g. Date Normal A B C A B - - B C A - C A - - - B - - - C - - - A is mandatory A B C A B - A - C A - - How to Define Cube? Mandatory Dimension
  • 20.  Dimensions that form a “contains” relationship where parent level is http://kylin.io required for child level to make sense.  E.g. Year -> Month -> Day; Country -> City Normal A B C A B - - B C A - C A - - - B - - - C - - - A -> B -> C is hierarchy A B C A B - A - - - - - How to Define Cube? Hierarchy Dimension
  • 21.  Dimensions on lookup table that can be derived by PK http://kylin.io  E.g. User ID -> [Name, Age, Gender] Normal A B C A B - - B C A - C A - - - B - - - C - - - A, B, C is derived by ID ID - How to Define Cube? Derived Dimension
  • 22. How to Build Cube? http://kylin.io Cube Build Job Flow
  • 23. How to Build Cube? http://kylin.io Cube Build Result
  • 24. http://kylin.io  Dynamic data management framework.  Formerly known as Optiq, Calcite is an Apache incubator project, used by Apache Drill and Apache Hive, among others.  http://optiq.incubator.apache.org How to Query Cube? Query Engine – Calcite
  • 25. http://kylin.io • Metadata SPI – Provide table schema from kylin metadata • Optimize Rule – Translate the logic operator into kylin operator • Relational Operator – Find right cube – Translate SQL into storage engine api call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) How to Query Cube? Calcite Plugins
  • 26. How to Query Cube? Kylin Explain Plan SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], http://kylin.io LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
  • 27. How to Query Cube? http://kylin.io Storage Engine  Plugin-able query engine  Common iterator interface for storage engine  Isolate query engine from underline storage  Translate cube query into HBase table scan  Columns, Groups  Cuboid ID  Filters -> Scan Range (Row Key)  Aggregations -> Measure Columns (Row Values)  Scan HBase table and translate HBase result into cube result  HBase Result (key + value) -> Cube Result (dimensions + measures)
  • 28. http://kylin.io  “Curse of dimensionality”: N dimension cube has 2N cuboid  Full Cube vs. Partial Cube  Hugh data volume  Dictionary Encoding  Incremental Building How to Optimize Cube? Cube Optimization
  • 29. How to Optimize Cube? http://kylin.io Full Cube vs. Partial Cube  Full Cube  Pre-aggregate all dimension combinations  “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cube  To avoid dimension explosion, we divide the dimensions into different aggregation groups  2N+M+L  2N + 2M + 2L  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands  230  210 + 210 + 210  Tradeoff between online aggregation and offline pre-aggregation
  • 30. How to Optimize Cube? http://kylin.io Partial Cube
  • 31. http://kylin.io  Data cube has lost of duplicated dimension values  Dictionary maps dimension values into IDs that will reduce the memory and storage footprint.  Dictionary is based on Trie How to Optimize Cube? Dictionary Encoding
  • 32. How to Optimize Cube? http://kylin.io Incremental Build
  • 33. http://kylin.io Inverted Index  Challenge  Has no raw data records  Slow table scan on high cardinality dimensions  Inverted Index Storage (an ongoing effort)  Persist the raw table  Bitmap inverted index  Time range partition  In-memory (block cache)  Parallel scan (endpoint coprocessor)
  • 34. Agenda  What’s Apache Kylin?  Tech Highlights  Performance  Open Source  Q & A
  • 35. http://kylin.io Kylin vs. Hive 200 150 100 50 0 # Query Type SQL #1 SQL #2 SQL #3 Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A Hive Kylin High Level Aggregatio n Analysis Query Drill Down to Detail Low Level Aggregatio n Transactio n Level Based on 12+B records case
  • 36. http://kylin.io Performance -- Concurrency Linear scale out with more nodes
  • 37. http://kylin.io Performance - Query Latency 90%tile queries <5s Green Line: 90%tile queries Gray Line: 95%tile queries
  • 38. Agenda  What’s Apache Kylin?  Tech Highlights  Performance  Open Source  Q & A
  • 39. http://kylin.io  Kylin Core  Fundamental framework of Kylin OLAP Engine  Extension  Plugins to support for additional functions and features  Integration  Lifecycle Management Support to integrate with other applications  Interface  Allows for third party users to build more features via user-interface atop Kylin core  Driver  ODBC and JDBC Drivers Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Scheduling Kylin Ecosystem
  • 40. http://kylin.io Kylin Evolution Roadmap 2013 2014 2015 Jan, 2014 Initial Prototype for MOLAP • Basic end to end POC MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • ACL • Open Source HOLAP • InvertedIndex • JDBC Driver • Automation • Capacity Managment • Support Spark • … more Next Gen • In-Memory Analysis • Real Time & Streaming (TBD) • … more TBD Future… Sep, 2013 Sep, 2014 Q1, 2015
  • 41. http://kylin.io Open Source  Kylin Site:  http://kylin.io  Twitter:  @ApacheKylin  Source Code Repo:  https://github.com/KylinOLAP  Google Group:  Kylin OLAP