SlideShare una empresa de Scribd logo
1 de 16
Design Cube in Kylin
dev@kylin.incubator.apache.org
Before You Start
• Kylin is a MOLAP engine on Hadoop.
• Understand Kylin helps cube design a lot.
– http://www.slideshare.net/YangLi43/apache-kylin-deep-dive-2014-dec
• This deck summarizes best practices and
patterns on how to design an efficient cube.
– For detailed steps to create a cube, check out
https://github.com/KylinOLAP/Kylin/wiki/Kylin-Cube-Creation-Tutorial
Overview
• Identify Star Schema
• Design Cube
– Dimensions
– Measures
– Incremental Build
– Advanced Options
• Build and Verify
Identify Star Schema
• Kylin creates cube from a star schema of Hive
tables.
• One fact table that has ever growing records, like
transactions.
• A few dimension tables that are relatively static,
like users and products.
• Hive tables must be synced into Kylin first.
Know Cardinalities of Columns
• Cardinalities have significant impact on cube size and query
latency.
– High Cardinality: > 1,000
– Ultra High Cardinality: > 1,000,000
• Avoid UHC as much as possible.
– If it’s used as indicator, then put the indicator in cube.
– Try categorize values or derive features from the UHC rather
than putting the original value in cube.
• To know column cardinalities
– select count(distinct A) from T
– or google for fancy tools
Cube Concepts
Cube = all combination of dimensions
Cuboid = one combination of dimensions
Curse of dimensionality: N dimension cube has 2N cuboid
Design Dimensions
• 15 dimensions or less is most ideal.
– More than that causes slowness in cube build and
longer query latency.
– Does user really need a report of 15+ dimensions?
– You can define multiple cubes on one star schema to
fulfill different analysis scenarios.
• Control the total number of dimensions.
– Mandatory dimension
– Hierarchy dimension
– Derived dimension
Mandatory Dimension
• Dimension that presents in every query.
– like Date
• Mandatory dimension cuts cuboid combinations by half.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A is Mandatory
A B C
A B -
A - C
A - -
Hierarchy Dimension
• Dimensions that form a “contains” relationship where
parent level is required for child level to make sense.
– like Year -> Month -> Day; or Country -> City
• Hierarchy dimension reduces combination from 2N to N+1.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A->B->C is Hierarchy
A B C
A B -
A - -
- - -
Derived Dimension
• Dimensions on lookup table that can be derived by PK.
– like User ID derives [Name, Age, Gender]
• Derived dimension reduces combination from 2N to 2 at the
cost of extra runtime aggregation.
Normal Dimensions
A B C
A B -
- B C
A - C
A - -
- B -
- - C
- - -
A, B, C are Derived by ID
ID
-
The Order of Dimensions
• Finally, define dimensions in following order.
– Mandatory dimension
– Dimensions that heavily involved in filters
– High cardinality dimensions
– Low cardinality dimensions
• Filter first, helps to cut down query scan ranges.
• High cardinality first, helps to calculate cube
efficiently.
Define Measures
• Kylin currently support
– Sum
– Count
– Max
– Min
– Average
– Distinct Count (based on HyperLogLog)
• Distinct Count is a very heavy data type.
– Error rate<1.22% takes 64KB per cell.
– Convince user to use the wildest tolerable error rate.
– Distinct Count is slower to build and query comparing to other
measures.
Incremental Build
• Kylin supports incremental build along a time dimension if enabled.
• Setting a start time, cube segments can be built daily (or any period)
processing only the incremental data.
• A segment can be refreshed relatively cheaply to reflect changes in
hive table.
• With the increasing number of segments, query would slow down a
bit.
• Merge segments to control the total number < 10 for best
performance.
Advanced Options
• Leave advanced options as is if you are not sure what they mean.
• Aggregation groups give finest control on which cuboids to build.
– Partial cube -- Only combinations within the same group are built.
– For cube with 30 dimensions, if divide the dimensions into 3 groups, the cuboid number will
reduce from 1 Billion to 3 Thousands.
• 230 => 210 + 210 + 210
– It’s tradeoff between online aggregation and offline pre-aggregation.
• Query is efficient when involved dimensions all come from a single aggregation
group, or otherwise runtime aggregation will slow down queries.
– Capture query patterns with your aggregation group.
– Keep less than 10 dimensions in one group, or the cube will be huge.
– A dimension can appear in multiple groups.
– Create a second cube with different aggregation group is also an option.
• Rowkeys, they are generated in order of dimensions. No need to change.
Build and Verify
• Once the cube is created, build it, and ready to verify.
• Check the expansion rate of your cube.
– Under 10 times is ideal.
• Notes on the SQLs
– Write queries against the original hive tables, cubes are
transparent at the query time.
– Sanity check: select count(*) from fact
– Make sure the join relationships (inner or left) matches the cube
definition exactly.
– Kylin works best with a group by clause.
– Date constant is like date ‘1970-01-01’
Q & A
Thanks!

Más contenido relacionado

La actualidad más candente

Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure DataTaro L. Saito
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Real-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka StreamsReal-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka Streamsconfluent
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMCloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMRightScale
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 

La actualidad más candente (20)

Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Apache Kylin
Apache KylinApache Kylin
Apache Kylin
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Real-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka StreamsReal-Time Market Data Analytics Using Kafka Streams
Real-Time Market Data Analytics Using Kafka Streams
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBMCloud Storage Comparison: AWS vs Azure vs Google vs IBM
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 

Similar a Design cube in Apache Kylin

Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuSpark Summit
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...Remy Rosenbaum
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)qhzhou
 
Data Warehouse Implementation
Data Warehouse ImplementationData Warehouse Implementation
Data Warehouse Implementationomayva
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopDataWorks Summit
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepPaweł Mitruś
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureNir Rubinstein
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
Deep learning optimization at alibaba by  zhenliang zhang from AlibabaDeep learning optimization at alibaba by  zhenliang zhang from Alibaba
Deep learning optimization at alibaba by zhenliang zhang from AlibabaBill Liu
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoicebazaarvoice_engineering
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Databricks
 

Similar a Design cube in Apache Kylin (20)

Datacube
DatacubeDatacube
Datacube
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Cloud DWH deep dive
Cloud DWH deep diveCloud DWH deep dive
Cloud DWH deep dive
 
Data Warehouse Implementation
Data Warehouse ImplementationData Warehouse Implementation
Data Warehouse Implementation
 
Enabling real interactive BI on Hadoop
Enabling real interactive BI on HadoopEnabling real interactive BI on Hadoop
Enabling real interactive BI on Hadoop
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
 
datacub
datacubdatacub
datacub
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
Deep learning optimization at alibaba by  zhenliang zhang from AlibabaDeep learning optimization at alibaba by  zhenliang zhang from Alibaba
Deep learning optimization at alibaba by zhenliang zhang from Alibaba
 
Austin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at BazaarvoiceAustin Scales- Clickstream Analytics at Bazaarvoice
Austin Scales- Clickstream Analytics at Bazaarvoice
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 

Último

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Design cube in Apache Kylin

  • 1. Design Cube in Kylin dev@kylin.incubator.apache.org
  • 2. Before You Start • Kylin is a MOLAP engine on Hadoop. • Understand Kylin helps cube design a lot. – http://www.slideshare.net/YangLi43/apache-kylin-deep-dive-2014-dec • This deck summarizes best practices and patterns on how to design an efficient cube. – For detailed steps to create a cube, check out https://github.com/KylinOLAP/Kylin/wiki/Kylin-Cube-Creation-Tutorial
  • 3. Overview • Identify Star Schema • Design Cube – Dimensions – Measures – Incremental Build – Advanced Options • Build and Verify
  • 4. Identify Star Schema • Kylin creates cube from a star schema of Hive tables. • One fact table that has ever growing records, like transactions. • A few dimension tables that are relatively static, like users and products. • Hive tables must be synced into Kylin first.
  • 5. Know Cardinalities of Columns • Cardinalities have significant impact on cube size and query latency. – High Cardinality: > 1,000 – Ultra High Cardinality: > 1,000,000 • Avoid UHC as much as possible. – If it’s used as indicator, then put the indicator in cube. – Try categorize values or derive features from the UHC rather than putting the original value in cube. • To know column cardinalities – select count(distinct A) from T – or google for fancy tools
  • 6. Cube Concepts Cube = all combination of dimensions Cuboid = one combination of dimensions Curse of dimensionality: N dimension cube has 2N cuboid
  • 7. Design Dimensions • 15 dimensions or less is most ideal. – More than that causes slowness in cube build and longer query latency. – Does user really need a report of 15+ dimensions? – You can define multiple cubes on one star schema to fulfill different analysis scenarios. • Control the total number of dimensions. – Mandatory dimension – Hierarchy dimension – Derived dimension
  • 8. Mandatory Dimension • Dimension that presents in every query. – like Date • Mandatory dimension cuts cuboid combinations by half. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A is Mandatory A B C A B - A - C A - -
  • 9. Hierarchy Dimension • Dimensions that form a “contains” relationship where parent level is required for child level to make sense. – like Year -> Month -> Day; or Country -> City • Hierarchy dimension reduces combination from 2N to N+1. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A->B->C is Hierarchy A B C A B - A - - - - -
  • 10. Derived Dimension • Dimensions on lookup table that can be derived by PK. – like User ID derives [Name, Age, Gender] • Derived dimension reduces combination from 2N to 2 at the cost of extra runtime aggregation. Normal Dimensions A B C A B - - B C A - C A - - - B - - - C - - - A, B, C are Derived by ID ID -
  • 11. The Order of Dimensions • Finally, define dimensions in following order. – Mandatory dimension – Dimensions that heavily involved in filters – High cardinality dimensions – Low cardinality dimensions • Filter first, helps to cut down query scan ranges. • High cardinality first, helps to calculate cube efficiently.
  • 12. Define Measures • Kylin currently support – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) • Distinct Count is a very heavy data type. – Error rate<1.22% takes 64KB per cell. – Convince user to use the wildest tolerable error rate. – Distinct Count is slower to build and query comparing to other measures.
  • 13. Incremental Build • Kylin supports incremental build along a time dimension if enabled. • Setting a start time, cube segments can be built daily (or any period) processing only the incremental data. • A segment can be refreshed relatively cheaply to reflect changes in hive table. • With the increasing number of segments, query would slow down a bit. • Merge segments to control the total number < 10 for best performance.
  • 14. Advanced Options • Leave advanced options as is if you are not sure what they mean. • Aggregation groups give finest control on which cuboids to build. – Partial cube -- Only combinations within the same group are built. – For cube with 30 dimensions, if divide the dimensions into 3 groups, the cuboid number will reduce from 1 Billion to 3 Thousands. • 230 => 210 + 210 + 210 – It’s tradeoff between online aggregation and offline pre-aggregation. • Query is efficient when involved dimensions all come from a single aggregation group, or otherwise runtime aggregation will slow down queries. – Capture query patterns with your aggregation group. – Keep less than 10 dimensions in one group, or the cube will be huge. – A dimension can appear in multiple groups. – Create a second cube with different aggregation group is also an option. • Rowkeys, they are generated in order of dimensions. No need to change.
  • 15. Build and Verify • Once the cube is created, build it, and ready to verify. • Check the expansion rate of your cube. – Under 10 times is ideal. • Notes on the SQLs – Write queries against the original hive tables, cubes are transparent at the query time. – Sanity check: select count(*) from fact – Make sure the join relationships (inner or left) matches the cube definition exactly. – Kylin works best with a group by clause. – Date constant is like date ‘1970-01-01’