1. 12th
July 2017 BICOD'2017@London.United Kingdom 1
Taming Size and Cardinality of OLAP Data
Cubes over Big Data
Alfredo Cuzzocrea University of Trieste & ICAR
Rim Moussa LaTICE Lab. & University of Carthage
Achref Labidi LaTICE Lab. & University of Carthage
The 31st British International Conference on Databases
@ London, United Kingdom
12th
of July, 2017
2. 12th
July 2017 BICOD'2017@London.United Kingdom 2
Outline
Data Warehouse Systems
DWS Architectures
OLAP cube
DSS Benchmarks
TPC-H*d: a Multi-dimensional Database Benchmark
TPC-H*d
AutoMDB
Application Scenarios of TPC-H*d
Benchmarking Data Servers
Benchmarking Multidimensional DB Schemas
Benchmarking Parallel OLAP Servers
Conclusion & Research Agenda
3. 12th
July 2017 BICOD'2017@London.United Kingdom 3
Data Warehouses Architectures: Lazy Data Integration
Query-driven Architecture
Relational
Data
Source
WRAPPER WRAPPER
MEDIATOR
WRAPPERWRAPPER
4. 12th
July 2017 BICOD'2017@London.United Kingdom 4
Data Warehouses Architectures: Eager Data Integration
Warehouse System Architecture
Data WarehouseRelational
Data
Source
Integration
workflows of the
Integration
System
5. 12th
July 2017 BICOD'2017@London.United Kingdom
Facts: are the objects that represent the subject of the desired analyses.
»Examples: sales records, weather records, cabs trips, …
»The fact table contained 3 types of attributes: measured attributes, foreign keys
to dimension tables, degenerate dimensions
Dimension(s):
»Levels are individual values that make up dimensions
»Examples
»Date dimension (Trimester, month, day)
»Time dimension (hour, min, sec)
»Geography dimension (Country, city, postal code)
Measure(s):
»Examples: revenue, lost revenue, sold quantities, expenses, …
»Use aggregate functions: min, max, count, distinct-count, sum, average, …
5
Data Warehousing
--OLAP Cube
8. 12th
July 2017 BICOD'2017@London.United Kingdom
Structured Query Language (SQL)
»Relational and static schema
»Data Definition, data Manipulation, and Data Control Language
»Analytic Functions (window functions over partition by …)
»Cube, roll-up and grouping sets operators
MultiDimensional eXpressions (MDX)
»Invented by Microsoft in 1997
»For querying and manipulating the multidimensional data stored in OLAP cubes
»Static schema
Data Flow programming language
»Google Sawzall, Apache Pig Latin, IBM Infosphere Streams
»Dynamic schema
»After data is loaded, multiple operators are applied on that data before the final
output is stored.
8
Query Languages
Load Data
Apply
Schema
Apply Filter Group Data
Apply
Aggregate
Function
Sort Data
Store
Output
10. 12th
July 2017 BICOD'2017@London.United Kingdom 10
Query Languages
MDX –Q16 of TPC-H Benchmark
WITH SET [Brands] AS 'Except({[Part Brand].Members}, {[Part
Brand].[Brand#45 ]})'
SET [Types] AS 'Filter({[Part Type].Members}, (NOT ([Part
Type].CurrentMember.Name MATCHES "(?i)MEDIUM POLISHED.*")))'
SET [Sizes] AS 'Filter({[Part Size].Members}, ([Part Size].CurrentMember IN
{[Part Size].[3], [Part Size].[9], [Part Size].[14], [Part Size].[19], [Part Size].[23],
[Part Size].[36], [Part Size].[45], [Part Size].[49]}))'
SELECT [Measures].[Supplier Count] ON COLUMNS,
nonemptyCrossjoin(nonemptyCrossjoin([Brands], [Types]), [Sizes]) ON ROWS
FROM [Cube16]
12. 12th
July 2017 BICOD'2017@London.United Kingdom 12
Decision Support Systems Benchmarks
Non-TPC Benchmarks
Real datasets
»Open data or proprietary data
»fixed size
»Devise a workload or trace the proprietary workload
APB-1: no scale factor
TPC Benchmarks
The Transaction Processing Council founded in 1988 to define
benchmarks
In 2009, TPC-TC is set up as an International Technology
Conference Series on Performance Evaluation and Benchmarking
Examples of benchmarks relevant for benchmarking decision support
systems: TPC-H, TPC-DS and TPC-DI
Common characteristics of TPC benchmarks
»Synthetic data
»Scale factor allowing generation of different volumes 1GB to 1PB
13. 12th
July 2017 BICOD'2017@London.United Kingdom 13
Decision Support Systems Benchmarks
TPC-H Benchmark Schema (1/2)
TPC-H Benchmark
22 ad-hoc SQL statements (star queries, nested queries, …) + refresh functions
14. 12th
July 2017 BICOD'2017@London.United Kingdom 14
Decision Support Systems Benchmarks
TPC-H Benchmark (2/2)
TPC-H Benchmark 2 Metrics
»QphH@Size is the number of queries processed per hour, that the system
under test can handle for a fixed load
»$/QphH@Size represents the ratio of cost to performance, where the cost is
the cost of ownership of the SUT (hardware,software, maintenance).
Variants of TPC-H Benchmarks
TPC-H*d Benchmark [Cuzzocrea and Moussa, 2013]
»Turning TPC-H benchmark into a Multi-dimensional benchmark
»Few schema changes
»Same TPC-H workload
»2 MDX workloads: query workload cube-then-query workload
SSB: Star Schema Benchmark [O’Neil et al., 2012]
»Turning TPC-H benchmark into star-schema
»Workload composed of 12 queries
TPC-H translated into Pig Latin (Apache Hadoop Ecosystem) [Moussa,2012]
»22 pig latin scripts which load and process TPC-H raw data files (.tbl files)
16. 12th
July 2017 BICOD'2017@London.United Kingdom 16
Decision Support Systems Benchmarks
TPC-DS Benchmark (2/2)
TPC-DS Benchmark Workload
Hundred of queries (99 query templates)
OLAP, windowing functions, mining, and reporting queries
ACID and Concurrent data maintenance (not ACID in TPC-DS 2.x)
TPC-DS Benchmark Metrics
2 main Metrics
»QphDS@Size is the number of queries processed per hour, that the
system under test can handle for a fixed load.
»Data Maintenance and Load Time are calculated
»$/QphDS@Size represents the ratio of cost to performance, where the
cost is a 3 year cost of ownership of the SUT (hardware,software,
maintenance)
TPC-DS implementations
TPC-DS v2.0
»Extension for non-relational systems such as Hadoop/Spark big data
systems
18. 12th
July 2017 BICOD'2017@London.United Kingdom
Given,
A relational Warehouse schema
A Workload -a set of OLAP business queries,
W = {Q1, Q2, …, Qn}
where Qi is a parameterized query
How to design the Multi-dimensional DB Schema?
How to define cubes?
Will there be a single cube or multiple cubes?
Are there any rules for merging of cubes?
Which Optimizations are suitable for performance tuning ?
Derived data calculus & refresh? (materialized views, derived attributes,
indexes,…)
Data partitioning & parallel cube building?
# 18
MDB Design Problem
19. 12th
July 2017 BICOD'2017@London.United Kingdom # 19
Idea
Map each business query to an OLAP cube
>> Obtain a multi-dimensional DB schema
Recommend & Test Optimizations
>> Derived Data
>> Data partitioning
>> Cube Merging
21. 12th
July 2017 BICOD'2017@London.United Kingdom # 21
TPC-H*d
TPC-H*d OLAP Cube C8
Market Share for each supplier nation within a region of customers,
for each year and each part type
22. 12th
July 2017 BICOD'2017@London.United Kingdom # 22
TPC-H*d
TPC-H*d OLAP Query Q8
Market Share for each RUSSIAN Suppliers within AMERICA region,
Over the years 1995 and 1996 and for part type ECO. ANODIZED STEEL
23. 12th
July 2017 BICOD'2017@London.United Kingdom
Open source software implemented in java
Parses MDB schemas (.xml) files using SAX Library
Performs comparisons of OLAP cubes' characteristics.
»For each pair of OLAP cubes,
»show whether they have same fact table or not
»compute the nbr of shared | different | coalescable dimensions
»Dimensions are coalescable if they are extracted from the dimension table
and their hierarchies are coalescable
»compute the number of shared | different measures
»Run merge of OLAP cubes using different similarity functions
»Simple distance function have or not same fact table
»K-means clustering
»Distance function is computed with weights to cube characteristics
»Propose Virtual Cubes
»Auto-generate a new MDB Schema (.xml)
»Create MDB Schema from TPC-DS SQL Workload
»On-going
# 23
AutoMDB
28. 12th
July 2017 BICOD'2017@London.United Kingdom 28
Outline
Introduction
Part I: Data warehousing
Part II: Multidimensional DB Design
Part III: Application Scenarios
Benchmarking Data Servers
Benchmarking Multidimensional DB Schemas
Benchmarking Parallel OLAP servers
Conclusion and Research agenda
29. 12th
July 2017 BICOD'2017@London.United Kingdom 29
Benchmarking Data Servers
--Column-oriented storage systems vs row-oriented storage systems
Columnar Storage Systems
»High IO performance: less data moving from hard drives to memory
»Efficient Memory Management: load only required data into memory
»Reduced Storage: columns with low cardinality are compressed
»Efficient Schema Modifying Techniques: adding new columns will not
induce a file storage re-organization
Types
»Binary Association Tables
»Each column is stored in a separate (surrogate key, value) table
»RDBMS: MonetDB
»Family of columns
»Design techniques are based on measuring the affinity between
attributes through the count of their co-occurrence in the query
workload and clustering attributes
»Vertical partitioning for DB design
30. 12th
July 2017 BICOD'2017@London.United Kingdom 30
Benchmarking Data Servers
--Column-storage systems vs row-based storage systems
MySQL MonetDB
C1 2,778 sec 30 sec
C10 Java heap space Error 758 sec
C11 2,558 sec 2,536 sec
C3 Mondrian Error: Size of cross join exceeded limit
31. 12th
July 2017 BICOD'2017@London.United Kingdom 31
Benchmarking Middleware for Parallel Cube Processing
--OLAP & High Performance Computing
Systems which scale-out through Data Fragmentation and Load Balancing
achieve
»Parallel IO
»Parallel Processing
Technologies
»Parallel Cube processing OLAP servers
»Distributed Relational Data Warehouses + Mid-tier for parallel cube
processing
»Hadoop Systems
»SQL-on-Hadoop Systems
»e.g. Hive, Spark SQL, Drill, Impala, IBM BigInsights, …
32. 12th
July 2017 BICOD'2017@London.United Kingdom 32
Benchmarking Middleware for Parallel Cube Processing
--OLAP* framework Key Considerations for Data Fragmentation
Reduce the Size of Each Cube to be Built at Each Node
»big-cardinality dimensions' partitioning
Simplify Post-Processing of OLAP Cubes
»Cubes which have disjoint dimensions’ members have simple post-
processing (union all operation), while the merge of all dimensions'
hierarchies is costly
Enhance Data Maintenance
»DW refresh processing
»Distributed Maintenance Transaction processing
Controlled Replication
»Replication has refresh and storage cost
»Replication optimizes join operations through dimension table
replication
33. 12th
July 2017 BICOD'2017@London.United Kingdom 33
Benchmarking Middleware for Parallel Cube Processing
--Performance Measurements with MySQL as DB backend
MySQL 4 MySQL instances DB
C1 2,778 sec 862 sec
C10 Java heap space Error 13,774 sec
34. 12th
July 2017 BICOD'2017@London.United Kingdom 34
Benchmarking MDB Schemas
MDB Design
»Simple approach: Map for each query a required cube(s)
»Sophisticated approach
»Analyze OLAP workload
»Find out shared facts, dimensions and measures
»Define new cubes based on cubes clustering
»Re-write the workload
35. 12th
July 2017 BICOD'2017@London.United Kingdom 35
Benchmarking MDB Schemas
--TPC-H*d Example
_Same fact table
_2 shared dimension tables
but different hierarchies
_1 different dimension
_Same measure
37. 12th
July 2017 BICOD'2017@London.United Kingdom 37
Conclusion and Future Work
Performance Leaks
Mondrian can not build an OLAP cube having more than
2,147,483,647 cells
OLAP cube 20 has 200,052,100,026 cells
Experiments
TPC-H with SF=10
RDBMS: MonetDB and MySQL
Tuning: materialized views and derived attributes
Were run on Suno nodes (@Sophia Grid5000 HPC platform)
Each node has 32GB of RAM
Mondrian requires more RAM
XML description of Cubes of TPC-H and TPC-DS cubes allows us to sketch,
recommend and assess
vertical partitioning techniques for DB design (Family of columns)
materialized views
indexes
38. 12th
July 2017 BICOD'2017@London.United Kingdom 38
Future Work
Intelligent Recommenders for the selection of Indexes and
Materialized Views
Indexes and physical structures that can significantly accelerate
performance
XML description of each cube allows us to recommend
Recommenders for performance tuning
»AutoAdmin research project at Microsoft, which explores techniques
to make databases self-tuning [Agrawal et al., 2000]
»Alerter Approach [Hose et al., 2008]: support the aggregate
configuration of an OLAP server by (1) continuously monitoring
information about the workload and the benefit of aggregation tables
and (2) alerting the DBA if changes to the current configuration would
be beneficial
»Semi-Automatic Index Tuning: keeping DBAs in the loop [Schnaiter and
Polyzotis, 2012] Online workload analysis with decisions delegated to
the DBA. The solution takes into account index interactions
40. 12th
July 2017 BICOD'2017@London.United Kingdom 40
References (1/3)
M. Fricke, The Knowledge Pyramid: A Critique of the DIKW Hierarchy. Journal of Information
Science. 2009.
E.F. Codd, S.B. Codd and C.T. Salley, Providing OLAP to User Analysts: an IT mandate, 1993.
J. Widom, Integrating Heterogeneous databases: eager or lazy? ACM Computing Surveys (CSUR)
Vol.4, 1996
Y.R. Cho, Data Warehouse and OLAP Operations www.ecs.baylor.edu/faculty/cho/4352
TPC homepage http://www.tpc.org/
M. Poess, T. Rabl and B. Caufield: TPC-DI: The First Industry Benchmark for Data
Integration. PVLDB 7(13): 1367-1378 (2014)
http://www.vldb.org/pvldb/vol7/p1367-poess.pdf
X. Li, J. Han, H. Gonzalez: High-Dimensional OLAP: A Minimal Cubing Approach. VLDB 2004.
C. Imhoff, N. Galemmo, J. G. Geiger. Mastering Data Warehouse Design: Relational and
Dimensional Techniques. 2003.
R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, B. Becker. The Data Warehouse
Lifecycle Toolkit. 2nd Edition.
R. Kimball, M. Ross. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling. 2nd Edition.
H. G. Molia. Data Warehousing Overview: Issues, Terminology, Products.
www.cs.uh.edu/~ceick/6340/dw-olap.ppt (slides)
41. 12th
July 2017 BICOD'2017@London.United Kingdom # 41
References (2/3)
Modeling Multidimensional Databases (non exhaustive list)
M. Gyssens and L. V.S. Lakshmanan. A Foundation for Multi-Dimensional Databases.
VLDB’1997.
R. Agrawal, A. Gupta and S. Sarawagi. Modeling Multidimensional Databases.
ICDE’1997.
J. Gray, A. Bosworth, A. Layman and H. Priahesh. Data Cube: A Relational Aggregation
Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. ICDE’2008.
P. Vassiliadis. Modeling Multidimensional Databases, Cubes and Cube Operations.
SSDBM’1998.
L. Cabibbo and R. Torlone. A Logical Approach to Multidimensional Databases.
EDBT’1998.
D. Cheung, B. Zhou, B. Kao, H. Lu, T. Lam and H. Ting. Requirement-based data cube
schema design. CIKM’1999.
T. Niemi, J. Nummenmaa and P. Thanisch. Constructing OLAP cubes based on Queries.
DOLAP’2001.
O. Teste. Towards Conceptual Multidimensional Design in Decision Support Systems.
DEXA’2010.
A. Cuzzocrea and R. Moussa. Multidimensional Database Design
via Schema Transformation: Turning TPC-H into the TPC-
H*d Multidimensional Benchmark. COMAD’2013.
42. 12th
July 2017 BICOD'2017@London.United Kingdom 42
References (3/3)
Introduction
Part I: Methods & State-of-the-Art
Part II: Experiences
Part III: Challenging Problems
Conclusion
M. Fowler, Schemaless data structures. 2013 http://martinfowler.com/articles/schemaless/
N. Marz and J. Warren, Big Data: Principles and best practices of scalable realtime data
systems, 1st Edition
S. Agrawal, S. Chaudhuri and V. Narasayya Automated Selection of Materialized Views and
Indexes for SQL Databases. VLDB’2000
http://www.research.microsoft.com/dmx/AutoAdmin
K. Hose, D. Klan, M. Marx and K. Sattler. When is it Time to Rethink the Aggregate
Configuration of Your OLAP Server?. VLDB’2008
Karl Schnaitter and Neoklis Polyzotis. Semi-Automatic Index Tuning: Keeping DBAs in the
Loop. VLDB’2012
P. Zhao, X. Li, D. Xin and J. Han.
Graph cube: on warehousing and OLAP multidimensional networks. SIGMOD’2011
L. D. Lins, J. T. Klosowski and C. E. Scheidegger:
Nanocubes for Real-Time Exploration of Spatiotemporal Datasets. IEEE Trans. Vis. Comput.
Graph. 2013 https://github.com/laurolins/nanocube
43. 12th
July 2017 BICOD'2017@London.United Kingdom 43
Thank you for your Attention
Q & A
Taming Size and Cardinality of OLAP Data Cubes over Big
Data
Alfredo Cuzzocrea, Rim Moussa and Achref Labidi
12th of July, 2017
44. 12th
July 2017 BICOD'2017@London.United Kingdom 44
Decision Support Systems Benchmarks
TPC-DI Benchmark (1/3)
[Poess et al. 2014]
For benchmarking Data Integration technologies
Synthetic Data of a Factious Retail Brokerage Firm
»Internal Trading system data, Internal Human resources data, Internal
CRM System and External data
»Different data scales
»Data extracted from different sources:
»Structured (csv)
»Semi-structured data (xml)
»Multi record (nested data)
»Change Data Capture (CDS)
18 Complex Data Integration Tasks
Load large volumes of historical data
Load incremental updates
Execute complex transformations
Check and ensure consistency of data
45. 12th
July 2017 BICOD'2017@London.United Kingdom # 45
TPC-H*d
Truly OLAP variant of TPC-H benchmark
TPC-H SQL workload translated into MDX (MultiDimensional
eXpressions)
The workload is composed of 23 MDX statements for OLAP
cubes and 23 MDX statements for OLAP business queries.
Each business question of TPC-H benchmark is mapped to an OLAP
cube