SlideShare una empresa de Scribd logo
1 de 38
MODELLING DATA
FOR SCALABLE,
AD HOC ANALYSIS
ASSEN TOTIN
Senior Engineer R&D
MariaDB Corporation
WHY ARE WE HERE?
● If working with a transactional (row) storage is like driving a car (and almost as
ubiquitous)...
● … then working with an analytical storage is driving a trailer.
● Bottom line: change your driving attitude or you’re not going to make it even out
of the parking lot!
QUICK SUMMARY
● The analytical workload.
● MariaDB ColumnStore brief.
● ColumnStore data modelling: preparing data for loading, preparing appropriate
schema, optimizing the queries and finding your way around them.
● Moving data to ColumnStore: usage scenarios.
● Q & A
COLUMNSORE AND
THE ANALYTICAL
WORKLOAD
THE ANALYTICAL WORKLOAD
● Relatively small set of functions needed, compared to general-purpose
scientific work.
● If needed, the business logic can be moved outside the data storage. Thus the
storage can be reduced to its most basic storing and retrieval functions.
● Data is mostly historic, hence time-sequenced, almost exclusively appended
and rarely – if at all – updated. Data is almost never deleted.
● Large sets of data are retrieved as batches, often full columns or continuous
parts of such.
MARIADB
COLUMNSORE
COLUMNSTORE STORAGE
● Dedicated columnar
storage
● Data is organised in
a hierarchical
structure (unlike flat
row-based storages)
COLUMNSTORE STORAGE
● Each database is a directory,
each table is a directory inside it,
each column is a file inside it
● Columns are split into multiple
files (extents) of equal size (8M
cells)
● Optional compression, defined
per-table
COLUMNSTORE STORAGE
● Data can be loaded (written) directly into extents.
● Completely bypasses the SQL layer, leaving it free
to process queries.
● Once writing completes, we then notify the
processing engine that new data is available.
COLUMNSTORE STORAGE
● For each extent some meta-data is calculated and stored in memory (MIN and
MAX values etc.).
● Divide-and-conquer strategy to queries: eliminate all unnecessary extents and
only load the one needed.
COLUMNSTORE STORAGE
COLUMNSTORE CLUSTER
● In ColumnStore, a module is a set of running processes.
● Two types of modules (nodes), User Module (UM) and Performance Module
(PM).
● User Module: provides client connectivity (speaks SQL) and has local storage
engines (InnoDB, MyISAM...). More UM = more concurrent connections and
HA. UM can be replicated.
● Performance Module: stores actual data. More PM = more data stored.
● For dev purposes, one UM and one PM may live together in a single OS.
COLUMNSTORE TABLES
● ColumnStore is a storage engine in MariaDB.
● To create a ColumnStore table, use
CREATE TABLE… ENGINE=ColumnStore
● Just as with any other MariaDB server, you can mix-and-match different storage
engines in one database.
● Just as with any other MariaDB server, you can do a cross-engine JOIN
between ColumnStore tables and tables in local storage engines on the UM.
COLUMNSTORE DATA DISTRIBUTION
● ColumnStore tables are always distributed (assuming more .than one PM).
● ColumnStore distributes data across the PM nodes in round-robin fashion.
● When a new (empty) PM is added, it receives data until its size catches up with
other PM.
● Manual control over data distribution is possible when side-loading via the Bulk
Load API: cpimport modes 2 & 3.
COLUMNSTORE DATA DISTRIBUTION
COLUMNSTORE
DATA MODELLING
NO INDICES, PLEASE!
● ColumnStore has no indices: with big data indices do not fit into memory, so
they become useless.
● This helps to reduce I/O drastically; ColumnStore I/O requirements are
significantly lower than for row storage (works very well on spinning media).
● Reduce CPU load previously spent on maintaining indices.
● The filesystem is always in-sync: file-level backup in real-time is again possible
and natural.
● Direct injection of data into the storage (bypassing SQL layer) is now possible.
● Instead of indices, ColumnStore uses divide-and-conquer to only load what’s
needed to serve a query.
PREPARING DATA FOR LOAD
● ColumnStore will append the data in the order we send it, so it is up to us to
order it.
● In order for the divide-and-conquer approach to work best, data has to be
arranged in sequential fashion (because then the most extents can be
eliminated before actual data read from disk begins).
● Examine your data and identify columns with incremental (or time-based)
ordering.
● Examine your queries and find which of these columns is most often used as
predicate.
● Order the data by this column prior to loading it.
CLUSTERING THE SCHEMA
● ColumnStore follows the map/reduce approach: each PM does the same work
on its part of the data (map), then all results are aggregated by a UM (reduce).
● To distribute a JOIN (push-down to all PM) one needs to ensure that either
– each node has one of the sides in full, or
– both sides are partitioned by the same key.
● With automated data distribution, ColumnStore finds the smaller side of the
JOIN and redistributes it on-the-fly to facilitate a distributed JOIN. If the smaller
side is bigger than a threshold, the JOIN is pushed up to the UM (which
requires more RAM).
CLUSTERING THE SCHEMA
● The optimal ColumnStore schema will thus consist of small number of big
tables and larger number of smaller tables so that JOIN between a big and a
small table can be distributed.
● This schema assumes high degree of data normalisation, so that big tables will
contain as many as possible references to small tables, from which actual
values are derived.
● This schema is usually referred to as star schema: one big table (in the centre)
linked to multiple small tables (around it).
CLUSTERING THE SCHEMA: STAR
Source: Wikipedia
CLUSTERING THE SCHEMA
● The big table in the centre is called fact table, because it contains data (rows),
related to events (facts) that occurred in different moments in time. These facts
are often associated with the technical or business activity that is represented
by the schema (e.g., each sale could be a fact, registered in one row; or, each
reading of a sensor value in an IoT system etc.).
● The fact table is amended in each new data load (new rows = new events).
● New rows are appended to the end of the fact table.
● Generally, older (time-wise) facts precede the newer ones.
CLUSTERING THE SCHEMA
● The small tables that are linked to the fact table are called dimension tables,
because they contain data that describes properties of the facts.
● Dimension tables constitute of things like nomenclatures and other nearly-
immutable data: e.g., the list of states and cities, list of points of sale etc.
● Dimension tables are rarely amended.
CLUSTERING THE SCHEMA
● Having a second layer of links may provide a more complicated design,
sometimes called snowflake schema.
● In a multi-tier (snowflake) schema, a table may be a dimension to one level and
a fact to another, e.g. the list of telco subscribers may be a fact (linked to
dimensions like the subscription plan), but also a dimension (to which the list of
phone calls links).
CLUSTERING THE SCHEMA: SNOWFLAKE
Source: Wikipedia
PHONE
CALLS
USERS
PLANS
OPTIMIZING THE QUERIES
● An important prerequisite for properly designing a schema is to know how it is
going to be used.
● Ensure the queries and the star schema match each other.
● Always JOIN a fact table to a dimension table only. Never JOIN two fact tables!
● As each column is a separate set of files, the more columns are requested in
the result set, the more data has to be read from the disk; always specify exact
columns and only those needed; never do SELECT (*)
OPTIMIZING THE QUERIES
● Filter on sequential columns as much as possible.
● Filter on actual values, not on functions, because functions prevent extent
elimination and lead to full column scan; make extra separate columns if
needed, e.g. have a separate column year instead of YEAR(date).
● ORDER BY and LIMIT are run last and always on the UM, so can be expensive
(depending on amount of data).
● JOIN with a table from a local storage engine (InnoDB, MyISAM...) is done by
first fetching the local table from UM. As this requires loopback connection, this
is often relatively slow – so consider its usage carefully.
OPTIMIZING DIMENSIONS
● Keep dimensions small (up to 1M rows) as they will be redistributed on-the-fly
for each JOIN.
● Increase the distributed JOIN threshold for bigger dimensions (but carefully).
This is a cluster-wide tunable from Columnstore.xml.
EXTENDING COLUMNSTORE ENGINE
● ColumnStore engine might not always be the best choice (e.g., data type
support,encoding support etc.).
● Local storage engines on UM may supplement the ColumnStore engine via
cross-engine JOIN.
● Usually multiple UM will be replicated, so tables from local storage engines are
also replicated… but in some special cases you may want not to replicate them
and effectively have different content for the same local table on different UM;
in this case, make sure to configure jobs to run only on the UM where
connected (access to ExeMgr process).
TRACING YOUR STEPS: EXPLAIN
● EXPLAIN works for ColumnStore, but is less useful (no indices)
SELECT t.customer_id, t.discount, t.discounted_price
FROM transactions t
JOIN books b ON b.book_id=t.book_id
WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
MariaDB [bookstore]> EXPLAIN SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON
b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31';
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; |
| | | | | | | | | | Using join buffer (flat, BNL join) |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
TRACING YOUR STEPS: STATS
● Use SELECT calGetStats() stored
procedure, which provides statistics
about resources used on the User
Module (UM) node, PM node, and
network by the last run query.
● 582979 rows in set (3.373 sec)
MariaDB [bookstore]> calGetStats();
+---------------------------------+
| Query Stats: |
| MaxMemPct-3; |
| NumTempFiles-0; |
| TempFileSpace-0B; |
| ApproxPhyI/O-71674; |
| CacheI/O-47150; |
| BlocksTouched-47128; |
| PartitionBlocksEliminated-1413; |
| MsgBytesIn-37MB; |
| MsgBytesOut-63KB; |
| Mode-Distributed |
+---------------------------------+
TRACING YOUR STEPS: TRACE
● To trace a query, first enable tracing with SELECT calSetTrace(1), then run
the query, then get the trace with SELECT calGetTrace().
MariaDB [bookstore]> SELECT calGetTrace();
+-------------------------------------------------------------------------------------------------------------------+
| Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows |
| BPS PM b 301760 (book_id) 0 7 0 0.002 5001 |
| BPS PM t 301805 (book_id,customer_id,discount,discounted_price,trans_date) 0 17280 1413 0.308 582979 |
| HJS PM t-b 301805 - - - - ----- - |
| TNS UM - - - - - - 2.476 582979 |
+-------------------------------------------------------------------------------------------------------------------+
COLUMNSTORE
DATA MOVING
MOVING DATA TO COLUMNSTORE
● Scenario A: Use the same schema as in transactional.
● Only use ColumnStore as long-term cold storage for large amounts of data.
● No OLAP as schema does not match requirements.
● No OLTP as data is too big.
● Copy selected parts of the data back to OLTP engine for processing.
MOVING DATA TO COLUMNSTORE
● Scenario B: Use dedicated star schema.
● Actively use ColumnStore as OLAP backend.
● Load the data from OLTP storage in batches: ETL with either LOAD DATA or
Bulk Load API (preferred: cpimport, shared library/JAR).
● Use any preferred front-end tool to drive the analytics (Tableau, Pentaho
Mondrian, Microsft SSAS, Apache Zeppelin…).
HAVE YOUR SAY!
Q&A
● Ask questions now…
● … or ask later. We are here for you!
THANK YOU!

Más contenido relacionado

Similar a Modeling data for scalable, ad hoc analytics

The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designCalpont
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Julien SIMON
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtablemaxtable
 
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1sunildupakuntla
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftAmazon Web Services LATAM
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData WebinarSnappyData
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxSadeka Islam
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...Dave Stokes
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testingsmittal81
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDave Stokes
 
Sap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptxSap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptxsweta prakash sahoo
 
MySQL Performance Optimization
MySQL Performance OptimizationMySQL Performance Optimization
MySQL Performance OptimizationAbhijit Mondal
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and BoxwoodEvan Weaver
 
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Dave Stokes
 

Similar a Modeling data for scalable, ad hoc analytics (20)

The thinking persons guide to data warehouse design
The thinking persons guide to data warehouse designThe thinking persons guide to data warehouse design
The thinking persons guide to data warehouse design
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtable
 
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
DB2 LUW V11.1 CERTIFICATION TRAINING PART #1
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
How to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptxHow to Cost-Optimize Cloud Data Pipelines_.pptx
How to Cost-Optimize Cloud Data Pipelines_.pptx
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
Mohan Testing
Mohan TestingMohan Testing
Mohan Testing
 
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and HistogramsDutch PHP Conference 2021 - MySQL Indexes and Histograms
Dutch PHP Conference 2021 - MySQL Indexes and Histograms
 
Running MySQL in AWS
Running MySQL in AWSRunning MySQL in AWS
Running MySQL in AWS
 
Sap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptxSap memory management ,workload and performance analysis.pptx
Sap memory management ,workload and performance analysis.pptx
 
MySQL Performance Optimization
MySQL Performance OptimizationMySQL Performance Optimization
MySQL Performance Optimization
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Bigtable and Boxwood
Bigtable and BoxwoodBigtable and Boxwood
Bigtable and Boxwood
 
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
Open Source 1010 and Quest InSync presentations March 30th, 2021 on MySQL Ind...
 

Más de MariaDB plc

MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.xMariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.xMariaDB plc
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - NewpharmaMariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - NewpharmaMariaDB plc
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - CloudMariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - CloudMariaDB plc
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB EnterpriseMariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB EnterpriseMariaDB plc
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB plc
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale MariaDB plc
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentationMariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentationMariaDB plc
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentationMariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentationMariaDB plc
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB plc
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-BackupMariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-BackupMariaDB plc
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023MariaDB plc
 
Hochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDBHochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDBMariaDB plc
 
Die Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerDie Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerMariaDB plc
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®MariaDB plc
 
Introducing workload analysis
Introducing workload analysisIntroducing workload analysis
Introducing workload analysisMariaDB plc
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoringMariaDB plc
 
Introducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorIntroducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorMariaDB plc
 
MariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introductionMariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introductionMariaDB plc
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBMariaDB plc
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQLMariaDB plc
 

Más de MariaDB plc (20)

MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.xMariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - NewpharmaMariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - Newpharma
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - CloudMariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - Cloud
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB EnterpriseMariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB Enterprise
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentationMariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentation
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentationMariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentation
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-BackupMariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023
 
Hochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDBHochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDB
 
Die Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerDie Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise Server
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Introducing workload analysis
Introducing workload analysisIntroducing workload analysis
Introducing workload analysis
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoring
 
Introducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorIntroducing the R2DBC async Java connector
Introducing the R2DBC async Java connector
 
MariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introductionMariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introduction
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
 

Último

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Último (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Modeling data for scalable, ad hoc analytics

  • 1. MODELLING DATA FOR SCALABLE, AD HOC ANALYSIS ASSEN TOTIN Senior Engineer R&D MariaDB Corporation
  • 2. WHY ARE WE HERE? ● If working with a transactional (row) storage is like driving a car (and almost as ubiquitous)... ● … then working with an analytical storage is driving a trailer. ● Bottom line: change your driving attitude or you’re not going to make it even out of the parking lot!
  • 3. QUICK SUMMARY ● The analytical workload. ● MariaDB ColumnStore brief. ● ColumnStore data modelling: preparing data for loading, preparing appropriate schema, optimizing the queries and finding your way around them. ● Moving data to ColumnStore: usage scenarios. ● Q & A
  • 5. THE ANALYTICAL WORKLOAD ● Relatively small set of functions needed, compared to general-purpose scientific work. ● If needed, the business logic can be moved outside the data storage. Thus the storage can be reduced to its most basic storing and retrieval functions. ● Data is mostly historic, hence time-sequenced, almost exclusively appended and rarely – if at all – updated. Data is almost never deleted. ● Large sets of data are retrieved as batches, often full columns or continuous parts of such.
  • 7. COLUMNSTORE STORAGE ● Dedicated columnar storage ● Data is organised in a hierarchical structure (unlike flat row-based storages)
  • 8. COLUMNSTORE STORAGE ● Each database is a directory, each table is a directory inside it, each column is a file inside it ● Columns are split into multiple files (extents) of equal size (8M cells) ● Optional compression, defined per-table
  • 9. COLUMNSTORE STORAGE ● Data can be loaded (written) directly into extents. ● Completely bypasses the SQL layer, leaving it free to process queries. ● Once writing completes, we then notify the processing engine that new data is available.
  • 10. COLUMNSTORE STORAGE ● For each extent some meta-data is calculated and stored in memory (MIN and MAX values etc.). ● Divide-and-conquer strategy to queries: eliminate all unnecessary extents and only load the one needed.
  • 12. COLUMNSTORE CLUSTER ● In ColumnStore, a module is a set of running processes. ● Two types of modules (nodes), User Module (UM) and Performance Module (PM). ● User Module: provides client connectivity (speaks SQL) and has local storage engines (InnoDB, MyISAM...). More UM = more concurrent connections and HA. UM can be replicated. ● Performance Module: stores actual data. More PM = more data stored. ● For dev purposes, one UM and one PM may live together in a single OS.
  • 13. COLUMNSTORE TABLES ● ColumnStore is a storage engine in MariaDB. ● To create a ColumnStore table, use CREATE TABLE… ENGINE=ColumnStore ● Just as with any other MariaDB server, you can mix-and-match different storage engines in one database. ● Just as with any other MariaDB server, you can do a cross-engine JOIN between ColumnStore tables and tables in local storage engines on the UM.
  • 14. COLUMNSTORE DATA DISTRIBUTION ● ColumnStore tables are always distributed (assuming more .than one PM). ● ColumnStore distributes data across the PM nodes in round-robin fashion. ● When a new (empty) PM is added, it receives data until its size catches up with other PM. ● Manual control over data distribution is possible when side-loading via the Bulk Load API: cpimport modes 2 & 3.
  • 17. NO INDICES, PLEASE! ● ColumnStore has no indices: with big data indices do not fit into memory, so they become useless. ● This helps to reduce I/O drastically; ColumnStore I/O requirements are significantly lower than for row storage (works very well on spinning media). ● Reduce CPU load previously spent on maintaining indices. ● The filesystem is always in-sync: file-level backup in real-time is again possible and natural. ● Direct injection of data into the storage (bypassing SQL layer) is now possible. ● Instead of indices, ColumnStore uses divide-and-conquer to only load what’s needed to serve a query.
  • 18. PREPARING DATA FOR LOAD ● ColumnStore will append the data in the order we send it, so it is up to us to order it. ● In order for the divide-and-conquer approach to work best, data has to be arranged in sequential fashion (because then the most extents can be eliminated before actual data read from disk begins). ● Examine your data and identify columns with incremental (or time-based) ordering. ● Examine your queries and find which of these columns is most often used as predicate. ● Order the data by this column prior to loading it.
  • 19. CLUSTERING THE SCHEMA ● ColumnStore follows the map/reduce approach: each PM does the same work on its part of the data (map), then all results are aggregated by a UM (reduce). ● To distribute a JOIN (push-down to all PM) one needs to ensure that either – each node has one of the sides in full, or – both sides are partitioned by the same key. ● With automated data distribution, ColumnStore finds the smaller side of the JOIN and redistributes it on-the-fly to facilitate a distributed JOIN. If the smaller side is bigger than a threshold, the JOIN is pushed up to the UM (which requires more RAM).
  • 20. CLUSTERING THE SCHEMA ● The optimal ColumnStore schema will thus consist of small number of big tables and larger number of smaller tables so that JOIN between a big and a small table can be distributed. ● This schema assumes high degree of data normalisation, so that big tables will contain as many as possible references to small tables, from which actual values are derived. ● This schema is usually referred to as star schema: one big table (in the centre) linked to multiple small tables (around it).
  • 21. CLUSTERING THE SCHEMA: STAR Source: Wikipedia
  • 22. CLUSTERING THE SCHEMA ● The big table in the centre is called fact table, because it contains data (rows), related to events (facts) that occurred in different moments in time. These facts are often associated with the technical or business activity that is represented by the schema (e.g., each sale could be a fact, registered in one row; or, each reading of a sensor value in an IoT system etc.). ● The fact table is amended in each new data load (new rows = new events). ● New rows are appended to the end of the fact table. ● Generally, older (time-wise) facts precede the newer ones.
  • 23. CLUSTERING THE SCHEMA ● The small tables that are linked to the fact table are called dimension tables, because they contain data that describes properties of the facts. ● Dimension tables constitute of things like nomenclatures and other nearly- immutable data: e.g., the list of states and cities, list of points of sale etc. ● Dimension tables are rarely amended.
  • 24. CLUSTERING THE SCHEMA ● Having a second layer of links may provide a more complicated design, sometimes called snowflake schema. ● In a multi-tier (snowflake) schema, a table may be a dimension to one level and a fact to another, e.g. the list of telco subscribers may be a fact (linked to dimensions like the subscription plan), but also a dimension (to which the list of phone calls links).
  • 25. CLUSTERING THE SCHEMA: SNOWFLAKE Source: Wikipedia PHONE CALLS USERS PLANS
  • 26. OPTIMIZING THE QUERIES ● An important prerequisite for properly designing a schema is to know how it is going to be used. ● Ensure the queries and the star schema match each other. ● Always JOIN a fact table to a dimension table only. Never JOIN two fact tables! ● As each column is a separate set of files, the more columns are requested in the result set, the more data has to be read from the disk; always specify exact columns and only those needed; never do SELECT (*)
  • 27. OPTIMIZING THE QUERIES ● Filter on sequential columns as much as possible. ● Filter on actual values, not on functions, because functions prevent extent elimination and lead to full column scan; make extra separate columns if needed, e.g. have a separate column year instead of YEAR(date). ● ORDER BY and LIMIT are run last and always on the UM, so can be expensive (depending on amount of data). ● JOIN with a table from a local storage engine (InnoDB, MyISAM...) is done by first fetching the local table from UM. As this requires loopback connection, this is often relatively slow – so consider its usage carefully.
  • 28. OPTIMIZING DIMENSIONS ● Keep dimensions small (up to 1M rows) as they will be redistributed on-the-fly for each JOIN. ● Increase the distributed JOIN threshold for bigger dimensions (but carefully). This is a cluster-wide tunable from Columnstore.xml.
  • 29. EXTENDING COLUMNSTORE ENGINE ● ColumnStore engine might not always be the best choice (e.g., data type support,encoding support etc.). ● Local storage engines on UM may supplement the ColumnStore engine via cross-engine JOIN. ● Usually multiple UM will be replicated, so tables from local storage engines are also replicated… but in some special cases you may want not to replicate them and effectively have different content for the same local table on different UM; in this case, make sure to configure jobs to run only on the UM where connected (access to ExeMgr process).
  • 30. TRACING YOUR STEPS: EXPLAIN ● EXPLAIN works for ColumnStore, but is less useful (no indices) SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31'; MariaDB [bookstore]> EXPLAIN SELECT t.customer_id, t.discount, t.discounted_price FROM transactions t JOIN books b ON b.book_id=t.book_id WHERE t.trans_date BETWEEN '2018-01-01' AND '2018-01-31'; +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+ | 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 2000 | Using where with pushed condition | | 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2000 | Using where; | | | | | | | | | | | Using join buffer (flat, BNL join) | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------------+
  • 31. TRACING YOUR STEPS: STATS ● Use SELECT calGetStats() stored procedure, which provides statistics about resources used on the User Module (UM) node, PM node, and network by the last run query. ● 582979 rows in set (3.373 sec) MariaDB [bookstore]> calGetStats(); +---------------------------------+ | Query Stats: | | MaxMemPct-3; | | NumTempFiles-0; | | TempFileSpace-0B; | | ApproxPhyI/O-71674; | | CacheI/O-47150; | | BlocksTouched-47128; | | PartitionBlocksEliminated-1413; | | MsgBytesIn-37MB; | | MsgBytesOut-63KB; | | Mode-Distributed | +---------------------------------+
  • 32. TRACING YOUR STEPS: TRACE ● To trace a query, first enable tracing with SELECT calSetTrace(1), then run the query, then get the trace with SELECT calGetTrace(). MariaDB [bookstore]> SELECT calGetTrace(); +-------------------------------------------------------------------------------------------------------------------+ | Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows | | BPS PM b 301760 (book_id) 0 7 0 0.002 5001 | | BPS PM t 301805 (book_id,customer_id,discount,discounted_price,trans_date) 0 17280 1413 0.308 582979 | | HJS PM t-b 301805 - - - - ----- - | | TNS UM - - - - - - 2.476 582979 | +-------------------------------------------------------------------------------------------------------------------+
  • 34. MOVING DATA TO COLUMNSTORE ● Scenario A: Use the same schema as in transactional. ● Only use ColumnStore as long-term cold storage for large amounts of data. ● No OLAP as schema does not match requirements. ● No OLTP as data is too big. ● Copy selected parts of the data back to OLTP engine for processing.
  • 35. MOVING DATA TO COLUMNSTORE ● Scenario B: Use dedicated star schema. ● Actively use ColumnStore as OLAP backend. ● Load the data from OLTP storage in batches: ETL with either LOAD DATA or Bulk Load API (preferred: cpimport, shared library/JAR). ● Use any preferred front-end tool to drive the analytics (Tableau, Pentaho Mondrian, Microsft SSAS, Apache Zeppelin…).
  • 37. Q&A ● Ask questions now… ● … or ask later. We are here for you!