Using the right data model in a data mart

USING THE
RIGHT DATA MODEL
IN A DATA MART
D AV I D M WA L K E R
D ATA M A N A G E M E N T & WA R E H O U S I N G

INTRODUCTION

•  The concept of a Data Mart as the data access
interface layer for Business Intelligence has been
around for over 25 years
•  Kimball style Dimensional Modelling and Star
Schemas have become the de facto data
modelling technique for data marts
•  These have been and continue to be hugely
successful with relational databases and reporting
tools – but are they the right tool for todays
technologies ?

March 2012 © 2012 Data Management & Warehousing 2

WHY IS A STAR SCHEMA SO
SUCCESSFUL?
•  There are three main reasons for creating a star
schema and their wide acceptance as a technique

•  Simpler for users to understand

•  Highly performant user queries

•  Optimal disk storage usage

March 2012 © 2012 Data Management & Warehousing Slide 3

WHAT IS A STAR SCHEMA?
•  A star schema consists of DATE DIMENSION STORE DIMENSION
two parts • 
• 
Date Surrogate Key
Date
• 
• 
Store Surrogate Key
Store Name
•  Facts: •  Day •  Store Number
Measurable numeric and/or •  Month
Year
•  Store Postcode
•  •  Store Town
time data about an event •  Public Holiday Flag •  Store Region
•  Dimensions:
Descriptive attributes about SALES FACTS
the event that give the facts a •  Date Surrogate Key
context •  Store Surrogate Key
•  Facts are stored at a • 
• 
Customer Surrogate Key
Product Surrogate Key
uniform level of detail • 
• 
Sale Time
Sale Quantity
known as the grain of the •  Sale Unit Price
data
•  A star schema consists of a CUSTOMER DIMENSION PRODUCT DIMENSION
fact table and a number of •  Customer Surrogate Key
Customer Loyalty Number
•  Product Surrogate Key
Product SKU
associated dimension tables
•  • 
•  Customer Gender •  Product Name
•  Customer Postcode •  Product Category
•  Customer Town •  Product Group
•  Customer Region •  Temperature Group


STAR SCHEMAS:
SIMPLER FOR USERS TO UNDERSTAND
•  Intuitive grouping of select P.PRODUCT_CATEGORY,
sum(SALES_QUANTITY)
information from SALES_FACTS F,
•  e.g. All customer data in one DATE_DIMENSION D,
dimension, all store data in STORE_DIMENSION S,
another, etc. CUSTOMER_DIMENSION C,
PRODUCT_DIMENSION P
•  Much easier queries than on where MONTH = ‘March’
a full relational schemas and YEAR = ‘2012’
•  Consequently harder to get and CUSTOMER_GENDER = ‘Female’
the wrong answer because of and STORE_LOCATION = ‘South West’
the wrong join and F.DATE_SKEY = D.DATE_SKEY
and F.STORE_SKEY = S.STORE_SKEY

•  All data is at the same level and F.CUSTOMER_SKEY = C.CUSTOMER_SKEY
and F.PRODUCT_SKEY = P.PRODUCT_SKEY
of granularity
•  Consequently harder to get Example query to get the number of sales in each
the wrong answer because of product category for March 2012 by female
mismatched levels of data customers in stores in the South West region


STAR SCHEMAS:
HIGHLY PERFORMANT USER QUERIES
•  Dimensional data has DATE DIMENSION STORE DIMENSION

an enforced one-to- • 
• 
Date Surrogate Key
Date
• 
• 
Store Surrogate Key
Store Name

many relationship with • 
• 
Day
Month
• 
• 
Store Number
Store Postcode

the fact table • 
• 
Year
Public Holiday Flag
• 
• 
Store Town
Store Region

•  Filtering occurs on the
(smaller) dimensions • 
SALES FACTS
Date Surrogate Key
•  e.g. •  Store Surrogate Key
where YEAR = ‘2012’
• 
•  Product Surrogate Key
Sale Time
•  Aggregation takes
• 
•  Sale Quantity

place only on the
•  Sale Unit Price

relevant subset of the CUSTOMER DIMENSION PRODUCT DIMENSION
facts • 
• 
• 
• 
Product SKU
•  e.g. •  Customer Gender
Customer Postcode
•  Product Name
Product Category
sum (SALES_QUANTITY)
•  • 
•  Customer Town •  Product Group


STAR SCHEMAS:
OPTIMAL DISK STORAGE USAGE
•  If STORE_REGION had: • 
DATE DIMENSION
Date Surrogate Key • 
STORE DIMENSION
Store Surrogate Key
•  10 discreet values • 
• 
Date
Day
• 
• 
Store Name
Store Number
•  was stored in the example • 
• 
Month
Year
• 
• 
Store Postcode
Store Town
SALES_FACT table •  Public Holiday Flag •  Store Region

•  was on average 10 bytes SALES FACTS
long •  Date Surrogate Key

•  This one field alone would • 
• 
Store Surrogate Key
require an additional 1Tb • 
• 
Sale Time
of storage • 
• 
Sale Quantity
Sale Unit Price
•  Not storing it in the fact
also improves query CUSTOMER DIMENSION PRODUCT DIMENSION

performance by reducing • 
• 
• 
• 
Product SKU
disk I/O required to • 
• 
Customer Gender
Customer Postcode
• 
• 
Product Name
Product Category
retrieve the information •  Customer Town •  Product Group


SCHEMAS:
THE ALTERNATIVES
RELATIONAL SNOWFLAKE STAR RESULT SET

Complexity Complexity Complexity Complexity
Speed Speed Speed Speed
Space Space Space Space

Usually used for data Favours saving some De facto standard Large single table
warehouses rather space in exchange for data mart design with the entire result
than data marts. for added user query based on traditional set – optimal in some
Favoured solution on complexity – usually technologies. Also circumstances
MPP technologies a techie compromise used as source for
due to their power OLAP cubes

© 2012 Data Management & Warehousing 8

STAR SCHEMAS:
TECHNOLOGY ASSUMPTIONS
•  There are two major and often unspoken assumptions about
the technologies used to build this sort of environment:

•  Firstly: The database used is a row store database and not a
column store database
•  Secondly: That users will be running reporting tools and OLAP
cubes to access the data

•  Neither of these assumptions is necessarily true – the last 10
years have seen massive innovation in Business Intelligence
technologies that will have an impact on the chosen
architectural solution – using alternate technologies means
that you should challenge existing designs and embrace
appropriate new designs in order to exploit the technology


UNDERSTAND THE DESIGN IMPACT
OF ALTERNATE TECHNOLOGIES
•  Column Store Databases:
•  What is a column store database?
•  Why are column store databases efficient?
•  How does this affect data mart design?

•  The use of alternate reporting mechanisms:
•  The user requirement gap
•  How users have filled the gap


WHAT IS A
COLUMN STORE DATABASE?
•  Traditionally databases are ‘row-based’ i.e. each
field of data in a record is stored next to each other:
Forename Surname Gender
David Walker Male
Helen Walker Female
Sheila Jones Female

•  Column store databases store the values in columns
and then hold a mapping to form the record
•  This is transparent to the user, who queries a table
with SQL in exactly the same way as they would a
row-based database
Jan 2012 © 2012 Data Management & Warehousing 11

COLUMN STORAGE EXAMPLE

First Name F Token Note: To the user this appears as a conventional
row-based table that can be queried by standard
Value SQL, it is only the underlying storage that is different
David PPP
Helen QQQ F Token S Token G Token
Sheila RRR PPP YYY BBB
Surname Value S Token QQQ YYY AAA
Jones XXX RRR XXX AAA
Walker YYY

Gender Value G Token
Female AAA
Male BBB


EFFICIENCIES OF COLUMN STORE
DATABASES
•  Column store databases offer significant storage
optimisation opportunities because long strings are not
repeatedly stored
•  In addition it is possible to compress the data column
stores very efficiently
•  It is possible, in some column store implementations, that
the column storage holds additional metadata that can
be used to speed up specific queries (e.g. the number of
records associated with each value in a column)
•  Reduced the data volume stored means reduced I/O
when querying the database, this therefore also gives
query performance improvements


COLUMN STORE DATABASES
AND DATA MART SCHEMAS
•  A column store database effectively internally
creates a star schema of every field in a result set
table.
•  This minimises the storage and maximises the query
speed in this type of database
•  Creating a star schema at the table level effectively
duplicates (in a less efficient manner) the
underlying structure that is automatically created
by the database engine
•  Consequently a single table result set is more
efficient in a column store database than a star
schema


SCHEMAS:
THE ALTERNATIVES
ROW DB COLUMN DB ROW DB COLUMN DB

Complexity Complexity Complexity Complexity
Speed Speed Speed Speed
Space Space Space Space

Column Store Column Store
Database improve Databases will
space usage and significantly improve
increase speed space usage and
compared to Row s p e e d w h e n
Based Databases compared to Row
Based Databases
STAR SCHEMA RESULT SET SCHEMA
© 2012 Data Management & Warehousing 15

WHO ARE THE COLUMN STORE
VENDORS
•  Many of the major database vendors have bought into this
concept, mostly by acquisition
Vendor Database SQL Dialect
Actian Vectorwise Ingres
EMC Greenplum Postgres
HP Vertica Postgres
InfoBright InfoBright MySQL
ParAccel ParAccel Postgres
SAP HANA (In Memory)
SAP Sybase IQ Sybase/TSQL
Teradata AsterData Postgres

•  There are multiple other players
•  For more information: Wikipedia & DBMS2


REPORTING TECHNOLOGIES

•  Historically:

•  Reporting tools were initially designed to provide a
‘simplified’ user interface for reporting against relational
schemas rather than writing SQL

•  Schemas were simplified into star schemas and specialist
tools evolved to query both star schemas and OLAP cubes
built on top of the star schemas

•  The focus of the tools was on the ability to report what had
happened from the data


THE USER REQUIREMENT GAP

What users had: What users want:

Historical Predictive
Reporting Analytics

Insight into Understanding
what has what is likely
happened to happen

HOW USERS HAVE FILLED THE GAP

•  Spreadsheets

•  Users love them even if IT hate the
associated data integrity issues
•  Users have adopted the idea of manipulating a worksheet of
data equivalent to a result set table.
•  Spreadsheets can connect to database sources to get data
often using a ‘join all’ view over a star schema to access data
•  Desktop based spreadsheets now support large data sets
(e.g. Excel supports 1M rows, 16K columns)
•  Emergence or equivalent web based technologies
(e.g. Google Docs)
•  Emergence of low cost, open source equivalents
•  In-built graphing and charting capabilities



•  Statistical Analysis Tools

•  Statistical analysis of data to identify future trends
•  Extracting large result sets to the tools for analysis
•  Connecting to result sets in the database for direct access
•  Emergence of low cost, open source equivalents (R)
•  Emergence or equivalent web based technologies (e.g.
Google Prediction, R Studio)
•  Predictive Model Standards (PMML)
•  In-built graphing and charting capabilities



•  Data Visualisation/Dashboarding Tools
•  Multiple maps, charts, graphs, gauges, sparklines, heat
maps and traffic lights displaying process critical information
•  Often sourced from a result set table which is being drip fed
the latest data by being automatically generated by
devices (machine generated data)
•  Emergence of agile/rapid
development style tools
•  Tools depend on it being easy to
load/update the data to give
near realtime information


SCHEMA TYPE SELECTION BASED ON
IMPLEMENTATION TECHNOLOGY
SPREADSHEETS

DASHBOARDS
STATISTICAL
TOOLS

Physical Star Schema with Single Table View Physical Single Table
TRADITIONAL

AND CUBING
REPORTING

TOOLS

Physical Star Schema Physical Single Table with Star Schema Views

ROW STORE COLUMN STORE
DATABASE DATABASE

IN CONCLUSION …

•  When designing your solution architecture it is
important that you choose
The Equivalent Alternate Design
best suited to the technology you are deploying

•  Star Schemas are still the best design pattern to use
when you are using row based databases
•  Result Set Single Tables are more efficient when
using column store databases
•  Consider the users and the tools that they will use
when choosing the schema design type


CONTACT US

•  Data Management & Warehousing
•  Website: http://www.datamgmt.com
•  Telephone: +44 (0) 118 321 5930
•  David Walker
•  E-Mail: davidw@datamgmt.com
•  Telephone: +44 (0) 7990 594 372
•  Skype: datamgmt
•  White Papers: http://scribd.com/davidmwalker


ABOUT US

Data Management & Warehousing is a UK based consultancy
that has been delivering successful business intelligence and
data warehousing solutions since 1995.

Our consultants have worked with major corporations around the
world including the US, Europe, Africa and the Middle East.

We have worked in many industry sectors such as telcos,
manufacturing, retail, financial and transport. We provide
governance and project management as well as expertise in the
leading technologies.


Using the right data model in a data mart

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Using the right data model in a data mart

Similar a Using the right data model in a data mart (20)

Más de David Walker

Más de David Walker (20)

Último

Último (20)

Using the right data model in a data mart