Designing high performance datawarehouse

Welcome to the webinar on

Designing High Performance Datawarehouse

Presented by

&

Contents

1

What happened in the Data 1.0 World

2

What is shaping the new Data 2.0 World

3


4

Q&A

What happened in the Data 1.0 World?
Before 2000

Do we need a DWH?

2000s

Select success : top down &
bottom up

Advent of ODS

Now

Business led

We’ve got BI / DWH Tools

Volume | Variety | Velocity |
Value

Performance vs. Volume :
Game Changer

Need insights from nonstructured data as well

Drill-down Reporting from
DWH – getting into mainstream

Analytics is a differentiator

Data Silos
Metrics for success?
OLAP = Insights
Painful Implementations

Show me the ROI
Standardized KPIs
Analytics as differentiator?

(DATA) Big, Real time, In-memory
– what do with existing
initiatives?

Retaining skills and expertise
Data 2.0 : scale, performance,
knowledge, relevance

Challenges in current DW environment - Survey
42%

say
Can’t scale to big data volumes

27% say
Inadequate data load speed

27%

say
Poor query response

25%
Existing DW modeled for
reports & OLAP only

24%
24%
23%
19%

Can’t score analytic models
Fast enough

18%

Cost of scaling up or out is too expensive

15%

Can’t support high
Concurrent user count

15%
Inadequate support for
In-memory processing

9%

18%
Current platform needs great
Manual effort for performance
Poorly suited to real-time
workloads
Can’t support in-database
analytics
Poor CPU speed and
capacity

Current platform is a legacy,
We must phase it out

TDWI research based on 278 respondents – Top Responses`

Social Media
Data

Data 2.0 World

True Sentiment
Faster Compliance

Text Data

Sensor Data

High Performance
Data Warehouse

Concurrency Enabled
Able to handle Complexity
Ability to Scale

Syndicated
Data

Faster Reach

Speed

Numeric
Data

Every 18 months, non-rich structured and unstructured enterprise
data doubles.

Big Data Analytics
Analytics =
Competitive Advantage

Efficiencies driving
down costs

Customer
experience & service

Business is now equipped to consume, identify and act upon this data for superior insights

So what is a High Performance Datawarehouse?

Key Dimensions

CONCURRENCY

S
P
E
E
D

HIGH
PERFORMANCE
DATA
WAREHOUSE

SCALE

C
O
M
P
L
E
X
I
T
Y

CONCURRENCY





 Streaming Big Data
S  Event Processing
P  Real time operation
 Operational BI
E
 Near time Analytics
E
 Dashboard
D
Refresh
 Fast Queries

Competing Workloads – OLAP, Analytics
Intraday data loads
Thousands of users
Ad hoc queries

High
Performance
Data
Warehouse






Big Data volumes
Detailed source data
Thousands of reports
Scale out into: cloud, clusters, grids, etc.

SCALE

 Big Data variety
 Unstructured
 Sensor
 Social media
 Many sources /
targets
 Complex models
and SQL
 High availability

C
O
M
P
L
E
X
I
T
Y

Industry recognized top techniques
45%

say
Creating Summary Tables

44%

say

33%
Adding Indexes

say
Altering SQL Statements or routines

24%
24%

Changing physical data models

16%

Using in-memory databases

21%

16%

Upgrading Hardware

20%
16%

Choosing between column-row
oriented data storage
Restricting or throttling user queries

15%

Moving an application to a
separate data mart

10%
Applying workload to
management controls

Shifting some workloads
to off-peak hours
Adjusting system parameters

6%
Others

TDWI research based on 329 responses from 114 respondents

Designing Summary Tables

45%

say
Creating Summary Tables

Summary table design process
A good sampling of queries. These may come from user interviews, testing / QA queries,

COLLECT

production queries, reports or any other means that provide a good representation of

expected production queries

ANALYZE

IDENTIFY

The dimension hierarchy levels, dimension attributes, and fact table measures that are

required by each query or report.

The row counts associated with each dimension level represented.

The most commonly required dimension levels against the number of rows in the resulting

BALANCE

summary tables. A goal should be to design summary tables that are roughly 1/100th the size
of the source fact tables in terms of rows (or less)

MINIMIZE

The columns that are carried in the summary table in favor of joining back to the dimension
table. The larger the summary table, the less performance advantages it provides.

Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a
hierarchy to the next.

Capturing requirements for Summary table
•Choosing Aggregates to Create - There are two basic pieces of information which are
required to select the appropriate aggregates.
•Expected usage patterns of the data.
•Data volumes and distributions in the fact table
Report

Date
Calendar Year

Measures
Sales
Sale_Amt

Dimension

Level

Report 1

Dimension Level
Store
Item
District

Report 2

District

Calendar Year

Sales_Qty
Sale_Amt

Store Geography

Report 3

District

Calendar Month
Calendar Year

Sales_Qty
Sale_Amt

Calendar Month
Fiscal Period
Fiscal Week
Fiscal Period
Fiscal Week

Sales_Qty
Sale_Amt
Sales_Qty
Sale_Amt
Sale_Amt

Fiscal Week

Sales_Qty
Sale_Amt

Division
Region
District
Store
Subject
Category
Department
Fiscal Year
Fiscal Quarter
Fiscal Period
Fiscal Week

Report 4
Report 5
Report 6
Report 7
Report 8
Report 9
Report 10
Report 11

District
Store

Category

Dept
Dept

District
District
District
District
Region

Dept
Category

Fiscal Quarter
Fiscal Period
Fiscal Week

Sales_Qty
Sale_Amt
Sales_Qty

Item Category
Date

#
Populated
of Members
1
3
50
3980
279
1987
4145
3
12
36
156

Summary table design considerations
Aggregate storage column selection

 Semi-additive and all non-additive fact data
– need not be stored in the summary table
 Add as many “pre calculated” columns as possible
 “Count” columns could be added for non additive
facts to preserve a portion of the information

Recreating vs. Updating Aggregates

 Efficient for aggregation programs to update the
aggregate tables with the newly loaded data
 Regeneration more appropriate if there is a lot of
program logic to determine what data must be
updated in the aggregate table

Storing Aggregate Rows
 A combined table containing basic level fact
rows and aggregate rows
 A single aggregate table which holds all
aggregate data for a single base fact table
 A separate table for each aggregate created

– Most preferred option

Storing Aggregate Dimension Data
 Multiple hierarchies in a single dimension
 Store all of the aggregate dimension records
together in a single table
 Use a separate table for each level in the

dimension
 Add dimension data to aggregate fact table

Efficient Indexing for Datawarehouse

44%

say
Adding Indexes

Dimension table indexing
Create a non clustered, primary key on the surrogate key of
each dimension table

•

A clustered index on the business key should be considered.
• Enhance the query response when the business key is
used in the WHERE clause.
• Help avoid lock escalation during ETL process

•

For large type 2 SCDs, create a four-part non-clustered index :
business key, record begin date, record end date and surrogate
key

•

Create non-clustered indexes on columns in the dimension that
will be used for searching, sorting, or grouping,.

•

If there’s a hierarchy in a dimension, such as Category- Sub
Category-Product ID, then create index on Hierarchy

Index Type

EmployeeKey

•

Index columns

Non clustered

EmployeeNationalIDAlternateKey

clustered

EmployeeNationalIDAlternateKey,
StartDate, EndDate
EmployeeKey

Non clustered

FirstName
LastName
DeoartmentName

Non clustered

Fact table indexing

Index columns

Index Type
clustered

•

Create a clustered, composite index composed of each of
the foreign keys to the fact tables

OrderDateKey
ProductKey
CustomerKey
PromotionKey
CurrencyKey
SalesTerritoryKey
DueDateKey

•

Keep the most commonly queried date column as the
leftmost column in the index

•

There can be more than one date in the fact table but there
is usually one date that is of the most interest to business
users. A clustered index on this column has the effect of
quickly segmenting the amount of data that must be
evaluated for a given query

Row Store and Column Store
Most of the queries does not
process all the attributes of a
particular relation.

Row Store

Column Store

(+) Easy to add/modify a record

(+) Only need to read in relevant data

(-) Might read in unnecessary data

(-) Tuple writes require multiple accesses

• One can obtain the performance beneﬁts of a column-store using a row-store
by making some changes to the physical structure of the row store.
– Vertically partitioning
– Using index-only plans
– Using materialized views

Vertical Partitioning
• Process:
– Full Vertical partitioning of each relation
• Each column =1 Physical table
• This can be achieved by adding integer position column to every table
• Adding integer position is better than adding primary key

– Join on Position for multi column fetch

Index-only plans
• Process:
– Add B+Tree index for every Table.column
– Plans never access the actual tuples on disk
– Headers are not stored, so per tuple overhead is less

Using Hadoop for Datawarehouse

Ecosystem of
open
Source projects

Metadata Management
(Hcatlog)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)

Hosted by
Apache
Foundation

Query
(Pig)

Google
developed and
shared
concepts

(Hcatlog APIs, WebHDFS,
Talend Open Studio for Big Data, Sqoop)

Scripting
(Pig)

Data Extraction & Loading

Non-Relational Database
(Hbase)

Workflow & Scheduling
(Oozie)

Management & Monitoring
(Ambari, Zookeeper)

Hadoop ecosystem

Distributed File
System that has
the ability to
scale out

Promising uses of Hadoop in DW context

Data Staging

Hadoop’s scalability and low cost
enable organizations to keep all
data forever in a readily
accessible online environment

Data archiving

Schema flexibility

Hadoop enables the growing
practice of “late binding” –
instead of transforming data as
it’s ingested by Hadoop, structure
is applied at runtime

Hadoop allows organizations to
deploy an extremely scalable and
economical ETL environment

Hadoop can quickly and easily
ingest any data format

Processing flexibility

Distributed DW architecture

Off load workloads for big data and
advanced analytics to HDFS,
discovery platforms and MapReduce

What led to Datawarehouse at Facebook
The Problem

The Hadoop Experiment

Challenges with Hadoop

Data, data and more data

Superior in availability, scalability

Programmability & Metadata



200 GB per day in

And Manageability compared

March 2008

to commercial Databases

2+ TB (compressed) per day

Uses Hadoop File System (HDFS)



Map Reduce hard to program
Need to publish data in well
known schemas

HIVE
What is Hive?

Key Building Principles

Tables

A system for managing and
querying structured data built on
top of Hadoop

SQL on structured data as a familiar data
warehousing tool

Each table has a corresponding directory in HDFS

Uses Map Reduce for execution

Pluggable map/reduce scripts in language
of your choice: Rich Data Types

Uses HDFS for storage

Performance

Each table points to existing data directories in
HDFS
Split data based on hash of a column – mainly for
parallelism

Analytical platforms overview
1010data
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)

Purpose-built database management
systems designed explicitly for query
processing and analysis that provides
dramatically higher price/performance
and availability compared to general
purpose solutions.
Deployment Options
-Software only (Paraccel, Vertica)
-Appliance (SAP, Exadata, Netezza)
-Hosted(1010data, Kognitio)

•

Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations

•

AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted
marketing

Which platform do you choose?

Hadoop

Analytic Database

General Purpose
RDBMS

Structured 

Semi-Structured 

Unstructured

Thank You
Please send your Feedback & Corporate Training /Consulting Services

requirements on BI to sameer@compulinkacademy.com

Designing high performance datawarehouse

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Designing high performance datawarehouse

Similar a Designing high performance datawarehouse (20)

Más de Uday Kothari

Más de Uday Kothari (7)

Último

Último (20)

Designing high performance datawarehouse