Just when the world of “Data 1.0” showed some signs of maturing; the “Outside In” driven demands seem to have already initiated some the disruptive changes to the data landscape. Parallel growth in volume, velocity and variety of data coupled with incessant war on finding newer insights and value from data has posed a Big Question: Is Your Data Warehouse Relevant?
In short, the surrounding changes happening real time is the new “Data 2.0”. It is characterized by feeding the ever hungry minds with sharper insights whether it is related to regulation, finance, corporate action, risk management or purely aimed at improving operational efficiencies. The source in this new “Data 2.0” has to be commensurate to the outside in demands from customers, regulators, stakeholders and business users; and hence, you would need a high relformance (relevance + performance) data warehouse which will be relevant to your business eco-system and will have the power to scale exponentially.
We starts this webinar by giving the audiences a sneak preview of what happened in the Data 1.0 world & which characteristics are shaping the new Data 2.0 world. It then delves deep on the challenges that growing data volumes have posed to the Data warehouse teams. It also presents the audiences some of the practical and proven methodologies to address these performance challenges. Finally, in the end it will highlight some of the thought provoking ways to turbo charge your data warehouse related initiatives by leveraging some of the newer technologies like Hadoop. Overall, the webinar will educate audiences with building high performance and relevant data warehouses which is capable of meeting the newer demands while significantly driving down the total cost of ownership.
Breaking the Kubernetes Kill Chain: Host Path Mount
Designing high performance datawarehouse
1. Welcome to the webinar on
Designing High Performance Datawarehouse
Presented by
&
2. Contents
1
What happened in the Data 1.0 World
2
What is shaping the new Data 2.0 World
3
Designing High Performance Datawarehouse
4
Q&A
3. What happened in the Data 1.0 World?
Before 2000
Do we need a DWH?
2000s
Select success : top down &
bottom up
Advent of ODS
Now
Business led
We’ve got BI / DWH Tools
Volume | Variety | Velocity |
Value
Performance vs. Volume :
Game Changer
Need insights from nonstructured data as well
Drill-down Reporting from
DWH – getting into mainstream
Analytics is a differentiator
Data Silos
Metrics for success?
OLAP = Insights
Painful Implementations
Show me the ROI
Standardized KPIs
Analytics as differentiator?
(DATA) Big, Real time, In-memory
– what do with existing
initiatives?
Retaining skills and expertise
Data 2.0 : scale, performance,
knowledge, relevance
4. Challenges in current DW environment - Survey
42%
say
Can’t scale to big data volumes
27% say
Inadequate data load speed
27%
say
Poor query response
25%
Existing DW modeled for
reports & OLAP only
24%
24%
23%
19%
Can’t score analytic models
Fast enough
18%
Cost of scaling up or out is too expensive
15%
Can’t support high
Concurrent user count
15%
Inadequate support for
In-memory processing
9%
18%
Current platform needs great
Manual effort for performance
Poorly suited to real-time
workloads
Can’t support in-database
analytics
Poor CPU speed and
capacity
Current platform is a legacy,
We must phase it out
TDWI research based on 278 respondents – Top Responses`
5. Social Media
Data
Data 2.0 World
True Sentiment
Faster Compliance
Text Data
Sensor Data
High Performance
Data Warehouse
Concurrency Enabled
Able to handle Complexity
Ability to Scale
Syndicated
Data
Faster Reach
Speed
Numeric
Data
Every 18 months, non-rich structured and unstructured enterprise
data doubles.
Big Data Analytics
Analytics =
Competitive Advantage
Efficiencies driving
down costs
Customer
experience & service
Business is now equipped to consume, identify and act upon this data for superior insights
6. So what is a High Performance Datawarehouse?
Key Dimensions
8. CONCURRENCY
Streaming Big Data
S Event Processing
P Real time operation
Operational BI
E
Near time Analytics
E
Dashboard
D
Refresh
Fast Queries
Competing Workloads – OLAP, Analytics
Intraday data loads
Thousands of users
Ad hoc queries
High
Performance
Data
Warehouse
Big Data volumes
Detailed source data
Thousands of reports
Scale out into: cloud, clusters, grids, etc.
SCALE
Big Data variety
Unstructured
Sensor
Social media
Many sources /
targets
Complex models
and SQL
High availability
C
O
M
P
L
E
X
I
T
Y
10. Industry recognized top techniques
45%
say
Creating Summary Tables
44%
say
33%
Adding Indexes
say
Altering SQL Statements or routines
24%
24%
Changing physical data models
16%
Using in-memory databases
21%
16%
Upgrading Hardware
20%
16%
Choosing between column-row
oriented data storage
Restricting or throttling user queries
15%
Moving an application to a
separate data mart
10%
Applying workload to
management controls
Shifting some workloads
to off-peak hours
Adjusting system parameters
6%
Others
TDWI research based on 329 responses from 114 respondents
12. Summary table design process
A good sampling of queries. These may come from user interviews, testing / QA queries,
COLLECT
production queries, reports or any other means that provide a good representation of
expected production queries
ANALYZE
IDENTIFY
The dimension hierarchy levels, dimension attributes, and fact table measures that are
required by each query or report.
The row counts associated with each dimension level represented.
The most commonly required dimension levels against the number of rows in the resulting
BALANCE
summary tables. A goal should be to design summary tables that are roughly 1/100th the size
of the source fact tables in terms of rows (or less)
MINIMIZE
The columns that are carried in the summary table in favor of joining back to the dimension
table. The larger the summary table, the less performance advantages it provides.
Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a
hierarchy to the next.
13. Capturing requirements for Summary table
•Choosing Aggregates to Create - There are two basic pieces of information which are
required to select the appropriate aggregates.
•Expected usage patterns of the data.
•Data volumes and distributions in the fact table
Report
Date
Calendar Year
Measures
Sales
Sale_Amt
Dimension
Level
Report 1
Dimension Level
Store
Item
District
Report 2
District
Calendar Year
Sales_Qty
Sale_Amt
Store Geography
Report 3
District
Calendar Month
Calendar Year
Sales_Qty
Sale_Amt
Calendar Month
Fiscal Period
Fiscal Week
Fiscal Period
Fiscal Week
Sales_Qty
Sale_Amt
Sales_Qty
Sale_Amt
Sale_Amt
Fiscal Week
Sales_Qty
Sale_Amt
Division
Region
District
Store
Subject
Category
Department
Fiscal Year
Fiscal Quarter
Fiscal Period
Fiscal Week
Report 4
Report 5
Report 6
Report 7
Report 8
Report 9
Report 10
Report 11
District
Store
Category
Dept
Dept
District
District
District
District
Region
Dept
Category
Fiscal Quarter
Fiscal Period
Fiscal Week
Sales_Qty
Sale_Amt
Sales_Qty
Item Category
Date
#
Populated
of Members
1
3
50
3980
279
1987
4145
3
12
36
156
14. Summary table design considerations
Aggregate storage column selection
Semi-additive and all non-additive fact data
– need not be stored in the summary table
Add as many “pre calculated” columns as possible
“Count” columns could be added for non additive
facts to preserve a portion of the information
Recreating vs. Updating Aggregates
Efficient for aggregation programs to update the
aggregate tables with the newly loaded data
Regeneration more appropriate if there is a lot of
program logic to determine what data must be
updated in the aggregate table
Storing Aggregate Rows
A combined table containing basic level fact
rows and aggregate rows
A single aggregate table which holds all
aggregate data for a single base fact table
A separate table for each aggregate created
– Most preferred option
Storing Aggregate Dimension Data
Multiple hierarchies in a single dimension
Store all of the aggregate dimension records
together in a single table
Use a separate table for each level in the
dimension
Add dimension data to aggregate fact table
16. Dimension table indexing
Create a non clustered, primary key on the surrogate key of
each dimension table
•
A clustered index on the business key should be considered.
• Enhance the query response when the business key is
used in the WHERE clause.
• Help avoid lock escalation during ETL process
•
For large type 2 SCDs, create a four-part non-clustered index :
business key, record begin date, record end date and surrogate
key
•
Create non-clustered indexes on columns in the dimension that
will be used for searching, sorting, or grouping,.
•
If there’s a hierarchy in a dimension, such as Category- Sub
Category-Product ID, then create index on Hierarchy
Index Type
EmployeeKey
•
Index columns
Non clustered
EmployeeNationalIDAlternateKey
clustered
EmployeeNationalIDAlternateKey,
StartDate, EndDate
EmployeeKey
Non clustered
FirstName
LastName
DeoartmentName
Non clustered
17. Fact table indexing
Index columns
Index Type
clustered
•
Create a clustered, composite index composed of each of
the foreign keys to the fact tables
OrderDateKey
ProductKey
CustomerKey
PromotionKey
CurrencyKey
SalesTerritoryKey
DueDateKey
•
Keep the most commonly queried date column as the
leftmost column in the index
•
There can be more than one date in the fact table but there
is usually one date that is of the most interest to business
users. A clustered index on this column has the effect of
quickly segmenting the amount of data that must be
evaluated for a given query
19. Row Store and Column Store
Most of the queries does not
process all the attributes of a
particular relation.
Row Store
Column Store
(+) Easy to add/modify a record
(+) Only need to read in relevant data
(-) Might read in unnecessary data
(-) Tuple writes require multiple accesses
• One can obtain the performance benefits of a column-store using a row-store
by making some changes to the physical structure of the row store.
– Vertically partitioning
– Using index-only plans
– Using materialized views
20. Vertical Partitioning
• Process:
– Full Vertical partitioning of each relation
• Each column =1 Physical table
• This can be achieved by adding integer position column to every table
• Adding integer position is better than adding primary key
– Join on Position for multi column fetch
21. Index-only plans
• Process:
– Add B+Tree index for every Table.column
– Plans never access the actual tuples on disk
– Headers are not stored, so per tuple overhead is less
23. Ecosystem of
open
Source projects
Metadata Management
(Hcatlog)
Distributed Processing
(MapReduce)
Distributed Storage
(HDFS)
Hosted by
Apache
Foundation
Query
(Pig)
Google
developed and
shared
concepts
(Hcatlog APIs, WebHDFS,
Talend Open Studio for Big Data, Sqoop)
Scripting
(Pig)
Data Extraction & Loading
Non-Relational Database
(Hbase)
Workflow & Scheduling
(Oozie)
Management & Monitoring
(Ambari, Zookeeper)
Hadoop ecosystem
Distributed File
System that has
the ability to
scale out
24. Promising uses of Hadoop in DW context
Data Staging
Hadoop’s scalability and low cost
enable organizations to keep all
data forever in a readily
accessible online environment
Data archiving
Schema flexibility
Hadoop enables the growing
practice of “late binding” –
instead of transforming data as
it’s ingested by Hadoop, structure
is applied at runtime
Hadoop allows organizations to
deploy an extremely scalable and
economical ETL environment
Hadoop can quickly and easily
ingest any data format
Processing flexibility
Distributed DW architecture
Off load workloads for big data and
advanced analytics to HDFS,
discovery platforms and MapReduce
25. What led to Datawarehouse at Facebook
The Problem
The Hadoop Experiment
Challenges with Hadoop
Data, data and more data
Superior in availability, scalability
Programmability & Metadata
200 GB per day in
And Manageability compared
March 2008
to commercial Databases
2+ TB (compressed) per day
Uses Hadoop File System (HDFS)
Map Reduce hard to program
Need to publish data in well
known schemas
HIVE
What is Hive?
Key Building Principles
Tables
A system for managing and
querying structured data built on
top of Hadoop
SQL on structured data as a familiar data
warehousing tool
Each table has a corresponding directory in HDFS
Uses Map Reduce for execution
Pluggable map/reduce scripts in language
of your choice: Rich Data Types
Uses HDFS for storage
Performance
Each table points to existing data directories in
HDFS
Split data based on hash of a column – mainly for
parallelism
27. Analytical platforms overview
1010data
Aster Data (Teradata)
Calpont
Datallegro (Microsoft)
Exasol
Greenplum (EMC)
IBM SmartAnalytics
Infobright
Kognitio
Netezza (IBM)
Oracle Exadata
Paraccel
Pervasive
Sand Technology
SAP HANA
Sybase IQ (SAP)
Teradata
Vertica (HP)
Purpose-built database management
systems designed explicitly for query
processing and analysis that provides
dramatically higher price/performance
and availability compared to general
purpose solutions.
Deployment Options
-Software only (Paraccel, Vertica)
-Appliance (SAP, Exadata, Netezza)
-Hosted(1010data, Kognitio)
•
Kelley Blue Book – Consolidates millions of auto transactions each week to calculate car valuations
•
AT&T Mobility – Tracks purchasing patterns for 80M customers daily to optimize targeted
marketing
28. Which platform do you choose?
Hadoop
Analytic Database
General Purpose
RDBMS
Structured
Semi-Structured
Unstructured
29. Thank You
Please send your Feedback & Corporate Training /Consulting Services
requirements on BI to sameer@compulinkacademy.com