The Data Warehouse plays a central role in any BI solution: it's the back end upon which everything in the coming years will be created. It must be capable of being flexible in order to support the fast changes needed by today's business, but also with a well-know and well-defined structure in order to support the "engineerization" of its development process, making it cost effective. In this full-day session, we will discuss architectural design details and techniques, Agile Modeling, unit testing, automation, and software engineering applied to a Data Warehouse project.
The only way to do this is to have a clear idea of its architecture, understanding the concepts of measures and dimensions, and a proven engineered way to build it so that quality and stability can go hand-in-hand with cost reduction and scalability. This will allow you to start your BI project in the best way possible avoiding errors, making implementation effective and efficient, building the groundwork for a winning Agile approach, and helping you to define the way in which your team should work so that your BI solution will stand the test of time.
2. Davide Mauri
• Microsoft SQL Server MVP
• Works with SQL Server from 6.5, on BI from 2003
• Specialized in Data Solution Architecture, Database Design, Performance
Tuning, High-Performance Data Warehousing, BI, Big Data
• President of UGISS (Italian SQL Server UG)
• Regular Speaker @ SQL Server events
• Consulting & Training, Mentor @ SolidQ
• E-mail: dmauri@solidq.com
• Twitter: @mauridb
• Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
3. Agenda
• Why a Data Warehouse?
• The Agile Approach
• Modeling the Data Warehouse
• Engineering the Solution
• Building the Data Warehouse
• Unit Testing Data
• The Complete Picture
• After the Data Warehouse
• Conclusions
4. Workshop Motivation
• Give you a solid background on why a DWH and an Agile
approach is needed
• Convince your boss
• Convince your team
• Convince your co-workers
• Understand how engineering and automation is important to
make it happen
• See in practice how a DWH can be build in an Agile way
7. Where do the data came from?
• OLTP: Online Transaction Processing
• OLTP databases are built to support
• single fast select/insert/update/delete operations
• high concurrency
• data consistency (normalization)
• “current” version of data: usually there is no need to keep historical information
• Many OLTP database exists within a company
• Data is scattered all around the company
• Not all in a relational format!
8. Accessing Data Directly – The Principle
OLTP
Magic
Infinite Scale-Out
Database Machine
OLTP
Metadata
Integration
Layer
9. Accessing Data Directly – The Reality
OLTP
Magic
Infinite Scale-Out
Database Machine
OLTP
Metadata
Integration
Layer
Move
Crunch Data
10. Accessing Data Directly – Summing Up
• PROS
• Always up to date
• No copies
• Minimal Storage (3NF or above)
• Isolation/security
• CONS
• May change too fast
• Performance Impact
• Slow queries
• Complex Schema (if one exists!)
• Low or No Coherence
• Scattered Data
• Historical information may be
missing
11. It’s only a technical detail?
• Big Data, In Memory and all the new stuff, can’t just fix any
performance problems?
• The answer would be “yes”, if a simple “container” of data would
be enough.
• (A simple technical artifact in order to speed up queries)
• But much more than this is needed.
12. What is a DWH, really?
In this new era, data is like water.
Who will ever drink from
• untested,
• untrusted,
• uncertified
data?
13. What is a DWH, really?
• Would a manager or a decision maker, make a decision based
on data of which he doesn’t know the source, the integrity and
the correctness?
14. What is a DWH, really?
• The Data Warehouse is the place where managers and
decision makers will look for
• Correct
• Trusted
• Updated
• Data in order to make a
• Informed or
“conscious” decision
15. What is a DWH, really? (Metaphysically)
• The answer is now easy:
16. What is DWH, really? (Physically)
• A place to store consolidated data coming from the whole
company
• A place where cleanse, verify and certify data
• A place where historic data is stored
• A place that holds the single version of truth (if there is one!)
• Forms the core of a BI solution
• User friendly Data models, designed to make data analysis
easier
17. Modern Data Environment
Master
Data
EDW
Data Mart
Big Data
Unstructured
Data
BI Environment
Analytics Environment
Structured
Data
Data Scientist
Decision Maker
“Data Juice”
See SlideShare
18. Forrester Research Says That:
• “Business intelligence (BI) is a set of methodologies, processes,
architectures, and technologies that transform raw data into meaningful
and useful information. It allows business users to make informed
business decisions with (real-time) data that can put a company ahead
of its competitors”
• “Data warehouses form the back-end infrastructure”
20. EDW: Reality Check
• EDW is the trusted container of all company data
• It cannot be created in “one day”
• It has to grow and evolve with business needs.
• (Likely) It will never be 100% complete
23. Adapt to Survive
“50% of requirements change in the first year of
a BI project”
Andreas Bitterer, Research VP, Gartner
24. A new approach is needed
• Reduce Risk of misunderstanding
• Increase chances to deliver a useful DW/BI project
• Delivery Quickly
• Immediately create value and get user feedback
• Deliver Frequently
• Prioritize
• Set Quick-Win Objectives (again, create value)
• Fail Fast (and Recover Quickly)
25. Agile Manifesto
• Our highest priority is to satisfy the customer through early and
continuous delivery of valuable software.
• Welcome changing requirements, even late in development.
Agile processes harness change for the customer's competitive
advantage.
• Business people and developers must work together daily
throughout the project.
26. Agile Manifesto
• The most efficient and effective method of conveying
information to and within a development team is face-to-face
conversation.
• Simplicity - the art of maximizing the amount of work not done -
is essential.
• Source: http://agilemanifesto.org/principles.html
28. Agile Project Startup
• Identify the principal Business Unit
• Define a small scope
• Do some very small analysis and design
• JEDUF / JITD
• Create a Prototype
• Let the users “play” with data
• Redefine the requirements
• Grow Build the definitive Project
29. Prototype is a mandatory!
• Start with small data samples
• Help to understand data
• MDM anyone?
• Help to better estimate efforts
• Low data quality is the problem
• Create a bridge between developer and user
• Help to check that the analysis
is correct and project is feasible
30. Prototype Outcomes
• User will change/refocus their
mind when they see the actual data
• You have probably forgotten something
• Usually «implied» (for the user)
requirements
• You may have misestimated data sizes
31. Agile Project Lifecycle - 2
• Iterative Approach
• The general scope is known
• Not the details
• Anything can (and will) change
• Even already deployed objects
• Only the certified data must stay stable
• Otherwise solution will lose credibility
Analyze
Develop
DeployTest
Feedback
Evolve
32. Agile Project Best Practices
• “JIT” Modeling: don’t try to model everything right from the
beginning, but engineer everything so that it will be easy to
make changes
• Prioritize Requirements
• Short iterations (weeks ideally)
• Automate as much as you can
• Follow a Test Driven Approach: release only after having tests
in place!
• «If ain’t tested it’s broken» (TDD Motto)
33. Don’t Fear the Change!
• Ability to Embrace Changes is a key value for the DW
• DW and Users will grow and evolve together
• Agility is a mindset more than anything else
• There is NO “Agile Product”
• There is NO “Agile Model”
• Agility allows to fail fast (an recover quickly)
34. Agile Challenges
• Delivery Quickly and Fast
• Challenge: keep high quality, no matter who’s doing the work
• Embrace Changes
• Challenge: don’t introduce bugs. Change the smallest part possible.
Use automatic Testing to preserve and assure data quality.
35. Taking the Agile Challenge
• To be Agile, some engineering practices needs to be included in
our work model
• Agility != Anarchy
• Engineering:
• Apply well-known models
• Define & Enforce rules
• Automate and/or Check rules application
• Measure
• Test
36. Information is like Water
• How can you be sure that changes
won’t
introduce unexpected errors?
• Data Quality Testing is Mandatory!
• Unit Tests
• Regression Tests
• “Gate” Tests
37. Agile Vocabulary
• Agile introduces a lot of specific words
• Here’s a very nice and complete summary:
• https://www.captechconsulting.com/blog/ben-harden/learning-the-agile-
vocabulary
38. Lean BI?
• Has the same objective of Agile BI: Support Business Decision
in a ever-changing world
• Limit the different types of waste that occur in BI projects (Lean
Manufacturing),
• Focus on the interdependencies of systems (Systems
Thinking),
• Develop based on values and principles in the agile manifesto
(Agile Software Development).
• http://www.maxmetrics.com/goingagile/agile-bi-vs-lean-bi/
• http://www.b-eye-network.com/view/10264
40. Data Warehouse is Undefined
• Data Warehousing is still a young discipline
• Lacks Basic definitions
• Data Warehouse
• Data Marts
• Few “universal” rules:
• Depends on modeled business
41. Data Mart or Data Warehouse ?
• No “Standard” definition, but usually
• «Data Marts» contains departmental data
• «Data Warehouse» contains all data
• The “role” played by DM/DW depends of the approach used
• Inmon
• Kimball
• Data Vault is on the rise
• Latest kid on the block is the “Anchor Modeling”
44. DW – Still Two Philosophies
KIMBALL
Star Schema
Specialized Models
Model Once (Mart)
User Friendly
INMON
Normal Forms
One Model
Model Twice (EDW/Mart)
But, we agree:
1. There IS a model
2. It is relational(ish)
45. Which way?
• Inmon or Kimball ?
• Both have pro and cons
• Of course the difference between the two is not only limited to the Data
Warehouse definition!
• Why not both?
• Avoid religion wars and take the best of both worlds
46. Facts about Normalizing
• It is expensive to
• Join (especially between large tables)
• Maintain referential integrity
• Build query plans
• It is very hard to
• Get consistently good query plans
• Make users understand >=3NF data
• Write the right query
• This is why we are careful about normalizing warehouses!
47. DW – Choose your side. Or not?
• Why not have an hybrid solution?
• Take the best from both world
• Inmon DW that generates Kimball DMs
• Solution will grow and evolve to its final design
• Agility is the key: it has to be engineered into the solution
• Emergent Design
• https://en.wikipedia.org/wiki/Emergent_Design
48. Kimball approach…with an accent
• On average, Kimball approach is the most used
• Easy to understand
• Easy to use
• Efficient
• Well supported by tools
• Well known
• But the idea of having one physical DWH is very good
• Again, the advice is not to be too rigid
• Be willing to mix the things and move from one to another
• Be «Adaptive»
• My «Perfect Solution» is one that evolves towards an Inmon Data Warehouse
used to generate Kimball Data Marts
49. Data Vault?
• Modeling technique often associated to Agile BI.
• That’s a myth Agility is not in the model, remember?
• Introduces the concepts of “Hubs”, “Link” and “Satellites” to split
keys from their dependent values
• Optimized to keep history, not for query performances
• At the end of the day, it will maps to Dimensions and Facts
50. A model is forever?
• Surely not!
• We’re going to use ANY model that will fit our needs.
• We’ll start with the Kimball+Inmon mix
• But always present a Dimensional Model to the end user
• Behind the scenes we can make the model evolve to anything
we need. Data Vault, 100% Inmon….whatever
51. Data Warehouse Performances
• Data Warehouse may need specific hardware or software to
work at best
• Due to huge amount of data
• Due to complex queries
• Why this happens?
• Data is usually stored with the highest level of detail in order to allow
any kind of analysis
• User usually needs aggregated data
• Several specific solutions (logical and physical)
• Using RDBMS or a mixture of technologies
52. Data Warehouse Performances
• Solutions built to support
• very fast reading of huge amount of data
• analyzing data from multiple perspectives
• easy querying & reporting
• pre-aggregate data
• Specific technology
• Online Analytical Processing (OLAP) Multi-dimensional database
• Different storage flavors (MOLAP, ROLAP, HOLAP)
• In-Memory Technology
• Column-Store Approach
54. Hardware is a game changer!
Screenshot Taken from a Fast Track DWH
Cloud can offer good performance too (but not yet up to this…)
55. Dimensional Modeling
• Modeling a database schema using Facts and Dimension
entities
• Proposed and Documented by Kimball (mid-nineties)
• Applicable both to Relational and Multidimensional database
• SQL Server
• Analysis Services
• Focus on the end user
56. Defining Facts
• A fact is something happened
• A product has been sold
• A contract has been signed
• A payment has been made
• Facts contains measurable data
• Product final price
• Contract value
• Paid Amount
• The measurable data is called a Measure
• Within the DWH, facts are stored in Fact Tables
57. Defining Measures
• Measures are usually Additive
• Make sense to sum up measure values
• E.g.: money amount, quantity, etc.
• Semi-Additive data exists
• Data that cannot be summed up
• E.g.: Account balance
• Tools may have specific support for semi-additive measures
58. Defining Dimensions
• Dimensions define how facts can be analyzed
• Provide a meaning to the fact
• Categorize and classify the fact
• E.g.: Customer, Date, Product, etc.
• Dimensions have Attributes
• Attributes are the building block of a Dimensions
• E.g.: Customer Name, Customer Surname, Product Color, etc.
• Within the DWH, Dimensions are stored in Dimension Tables
• Dimension Members are the values stored in Dimensions
59. Dimensional Modeling
• Dimension Modeling come in two flavors
• Star Schema
• Snowflake Schema
• Star Schema
• Dimensions have direct relationship with fact tables
• Snowflake Schema
• Dimension may have an indirect relationship with fact fables
62. Star Schema
• Pros
• Easy to understand and to query
• Offers very good performances
• Well supported by SQL Engines (e.g.: Star-Join optimization)
• Cons
• May require a lot of space
• Make dimension update and maintenance harder
• Somehow rigid
63. Snowflake Schema
• Pros
• Less duplicate data
• Easier dimension update
• Flexibility
• Cons
• (Much) More complex to understand
• (Much) More complex to query
• In turn this means: more resource-hungry, slower, expensive
64. Snowflake or Star schema?
• Feel free to design the Data Warehouse as you prefer, but
present a Star Schema to OLAP engine or to the End User
• Views will protect end-users from model complexity
• Views will guarantee that you can have all the flexibility you need to
properly model your data
• Views will allow to make changes in future (e.g.: moving from Star to
Snowflake)
• If in doubt, start with the Star Schema
• Is usually the preferred solution
• So start with this one, you can always change your mind later
• Remember, we embrace changes
65. Understand fact granularity
• Before doing physical design
• Understand facts granularity
• Understand if and how historical data should be preserved
• Granularity is the level of detail
• Granularity has to be agreed with SME and Decision Makers
• Data should be stored at the highest granularity
• Aggregation will be done later
• Must be defined both for facts and dimensions
66. Deal with changes in dimension data
• Two options:
• Keep only the last value
• Keep all the values
• Kimball has defined specific terminology
• “Slowly Changing Dimension”
• Kind of Architectural Pattern (well known, universally recognized)
• Three type of SCD
• 1, 2 and 3
• Mix of them
67. SCD Type 1
• Update all data to last value
• Use Cases
• Correct erroneous data
• Make the past look like the present situation
• E.g.: A Business Unit changed its name
68. SCD Type 2
• Preserve all past values
• Use Cases
• Keep the information known at the time the fact occurred
• Avoid inconsistent analysis
69. SCD Type 3
• Preserve only the last valid value before the current (“previous”
values)
• Use Cases
• I’ve never seen it in use
70. Other well-known objects
• Junk Dimensions
• Generic Attributes that do no belong to any specific dimension
• They are grouped in only one dimension in order to avoid to have too many
dimensions, since this may “scare” final user
• Degenerate Dimensions
• Dimension generated from the fact table
• E.g.: Invoice Number
71. Fact Table Types
• Kimball has defined two main types
• Transactional
• Snapshot
• Again, kind of Architectural Pattern (well known, universally
recognized)
• We proposed a new fact table type at PASS Summit 2011
• Temporal Snapshot
• http://www.slideshare.net/davidemauri/temporal-snapshot-fact-tables
72. Transactional Fact Table
• Used to store «Transactional Data»
• Sales
• Invoices
• Quantities
• Each row represent an event happened in a specific point in
time
73. Snapshot Fact Table
• Useful when you need to store inventory/stock/quotes data
• Data that is *not* additive
• Store the entire situation of a precise point in time
• «Picture of the moment»
• Expensive in terms of data usage
• Usually snapshot are at week level or above (months / semester etc.)
• Thought Column-Oriented storage can help a lot here
74. Temporal Snapshot Fact Table
• New approach to store snapshot data without doing snapshots
• Each rows doesn’t represent a point in time but a time interval
• It seems easy but it’s a completely new way to approach the problem
• Bring Temporal Database theory into Data Warehousing
• Free PDF Book online:
http://www.cs.arizona.edu/people/rts/tdbbook.pdf
75. Temporal Snapshot Fact Table
• Allows the user to have daily (or even hourly) snapshot of data
• Avoids data explosion
• Look in the
• PASS 2011 DVDs, SQL Bits 11 website (shorter version), SlideShare
(shorter version)
76. Many to Many relationships
• How to manage M:N relationships between dimensions?
• e.g.: Books and Authors
• An additional table is (still) needed
• The table will not hold facts (in the BI meaning)
• Hence it will be a “factless” table
• Or – better – a Bridge table
• The OLAP engine must support such modeling approach
77. Bridge / Factless Tables
• Bookstore sample:
• The bridge table (usually) doesn’t contain facts…so it’s a factless table. It’s only used
to store M:N relationship.
• In really it could happen that a fact table also act as a bridge/factless table
Sales Fact
Table
Book
Dimension
Author
Dimension
Sales
Factless
(Bridge)
Table
78. Generic Modeling Best Practices
• Don’t create too many dimensions
• Keep It Super Simple
• If you have a lot of attributes in a dimension and some are SCD1 and
some SCD2 it may make sense to split the dimension in two
• If a dimension become huge (>1M rows) its worth to analyze how to
split it into two or more dimensions
• Keep security in mind right from the very first steps
• Since this may require you to change the way you model your Data Warehouse
80. Architecture is well known
• We now have «architectural» elements of a BI solution
• Inmon / Kimball / Other
• Star Schema / Snowflake Schema
• Facts & Dimensions
• In some specific cases we also have well-known «Design
Pattern»
• Slowly Changing Dimensions
81. Implementation is problematic
• So, from an architectural point of view, we can be happy. But
from the implementation standpoint, what we can say?
• Each time we have to start from scratch
• Every person has its own way to implement the architectural solutions
adopted
• The quality of the implementation is directly proportional to the
experience of the implementer
82. Time lost in low-value work
• You lose a lot of time in implementing “technical” stuff. Time that
is subtracted from the identification of the optimal resolution to
the business problem
• Ex: load an SCD type 2. How much you’ll spend on its development?
• From 2 days to 10 days depending on the experience that you have
• An a minimum of 2 days is still there
• Since there are no standard implementation rules, each one
applies its own
• That works, but everyone is different
83. Choices
• In the development of a BI solution you will need to make a lot
choices in terms of architecture and implementation
• Every choice we make brings pros and cons
• It will impact the future of the solution
• How do you choose? Who chooses? Why? All the people in the team
are able to make autonomous choices?
• How can you be sure that all those choices do not conflict with each
other?
• Especially when performed by different people?
84. Reaching the goal - 1
• This is the situation
• Everyone follows his own path
• It will be better to work in harmony …
• …with common rules
Target
85. DW is a TeamWork
• Problems arises when the team is made of several people
• One work well alone
• «Geniuses» (or geniuses-wannabe ) work well together
• We need to do a “exceptional” job with “normal” people. Smart and
willing but “normal”
• Must be "guaranteed" a minimum quality regardless of who does the work
• It must be easy to "scale" the number of people at work
• It must be easy to replace a person
• It’s vital to allow people to do what they do best: to give added value to the
solution. The "monkey work" should be as small as possible.
86. Software Engineering for BI
• «Software Engineering is the application of a systematic,
disciplined, quantifiable approach to the development,
operation, and maintenance of software, and the study of these
approaches; that is, the application of engineering to software”
IEEE Computer Society
87. With clear and well defined rules…
• We’d like to have this!
• So, we need to formally define
our rules for work
Target
88. Objectives
• What are the objectives we want to set?
• It must be possible to "change our mind" during development (and thus
being independent of the initial architectural choices)
• Each person must be able to solve the given problem in a personal
way, but the implementation of the solution should be made following a
common path
• Careless mistakes and errors due repetitive processes should
be minimized
• It must be possible to parallelize and (when possible) to
automate the work
• The solution must be testable
• It must have rigidity and flexibility at the same time
• It should be “adaptive”!
89. Achieve a common goal
• Everything must be designed to achieve a common goal:
• Spend more time to find the best solution to the business problem
• Spend (much) less time to implement the solution
• making as few mistakes as possible
• preventing common mistakes
• In other words, take the best from each player on the field
• Men -> Added value: Intelligence
• Machine -> Added value: Automation
90. Engineering The Solution
• A set of rules that defines
• Naming Convention
• Mandatory Objects / Attributes
• Standard implementations of solutions to common problems
• Dependencies between objects
• Best practices and development methodology
• Each and every rules has purpose to
• Prevent Errors
• Set a Standard
• Assure Maintainability
• Help Team Scale-Out
• Let developer concentrate more on solving the business problem and less on the
implementation
91. Engineering The Solution
• All rules presented here are born from real-world experience
• Following the Agile Principle of Simplicity
• Metadata are embedded in the rules
• Sometimes this bring to some ugly solutions…
• …if you want to avoid this, external files/documents MUST be
maintained
93. Engineering The Solution
• A BI Solution has three main layers
• Producers
• Coordinators
• Consumers
• Producers Layer
• Contains all the data sources
• Coordinators Layer
• Contains all objects that process source data into a Data Warehouse
• Consumers Layers
• Where Data Warehouse data is consumed
94. Engineering The Solution
• A BI solution can be thought as made of
3 different layers
• Data flows from and only from lower
levels to higher levels
• Higher levels doesn’t know how data is
managed in lower levels
• (Information Hiding Principle)
Producers
Coordinators
Consumers
96. Engineering The Solution
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Cub
e
Repor
ts
Producer
Coordinators
Consumers
97. Databases
• Helper
• Contains object that permits to
access the data from the OLTP
database.
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
98. Databases
• Staging
• Contains intermediate “volatile”
data
• Contains ETL procedures and
support objects (like err tables)
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
99. Databases
• Configuration
• objects that add additional value
to the data (e.g.: lookup tables)
• objects that allows the BI solution
to be configurable, like, for which
company load data
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
100. Databases
• Data Warehouse
• The final data store
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
101. Databases
• Metadata
• Contains all the information needed to automate the creation and the
loading of
• Staging
• Data Warehouse
• Log
• Guess?
102. Databases
• Naming Convention:
• projectname_*
• * = CFG, LOG, STG, DWH, MD, HLP
• Databases Files
• STG & DWH databases MUST be created with 2 filegroups (at least)
• PRIMARY (system catalogs),
• SECONDARY (all other table). This is the default filegroup
• Strongly recommended also for other databases
103. Schemas
• Schemas helps to
• create logical boundaries
• distinguish objects scopes
• Several Schemas used to identify the different scopes
• stg, etl, cfg, dwh, tmp, bi, err, olap, rpt
• optional “util” schema to store utility objects
• eg: fn_Nums, a function to generate numbers
• A schema (generally) cannot be used in more than one database
• Prevents careless mistakes
105. Views
• Views are the key of abstraction
• Shields higher levels from the complexity of underlying levels
• Used throughout the entire solution to reduce “friction” between
layers and objects
• Apply the “Information Hiding Principle” (helps to have teams that work
in parallel)
• Helps to auto-document the solution
106. Views
• General Rules
• Do basic data preparation in order to simplify SSIS package
development
• Casts
• Column rename
• Basic Data Filtering
• Simple data normalization and cleansing
• Join tables
107. Stored Procedures
• Their usage should be very very limited
• The majority of ETL logic is in SSIS
• Usage
• Incremental Load/Management
• SCD loading (MERGE)
• Dummy member management
• Additional abstraction that helps to avoid to change SSIS packages
• for debugging (import one specific fact table row)
• for optimizations (eg: query hints)
• for ordering data
108. Basic Concepts
• Dimension will gather data from one or more data source
• Dimension will holds key value of each source entity (if
available)
• The “Business Key”
109. Basic Concepts
• Business Key won’t be used to relate Dimension to Fact table
• A surrogate key will be created during ETL phase
• The surrogate key with be used to create the relationship
• The Surrogate key has several advantages
• Is meaningless
• Is small
• Is independent from the data source
• Helps to make the fact table smaller
110. Why Integer Keys are Better
• Smaller row sizes
• More rows/page = more compression
• Faster to join
• Faster in column stores
111. Dimensions – Example
• Data comes from three tables: Departments, SubDepartmens
and Working Area (sample model from a Logistic company)
Business Keys «Payload»Surrogate Key
112. Dimensions – Key points
• A dimension is (usually) created using data coming from master
data or reference tables
• OLTP PK/AK -> Business Key
• Dimension PK will be artificial and surrogate
113. SCD Type 1
• Scope
• Update data to last value
• Implementation
• UPDATE
114. SCD Type 2
• Scope
• Keep the all the past values and the current ones
• Implementation
• Row Valid Time + UPDATE + INSERT
115. SCD Type 3
• Scope
• Keep the current value and the one before that only
• Implementation
• Specific Columns + UPDATE
116. SCD Key vs BK
• We defined the SCD Key as the key used to lookup dimension
data while loading the fact table
• It may be not made by *ALL* BK
• It’s an ALTERNATE KEY (and thus is UNIQUE)
117. Hierarchies
• In our sample the dimension also holds a (natural) hiearchy
• Department > Subdepartment > Working Area
118. Things to keep in mind
• Huge dimension (>1M members)
• Evaluate to split it in two
• Dimension with SCD1+SCD2 attributes
• Evaluate to split it in two
• Security: keep it in mind from the beginning since it may be a
painful process if done after
119. Dimensions Rules
• Dimensions has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Name: dim_<plural_dimension_name>
• Dimension key: id_<table_name>
• Surrogate / Artificial Key
• Business Key: prefixed by bk_
• Additional mandatory columns
• last_update (datetime) or log id (int)
• scd1_checksum / scd2_checksum
• only one or both, depending on scd usage
120. Dimensions Dummy Values
• Add at least one «dummy» value
• To represent a “not available” data
• Dummy value rules
• Dimension key: negative number
• Business Key: NULL
• Fixed values for text and numeric data
• Text: “N/A” or “Not Available”
• Choose appropriate terms if more than on dummy exists
• Numeric: NULL
121. Date Dimension
• Date Dimension is an exception
• Key (id_dim_date) is not
meaningless
• Integer Data Type
• Format: yyyymmdd
• This allows easier queries on the fact table and usage of negative
dummy values for dummy members
• Eg: Unknown Date, Erroneous Date, Invalid Date
• Don’t need last_update and scd_checksum mandatory columns
122. Time Dimension
• Time Dimension is also exception
• Key (id_dim_time) is not
meaningless
• Integer Data Type
• Format: hhmmss
• Don’t need last_update
and scd_checksum
mandatory columns
• If not mandatory Drill-Down, Date & Time should be two separate
Dimensions
123. Fact Tables
• More than one table may exists within the same DW solution
• Different Granularity? Different Fact Table!
• It’s only important that they all use the same dimensions
• where applicable
• Example: Product Sales and Product Costs
• This allows to make coherent queries
124. Transactional Fact Table
• «total_amount» can just be summed up to get aggregated
values for all possible combination of dimension values
125. Snapshot Fact Table
• All data is stored for each snapshot taken.
• «Snapshot Date» Mandatory for almost all analysis
126. Temporal Snapshot Fact Table
• Each row represent an interval (max one year wide)
12
6
Underlying interval: 20090701->20090920
127. Temporal Snapshot Fact Table
• Some real-world usage
• Using Temporal Fact
• 148.380.542 Rows that uses 13 GB
• Without this technique we would have had
• 11.733.038.614 Rows that would have used 1TB of data
• This just for one month. So for one year we would have more than
10TB of data.
128. Fact Tables
• Fact Tables has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Table: fact_<plural_fact_name>
• Fact key: id_[fact]_<table_name>
• Additional mandatory columns
• insert_time (datetime) or log id (int)
• Foreign Key to Dimensions: not needed
• Put into fact table the business key columns of the source OLTP table to ease
debugging and error checking
• If BK are not too big
• Business Key: prefixed by bk_
129. Factless/Bridge Tables
• Factless/Bridge Tables has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Table: factless_<plural_table_name>
• Factless key: not needed
• Foreign Key to Dimensions: not needed
• Additional mandatory columns
• insert_time (datetime) or log id (int)
130. The DW Query Pattern
SELECT foo [..n], <aggregate>(something)
FROM dwh.fact F
JOIN dwh.dim_a A
ON F.id_a = A.id_a
JOIN dwh.dim_b B
ON F.id_b = B.id_b
WHERE <filter>
GROUP BY foo [..n]
131. The expected Relational Query Plan
Partial
Aggregate
Fact CSI Scan
Dim Scan
Dim Seek
Batch
Build
Batch
Build
Hash
Join
Hash
Join
Has
h
Stream
Aggregate
133. Loading the Data Warehouse
• Loading the DWH means doing ETL
• Extract data from data sources
• Databases, Files, Web Services, etc.
• Transform extracted data so that
• It can be cleansed and verified
• It can be enriched with additional data
• It can be placed into a star-schema
• Load data into the Data Warehouse
134. Loading the Data Warehouse
• ETL is usually the most complex and long phase
• roughly 80% of the entire work is done here
• Integration Services is the engine we use to do ETL
• Very very fast
• Completely In-Memory
• 64 bits aware
• Very scalable
135. Loading the Data Warehouse
• SSIS does NOT substitute T-SQL
• T-SQL and set based-operations are still faster
• When possible avoid working on per-row basis but favor «set-based»
operations
• Just keep in mind that you have to deal with the t-log
• They are complementary work together
• T-SQL: ideal for “simple” set-oriented data manipulation
• SSIS: ideal for complex, multi-stage, data manipulation
• Advanced scripting through SSIS Expression or .NET
136. Loading the Data Warehouse
• Integration Services and T-SQL plays the major role here
• .NET help may be needed from time to time for complex transformations
• Our objective: create an ETL solution such in a way is almost auto-
documented
• It should be possible to understand what ETL do, just «reading» the SSIS
Packages
• Following the KISS principle, avoid to mix ETL logic
• “Simple” ETL logic in views
• “Complex” ETL logic in SSIS Packages
137. Loading the Data Warehouse
• SSIS will NEVER load data directly from a table
• ALWAYS go through a view
• View will decrease complexity of package and make it loosely coupled with
the database schema
• This will make SSIS development easier
• Simple filtering changes or joins can be changed here without having to touch
SSIS
• SSIS Package are like applications!
• Only one exception to this rule will be seen in loading Fact and
Dimension tables
• Exception is made since there is a case where using a view will not decrease
complexity
138. Divide et Impera
• To be able to be Agile is *vital* to keep business and technical
process completely separated
• Business Process: ETL logic that can be applied only to the
specific solution you’re building
• Technical Process: ETL logic that can be used with any Data
Warehouse and that can be highly automated
139. Divide et Impera
• Follow the “Divide et Impera” principle
• Move data from OLTP to Staging
• Move data from Staging to Data Warehouse
• Create at least two different SSIS solutions
• One to load the Staging Database
• One to load the Data Warehouse Database
141. Loading the Data Warehouse – Step 1
OLTP STGExtract
& Load
Views
HLP
Other
Data
Sources
142. Loading the Data Warehouse – Step 1
• First step is to load data into staging database
• From Data Sources
• NO “Transformation” here, just load data as is
• In other words, create a copy of OLTP data used in the BI solution
• Total or Partial in case of Incremental Load
• This will make us free to do complex ETL queries without interfering with
production systems
• Only filter data that by definition should not be handled by BI solution
• Sample or Test data
143. The “Helper” database
• Create views to expose data that will be used to create DWH
• Views are simple “SELECT columns FROM…”
• no data transformation allowed
• no casts, no column renaming, no data cleansing
• only filter data that should never ever be imported into DWH
• eg: customer id 999 which is the “test customer”
• Views has to be put in the bi schema
144. Loading the Data Warehouse – Step 2
STG
ETL
Views
StoredProcedures
TMP ERR
CFG
145. Loading the Data Warehouse – Step 2
• Second step is to transform data so that it can be loaded into
the Data Warehouse
• “Transform” can be a complex duty
• Transform = Cleanse, Check, De-Duplicate, Correct
• Data may have to go through several transformations in order to reach
the final shape
• All intermediate values will never go out the staging database
• Here is where you’ll spend most of your time
146. The “Configuration” database
• “Configuration” data
• Data non available elsewhere
• E.g.: lookup tables of “Well-Known” values
• E.g.: C1 -> Company 1, C2 -> Company2
• Tables used to hold “configuration” data
• Use the cfg schema
147. The “Staging” Database
• Contains a copy of OLTP data
• Only the needed data, of course
• Copying data is fast. This allows us to avoid to use OLTP database for
too long
• Avoid concurrency problems
• All further work will be done on the BI server an won’t affect OLTP performances
• Data from tables from the OLTP data sources has to be copied
into staging tables
• tables must have the same schema of OLTP tables
• staging tables has to be created in the staging schema
148. The “Staging” Database
• Contains intermediate tables used to transform the data
• Favor usage of several intermediate tables (even if you’ll use more
space) instead of doing everything in memory with SSIS
• This will make debugging/troubleshooting much more easier!
• The correct balance to decide how many intermediate tables are needed has to
found on per-project basis
149. The “Staging” Database
• Tables used to hold data coming from files
• E.g.: Excel, Flat Files
• Use the etl schema
• Tables used to hold intermediate data
• Use the tmp schema
• Objects used in the ETL phase
• Views, Stored Procedures, User-Defined Functions, ecc..
• All these objects must be placed in the etl schema
150. The “Staging” Database
• Views prepare data to be further processed by SSIS
• SSIS read data only from views
• Source view naming convention
• vw_<logical_name>
• E.g.: etl.vw_claims
• Destination table naming convention
• <logical_name>
• E.g.: tmp.claims
• If ETL has to be done in more than one step
• append the «step_number» to objects_name
• E.g.: etl.vw_claims_step_1, tmp.claims_step_1
151. The “Staging” Database
• Views take care of creating a “logical” view of dimension or fact
data
• rename columns to give human understandable meaning
• CAST data types in order to make them consistent with the one used in
DWH
• perform basic data filtering and data re-organization
• eg: flatten hiearchies to “n” columns, trim white spaces
• perform basic ETL logic
• CASE statments, ROW_NUMBER, Joins, Ecc.
152. The “Staging” Database
• ETL Stored procedures are used only to manage dimension
loading (SCD 1 or 2) and Dummy Members:
• Naming convention:
• etl.stp_merge_dim_<dimension target>
• etl.stp_add_dummy_dim_<dimension target>
153. The “Staging” Database
• The err schema contains table that holds rows with errors that
cannot be corrected or ignored (rows that cannot be processed)
• For example: you have a temporal database and for some rows you
find that “Valid To” happens before “Valid From”
• This data can be later exposed to SMEs in order to fix it
• Is interesting to note that already in the middle of development the BI
solution become useful
• Helps to increase data quality
154. Loading the Data Warehouse – Step 3
STG DWH
SSIS
Views
StoredProcedures
155. Loading the Data Warehouse – Step 3
• Third step is the loading of Data Warehouse
• Very simple: just take the transformed data from staging database and put it
into Facts and Dimensions
• Load all dimensions
• Generate dimension IDs
• Load fact tables
• “Just” convert business keys to dimension IDs
• Not so easy
• Must handle incremental loading
• Mandatory for dimensions (otherwise you may have problems if loaded data have
different dimension ID)
• Would be nice also for facts
• More complex when you have «early arriving facts»/«late arriving
dimensions»
156. Handling Dimension Keys
• Mapping Source Dimension Keys (the BK) to the surrogate
Dimension ID may be more complex that what expected. You may
encounter several key «pathologies»
• Composite Keys, Zombie Keys, Multi Keys, Dolly Keys
• A good way to solve the problems is to add an additional abstraction
layer, using mapping tables
• Thomas Kejser has some very good posts on that here
• http://blog.kejser.org/tag/keys/
157. The “Data Warehouse” database
• DWH database must contain only
• tables related to the dwh fact, factless and dimensions
• all tables must be in the dwh schema
• Views to allow access to physical tables
• use specific schemas to expose data to other tools
• use olap schema for views used by SSAS
• use rpt schema for views used by SSRS
• Add your own schema depending on the technology you use
• Or even create a Data Mart out of the Data Warehouse!
158. The “Data Warehouse” database
• Stored Procedures
• If needed for reporting purposes must be put into the reporting schema
• No other use allowed
159. The “Data Warehouse” database
• Dimension loading
• Always incremental
• With all the rules in place there is only one way to load them
• Of course it there may be differences on per dimension-basis
• But is just like building an house. No two house are identical, yet all are built following
the same rules
• This means that it can be completely automatized!
160. The “Data Warehouse” database
• Fact tables loading
• Incremental would be nice
• But it may be not an easy task
• SQL Server 2008 CDC in the source can help a lot
• Sometimes just dropping and re-loading the facts is the most effective solution
• Rarely for the entire table
• More common with time-partitioning
• FAST load of fact tables:
• Drop and re-create indexes
• Remove Compression and add it later
• Load Partitions in Parallel
• A tool to automatize partitioned table managing exists
• SQL CAT Partition Management Tool
161. Improving DW Querying Performance
• Use ColumnStore Indexes to speed up queries against the DW
(if you’re not using other additional solutions)
• Try to keep Factless/Bridge table as small as possible. A
Whitepaper details how to implement a «proprietary»
compression that works extremely well:
• http://www.microsoft.com/en-us/download/details.aspx?id=137
162. Tools that helps
• Use Multiple Hash Component to calculate hash values
• http://ssismhash.codeplex.com/
• When looking up SCD2 dimension, try to avoid the default
Lookup transformation since it does not support FULL cache in
this scenario. Matt Masson has a very good post no how to
implement «Range Lookups»
• http://bit.ly/SSISRangeLookup
163. Integration Services Rules
• Avoid usage of OLEDB Command in DataFlow
• It’s just too slow, prefer a set-based solution
• Try to do as much as transformation / operations here and NOT in
SSAS or SSRS
• In other words: avoid to spread ETL process all around
• Always read from views
• Use of OPTION(RECOMPILE) is encouraged so that we can have optimum
plans
• Except for Dimension loading lookup component
• (Doesn’t help to lower complexity)
164. Integration Services Rules
• Package Naming Convention
• Use “setup_” prefix for all packages that contains logic that must to be
run in first place in order to be able to load data
• Use “load_” prefix for all packages that loads data into “final” tables
• E.g.: staging tables, dwh tables
• Use “prepare_” prefix for all packages that transform data in order to
make it usable by another transformation phase
• E.g.: tmp tables
• Use a sequence number (###)
• To group all independent packages
• To quickly identify package dependencies
165. Integration Services Rules - Staging
load_DFKKKO
load_DFKKOP
load_BUT000
load_<xxxxxxxx>
prepare_010_orders
prepare_010_customers
prepare_020_invoices
prepare_020_orders
All these packages are
independent from each
other and can be run
simultaneously
All these packages are
independent from each
other and can be run
simultaneously, but
works on data loaded by
“load_” packages
All these packages are
independent from each
other and can be run
simultaneously, but
works on data loaded by
previous “prepare_”
packages
166. Integration Services Rules - DWH
load_dim_time
load_dim_customers
load_dim_products
load_dim_categories
load_dim_geography
load_fact_orders
load_fact_invoices
load_fact_costs
load_factless_products_categories
First load all Dimensions
Than load all Facts
Then load all Factless
167. Integration Services Rules
• One “action” per package!
• With SQL Server 2012+ use Shared Connections and the «Project»
deployment model
• Use one or more “Master Package” to execute packages in the correct sequence /
parallelism
• With Previous Versions Try to make sure that all packages of the same
layer (STG or DWH) uses the same connection managers
• In this way you can have only one configuration file to configure connections when
running packages
• Don’t bother too much about logging
• SQL Server 2012+ has native support
• http://ssis-dashboard.azurewebsites.net/
• If using SQL Server 2005 or 2008/R2 use DTLoggedExec
• http://dtloggedexec.codeplex.com/
168. Building a DWH in 2013
• Is still a (almost) manual process
• A *lot* of repetitive low-value work
• No (or very few) standard tools available
169. How it should be
• Semi-automatic process
• “develop by intent”
• Define the mapping logic from a
semantic perspective
• Source to Dimensions / Measures
• (Metadata anyone?)
• Design the model and let the
tool build it for you
CREATE DIMENSION Customer
FROM SourceCustomerTable
MAP USING CustomerMetadata
ALTER DIMENSION Customers
ADD ATTRIBUTE LoyaltyLevel
AS TYPE 1
CREATE FACT Orders
FROM SourceOrdersTable
MAP USING OrdersMetadata
ALTER FACT Orders
ADD DIMENSION Customer
171. Invest on Automation?
• Faster development
• Reduce Costs
• Embrace Changes
• Less bugs
• Increase solution quality and
make it consistent throughout
the whole product
172. Automation Pre-Requisites
• Split the process to have two separate type of processes
• What can be automated
• What can NOT be automated
• Create and impose a set of rules that defines
• How to solve common technical problems
• How to implement such identified solutions
173. No Monkey Work!
Let the people think and
let the machines do the
«monkey» work.
174. Design Pattern
“A general reusable
solution to a commonly
occurring problem within
a given context”
176. Design Pattern
• Specific SQL Server Patterns
• Change Data Capture
• Change Tracking
• Partition Load
• SSIS Parallelism
177. Engineering the DWH
• “Software Engineering allows and require the formalization of
software building and maintenance process.”
178. Sample Rules
• Always put «last_update» column
• Always log Inserted/Updated/Deleted rows to log.load_info table
• Use FNV1a64 for checksums
• Use views to expose data
• Dimension & Fact views MUST use the same column names for lookup
columns
179. Engineering the DWH
There are two intrinsc
processes hidden in the
development of a BI
solution that must be
allowed (or forced) to
emerge.
180. Business Process
• Data manipulation,
transformation, enrichment &
cleansing logic
• Specific for every customer.
Almost not automatable
181. Technical Process
• Application of data extraction
and loading techniques
• Recurring (pattern) in any
solution
• Highly Automatable
183. ETL Phases
• «E» and «L» must be
• Simple, Easy and Straightforward
• Completely Automated
• Completely Reusable
• «E» and «L» have ZERO value in a BI Solution
• Should be done in the most economic way
188. Source Differential Load
• SQL Server 2012 that can help with incremental/differential load
• Change Data Capture
• Natively supported in SSIS 2012
• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-server-2012-2/
• Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
E
189. SCD 1 & SCD 2
L
Start
Lookup Dimension Id
and MD5 Checksum
From Business Key
Calculate MD5
Checksum of Non-
SCD-Key Colums
Dimension Id is
Null?
Yes
Insert new members
into DWH
No
Checksum are
different?
Yes
Store into temp
table
Merge data from
temp table to DWH
End
190. SCD 2 Special Note
• Merge => UPDATE Interval + INSERT New Row
L
193. Parallel Load
• Logically split the work in several steps
• E.g: Load/Process one customer at time
• Create a «queue» table the stores information for each step
• Step 1 -> Load Customer «A»
• Step 2 -> Load Customer «B»
• Create a Package that
• Pick the first not already picked up
• Do work
• Back to step 3
• Call the Package «n» times simultaneously
EL
194. Other SSIS Specific Patterns
• Range Lookup
• Not natively supported
• Matt Masson has the answer in his blog
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-
lookups.aspx
195. Metadata
• Provide context information
• Which columns are used to build/feed a Dimension?
• Which columns are Business Keys?
• Which table is the Fact Table?
• How Fact and Dimension are connected?
• Which columns are used?
196. How to manage Metadata?
• Naming Convention
• Specific, Ad Hoc Database or Tables
• JSON
• Other (XML, File, ecc.)
197. Naming Convention
• The easiest and cheapest
• No additional (hidden) costs
• No need to be maintained
• Never out-of-sync
• No documentation need
• Actually, it IS PART of the documentation
• Imposes a Standard
• Very limited in terms of flexibility and usage
198. Extended Properties
• Support most of metadata needs
• No additional software needed
• Very verbose usage
• Development of a wrapper to make usage simpler is feasible and
encouraged
199. Metadata Objects
• Dedicated Ad-Hoc Database and Tables
• As Flexible as you need
• Maintenance Overhead to keep metadata in-sync with data
• Development of automatic check procedure is needed
• DMV can help a lot here
• Need a GUI to make them user-friendly
200. JSON
• Could be expensive to keep them in-sync
• A tool is needed, otherwise too much manual work
• User and Developer Friendly!
• VERY flexible
• If too much JSON.Net Schema may help
• Supported by Visual Studio
• An SQL Server 2016
201. Automation Scenarios
• Run-Time: «Auto-Configuring» Packages
• Really hard to customize packages
• SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime
• On-the fly creation of package may be needed
• Design-Time: Package Generators / Package Templates
• Easy to customize created packages
204. Useful Resources
• «STOCK» Tasks:
• http://msdn.microsoft.com/en-us/library/ms135956.aspx
• How to set Task properties at runtime:
• http://technet.microsoft.com/en-
us/library/microsoft.sqlserver.dts.runtime.executables.add.aspx
205. BIML – BI Markup Language
• Developed by Varigence
• http://www.varigence.com
• http://bimlscript.com/
• MIST: BIML Full-Featured IDE
• Free via BIDS Helper
• Support “limited” to SSIS package generation
• http://bidshelper.codeplex.com
207. Data Warehouse Unit Test
• Before releasing anything data in the DW must be tested.
• User has to validate a sample of data
• (e.g.:total invoice amount of January 2012)
• That validated value will become the reference value
• Before release, the same query will be executed again. If the data is
the expected reference data then test is green otherwise the test
fails
208. Data Warehouse Unit Test
• Of course test MUST be automated when possibile
• Visual Studio
• BI.Quality (on CodePlex…now old)
• Based on Nunit
• NBI is the new way to go http://www.nbi.io/ !
• Based on Nunit
• What to test?
• Structures
• Aggregated results
• Specific values of some «special» rule
• Fixed bugs/tickets
• Values in the various layers
211. Modern Data Environment - Details
Files
Web Svc
Cloud /
Syndicated
RDBMS
Master Data
E
x
tr
a
c
t
Archive / Big Data
Facts
Staging
Archive
Replay
DimensionsStandardise
Extract
Cube
V-Mart
Mart
Mart
Copy
Facts
Facts
Process
Secure
/ Expose
Aggregate
Transform
214. What’s Next?
• Now that the DW is ready, any tool can be used to create a
BI/Reporting solution on a solid and simpler, user friendly,
ground.
• Reporting
• Reporting Services / Business Object / Microstrategy / JasperReports
• Analysis
• Analysis Services, Cognos
• Power Pivot, QlikView, Tableau, Power BI
216. A Starting Point
• The presented content can be used as is or as a starting point to build your
own framework
• Extend the content when it doesn’t fit in your solution (for example: add
additional databases, like «SYSCFG» if this help you)
• Define your rules! Drive the tools and be not driven by them!
• Keep the layers separated and favor loose coupling (less «friction» to
changes)
• Spread the idea of Unit Testing Data even if at the beginning it seems and
expensive approach.
217. Real World Samples
• The presented content comes from on-the field experience
• More than 40 (successful) project using the proposed approach
• More than 2000 packages managed (biggest solution: 572 packages)
• Several team involved (biggest team: 12 people)
• Several customer grown their own standard starting from this
• Data coming from ANY source: SAP, Dynamics DB2, Text or Excel Files
218. Some challenges faced
• Changed and entire accounting system, moving from one vendor to another
• DWH and OLAP/Reporting solution completely untouched. 2/3 of budget saved
• Started with a full load only and the added incremental load later
• Less then 5% of Extract and Load logic changed (Transformations untouched)
• Created a solution in 3 month with a minimal set of features and evolved and
grown in to be an enterprise data warehouse / BI solution.
• Monthly Delivery.
• Never release bad data (helped to correct errors in the source systems)
• Helped an enterprise company to reduce time spent on crunching data by 66%
percent.
219. Latest challenges faced
• Supported on a *big* electronic retail company in creating their
BI/DSS solution on their shiny new Dynamics CRM installation.
• During CRM Development.
• The first specification document for reporting was very “agile”…
• “What do you need?”: “Don’t know, but all”
DATA alone is not enough. It’s like a raw material. It has to be processed in order to become INFORMATION, that will drive to extract and acquire KNOWLEDGE and ultimately allows people to take DECISIONS.
OLTP samples: ecommerce website, SAP, CRM, ERP, and so on
Usually OLTP database are tied to a specific business purpose
Querying an OLTP database to analyze data and trends may not be a good idea
OLTP database is complex
Queries that analyzes data are complex and will slow down your production system
OLTP database schema may change unexpectedly
All needed data may not be available in only one database
Data can be updated at any time, making «point-in-time» queries unreliable
“In a modern company, everyone is a Decision Maker.”
Data Juice
http://www.slideshare.net/davidemauri/data-juice
http://www.forrester.com/Topic+Overview+Business+Intelligence/fulltext/-/E-RES39218
A Data Warehouse is needed no matter which technology you’ll decide to use for you BI/DSS solution, since it is the spine of it!
Delivery Quickly: make BI a key asset for the company right from the beginning. The sooner people will get data, the sooner they will learn more about their data. For example it’s very easy to detect underestimaded data quality or business process problems. BI can be a good help to start fix them and monitor them and thus making the ROI tangible right from the start.
JUDEF: Just Enough Design Upfront
JITD: Just In Time Design
Unit Testing is a key topic in BI!
A little bit more detail on the sentence that states there’s a lock of “Universal Rules”. The meaning is that make no sense to ask if “this entity has been modeled correctly”. The answer is that the entity – let’s say, the Customer – has been modeled correctly if and only if it allows all the analysis that the business need to do, in and efficient, fast and errorless way. It’s not possible to say that modeling the Customer with two or three tables it’s better than using just one table. It depends on the business needs, the amount of work required to implement that entity, the “friction” that such model introduce and thus make changes harder and so on.
Easy to understand
Easy to use
Efficient
Well supported by tools
Well known
On average the Kimball approach is the most used since is:
Easy to understand
Easy to use
Efficient
Well supported by tools
Well known
But the idea of having one physical DWH is very good. Again, the advice is not to be too rigid: Be willing to mix the things of move from one to another… Be «Adaptive»
My «Perfect Solution» is an Inmon Datawarehouse used to generate Kimball Data Marts
The solution will grow over time, and so it may be created using one approach but then it will be modified to another as time passes by, in order to better serve business requirements. The idea of «changes» is not something that has to be fought, but something the has to be «embraced». The BI Solution must be able to accept changes.
“analyzing data from multiple perspectives”: this also can be rephrased as «analyze data among all its possible categorizations»
“One solution is to move away from RDBMS for quering”: as usual has Pros and Cons.
Pros:
Ah-Hoc solution that give best performances
Very easy to use for the final user (a Data Analyst)
Cons:
Is another technology for which people has to be trAnother solution is to stay with RDBMS but optimize it for this purpose (Indexed Views, Parallel Data Warehouse, Column Aligned Storage, …)
“One solution is to move away from RDBMS for quering”: as usual has Pros and Cons.
Pros:
Ah-Hoc solution that give best performances
Very easy to use for the final user (a Data Analyst)
Cons:
Is another technology for which people has to be trained to use it effectively
More complex to use for the developer
ained to use it effectively
More complex to use for the developer
Focus on the end user: make life easier for who has to query the data for analytics purposes
Make dimension update and maintenance harder -> Due to denormalization
Somehow rigid -> Again, due to denormalization, it’s harder to update a dimension since there’s a lot of duplicate data the you have to deal with
SME = Subject Matter Experts
The fact table contains the Book dimension Id
If a book is written by many authors we cannot create additional rows in the fact table Otherwise we would not correctly model reality, and have wrong results
Sometimes the whole is not made of the sum of the single elements.
Keep security in mind right from the very first steps: we won’t go deep into security problems in this workshop but it’s very important to understand what kind of security requirements you have to follow
Underline that the mentioned point are exactly what’s needed to make a team working using an Agile approach
Information Hiding Principle: http://en.wikipedia.org/wiki/Information_hiding
Configuration:
Contains Configuration objects
objects that add additional value to the data (eg: lookup tables)
objects that allows the BI solution to be configurable, like, for which company load data
Staging:
Contains intermediate “volatile” data
Contains ETL procedures and support objects (like err tables)
Data Warehouse:
The final data store
Helper:
Contains object that accesses the data from the OLTP database.
Dimension contains all the possibile valid combinations of values in the three tables.
Type 3 is never used in reality.
“A hierarchy is a natural hierarchy when each attribute included in the user-defined hierarchy has a one to many relationship with the attribute immediately below it”
http://msdn.microsoft.com/en-us/library/ms174557.aspx
Don’t create too much dimensions (<20)
If you have a lot of attributes in a dimension and some are SCD1 and some SCD2 it may make sense to split the dimension in two
If a dimension become huge (>1M rows) its worth to analyze how to split it into two or more dimensions
Keep security in mind right from the very first steps
Since this may require you to change the way you model your Data Warehouse
Keep security in mind right from the very first steps: we won’t go deep into security problems in this workshop but it’s very important to understand what kind of security requirements you have to follow
Product Sales and Product Costs:
Shared dimensions: Product, Category
Non Shared dimenions: Customer
It allows, for example, to calculate the gross margin
“Simple” means that you never need to use a temporary table to store intermediate data.
ALWAYS go through a view: this can be read also as “Views PREPARE data to be used by SSIS”
Other Data Source => Excel, Flat Files, Web Services, ecc…
Or even create a Data Mart out of the Data Warehouse: Maybe you need to have specific aggregations or add specific data used only by one department