SlideShare una empresa de Scribd logo
1 de 221
Agile Data Warehousing:
From Start to Finish
Davide Mauri
@mauridb
dmauri@solidq.com
Davide Mauri
• Microsoft SQL Server MVP
• Works with SQL Server from 6.5, on BI from 2003
• Specialized in Data Solution Architecture, Database Design, Performance
Tuning, High-Performance Data Warehousing, BI, Big Data
• President of UGISS (Italian SQL Server UG)
• Regular Speaker @ SQL Server events
• Consulting & Training, Mentor @ SolidQ
• E-mail: dmauri@solidq.com
• Twitter: @mauridb
• Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
Agenda
• Why a Data Warehouse?
• The Agile Approach
• Modeling the Data Warehouse
• Engineering the Solution
• Building the Data Warehouse
• Unit Testing Data
• The Complete Picture
• After the Data Warehouse
• Conclusions
Workshop Motivation
• Give you a solid background on why a DWH and an Agile
approach is needed
• Convince your boss
• Convince your team
• Convince your co-workers
• Understand how engineering and automation is important to
make it happen
• See in practice how a DWH can be build in an Agile way
Why a Data Warehouse?
The Data-Driven Age
DecisionKnowledgeInformationData
“In a modern company, everyone is a Decision Maker.”
Where do the data came from?
• OLTP: Online Transaction Processing
• OLTP databases are built to support
• single fast select/insert/update/delete operations
• high concurrency
• data consistency (normalization)
• “current” version of data: usually there is no need to keep historical information
• Many OLTP database exists within a company
• Data is scattered all around the company
• Not all in a relational format!
Accessing Data Directly – The Principle
OLTP
Magic
Infinite Scale-Out
Database Machine
OLTP
Metadata
Integration
Layer
Accessing Data Directly – The Reality
OLTP
Magic
Infinite Scale-Out
Database Machine
OLTP
Metadata
Integration
Layer
Move
Crunch Data
Accessing Data Directly – Summing Up
• PROS
• Always up to date
• No copies
• Minimal Storage (3NF or above)
• Isolation/security
• CONS
• May change too fast
• Performance Impact
• Slow queries
• Complex Schema (if one exists!)
• Low or No Coherence
• Scattered Data
• Historical information may be
missing
It’s only a technical detail?
• Big Data, In Memory and all the new stuff, can’t just fix any
performance problems?
• The answer would be “yes”, if a simple “container” of data would
be enough.
• (A simple technical artifact in order to speed up queries)
• But much more than this is needed.
What is a DWH, really?
In this new era, data is like water.
Who will ever drink from
• untested,
• untrusted,
• uncertified
data?
What is a DWH, really?
• Would a manager or a decision maker, make a decision based
on data of which he doesn’t know the source, the integrity and
the correctness?
What is a DWH, really?
• The Data Warehouse is the place where managers and
decision makers will look for
• Correct
• Trusted
• Updated
• Data in order to make a
• Informed or
“conscious” decision
What is a DWH, really? (Metaphysically)
• The answer is now easy:
What is DWH, really? (Physically)
• A place to store consolidated data coming from the whole
company
• A place where cleanse, verify and certify data
• A place where historic data is stored
• A place that holds the single version of truth (if there is one!)
• Forms the core of a BI solution
• User friendly Data models, designed to make data analysis
easier
Modern Data Environment
Master
Data
EDW
Data Mart
Big Data
Unstructured
Data
BI Environment
Analytics Environment
Structured
Data
Data Scientist
Decision Maker
“Data Juice”
See SlideShare
Forrester Research Says That:
• “Business intelligence (BI) is a set of methodologies, processes,
architectures, and technologies that transform raw data into meaningful
and useful information. It allows business users to make informed
business decisions with (real-time) data that can put a company ahead
of its competitors”
• “Data warehouses form the back-end infrastructure”
The Agile Approach
EDW: Reality Check
• EDW is the trusted container of all company data
• It cannot be created in “one day”
• It has to grow and evolve with business needs.
• (Likely) It will never be 100% complete
Gather
Requirement
s
Design
Develop
Delivery
Generate
Value
• Too few
stakeholders
• Too many technical
people
• Too few iterations
• Too slow
• Too expensive
• Illusion of Control
Traditional Development Lifecycle
A well known picture
Adapt to Survive
“50% of requirements change in the first year of
a BI project”
Andreas Bitterer, Research VP, Gartner
A new approach is needed
• Reduce Risk of misunderstanding
• Increase chances to deliver a useful DW/BI project
• Delivery Quickly
• Immediately create value and get user feedback
• Deliver Frequently
• Prioritize
• Set Quick-Win Objectives (again, create value)
• Fail Fast (and Recover Quickly)
Agile Manifesto
• Our highest priority is to satisfy the customer through early and
continuous delivery of valuable software.
• Welcome changing requirements, even late in development.
Agile processes harness change for the customer's competitive
advantage.
• Business people and developers must work together daily
throughout the project.
Agile Manifesto
• The most efficient and effective method of conveying
information to and within a development team is face-to-face
conversation.
• Simplicity - the art of maximizing the amount of work not done -
is essential.
• Source: http://agilemanifesto.org/principles.html
Hi-Level
Requirements
JIT Model
Implement
Test
Deliver
Generate
Value
• Multi-Disciplinary Team
• Many Iterations
• Cost Effective
• Quick Delivery
• Iterative
• True Control
Agile Development Lifecycle - 1
Weeks or few Months
Agile Project Startup
• Identify the principal Business Unit
• Define a small scope
• Do some very small analysis and design
• JEDUF / JITD
• Create a Prototype
• Let the users “play” with data
• Redefine the requirements
• Grow Build the definitive Project
Prototype is a mandatory!
• Start with small data samples
• Help to understand data
• MDM anyone?
• Help to better estimate efforts
• Low data quality is the problem
• Create a bridge between developer and user
• Help to check that the analysis
is correct and project is feasible
Prototype Outcomes
• User will change/refocus their
mind when they see the actual data
• You have probably forgotten something
• Usually «implied» (for the user)
requirements
• You may have misestimated data sizes
Agile Project Lifecycle - 2
• Iterative Approach
• The general scope is known
• Not the details
• Anything can (and will) change
• Even already deployed objects
• Only the certified data must stay stable
• Otherwise solution will lose credibility
Analyze
Develop
DeployTest
Feedback
Evolve
Agile Project Best Practices
• “JIT” Modeling: don’t try to model everything right from the
beginning, but engineer everything so that it will be easy to
make changes
• Prioritize Requirements
• Short iterations (weeks ideally)
• Automate as much as you can
• Follow a Test Driven Approach: release only after having tests
in place!
• «If ain’t tested it’s broken» (TDD Motto)
Don’t Fear the Change!
• Ability to Embrace Changes is a key value for the DW
• DW and Users will grow and evolve together
• Agility is a mindset more than anything else
• There is NO “Agile Product”
• There is NO “Agile Model”
• Agility allows to fail fast (an recover quickly)
Agile Challenges
• Delivery Quickly and Fast
• Challenge: keep high quality, no matter who’s doing the work
• Embrace Changes
• Challenge: don’t introduce bugs. Change the smallest part possible.
Use automatic Testing to preserve and assure data quality.
Taking the Agile Challenge
• To be Agile, some engineering practices needs to be included in
our work model
• Agility != Anarchy
• Engineering:
• Apply well-known models
• Define & Enforce rules
• Automate and/or Check rules application
• Measure
• Test
Information is like Water
• How can you be sure that changes
won’t
introduce unexpected errors?
• Data Quality Testing is Mandatory!
• Unit Tests
• Regression Tests
• “Gate” Tests
Agile Vocabulary
• Agile introduces a lot of specific words
• Here’s a very nice and complete summary:
• https://www.captechconsulting.com/blog/ben-harden/learning-the-agile-
vocabulary
Lean BI?
• Has the same objective of Agile BI: Support Business Decision
in a ever-changing world
• Limit the different types of waste that occur in BI projects (Lean
Manufacturing),
• Focus on the interdependencies of systems (Systems
Thinking),
• Develop based on values and principles in the agile manifesto
(Agile Software Development).
• http://www.maxmetrics.com/goingagile/agile-bi-vs-lean-bi/
• http://www.b-eye-network.com/view/10264
Modeling the Data Warehouse
Data Warehouse is Undefined
• Data Warehousing is still a young discipline
• Lacks Basic definitions
• Data Warehouse
• Data Marts
• Few “universal” rules:
• Depends on modeled business
Data Mart or Data Warehouse ?
• No “Standard” definition, but usually
• «Data Marts» contains departmental data
• «Data Warehouse» contains all data
• The “role” played by DM/DW depends of the approach used
• Inmon
• Kimball
• Data Vault is on the rise
• Latest kid on the block is the “Anchor Modeling”
Kimball Design
Source
Data
Source
Data
Source
Data
Source
Data
Mart
Mart
Mart
Data Warehouse ::= is the sum of all Data
Marts and the conformed dimensions
Conformed
Dimensions
Inmon Design
Source
Data
Source
Data
Source
Data
Source
Data
“Enterprise”
Data Warehouse
Data Warehouse ::= THE corporate wide data model
Datamarts ::= Subsets of the Data Warehouse
Mart
Mart
DW – Still Two Philosophies
KIMBALL
Star Schema
Specialized Models
Model Once (Mart)
User Friendly
INMON
Normal Forms
One Model
Model Twice (EDW/Mart)
But, we agree:
1. There IS a model
2. It is relational(ish)
Which way?
• Inmon or Kimball ?
• Both have pro and cons
• Of course the difference between the two is not only limited to the Data
Warehouse definition!
• Why not both? 
• Avoid religion wars and take the best of both worlds
Facts about Normalizing
• It is expensive to
• Join (especially between large tables)
• Maintain referential integrity
• Build query plans
• It is very hard to
• Get consistently good query plans
• Make users understand >=3NF data
• Write the right query
• This is why we are careful about normalizing warehouses!
DW – Choose your side. Or not?
• Why not have an hybrid solution?
• Take the best from both world
• Inmon DW that generates Kimball DMs
• Solution will grow and evolve to its final design
• Agility is the key: it has to be engineered into the solution
• Emergent Design
• https://en.wikipedia.org/wiki/Emergent_Design
Kimball approach…with an accent
• On average, Kimball approach is the most used
• Easy to understand
• Easy to use
• Efficient
• Well supported by tools
• Well known
• But the idea of having one physical DWH is very good
• Again, the advice is not to be too rigid
• Be willing to mix the things and move from one to another
• Be «Adaptive» 
• My «Perfect Solution» is one that evolves towards an Inmon Data Warehouse
used to generate Kimball Data Marts
Data Vault?
• Modeling technique often associated to Agile BI.
• That’s a myth  Agility is not in the model, remember?
• Introduces the concepts of “Hubs”, “Link” and “Satellites” to split
keys from their dependent values
• Optimized to keep history, not for query performances
• At the end of the day, it will maps to Dimensions and Facts
A model is forever?
• Surely not!
• We’re going to use ANY model that will fit our needs.
• We’ll start with the Kimball+Inmon mix
• But always present a Dimensional Model to the end user
• Behind the scenes we can make the model evolve to anything
we need. Data Vault, 100% Inmon….whatever 
Data Warehouse Performances
• Data Warehouse may need specific hardware or software to
work at best
• Due to huge amount of data
• Due to complex queries
• Why this happens?
• Data is usually stored with the highest level of detail in order to allow
any kind of analysis
• User usually needs aggregated data
• Several specific solutions (logical and physical)
• Using RDBMS or a mixture of technologies
Data Warehouse Performances
• Solutions built to support
• very fast reading of huge amount of data
• analyzing data from multiple perspectives
• easy querying & reporting
• pre-aggregate data
• Specific technology
• Online Analytical Processing (OLAP) Multi-dimensional database
• Different storage flavors (MOLAP, ROLAP, HOLAP)
• In-Memory Technology
• Column-Store Approach
Improving DW Performances
• Hardware Solutions
• Fast-Track
• Parallel Data Warehouse / APS
• Exadata
• Teratadata
• Netezza
• Software Solutions
• Multi-Dimensional Databases (Analysis Services, Cognos)
• In-Memory Databases (Power Pivot, Qlikview…)
• Column-Store Systems (SQL Server 2012+, Vertica, Greenplum)
Hardware is a game changer!
Screenshot Taken from a Fast Track DWH
Cloud can offer good performance too (but not yet up to this…)
Dimensional Modeling
• Modeling a database schema using Facts and Dimension
entities
• Proposed and Documented by Kimball (mid-nineties)
• Applicable both to Relational and Multidimensional database
• SQL Server
• Analysis Services
• Focus on the end user
Defining Facts
• A fact is something happened
• A product has been sold
• A contract has been signed
• A payment has been made
• Facts contains measurable data
• Product final price
• Contract value
• Paid Amount
• The measurable data is called a Measure
• Within the DWH, facts are stored in Fact Tables
Defining Measures
• Measures are usually Additive
• Make sense to sum up measure values
• E.g.: money amount, quantity, etc.
• Semi-Additive data exists
• Data that cannot be summed up
• E.g.: Account balance
• Tools may have specific support for semi-additive measures
Defining Dimensions
• Dimensions define how facts can be analyzed
• Provide a meaning to the fact
• Categorize and classify the fact
• E.g.: Customer, Date, Product, etc.
• Dimensions have Attributes
• Attributes are the building block of a Dimensions
• E.g.: Customer Name, Customer Surname, Product Color, etc.
• Within the DWH, Dimensions are stored in Dimension Tables
• Dimension Members are the values stored in Dimensions
Dimensional Modeling
• Dimension Modeling come in two flavors
• Star Schema
• Snowflake Schema
• Star Schema
• Dimensions have direct relationship with fact tables
• Snowflake Schema
• Dimension may have an indirect relationship with fact fables
Star Schema
Screenshot taken from Wikipedia
Snowflake Schema
Screenshot taken from Wikipedia
Star Schema
• Pros
• Easy to understand and to query
• Offers very good performances
• Well supported by SQL Engines (e.g.: Star-Join optimization)
• Cons
• May require a lot of space
• Make dimension update and maintenance harder
• Somehow rigid
Snowflake Schema
• Pros
• Less duplicate data
• Easier dimension update
• Flexibility
• Cons
• (Much) More complex to understand
• (Much) More complex to query
• In turn this means: more resource-hungry, slower, expensive
Snowflake or Star schema?
• Feel free to design the Data Warehouse as you prefer, but
present a Star Schema to OLAP engine or to the End User
• Views will protect end-users from model complexity
• Views will guarantee that you can have all the flexibility you need to
properly model your data
• Views will allow to make changes in future (e.g.: moving from Star to
Snowflake)
• If in doubt, start with the Star Schema
• Is usually the preferred solution
• So start with this one, you can always change your mind later
• Remember, we embrace changes 
Understand fact granularity
• Before doing physical design
• Understand facts granularity
• Understand if and how historical data should be preserved
• Granularity is the level of detail
• Granularity has to be agreed with SME and Decision Makers
• Data should be stored at the highest granularity
• Aggregation will be done later
• Must be defined both for facts and dimensions
Deal with changes in dimension data
• Two options:
• Keep only the last value
• Keep all the values
• Kimball has defined specific terminology
• “Slowly Changing Dimension”
• Kind of Architectural Pattern (well known, universally recognized)
• Three type of SCD
• 1, 2 and 3 
• Mix of them
SCD Type 1
• Update all data to last value
• Use Cases
• Correct erroneous data
• Make the past look like the present situation
• E.g.: A Business Unit changed its name
SCD Type 2
• Preserve all past values
• Use Cases
• Keep the information known at the time the fact occurred
• Avoid inconsistent analysis
SCD Type 3
• Preserve only the last valid value before the current (“previous”
values)
• Use Cases
• I’ve never seen it in use 
Other well-known objects
• Junk Dimensions
• Generic Attributes that do no belong to any specific dimension
• They are grouped in only one dimension in order to avoid to have too many
dimensions, since this may “scare” final user
• Degenerate Dimensions
• Dimension generated from the fact table
• E.g.: Invoice Number
Fact Table Types
• Kimball has defined two main types
• Transactional
• Snapshot
• Again, kind of Architectural Pattern (well known, universally
recognized)
• We proposed a new fact table type at PASS Summit 2011
• Temporal Snapshot
• http://www.slideshare.net/davidemauri/temporal-snapshot-fact-tables
Transactional Fact Table
• Used to store «Transactional Data»
• Sales
• Invoices
• Quantities
• Each row represent an event happened in a specific point in
time
Snapshot Fact Table
• Useful when you need to store inventory/stock/quotes data
• Data that is *not* additive
• Store the entire situation of a precise point in time
• «Picture of the moment»
• Expensive in terms of data usage
• Usually snapshot are at week level or above (months / semester etc.)
• Thought Column-Oriented storage can help a lot here
Temporal Snapshot Fact Table
• New approach to store snapshot data without doing snapshots
• Each rows doesn’t represent a point in time but a time interval
• It seems easy but it’s a completely new way to approach the problem
• Bring Temporal Database theory into Data Warehousing
• Free PDF Book online:
http://www.cs.arizona.edu/people/rts/tdbbook.pdf
Temporal Snapshot Fact Table
• Allows the user to have daily (or even hourly) snapshot of data
• Avoids data explosion
• Look in the
• PASS 2011 DVDs, SQL Bits 11 website (shorter version), SlideShare
(shorter version)
Many to Many relationships
• How to manage M:N relationships between dimensions?
• e.g.: Books and Authors
• An additional table is (still) needed
• The table will not hold facts (in the BI meaning)
• Hence it will be a “factless” table
• Or – better – a Bridge table
• The OLAP engine must support such modeling approach
Bridge / Factless Tables
• Bookstore sample:
• The bridge table (usually) doesn’t contain facts…so it’s a factless table. It’s only used
to store M:N relationship.
• In really it could happen that a fact table also act as a bridge/factless table
Sales Fact
Table
Book
Dimension
Author
Dimension
Sales
Factless
(Bridge)
Table
Generic Modeling Best Practices
• Don’t create too many dimensions
• Keep It Super Simple
• If you have a lot of attributes in a dimension and some are SCD1 and
some SCD2 it may make sense to split the dimension in two
• If a dimension become huge (>1M rows) its worth to analyze how to
split it into two or more dimensions
• Keep security in mind right from the very first steps
• Since this may require you to change the way you model your Data Warehouse
Engineering the Solution
Architecture is well known
• We now have «architectural» elements of a BI solution
• Inmon / Kimball / Other
• Star Schema / Snowflake Schema
• Facts & Dimensions
• In some specific cases we also have well-known «Design
Pattern»
• Slowly Changing Dimensions
Implementation is problematic
• So, from an architectural point of view, we can be happy. But
from the implementation standpoint, what we can say?
• Each time we have to start from scratch
• Every person has its own way to implement the architectural solutions
adopted
• The quality of the implementation is directly proportional to the
experience of the implementer
Time lost in low-value work
• You lose a lot of time in implementing “technical” stuff. Time that
is subtracted from the identification of the optimal resolution to
the business problem
• Ex: load an SCD type 2. How much you’ll spend on its development?
• From 2 days to 10 days depending on the experience that you have
• An a minimum of 2 days is still there
• Since there are no standard implementation rules, each one
applies its own
• That works, but everyone is different
Choices
• In the development of a BI solution you will need to make a lot
choices in terms of architecture and implementation
• Every choice we make brings pros and cons
• It will impact the future of the solution
• How do you choose? Who chooses? Why? All the people in the team
are able to make autonomous choices?
• How can you be sure that all those choices do not conflict with each
other?
• Especially when performed by different people?
Reaching the goal - 1
• This is the situation
• Everyone follows his own path
• It will be better to work in harmony …
• …with common rules
Target
DW is a TeamWork
• Problems arises when the team is made of several people
• One work well alone
• «Geniuses» (or geniuses-wannabe ) work well together
• We need to do a “exceptional” job with “normal” people. Smart and
willing but “normal”
• Must be "guaranteed" a minimum quality regardless of who does the work
• It must be easy to "scale" the number of people at work
• It must be easy to replace a person
• It’s vital to allow people to do what they do best: to give added value to the
solution. The "monkey work" should be as small as possible.
Software Engineering for BI
• «Software Engineering is the application of a systematic,
disciplined, quantifiable approach to the development,
operation, and maintenance of software, and the study of these
approaches; that is, the application of engineering to software”
IEEE Computer Society
With clear and well defined rules…
• We’d like to have this!
• So, we need to formally define
our rules for work
Target
Objectives
• What are the objectives we want to set?
• It must be possible to "change our mind" during development (and thus
being independent of the initial architectural choices)
• Each person must be able to solve the given problem in a personal
way, but the implementation of the solution should be made following a
common path
• Careless mistakes and errors due repetitive processes should
be minimized
• It must be possible to parallelize and (when possible) to
automate the work
• The solution must be testable
• It must have rigidity and flexibility at the same time
• It should be “adaptive”!
Achieve a common goal
• Everything must be designed to achieve a common goal:
• Spend more time to find the best solution to the business problem
• Spend (much) less time to implement the solution
• making as few mistakes as possible
• preventing common mistakes
• In other words, take the best from each player on the field
• Men -> Added value: Intelligence
• Machine -> Added value: Automation
Engineering The Solution
• A set of rules that defines
• Naming Convention
• Mandatory Objects / Attributes
• Standard implementations of solutions to common problems
• Dependencies between objects
• Best practices and development methodology
• Each and every rules has purpose to
• Prevent Errors
• Set a Standard
• Assure Maintainability
• Help Team Scale-Out
• Let developer concentrate more on solving the business problem and less on the
implementation
Engineering The Solution
• All rules presented here are born from real-world experience
• Following the Agile Principle of Simplicity
• Metadata are embedded in the rules
• Sometimes this bring to some ugly solutions…
• …if you want to avoid this, external files/documents MUST be
maintained
Building the Data Warehouse
Engineering The Solution
• A BI Solution has three main layers
• Producers
• Coordinators
• Consumers
• Producers Layer
• Contains all the data sources
• Coordinators Layer
• Contains all objects that process source data into a Data Warehouse
• Consumers Layers
• Where Data Warehouse data is consumed
Engineering The Solution
• A BI solution can be thought as made of
3 different layers
• Data flows from and only from lower
levels to higher levels
• Higher levels doesn’t know how data is
managed in lower levels
• (Information Hiding Principle)
Producers
Coordinators
Consumers
Databases
• Core
• Configuration
• Staging
• Data Warehouse
• Optional (recommended)
• Helper
• Support
• Log
• Metadata
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
MetadataLog
Engineering The Solution
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Cub
e
Repor
ts
Producer
Coordinators
Consumers
Databases
• Helper
• Contains object that permits to
access the data from the OLTP
database.
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Databases
• Staging
• Contains intermediate “volatile”
data
• Contains ETL procedures and
support objects (like err tables)
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Databases
• Configuration
• objects that add additional value
to the data (e.g.: lookup tables)
• objects that allows the BI solution
to be configurable, like, for which
company load data
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Databases
• Data Warehouse
• The final data store
OLTP SYS 1 OLTP SYS 2
Helper 1 Helper 2
Staging
Data
Warehouse
Configuration
Databases
• Metadata
• Contains all the information needed to automate the creation and the
loading of
• Staging
• Data Warehouse
• Log
• Guess? 
Databases
• Naming Convention:
• projectname_*
• * = CFG, LOG, STG, DWH, MD, HLP
• Databases Files
• STG & DWH databases MUST be created with 2 filegroups (at least)
• PRIMARY (system catalogs),
• SECONDARY (all other table). This is the default filegroup
• Strongly recommended also for other databases
Schemas
• Schemas helps to
• create logical boundaries
• distinguish objects scopes
• Several Schemas used to identify the different scopes
• stg, etl, cfg, dwh, tmp, bi, err, olap, rpt
• optional “util” schema to store utility objects
• eg: fn_Nums, a function to generate numbers
• A schema (generally) cannot be used in more than one database
• Prevents careless mistakes
Schemas
bi
Helper
stg
etl
tmp
err
util
Staging
dwh
olap
rpt
DWH
bi
OLTP
cfg
Config
md
MetaData
log
Log
Views
• Views are the key of abstraction
• Shields higher levels from the complexity of underlying levels
• Used throughout the entire solution to reduce “friction” between
layers and objects
• Apply the “Information Hiding Principle” (helps to have teams that work
in parallel)
• Helps to auto-document the solution
Views
• General Rules
• Do basic data preparation in order to simplify SSIS package
development
• Casts
• Column rename
• Basic Data Filtering
• Simple data normalization and cleansing
• Join tables
Stored Procedures
• Their usage should be very very limited
• The majority of ETL logic is in SSIS
• Usage
• Incremental Load/Management
• SCD loading (MERGE)
• Dummy member management
• Additional abstraction that helps to avoid to change SSIS packages
• for debugging (import one specific fact table row)
• for optimizations (eg: query hints)
• for ordering data
Basic Concepts
• Dimension will gather data from one or more data source
• Dimension will holds key value of each source entity (if
available)
• The “Business Key”
Basic Concepts
• Business Key won’t be used to relate Dimension to Fact table
• A surrogate key will be created during ETL phase
• The surrogate key with be used to create the relationship
• The Surrogate key has several advantages
• Is meaningless
• Is small
• Is independent from the data source
• Helps to make the fact table smaller
Why Integer Keys are Better
• Smaller row sizes
• More rows/page = more compression
• Faster to join
• Faster in column stores
Dimensions – Example
• Data comes from three tables: Departments, SubDepartmens
and Working Area (sample model from a Logistic company)
Business Keys «Payload»Surrogate Key
Dimensions – Key points
• A dimension is (usually) created using data coming from master
data or reference tables
• OLTP PK/AK -> Business Key
• Dimension PK will be artificial and surrogate
SCD Type 1
• Scope
• Update data to last value
• Implementation
• UPDATE
SCD Type 2
• Scope
• Keep the all the past values and the current ones
• Implementation
• Row Valid Time + UPDATE + INSERT
SCD Type 3
• Scope
• Keep the current value and the one before that only
• Implementation
• Specific Columns + UPDATE
SCD Key vs BK
• We defined the SCD Key as the key used to lookup dimension
data while loading the fact table
• It may be not made by *ALL* BK
• It’s an ALTERNATE KEY (and thus is UNIQUE)
Hierarchies
• In our sample the dimension also holds a (natural) hiearchy
• Department > Subdepartment > Working Area
Things to keep in mind
• Huge dimension (>1M members)
• Evaluate to split it in two
• Dimension with SCD1+SCD2 attributes
• Evaluate to split it in two
• Security: keep it in mind from the beginning since it may be a
painful process if done after
Dimensions Rules
• Dimensions has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Name: dim_<plural_dimension_name>
• Dimension key: id_<table_name>
• Surrogate / Artificial Key
• Business Key: prefixed by bk_
• Additional mandatory columns
• last_update (datetime) or log id (int)
• scd1_checksum / scd2_checksum
• only one or both, depending on scd usage
Dimensions Dummy Values
• Add at least one «dummy» value
• To represent a “not available” data
• Dummy value rules
• Dimension key: negative number
• Business Key: NULL
• Fixed values for text and numeric data
• Text: “N/A” or “Not Available”
• Choose appropriate terms if more than on dummy exists
• Numeric: NULL
Date Dimension
• Date Dimension is an exception
• Key (id_dim_date) is not
meaningless
• Integer Data Type
• Format: yyyymmdd
• This allows easier queries on the fact table and usage of negative
dummy values for dummy members
• Eg: Unknown Date, Erroneous Date, Invalid Date
• Don’t need last_update and scd_checksum mandatory columns
Time Dimension
• Time Dimension is also exception
• Key (id_dim_time) is not
meaningless
• Integer Data Type
• Format: hhmmss
• Don’t need last_update
and scd_checksum
mandatory columns
• If not mandatory Drill-Down, Date & Time should be two separate
Dimensions
Fact Tables
• More than one table may exists within the same DW solution
• Different Granularity? Different Fact Table!
• It’s only important that they all use the same dimensions
• where applicable
• Example: Product Sales and Product Costs
• This allows to make coherent queries
Transactional Fact Table
• «total_amount» can just be summed up to get aggregated
values for all possible combination of dimension values
Snapshot Fact Table
• All data is stored for each snapshot taken.
• «Snapshot Date» Mandatory for almost all analysis
Temporal Snapshot Fact Table
• Each row represent an interval (max one year wide)
12
6
Underlying interval: 20090701->20090920
Temporal Snapshot Fact Table
• Some real-world usage
• Using Temporal Fact
• 148.380.542 Rows that uses 13 GB
• Without this technique we would have had
• 11.733.038.614 Rows that would have used 1TB of data
• This just for one month. So for one year we would have more than
10TB of data.
Fact Tables
• Fact Tables has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Table: fact_<plural_fact_name>
• Fact key: id_[fact]_<table_name>
• Additional mandatory columns
• insert_time (datetime) or log id (int)
• Foreign Key to Dimensions: not needed
• Put into fact table the business key columns of the source OLTP table to ease
debugging and error checking
• If BK are not too big 
• Business Key: prefixed by bk_
Factless/Bridge Tables
• Factless/Bridge Tables has to be created in
• Database: DWH - Schema: dwh
• Table rules
• Table: factless_<plural_table_name>
• Factless key: not needed
• Foreign Key to Dimensions: not needed
• Additional mandatory columns
• insert_time (datetime) or log id (int)
The DW Query Pattern
SELECT foo [..n], <aggregate>(something)
FROM dwh.fact F
JOIN dwh.dim_a A
ON F.id_a = A.id_a
JOIN dwh.dim_b B
ON F.id_b = B.id_b
WHERE <filter>
GROUP BY foo [..n]
The expected Relational Query Plan
Partial
Aggregate
Fact CSI Scan
Dim Scan
Dim Seek
Batch
Build
Batch
Build
Hash
Join
Hash
Join
Has
h
Stream
Aggregate
Loading the Data Warehouse?
Loading the Data Warehouse
• Loading the DWH means doing ETL
• Extract data from data sources
• Databases, Files, Web Services, etc.
• Transform extracted data so that
• It can be cleansed and verified
• It can be enriched with additional data
• It can be placed into a star-schema
• Load data into the Data Warehouse
Loading the Data Warehouse
• ETL is usually the most complex and long phase
• roughly 80% of the entire work is done here
• Integration Services is the engine we use to do ETL
• Very very fast
• Completely In-Memory
• 64 bits aware
• Very scalable
Loading the Data Warehouse
• SSIS does NOT substitute T-SQL
• T-SQL and set based-operations are still faster
• When possible avoid working on per-row basis but favor «set-based»
operations
• Just keep in mind that you have to deal with the t-log
• They are complementary work together
• T-SQL: ideal for “simple” set-oriented data manipulation
• SSIS: ideal for complex, multi-stage, data manipulation
• Advanced scripting through SSIS Expression or .NET
Loading the Data Warehouse
• Integration Services and T-SQL plays the major role here
• .NET help may be needed from time to time for complex transformations
• Our objective: create an ETL solution such in a way is almost auto-
documented
• It should be possible to understand what ETL do, just «reading» the SSIS
Packages
• Following the KISS principle, avoid to mix ETL logic
• “Simple” ETL logic in views
• “Complex” ETL logic in SSIS Packages
Loading the Data Warehouse
• SSIS will NEVER load data directly from a table
• ALWAYS go through a view
• View will decrease complexity of package and make it loosely coupled with
the database schema
• This will make SSIS development easier
• Simple filtering changes or joins can be changed here without having to touch
SSIS
• SSIS Package are like applications!
• Only one exception to this rule will be seen in loading Fact and
Dimension tables
• Exception is made since there is a case where using a view will not decrease
complexity
Divide et Impera
• To be able to be Agile is *vital* to keep business and technical
process completely separated
• Business Process: ETL logic that can be applied only to the
specific solution you’re building
• Technical Process: ETL logic that can be used with any Data
Warehouse and that can be highly automated
Divide et Impera
• Follow the “Divide et Impera” principle
• Move data from OLTP to Staging
• Move data from Staging to Data Warehouse
• Create at least two different SSIS solutions
• One to load the Staging Database
• One to load the Data Warehouse Database
Divide et Impera
STG
ETLETL
OLTP DWH
ETL
Technical
Process
Business
Process
Technical
Process
Loading the Data Warehouse – Step 1
OLTP STGExtract
& Load
Views
HLP
Other
Data
Sources
Loading the Data Warehouse – Step 1
• First step is to load data into staging database
• From Data Sources
• NO “Transformation” here, just load data as is
• In other words, create a copy of OLTP data used in the BI solution
• Total or Partial in case of Incremental Load
• This will make us free to do complex ETL queries without interfering with
production systems
• Only filter data that by definition should not be handled by BI solution
• Sample or Test data
The “Helper” database
• Create views to expose data that will be used to create DWH
• Views are simple “SELECT columns FROM…”
• no data transformation allowed
• no casts, no column renaming, no data cleansing
• only filter data that should never ever be imported into DWH
• eg: customer id 999 which is the “test customer”
• Views has to be put in the bi schema
Loading the Data Warehouse – Step 2
STG
ETL
Views
StoredProcedures
TMP ERR
CFG
Loading the Data Warehouse – Step 2
• Second step is to transform data so that it can be loaded into
the Data Warehouse
• “Transform” can be a complex duty
• Transform = Cleanse, Check, De-Duplicate, Correct
• Data may have to go through several transformations in order to reach
the final shape
• All intermediate values will never go out the staging database
• Here is where you’ll spend most of your time
The “Configuration” database
• “Configuration” data
• Data non available elsewhere
• E.g.: lookup tables of “Well-Known” values
• E.g.: C1 -> Company 1, C2 -> Company2
• Tables used to hold “configuration” data
• Use the cfg schema
The “Staging” Database
• Contains a copy of OLTP data
• Only the needed data, of course 
• Copying data is fast. This allows us to avoid to use OLTP database for
too long
• Avoid concurrency problems
• All further work will be done on the BI server an won’t affect OLTP performances
• Data from tables from the OLTP data sources has to be copied
into staging tables
• tables must have the same schema of OLTP tables
• staging tables has to be created in the staging schema
The “Staging” Database
• Contains intermediate tables used to transform the data
• Favor usage of several intermediate tables (even if you’ll use more
space) instead of doing everything in memory with SSIS
• This will make debugging/troubleshooting much more easier!
• The correct balance to decide how many intermediate tables are needed has to
found on per-project basis
The “Staging” Database
• Tables used to hold data coming from files
• E.g.: Excel, Flat Files
• Use the etl schema
• Tables used to hold intermediate data
• Use the tmp schema
• Objects used in the ETL phase
• Views, Stored Procedures, User-Defined Functions, ecc..
• All these objects must be placed in the etl schema
The “Staging” Database
• Views prepare data to be further processed by SSIS
• SSIS read data only from views
• Source view naming convention
• vw_<logical_name>
• E.g.: etl.vw_claims
• Destination table naming convention
• <logical_name>
• E.g.: tmp.claims
• If ETL has to be done in more than one step
• append the «step_number» to objects_name
• E.g.: etl.vw_claims_step_1, tmp.claims_step_1
The “Staging” Database
• Views take care of creating a “logical” view of dimension or fact
data
• rename columns to give human understandable meaning
• CAST data types in order to make them consistent with the one used in
DWH
• perform basic data filtering and data re-organization
• eg: flatten hiearchies to “n” columns, trim white spaces
• perform basic ETL logic
• CASE statments, ROW_NUMBER, Joins, Ecc.
The “Staging” Database
• ETL Stored procedures are used only to manage dimension
loading (SCD 1 or 2) and Dummy Members:
• Naming convention:
• etl.stp_merge_dim_<dimension target>
• etl.stp_add_dummy_dim_<dimension target>
The “Staging” Database
• The err schema contains table that holds rows with errors that
cannot be corrected or ignored (rows that cannot be processed)
• For example: you have a temporal database and for some rows you
find that “Valid To” happens before “Valid From”
• This data can be later exposed to SMEs in order to fix it
• Is interesting to note that already in the middle of development the BI
solution become useful
• Helps to increase data quality
Loading the Data Warehouse – Step 3
STG DWH
SSIS
Views
StoredProcedures
Loading the Data Warehouse – Step 3
• Third step is the loading of Data Warehouse
• Very simple: just take the transformed data from staging database and put it
into Facts and Dimensions
• Load all dimensions
• Generate dimension IDs
• Load fact tables
• “Just” convert business keys to dimension IDs
• Not so easy 
• Must handle incremental loading
• Mandatory for dimensions (otherwise you may have problems if loaded data have
different dimension ID)
• Would be nice also for facts
• More complex when you have «early arriving facts»/«late arriving
dimensions»
Handling Dimension Keys
• Mapping Source Dimension Keys (the BK) to the surrogate
Dimension ID may be more complex that what expected. You may
encounter several key «pathologies»
• Composite Keys, Zombie Keys, Multi Keys, Dolly Keys
• A good way to solve the problems is to add an additional abstraction
layer, using mapping tables
• Thomas Kejser has some very good posts on that here
• http://blog.kejser.org/tag/keys/
The “Data Warehouse” database
• DWH database must contain only
• tables related to the dwh fact, factless and dimensions
• all tables must be in the dwh schema
• Views to allow access to physical tables
• use specific schemas to expose data to other tools
• use olap schema for views used by SSAS
• use rpt schema for views used by SSRS
• Add your own schema depending on the technology you use
• Or even create a Data Mart out of the Data Warehouse!
The “Data Warehouse” database
• Stored Procedures
• If needed for reporting purposes must be put into the reporting schema
• No other use allowed
The “Data Warehouse” database
• Dimension loading
• Always incremental
• With all the rules in place there is only one way to load them 
• Of course it there may be differences on per dimension-basis
• But is just like building an house. No two house are identical, yet all are built following
the same rules
• This means that it can be completely automatized!
The “Data Warehouse” database
• Fact tables loading
• Incremental would be nice
• But it may be not an easy task
• SQL Server 2008 CDC in the source can help a lot
• Sometimes just dropping and re-loading the facts is the most effective solution
• Rarely for the entire table
• More common with time-partitioning
• FAST load of fact tables:
• Drop and re-create indexes
• Remove Compression and add it later
• Load Partitions in Parallel
• A tool to automatize partitioned table managing exists 
• SQL CAT Partition Management Tool
Improving DW Querying Performance
• Use ColumnStore Indexes to speed up queries against the DW
(if you’re not using other additional solutions)
• Try to keep Factless/Bridge table as small as possible. A
Whitepaper details how to implement a «proprietary»
compression that works extremely well:
• http://www.microsoft.com/en-us/download/details.aspx?id=137
Tools that helps
• Use Multiple Hash Component to calculate hash values
• http://ssismhash.codeplex.com/
• When looking up SCD2 dimension, try to avoid the default
Lookup transformation since it does not support FULL cache in
this scenario. Matt Masson has a very good post no how to
implement «Range Lookups»
• http://bit.ly/SSISRangeLookup
Integration Services Rules
• Avoid usage of OLEDB Command in DataFlow
• It’s just too slow, prefer a set-based solution
• Try to do as much as transformation / operations here and NOT in
SSAS or SSRS
• In other words: avoid to spread ETL process all around
• Always read from views
• Use of OPTION(RECOMPILE) is encouraged so that we can have optimum
plans
• Except for Dimension loading lookup component
• (Doesn’t help to lower complexity)
Integration Services Rules
• Package Naming Convention
• Use “setup_” prefix for all packages that contains logic that must to be
run in first place in order to be able to load data
• Use “load_” prefix for all packages that loads data into “final” tables
• E.g.: staging tables, dwh tables
• Use “prepare_” prefix for all packages that transform data in order to
make it usable by another transformation phase
• E.g.: tmp tables
• Use a sequence number (###)
• To group all independent packages
• To quickly identify package dependencies
Integration Services Rules - Staging
load_DFKKKO
load_DFKKOP
load_BUT000
load_<xxxxxxxx>
prepare_010_orders
prepare_010_customers
prepare_020_invoices
prepare_020_orders
All these packages are
independent from each
other and can be run
simultaneously
All these packages are
independent from each
other and can be run
simultaneously, but
works on data loaded by
“load_” packages
All these packages are
independent from each
other and can be run
simultaneously, but
works on data loaded by
previous “prepare_”
packages
Integration Services Rules - DWH
load_dim_time
load_dim_customers
load_dim_products
load_dim_categories
load_dim_geography
load_fact_orders
load_fact_invoices
load_fact_costs
load_factless_products_categories
First load all Dimensions
Than load all Facts
Then load all Factless
Integration Services Rules
• One “action” per package!
• With SQL Server 2012+ use Shared Connections and the «Project»
deployment model
• Use one or more “Master Package” to execute packages in the correct sequence /
parallelism
• With Previous Versions Try to make sure that all packages of the same
layer (STG or DWH) uses the same connection managers
• In this way you can have only one configuration file to configure connections when
running packages
• Don’t bother too much about logging
• SQL Server 2012+ has native support
• http://ssis-dashboard.azurewebsites.net/
• If using SQL Server 2005 or 2008/R2 use DTLoggedExec
• http://dtloggedexec.codeplex.com/
Building a DWH in 2013
• Is still a (almost) manual process
• A *lot* of repetitive low-value work
• No (or very few) standard tools available
How it should be
• Semi-automatic process
• “develop by intent”
• Define the mapping logic from a
semantic perspective
• Source to Dimensions / Measures
• (Metadata anyone?)
• Design the model and let the
tool build it for you
CREATE DIMENSION Customer
FROM SourceCustomerTable
MAP USING CustomerMetadata
ALTER DIMENSION Customers
ADD ATTRIBUTE LoyaltyLevel
AS TYPE 1
CREATE FACT Orders
FROM SourceOrdersTable
MAP USING OrdersMetadata
ALTER FACT Orders
ADD DIMENSION Customer
The perfect BI process & architecture
Iterative!
Invest on Automation?
• Faster development
• Reduce Costs
• Embrace Changes
• Less bugs
• Increase solution quality and
make it consistent throughout
the whole product
Automation Pre-Requisites
• Split the process to have two separate type of processes
• What can be automated
• What can NOT be automated
• Create and impose a set of rules that defines
• How to solve common technical problems
• How to implement such identified solutions
No Monkey Work!
Let the people think and
let the machines do the
«monkey» work.
Design Pattern
“A general reusable
solution to a commonly
occurring problem within
a given context”
Design Pattern
• Generic ETL Pattern
• Partition Load
• Incremental/Differential Load
• Generic BI Design Pattern
• Slowly Changing Dimension
• SCD1, SCD2, ecc.
• Fact Table
• Transactional, Snapshot, Temporal Snapshot
Design Pattern
• Specific SQL Server Patterns
• Change Data Capture
• Change Tracking
• Partition Load
• SSIS Parallelism
Engineering the DWH
• “Software Engineering allows and require the formalization of
software building and maintenance process.”
Sample Rules
• Always put «last_update» column
• Always log Inserted/Updated/Deleted rows to log.load_info table
• Use FNV1a64 for checksums
• Use views to expose data
• Dimension & Fact views MUST use the same column names for lookup
columns
Engineering the DWH
There are two intrinsc
processes hidden in the
development of a BI
solution that must be
allowed (or forced) to
emerge.
Business Process
• Data manipulation,
transformation, enrichment &
cleansing logic
• Specific for every customer.
Almost not automatable
Technical Process
• Application of data extraction
and loading techniques
• Recurring (pattern) in any
solution
• Highly Automatable
Hi-Level Vision
STG
ETLETL
OLTP DWH
ETL
Technical Process
Business Process
Technical Process
ETL Phases
• «E» and «L» must be
• Simple, Easy and Straightforward
• Completely Automated
• Completely Reusable
• «E» and «L» have ZERO value in a BI Solution
• Should be done in the most economic way
Source Full Load
E
Source Incremental Load
E
In this scenario,
“ID” is a IDENTITY/SEQUENCE.
Probably a PK.
Source Differential Load/1
E
In this scenario the source table
doesn’t offer any specific way to
Understand what’s changed
Source Differential Load/2
E
In this scenario the source table
has a TimeStamp-Like column
Source Differential Load
• SQL Server 2012 that can help with incremental/differential load
• Change Data Capture
• Natively supported in SSIS 2012
• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-server-2012-2/
• Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
E
SCD 1 & SCD 2
L
Start
Lookup Dimension Id
and MD5 Checksum
From Business Key
Calculate MD5
Checksum of Non-
SCD-Key Colums
Dimension Id is
Null?
Yes
Insert new members
into DWH
No
Checksum are
different?
Yes
Store into temp
table
Merge data from
temp table to DWH
End
SCD 2 Special Note
• Merge => UPDATE Interval + INSERT New Row
L
FACT TABLE LOAD
L
Partition Load
EL
Parallel Load
• Logically split the work in several steps
• E.g: Load/Process one customer at time
• Create a «queue» table the stores information for each step
• Step 1 -> Load Customer «A»
• Step 2 -> Load Customer «B»
• Create a Package that
• Pick the first not already picked up
• Do work
• Back to step 3
• Call the Package «n» times simultaneously
EL
Other SSIS Specific Patterns
• Range Lookup
• Not natively supported
• Matt Masson has the answer in his blog 
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-
lookups.aspx
Metadata
• Provide context information
• Which columns are used to build/feed a Dimension?
• Which columns are Business Keys?
• Which table is the Fact Table?
• How Fact and Dimension are connected?
• Which columns are used?
How to manage Metadata?
• Naming Convention
• Specific, Ad Hoc Database or Tables
• JSON
• Other (XML, File, ecc.)
Naming Convention
• The easiest and cheapest
• No additional (hidden) costs
• No need to be maintained
• Never out-of-sync
• No documentation need
• Actually, it IS PART of the documentation
• Imposes a Standard
• Very limited in terms of flexibility and usage
Extended Properties
• Support most of metadata needs
• No additional software needed
• Very verbose usage
• Development of a wrapper to make usage simpler is feasible and
encouraged
Metadata Objects
• Dedicated Ad-Hoc Database and Tables
• As Flexible as you need
• Maintenance Overhead to keep metadata in-sync with data
• Development of automatic check procedure is needed
• DMV can help a lot here
• Need a GUI to make them user-friendly
JSON
• Could be expensive to keep them in-sync
• A tool is needed, otherwise too much manual work
• User and Developer Friendly!
• VERY flexible
• If too much JSON.Net Schema may help
• Supported by Visual Studio
• An SQL Server 2016
Automation Scenarios
• Run-Time: «Auto-Configuring» Packages
• Really hard to customize packages
• SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime
• On-the fly creation of package may be needed
• Design-Time: Package Generators / Package Templates
• Easy to customize created packages
Automation Solutions
• Specific Tool/frameworks
• BIML / MIST
• SQL Server Platform
• SQL, PowerShell, .NET
• SMO, AMO
Package Generators
• Required Assemblies
• Microsoft.SqlServer.ManagedDTS
• Microsoft.SqlServer.DTSRuntimeWrap
• Microsoft.SqlServer.DTSPipelineWrap
• Path:
• C:Program Files (x86)Microsoft SQL Server110SDKAssemblies
Useful Resources
• «STOCK» Tasks:
• http://msdn.microsoft.com/en-us/library/ms135956.aspx
• How to set Task properties at runtime:
• http://technet.microsoft.com/en-
us/library/microsoft.sqlserver.dts.runtime.executables.add.aspx
BIML – BI Markup Language
• Developed by Varigence
• http://www.varigence.com
• http://bimlscript.com/
• MIST: BIML Full-Featured IDE
• Free via BIDS Helper
• Support “limited” to SSIS package generation
• http://bidshelper.codeplex.com
Testing the Data Warehouse
Data Warehouse Unit Test
• Before releasing anything data in the DW must be tested.
• User has to validate a sample of data
• (e.g.:total invoice amount of January 2012)
• That validated value will become the reference value
• Before release, the same query will be executed again. If the data is
the expected reference data then test is green otherwise the test
fails
Data Warehouse Unit Test
• Of course test MUST be automated when possibile
• Visual Studio
• BI.Quality (on CodePlex…now old)
• Based on Nunit
• NBI is the new way to go http://www.nbi.io/ !
• Based on Nunit
• What to test?
• Structures
• Aggregated results
• Specific values of some «special» rule
• Fixed bugs/tickets
• Values in the various layers
The Complete Picture
Modern Data Environment
Master
Data
EDW
Data Mart
Big Data
Unstructured
Data
BI Environment
Analytics Environment
Structured
Data
Data Scientist
Decision Maker
Modern Data Environment - Details
Files
Web Svc
Cloud /
Syndicated
RDBMS
Master Data
E
x
tr
a
c
t
Archive / Big Data
Facts
Staging
Archive
Replay
DimensionsStandardise
Extract
Cube
V-Mart
Mart
Mart
Copy
Facts
Facts
Process
Secure
/ Expose
Aggregate
Transform
Inside The Data Warehouse
SSIS
source tables
stg.* tables
etl.*
tables
tmp.*
tables
dwh.* tables
olap.* views report.* views
ReportingAnalysis
config.*
tables
etl.* objects
SSIS
bi.* views
After the Data Warehouse
What’s Next?
• Now that the DW is ready, any tool can be used to create a
BI/Reporting solution on a solid and simpler, user friendly,
ground.
• Reporting
• Reporting Services / Business Object / Microstrategy / JasperReports
• Analysis
• Analysis Services, Cognos
• Power Pivot, QlikView, Tableau, Power BI
Conclusion
A Starting Point
• The presented content can be used as is or as a starting point to build your
own framework
• Extend the content when it doesn’t fit in your solution (for example: add
additional databases, like «SYSCFG» if this help you)
• Define your rules! Drive the tools and be not driven by them!
• Keep the layers separated and favor loose coupling (less «friction» to
changes)
• Spread the idea of Unit Testing Data even if at the beginning it seems and
expensive approach.
Real World Samples
• The presented content comes from on-the field experience
• More than 40 (successful) project using the proposed approach
• More than 2000 packages managed (biggest solution: 572 packages)
• Several team involved (biggest team: 12 people)
• Several customer grown their own standard starting from this
• Data coming from ANY source: SAP, Dynamics DB2, Text or Excel Files
Some challenges faced
• Changed and entire accounting system, moving from one vendor to another
• DWH and OLAP/Reporting solution completely untouched. 2/3 of budget saved
• Started with a full load only and the added incremental load later
• Less then 5% of Extract and Load logic changed (Transformations untouched)
• Created a solution in 3 month with a minimal set of features and evolved and
grown in to be an enterprise data warehouse / BI solution.
• Monthly Delivery.
• Never release bad data (helped to correct errors in the source systems)
• Helped an enterprise company to reduce time spent on crunching data by 66%
percent.
Latest challenges faced
• Supported on a *big* electronic retail company in creating their
BI/DSS solution on their shiny new Dynamics CRM installation.
• During CRM Development.
• The first specification document for reporting was very “agile”…
• “What do you need?”: “Don’t know, but all”
Thanks!
Agile Data Warehousing

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Planning Data Warehouse
Planning Data WarehousePlanning Data Warehouse
Planning Data Warehouse
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
 
揭开数据虚拟化的神秘面纱
揭开数据虚拟化的神秘面纱揭开数据虚拟化的神秘面纱
揭开数据虚拟化的神秘面纱
 
Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Azure data factory
Azure data factoryAzure data factory
Azure data factory
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Azure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data FlowsAzure Data Factory V2; The Data Flows
Azure Data Factory V2; The Data Flows
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 

Similar a Agile Data Warehousing

Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
Davide Mauri
 
InfoVision_PM101_RPadaki
InfoVision_PM101_RPadakiInfoVision_PM101_RPadaki
InfoVision_PM101_RPadaki
Ravi Padaki
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
DATAVERSITY
 
Lessons learned from the proverbial battlefield - Hortonworks roadshow
Lessons learned from the proverbial battlefield - Hortonworks roadshowLessons learned from the proverbial battlefield - Hortonworks roadshow
Lessons learned from the proverbial battlefield - Hortonworks roadshow
Suhail S.
 
Aagile business analytics - how a new generation bi is reducing risk and incr...
Aagile business analytics - how a new generation bi is reducing risk and incr...Aagile business analytics - how a new generation bi is reducing risk and incr...
Aagile business analytics - how a new generation bi is reducing risk and incr...
Andrew Marks
 

Similar a Agile Data Warehousing (20)

Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
InfoVision_PM101_RPadaki
InfoVision_PM101_RPadakiInfoVision_PM101_RPadaki
InfoVision_PM101_RPadaki
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
 
Building Data Warehouse in SQL Server
Building Data Warehouse in SQL ServerBuilding Data Warehouse in SQL Server
Building Data Warehouse in SQL Server
 
Agile Data Architecture
Agile Data ArchitectureAgile Data Architecture
Agile Data Architecture
 
Building enterprise platforms - off the beaten path - SharePoint User Group U...
Building enterprise platforms - off the beaten path - SharePoint User Group U...Building enterprise platforms - off the beaten path - SharePoint User Group U...
Building enterprise platforms - off the beaten path - SharePoint User Group U...
 
Top Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practicesTop Devops bottlenecks, constraints and best practices
Top Devops bottlenecks, constraints and best practices
 
Andew Marks Agile Business Analytics How A New Generation Bi Is Reducing ...
Andew Marks   Agile Business Analytics   How A New Generation Bi Is Reducing ...Andew Marks   Agile Business Analytics   How A New Generation Bi Is Reducing ...
Andew Marks Agile Business Analytics How A New Generation Bi Is Reducing ...
 
Store, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged ConferenceStore, Extract, Transform, Load, Visualize. Untagged Conference
Store, Extract, Transform, Load, Visualize. Untagged Conference
 
Building SharePoint Enterprise Platforms - Off the beaten path - SharePoint S...
Building SharePoint Enterprise Platforms - Off the beaten path - SharePoint S...Building SharePoint Enterprise Platforms - Off the beaten path - SharePoint S...
Building SharePoint Enterprise Platforms - Off the beaten path - SharePoint S...
 
Lessons learned from the proverbial battlefield - Hortonworks roadshow
Lessons learned from the proverbial battlefield - Hortonworks roadshowLessons learned from the proverbial battlefield - Hortonworks roadshow
Lessons learned from the proverbial battlefield - Hortonworks roadshow
 
How to Ease Scaleup Growing Pains - from Startup to Scaleup without the pain
How to Ease Scaleup Growing Pains - from Startup to Scaleup without the painHow to Ease Scaleup Growing Pains - from Startup to Scaleup without the pain
How to Ease Scaleup Growing Pains - from Startup to Scaleup without the pain
 
Aagile business analytics - how a new generation bi is reducing risk and incr...
Aagile business analytics - how a new generation bi is reducing risk and incr...Aagile business analytics - how a new generation bi is reducing risk and incr...
Aagile business analytics - how a new generation bi is reducing risk and incr...
 
How a Top Retailer Brought Together UX Design and Agile Development (and got ...
How a Top Retailer Brought Together UX Design and Agile Development (and got ...How a Top Retailer Brought Together UX Design and Agile Development (and got ...
How a Top Retailer Brought Together UX Design and Agile Development (and got ...
 
Data vault modeling et retour d'expérience
Data vault modeling et retour d'expérienceData vault modeling et retour d'expérience
Data vault modeling et retour d'expérience
 
Why retail companies can't afford database downtime
Why retail companies can't afford database downtimeWhy retail companies can't afford database downtime
Why retail companies can't afford database downtime
 
Product Management 101 for Data and Analytics
Product Management 101 for Data and Analytics Product Management 101 for Data and Analytics
Product Management 101 for Data and Analytics
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Data Vault Introduction
Data Vault IntroductionData Vault Introduction
Data Vault Introduction
 
Building SharePoint Enterprise Platforms - Off the beaten path
Building SharePoint Enterprise Platforms - Off the beaten pathBuilding SharePoint Enterprise Platforms - Off the beaten path
Building SharePoint Enterprise Platforms - Off the beaten path
 

Más de Davide Mauri

Más de Davide Mauri (20)

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstart
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enough
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSON
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal Tables
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For Developers
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BI
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applications
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream Analytics
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSON
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BI
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)
 
Azure Machine Learning (Italian)
Azure Machine Learning (Italian)Azure Machine Learning (Italian)
Azure Machine Learning (Italian)
 

Último

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Agile Data Warehousing

  • 1. Agile Data Warehousing: From Start to Finish Davide Mauri @mauridb dmauri@solidq.com
  • 2. Davide Mauri • Microsoft SQL Server MVP • Works with SQL Server from 6.5, on BI from 2003 • Specialized in Data Solution Architecture, Database Design, Performance Tuning, High-Performance Data Warehousing, BI, Big Data • President of UGISS (Italian SQL Server UG) • Regular Speaker @ SQL Server events • Consulting & Training, Mentor @ SolidQ • E-mail: dmauri@solidq.com • Twitter: @mauridb • Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
  • 3. Agenda • Why a Data Warehouse? • The Agile Approach • Modeling the Data Warehouse • Engineering the Solution • Building the Data Warehouse • Unit Testing Data • The Complete Picture • After the Data Warehouse • Conclusions
  • 4. Workshop Motivation • Give you a solid background on why a DWH and an Agile approach is needed • Convince your boss • Convince your team • Convince your co-workers • Understand how engineering and automation is important to make it happen • See in practice how a DWH can be build in an Agile way
  • 5. Why a Data Warehouse?
  • 6. The Data-Driven Age DecisionKnowledgeInformationData “In a modern company, everyone is a Decision Maker.”
  • 7. Where do the data came from? • OLTP: Online Transaction Processing • OLTP databases are built to support • single fast select/insert/update/delete operations • high concurrency • data consistency (normalization) • “current” version of data: usually there is no need to keep historical information • Many OLTP database exists within a company • Data is scattered all around the company • Not all in a relational format!
  • 8. Accessing Data Directly – The Principle OLTP Magic Infinite Scale-Out Database Machine OLTP Metadata Integration Layer
  • 9. Accessing Data Directly – The Reality OLTP Magic Infinite Scale-Out Database Machine OLTP Metadata Integration Layer Move Crunch Data
  • 10. Accessing Data Directly – Summing Up • PROS • Always up to date • No copies • Minimal Storage (3NF or above) • Isolation/security • CONS • May change too fast • Performance Impact • Slow queries • Complex Schema (if one exists!) • Low or No Coherence • Scattered Data • Historical information may be missing
  • 11. It’s only a technical detail? • Big Data, In Memory and all the new stuff, can’t just fix any performance problems? • The answer would be “yes”, if a simple “container” of data would be enough. • (A simple technical artifact in order to speed up queries) • But much more than this is needed.
  • 12. What is a DWH, really? In this new era, data is like water. Who will ever drink from • untested, • untrusted, • uncertified data?
  • 13. What is a DWH, really? • Would a manager or a decision maker, make a decision based on data of which he doesn’t know the source, the integrity and the correctness?
  • 14. What is a DWH, really? • The Data Warehouse is the place where managers and decision makers will look for • Correct • Trusted • Updated • Data in order to make a • Informed or “conscious” decision
  • 15. What is a DWH, really? (Metaphysically) • The answer is now easy:
  • 16. What is DWH, really? (Physically) • A place to store consolidated data coming from the whole company • A place where cleanse, verify and certify data • A place where historic data is stored • A place that holds the single version of truth (if there is one!) • Forms the core of a BI solution • User friendly Data models, designed to make data analysis easier
  • 17. Modern Data Environment Master Data EDW Data Mart Big Data Unstructured Data BI Environment Analytics Environment Structured Data Data Scientist Decision Maker “Data Juice” See SlideShare
  • 18. Forrester Research Says That: • “Business intelligence (BI) is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information. It allows business users to make informed business decisions with (real-time) data that can put a company ahead of its competitors” • “Data warehouses form the back-end infrastructure”
  • 20. EDW: Reality Check • EDW is the trusted container of all company data • It cannot be created in “one day” • It has to grow and evolve with business needs. • (Likely) It will never be 100% complete
  • 21. Gather Requirement s Design Develop Delivery Generate Value • Too few stakeholders • Too many technical people • Too few iterations • Too slow • Too expensive • Illusion of Control Traditional Development Lifecycle
  • 22. A well known picture
  • 23. Adapt to Survive “50% of requirements change in the first year of a BI project” Andreas Bitterer, Research VP, Gartner
  • 24. A new approach is needed • Reduce Risk of misunderstanding • Increase chances to deliver a useful DW/BI project • Delivery Quickly • Immediately create value and get user feedback • Deliver Frequently • Prioritize • Set Quick-Win Objectives (again, create value) • Fail Fast (and Recover Quickly)
  • 25. Agile Manifesto • Our highest priority is to satisfy the customer through early and continuous delivery of valuable software. • Welcome changing requirements, even late in development. Agile processes harness change for the customer's competitive advantage. • Business people and developers must work together daily throughout the project.
  • 26. Agile Manifesto • The most efficient and effective method of conveying information to and within a development team is face-to-face conversation. • Simplicity - the art of maximizing the amount of work not done - is essential. • Source: http://agilemanifesto.org/principles.html
  • 27. Hi-Level Requirements JIT Model Implement Test Deliver Generate Value • Multi-Disciplinary Team • Many Iterations • Cost Effective • Quick Delivery • Iterative • True Control Agile Development Lifecycle - 1 Weeks or few Months
  • 28. Agile Project Startup • Identify the principal Business Unit • Define a small scope • Do some very small analysis and design • JEDUF / JITD • Create a Prototype • Let the users “play” with data • Redefine the requirements • Grow Build the definitive Project
  • 29. Prototype is a mandatory! • Start with small data samples • Help to understand data • MDM anyone? • Help to better estimate efforts • Low data quality is the problem • Create a bridge between developer and user • Help to check that the analysis is correct and project is feasible
  • 30. Prototype Outcomes • User will change/refocus their mind when they see the actual data • You have probably forgotten something • Usually «implied» (for the user) requirements • You may have misestimated data sizes
  • 31. Agile Project Lifecycle - 2 • Iterative Approach • The general scope is known • Not the details • Anything can (and will) change • Even already deployed objects • Only the certified data must stay stable • Otherwise solution will lose credibility Analyze Develop DeployTest Feedback Evolve
  • 32. Agile Project Best Practices • “JIT” Modeling: don’t try to model everything right from the beginning, but engineer everything so that it will be easy to make changes • Prioritize Requirements • Short iterations (weeks ideally) • Automate as much as you can • Follow a Test Driven Approach: release only after having tests in place! • «If ain’t tested it’s broken» (TDD Motto)
  • 33. Don’t Fear the Change! • Ability to Embrace Changes is a key value for the DW • DW and Users will grow and evolve together • Agility is a mindset more than anything else • There is NO “Agile Product” • There is NO “Agile Model” • Agility allows to fail fast (an recover quickly)
  • 34. Agile Challenges • Delivery Quickly and Fast • Challenge: keep high quality, no matter who’s doing the work • Embrace Changes • Challenge: don’t introduce bugs. Change the smallest part possible. Use automatic Testing to preserve and assure data quality.
  • 35. Taking the Agile Challenge • To be Agile, some engineering practices needs to be included in our work model • Agility != Anarchy • Engineering: • Apply well-known models • Define & Enforce rules • Automate and/or Check rules application • Measure • Test
  • 36. Information is like Water • How can you be sure that changes won’t introduce unexpected errors? • Data Quality Testing is Mandatory! • Unit Tests • Regression Tests • “Gate” Tests
  • 37. Agile Vocabulary • Agile introduces a lot of specific words • Here’s a very nice and complete summary: • https://www.captechconsulting.com/blog/ben-harden/learning-the-agile- vocabulary
  • 38. Lean BI? • Has the same objective of Agile BI: Support Business Decision in a ever-changing world • Limit the different types of waste that occur in BI projects (Lean Manufacturing), • Focus on the interdependencies of systems (Systems Thinking), • Develop based on values and principles in the agile manifesto (Agile Software Development). • http://www.maxmetrics.com/goingagile/agile-bi-vs-lean-bi/ • http://www.b-eye-network.com/view/10264
  • 39. Modeling the Data Warehouse
  • 40. Data Warehouse is Undefined • Data Warehousing is still a young discipline • Lacks Basic definitions • Data Warehouse • Data Marts • Few “universal” rules: • Depends on modeled business
  • 41. Data Mart or Data Warehouse ? • No “Standard” definition, but usually • «Data Marts» contains departmental data • «Data Warehouse» contains all data • The “role” played by DM/DW depends of the approach used • Inmon • Kimball • Data Vault is on the rise • Latest kid on the block is the “Anchor Modeling”
  • 42. Kimball Design Source Data Source Data Source Data Source Data Mart Mart Mart Data Warehouse ::= is the sum of all Data Marts and the conformed dimensions Conformed Dimensions
  • 43. Inmon Design Source Data Source Data Source Data Source Data “Enterprise” Data Warehouse Data Warehouse ::= THE corporate wide data model Datamarts ::= Subsets of the Data Warehouse Mart Mart
  • 44. DW – Still Two Philosophies KIMBALL Star Schema Specialized Models Model Once (Mart) User Friendly INMON Normal Forms One Model Model Twice (EDW/Mart) But, we agree: 1. There IS a model 2. It is relational(ish)
  • 45. Which way? • Inmon or Kimball ? • Both have pro and cons • Of course the difference between the two is not only limited to the Data Warehouse definition! • Why not both?  • Avoid religion wars and take the best of both worlds
  • 46. Facts about Normalizing • It is expensive to • Join (especially between large tables) • Maintain referential integrity • Build query plans • It is very hard to • Get consistently good query plans • Make users understand >=3NF data • Write the right query • This is why we are careful about normalizing warehouses!
  • 47. DW – Choose your side. Or not? • Why not have an hybrid solution? • Take the best from both world • Inmon DW that generates Kimball DMs • Solution will grow and evolve to its final design • Agility is the key: it has to be engineered into the solution • Emergent Design • https://en.wikipedia.org/wiki/Emergent_Design
  • 48. Kimball approach…with an accent • On average, Kimball approach is the most used • Easy to understand • Easy to use • Efficient • Well supported by tools • Well known • But the idea of having one physical DWH is very good • Again, the advice is not to be too rigid • Be willing to mix the things and move from one to another • Be «Adaptive»  • My «Perfect Solution» is one that evolves towards an Inmon Data Warehouse used to generate Kimball Data Marts
  • 49. Data Vault? • Modeling technique often associated to Agile BI. • That’s a myth  Agility is not in the model, remember? • Introduces the concepts of “Hubs”, “Link” and “Satellites” to split keys from their dependent values • Optimized to keep history, not for query performances • At the end of the day, it will maps to Dimensions and Facts
  • 50. A model is forever? • Surely not! • We’re going to use ANY model that will fit our needs. • We’ll start with the Kimball+Inmon mix • But always present a Dimensional Model to the end user • Behind the scenes we can make the model evolve to anything we need. Data Vault, 100% Inmon….whatever 
  • 51. Data Warehouse Performances • Data Warehouse may need specific hardware or software to work at best • Due to huge amount of data • Due to complex queries • Why this happens? • Data is usually stored with the highest level of detail in order to allow any kind of analysis • User usually needs aggregated data • Several specific solutions (logical and physical) • Using RDBMS or a mixture of technologies
  • 52. Data Warehouse Performances • Solutions built to support • very fast reading of huge amount of data • analyzing data from multiple perspectives • easy querying & reporting • pre-aggregate data • Specific technology • Online Analytical Processing (OLAP) Multi-dimensional database • Different storage flavors (MOLAP, ROLAP, HOLAP) • In-Memory Technology • Column-Store Approach
  • 53. Improving DW Performances • Hardware Solutions • Fast-Track • Parallel Data Warehouse / APS • Exadata • Teratadata • Netezza • Software Solutions • Multi-Dimensional Databases (Analysis Services, Cognos) • In-Memory Databases (Power Pivot, Qlikview…) • Column-Store Systems (SQL Server 2012+, Vertica, Greenplum)
  • 54. Hardware is a game changer! Screenshot Taken from a Fast Track DWH Cloud can offer good performance too (but not yet up to this…)
  • 55. Dimensional Modeling • Modeling a database schema using Facts and Dimension entities • Proposed and Documented by Kimball (mid-nineties) • Applicable both to Relational and Multidimensional database • SQL Server • Analysis Services • Focus on the end user
  • 56. Defining Facts • A fact is something happened • A product has been sold • A contract has been signed • A payment has been made • Facts contains measurable data • Product final price • Contract value • Paid Amount • The measurable data is called a Measure • Within the DWH, facts are stored in Fact Tables
  • 57. Defining Measures • Measures are usually Additive • Make sense to sum up measure values • E.g.: money amount, quantity, etc. • Semi-Additive data exists • Data that cannot be summed up • E.g.: Account balance • Tools may have specific support for semi-additive measures
  • 58. Defining Dimensions • Dimensions define how facts can be analyzed • Provide a meaning to the fact • Categorize and classify the fact • E.g.: Customer, Date, Product, etc. • Dimensions have Attributes • Attributes are the building block of a Dimensions • E.g.: Customer Name, Customer Surname, Product Color, etc. • Within the DWH, Dimensions are stored in Dimension Tables • Dimension Members are the values stored in Dimensions
  • 59. Dimensional Modeling • Dimension Modeling come in two flavors • Star Schema • Snowflake Schema • Star Schema • Dimensions have direct relationship with fact tables • Snowflake Schema • Dimension may have an indirect relationship with fact fables
  • 62. Star Schema • Pros • Easy to understand and to query • Offers very good performances • Well supported by SQL Engines (e.g.: Star-Join optimization) • Cons • May require a lot of space • Make dimension update and maintenance harder • Somehow rigid
  • 63. Snowflake Schema • Pros • Less duplicate data • Easier dimension update • Flexibility • Cons • (Much) More complex to understand • (Much) More complex to query • In turn this means: more resource-hungry, slower, expensive
  • 64. Snowflake or Star schema? • Feel free to design the Data Warehouse as you prefer, but present a Star Schema to OLAP engine or to the End User • Views will protect end-users from model complexity • Views will guarantee that you can have all the flexibility you need to properly model your data • Views will allow to make changes in future (e.g.: moving from Star to Snowflake) • If in doubt, start with the Star Schema • Is usually the preferred solution • So start with this one, you can always change your mind later • Remember, we embrace changes 
  • 65. Understand fact granularity • Before doing physical design • Understand facts granularity • Understand if and how historical data should be preserved • Granularity is the level of detail • Granularity has to be agreed with SME and Decision Makers • Data should be stored at the highest granularity • Aggregation will be done later • Must be defined both for facts and dimensions
  • 66. Deal with changes in dimension data • Two options: • Keep only the last value • Keep all the values • Kimball has defined specific terminology • “Slowly Changing Dimension” • Kind of Architectural Pattern (well known, universally recognized) • Three type of SCD • 1, 2 and 3  • Mix of them
  • 67. SCD Type 1 • Update all data to last value • Use Cases • Correct erroneous data • Make the past look like the present situation • E.g.: A Business Unit changed its name
  • 68. SCD Type 2 • Preserve all past values • Use Cases • Keep the information known at the time the fact occurred • Avoid inconsistent analysis
  • 69. SCD Type 3 • Preserve only the last valid value before the current (“previous” values) • Use Cases • I’ve never seen it in use 
  • 70. Other well-known objects • Junk Dimensions • Generic Attributes that do no belong to any specific dimension • They are grouped in only one dimension in order to avoid to have too many dimensions, since this may “scare” final user • Degenerate Dimensions • Dimension generated from the fact table • E.g.: Invoice Number
  • 71. Fact Table Types • Kimball has defined two main types • Transactional • Snapshot • Again, kind of Architectural Pattern (well known, universally recognized) • We proposed a new fact table type at PASS Summit 2011 • Temporal Snapshot • http://www.slideshare.net/davidemauri/temporal-snapshot-fact-tables
  • 72. Transactional Fact Table • Used to store «Transactional Data» • Sales • Invoices • Quantities • Each row represent an event happened in a specific point in time
  • 73. Snapshot Fact Table • Useful when you need to store inventory/stock/quotes data • Data that is *not* additive • Store the entire situation of a precise point in time • «Picture of the moment» • Expensive in terms of data usage • Usually snapshot are at week level or above (months / semester etc.) • Thought Column-Oriented storage can help a lot here
  • 74. Temporal Snapshot Fact Table • New approach to store snapshot data without doing snapshots • Each rows doesn’t represent a point in time but a time interval • It seems easy but it’s a completely new way to approach the problem • Bring Temporal Database theory into Data Warehousing • Free PDF Book online: http://www.cs.arizona.edu/people/rts/tdbbook.pdf
  • 75. Temporal Snapshot Fact Table • Allows the user to have daily (or even hourly) snapshot of data • Avoids data explosion • Look in the • PASS 2011 DVDs, SQL Bits 11 website (shorter version), SlideShare (shorter version)
  • 76. Many to Many relationships • How to manage M:N relationships between dimensions? • e.g.: Books and Authors • An additional table is (still) needed • The table will not hold facts (in the BI meaning) • Hence it will be a “factless” table • Or – better – a Bridge table • The OLAP engine must support such modeling approach
  • 77. Bridge / Factless Tables • Bookstore sample: • The bridge table (usually) doesn’t contain facts…so it’s a factless table. It’s only used to store M:N relationship. • In really it could happen that a fact table also act as a bridge/factless table Sales Fact Table Book Dimension Author Dimension Sales Factless (Bridge) Table
  • 78. Generic Modeling Best Practices • Don’t create too many dimensions • Keep It Super Simple • If you have a lot of attributes in a dimension and some are SCD1 and some SCD2 it may make sense to split the dimension in two • If a dimension become huge (>1M rows) its worth to analyze how to split it into two or more dimensions • Keep security in mind right from the very first steps • Since this may require you to change the way you model your Data Warehouse
  • 80. Architecture is well known • We now have «architectural» elements of a BI solution • Inmon / Kimball / Other • Star Schema / Snowflake Schema • Facts & Dimensions • In some specific cases we also have well-known «Design Pattern» • Slowly Changing Dimensions
  • 81. Implementation is problematic • So, from an architectural point of view, we can be happy. But from the implementation standpoint, what we can say? • Each time we have to start from scratch • Every person has its own way to implement the architectural solutions adopted • The quality of the implementation is directly proportional to the experience of the implementer
  • 82. Time lost in low-value work • You lose a lot of time in implementing “technical” stuff. Time that is subtracted from the identification of the optimal resolution to the business problem • Ex: load an SCD type 2. How much you’ll spend on its development? • From 2 days to 10 days depending on the experience that you have • An a minimum of 2 days is still there • Since there are no standard implementation rules, each one applies its own • That works, but everyone is different
  • 83. Choices • In the development of a BI solution you will need to make a lot choices in terms of architecture and implementation • Every choice we make brings pros and cons • It will impact the future of the solution • How do you choose? Who chooses? Why? All the people in the team are able to make autonomous choices? • How can you be sure that all those choices do not conflict with each other? • Especially when performed by different people?
  • 84. Reaching the goal - 1 • This is the situation • Everyone follows his own path • It will be better to work in harmony … • …with common rules Target
  • 85. DW is a TeamWork • Problems arises when the team is made of several people • One work well alone • «Geniuses» (or geniuses-wannabe ) work well together • We need to do a “exceptional” job with “normal” people. Smart and willing but “normal” • Must be "guaranteed" a minimum quality regardless of who does the work • It must be easy to "scale" the number of people at work • It must be easy to replace a person • It’s vital to allow people to do what they do best: to give added value to the solution. The "monkey work" should be as small as possible.
  • 86. Software Engineering for BI • «Software Engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software” IEEE Computer Society
  • 87. With clear and well defined rules… • We’d like to have this! • So, we need to formally define our rules for work Target
  • 88. Objectives • What are the objectives we want to set? • It must be possible to "change our mind" during development (and thus being independent of the initial architectural choices) • Each person must be able to solve the given problem in a personal way, but the implementation of the solution should be made following a common path • Careless mistakes and errors due repetitive processes should be minimized • It must be possible to parallelize and (when possible) to automate the work • The solution must be testable • It must have rigidity and flexibility at the same time • It should be “adaptive”!
  • 89. Achieve a common goal • Everything must be designed to achieve a common goal: • Spend more time to find the best solution to the business problem • Spend (much) less time to implement the solution • making as few mistakes as possible • preventing common mistakes • In other words, take the best from each player on the field • Men -> Added value: Intelligence • Machine -> Added value: Automation
  • 90. Engineering The Solution • A set of rules that defines • Naming Convention • Mandatory Objects / Attributes • Standard implementations of solutions to common problems • Dependencies between objects • Best practices and development methodology • Each and every rules has purpose to • Prevent Errors • Set a Standard • Assure Maintainability • Help Team Scale-Out • Let developer concentrate more on solving the business problem and less on the implementation
  • 91. Engineering The Solution • All rules presented here are born from real-world experience • Following the Agile Principle of Simplicity • Metadata are embedded in the rules • Sometimes this bring to some ugly solutions… • …if you want to avoid this, external files/documents MUST be maintained
  • 92. Building the Data Warehouse
  • 93. Engineering The Solution • A BI Solution has three main layers • Producers • Coordinators • Consumers • Producers Layer • Contains all the data sources • Coordinators Layer • Contains all objects that process source data into a Data Warehouse • Consumers Layers • Where Data Warehouse data is consumed
  • 94. Engineering The Solution • A BI solution can be thought as made of 3 different layers • Data flows from and only from lower levels to higher levels • Higher levels doesn’t know how data is managed in lower levels • (Information Hiding Principle) Producers Coordinators Consumers
  • 95. Databases • Core • Configuration • Staging • Data Warehouse • Optional (recommended) • Helper • Support • Log • Metadata OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration MetadataLog
  • 96. Engineering The Solution OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration Cub e Repor ts Producer Coordinators Consumers
  • 97. Databases • Helper • Contains object that permits to access the data from the OLTP database. OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration
  • 98. Databases • Staging • Contains intermediate “volatile” data • Contains ETL procedures and support objects (like err tables) OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration
  • 99. Databases • Configuration • objects that add additional value to the data (e.g.: lookup tables) • objects that allows the BI solution to be configurable, like, for which company load data OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration
  • 100. Databases • Data Warehouse • The final data store OLTP SYS 1 OLTP SYS 2 Helper 1 Helper 2 Staging Data Warehouse Configuration
  • 101. Databases • Metadata • Contains all the information needed to automate the creation and the loading of • Staging • Data Warehouse • Log • Guess? 
  • 102. Databases • Naming Convention: • projectname_* • * = CFG, LOG, STG, DWH, MD, HLP • Databases Files • STG & DWH databases MUST be created with 2 filegroups (at least) • PRIMARY (system catalogs), • SECONDARY (all other table). This is the default filegroup • Strongly recommended also for other databases
  • 103. Schemas • Schemas helps to • create logical boundaries • distinguish objects scopes • Several Schemas used to identify the different scopes • stg, etl, cfg, dwh, tmp, bi, err, olap, rpt • optional “util” schema to store utility objects • eg: fn_Nums, a function to generate numbers • A schema (generally) cannot be used in more than one database • Prevents careless mistakes
  • 105. Views • Views are the key of abstraction • Shields higher levels from the complexity of underlying levels • Used throughout the entire solution to reduce “friction” between layers and objects • Apply the “Information Hiding Principle” (helps to have teams that work in parallel) • Helps to auto-document the solution
  • 106. Views • General Rules • Do basic data preparation in order to simplify SSIS package development • Casts • Column rename • Basic Data Filtering • Simple data normalization and cleansing • Join tables
  • 107. Stored Procedures • Their usage should be very very limited • The majority of ETL logic is in SSIS • Usage • Incremental Load/Management • SCD loading (MERGE) • Dummy member management • Additional abstraction that helps to avoid to change SSIS packages • for debugging (import one specific fact table row) • for optimizations (eg: query hints) • for ordering data
  • 108. Basic Concepts • Dimension will gather data from one or more data source • Dimension will holds key value of each source entity (if available) • The “Business Key”
  • 109. Basic Concepts • Business Key won’t be used to relate Dimension to Fact table • A surrogate key will be created during ETL phase • The surrogate key with be used to create the relationship • The Surrogate key has several advantages • Is meaningless • Is small • Is independent from the data source • Helps to make the fact table smaller
  • 110. Why Integer Keys are Better • Smaller row sizes • More rows/page = more compression • Faster to join • Faster in column stores
  • 111. Dimensions – Example • Data comes from three tables: Departments, SubDepartmens and Working Area (sample model from a Logistic company) Business Keys «Payload»Surrogate Key
  • 112. Dimensions – Key points • A dimension is (usually) created using data coming from master data or reference tables • OLTP PK/AK -> Business Key • Dimension PK will be artificial and surrogate
  • 113. SCD Type 1 • Scope • Update data to last value • Implementation • UPDATE
  • 114. SCD Type 2 • Scope • Keep the all the past values and the current ones • Implementation • Row Valid Time + UPDATE + INSERT
  • 115. SCD Type 3 • Scope • Keep the current value and the one before that only • Implementation • Specific Columns + UPDATE
  • 116. SCD Key vs BK • We defined the SCD Key as the key used to lookup dimension data while loading the fact table • It may be not made by *ALL* BK • It’s an ALTERNATE KEY (and thus is UNIQUE)
  • 117. Hierarchies • In our sample the dimension also holds a (natural) hiearchy • Department > Subdepartment > Working Area
  • 118. Things to keep in mind • Huge dimension (>1M members) • Evaluate to split it in two • Dimension with SCD1+SCD2 attributes • Evaluate to split it in two • Security: keep it in mind from the beginning since it may be a painful process if done after
  • 119. Dimensions Rules • Dimensions has to be created in • Database: DWH - Schema: dwh • Table rules • Name: dim_<plural_dimension_name> • Dimension key: id_<table_name> • Surrogate / Artificial Key • Business Key: prefixed by bk_ • Additional mandatory columns • last_update (datetime) or log id (int) • scd1_checksum / scd2_checksum • only one or both, depending on scd usage
  • 120. Dimensions Dummy Values • Add at least one «dummy» value • To represent a “not available” data • Dummy value rules • Dimension key: negative number • Business Key: NULL • Fixed values for text and numeric data • Text: “N/A” or “Not Available” • Choose appropriate terms if more than on dummy exists • Numeric: NULL
  • 121. Date Dimension • Date Dimension is an exception • Key (id_dim_date) is not meaningless • Integer Data Type • Format: yyyymmdd • This allows easier queries on the fact table and usage of negative dummy values for dummy members • Eg: Unknown Date, Erroneous Date, Invalid Date • Don’t need last_update and scd_checksum mandatory columns
  • 122. Time Dimension • Time Dimension is also exception • Key (id_dim_time) is not meaningless • Integer Data Type • Format: hhmmss • Don’t need last_update and scd_checksum mandatory columns • If not mandatory Drill-Down, Date & Time should be two separate Dimensions
  • 123. Fact Tables • More than one table may exists within the same DW solution • Different Granularity? Different Fact Table! • It’s only important that they all use the same dimensions • where applicable • Example: Product Sales and Product Costs • This allows to make coherent queries
  • 124. Transactional Fact Table • «total_amount» can just be summed up to get aggregated values for all possible combination of dimension values
  • 125. Snapshot Fact Table • All data is stored for each snapshot taken. • «Snapshot Date» Mandatory for almost all analysis
  • 126. Temporal Snapshot Fact Table • Each row represent an interval (max one year wide) 12 6 Underlying interval: 20090701->20090920
  • 127. Temporal Snapshot Fact Table • Some real-world usage • Using Temporal Fact • 148.380.542 Rows that uses 13 GB • Without this technique we would have had • 11.733.038.614 Rows that would have used 1TB of data • This just for one month. So for one year we would have more than 10TB of data.
  • 128. Fact Tables • Fact Tables has to be created in • Database: DWH - Schema: dwh • Table rules • Table: fact_<plural_fact_name> • Fact key: id_[fact]_<table_name> • Additional mandatory columns • insert_time (datetime) or log id (int) • Foreign Key to Dimensions: not needed • Put into fact table the business key columns of the source OLTP table to ease debugging and error checking • If BK are not too big  • Business Key: prefixed by bk_
  • 129. Factless/Bridge Tables • Factless/Bridge Tables has to be created in • Database: DWH - Schema: dwh • Table rules • Table: factless_<plural_table_name> • Factless key: not needed • Foreign Key to Dimensions: not needed • Additional mandatory columns • insert_time (datetime) or log id (int)
  • 130. The DW Query Pattern SELECT foo [..n], <aggregate>(something) FROM dwh.fact F JOIN dwh.dim_a A ON F.id_a = A.id_a JOIN dwh.dim_b B ON F.id_b = B.id_b WHERE <filter> GROUP BY foo [..n]
  • 131. The expected Relational Query Plan Partial Aggregate Fact CSI Scan Dim Scan Dim Seek Batch Build Batch Build Hash Join Hash Join Has h Stream Aggregate
  • 132. Loading the Data Warehouse?
  • 133. Loading the Data Warehouse • Loading the DWH means doing ETL • Extract data from data sources • Databases, Files, Web Services, etc. • Transform extracted data so that • It can be cleansed and verified • It can be enriched with additional data • It can be placed into a star-schema • Load data into the Data Warehouse
  • 134. Loading the Data Warehouse • ETL is usually the most complex and long phase • roughly 80% of the entire work is done here • Integration Services is the engine we use to do ETL • Very very fast • Completely In-Memory • 64 bits aware • Very scalable
  • 135. Loading the Data Warehouse • SSIS does NOT substitute T-SQL • T-SQL and set based-operations are still faster • When possible avoid working on per-row basis but favor «set-based» operations • Just keep in mind that you have to deal with the t-log • They are complementary work together • T-SQL: ideal for “simple” set-oriented data manipulation • SSIS: ideal for complex, multi-stage, data manipulation • Advanced scripting through SSIS Expression or .NET
  • 136. Loading the Data Warehouse • Integration Services and T-SQL plays the major role here • .NET help may be needed from time to time for complex transformations • Our objective: create an ETL solution such in a way is almost auto- documented • It should be possible to understand what ETL do, just «reading» the SSIS Packages • Following the KISS principle, avoid to mix ETL logic • “Simple” ETL logic in views • “Complex” ETL logic in SSIS Packages
  • 137. Loading the Data Warehouse • SSIS will NEVER load data directly from a table • ALWAYS go through a view • View will decrease complexity of package and make it loosely coupled with the database schema • This will make SSIS development easier • Simple filtering changes or joins can be changed here without having to touch SSIS • SSIS Package are like applications! • Only one exception to this rule will be seen in loading Fact and Dimension tables • Exception is made since there is a case where using a view will not decrease complexity
  • 138. Divide et Impera • To be able to be Agile is *vital* to keep business and technical process completely separated • Business Process: ETL logic that can be applied only to the specific solution you’re building • Technical Process: ETL logic that can be used with any Data Warehouse and that can be highly automated
  • 139. Divide et Impera • Follow the “Divide et Impera” principle • Move data from OLTP to Staging • Move data from Staging to Data Warehouse • Create at least two different SSIS solutions • One to load the Staging Database • One to load the Data Warehouse Database
  • 140. Divide et Impera STG ETLETL OLTP DWH ETL Technical Process Business Process Technical Process
  • 141. Loading the Data Warehouse – Step 1 OLTP STGExtract & Load Views HLP Other Data Sources
  • 142. Loading the Data Warehouse – Step 1 • First step is to load data into staging database • From Data Sources • NO “Transformation” here, just load data as is • In other words, create a copy of OLTP data used in the BI solution • Total or Partial in case of Incremental Load • This will make us free to do complex ETL queries without interfering with production systems • Only filter data that by definition should not be handled by BI solution • Sample or Test data
  • 143. The “Helper” database • Create views to expose data that will be used to create DWH • Views are simple “SELECT columns FROM…” • no data transformation allowed • no casts, no column renaming, no data cleansing • only filter data that should never ever be imported into DWH • eg: customer id 999 which is the “test customer” • Views has to be put in the bi schema
  • 144. Loading the Data Warehouse – Step 2 STG ETL Views StoredProcedures TMP ERR CFG
  • 145. Loading the Data Warehouse – Step 2 • Second step is to transform data so that it can be loaded into the Data Warehouse • “Transform” can be a complex duty • Transform = Cleanse, Check, De-Duplicate, Correct • Data may have to go through several transformations in order to reach the final shape • All intermediate values will never go out the staging database • Here is where you’ll spend most of your time
  • 146. The “Configuration” database • “Configuration” data • Data non available elsewhere • E.g.: lookup tables of “Well-Known” values • E.g.: C1 -> Company 1, C2 -> Company2 • Tables used to hold “configuration” data • Use the cfg schema
  • 147. The “Staging” Database • Contains a copy of OLTP data • Only the needed data, of course  • Copying data is fast. This allows us to avoid to use OLTP database for too long • Avoid concurrency problems • All further work will be done on the BI server an won’t affect OLTP performances • Data from tables from the OLTP data sources has to be copied into staging tables • tables must have the same schema of OLTP tables • staging tables has to be created in the staging schema
  • 148. The “Staging” Database • Contains intermediate tables used to transform the data • Favor usage of several intermediate tables (even if you’ll use more space) instead of doing everything in memory with SSIS • This will make debugging/troubleshooting much more easier! • The correct balance to decide how many intermediate tables are needed has to found on per-project basis
  • 149. The “Staging” Database • Tables used to hold data coming from files • E.g.: Excel, Flat Files • Use the etl schema • Tables used to hold intermediate data • Use the tmp schema • Objects used in the ETL phase • Views, Stored Procedures, User-Defined Functions, ecc.. • All these objects must be placed in the etl schema
  • 150. The “Staging” Database • Views prepare data to be further processed by SSIS • SSIS read data only from views • Source view naming convention • vw_<logical_name> • E.g.: etl.vw_claims • Destination table naming convention • <logical_name> • E.g.: tmp.claims • If ETL has to be done in more than one step • append the «step_number» to objects_name • E.g.: etl.vw_claims_step_1, tmp.claims_step_1
  • 151. The “Staging” Database • Views take care of creating a “logical” view of dimension or fact data • rename columns to give human understandable meaning • CAST data types in order to make them consistent with the one used in DWH • perform basic data filtering and data re-organization • eg: flatten hiearchies to “n” columns, trim white spaces • perform basic ETL logic • CASE statments, ROW_NUMBER, Joins, Ecc.
  • 152. The “Staging” Database • ETL Stored procedures are used only to manage dimension loading (SCD 1 or 2) and Dummy Members: • Naming convention: • etl.stp_merge_dim_<dimension target> • etl.stp_add_dummy_dim_<dimension target>
  • 153. The “Staging” Database • The err schema contains table that holds rows with errors that cannot be corrected or ignored (rows that cannot be processed) • For example: you have a temporal database and for some rows you find that “Valid To” happens before “Valid From” • This data can be later exposed to SMEs in order to fix it • Is interesting to note that already in the middle of development the BI solution become useful • Helps to increase data quality
  • 154. Loading the Data Warehouse – Step 3 STG DWH SSIS Views StoredProcedures
  • 155. Loading the Data Warehouse – Step 3 • Third step is the loading of Data Warehouse • Very simple: just take the transformed data from staging database and put it into Facts and Dimensions • Load all dimensions • Generate dimension IDs • Load fact tables • “Just” convert business keys to dimension IDs • Not so easy  • Must handle incremental loading • Mandatory for dimensions (otherwise you may have problems if loaded data have different dimension ID) • Would be nice also for facts • More complex when you have «early arriving facts»/«late arriving dimensions»
  • 156. Handling Dimension Keys • Mapping Source Dimension Keys (the BK) to the surrogate Dimension ID may be more complex that what expected. You may encounter several key «pathologies» • Composite Keys, Zombie Keys, Multi Keys, Dolly Keys • A good way to solve the problems is to add an additional abstraction layer, using mapping tables • Thomas Kejser has some very good posts on that here • http://blog.kejser.org/tag/keys/
  • 157. The “Data Warehouse” database • DWH database must contain only • tables related to the dwh fact, factless and dimensions • all tables must be in the dwh schema • Views to allow access to physical tables • use specific schemas to expose data to other tools • use olap schema for views used by SSAS • use rpt schema for views used by SSRS • Add your own schema depending on the technology you use • Or even create a Data Mart out of the Data Warehouse!
  • 158. The “Data Warehouse” database • Stored Procedures • If needed for reporting purposes must be put into the reporting schema • No other use allowed
  • 159. The “Data Warehouse” database • Dimension loading • Always incremental • With all the rules in place there is only one way to load them  • Of course it there may be differences on per dimension-basis • But is just like building an house. No two house are identical, yet all are built following the same rules • This means that it can be completely automatized!
  • 160. The “Data Warehouse” database • Fact tables loading • Incremental would be nice • But it may be not an easy task • SQL Server 2008 CDC in the source can help a lot • Sometimes just dropping and re-loading the facts is the most effective solution • Rarely for the entire table • More common with time-partitioning • FAST load of fact tables: • Drop and re-create indexes • Remove Compression and add it later • Load Partitions in Parallel • A tool to automatize partitioned table managing exists  • SQL CAT Partition Management Tool
  • 161. Improving DW Querying Performance • Use ColumnStore Indexes to speed up queries against the DW (if you’re not using other additional solutions) • Try to keep Factless/Bridge table as small as possible. A Whitepaper details how to implement a «proprietary» compression that works extremely well: • http://www.microsoft.com/en-us/download/details.aspx?id=137
  • 162. Tools that helps • Use Multiple Hash Component to calculate hash values • http://ssismhash.codeplex.com/ • When looking up SCD2 dimension, try to avoid the default Lookup transformation since it does not support FULL cache in this scenario. Matt Masson has a very good post no how to implement «Range Lookups» • http://bit.ly/SSISRangeLookup
  • 163. Integration Services Rules • Avoid usage of OLEDB Command in DataFlow • It’s just too slow, prefer a set-based solution • Try to do as much as transformation / operations here and NOT in SSAS or SSRS • In other words: avoid to spread ETL process all around • Always read from views • Use of OPTION(RECOMPILE) is encouraged so that we can have optimum plans • Except for Dimension loading lookup component • (Doesn’t help to lower complexity)
  • 164. Integration Services Rules • Package Naming Convention • Use “setup_” prefix for all packages that contains logic that must to be run in first place in order to be able to load data • Use “load_” prefix for all packages that loads data into “final” tables • E.g.: staging tables, dwh tables • Use “prepare_” prefix for all packages that transform data in order to make it usable by another transformation phase • E.g.: tmp tables • Use a sequence number (###) • To group all independent packages • To quickly identify package dependencies
  • 165. Integration Services Rules - Staging load_DFKKKO load_DFKKOP load_BUT000 load_<xxxxxxxx> prepare_010_orders prepare_010_customers prepare_020_invoices prepare_020_orders All these packages are independent from each other and can be run simultaneously All these packages are independent from each other and can be run simultaneously, but works on data loaded by “load_” packages All these packages are independent from each other and can be run simultaneously, but works on data loaded by previous “prepare_” packages
  • 166. Integration Services Rules - DWH load_dim_time load_dim_customers load_dim_products load_dim_categories load_dim_geography load_fact_orders load_fact_invoices load_fact_costs load_factless_products_categories First load all Dimensions Than load all Facts Then load all Factless
  • 167. Integration Services Rules • One “action” per package! • With SQL Server 2012+ use Shared Connections and the «Project» deployment model • Use one or more “Master Package” to execute packages in the correct sequence / parallelism • With Previous Versions Try to make sure that all packages of the same layer (STG or DWH) uses the same connection managers • In this way you can have only one configuration file to configure connections when running packages • Don’t bother too much about logging • SQL Server 2012+ has native support • http://ssis-dashboard.azurewebsites.net/ • If using SQL Server 2005 or 2008/R2 use DTLoggedExec • http://dtloggedexec.codeplex.com/
  • 168. Building a DWH in 2013 • Is still a (almost) manual process • A *lot* of repetitive low-value work • No (or very few) standard tools available
  • 169. How it should be • Semi-automatic process • “develop by intent” • Define the mapping logic from a semantic perspective • Source to Dimensions / Measures • (Metadata anyone?) • Design the model and let the tool build it for you CREATE DIMENSION Customer FROM SourceCustomerTable MAP USING CustomerMetadata ALTER DIMENSION Customers ADD ATTRIBUTE LoyaltyLevel AS TYPE 1 CREATE FACT Orders FROM SourceOrdersTable MAP USING OrdersMetadata ALTER FACT Orders ADD DIMENSION Customer
  • 170. The perfect BI process & architecture Iterative!
  • 171. Invest on Automation? • Faster development • Reduce Costs • Embrace Changes • Less bugs • Increase solution quality and make it consistent throughout the whole product
  • 172. Automation Pre-Requisites • Split the process to have two separate type of processes • What can be automated • What can NOT be automated • Create and impose a set of rules that defines • How to solve common technical problems • How to implement such identified solutions
  • 173. No Monkey Work! Let the people think and let the machines do the «monkey» work.
  • 174. Design Pattern “A general reusable solution to a commonly occurring problem within a given context”
  • 175. Design Pattern • Generic ETL Pattern • Partition Load • Incremental/Differential Load • Generic BI Design Pattern • Slowly Changing Dimension • SCD1, SCD2, ecc. • Fact Table • Transactional, Snapshot, Temporal Snapshot
  • 176. Design Pattern • Specific SQL Server Patterns • Change Data Capture • Change Tracking • Partition Load • SSIS Parallelism
  • 177. Engineering the DWH • “Software Engineering allows and require the formalization of software building and maintenance process.”
  • 178. Sample Rules • Always put «last_update» column • Always log Inserted/Updated/Deleted rows to log.load_info table • Use FNV1a64 for checksums • Use views to expose data • Dimension & Fact views MUST use the same column names for lookup columns
  • 179. Engineering the DWH There are two intrinsc processes hidden in the development of a BI solution that must be allowed (or forced) to emerge.
  • 180. Business Process • Data manipulation, transformation, enrichment & cleansing logic • Specific for every customer. Almost not automatable
  • 181. Technical Process • Application of data extraction and loading techniques • Recurring (pattern) in any solution • Highly Automatable
  • 182. Hi-Level Vision STG ETLETL OLTP DWH ETL Technical Process Business Process Technical Process
  • 183. ETL Phases • «E» and «L» must be • Simple, Easy and Straightforward • Completely Automated • Completely Reusable • «E» and «L» have ZERO value in a BI Solution • Should be done in the most economic way
  • 185. Source Incremental Load E In this scenario, “ID” is a IDENTITY/SEQUENCE. Probably a PK.
  • 186. Source Differential Load/1 E In this scenario the source table doesn’t offer any specific way to Understand what’s changed
  • 187. Source Differential Load/2 E In this scenario the source table has a TimeStamp-Like column
  • 188. Source Differential Load • SQL Server 2012 that can help with incremental/differential load • Change Data Capture • Natively supported in SSIS 2012 • http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-server-2012-2/ • Change Tracking • Underused feature in BI…not so rich as CDC but MUCH more simpler and easier E
  • 189. SCD 1 & SCD 2 L Start Lookup Dimension Id and MD5 Checksum From Business Key Calculate MD5 Checksum of Non- SCD-Key Colums Dimension Id is Null? Yes Insert new members into DWH No Checksum are different? Yes Store into temp table Merge data from temp table to DWH End
  • 190. SCD 2 Special Note • Merge => UPDATE Interval + INSERT New Row L
  • 193. Parallel Load • Logically split the work in several steps • E.g: Load/Process one customer at time • Create a «queue» table the stores information for each step • Step 1 -> Load Customer «A» • Step 2 -> Load Customer «B» • Create a Package that • Pick the first not already picked up • Do work • Back to step 3 • Call the Package «n» times simultaneously EL
  • 194. Other SSIS Specific Patterns • Range Lookup • Not natively supported • Matt Masson has the answer in his blog  • http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range- lookups.aspx
  • 195. Metadata • Provide context information • Which columns are used to build/feed a Dimension? • Which columns are Business Keys? • Which table is the Fact Table? • How Fact and Dimension are connected? • Which columns are used?
  • 196. How to manage Metadata? • Naming Convention • Specific, Ad Hoc Database or Tables • JSON • Other (XML, File, ecc.)
  • 197. Naming Convention • The easiest and cheapest • No additional (hidden) costs • No need to be maintained • Never out-of-sync • No documentation need • Actually, it IS PART of the documentation • Imposes a Standard • Very limited in terms of flexibility and usage
  • 198. Extended Properties • Support most of metadata needs • No additional software needed • Very verbose usage • Development of a wrapper to make usage simpler is feasible and encouraged
  • 199. Metadata Objects • Dedicated Ad-Hoc Database and Tables • As Flexible as you need • Maintenance Overhead to keep metadata in-sync with data • Development of automatic check procedure is needed • DMV can help a lot here • Need a GUI to make them user-friendly
  • 200. JSON • Could be expensive to keep them in-sync • A tool is needed, otherwise too much manual work • User and Developer Friendly! • VERY flexible • If too much JSON.Net Schema may help • Supported by Visual Studio • An SQL Server 2016
  • 201. Automation Scenarios • Run-Time: «Auto-Configuring» Packages • Really hard to customize packages • SSIS limitations must be managed • Eg: Data Flow cannot be changed at runtime • On-the fly creation of package may be needed • Design-Time: Package Generators / Package Templates • Easy to customize created packages
  • 202. Automation Solutions • Specific Tool/frameworks • BIML / MIST • SQL Server Platform • SQL, PowerShell, .NET • SMO, AMO
  • 203. Package Generators • Required Assemblies • Microsoft.SqlServer.ManagedDTS • Microsoft.SqlServer.DTSRuntimeWrap • Microsoft.SqlServer.DTSPipelineWrap • Path: • C:Program Files (x86)Microsoft SQL Server110SDKAssemblies
  • 204. Useful Resources • «STOCK» Tasks: • http://msdn.microsoft.com/en-us/library/ms135956.aspx • How to set Task properties at runtime: • http://technet.microsoft.com/en- us/library/microsoft.sqlserver.dts.runtime.executables.add.aspx
  • 205. BIML – BI Markup Language • Developed by Varigence • http://www.varigence.com • http://bimlscript.com/ • MIST: BIML Full-Featured IDE • Free via BIDS Helper • Support “limited” to SSIS package generation • http://bidshelper.codeplex.com
  • 206. Testing the Data Warehouse
  • 207. Data Warehouse Unit Test • Before releasing anything data in the DW must be tested. • User has to validate a sample of data • (e.g.:total invoice amount of January 2012) • That validated value will become the reference value • Before release, the same query will be executed again. If the data is the expected reference data then test is green otherwise the test fails
  • 208. Data Warehouse Unit Test • Of course test MUST be automated when possibile • Visual Studio • BI.Quality (on CodePlex…now old) • Based on Nunit • NBI is the new way to go http://www.nbi.io/ ! • Based on Nunit • What to test? • Structures • Aggregated results • Specific values of some «special» rule • Fixed bugs/tickets • Values in the various layers
  • 210. Modern Data Environment Master Data EDW Data Mart Big Data Unstructured Data BI Environment Analytics Environment Structured Data Data Scientist Decision Maker
  • 211. Modern Data Environment - Details Files Web Svc Cloud / Syndicated RDBMS Master Data E x tr a c t Archive / Big Data Facts Staging Archive Replay DimensionsStandardise Extract Cube V-Mart Mart Mart Copy Facts Facts Process Secure / Expose Aggregate Transform
  • 212. Inside The Data Warehouse SSIS source tables stg.* tables etl.* tables tmp.* tables dwh.* tables olap.* views report.* views ReportingAnalysis config.* tables etl.* objects SSIS bi.* views
  • 213. After the Data Warehouse
  • 214. What’s Next? • Now that the DW is ready, any tool can be used to create a BI/Reporting solution on a solid and simpler, user friendly, ground. • Reporting • Reporting Services / Business Object / Microstrategy / JasperReports • Analysis • Analysis Services, Cognos • Power Pivot, QlikView, Tableau, Power BI
  • 216. A Starting Point • The presented content can be used as is or as a starting point to build your own framework • Extend the content when it doesn’t fit in your solution (for example: add additional databases, like «SYSCFG» if this help you) • Define your rules! Drive the tools and be not driven by them! • Keep the layers separated and favor loose coupling (less «friction» to changes) • Spread the idea of Unit Testing Data even if at the beginning it seems and expensive approach.
  • 217. Real World Samples • The presented content comes from on-the field experience • More than 40 (successful) project using the proposed approach • More than 2000 packages managed (biggest solution: 572 packages) • Several team involved (biggest team: 12 people) • Several customer grown their own standard starting from this • Data coming from ANY source: SAP, Dynamics DB2, Text or Excel Files
  • 218. Some challenges faced • Changed and entire accounting system, moving from one vendor to another • DWH and OLAP/Reporting solution completely untouched. 2/3 of budget saved • Started with a full load only and the added incremental load later • Less then 5% of Extract and Load logic changed (Transformations untouched) • Created a solution in 3 month with a minimal set of features and evolved and grown in to be an enterprise data warehouse / BI solution. • Monthly Delivery. • Never release bad data (helped to correct errors in the source systems) • Helped an enterprise company to reduce time spent on crunching data by 66% percent.
  • 219. Latest challenges faced • Supported on a *big* electronic retail company in creating their BI/DSS solution on their shiny new Dynamics CRM installation. • During CRM Development. • The first specification document for reporting was very “agile”… • “What do you need?”: “Don’t know, but all”

Notas del editor

  1. 10.00-12.00        1st slot 2h (finish at slide 78) 12.00-13.00        lunch break 13.00-15.00        2nd slot 2h 15.00-15.30        coffee break 15.30-17.30        3rd slot 2h (Demo)
  2. DATA alone is not enough. It’s like a raw material. It has to be processed in order to become INFORMATION, that will drive to extract and acquire KNOWLEDGE and ultimately allows people to take DECISIONS.
  3. OLTP samples: ecommerce website, SAP, CRM, ERP, and so on Usually OLTP database are tied to a specific business purpose
  4. Querying an OLTP database to analyze data and trends may not be a good idea OLTP database is complex Queries that analyzes data are complex and will slow down your production system OLTP database schema may change unexpectedly All needed data may not be available in only one database Data can be updated at any time, making «point-in-time» queries unreliable
  5. “In a modern company, everyone is a Decision Maker.” Data Juice http://www.slideshare.net/davidemauri/data-juice
  6. http://www.forrester.com/Topic+Overview+Business+Intelligence/fulltext/-/E-RES39218 A Data Warehouse is needed no matter which technology you’ll decide to use for you BI/DSS solution, since it is the spine of it!
  7. Delivery Quickly: make BI a key asset for the company right from the beginning. The sooner people will get data, the sooner they will learn more about their data. For example it’s very easy to detect underestimaded data quality or business process problems. BI can be a good help to start fix them and monitor them and thus making the ROI tangible right from the start.
  8. JUDEF: Just Enough Design Upfront JITD: Just In Time Design
  9. Unit Testing is a key topic in BI!
  10. A little bit more detail on the sentence that states there’s a lock of “Universal Rules”. The meaning is that make no sense to ask if “this entity has been modeled correctly”. The answer is that the entity – let’s say, the Customer – has been modeled correctly if and only if it allows all the analysis that the business need to do, in and efficient, fast and errorless way. It’s not possible to say that modeling the Customer with two or three tables it’s better than using just one table. It depends on the business needs, the amount of work required to implement that entity, the “friction” that such model introduce and thus make changes harder and so on.
  11. Easy to understand Easy to use Efficient Well supported by tools Well known
  12. On average the Kimball approach is the most used since is: Easy to understand Easy to use Efficient Well supported by tools Well known But the idea of having one physical DWH is very good. Again, the advice is not to be too rigid: Be willing to mix the things of move from one to another… Be «Adaptive»  My «Perfect Solution» is an Inmon Datawarehouse used to generate Kimball Data Marts The solution will grow over time, and so it may be created using one approach but then it will be modified to another as time passes by, in order to better serve business requirements. The idea of «changes» is not something that has to be fought, but something the has to be «embraced». The BI Solution must be able to accept changes.
  13. “analyzing data from multiple perspectives”: this also can be rephrased as «analyze data among all its possible categorizations»
  14. “One solution is to move away from RDBMS for quering”: as usual has Pros and Cons. Pros: Ah-Hoc solution that give best performances Very easy to use for the final user (a Data Analyst) Cons: Is another technology for which people has to be trAnother solution is to stay with RDBMS but optimize it for this purpose (Indexed Views, Parallel Data Warehouse, Column Aligned Storage, …) “One solution is to move away from RDBMS for quering”: as usual has Pros and Cons. Pros: Ah-Hoc solution that give best performances Very easy to use for the final user (a Data Analyst) Cons: Is another technology for which people has to be trained to use it effectively More complex to use for the developer ained to use it effectively More complex to use for the developer
  15. Focus on the end user: make life easier for who has to query the data for analytics purposes
  16. Make dimension update and maintenance harder -> Due to denormalization Somehow rigid -> Again, due to denormalization, it’s harder to update a dimension since there’s a lot of duplicate data the you have to deal with
  17. SME = Subject Matter Experts
  18. The fact table contains the Book dimension Id If a book is written by many authors we cannot create additional rows in the fact table Otherwise we would not correctly model reality, and have wrong results
  19. Sometimes the whole is not made of the sum of the single elements.
  20. Keep security in mind right from the very first steps: we won’t go deep into security problems in this workshop but it’s very important to understand what kind of security requirements you have to follow
  21. Underline that the mentioned point are exactly what’s needed to make a team working using an Agile approach
  22. Information Hiding Principle: http://en.wikipedia.org/wiki/Information_hiding
  23. Configuration: Contains Configuration objects objects that add additional value to the data (eg: lookup tables) objects that allows the BI solution to be configurable, like, for which company load data Staging: Contains intermediate “volatile” data Contains ETL procedures and support objects (like err tables) Data Warehouse: The final data store Helper: Contains object that accesses the data from the OLTP database.
  24. Dimension contains all the possibile valid combinations of values in the three tables.
  25. Type 3 is never used in reality.
  26. “A hierarchy is a natural hierarchy when each attribute included in the user-defined hierarchy has a one to many relationship with the attribute immediately below it” http://msdn.microsoft.com/en-us/library/ms174557.aspx
  27. Don’t create too much dimensions (<20) If you have a lot of attributes in a dimension and some are SCD1 and some SCD2 it may make sense to split the dimension in two If a dimension become huge (>1M rows) its worth to analyze how to split it into two or more dimensions Keep security in mind right from the very first steps Since this may require you to change the way you model your Data Warehouse Keep security in mind right from the very first steps: we won’t go deep into security problems in this workshop but it’s very important to understand what kind of security requirements you have to follow
  28. Product Sales and Product Costs: Shared dimensions: Product, Category Non Shared dimenions: Customer It allows, for example, to calculate the gross margin
  29. “Simple” means that you never need to use a temporary table to store intermediate data.
  30. ALWAYS go through a view: this can be read also as “Views PREPARE data to be used by SSIS”
  31. Other Data Source => Excel, Flat Files, Web Services, ecc…
  32. Or even create a Data Mart out of the Data Warehouse: Maybe you need to have specific aggregations or add specific data used only by one department
  33. http://chartporn.org/2012/05/10/repetitive-tasks/
  34. http://en.wikipedia.org/wiki/Software_design_pattern
  35. http://en.wikipedia.org/wiki/Software_design_pattern
  36. http://en.wikipedia.org/wiki/Software_design_pattern
  37. Matt Masson Blog: http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx