Scalable Machine Learning: The Role of Stratified Data Sharding
Data warehouse and ssas terms
1. About Presenter
Karan Gulati is SQL Server Analysis Services Maestro (MCM), working as
Support Escalation Engineer in Microsoft for last five years. Currently he is
focusing more on SQL BI and SQL PDW. He is very Active blogger and
contributed to multiple whitepapers which are published on MSDN or
TechNet site. He had also written tools which are available on CodePlex.
0 Karan Gulati (SSAS Maestro)
2. Data Warehousing Concepts
Overview of Data Warehousing and Analysis Services terms
1 Karan Gulati (SSAS Maestro)
3. What are we covering
Understanding terms used in SSAS and Data Warehousing world:
• What is Data Warehouse
• OLAP
• Cube
• Measures
• Dimensions
• Schema
• Star
• Snow-Flake
• Surrogate Keys
• Slowly Changing Dimensions
• SCD1
• SCD2
• SCD3
2 Karan Gulati (SSAS Maestro)
4. Data Warehousing
A data warehouse is a general structure for storing the data needed for good
BI (Business Intelligence).
Data in a warehouse is of little use until it is converted into the information
that decision makers need.
The large relational databases, typical of data warehouses, need additional
help to convert the data into information.
3 Karan Gulati (SSAS Maestro)
5. Why Use OLAP?
Provides fast and interactive access to aggregated data and the ability to drill
down to detail.
Lets users view and interrogate large volumes of data (often millions of rows)
by pre-aggregating the information.
Puts the data needed to make strategic decisions directly into the hands of the
decision makers, through pre-defined queries and reports, because it gives
end users the ability to perform their own ad hoc queries, minimizing users'
dependence on database developers.
4 Karan Gulati (SSAS Maestro)
6. OLAP Secret
It leverages existing data from a relational schema or data warehouse (data
source) by placing key performance indicators (measures) into context
(dimensions).
Once processed into a multidimensional database (cube), all of the measures
are pre-aggregated, which makes data retrieval significantly faster.
The processed cube can then be made available to business users who can
browse the data using a variety of tools, making ad hoc analysis an interactive
and analytical process rather than a development effort.
SQL Server 2005's BI Workbench substantially improves upon SQL Server
2000's BI capability.
5 Karan Gulati (SSAS Maestro)
7. SQL BI Tools
The SQL Server BI Workbench suite consists of five basic tools:
SQL Server Relational Database: Used to create relational database
Analysis Services: Used to create multidimensional model
(measures, dimensions and schema)
Data Transformation Services (DTS (Integration Services)): Used to
extract, transform and load data from source(s) to the data warehouse or
schema
Reporting Services: Used to build and manage enterprise reporting using
the relational or multidimensional sources
Data Mining: Used to extract information based on predetermined
algorithms
6 Karan Gulati (SSAS Maestro)
9. What is Cube?
A collection of one or more related measure groups and
their associated dimensions
8 Karan Gulati (SSAS Maestro)
10. Cube Example
Consider the following Imports cube. It contains:
Two measures:
Packages
Last
Three related dimensions:
Route
Source
Time
9 Karan Gulati (SSAS Maestro)
11. Elements of Cubes
Measures
Dimensions
Schema
Star
Snowflake
10 Karan Gulati (SSAS Maestro)
12. Measures
Measures are the key performance indicators that you want to evaluate.
To determine which of the numbers in the data might be measures, here is a
rule of thumb:
If a number makes sense when it is aggregated, then it is a measure.
11 Karan Gulati (SSAS Maestro)
13. Dimensions
Dimensions are the categories of data analysis.
Here is the rule of thumb:
When a report is requested "by" something, that something is
usually a dimension.
12 Karan Gulati (SSAS Maestro)
14. Schema
Methodology of arranging your Fact and Master tables:
Star Schema
Snow-Flake Schema
13 Karan Gulati (SSAS Maestro)
15. Star Schema
The figure shows a basic star schema; with the dimension tables arranged
around a central fact table that contains the measures. A fact table contains a
column for each measure as well as a column for each dimension. Each
dimension column has a foreign-key relationship to the related dimension
table, and the dimension columns taken together are the key to the fact table.
14 Karan Gulati (SSAS Maestro)
16. Snowflake
Normalizing each of the dimension tables so that there are many joins for
each dimension results in a Snowflake Schema.
It is called a Snowflake Schema because the “points” of the star get broken
up into little branches that look like a snowflake.
15 Karan Gulati (SSAS Maestro)
17. Which Schema works for you?
Good question:
It all depends on your requirement, I will say Star is very simple to understand and
manage in comparison to Snow-flake but in real world you cant fit everything in
one table so Normalize needs to be done.
16 Karan Gulati (SSAS Maestro)
18. Surrogate Keys
Also known:
Meaningless keys
Substitute keys
Non-natural keys
Artificial keys
A surrogate key is a unique value, usually an integer,
assigned to each row in the dimension. This surrogate key
becomes the primary key of the dimension table and is used
to join the dimension to the associated foreign key field in
the fact table.
17 Karan Gulati (SSAS Maestro)
19. What’s benefit of Surrogate Keys
A surrogate key is a unique value, usually an integer, assigned
to each row in the dimension. This surrogate key becomes
the primary key of the dimension table and is used to join
the dimension to the associated foreign key field in the fact
table.
Surrogate keys helps in maintaining history in case of Slowly
Changing Dimensions
18 Karan Gulati (SSAS Maestro)
20. Slowly Changing Dimensions
There are 3 Versions of SCD
SCD 1
The Type 1 methodology overwrites old data with new data, and therefore
does not track historical data at all. This is most appropriate when correcting
certain types of data errors, such as the spelling of a name. (Assuming you
won't ever need to know how it used to be misspelled in the past)
19 Karan Gulati (SSAS Maestro)
21. So, what’ Dis-Advantage of SCD1
The obvious disadvantage to this method of managing SCDs is that there is no
historical record kept in the data warehouse. You can't tell if your suppliers are
tending to move to the Midwest, for example. But an advantage to this is that
these are very easy to maintain. Type 2
20 Karan Gulati (SSAS Maestro)
22. SCD 2
The Type 2 method tracks historical data by creating multiple records in the
dimensional tables with separate keys. With Type 2, we have unlimited history
preservation as a new record is inserted each time a change is made.
In the same example, if the supplier moves to Illinois, the table would look like
this:
Another popular method for tuple versioning is to add effective date columns.
21 Karan Gulati (SSAS Maestro)
23. SCD 3
The Type 3 method tracks changes using separate columns. Whereas Type 2 had
unlimited history preservation, Type 3 has limited history preservation, as it's
limited to the number of columns we designate for storing historical data. Where
the original table structure in Type 1 and Type 2 was very similar, Type 3 will add
additional columns to the tables:
Note: Type 3, keeps separate columns for both the old and new
attribute values—sometimes called “alternate realities.” In our
experience, Type 3 is less common because it involves changing the
physical tables and is not very scalable.
22 Karan Gulati (SSAS Maestro)
24. Slowly Changing Dimension
You can use SSIS or TSQL for implementing SCD in DW
Here is a reference –
http://blogs.msdn.com/b/karang/archive/2010/09/29/slowly-changing-
dimension-using-ssis.aspx
23 Karan Gulati (SSAS Maestro)
It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult. Example:On Jan 1 2010, Emp A belongs to Dept1, whatever sales made by this employee added to Dept1 but on June 1 2010 Emp A moved to Dept2. All his new sales contribution should be added to Dept2 from that day onwards and the old one should belong to the Dept2.If let's say in this case we have used business key (Primary key as stated in RDBMS) within data warehouse everything would be allocated to Dept2 even what actually belongs to Dept1If you use surrogate keys you could create on the 1st June a new record for the Employee 'A' in your Employee Dimension with a new surrogate key. This way in your fact table you have your old data (before June) with the SID of the Employee 'E1' + 'Dept1' All new data (after June) would take the SID of the employee 'E1' + 'Dept2' Key Points: