2. Mountains of Data
Organizations have lots of data
◦ ERP, CRM, Portal…
Data is not in a form that is useful to decision-makers
◦ Not easy to review
◦ Not informative nor insightful
BASQUANG@HOTMAIL.COM 2
4. Data Consolidation Solution
MRPCRMSCM Finance
Transaction
Layer
Shared Data
Layer
Data Warehouse
Customers Sales Procurement Suppliers Operations Finance
Shared
Reporting
BASQUANG@HOTMAIL.COM 4
5. Business Intelligence System
A BI system is the solution for gathering data from multiple sources, transforming that data so
that it is consistent and stored in a single location, and presenting the information to you to
analysis and decision making.
BASQUANG@HOTMAIL.COM 5
6. Business Intelligence Process
Information
Gathering
Data Sources
Data
Processing
Data Integration
Analysis &
Production
Report Creation
Directing &
Planning
Analytic Groups
Consumers Requirements
Dashboards,
Reports,
Charts…
BASQUANG@HOTMAIL.COM 6
7. Data Sources
Staging Area
Manual
Cleansing
Data Marts
Data Warehouse
Client
Access
Client
Access
1: Clients need access to data2: Clients may access data sources directly3: Data sources can be mirrored/replicated to reduce contention4: The data warehouse manages data for analyzing and reporting5: Data warehouse is periodically populated from data sources6: Staging areas may simplify the data warehouse population7: Manual cleansing may be required to cleanse dirty data8: Clients use various tools to query the data warehouse9: Delivering BI enables a process of continuous business improvement
BASQUANG@HOTMAIL.COM 7
8. SQL Server BI Structure
Data Source Layer
Data Transformation Layer
Data Storage and Retrieval Layer
Analytical Layer
Presentation Layer
Text, MS Excel, MS Access, MS SQL, Oracle,…|
External Sources
1. Extract the data from the multiple sources
2. Modify the data to consistent
3. Load the data into Data Storage system
Data Warehouse in RDBMS
Turn data into information (analysis)
Multidimensional OLAP Database
Reporting and Visualization Tools (Dashboard,
KPI, Scorecard,…)
BASQUANG@HOTMAIL.COM 8
9. Microsoft Business Intelligence
Platform
Data Warehouse, Data Marts,
Operational Data
(SQL Server 2008 R2/Oracle/DB2, Sybase…)
Integrate
(SQL Integration Services)
Analyze
(SQL Analysis Services)
Report
(SQL Reporting Services)
Portal
(SharePoint)
Scorecards, Analytics, Planning
(PerformancePoint Service)
Report Builder
SSRS
End-user Analysis
(Excel)
Office
SQLInfrastructure
Platform
Data Delivery
Analytic
Applications
BASQUANG@HOTMAIL.COM 9
12. SQL Server BI Structure
Data Source Layer
Data Transformation
Layer
Data Storage and
Retrieval Layer
Analytical Layer
Presentation Layer
Text, MS Excel, MS Access, MS SQL, Oracle,…|
External Sources
1. Extract the data from the multiple sources
2. Modify the data to consistent
3. Load the data into Data Storage system
Data Warehouse in RDBMS
Turn data into information (analysis)
Multidimensional OLAP Database
Reporting and Visualization Tools (Dashboard,
KPI, Scorecard,…)
BASQUANG@HOTMAIL.COM 12
13. Data Integration in Real World
Extract data
from sources
Cleanse &
Transform
Load data into
data warehouse
BASQUANG@HOTMAIL.COM 13
14. SSIS Architecture
SQL Server Integration
Services (SSIS) service
SSIS object model
Two distinct runtime engines:
◦ Control flow
◦ Data flow
BASQUANG@HOTMAIL.COM 14
15. SSIS Architecture
SSIS Designer
◦ Graphical tool to create and maintain Integration Services packages.
Integration Services Runtime
◦ Saves the layout of packages, runs packages, and provides support
for logging, breakpoints, configuration, connections, and
transactions.
Tasks and other executable:
◦ The Integration Services run-time executables are the package,
containers, tasks, and event handlers
BASQUANG@HOTMAIL.COM 15
17. SSIS Architecture
Object Model
◦ Allow for creating custom components for use in packages
Integration Services Service
◦ Lets you monitor running Integration Services packages and to
manage the storage of packages.
BASQUANG@HOTMAIL.COM 17
21. SQL Server BI Structure
Data Source Layer
Data Transformation
Layer
Data Storage and
Retrieval Layer
Analytical Layer
Presentation Layer
Text, MS Excel, MS Access, MS SQL, Oracle,…|
External Sources
1. Extract the data from the multiple sources
2. Modify the data to consistent
3. Load the data into Data Storage system
Data Warehouse in RDBMS
Turn data into information (analysis)
Multidimensional OLAP Database
Reporting and Visualization Tools (Dashboard,
KPI, Scorecard,…)
BASQUANG@HOTMAIL.COM 21
23. Measure and Metadata
Measure: A summarizable numerical value
◦ Sales Dollars, Shipment Units,...
Metadata: Data about data
◦ Label, Order by,...
Units Sold
7070
Adventure Works Sales Adventure Works Sales
Metadata
Measure
BASQUANG@HOTMAIL.COM 23
24. Unit sold by Product and Month
report
Product Jan 2011 Feb 2011 Mar 2011 Apr 2011
Mountain-500 Black, 40 1 3 1 2
Mountain-500 Black, 44 2 1
Mountain-500 Black, 48 1 2 1
Mountain-500 Silver, 40 1 2 1
Mountain-500 Silver, 44 1 1 1
Mountain-500 Silver, 48 2
Road-750 Black, 44 10 7
Road-750 Black, 48 5 9
Hitch Rack 1 6 6 3
BASQUANG@HOTMAIL.COM 24
25. Grouping-Aggregating
Attribute-Member
Grouping – Aggregating: is the way
humans deal with too much detail
◦ Ex: group Products by model, subcategory,
and category groups
Attribute: Product (Key), Model, Color,
Size
Member
◦ Model: Mountain-500, Road-750…
◦ Color: Black, Silver
◦ Size: 40, 44, 48
Product Model Color Size
Mountain-500 Black, 40 Mountain-500 Black 40
Mountain-500 Black, 44 Mountain-500 Black 44
Mountain-500 Black, 48 Mountain-500 Black 48
Mountain-500 Silver, 40 Mountain-500 Silver 40
Mountain-500 Silver, 44 Mountain-500 Silver 44
Mountain-500 Silver, 48 Mountain-500 Silver 48
Road-750 Black, 44 Road-750 Black 44
Road-750 Black, 48 Road-750 Black 48
Hitch Rack Hitch Rack
product with model name, color, and size attributes
BASQUANG@HOTMAIL.COM 25
26. Hierarchy: Model Product
Jan 2011 Feb 2011 Mar 2011 Apr 2011
Mountain-500 3 8 6 6
Mountain-500 Black, 40 1 3 1 2
Mountain-500 Black, 44 2 1
Mountain-500 Black, 48 1 2 1
Mountain-500 Silver, 40 1 2 1
Mountain-500 Silver, 44 1 1 1
Mountain-500 Silver, 48 2
Road-750 15 16
Road-750 Black, 44 10 7
Road-750 Black, 48 5 9
Hitch Rack 1 6 6 3
Hitch Rack 1 6 6 3
Group Units Sold by Model, Product and Month
BASQUANG@HOTMAIL.COM 26
27. Hierarchy
Hierarchy is created by
arranging related attributes
into levels
Hierarchy level: 2, 3,…n
Hierarchy type:
◦ Balance (Date)
◦ Unbalance (Organization)
BASQUANG@HOTMAIL.COM 27
28. Dimensions
Jan 2011 Feb 2011 Mar 2011 Apr 2011
Mountain-500 3 8 6 6
Road-750 15 16
Hitch Rack 1 6 6 3
Units Sold by Model and Month
• Attribute:
– Model (3)
– Month (4)
• Potential number of values: 12 = 3x4
BASQUANG@HOTMAIL.COM 28
29. Dimensions
Jan 2011 Feb 2011 Mar 2011 Apr 2011
Units $ Units $ Units $ Units $
WA Hitch Rack 4 $480 3 $360 2 $240
Mountain-500 2 $1.105 6 $3.256 5 $2.775 5 $2.750
Road-750 9 $4.860 10 $5.400
OR Hitch Rack 2 $240 3 $360 1 $120
Mountain-500 1 $120 2 $1.105 1 $540 1 $540
Road-750 1 $565 6 $3.240 6 $3.240
• Attribute:
– State (2), Model (3), Month (4), Measure (2: Units sold, Sales dollars)
• Potential number of values: 2x3x4x2 = 48
BASQUANG@HOTMAIL.COM 29
30. Dimensions
Examples:
◦ State attribute belongs to the Geography dimension
◦ Model attribute belongs to the Product dimension
◦ Month attribute belongs to the Date dimension
◦ Units sold and Sale Dollars belongs to the Measure dimension
BASQUANG@HOTMAIL.COM 30
31. Dimensions
The independent attributes and hierarchies are the dimension
A dimension may contain more than one attributes
◦ Ex: Product dimension contain Color and Size attribute
Dimension also contain hierarchies
◦ Ex: Product by Model hierarchy is composed of attributes contained in the Product dimension, so the
hierarchy also belongs in the Product dimension
Measure dimension are displayed on columns
BASQUANG@HOTMAIL.COM 31
33. SQL Server BI Structure
Data Source Layer
Data Transformation
Layer
Data Storage and
Retrieval Layer
Analytical Layer
Presentation Layer
Text, MS Excel, MS Access, MS SQL, Oracle,…|
External Sources
1. Extract the data from the multiple sources
2. Modify the data to consistent
3. Load the data into Data Storage system
Data Warehouse in RDBMS
Turn data into information (analysis)
Multidimensional OLAP Database
Reporting and Visualization Tools (Dashboard,
KPI, Scorecard,…)
BASQUANG@HOTMAIL.COM 33
34. Dimension Data Warehouse
Dimension Data Warehouse is the data storage and retrieval layer of BI system
In dimension data warehouse:
◦ Dimension are stored in dimension tables
◦ Measure are called facts and are stored in fact tables
BASQUANG@HOTMAIL.COM 34
35. Fact Table
State Product Month UnitsSold SalesDollars
OR Hitch Rack Jan 2011 1 $120.00
OR Mountain-500 Silver, 40 Jan 2011 1 $565.00
OR Mountain-500 Silver, 48 Jan 2011 1 $552.50
WA Mountain-500 Silver, 48 Jan 2011 1 $552.50
OR Hitch Rack Feb 2011 2 $240.00
WA Hitch Rack Feb 2011 4 $480.00
• Fact table:
– table that stores the detailed values for measures
• Key Column:
– State, Product, Month
• Fact Column:
– UnitsSold, SalesDollars
FactSales table
BASQUANG@HOTMAIL.COM 35
36. Fact Table
The value in the key columns relate the facts in the fact table row to a row in each dimension
table
Fact table may have other type of column for reference purposes
Fact table might contain one or more measure columns
BASQUANG@HOTMAIL.COM 36
37. Fact Table
The level of detail stored in a fact table is called granularity
The dimensions that a fact table is related to is called dimensionality of the fact table
Facts that have different granularity of different dimensionality must be stored in separate fact
tables
BASQUANG@HOTMAIL.COM 37
38. Fact table: Dimension key
Actually a fact table almost always
uses an integer, called a dimension
key, for each dimension member
There must be a dimension table for
each dimension key in a fact table
State Product Month UnitsSold SalesDollars
1 483 201101 1 120.00
1 591 201101 1 565.00
1 594 201101 1 552.50
2 594 201101 1 552.50
1 483 201102 2 240.00
2 483 201102 4 480.00
FactSales table using Dimension key
BASQUANG@HOTMAIL.COM 38
39. Dimension Table
A dimension table contain one row for each member of
the key attribute of the dimension
The key attribute has two column:
◦ Integer dimension key (PK)
◦ Attribute label
A dimension table may contain other columns for other
attributes of the dimension
ProductKey Product
596 Mountain-500 Black, 40
598 Mountain-500 Black, 44
599 Mountain-500 Black, 48
591 Mountain-500 Silver, 40
593 Mountain-500 Silver, 44
594 Mountain-500 Silver, 48
604 Road-750 Black, 44
605 Road-750 Black, 48
483 Hitch Rack
DimProduct Dimension Table
BASQUANG@HOTMAIL.COM 39
41. Aggregatable and Aggregate
Aggregatable: Attributes that can be used to create groups
Non aggregatable attributes are referred to as member properties
◦ Ex: List Price, Telephone Number, Street Address…
Aggregate: Summary value in the group of aggregatable
Example:
◦ Aggregatable: Category, Color…
◦ Aggregate: Number of Units Sold for each Category
BASQUANG@HOTMAIL.COM 41
42. Table structure
OLTP: Normalization to make sure
that a value is stored in only one
place
- Consistency
- More tables with more
relationship
Normalizing each of the dimension
tables so that each dimension has
several tables results in a snowflake
schema,
BASQUANG@HOTMAIL.COM 42
43. Table structure
OLAP: Denormalizing data to storing redundant values in
a single table
- redundant
- fast query
Creating a single denormalized table for each dimension
results in a star schema.
BASQUANG@HOTMAIL.COM 43
45. SQL Server BI Structure
Data Source Layer
Data Transformation
Layer
Data Storage and
Retrieval Layer
Analytical Layer
Presentation Layer
Text, MS Excel, MS Access, MS SQL, Oracle,…|
External Sources
1. Extract the data from the multiple sources
2. Modify the data to consistent
3. Load the data into Data Storage system
Data Warehouse in RDBMS
Turn data into information (analysis)
Multidimensional OLAP Database
Reporting and Visualization Tools (Dashboard,
KPI, Scorecard,…)
BASQUANG@HOTMAIL.COM 45
46. Multidimensional OLAP
Multidimensional OLAP database resides between the data storage and retrieval layer and the
presentation layer
It converts the relation data warehouse data into a fully implemented dimensional model for
creating analytical reports and data visualizations
BASQUANG@HOTMAIL.COM 46
47. Measure Group and Cube
Measure group corresponds to a single fact table
Measure group may contains data for single level of detail and aggregated data for
all higher levels of detail
Cube: Combination of several related measure groups and a set of dimensions
State Product Date Units Sold Sales Amount
All All All 70 31.305
WA All All 46 21.235
WA Bikes All 37 20.115
WA Road Bikes All 19 10.260
BASQUANG@HOTMAIL.COM 47
49. What is OLAP
1985.
OLTP
1993.
OLAP
• Benefits
– Consistently fast response
– Metadata-based queries
– Spreadsheet-style formulas
Online Transaction
Processing
Online Analytical
Processing
BASQUANG@HOTMAIL.COM 49
50. Consistently Fast Response
Calculating and storing aggregate values and the results of formulas when a cube is loaded
(calculation in advance)
Aggregate tables can be created to provide fast query results
BASQUANG@HOTMAIL.COM 50
51. Metadata-Based
Queries
SQL is suitable for transaction
system not for reporting
applications
Query language for OLAP data
source
◦ Multidimensional expression
(MDX)
SELECT
[Store].[Store Country].[Canada].[Vancouver]
ON COLUMNS,
[Product].[All Products].[Clothing].[Mittens]
ON ROWS
FROM [Sales]
WHERE ([Measures].[Unit Sales],
[Date].[2010].[February])
SELECT SUM(Sales.[Unit Sales])
FROM (Sales INNER JOIN Stores
ON Sales.StoreID = Stores.StoreID)
INNER JOIN Products
ON Sales.ProductID = Products.ProductID
WHERE Stores.StoreCity = 'Vancouver'
AND Products.ProductName = 'Mittens'
AND Sales.SaleDate BETWEEN '01-02-2010' AND
'28-02-2010'
SQL Query
MDX Query
BASQUANG@HOTMAIL.COM 51
Spend time building up this slide. Note that the main points on this slide will be covered in the slides that follow.Build 1: Introduces source systems and client access. Mention a common requirement for information workers to analyze and report on this data.Build 2: Should the information workers connect directly to these systems? Remind students of the points on the slide about common information problems: Performance impact, availability, cleanliness, historical context preservation, and end user skills and tools.Build 3: Focuses on source system mirroring. Mention that database mirroring (an availability feature introduced with SQL Server 2008) could make a read-only copy of the database available to reduce the impact on the source database.Build 4: Introduces the data warehouse, which consists of data marts, a multidimensional database, data mining models and data feeds. The data warehouse system can overcome many of the issues raised in Build 2, but it implies that the data must be copied from the source systems…Build 5: Highlights the ETL process. Mention that the data from the source systems needs to be periodically extracted and loaded into the data marts. These data marts commonly have a particular schema design optimized for querying, so the data will need to be transformed. Introduce the term ETL—extract, transform, and load.Build 6: Introduces the staging systems. Performing the ETL in one process may be difficult to achieve because of the complexity of transformations or the need to cleanse the data. Mention that staging systems are optional and that the technologies introduced in this course (e.g., SSIS) may challenge this traditional need. Note that staging is still an important design consideration because it provides convenient restartability of the ETL process without the need to disturb the source systems.Build 7: Manual cleansing may be required to fix problematic data. This is expensive in terms of human resources and time. Mention that the technologies introduced in this course (e.g., SSIS) may be able to address this problem.Build 8: Client access can take many forms—for example, via browsing tools, reports, spreadsheets, dashboards, and so on.. Stress that, ideally, clients extract their data from the “one version of the truth.” Discuss the different types of users: power users, analysts and their different needs.Build 9: Emphasize that this is a continuous process of monitoring, analyzing and planning.
Source data can stored in a variety of different data stores and in difference formats.Is usually is not optimized for analytic and reporting needs.A data warehouse can deliver a unified data store that presents cleansed, conformed data for optimized analytics and reporting.
Adventure Works Cycles, that manufactures and sells bicycles, bicycle components, clothing,and accessories for North American, European, and Asian markets.When detailed data from the data warehouse is loaded into a multidimensional OLAP database, summarized values are precalculated.
Source data can stored in a variety of different data stores and in difference formats.Is usually is not optimized for analytic and reporting needs.A data warehouse can deliver a unified data store that presents cleansed, conformed data for optimized analytics and reporting.
In order to populate the data warehouse, periodically data needs to move from the source system(s) to the data warehouse.This is often referred to an ETL process (Extract, Transform and Load).SQL Server Integration Services was designed specifically to perform an ETL processes.
Control flow governs the order and precedence of how tasks are executed.
Data flows can be developed to extract data from multiple sources, and then integrate and transform that data into a format that is useful for reporting and analytics.
Cubes deliver a conceptual model of measures and dimensions.They are best developed on top of data warehouse structures, in particular those designed as star schemas.Cubes deliver rapid ad hoc query responses, and can enrich the model with hierarchies, properties, calculations, KPIs, actions, perspectives and translations.End-users commonly connect directly to cubes and use graphical designers to construct their queries.
Numbers without context may be data, but they are not informationWhen data is loaded into a multidimensional OLAP database, metadata is added to the data. Metadata is data about the data. The metadata in an OLAP database includes information about relationships and hierarchies in the data, how the data should be sorted and summarized, and how it should be formatted for presentation. The metadata in the OLAP database is what turns data into information.
BI practitioners just call each list an attribute.Because the labels in each list are related to each other and belong in the same attribute, the labels are called members.The Product attribute is the key attribute.
The value of Units Sold for each Model is the sum, or aggregation, of the value of Units Sold of the related Products.The Model and Product attribute members are arranged in a hierarchy.
Cubes deliver a conceptual model of measures and dimensions.They are best developed on top of data warehouse structures, in particular those designed as star schemas.Cubes deliver rapid ad hoc query responses, and can enrich the model with hierarchies, properties, calculations, KPIs, actions, perspectives and translations.End-users commonly connect directly to cubes and use graphical designers to construct their queries.
Each column in a fact table is typically either a key column or a fact column, but it is also possible to have other columns for reference purposes—for example, purchase order numbers or invoice numbers.A fact table contains a column for each measure.
Many dimension attributes can be used to create groups of dimension records, and then the related facts can be summarized for each group.For example, Product dimension records can be grouped into Bikes and Accessories categories, and then the number of Units Sold for each category can be calculated.
In an operational database, it is critical for data to be consistent across the entire application:If you change a customer’s address in one part of the system, you want the changed addressto be immediately visible in all parts of the system. Because of this need for consistency, operationaldatabases tend to be broken up into many tables so that any value is stored onlyonce in a single table.Any time the value is needed, a join to the table containing the valuecan be created. Ensuring that a value is stored in only one place is one element of a processcalled normalization, and it is very important in operational database systems.If you execute a report using data from the data warehouse, however, many joins can make the query slow. For example, suppose you want to see Sales Amount for the Bikes category for the year 2011. To aggregate by Bikes, you have to join each row in the fact table to the Product table, and then to the Subcategory table, and then to the Category table. To aggregate by the year 2011, you also have to join the fact table to the Month table, to the Quarter table, and finally to the Year table. And you have to do all those joins for all the rows in the fact table, discard the rows that are not related to the Bikes category and the year 2011, and then sum Sales Amount in the remaining rows. Joining all of the Product dimension and Date dimension tables to the fact table makes the query for this report much slower than if all the Product attributes were in a single table and all the Date attributes were in a single table.
Cubes deliver a conceptual model of measures and dimensions.They are best developed on top of data warehouse structures, in particular those designed as star schemas.Cubes deliver rapid ad hoc query responses, and can enrich the model with hierarchies, properties, calculations, KPIs, actions, perspectives and translations.End-users commonly connect directly to cubes and use graphical designers to construct their queries.
The columns containing numerical data in a fact table correspond to measures in a dimensionalmodel, so each fact table is a group of measures. Analysis Services organizes informationin a logical construct called a measure group that corresponds to a single fact table andits related dimensions.
A report of sales by product subcategory by quarter may require several minutes to run, even if you have only 50 subcategories and 20 quarters. But if you pre-summarize the data into an aggregate table that includes only subcategories and quarters, the aggregate table will have at most 1,000 rows (50 subcategories times 20 quarters gives a maximum of 1,000 possible rows), and a report requesting totals by subcategory and by quarter will not take very long to execute.
Areport of sales by product subcategory by quarter may require several minutes to run, evenif you have only 50 subcategories and 20 quarters. But if you pre-summarize the data into anaggregate table that includes only subcategories and quarters, the aggregate table will haveat most 1,000 rows (50 subcategories times 20 quarters gives a maximum of 1,000 possiblerows), and a report requesting totals by subcategory and by quarter will not take very longto execute.Conceptually, eachmeasure group contains all the detail values stored in the fact table, but that doesn’t meanthat the measure group must physically copy and store all of that data. If you choose, youcan make the measure group dynamically retrieve values as needed from the fact table. Inthis case, you’re using the measure group only to define metadata. This is called relationalOLAP, or ROLAP. For faster query performance, you can have Analysis Services load the detailvalues into its own proprietary storage structure and precalculate aggregate values. Thiswill provide improved query performance. This is called multidimensional OLAP, or MOLAP.Analysis Services allows you, the cube designer, to decide to use MOLAP or ROLAP. Asidefrom performance differences, where the detail values are physically stored is completelyinvisible to a user of a cube. Whether you use MOLAP or ROLAP, when you execute a querythe results are stored in memory, on a space-available basis, to make subsequent queriesfaster. You can think of MOLAP storage as a disk-based cache that allows the Analysis Serverto load the memory cache much faster than if it had to retrieve data from a relational datawarehouse.