BI Environment Technical Analysis

SmithGroup JJR | Technical Analysis of BI Environment
Version 1.0

Technical Analysis of BI Environment
© AIM Report Writing, 2017 Page 2 of 21
Executive Summary
At the beginning of the engagement with the SmithGroup JJR, AIM Report Writing was asked to provide an initial, high
level Technical Analysis of their Business Intelligence (BI) Environment. This document is what resulted from the weeks
of investigation that followed. The following areas were agreed upon for this delivery item:
• Executive Summary
• Recommendations
• Data Warehouse Architecture
• SQL Server Best Practices
• Case for Hadoop: Indoor Positioning Study (POE)
• SQL Server / Database Discovery
• Data Warehouse and Data Marts
• Extract, Transform, and Load
• Appendix A | Microsoft Data Warehouse On-Premises Architecture
• Appendix B | Design Questions to Review
The current business intelligence and data warehouse environment at SmithGroup JJR includes 3 primary components: a
data warehouse, an ETL process, and a data mart. One additional storage location exists on separate SQL Server
resources. There are two main data sources stored in the cloud with the intent of future expansion in the cloud.
This business intelligence and data warehouse environment was analyzed against SQL Server Best Practices, the Data
Warehouse (Data Mart), and the ETL process. In addition to this analysis, a database discovery and a case for Hadoop
was completed.
In order to implement the following recommendations, it is suggested to follow an Agile approach. This document’s
technical analysis has identified and allowed the defining groups of work, called an Epic in Agile. Breaking up the findings
of this analysis into Epics (groups of work) provides a starting points for identifying scope and vision, user stories, and
backlog. These Agile efforts provide the framework for defining the effort, sprints, and delivery to production.
During on-site meetings regarding the topics in this document, it was informally agreed that the initial two Epics for
development focus on the two areas listed below:
• SSIS Error Flows to replace TSQL Functions
• Dimensional Modeling and Star Schema
The team at SmithGroup JJR has started gathering use cases. These use cases will be used during the dimensional
modeling (star schema) development. These use cases serve 4 purposes:
• Identify Entities, Relationships, and Attributes for Star Schema Conceptual Model
• Used to Develop the Dimensional Model, Star Schema Conceptual Model
• After Draft Conceptual Model is Complete They are Used to Verify that the Model can Source the Use Cases
• Used to Develop the Front End Requirement such as a Report, or Dashboard
Future Epics need to address validation, scalability, transaction processing, and load Meta data.

Recommendations
• SSIS Error Flows to replace TSQL Functions
o All Existing Source to Stage functions in ETL Database
o This needs to be investigated as to where and how this change would be implemented
o Alternative: T-SQL Error Control can produce the current failing row, but not a batch of failed rows like SSIS
• Dimensional Modeling and Star Schema
o Continue to Gather and Collect Use Cases
o Identify Entities (Dimensions), Relationships, and Attributes for Star Schema Conceptual Model
▪ Used to Obtain Stakeholder Consensus
o Create Flat Table Definitions of Entities (Dimensions) using Excel
o Model the Star Schema Conceptual Model using ERWin
o Continued Learning and Training on ERWin, possible Pluralsight, or Webinars (http://erwin.com/videos/)
• Architecture | Option 1, Shown Below in the Following Section, Data Warehouse Architecture
o Current Use does not require an integrated cloud environment
▪ Users do not experience issues using a gateway with on-premises in terms of performance
▪ Especially in consideration the on-premises environment should receive maximum efforts
o Cloud Analysis and Analytics using Power BI using Gateway and On-Premises Data
o Tableau users have Access to On-Premises Data for Analytics
o Approach allows a scalable and future roadmap to Integrate the On-Premises and the Cloud
• Hadoop | Reserve for Future Roadmap
o Low Volume | Carl Estimated 1TB of Data
▪ As an unwritten rule shared by experts, Hadoop needs at least 5TB to justify and achieve performance
o High to Moderate Investment for On-Premises, or Cloud based Hadoop
o Although Volume, Velocity, Variety, and Veracity are all considerations, Volume is required for federation.
• Naming of Business Resource with Industry Standard Naming
o Data Lake is used with Hadoop. Rename to something else
o Data Vault is actually a Data Warehouse (not a big concern like the above naming conflict)
• 2 Summary Tables from the Analysis Sections towards the end of this document.
o 1 | Data Warehouse and Data Marts
o 2 | Extract, Transform, and Load
1 | Data Warehouse and Data Marts
Security The SQL Server security appears to be adequate and industry standards.
Partitioning After review of dev data only, we do not see a data need for partitioning. No performance issues reported.
Alerting Combining Try-catch, database mail, and SQL Agent is highly recommended for SQL Server alerting of issues.
Indexing Index discovery and creating an enterprise index strategy is recommended for production servers.
Star Schema There is currently no star schema and it is highly recommended.
Conformed Dimensions Since there is not a star schema, there are not conformed dimensions.
Scalability Scalability appears to be a concern. A future use plan for instances, files, and file groups is suggested.
Exception Handling Try-Catch is not being used in functions and stored procedures. It is highly recommended that they be added.
Transaction Processing The environment doesn’t use Transaction Processing. Transaction Processing is recommend for future phases.
SQL Views (Business Views) A star schema is suggested to reduce complexity in creating and managing SQL Views for the business.
Surrogate Keys Surrogate Keys with a data type of Integer is suggested for the star schema.
Delta Loads TSQL Merge is adequate, however the additional use of checksums should be considered. Meta data needed.
2 | Extract, Transform, and Load
Load Meta Data Currently, there no tracking of load meta data and it is highly recommended.
Environments and Environment Variables Currently, environments and environment variables are being used with success.
Parameters Currently, parameters are being used with success.
Logging Logging with the SSIS Framework is working, however, load meta data logging is suggested.
Validation Since little, to no validation is currently employed, it is highly recommended.
Transaction Processing It is recommend not using transaction processing in SSIS, but rather at the SQL Server level functions and SPs.
Package Sequencing There are no reported errors, or issues with the current package sequencing.
Connection Managers Currently, connection managers are being used with success.
Alerting Alerting is not enabled in the solutions evaluated. Alerting is highly recommended.
Exception Handling There is no exception handling in the currently evaluated packages, this is highly recommended.
Checkpoints It is recommended that some restart ability be designed and implemented.
Naming Conventions Naming conventions are recommended.

Data Warehouse Architecture
Below, we provided diagrams representing the current architecture and 2 options for the next stage of the BI / Data
Warehouse Architecture. An “all-features” Microsoft On-Premises Architecture diagram can be found in Appendix A.
These diagrams are intended to help the decision makers compare their current architecture with possible phases. In
order to help clarify these phases, this section also includes information about the Current Error Control Design and
information about Integrating On-Premises and the Cloud.
The Current Architecture stages both enterprise and project / application specific data sources from both internal and
external locations. Data sources intended for the data warehouse are staged first then loaded. The data stored in the
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR All On-Premises DW and Cloud | Current
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Blue
Vision
Azure Event Hub
Azure Streaming Analytics
Azure Table Storage
Azure SQL Database Azure SQL Database
Marquette
Normalized Data
Figure 1 | Current Architecture
data warehouse becomes the source for the data mart. Data source examples include: Ultipro, Vision, and Active
Directory. Other data sources such as NSF, IPEDs, and RevIt are stored on other dedicated SQL Server storage. In the
cloud, data sources from Indoor Positioning and Marquette indicate the slow adoption of integrating on-premises data
with cloud data. With this identified, the options discussed include an all on-premises option and an integrated on-
premises and cloud option.

Phase 1, an All On-Premises Data Warehouse design dictates that all of the data structures and data be stored on
internal company resources (no cloud). In this option, the star schema and cubes exist and remain on internal company
resources, however, these resources and their content connect to cloud apps such as Power BI. This connection is
facilitated by a gateway. In this option, the gateway is required to make multiple trips when sourcing data. However, this
design allows end users using Power BI, or Tableau a flexible and feasible option.
All
On-Premises
with Gateway
Data Mart
Star
Schema
Cube Analyitics
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR All On-Premises DW and Cloud | Option 1
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Star
Schema
Blue
Vision
Azure Event Hub
Azure Table Storage
Azure SQL Database
HTTP
No VPN
Azure SQL Database
Marquette
Normalized Data
Figure 2| Option 1

Phase 2, an Integrated On-Premises and Cloud Data Warehouse, seeks to design a hybrid data warehouse providing the
best of both the on-premises and cloud data warehouses. In option 2, SQL Server Integration Services loads data from a
data mart start schema into Azure SQL Database, or directly into Azure SQL Server Analysis Services. This option differs
from option 1 since the cloud is used to store on-premises data in the cloud to be consumed by applications such as
Power BI. When using Power BI as a data scientist, or a business analyst, having the on-premises data in the cloud
provides fast analysis with external data already in the cloud. In the first option, the gateway is required to make
multiple trips when sourcing data. In this option, the data exists in the cloud, so the gateway use in minimalized.
Gateway
Data Mart
Star
Schema
Cube Analyitics
Data Mart
Data
Warehouse
(3NF)
Data Vault
UltiPro Schema
Vision Schema
SmithGroupJJR On-Premises and Cloud | Option 2
UltiPro
Active Directory
Vision
NewForma
Enterprise
Systems
ETL
Stage
ELT
Process
Structured
Data
RevIt
NSF, IPEDs
SQL
Server
Indoor Positioning
Denormalized
Tableau
End User
Power BI
End User
Blue Vision IPS RAW
Star
Schema
Blue
Vision
Azure Event Hub
Azure Table Storage
Azure SQL Database
Azure SSAS
VPN
Azure SQL Database
Azure SQL Database
Marquette
Normalized Data
Figure 3| Option 2

In the next two diagrams, the flow of data is separated into five phases. These phases are Enterprise Source Systems,
Staging, Data Warehouse, Data Mart, and Star Schema. However, for this analysis, we are focusing on the first two
rectangles titled Enterprise Source Systems and Staging. The first diagram displays the current use of functions to extract
data from the source systems. The second diagram displays a possible use of SSIS Error Flows.
Functions, the diagram below represents the current flow of data where functions extract data using functions. These
function are intended to exist on the actual source system in a database named ETL, but in the case of Ultipro (backup /
restore) the functions exist on the ETL database used by the data warehouse. These functions are used to load the
staging tables used in the downstream merge. The advantage to this design is that changes to the architecture can be
implemented without affecting downstream objects such as SSIS. The concern with this design is that during a load
failure, specific rows that have failed are not easily identifiable to generate a detailed alert that contains the failed rows.
Star
Schema
SmithGroupJJR | Current using Functions
Star
Schema
Enterprise
Source
Systems
Staging
Data
Warehouse
Data
Mart
SSAS
SSAS
SSAS
SSAS
SQL Functions
ETL Database Stores
Table Returned by
Functions
SQL Stored
Procedures
ETL Database Stores
Functions to Create Load
Tables from Source
UltiPro
Active Directory
Vision
Backup and Restore
Process. Requires
Functions to be stored on
Stage, not the Source
System.
Functions stored on
Source System.
Process Uses SSIS Plugin
for Connecting and
Extracting Active Directory
Data
Merge Statement Loads
Inserts and Updates Into
Data Warehouse
DeNormalized
SQL Functions
SQL Stored
Procedures
Stored Procedures on the
Data Mart Execute
Functions on the Data
Warehouse to Load the
Data Mart
Star Schema to be
Designed and Developed
in Future Phases
Using Functions to Pull the
Source Data Prevents
using SSIS Data Flow
Tasks. This means that we
can t have an error flow
that stores failed rows for
evaluation and fixing.
Warning!
Figure 4| Current Architecture Using Functions

Error Flow, the diagram below represents the proposed flow of data using SSIS Error flow. The use of SSIS is intended to
replace the existing functions that are used to load the staging tables. As shown in the diagram, SSIS has the ability to
create an error flow to capture rows that fail the load process. This ability allows the details and cause of the failure to
be emailed to alert the appropriate stake holders. Once SSIS loads the staging tables and stores any row failures, the rest
of the data flow remains the same as in the current diagram.
Star
Schema
SmithGroupJJR | Using Error Flow
Star
Schema
Enterprise
Source
Systems
Staging
Data
Warehouse
Data
Mart
SSAS
SSAS
SSAS
SSAS
ETL Database Stores
Table Returned by
Functions
SQL Stored
Procedures
Data Flow Task uses an
SSIS Data Source to
Extract Data from Source
Systems
UltiPro
Active Directory
Vision
Merge Statement Loads
Inserts and Updates Into
Data Warehouse
DeNormalized
SQL Functions
SQL Stored
Procedures
Stored Procedures on the
Data Mart Execute
Functions on the Data
Warehouse to Load the
Data Mart
Star Schema to be
Designed and Developed
in Future Phases
Using The Data Flow Task
Allows the Use of Error
Flows. This means that we
can have an error flow that
stores failed rows for
evaluation and fixing.
Notice!
Figure 5 | Current Architecture Using SSIS Error Flows

The options to Integrate On-Premises and Cloud are diagramed below. The Full Overview show a Site-to-Site and Point-
to-Site VPNs as well as a HTTP connection. All three of these options provide different levels of security and IPSec
standards. An additional option to the Site-to-Site VPN is ExpressRoute
(https://azure.microsoft.com/en-us/services/expressroute/).
ExpressRoute is a Microsoft Azure app that provides advanced scalability, increased reliability and speed, lower latency,
and WAN integration. This is a fee and pay for use application.
SmithGroupJJR | Integrate On Premises and Cloud
Full Overview | Integrate On Premises and Cloud
Site-to-Site Secure VPN | ExpressRoute Point-to-Site VPN | HTTP
Internet
WorkstationGateway
SQL
Server
Workstation
Point-to-Site VPN
Site -to-Site VPN
(ExpressRoute)
• Secure
• Controlled
• better connectivity quality
SQL
Server
Site -to-Site VPN
(ExpressRoute)
* optional
• Secure
• Controlled
• better connectivity quality
Gateway
Gateway
Workstation
Point-to-Site VPN
Figure 6 | Integrate On-Premises and Cloud

SQL Server Best Practices
SQL Server best practices were discussed and explained during a meeting with the SmithGroup JJR DBA and
Infrastructure teams. The meeting was demonstrated on development servers to ensure production SLAs. All decisions
in regards to implement, or not to implement these best practices and when, were left up to the SmithGroup JJR.
NTFS Allocation Unit (AU) Block Size = 64k, Alignment = 1024k | Default is 4k, Use /L with Format on Windows 2012 and above.
Max Degree of Parallelism MAXDOP | Set to Number of Cores in a Single CPU Socket
DB Auto Growth Set very high for performance. (100MB to xxx GB)
Cost Threshold for Parallelism For OLTP where we seek to minimize Parallelism and offer more concurrency then Use 15-20.
Up to 50 with modern CPUs. For DSS, OLAP, Data Warehouse, and test environments
Consider leaving at default and Managing Parallelism with MAXDOP if concurrency is a problem.
TempDB 1:2, or 1:4 Ratio (TempDB Data Files to Cores). 1:1 Ratio for large systems. Pre SQL Server 2016:
Use Trace Flags T1117 and T1118 to enable consistent AutoGrowth. On Flash Arrays enable
SORT_IN_TEMPDB Index Build option to prevent index rebuilds.
Separate Data / Log Volumes Tier 1. Test to determine for Tier 2 Flash Arrays. Multiple Volumes per File Group to Reduce Latch Contention.
4-8 Files per File Group.
3 Volumes | TempDB, Data / Log Files, and Backups for Fast Flash (Under 1ms Response Times)
Max Server Memory 90% of Available Server Memory
Enable Instant File Initialization Windows Server Setting: Perform Volume Maintenance Tasks needs to be
Set Under Local Policies and User Rights Assignments.
Case for Hadoop: Indoor Positioning Study (POE)
During our initial meetings regarding the data sources here at the SmithGroup JJR, we identified one possible use case
for Hadoop. This one possible use case is Indoor Positioning Study (POE). During our conversations, multiple questions
were asked about Hadoop such as what the minimum size is and how to handle aggregates on unstructured data.
Hadoop does not perform well on 5TB, or less. Also, it is worth noting small files do not work well with Hadoop and
should be combined into larger files. As for aggregates in Hadoop, if SmithGroup JJR were to use Azure Data Lake Store
(ADLS), they could use HDInsights and Hive, or if they use SQL Data Warehouse, or SQL Server 2016 then they could use
PolyBase. Another option is Azure Data Lake Analytics/U-SQL to aggregate Hadoop data.
Below are some questions that are given in the Indoor Positioning Study (POE) documentation to describe the questions
and answers that the SmithGroup JJR would like to answer with this data source. These questions below are broad
topics, each topic with more specific questions.
• How do people utilize space?
o What is the average dwell time by space?
o How does the number of people within a space vary over time?
o What are the most frequently used paths between spaces?
• How do people interact and collaborate?
o How much time do people spend in spaces occupied by other people?
o What is the average number of people in a collaborative space?
o How does job/organizational role impact collaboration?
• Person movement
o How often do people move between spaces?
o What is the average duration of rest (motion)?

Additional Questions:
• Exact location of a user within space
• Actual paths traveled between spaces
• Relationship between workspace and study subject (employee/organizational) measures, such as happiness or
productivity (what is an abstract term that captures these types of things?)
• Comparison of varied workspace configurations / designs / arrangements such as office /open / free assignment
• Integration with other technologies and data sources, such as space scheduling software, communication
software, galvanic skin response, implanted telemetry chips, health and dental records, etc.
After talking with Peter, he estimated that the size in terabytes for the Indoor Positioning Study (POE) at the SmithGroup
JJR was at most one terabyte. Understanding that this is much less that the five terabyte minimum for Hadoop clusters,
it is not suggested to implement a Hadoop cluster for this use case.
SQL Server / Database Discovery
In order to complete a data discovery, we were provided 3 databases:
• DataVault
• DataMart
• ETL
We provided a data discovery by employing 2 different methods. The first method was to create a web based document
of each database using Redgate’s SQL Doc. The second method was to use SQL Server DMVs and TSQL to create an Excel
based data dictionary. The files are included on the SharePoint folder along with this document. Also note, that
Shabhana provided the WBS Migration Changes to Datawarehouse Systems.pdf where you can find much of this type of
information as well.
Finally, we collected information about the various data sources (both internal and external). The list of data sources is
as follows:
Internal
Enterprise Data Source
• Vision Enterprise Resource Planning Software
• UltiPro Human Resources
• SharePoint Document Management and Collaboration
• Active Directory (AD) Network / Domain Information
• NewForma Project Meta Data and RFIs
Project / Application Specific
• Revit Data Collector Building Information Modeling (Model Statistics)
• CER
• WorkSim Space Planning Space Planning
• Indoor Positioning Study (POE) Azure SQL for People Movement in Workspace
• Campus Project Data (Marquette) Campus Planning & Space
External
• IPEDS Public University Data

• National Science Foundation Public for Funded Projects
• Bureau of Labor Government Labor Statistics
• GIS Typographical, Surveys of Land
Data Warehouse and Data Marts
In order to analyze SmithGroup JJR’s Data Warehouse / Data Mart environment, we were provided 3 databases:
• DataVault
• DataMart
• ETL
Overview
The data warehouse (DataVault) and the data mart of the same name are the 2 databases that make up the SmithGroup
JJR BI environment. The DataVault is a 3NF database. The DataMart in de-normalized and currently contains employee
and project data. At this time, there is no star schema. However, there are plans to build out a star schema in the future.
The DataVault stores source data by the corresponding source system name by using schemas of the same name such as
UltiPro and Vision. In order to complete this analysis below, a Server and Database Discovery was completed as well.
Analysis
The areas for analysis for these 2 databases include the following topics:
• Security
• Scalability
• Partitioning
• Exception Handling
• Alerting
• Transaction Processing
• Indexing
• SQL Views (Business Views)
• Star Schema
• Surrogate Keys
• Conformed Dimensions
• Delta Loads (Merge SCD 1 and SCD 2, Checksums)
Security should always be the first concern in planning and deploying any data warehouse / data mart environment. In
reviewing what roles were defined, we found the following server roles: bulkadmin, dbcreator, diskadmin,
processadmin, public, securityadmin, serveradmin, setupadmin, and sysadmin. There were no user defined SQL Server
roles. The sa account was enabled, but not being used. There was not an implementations of Row Level Security or, Role
Based Security. SQL Schemas, such as ultipro, vision, ad, and admin were used to scale and organize the various SQL
Server objects.
Scalability is a very high priority for companies that want to deliver solutions that last 5, or more years after the initial
deployment. Many of the Server and Database DMVs listed above help us determine the scalability. For instance, using
instances, partitioning, files and file groups, and synonyms can help make a system more scalable. Instances allow better
resource management between different processes on the same server. They also allow us to separate load layers such
as stage, consolidation, transformation, 3NF, Star, and Analytics. Since we only had access to development servers, we

did not see any examples of instances, but we highly recommend them in production. We also looked at partitioning,
but that will be discussed below. As for files and file groups, we have provided an Excel spreadsheet to identify what the
files and file groups and their current sizes. We also provided size information for all of the tables in the 3 databases we
were asked to analyze. File and Table sizes are important indicators for scalability and where to set the auto growth for
your tables. The data and log files were on the same volume and had the following sizes:
FileName FileSizeMB SpaceUsedMB AvailableSpaceMB %FreeSpace
DataVault 3004 1636.69 1367.31 45.52
DataVault_log 36828.31 1256.45 35571.87 96.59
When looking at development, we were okay with these settings, however, the auto growth was not what we
recommend in production. Finally, synonyms are an easy way to manage server to server (physical, or instances)
connections without taking the risk of using linked servers. We did notice 3 linked servers (FINANCIALDATA, SGJJR-
SQL2ASCCM2012, and VISIONDEVDB. These linked servers were not part of our scope provided by the SmithGroup JJR,
however, we would warn against relying too heavy on linked servers.
Partitioning is a great way to manage reporting performance in a data warehouse / data mart environment. Currently,
the SmithGroup JJR is not using any partitioning strategy. Since we only had access to the development data, it is hard to
tell if the sizes in our analysis represent the real production sizes, however, with the database sizes we encountered in
the scope of this analysis, we do not recommend partitioning at this time. For the DataVault, there were a total of
2,548,891 rows. The table with the most rows was 730,655 rows in the [vision].[ProjectFinancialsByPeriod] table. We
can see in the section above that data size for DataVault is 3004 MB. At this time, partitioning is not recommended.
Exception Handling in SQL Server (TSQL) is accomplished by using Try Catch clauses. We did our due diligence in
verifying that there is no exception handling at the SQL Server level. We confirmed this with the different teams at
SmithGroup JJR. Exception handling is recommend for future phases of development.
Alerting in SQL Server is a combination of Try Catch clauses, database mail, and SQL Server Agent. Both SQL Server
Agent allow us to define operators and alerts. Alerts can then be defined on performance conditions, or SQL Server
events based on an error number from a try catch clause. Alerting is recommend for future phases of development.
Transaction Processing is the process of ensuring that data is written to disk before we commit a transaction and move
on to the next step in the process. Transaction Processing also provides a mechanism to rollback any data that have
been written if the transaction fails before a commit can take place. Transaction Processing is a critical part of any
design. Currently, the environment here does not use Transaction Processing. Transaction Processing is recommend for
future phases of development.
Indexing has a huge impact on Server and Query Performance. DMV queries to identify any unused indexes and what
indexes need to be re-organized and re-built should be ran on a regular basis. Index discovery and creating an enterprise
index strategy is recommended.
SQL Views (Business Views) can be used to denormalize and simplify data structures in the 3NF for reporting purposes.
Currently, both the DataVault and DataMart use SQL Views, however, the ETL database does not. There are 32 SQL
Views grouped into 5 different schemas (admin, api, dbo, lookup, and vision) in the DataVault database. The DataMart

database has 10 SQL Views all in the dbo schema. A star schema is suggested to reduce complexity in creating and
managing SQL Views for the business.
Star Schema is not used and is not being developed. It is highly recommended for future phases.
Surrogate Keys are used to provide referential security in a Data Warehouse / Data Mart that sources data from
numerous data sources that all have different keys defined for the same entity / attribute such as Person / Social
Security Number. Surrogate keys are employed in the SmithGroup JJR DataVault and DataMart, however, GUIDs have
been used. This design has no issues for the data warehouse, however, the reporting start schema should use integers
for load and processing performance.
Conformed Dimensions at the database level entails ensuring that the Data Warehouse in a star schema only has 1
dimension for a specific entity such as employee, or region. Any data mart use of the entity employee, or region need to
be sourced from the Data Warehouse and not reloaded with different logic and processes. Since there is not a star
schema for DataVault, or DataMart, we do not have any dimensions to conform. We suggest a robust star schema for
both a data warehouse and data marts.
Delta Loads for data is both a performance issue and a management issue. Loading only the data that has changed since
the last load can be implemented and managed in many ways. We can use TSQL merge, checksums, and last load date
tables to determine if a row has changed since the last time the table has been loaded. SmithGroup JJR uses TQL merge,
but not checksums and a last load date table. At the size of the data today, checksums and storing a last load date is not
necessary, but recommended for performance and scalability. These same processes can also be used to implement
slowly changing dimensions once a star schema is developed. The was an initial plan at the SmithGroup JJR to use a
logging, error logging, and number table to manage load meta data, but it was not implemented.

Extract, Transform, and Load
In order to analyze SmithGroup JJR’s SSIS environment, we were provided 4 SSIS Solutions for the following areas:
• Active Directory | 12 Packages
• Deltek Vision | 63 Packages
• Ultipro | 19 Packages
• NewForma | 21 Packages with 11 Disabled
Overview
The Active Directory load process uses the KingswaySoft Directory Services Integration toolkit to provide access to Active
Directory data. Using this tool, this solution extracts Active Directory data for the following areas: computer, group,
group member, and user. The master ActiveDirectory.dtsx package call 3 sub-packages named: Extract, Transform, and
Load. As the names of these packages indicate, the Extracts package takes data from the source system and temporarily
stores this data in the ETL database. Unlike the other more complex ETL solutions, this solution does not have any tasks,
or data flow transformations in the Transformation package. This Transform.dtsx could possibly be disabled. The Loads
package calls stored procedures located in the ETL database then loads the data warehouse.
The Active Directory design pattern uses the KingswaySoft Directory Services Integration toolkit to extract the data from
the source system and place the extracted data into stage tables located on the Data Warehouse in the ETL database on
that server. Once the data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL
database load the staged data into the DataVault tables using the merge statement.
The Deltek Vision solution also uses a similar process by calling separate sub-packages for the Extract, Transform, and
Load phases of the data load. However, this solution also has a PreProcessing and a PostProcessing package. The
PreProcessing package truncates the TPH tables. The PostProcessing package is empty and could possibly be disabled.

The Extracts package takes data from the source system and temporarily stores this data in the ETL database for tables
like client, vendor, and employee. The Transform package transforms data for vendor / client, contact / employee,
project / opportunity, project dependents. The Loads package calls stored procedures located in the ETL database then
loads the data warehouse for these same areas such as client, vendor, and employee.
The Vision design pattern includes an ETL database that is stored on the transactional server. This ETL database stores
functions that are used to extract the data from the source system and place the extracted data into stage tables located
on the Data Warehouse in the ETL database on that server. Once the data is staged in the ETL database on the Data
Warehouse, ETL stored procedures in the ETL database load the staged data into the DataVault tables using the merge
statement.
The Ultipro solution uses a different process by organizing the Extract, Transform, and Load phases of the data load into
separate containers. The Extracts package stores data in the ETL database. The Transform package transforms data for
organization, employment, and employmentHistory. The Load package loads tables using the TSQL Merge statement
from the ETL database to the data warehouse. Please review Shobhana’s WBS Migration Changes to Datawarehouse
Systems.pdf for an ETL Dataflow Diagram and other useful package information.

Since Ultipro is a backup and restore process, the Ultipro design pattern does not include ETL database functions that
are stored on the transactional server. Instead, these extract functions are stored in the ETL database on the Data
Warehouse server. This ETL database stores functions that are used to extract the data from the source system and
place the extracted data into stage tables located on the Data Warehouse in the ETL database on that server. Once the
data is staged in the ETL database on the Data Warehouse, ETL stored procedures in the ETL database load the staged
data into the DataVault tables using the merge statement.
The NewForma (oblivion) load process is a non-standard load process that needs to be updated into the new design
pattern described with the Vision load process above. There is a monthly load that calls a weekly load that calls the
hourly load that is not being currently used. The weekly load package also has an archival process. Besides the monthly
load, there is a daily load that calls the hourly load. Both the weekly and daily load packages call the same hourly
package.
The current state of the NewForma load process executes two packages in parallel. The first package is Execute
etlOrgChart and the second package is Execute etlNewformaProjects. Execute etlNewformaProjects has two child
packages named Execute etlProjectRFIs and Execute etlProjectMilestones. The packages etlOrgChart and

etlProjectMilestones both use the KingswaySoft SharePoint Integration toolkit to extract and load data to and from
SharePoint lists. This process needs to be updated and implemented to production using the standard process.
Analysis
The areas for analysis for these 3 solutions include the following topics:
• Load Meta Data
• Package Sequencing (Master and Child Packages, SQL Jobs, Conformed Dimensions, Dimensions, Facts, Data Marts)
• Environments and Environment Variables
• Connection Managers (Package and Project)
• Parameters (Package and Project)
• Alerting
• Logging
• Exception Handling
• Validation
• Checkpoints
• Transaction Processing (requires MSDTC)
• Naming Conventions
Load Meta Data is important since it can help us track load start and end times by package, table, and even cube
processing, but it also can track load row counts for inserts and updates, it can provide restart ability that is more robust
than SSIS checkpoints, and it can provide rollback information during a failure. Currently, there no tracking of load meta
data and it is highly recommended.
Package Sequencing controls the order of how the packages load the tables. In terms of packages, we have a master
package and then child packages. The master package may call child packages such as a conformed dimension package,
a dimension package, and a fact package. Child packages can also call data mart packages that duplicate data warehouse
dimensions and facts to be used as data marts. Finally, SQL Jobs can be used to schedule different load patterns and
times such as daily, every hour, and even weekly, or monthly. Since package sequencing is already working and not
causing issues at this time, this is not a high priority for redesign.
Environments and Environment Variables are used to provide a mechanism for changing project data connections and
variables during a change control migration from one environment to another, such as development to test, or test to
production. Currently, environments and environment variables are being used with success.
Connection Managers can be either project, or package level. In most cases, connections that need to change from
environment to environment, or will be used in many packages will be project connections. Any connections that are
just required by 1 package and will not change between environments can be package connections.
Parameters can be either project, or package level. In most cases, parameters that need to change from environment to
environment, or will be used in many packages will be project parameters. Any parameters that are just required by 1
package and will not change between environments can be package parameters.
Alerting in SSIS is provided by using a SMTP connection. This connection can then be used in a task flow, data flow, or
even as an event handler such as OnError. Alerting is not enabled in the solutions evaluated. Alerting is highly
recommended.

Logging can be very customized by using a custom logging schema and then tying that custom logging to the logging
built into SQL Server 2012 and newer. Since a robust version Logging is provided in new versions of SQL Server that
include verbose logging for troubleshooting, it is not recommended to make any changes to logging. An example custom
logging diagram that can bridge to the data logged by new versions of SQL Server has been provided for your reference.
Exception Handling in SSIS can be addressed in multiple methods. One method is exception data flows. These data flows
can load exception data into flat files such as text files, or a table that stores exception data in an XML format. Another
example of exception handling is rolling back data when an error occurs. This can be accomplished at the end of an error
flow, or in an event handler such as OnError. Finally, exception handling can be combined with Alerting to let the
appropriate technical and business users know of and issue, or delay. Since there is no exception handling in the
currently evaluated packages, this is highly recommended.
Validation has many solutions. The most common include row and column based validation combined with what are
called sanity checks. In a fact table, column validation can sum the column and compare that to an aggregate value in
the consolidation layer of the load process. For dimension validation, we can verify that the surrogate key for a specific
user ties back to multiple source systems through the stored business keys. Sanity checks tend to focus on a known
business rule and verify that a calculated business rule matches in multiple systems such as a source system, data
warehouse, data mart, the cloud, and reporting tools. This validation can use load meta data as well as SQL Tasks to
gather can validate complex scenarios (data sources) when necessary. Since little, to no validation is currently employed,
it is highly recommended.
Checkpoints are SSIS’s built in method for providing package restart ability. They are configured by providing a
checkpoint location in the package level properties. The property name is CheckPointFileName. Two other properties
need to be configured named CheckPointUsage and SaveCheckPoints. These properties are defined in the solutions we
evaluated. It is recommended that some restart ability be designed and implemented.
Transaction Processing is a feature in SSIS, however it requires that Microsoft Distributed Transaction Coordinator
(MSDTC) be enabled. This coordinator comes with overhead and is not always well received by the DBA team. Since the
SmithGroup JJR already uses SQL Tasks to call stored procedures that utilize the TSQL merge statement, we recommend
not using transaction processing in SSIS, but rather at the SQL Server level.
Naming Conventions in SSIS may seem elementary, but good naming conventions in SSIS can help with readability and
maintenance, especially when introducing new developers to the ETL environment. A sample SSIS naming convention
document has been provided.

Appendix A | Microsoft Data Warehouse On-Premises Architecture
Below is a diagram that illustrates an ideal business intelligence (BI) architecture. This example is intended to show the
many pieces available in the Microsoft BI stack. This diagram includes source systems (structured data) that flow into a
SQL Server data repository. Data repositories offload the reporting loads onto non-production, transactional servers.
The ELT process will stage the data required for that load only. There are three (3) paths for the data once it is staged.
The first path is a quickly cleaned path intended for daily business reporting needs called an Operational Data Store.
Moving down the diagram, the second path is the fact pipeline. The fact pipeline will begin to denormalize and prepare
the fact data for transformation. The third path is the dimension pipeline. The dimension pipeline goes through two
other tools offered in the Microsoft BI Stack.
The first tool is Data Quality Services. This tool is used to cleanse the data. The second tool is Master Data Management.
Master Data Management provides what the industry calls “Golden Record Management.” Golden record management
gives you access to the most pure, validated, and complete picture of your individual records in your domain. Products
like Profisee (https://profisee.com/grm) offer functionality beyond the built in tools offered out of the box with SQL
Server. This extra “Golden Record” functionality include matching, de-duplication, mastering, and record harmonization.
Profisee also offers graphical user interfaces, scorecards, and reports.
Stage
ODS
DQS
*DQS = Data Quality Services
*MDM = Master Data Management
MDM
Data
Warehouse
(3NF)
eDW
Sales Schema
CRM Schema
Marketing Schema
Microsoft Data Warehouse On-Premises Architecture
*Data Steward(s)
Fact Pipeline
Dimension Pipeline
Structured
Data
eDW
Cube Farm
Sales Cube
CRM Cube
Marketing Cube
SharePoint Portal
Excel SSRSSSAS Power QueryPower BI
Data
Repository
Flat File Data
Sales Data
Customer Service
(CRM) Data
Accounting Data
Human Resource
(HR) Data
Supply Chain Data
Enterprise
Systems
Configuration Logging Audit
Real Time
Operational
Reporting
Transform
Supplement
Data Steward Tasks
Data Steward(s)
Define Business Rules
Manage Master Data
Subject Matter Expert (SME)
Liaison Between Business and BI Team
Structured
Data
Identify Business Question
Define Staffing Roles
Data Discovery
Establish Data Stewards
Agree on Business Rules
Determine Master Data Lists
Business Tasks
Data Mart
Data Mart
Star
Schema
eAnalyitics
Sales
CRM
Plan Prototype Around Question
Product Licensing
Examine Existing Infrastructure
Determine eDW Infrastructure
Plan Security / Kerberos
Develop eDW Architecture
BI Team Tasks
Figure 7 | Microsoft Data Warehouse On-Premises Architecture

Appendix B | Design Questions to Review
How is data from multiple sources consolidated? An example is currently, when we model DataVault, we see 3
person tables, a vision.Person, an ultipro.Person, and a dbo.person. According to the DBAs, dbo.Person is a
consolidated version of Person. This begs another question of what logic is used to consolidate the 2 different
versions of Person. Is there a reference, or lookup table?
What was the reasons and domain knowledge used to use GUIDs and not INTs for Surrogate Keys?
Confusing nomenclature for DataVault, DataMart, and DataLake databases. May want to rethink these names
to not confuse the more general and industry accepted meanings.

BI Environment Technical Analysis

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a BI Environment Technical Analysis

Similar a BI Environment Technical Analysis (20)

Más de Ryan Casey

Más de Ryan Casey (7)

Último

Último (20)

BI Environment Technical Analysis