Implementing Change Data Capture for a Slowly Changing Dimension in SSIS 2005: a research presentation for the SetFocus Business Intelligence Honors program
2. Employee Rates Data Flow The process must execute a Lookup on the target table for each incoming record to distinguish inserts and updates. Also, without separate tracking data, the count of incoming records is the size of the source table. Sample Multi-Purpose Data Flow for both Inserts and Updates
3. Change Data Capture image from Microsoft Books Online, 2008 Change Data Capture (CDC) is an automated operation that records transactional activity in the source table (inserts, updates, and deletes). This streamlines the ETL procedure because there is no need to compare all the data in the target table to identify changes. Also, it increases efficiency by limiting the source pool to already identified changes. SQL Server 2008 has full CDC support and implements the capture process by writing transaction log activity into a set of specialized CDC tables. This is a new feature which did not exist in SQL Server 2005. Even without the automated transaction log tracking, there are other methods of developing a capture process. This demonstration uses triggers to load the changes in a CDC change table which is similar in design to the 2008 version.
4.
5. CDC Test Inserts and Updates Result set in the CDC table tracking the changes. Note, the updates create two records. Test script with inserts, updates, and deletes
6. SCD Data Flow CDC for Slowly Changing Dimension The SCD transform determines insert or update without the need for a Lookup transform. The conditional split is based on the CDC_$operation column. Note, the source table for this data flow is the CDC table
7. Near Real-Time Changes Reduce Source-Target Latency By running the SSIS package as a recurring job in the background, can reduce the latency interval to the execution time of the complete CDC process. For this demonstration, there is a single data flow, so a For Loop container can serve a similar purpose. The data flow executes multiple times within the loop and captures any changes to the CDC table.
8. Final Results A second set of inserts and updates and the corresponding changes to the CDC and target tables, mere seconds later.