Apresentação sobre os métodos aplicados no processo de ETL, aprofundando sobre os métodos CDC que são utilizados em ETL de DataWarehouse de Tempo Real.
3. 3
Adriano Patrick Cunha
Conceits
Data Warehouse (DW)
“is a prominent approach to materialized data integration.
Data of interest, scattered across multiple heterogeneous
sources is integrated into a central database system.” (Jörg e
Dessloch)
“provides information for analytical processing, decision
making and data mining tools. A DW collects data from
multiple heterogeneous operational source systems OLTP
and stores summarized integrated business data in a central
repository used by analytical applications OLAP” (Bernadino e
Santos)
4. (Kakish e Kraft)
4
ETL – Extraction, Transformation and Loading
“Is a process extract the data from source system, transforms
the data according to business rule, and loads results into the
target data warehouse.”
Actions:
1)The identification of relevant information at the source
side.
2)The extraction of this information.
3)The customization and integration of the information
coming from multiple sources into common format.
4)The cleaning of the result data set on the basis of
database and business rules.
5)The propagation of the data to the DW and DM
Adriano Patrick Cunha
Conceits
5. 5
Conceits
Data Warehouse (DW) – Data Quality Dimensions
Adriano Patrick Cunha
Completeness
Conformity
Consistency
Accuracy
Duplication
Integrity
6. 6
Adriano Patrick Cunha
ETL Process
Extract
“Taking out the data from a variety of disparate source
system correctly is often the most challenging aspect of ETL
...”
“The goal of the extraction phase is to convert the data into
a single format which is appropriate for transformation
process...”
Relational DB, flat files, IMS, VSAM, ISAM etc.
“Most of the time the data in source system is very complex,
thus determining which data is relevant is very difficult...”
(Kakish e Kraft)
7. 7
Adriano Patrick Cunha
ETL Process
Extract
Logical Methods for extraction:
Full extraction
No need to keep track change
Incremental extraction
CDC mechanism
Staging Area
8. 8
Adriano Patrick Cunha
ETL Process
Extract
Physical Methods for extraction:
Online extraction
Connect to source system to extract in preconfigured format.
Offline extraction
The data extracted is staged outside
9. 9
Adriano Patrick Cunha
ETL Process
Transform
Types Transformation
1. Selecting only certain columns to load;
2. Translating coded values (1 for male and 2 for famale, but DW M and F);
3. Encoding free-form values (mapping “Male” to “1”);
4. Deriving a new calculated value;
5. Sorting;
6. Joining data from multiple sources and removing data duplicating;
7. Aggregation;
8. Generating surrogate-key values;
10. 10
Adriano Patrick Cunha
ETL Process
Transform
Types Transformation
1. Transposing or pivoting (turning multiple columns into multiple rows or
vice versa);
2. Splitting a column into multiple columns;
3. Disaggregation of repeating columns into a separate detail table;
4. Lookup and validate the relevant data from tables or referential files for
slowly change dimensions; and
5. Applying any form of simple or complex data validation.
11. 11
Adriano Patrick Cunha
ETL Process
Load
Mechanisms to load include:
1. SQL loader: used in flat files into DW;
2. External Tables: store data in virtual table to queried and joined;
3. Oracle Call interface (OCI): is a API used when the transformation
process is done outside database;
4. Export/Import
13. 13
Adriano Patrick Cunha
CDC - Change Data Capture
Snapshot Sources - Performs the ETL to a file and run a compare
with the previous version of the file
Logged Sources - Uses change logs, usually using triggers to go
with storing the logs changes, but may also be used by the
business logic of the applications or even using specific utilities of
the DBMS, such as database log scraping or log sniffing, who
loggin transactions
Timestamped Sources - the tables have attributes audit, which
indicate when the attribute is created or changed
16. 16
Adriano Patrick Cunha
Bibliografia
Near real-time data warehousing using state-of-the-art ETL tools
Thomas Jörg, Stefan Dessloch (2010)
Lecture Notes in Business Information Processing 41 LNBI
Real-time data warehouse loading methodology
Ricardo Jorge Santos, Jorge Bernardino (2008)
Proceedings of the 2008 international symposium on Database engineering & applications - IDEAS '08
http://portal.acm.org/citation.cfm?doid=1451940.1451949
Near real-time data warehousing with multi-stage trickle and flip
Janis Zuters (2011)
Lecture Notes in Business Information Processing 90 LNBIP
A Triggering and scheduling approach for ETL in a real-time data warehouse
Jie Song, Yubin Bao, Jingang Shi (2010)
Proceedings - 10th IEEE International Conference on Computer and Information Technology,
CIT-2010, 7th IEEE International Conference on Embedded Software and Systems, ICESS-2010,
ScalCom-2010
Creating a Real Time Data Warehouse
Joseph Guerra, David A Andrews (2011)
Andrews Consulting Group
ETL Evolution for Real-Time Data Warehousing
Kamal Kakish, Theresa A Kraft (2012)
Proceedings of the Conference on Information Systems Applied Research p. 1-12
www.aitp-edsig.org
17. 17
All text and image content in this document is licensed under the Creative Commons Attribution-Share Alike 3.0 License
(unless otherwise specified). "LibreOffice" and "The Document Foundation" are registered trademarks. Their respective logos
and icons are subject to international copyright laws. The use of these therefore is subject to the trademark policy.
Adriano Patrick Cunha
Thank you …
adriano.patrick@unifor.br
adrianopatrickc