Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling and Data Governance conference on Oct. 17, 2019: Integrate Information Quality in your Data Warehouse Architecture
The start of GDPR implementations in Europe was, for most organizations, also the start of rethinking their Data Warehouse strategy. The experience of past implementations gave a better view on the do's and don'ts. One of the important lessons learned was the approach of handling information quality. It's not something you handle on top of your data warehouse. To be successful, information quality goes hand in hand with your data warehouse implementation.
Similar a Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling and Data Governance conference on Oct. 17, 2019: Integrate Information Quality in your Data Warehouse Architecture
Similar a Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling and Data Governance conference on Oct. 17, 2019: Integrate Information Quality in your Data Warehouse Architecture (20)
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Presentation by Ivan Schotsmans (DV Community) at the Data Vault Modelling and Data Governance conference on Oct. 17, 2019: Integrate Information Quality in your Data Warehouse Architecture
2. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
AboutUs
DV-Community a meeting place for DataWarehouseAutomation
addicts to get information, share resources and solutions,
increase networking and expand DWA expertise.
DataWarehouse Automation Special Interest Group
» Information Hub for Data Vault
» DWA – events
» Training
» Webinars
» Software / Application information
2
3. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
IvanSchotsmans
» Data Evangelist with +30 years experience
» (Co-) Founder local chaptersTDWI, DAMA, BI-Community,
DV-Community, IAIDQ
» Data Warehouse – Business Intelligence – Data Governance
» NOW: Master Data Officer
3
4. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
»Business Case
»DataChallenges
»Data Strategy
»DataQuality
»DataArchitecture
Agenda
5. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Customer Case
5
6. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Scope: Don’tboiltheocean
6
» Start with critical applications
» Parameters
• Criticality
• Impacts
• Depreciation
7. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
BusinessRequirements
7
» Data Quality Audit starts from a MASTER application (reference table)
• Starting point ReferenceTable
• Compare against ReferenceTable
Master
APPL21APPL20APPL01 …
Customer 1
AAA
Customer 1
ProductXXX
Customer 1
YYY
Customer 1
ZZZ
Customer 1
NNN
8. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
DataDrivenBusinessRules
Root Product ProductType Key value Application 1 Application 2 Condition Old Product
number
Product 1 Access Value 1 PTXGI FFTH AND 123812
Product 1 Access Value 1 PTXGI GTFR AND 89103
Product 1 Access Value 1 PTXGI DHFD NA 180153
Product 2 Cable Value 1 PTXGI PFDR OR 115976
Product 2 Cable Value 1 PTXGI WSHN OR 100153
Product 2 Cable Value 1 PTXGI AZFD NA 100152
8
9. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
DataQualityChecks
9
Prepare Execute Report
Master Reference Table
Support Mapping Table
APPL01
APPL02
APPL03
APPL04
APPL…
XLS
Reporting
Read Mapping Join
Error
Flags
Mapping process
Error Checking
Flag Setting Outcome in one big XLS File
Source for different dashboards
One outcome table per application
10. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
CleanupStatus
Total Products 79.730
Sales 7.696
Customers 4.642 Customers 72.034
Products Maintenance Fee 1.908 Product Maintenance Fee 0
Active Products 1.649 Active Products 0
Suspended 257 Suspended 0
New 0 New 0
Out of Service 2 Out of Service 0
Unknown 0 Unknown 0
Products without Maintenance 3.054 Products without Maintenance 72.034
Active Products 2.237 Active Products 29.255
Suspended 323 Suspended 6.843
New 9 New 740
Out of Service 485 Out of Service 35.196
Unknown 0 Unknown 0
10
11. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
RawDataQualityAnalysis
Product
Number
SAP Code Latest
Version Date
F_
Clean_
OK
Begin_
Date
Last_
Usage_Date
Total_
Revenue
Nbr_
custs
Appl_
01
Appl_
02
Appl_
---
Last_
Invoice Date
65 20041128 0 19960104 19981020 0 Zero 0 0 0
66 680039 20041128 0 19963112 20011017 0 Zero 0 1 0
67 680013 20041128 0 2000101 20010131 0 Zero 0 0 0
68 680044 20060315 0 19960101 20050514 0 Zero 0 0 0
69 680034 20060315 0 19971020 20050514 1.250 LT10 4 3 6
70 20060315 0 20050701 20070514 0 Zero 0 0 1 20070531
71 70310 20060315 0 20050514 20060909 0 Zero 0 0 0
72 896401 20060315 1 20050701 20060101 0 Zero 0 2 0 20060201
73 20060315 0 20050514 20070112 0 Zero 0 0 0
11
12. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Data Challenges
12
13. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
OurDatastatuswasa“DisparateDataCycle”, …
13
People Create their
own Data
Can’t Find
Don’t Trust
Can’t Access Data
Data Not
Integrated
Or
Documented
People Come
Looking for data
People Uncertain
About the Data
People Come With
Own Data
The Disparate Data Cycle (Michael Brackett)
14. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
…butweneededtotransformtoaComparateDataCycle.
14
New Data
Created When
Necessary
People Find
Trust and
Access Data
New Data
Integrated
And
Documented
People Come
Looking for data
Existing Data
Resource
Readily Shared
People With
New Data
Check First
The Comparate Data Cycle (Michael Brackett)
15. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
15
Achallenging data strategy will ensure that the our organization is better placed to
meet its challenges in a fast changing environment.
FOCUS AREAS
One central Data Governance Team
CHALLENGES CHALLENGES
VALUES
One version of the truth
Process Harmonization
Focus
Specialization
Simplification
People
Data = Asset
DG VISION
improve efficiency, increase
punctuality and optimize decision
making by ensuring that the highest
quality data is delivered.
» Missing key elements (taxonomies,
data dictionaries, data quality
metrics)
» Data Duplication,
Overlaps
» Time to Market
• Professionalism • Teamwork
• Reference and Master Data
• Enterprise Data Model
• Clear responsibilities
• Data Scientists
• Data Stewards
• Data Curators
• One function, one tool
• IT Landscape
• Deduplication
• The right person in the right
place at the right time
• Timely and relevant training
• Awareness Raising
• Data quality
Customer Satisfaction
• Respect • Entrepreneurship
» Liberalization
» Legal requirements (GDPR)
» Shadow IT
» Complexity
16. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Data Strategy
16
17. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Wedefinedadatastrategycoveringpeople,processes,dataandtechnology.
Embedding a culture of transparency and diversity, identifying
the capabilities we need for the future, and developing better
and clearer career paths for our employees
Simplifying processes and applying customer-centric design
and Lean principles where appropriate. Leveraging automation
to reduce manual processes and End User Computing
Better understanding of our data to enable value-added
analysis and support strategic decision making .
Making strategic investments to simplify the technology
environment and ensure that it enables our desired capabilities
People
Processes
Data
Technology
&Tools
18. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
18
Weintroducedateamof dataspecialistswithspecificrolesandresponsibilities,…
• DataOwner: working within the business, accountable for content and quality of
an enterprise data asset.
• Data Steward: working within the business, responsible for the quality of an
information asset on a day-to-day basis.
• DataAnalysts: working within the business and relying on IT to provide access
to data from different applications and systems.
• Data Scientists: working within the business and relying on IT to provide access
to data from different applications and systems.
• Data Engineers: working within IT and having a deep understating of the
systems and infrastructure that generate and store the business data.
• DataCurators: working within IT and curating data for different analytical tasks,
to allocate resources for accelerating data analysis, adding semantic meaning to
data catalogs or repositories, to blending and organizing data sets.
Data
Asset
Data Owner
Data
Steward
Information
Worker
Data
Analyst
Data
Scientist
Data
Engineer
Data
Curator
Data
Consumers
Data
Custodians
Data
Owners
19. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
…toemphasizetheimportanceof businesscommitment.
Data Management
Office
DM IT Team
Data Engineers
Data Curator
DM Business Team
Data Scientist
Data Analyst
Business Domain
Data Owner
Data Steward
Information
Worker
Business Domain
Data Owner
Data Steward
User
Data Curator
20. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Wecoveredthebusinessdemandforscalabilityandflexibilitywiththeuseof data
vault.
20
Data Vault Characteristics
• Agile
• Set of Best Practices
• Historization
• Logging
• Unique IDs (hash-keys)
• Reconciliation.
21. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Duetoitsflexibilitydatavaultnotonlyguaranteesanagileapproachbutalsoa
fastertimetomarket.
• Proven enterprise data warehouse framework
• Single version of the facts
• Business rule neutral
• Source system neutral
• Agility (case study granularity change)
• Data ingestion performance: massive parallel processing
• Auditability: full historization
• Adaptability:
• Business rules can change
• Master data management maturity can evolve
• Source system landscape can change
21
22. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Data Quality
22
23. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Ourfirstchallengewasimprovinginformationqualityanddataprocesses…
23
What is the best way to save the fish ?
Filter the stream to
clean the water?
or
Find and eliminate
the sources of
pollution?
25. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Dataqualityhasthreedimensions: definition,contentandpresentation.
» Data Definition Quality
• The extent to which the data definition accurately describes the data of the real-world entity type
or fact-type the data represent and meet the need of all information users (Larry English 1999);
• Clear, precise and complete definition and business rules;
• Data definition quality is measured using metadata.
» Data Content Quality
• A measure of the quality of the data stored in systems;
• The correctness of data values. Conformance to the defined and approved business rules and the accuracy of data.
• Data content quality is measured using validation and verification checks that are
developed using the business rules and other criteria specified in the data dictionary.
» Data Presentation Quality
• A way of explaining the available data
• Transforming the data material into a useful information product, and accessible when needed.
25
26. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Dataprofilingisanimportanttoolorganizationscanusetoimprovethedataquality.
» More Complete information
» More Accurate information
» More Consistent information
» More Timely information
» More Useful information
» More Standardized Information
26
27. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Wemeasurecompleteness,accuracy,consistency…
Data Completeness:
Ø Degree to which values are present in the attributes that require them.
Ø Metric: Percent of data fields having values entered in them
Data Accuracy
Ø A qualitative assessment of freedom from error
Ø Metric: Percent of values that are correct when compared to the actual value
Data Consistency
Ø Measures the degree to which a set of data satisfies a set of constraints regardless of the number of times it is
replicated across files or tables
Ø Metric: Percent of matching values across tables and files
27
28. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Timeliness,uniquenessandstandardizationtoguideourdatacleaningprocess.
DataTimeliness:
Ø Measures the degree to which data values are up-to-date. Also measures the effectiveness of data provisioning
relative to its need.
Ø Metric: Percent of data available within a specified threshold timeframe
Data Uniqueness
Ø The state of being the only one of its kind.
Ø Metric: Percent of records having a unique key
Data Standardization
Ø Measures the degree to which formats are consistent for data items sharing common characteristics, such as date
fields.
Ø Metric: Percent of fields with like characteristics utilizing a common format
28
29. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
TheDataQualitydashboardactasaninstrumentforthedatastewardtofulfilhis/herrole.
• Stewards should be considered data subject-matter experts for their respective
business functions and processes.
• Stewards are responsible for guiding the effort, not necessarily executing it themselves.
• Their roles as stewards should be to guide and influence others in implementing the
changes necessary to improve data quality.They should be viewed as the leaders of the
data quality improvement effort, not necessarily the "doers.“
• Stewards should define and monitor quality measures to justify the program but also
must have specific goals for data quality improvement.
• Stewards must be accountable
• Stewardship should be based on manageable subsets of data.
29
30. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
DataQualityimprovementisonlysuccessfulif youcanoptimizethelinkbetween
people,process,data,technologyandtools.
The data steward (business) and data curator (IT) are
responsible to deliver trusted data to the information users.
We support: data handling in the different projects but also an
overall program to streamline all data activities.
Data Glossary, Data Dictionary are still important but the end
goal must be a data catalog. It informs information users
about available data, metadata and context.
Ideally you have a typical metadata tool to support your data
strategy. You need to find a tool which fits in your overall
architecture and approach.
People
Process
Data
Technology
&Tools
31. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Data Architecture
31
32. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
OurDataStrategyfitsinthedesignedArchitectureforDataWarehousing,…
32
Master Data Management
Data Warehouse
Use Cases
Staging Integration Presentation
Staging /
Loading Area
Raw DataVault
Business Data
Vault
Raw Data Mart
Information Mart
Hard
Rules
Hard
Rules
Soft
Rules
Soft
Rules
Soft
Rules
RDBMS
Hadoop / NoSQL
OtherBatch
Batch
Near Real Time
Near Real Time
BI, analytics, Cubes, reports
Services, APIs
Labs. Exploration
Analytics, Data Science
OLTP
Semi-structured
And unstructured data
APIs
Rules Engine
Queue / ESB
Data Sources
33. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
…anddataqualitycheckswhereexecutedintheintegrationlayer.
33
Master Data ManagementData Sources
Data Warehouse
Use Cases
Staging Integration Presentation
Staging /
Loading Area
Raw DataVault
Business Data
Vault
Raw Data Mart
Information Mart
Hard
Rules
Hard
Rules
Soft
Rules
Soft
Rules
Soft
Rules
RDBMS
Hadoop / NoSQL
OtherBatch
Batch
Near Real Time
Near Real Time
BI, analytics, Cubes, reports
Services, APIs
Labs. Exploration
Analytics, Data Science
OLTP
Semi-structured
And unstructured data
APIs
Rules Engine
Queue / ESB
34. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Data Lake
Gateway
Staging Area (CBG Ingestion Layer)
Raw Data Vault (CBG Logic Layer)
Business Data Vault (CBG Storage Layer)
External Source
Systems
Information marts (CBG Reporting Layer)
>
>
SAP
SAPBW/4HANA
>>
Data Labs
(Semi-) Unstructured
Data
Internal Source
Systems
>
Data
Catalog
>
>
>
>
>
>
>
>
>
>
>
API Management
>
>
>
>>
>
Gateway Gateway Gateway
>
>
>
>
>
>
>
>
>
>>
>>
35. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Adatadrivenapproachistheendgoalinourautomateddataqualityprocess,…
35
Data Base
with rules
Rules
Engine
Generic
program
Program
Simple
dashboard
Result
Rulenr Database Field Rule Combine
1200 Customer Custmr NA
1201 Product Prodnr 98105 AND
1201 Product Prodtype Direct AND
Select &Field&
From &Database&
Where Prodnr = “98105”
And Prodtype = “Direct”
Product R1200 R1201 R9999
98105 0 0 0
124195 0 1 0
98105 0 0 0
36. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
…andisessentialtominimize(oreliminate)scrapandrework.
» Data Cleansing is part of a technical process, and ensures that the data integrated
into the data warehouse undergoes transformations to improve the quality:
• Reduce data overlap and data redundancy
• Complete records
• Correct inaccurate data fields
• Adjust data formatting
• Complete empty data
• Enforce referential integrity
36
37. DWA-Day
F e b r u a r y 1 3 . B e l g i u m
Finally,dataqualityisembeddedintodatagovernanceandneedsacyclingprocess
Rules
Action
Plans
37
Embed Data Quality
in your daily work
Do it right the
first time
Assess and analyse
Root CausesImprove Data
Quality
Communicate and
gain trust
Involve &Train
Communication Governance
Data
Validation
38. Thank You
Data Warehouse Automation
F r e e m e m b e r s h i p
D V - C o m m u n i t y . o r g
Ivan Schotsmans
+32 495 55 1907
ischotsm@dv-community.org
https://www.dv-community.org/
https://www.bi-community.org/
FgtT@2020!
DWA – Day
Thursday 13 Feb 2020
Belgium