Good data is like good water: best served fresh, and ideally well-filtered. Data Management strategies can produce tremendous procedural improvements and increased profit margins across the board, but only if the data being managed is of a high quality. Determining how Data Quality should be engineered provides a useful framework for utilizing Data Quality management effectively in support of business strategy, which in turns allows for speedy identification of business problems, delineation between structural and practice-oriented defects in Data Management, and proactive prevention of future issues.
Over the course of this webinar, we will:
Help you understand foundational Data Quality concepts based on “The DAMA Guide to the Data Management Body of Knowledge” (DAMA DMBOK), as well as guiding principles, best practices, and steps for improving Data Quality at your organization
Demonstrate how chronic business challenges for organizations are often rooted in poor Data Quality
Share case studies illustrating the hallmarks and benefits of Data Quality success
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Data Quality Strategies
1. Data Quality Strategies
From Data Duckling to Successful Swan
Peter Aiken, Ph.D.
• DAMA International President 2009-2013 / 2018
• DAMA International Achievement Award 2001
(with Dr. E. F. "Ted" Codd
• DAMA International Community Award 2005
Copyright 2018 by Data Blueprint Slide # !2
Peter Aiken, Ph.D.
• I've been doing this a long time
• My work is recognized as useful
• Associate Professor of IS (vcu.edu)
• Founder, Data Blueprint (datablueprint.com)
• DAMA International (dama.org)
• 10 books and dozens of articles
• Experienced w/ 500+ data
management practices worldwide
• Multi-year immersions
– US DoD (DISA/Army/Marines/DLA)
– Nokia
– Deutsche Bank
– Wells Fargo
– Walmart
– …
PETER AIKEN WITH JUANITA BILLINGS
FOREWORD BY JOHN BOTTEGA
MONETIZING
DATA MANAGEMENT
Unlocking the Value in Your Organization’s
Most Important Asset.
2. !3Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
!4Copyright 2018 by Data Blueprint Slide #
• Before further construction could proceed
• No IT equivalent
Our barn had to pass a foundation inspection
3. Maslow's Hierarchy of Needs
!5Copyright 2018 by Data Blueprint Slide #
You can accomplish
Advanced Data Practices
without becoming proficient
in the Foundational Data
Practices however
this will:
• Take longer
• Cost more
• Deliver less
• Present
greater
risk (with thanks to
Tom DeMarco)
Data Management Practices Hierarchy
Advanced
Data
Practices
• MDM
• Mining
• Big Data
• Analytics
• Warehousing
• SOA
Foundational Data Practices
Data Platform/Architecture
Data Governance Data Quality
Data Operations
Data Management Strategy
Technologies
Capabilities
!6Copyright 2018 by Data Blueprint Slide #
4. DMM℠ Structure of
5 Integrated
DM Practice Areas
Data architecture
implementation
Data
Governance
Data
Management
Strategy
Data
Operations
Platform
Architecture
Supporting
Processes
Maintain fit-for-purpose data,
efficiently and effectively
!7Copyright 2018 by Data Blueprint Slide #
Manage data coherently
Manage data assets professionally
Data life cycle
management
Organizational support
Data
Quality
Data architecture
implementation
Maintain fit-for-purpose data,
efficiently and effectively
Manage data coherently
Manage data assets professionally
Data life cycle
management
Organizational support
DMM℠ Structure of
5 Integrated
DM Practice Areas
Data
Governance
Data
Management
Strategy
Data
Operations
Platform
Architecture
Supporting
Processes
!8Copyright 2018 by Data Blueprint Slide #
Data
Quality
3
3
33
1
6. !11Copyright 2018 by Data Blueprint Slide #
Organizational
Strategy
Data Strategy
Data
Governance
Data Quality and Data Governance in Context
Data
asset support for
organizational
strategy
What
the data assets do to
support strategy
(business goals)
How well the data
strategy is working
(metadata)
Data Quality
Governance
of quality aspects
of data assets
Evolutionary
feedback
about the
current focus
!12Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
7. Data
Data
Data
Information
Fact Meaning
Request
A Model Specifying Relationships Among Important Terms
[Built on definition by Dan Appleton 1983]
Intelligence
Use
1. Each FACT combines with one or more MEANINGS.
2. Each specific FACT and MEANING combination is referred to as a DATUM.
3. An INFORMATION is one or more DATA that are returned in response to a specific
REQUEST
4. INFORMATION REUSE is enabled when one FACT is combined with more than one
MEANING.
5. INTELLIGENCE is INFORMATION associated with its USES.
Wisdom & knowledge are
often used synonymously
Data
Data
Data Data
!13
Copyright 2018 by Data Blueprint Slide #
Definitions
• Quality Data
– Fit for purpose meets the requirements of its authors, users,
and administrators (adapted from Martin Eppler)
– Synonymous with information quality, since poor data quality
results in inaccurate information and poor business performance
• Data Quality Management
– Planning, implementation and control activities that apply quality
management techniques to measure, assess, improve, and
ensure data quality
– Entails the "establishment and deployment of roles, responsibilities
concerning the acquisition, maintenance, dissemination, and
disposition of data" http://www2.sas.com/proceedings/sugi29/098-29.pdf
✓ Critical supporting process from change management
✓ Continuous process for defining acceptable levels of data quality to meet
business needs and for ensuring that data quality meets these levels
• Data Quality Engineering
– Recognition that data quality solutions cannot not managed but must be engineered
– Engineering is the application of scientific, economic, social, and practical knowledge
in order to design, build, and maintain solutions to data quality challenges
– Engineering concepts are generally not known and understood within IT or business!
!14
Copyright 2018 by Data Blueprint Slide #
Spinach/Popeye story from http://it.toolbox.com/blogs/infosphere/spinach-how-a-data-quality-mistake-created-a-myth-and-a-cartoon-character-10166
8. Improving Data Quality during System Migration
• Challenge
– Millions of NSN/SKUs
maintained in a catalog
– Key and other data stored in
clear text/comment fields
– Original suggestion was manual
approach to text extraction
– Left the data structuring problem unsolved
• Solution
– Proprietary, improvable text extraction process
– Converted non-tabular data into tabular data
– Saved a minimum of $5 million
– Literally person centuries of work
Copyright 2018 by Data Blueprint Slide #
!15
Unmatched
Items
Ignorable
Items
Items
Matched
Week # (% Total) (% Total) (% Total)
1 31.47% 1.34% N/A
2 21.22% 6.97% N/A
3 20.66% 7.49% N/A
4 32.48% 11.99% 55.53%
… … … …
14 9.02% 22.62% 68.36%
15 9.06% 22.62% 68.33%
16 9.53% 22.62% 67.85%
17 9.5% 22.62% 67.88%
18 7.46% 22.62% 69.92%
Determining Diminishing Returns
Copyright 2018 by Data Blueprint Slide #
!16
Before
After
9. Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
Copyright 2018 by Data Blueprint Slide #
!17
Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
Copyright 2018 by Data Blueprint Slide #
!18
Time needed to review all NSNs once over the life of the project:
NSNs 150,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 750,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 750,000
Minutes available person/year 108,000
Total Person-Years 7
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 7
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $420,000
10. Time needed to review all NSNs once over the life of the project:
NSNs 2,000,000
Average time to review & cleanse (in minutes) 5
Total Time (in minutes) 10,000,000
Time available per resource over a one year period of time:
Work weeks in a year 48
Work days in a week 5
Work hours in a day 7.5
Work minutes in a day 450
Total Work minutes/year 108,000
Person years required to cleanse each NSN once prior to migration:
Minutes needed 10,000,000
Minutes available person/year 108,000
Total Person-Years 92.6
Resource Cost to cleanse NSN's prior to migration:
Avg Salary for SME year (not including overhead) $60,000.00
Projected Years Required to Cleanse/Total DLA Person Year Saved 93
Total Cost to Cleanse/Total DLA Savings to Cleanse NSN's: $5.5 million
Quantitative Benefits
Copyright 2018 by Data Blueprint Slide #
!19
Data Quality Misconceptions
• You can fix the data
• Data quality is an IT problem
• The problem is in the data sources or data entry
• The data warehouse will provide a single version of the truth
• The new system will provide a single version of the truth
• Standardization will eliminate the problem of different "truths"
represented in the reports or analysis
Source: Business Intelligence solutions, Athena Systems
!20
Copyright 2018 by Data Blueprint Slide #
11. • It was six men of Indostan, To learning much inclined,
Who went to see the Elephant
(Though all of them were blind),
That each by observation
Might satisfy his mind.
• The First approached the Elephant,
And happening to fall
Against his broad and sturdy side,
At once began to bawl:
"God bless me! but the Elephant
Is very like a wall!"
• The Second, feeling of the tusk
Cried, "Ho! what have we here,
So very round and smooth and sharp? To me `tis mighty clear
This wonder of an Elephant
Is very like a spear!"
• The Third approached the animal,
And happening to take
The squirming trunk within his hands, Thus boldly up he spake:
"I see," quoth he, "the Elephant
Is very like a snake!"
• The Fourth reached out an eager hand, And felt about the knee:
"What most this wondrous beast is like Is mighty plain," quoth he;
"'Tis clear enough the Elephant
Is very like a tree!"
• The Fifth, who chanced to touch the ear, Said: "E'en
the blindest man
Can tell what this resembles most;
Deny the fact who can,
This marvel of an Elephant
Is very like a fan!"
• The Sixth no sooner had begun
About the beast to grope,
Than, seizing on the swinging tail
That fell within his scope.
"I see," quoth he, "the Elephant
Is very like a rope!"
• And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong!
The Blind Men and the Elephant
(Source: John Godfrey Saxe's ( 1816-1887) version of the famous Indian legend )
!21
Copyright 2018 by Data Blueprint Slide #
No universal conception of data
quality exists, instead many
differing perspective compete
• Problem:
– Most organizations approach
data quality problems in the same way
that the blind men approached the elephant - people tend to see only the data
that is in front of them
– Little cooperation across boundaries, just as the blind men were unable to
convey their impressions about the elephant to recognize the entire entity.
– Leads to confusion, disputes and narrow views
• Solution:
– Data quality engineering can help achieve a more complete picture and facilitate
cross boundary communications
!22
Copyright 2018 by Data Blueprint Slide #
12. Quality Data is ...
!23Copyright 2018 by Data Blueprint Slide #
Fit
For
Purpose
Famous Words?
• Question:
– Why haven't organizations taken a
more proactive approach to data quality?
• Answer:
– Fixing data quality problems is not easy
– It is dangerous -- they'll come after you
– Your efforts are likely to be misunderstood
– You could make things worse
– Now you get to fix it
• A single data quality
issue can grow
into a significant,
unexpected
investment
!24Copyright 2018 by Data Blueprint Slide #
13. !25Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
!26Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
14. Four ways to make your data sparkle!
1.Prioritize the task
– Cleaning data is costly and time
consuming
– Identify mission critical/non-mission
critical data
2.Involve the data owners
– Seek input of business units on what constitutes "dirty"
data
3.Keep future data clean
– Incorporate processes and technologies that check every
zip code and area code
4.Align your staff with business
– Align IT staff with business units
(Source: CIO JULY 1 2004)
!27
Copyright 2018 by Data Blueprint Slide #
Structured Data Quality Engineering
1. Allow the form of the
Problem to guide the
form of the solution
2. Provide a means of
decomposing the problem
3. Feature a variety of tools
simplifying system understanding
4. Offer a set of strategies for evolving a design solution
5. Provide criteria for evaluating the quality of the various solutions
6. Facilitate development of a framework for developing
organizational knowledge.
!28
Copyright 2018 by Data Blueprint Slide #
15. The DQE Cycle
• Deming cycle
• "Plan-do-study-act" or
"plan-do-check-act"
1. Identifying data issues that are
critical to the achievement of
business objectives
2. Defining business requirements for
data quality
3. Identifying key data quality
dimensions
4. Defining business rules critical to
ensuring high quality data
!29
Copyright 2018 by Data Blueprint Slide #
The DQE Cycle: (1) Plan
• Plan for the assessment of the
current state and identification
of key metrics for measuring
quality
• The data quality engineering
team assesses the scope of
known issues
– Determining cost and impact
– Evaluating alternatives for
addressing them
!30
Copyright 2018 by Data Blueprint Slide #
16. The DQE Cycle: (2) Deploy
• Deploy processes for measuring
and improving the quality of
data:
• Data profiling
– Institute inspections and monitors to
identify data issues when they occur
– Fix flawed processes that are the root
cause of data errors or correct errors
downstream
– When it is not possible to correct
errors at their source, correct them at
their earliest point in the data flow
!31
Copyright 2018 by Data Blueprint Slide #
The DQE Cycle: (3) Monitor
• Monitor the quality of data as
measured against the defined
business rules
• If data quality meets defined
thresholds for acceptability,
the processes are in control
and the level of data quality
meets the business
requirements
• If data quality falls below
acceptability thresholds,
notify data stewards so they
can take action during the
next stage
!32
Copyright 2018 by Data Blueprint Slide #
17. The DQE Cycle: (4) Act
• Act to resolve any identified
issues to improve data
quality and better meet
business expectations
• New cycles begin as new
data sets come under
investigation or as new data
quality requirements are
identified for existing data
sets
!33
Copyright 2018 by Data Blueprint Slide #
DQE Context & Engineering Concepts
• Can rules be implemented stating that no data can be corrected
unless the source of the error has been discovered and
addressed?
• All data must
be 100%
perfect?
• Pareto
– 80/20 rule
– Not all data
is of equal
Importance
• Scientific,
economic,
social, and
practical
knowledge
!34Copyright 2018 by Data Blueprint Slide #
18. !35Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
Two Distinct Activities Support Quality Data
• Data quality best practices depend on both
– Practice-oriented activities
– Structure-oriented activities
!36Copyright 2018 by Data Blueprint Slide #
Practice-oriented
activities focus on the
capture and
manipulation of data
Structure-oriented
activities focus on the
data implementation
Quality
Data
19. Practice-Oriented Activities
• Stem from a failure to rigor when capturing/manipulating data such
as:
– Edit masking
– Range checking of input data
– CRC-checking of transmitted data
• Affect the Data Value Quality and Data Representation Quality
• Examples of improper practice-oriented activities:
– Allowing imprecise or incorrect data to be collected when requirements specify
otherwise
– Presenting data out of sequence
• Typically diagnosed in bottom-up manner: find and fix the resulting
problem
• Addressed by imposing
more rigorous
data-handling/governance
!37
Copyright 2018 by Data Blueprint Slide #
Practice-oriented activities
Quality of Data
Values
Quality of Data
Representation
Knee Surgery
!38Copyright 2018 by Data Blueprint Slide #
20. Structure-Oriented Activities
• Occur because of data and metadata that has been arranged
imperfectly. For example:
– When the data is in the system but we just can't access it;
– When a correct data value is provided as the wrong response to a query; or
– When data is not provided because it is unavailable or inaccessible
• Developer focus within system boundaries instead of within
organization boundaries
• Affect the Data Model Quality and Data Architecture Quality
• Examples of improper structure-oriented activities:
– Providing a correct response but incomplete data to a query because the user
did not comprehend the system data structure
– Costly maintenance of inconsistent data used by redundant systems
• Typically diagnosed in
top-down manner: root
cause fixes
• Addressed through
fundamental data structure
governance
!39
Copyright 2018 by Data Blueprint Slide #
Quality of
Data Models
Quality of
Data Architecture
Structure-oriented activities
New York Turns to Data to
Solve Big Tree Problem
• NYC
– 2,500,000 trees
• 11-months from 2009 to 2010
– 4 people were killed or seriously injured by falling tree limbs in
Central Park alone
• Belief
– Arborists believe that pruning and otherwise maintaining trees can keep them
healthier and make them more likely to withstand a storm, decreasing the
likelihood of property damage, injuries and deaths
• Until recently
– No research or data to back it up
!40
Copyright 2018 by Data Blueprint Slide #
http://www.computerworld.com/s/article/9239793/New_York_Turns_to_Big_Data_to_Solve_Big_Tree_Problem?source=CTWNLE_nlt_datamgmt_2013-06-05
21. NYC's Big Tree Problem
• Question
– Does pruning trees in one year reduce the
number of hazardous tree conditions in the
following year?
• Lots of data but granularity challenges
– Pruning data recorded block by block
– Cleanup data recorded at the address level
– Trees have no unique identifiers
• After downloading, cleaning, merging, analyzing and intensive
modeling
– Pruning trees for certain types of hazards caused a 22 percent reduction in the
number of times the department had to send a crew for emergency cleanups
• The best data analysis
– Generates further questions
• NYC cannot prune each block every year
– Building block risk profiles: number of trees, types of trees, whether the block is in
a flood zone or storm zone
!41
Copyright 2018 by Data Blueprint Slide #
http://www.computerworld.com/s/article/9239793/New_York_Turns_to_Big_Data_to_Solve_Big_Tree_Problem?source=CTWNLE_nlt_datamgmt_2013-06-05
Quality Dimensions
!42
Copyright 2018 by Data Blueprint Slide #
22. 4 Dimensions of Data Quality
An organization’s overall data quality is a function of four
distinct components, each with its own attributes:
• Data Value: the quality of data as stored & maintained in
the system
• Data Representation – the quality of representation for
stored values; perfect data values stored in a system that
are inappropriately represented can be harmful
• Data Model – the quality of data logically representing
user requirements related to data entities, associated
attributes, and their relationships; essential for effective
communication among data suppliers and consumers
• Data Architecture – the coordination of data
management activities in cross-functional system
development and operations
!43
Copyright 2018 by Data Blueprint Slide #
Practice-
oriented
Structure-
oriented
Effective Data Quality Engineering
• Data quality engineering has been focused on operational problem
correction
– Directing attention to practice-oriented data imperfections
• Data quality engineering is more effective when also focused on
structure-oriented causes
– Ensuring the quality of shared data across system boundaries
!44
Copyright 2018 by Data Blueprint Slide #
Data
Representation
Quality
As presented to
the user
Data Value
Quality
As maintained in
the system
Data Model
Quality
As understood by
developers
Data Architecture
Quality
As an
organizational
asset
(closer to the architect)(closer to the user)
23. Full Set of Data Quality Attributes
!45
Copyright 2018 by Data Blueprint Slide #
Difficult to obtain leverage at the bottom of the falls
!46
Copyright 2018 by Data Blueprint Slide #
24. Frozen Falls
!47
Copyright 2018 by Data Blueprint Slide #
!48Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
25. Data acquisition activities Data usage activitiesData storage
Traditional Quality Life Cycle
!49
Copyright 2018 by Data Blueprint Slide #
restored data
Metadata
Creation
Metadata Refinement
Metadata
Structuring
Data Utilization
Data Manipulation
Data Creation
Data Storage
Data
Assessment
Data
Refinement
Data Life
Cycle
Model
Products
!50
Copyright 2018 by Data Blueprint Slide #
data
architecture
& models
populated data
models and
storage locations
data values
data
values
data
values
value
defects
structure
defects
architecture
refinements
model
refinements
data
26. architecture &
model quality
Data
Refinement
Data Utilization
Data Manipulation
representation
quality
restored data
Metadata Refinement
Metadata
Structuring
Data Creation
Data Storage
Data
Assessment
Data Life
Cycle
Model:
Quality
Focus
!51
Copyright 2018 by Data Blueprint Slide #
populated data
models and
storage locations
data
values
data
model quality
value quality
value quality
value quality
Metadata
Creation
architecture
quality
Starting
point
for new
system
development
data performance metadata
data architecture
data
architecture and
data models
shared data updated data
corrected
data
architecture
refinements
facts &
meanings
Metadata &
Data Storage
Starting point
for existing
systems
Metadata Refinement
• Correct Structural Defects
• Update Implementation
Metadata Creation
• Define Data Architecture
• Define Data Model Structures
Metadata Structuring
• Implement Data Model Views
• Populate Data Model Views
Data Refinement
• Correct Data Value Defects
• Re-store Data Values
Data Manipulation
• Manipulate Data
• Updata Data
Data Utilization
• Inspect Data
• Present Data
Data Creation
• Create Data
• Verify Data Values
Data Assessment
• Assess Data Values
• Assess Metadata
Extended data life cycle model with metadata sources and uses
!52
Copyright 2018 by Data Blueprint Slide #
27. !53Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
Profile, Analyze and Assess DQ
• Data assessment using 2 different approaches:
– Bottom-up
– Top-down
• Bottom-up assessment:
– Inspection and evaluation of the data sets
– Highlight potential issues based on the
results of automated processes
• Top-down assessment:
– Engage business users to document
their business processes and the
corresponding critical data dependencies
– Understand how their processes
consume data and which data elements
are critical to the success of the business
applications
!54
Copyright 2018 by Data Blueprint Slide #
28. Define DQ Measures
• Measures development occurs as part of the strategy/design/plan
step
• Process for defining data quality measures:
1. Select one of the identified critical business impacts
2. Evaluate the dependent data elements, create and update processes associate
with that business impact
3. List any associated data requirements
4. Specify the associated dimension of data quality and one or more business rules
to use to determine conformance of the data to expectations
5. Describe the process for measuring conformance
6. Specify an acceptability threshold
!55
Copyright 2018 by Data Blueprint Slide #
Set and Evaluate DQ Service Levels
• Data quality inspection and
monitoring are used to
measure and monitor
compliance with defined
data quality rules
• Data quality SLAs specify
the organization’s expectations for response and remediation
• Operational data quality control defined in data quality SLAs
includes:
– Data elements covered by the agreement
– Business impacts associated with data flaws
– Data quality dimensions associated with each data element
– Quality expectations for each data element of the identified dimensions in
each application for system in the value chain
– Methods for measuring against those expectations
– (…)
!56
Copyright 2018 by Data Blueprint Slide #
29. Measure, Monitor & Manage DQ
• DQM procedures depend on
available data quality measuring
and monitoring services
• 2 contexts for control/measurement
of conformance to data quality
business rules exist:
– In-stream: collect in-stream measurements while creating data
– In batch: perform batch activities on collections of data instances assembled in a
data set
• Apply measurements at 3 levels of granularity:
– Data element value
– Data instance or record
– Data set
!57
Copyright 2018 by Data Blueprint Slide #
Overview: Data Quality Tools
• 4 categories of activities:
– Analysis
– Cleansing
– Enhancement
– Monitoring
• Principal tools:
– Data Profiling
– Parsing and Standardization
– Data Transformation
– Identity Resolution and Matching
– Enhancement
– Reporting
!58
Copyright 2018 by Data Blueprint Slide #
30. DQ Tool Set #1: Data Profiling
• Data profiling is the assessment of
value distribution and clustering of
values into domains
• Need to be able to distinguish
between good and bad data before
making any improvements
• Data profiling is a set of algorithms
for 2 purposes:
– Statistical analysis and assessment of the data quality values within a data set
– Exploring relationships that exist between value collections within and across
data sets
• At its most advanced, data profiling takes a series of prescribed
rules from data quality engines. It then assesses the data,
annotates and tracks violations to determine if they comprise new
or inferred data quality rules
!59
Copyright 2018 by Data Blueprint Slide #
DQ Tool Set #1: Data Profiling, cont’d
• Data profiling vs. data quality-business context and semantic/
logical layers
– Data quality is concerned with proscriptive rules
– Data profiling looks for patterns when rules are adhered to and when rules are
violated; able to provide input into the business context layer
• Incumbent that data profiling services notify all concerned parties
of whatever is discovered
• Profiling can be used to…
– …notify the help desk that valid
changes in the data are about to
case an avalanche of “skeptical
user” calls
– …notify business analysts of
precisely where they should be
working today in terms of shifts
in the data
!60
Copyright 2018 by Data Blueprint Slide #
31. Courtesy GlobalID.com
!61
Copyright 2018 by Data Blueprint Slide #
DQ Tool Set #2: Parsing & Standardization
• Data parsing tools enable the definition
of patterns that feed into a rules engine
used to distinguish between valid
and invalid data values
• Actions are triggered upon matching
a specific pattern
• When an invalid pattern is recognized,
the application may attempt to
transform the invalid value into one that meets expectations
• Data standardization is the process of conforming to a set of
business rules and formats that are set up by data stewards and
administrators
• Data standardization example:
– Brining all the different formats of “street” into a single format, e.g. “STR”, “ST.”,
“STRT”, “STREET”, etc.
!62
Copyright 2018 by Data Blueprint Slide #
32. DQ Tool Set #3: Data Transformation
• Upon identification of data
errors, trigger data rules to
transform the flawed data
• Perform standardization
and guide rule-based
transformations by
mapping data values in
their original formats and
patterns into a target
representation
• Parsed components of a
pattern are subjected to
rearrangement,
corrections, or any
changes as directed by the
rules in the knowledge
base
!63
Copyright 2018 by Data Blueprint Slide #
DQ Tool Set #4: Identify Resolution & Matching
• Data matching enables analysts to identify relationships between records for
de-duplication or group-based processing
• Matching is central to maintaining data consistency and integrity throughout
the enterprise
• The matching process should be used in
the initial data migration of data into a
single repository
• 2 basic approaches to matching:
• Deterministic
– Relies on defined patterns/rules for assigning
weights and scores to determine similarity
– Predictable
– Dependent on rules developers anticipations
• Probabilistic
– Relies on statistical techniques for assessing the probability that any pair of record represents
the same entity
– Not reliant on rules
– Probabilities can be refined based on experience -> matchers can improve precision as more
data is analyzed
!64
Copyright 2018 by Data Blueprint Slide #
33. DQ Tool Set #5: Enhancement
• Definition:
– A method for adding value to information by accumulating additional information
about a base set of entities and then merging all the sets of information to
provide a focused view. Improves master data.
• Benefits:
– Enables use of third party data sources
– Allows you to take advantage of the information and
research carried out by external data vendors to
make data more meaningful and useful
• Examples of data enhancements:
– Time/date stamps
– Auditing information
– Contextual information
– Geographic information
– Demographic information
– Psychographic information
!65
Copyright 2018 by Data Blueprint Slide #
DQ Tool Set #6: Reporting
• Good reporting supports:
– Inspection and monitoring of conformance to data quality expectations
– Monitoring performance of data stewards conforming to data quality SLAs
– Workflow processing for data quality incidents
– Manual oversight of data cleansing and correction
• Data quality tools provide dynamic reporting and monitoring
capabilities
• Enables analyst and data stewards to support and drive the
methodology for ongoing DQM and improvement with a single,
easy-to-use solution
• Associate report results with:
– Data quality measurement
– Metrics
– Activity
!66
Copyright 2018 by Data Blueprint Slide #
34. !67Copyright 2018 by Data Blueprint Slide #
1. Data Quality in Context of Data Management
2. DQE Definition
3. DQE Cycle & Contextual Complications
4. DQ Causes and Dimensions
5. Quality and the Data Life Cycle
6. DDE Tool Sets
7. Takeaways and Q&A
Data Quality Strategies
Guiding Principles
• Manage data as a core organizational asset.
• Identify a gold record for all data elements
• All data elements will have a standardized data
definition, data type, and acceptable value domain
• Leverage data governance for the control and performance of DQM
• Use industry and international data standards whenever possible
• Downstream data consumers specify data quality expectations
• Define business rules to assert conformance to data quality expectations
• Validate data instances and data sets against defined business rules
• Business process owners will agree to and abide by data quality SLAs
• Apply data corrections at the original source if possible
• If it is not possible to correct data at the source, forward data corrections
to the owner of the original source. Influence on data brokers to conform
to local requirements may be limited
• Report measured levels of data quality to appropriate data stewards,
business process owners, and SLA managers
!68
Copyright 2018 by Data Blueprint Slide #
35. Goals and Principles
• To measurably improve the quality of
data in relation to defined business
expectations
• To define requirements and
specifications for integrating data
quality control into the system
development life cycle
• To provide defined processes for
measuring, monitoring, and reporting
conformance to acceptable levels of
data quality
!69
Copyright 2018 by Data Blueprint Slide #
Summary: Data Quality Engineering
!70
Copyright 2018 by Data Blueprint Slide #
36. November Webinar:
Data Architecture v Data Modeling
November 12, 2018 @ 2:00 PM ET
December Webinar:
Exorcising The Seven Deadly Data Sins
December 11, 2018 @ 2:00 PM ET
EDW2019 - Boston
How I Learned to Stop Worrying and Love My Data Warehouse
March 18, 2019 @ 1:30 PM ET
Sign up for webinars at: www.datablueprint.com/webinar-schedule or at www.dataversity.net
Upcoming Events
!71Copyright 2018 by Data Blueprint Slide #
Brought to you by:
References & Recommended Reading
!72Copyright 2018 by Data Blueprint Slide #
39. Data Architecture Quality
!77Copyright 2018 by Data Blueprint Slide #
Questions?
!78
Copyright 2018 by Data Blueprint Slide #
+ =
It’s your turn!
Use the chat feature or Twitter (#dataed) to submit
your questions to Peter now.
40. 10124 W. Broad Street, Suite C
Glen Allen, Virginia 23060
804.521.4056