A Discussion of the 5 types of data and information quality defects and the ways in which they can arise. First given at a BCS meeting a solent University 200803
Exploring the Future Potential of AI-Enabled Smartphone Processors
Bcs 20080228 Ku
1. The quality of information
and data is strained
International Association for Information and Data Quality
Keith Underdown
Convenor, British Community of Practice
International Association for Information and Data Quality
2. Shameless Plug
International Association for
Information & Data Quality
www.iaidq.org
◦ Student Membership—$25
◦ Personal Membership—$85
International Association for Information and Data Quality
◦ Corporate Membership Available
◦ Extensive Conference Discounts
www.justgiving.com/keithunderdown
◦ My fundraising page
◦ Reward me if you enjoy my
presentation
International Association for Information and Data Quality
3. Data
“Everybody knows what data is”!
◦ “Define:data” in a Google search gives
41 results
◦ Mix of
International Association for Information and Data Quality
“data processing” biased
Philosophical
Irrelevant (Data is an android in Startrek
TNG)
My Preference:
A collection of facts held in a formalized manner suitable
for processing by automatic or human means.
International Association for Information and Data Quality
4. Fundamental Data Quality
The facts in the case can be:
◦ Inaccurate
◦ Incomplete
◦ Inconsistent
International Association for Information and Data Quality
◦ Invalid
◦ Incomprehensible
International Association for Information and Data Quality
5. The Five “I’s”
Incomplete Data
◦ mandatory fields with null, empty string, etc…
Invalid Data
◦ values outside the allowed value set or fails
tests against rules
Inconsistent Data
International Association inconsistencyand Data Quality
◦ intra-record for Information
◦ inter-record inconsistency
◦ Inter-datastore Consistency
Inaccurate Data:
◦ Statistical outliers & other “sore thumbs”
E.g. Price 10 times higher than similar models
Incomprehensible Data
◦ without full and accurate context
International Association for Information and Data Quality
6. Incomplete Data
Facts essential to business process are
missing
◦ Implies that data validation incorrect
◦ Often arises during bulk import of data
International Association for Information and Data Quality
Data not immediately available so validation
relaxed
Follow-up not completed
Database field cannot be made mandatory
International Association for Information and Data Quality
7. Example
Change in Law made knowledge of
Social Security number mandatory
◦ Too expensive to go to customers
◦ Populate at need
International Association for Information and Data Quality
◦ Telephone agents used their own
Customer failed to fill in DoB field
◦ Data entry clerk guessed!
◦ Customer has high value transaction
turned down
◦ Lots of adverse publicity
International Association for Information and Data Quality
8. How can we avoid these?
Plan for their absence
◦ When creating new databases plan to
populate fields
◦ When bulk updates required bite the
bullet
International Association for Information and Data Quality
◦ Ensure agents have time and
understand the need to collect data
Check for likely “cheats”
International Association for Information and Data Quality
9. Invalid Data
Data that fails genuine business rules
Or
Fails unstated real world validation
◦ Company name info spills over into
International Association for Information and Data Quality
address fields
International Association for Information and Data Quality
10. Examples
01222 535681 looks like a valid phone
no.
◦ But Cardiff is an exception
029 2053 5681
International Association for might work it out Quality
Human being Information and Data
Power dialler won’t
02/03/08
◦ US=3rd February 08
◦ UK= 2nd March 08
◦ Which century?
International Association for Information and Data Quality
11. How do we avoid these
Make field syntax as tight as possible
◦ E.g. Always use date-stamp fields for
dates
◦ Use external validation systems
International Association Address File and Data Quality
E.g. Postal for Information
◦ Use masks to validate input patterns
Use carefully, still allows cheating
◦ Use drop-down lists from reference
tables
International Association for Information and Data Quality
12. Inconsistent Data
Intra-record inconsistency:
◦ Gender=“m”, Marital-Status=“Wife”;
inter-record inconsistency
◦ R1: VIN=VF7N1KFXF36772582;
International Association forMark=T87BRB Quality
Registration Information and Data
◦ R2: VIN=VF7N1KFXF36772582;
Registration Mark=CC04PNL
Inter-datastore inconsistency
◦ E.g. Customer data in many data
stores
International Association for Information and Data Quality
13. How do we avoid these?
“Common sense validation”
◦ Men cannot be wives
But: what is correct value?
So: don’t over-specify
International Association for Information and Data Quality
◦ Marital status?
◦ Better: Relationship Status
Legally Married
In Civil Partnership
Unmarried
Divorced
International Association for Information and Data Quality
14. Careful of surrogate keys
Entities can often be identified in
different ways
◦ NI Number
◦ NHS Number
International Association for Information and Data Quality
These are surrogate keys
All key fields should be unique
VIN example could not have arisen if
field required to be unique
Nor would have SSN example earlier
International Association for Information and Data Quality
15. Root Cause
Often historically poor data quality
◦ NI numbers poorly administered
Many to many relationships!
Keys not unique in practice
International Association for Information and Data Quality
Allows for new errors in data entry
International Association for Information and Data Quality
16. An Aside—Checksums
Checksums ancient technique to
validate input data
◦ Additional digit attached to key
◦ Derived from key bytes
International Association for Information and Data Quality
◦ Mis-keying always generates mismatch
Not part of key so store separately if
at all
Better to generate key automatically
validate against existing
International Association for Information and Data Quality
17. Inaccurate Data
Statistical outliers & other “sore thumbs”
◦ E.g. Price 10 times higher than similar
model
◦ River Temperature >100° C
◦ Gas Bill orders of magnitude too high
International Association for Information and Data Quality
Transposed Digits
◦ Accountancy packages have lots of
tricks to find these
Spurious Accuracy
◦ Wall length in mm
◦ Averages computed to too many places
International Association for Information and Data Quality
18. Incomprehensible Data
The facts could meet all the previous
strictures but still be useless
They must be put in context
International Association for Information and Data Quality
International Association for Information and Data Quality
19. Data in Context
3.142 is a fact
Gertie 3.142 2005-02-02
is data
Name Height Measurement Date
International Association for Information and Data Quality
Gertie 3.142 2005-02-02
is becoming “Data in Context”
Still need
◦ units for Height (metres)
◦ Date rules (ISO 8601)
◦…
International Association for Information and Data Quality
20. No Context => Expensive errors
Mars Climate Orbiter
◦ Discrepancies observed in approach but
not formally noted
◦ Spacecraft vanished during insertion
into orbit
International Association for Information and Data Quality
◦ Engineers specified forces to applied in
lb Force (poundal) not Newtons
◦ Factor of 4.45 difference!
◦ They did it again for Mars Polar Lander!
International Association for Information and Data Quality
21. More examples
Redefining field usage on the fly
◦ 2-byte field in database but highest
value <256
◦ Project team seeks to avoid cost of
inserting new field
International Association for Information and Data Quality
◦ Redefines field in code to be two 1-byte
fields
◦ Existing reports start giving odd results
but nobody notices
◦ Wrong business decisions made
International Association for Information and Data Quality
22. Information
Information is
◦ What sentient beings use to:
Facilitate decision-making
Communicate
International Association for are sentient (so far)
Only humans Information and Data Quality
◦ Information only exists when humans
in value chain
Machine-machine communication
◦ Data in context
International Association for Information and Data Quality
23. What is Quality Information
Conveys the right “impression”
◦ Trespassing on Conrad’s territory
◦ We’ll look at some graphical examples
Takes into account cultural differences
International Association for Information and Data Quality
◦ “Wait while the red light flashes”
International Association for Information and Data Quality
24. Phone Number example again
01222 331988
I see that and “know” that it is wrong
I could programme the rule to convert
an erroneously converted number
International Association for Information and Data Quality
01222 => 029
Prefix subscriber number with 20
But 029 is officially the code for Wales
and other prefixes will appear, 21
already in use.
International Association for Information and Data Quality
25. Information Presentation
Which of these companies would you
rather buy into?
International Association for Information and Data Quality
1 2 3 4 5 6 7 8
International Association for Information and Data Quality
26. Illegality
US accounting rules now outlaw chart
manipulations
Money Laundering rules
Managers could go to prison
International Association for Information and Data Quality
Basel II and Sarbanes-Oxley
•
Directors could go to prison
International Association for Information and Data Quality
27. Data Quality is Free
Poor Data Quality costs 10-30% of
Turnover routinely
Particular issues can be catastrophic
◦ Regulator can fine companies
International Association for Information and Data Quality
◦ People can sue
◦ Officers and directors could go to jail
Data Quality is better then Free
But needs to be worked at
International Association for Information and Data Quality
28. No IQ without DQ
Cannot have good Information Quality
◦ Without good quality data
Information Quality is a business issue
◦ Needs complete commitment
International Association for Information and Data Quality
◦ Very strong management process
Information is the Third Asset
◦ It is not a cost centre
◦ It is not reflected on the bottom line
◦ Yet
International Association for Information and Data Quality