SlideShare a Scribd company logo
1 of 29
The quality of information
   and data is strained


International Association for Information and Data Quality
   Keith Underdown
   Convenor, British Community of Practice




               International Association for Information and Data Quality
Shameless Plug
        International Association for
    
        Information & Data Quality
        www.iaidq.org
       ◦ Student Membership—$25
       ◦ Personal Membership—$85
International Association for Information and Data Quality
       ◦ Corporate Membership Available
       ◦ Extensive Conference Discounts
        www.justgiving.com/keithunderdown
    
        ◦ My fundraising page
        ◦ Reward me if you enjoy my
          presentation
               International Association for Information and Data Quality
Data
        “Everybody knows what data is”!
    
        ◦ “Define:data” in a Google search gives
          41 results
        ◦ Mix of
International Association for Information and Data Quality
           “data processing” biased
          Philosophical
          Irrelevant (Data is an android in Startrek
           TNG)
        My Preference:
    
     A collection of facts held in a formalized manner suitable
     for processing by automatic or human means.

               International Association for Information and Data Quality
Fundamental Data Quality
        The facts in the case can be:
    
        ◦ Inaccurate
       ◦ Incomplete
       ◦ Inconsistent
International Association for Information and Data Quality
       ◦ Invalid
       ◦ Incomprehensible




               International Association for Information and Data Quality
The Five “I’s”
     Incomplete         Data
        ◦ mandatory fields with null, empty string, etc…
     Invalid    Data
        ◦ values outside the allowed value set or fails
          tests against rules
     Inconsistent         Data
International Association inconsistencyand Data Quality
        ◦ intra-record for Information
        ◦ inter-record inconsistency
        ◦ Inter-datastore Consistency
     Inaccurate        Data:
        ◦ Statistical outliers & other “sore thumbs”
          E.g. Price 10 times higher than similar models
        Incomprehensible Data
    
        ◦ without full and accurate context

                International Association for Information and Data Quality
Incomplete Data
        Facts essential to business process are
    
        missing
        ◦ Implies that data validation incorrect
        ◦ Often arises during bulk import of data
International Association for Information and Data Quality
          Data not immediately available so validation
           relaxed
          Follow-up not completed
          Database field cannot be made mandatory




               International Association for Information and Data Quality
Example
        Change in Law made knowledge of
    
        Social Security number mandatory
       ◦ Too expensive to go to customers
       ◦ Populate at need
International Association for Information and Data Quality
       ◦ Telephone agents used their own
        Customer failed to fill in DoB field
    
        ◦ Data entry clerk guessed!
        ◦ Customer has high value transaction
          turned down
        ◦ Lots of adverse publicity
               International Association for Information and Data Quality
How can we avoid these?
        Plan for their absence
    
       ◦ When creating new databases plan to
         populate fields
       ◦ When bulk updates required bite the
         bullet
International Association for Information and Data Quality

       ◦ Ensure agents have time and
         understand the need to collect data
          Check for likely “cheats”




              International Association for Information and Data Quality
Invalid Data
        Data that fails genuine business rules
    
        Or
    
        Fails unstated real world validation
    
       ◦ Company name info spills over into
International Association for Information and Data Quality
         address fields




               International Association for Information and Data Quality
Examples
        01222 535681 looks like a valid phone
    
        no.
        ◦ But Cardiff is an exception
           029 2053 5681
International Association for might work it out Quality
           Human being Information and Data
           Power dialler won’t
        02/03/08
    
        ◦ US=3rd February 08
        ◦ UK= 2nd March 08
        ◦ Which century?

               International Association for Information and Data Quality
How do we avoid these
        Make field syntax as tight as possible
    
        ◦ E.g. Always use date-stamp fields for
          dates
        ◦ Use external validation systems
International Association Address File and Data Quality
           E.g. Postal for Information
        ◦ Use masks to validate input patterns
          Use carefully, still allows cheating
        ◦ Use drop-down lists from reference
          tables


               International Association for Information and Data Quality
Inconsistent Data
     Intra-record         inconsistency:
      ◦ Gender=“m”, Marital-Status=“Wife”;
     inter-record         inconsistency
        ◦ R1: VIN=VF7N1KFXF36772582;
International Association forMark=T87BRB Quality
          Registration Information and Data
        ◦ R2: VIN=VF7N1KFXF36772582;
          Registration Mark=CC04PNL
     Inter-datastore            inconsistency
      ◦ E.g. Customer data in many data
        stores


             International Association for Information and Data Quality
How do we avoid these?
        “Common sense validation”
    
        ◦ Men cannot be wives
        But: what is correct value?
    
        So: don’t over-specify
    
International Association for Information and Data Quality
        ◦ Marital status?
        ◦ Better: Relationship Status
             Legally Married
         
             In Civil Partnership
         
             Unmarried
         
             Divorced
         

                 International Association for Information and Data Quality
Careful of surrogate keys
        Entities can often be identified in
    
        different ways
        ◦ NI Number
        ◦ NHS Number
International Association for Information and Data Quality
        These are surrogate keys
    
        All key fields should be unique
    
        VIN example could not have arisen if
    
        field required to be unique
        Nor would have SSN example earlier
    


               International Association for Information and Data Quality
Root Cause
        Often historically poor data quality
    
        ◦ NI numbers poorly administered
          Many to many relationships!
       Keys not unique in practice
    
International Association for Information and Data Quality
     Allows for new errors in data entry




               International Association for Information and Data Quality
An Aside—Checksums
       Checksums ancient technique to
   
       validate input data
       ◦ Additional digit attached to key
       ◦ Derived from key bytes
International Association for Information and Data Quality
       ◦ Mis-keying always generates mismatch
       Not part of key so store separately if
   
       at all
       Better to generate key automatically
   
       validate against existing
   


              International Association for Information and Data Quality
Inaccurate Data
     Statistical     outliers & other “sore thumbs”
       ◦ E.g. Price 10 times higher than similar
         model
       ◦ River Temperature >100° C
       ◦ Gas Bill orders of magnitude too high
International Association for Information and Data Quality
       Transposed Digits
   
       ◦ Accountancy packages have lots of
         tricks to find these
       Spurious Accuracy
   
       ◦ Wall length in mm
       ◦ Averages computed to too many places
              International Association for Information and Data Quality
Incomprehensible Data
        The facts could meet all the previous
    
        strictures but still be useless
        They must be put in context
    

International Association for Information and Data Quality




               International Association for Information and Data Quality
Data in Context
        3.142 is a fact
    
     Gertie               3.142             2005-02-02
           is data
        Name                Height           Measurement Date
    
International Association for Information and Data Quality
        Gertie             3.142        2005-02-02

          is becoming “Data in Context”
     Still need
      ◦ units for Height (metres)
      ◦ Date rules (ISO 8601)
      ◦…
               International Association for Information and Data Quality
No Context => Expensive errors
       Mars Climate Orbiter
   
       ◦ Discrepancies observed in approach but
         not formally noted
       ◦ Spacecraft vanished during insertion
         into orbit
International Association for Information and Data Quality

       ◦ Engineers specified forces to applied in
         lb Force (poundal) not Newtons
       ◦ Factor of 4.45 difference!
       ◦ They did it again for Mars Polar Lander!


             International Association for Information and Data Quality
More examples
       Redefining field usage on the fly
   
       ◦ 2-byte field in database but highest
         value <256
       ◦ Project team seeks to avoid cost of
         inserting new field
International Association for Information and Data Quality

       ◦ Redefines field in code to be two 1-byte
         fields
       ◦ Existing reports start giving odd results
         but nobody notices
       ◦ Wrong business decisions made

             International Association for Information and Data Quality
Information
       Information is
   
       ◦ What sentient beings use to:
         Facilitate decision-making
         Communicate
International Association for are sentient (so far)
     Only humans Information and Data Quality
       ◦ Information only exists when humans
         in value chain
       Machine-machine communication
   
       ◦ Data in context


             International Association for Information and Data Quality
What is Quality Information
        Conveys the right “impression”
    
        ◦ Trespassing on Conrad’s territory
        ◦ We’ll look at some graphical examples
        Takes into account cultural differences
    
International Association for Information and Data Quality
        ◦ “Wait while the red light flashes”




               International Association for Information and Data Quality
Phone Number example again
       01222 331988
   
     I see that and “know” that it is wrong
     I could programme the rule to convert
       an erroneously converted number
International Association for Information and Data Quality
     01222 => 029
     Prefix subscriber number with 20
     But 029 is officially the code for Wales
       and other prefixes will appear, 21
       already in use.

             International Association for Information and Data Quality
Information Presentation
        Which of these companies would you
    
        rather buy into?


International Association for Information and Data Quality




    1   2   3   4      5   6    7   8




                    International Association for Information and Data Quality
Illegality
        US accounting rules now outlaw chart
    
        manipulations
        Money Laundering rules
    
        Managers could go to prison
International Association for Information and Data Quality
        Basel II and Sarbanes-Oxley
    •
        Directors could go to prison




               International Association for Information and Data Quality
Data Quality is Free
        Poor Data Quality costs 10-30% of
    
        Turnover routinely
        Particular issues can be catastrophic
    
       ◦ Regulator can fine companies
International Association for Information and Data Quality
       ◦ People can sue
       ◦ Officers and directors could go to jail
        Data Quality is better then Free
    
        But needs to be worked at
    



              International Association for Information and Data Quality
No IQ without DQ
        Cannot have good Information Quality
    
        ◦ Without good quality data
        Information Quality is a business issue
    
       ◦ Needs complete commitment
International Association for Information and Data Quality
       ◦ Very strong management process
        Information is the Third Asset
    
        ◦ It is not a cost centre
        ◦ It is not reflected on the bottom line
        ◦ Yet

               International Association for Information and Data Quality
Any Questions?




          keith.underdown@iaidq.org

More Related Content

Similar to Bcs 20080228 Ku

Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionStefan Urbanek
 
Improving Findability through Site Search Analytics
Improving Findability through Site Search AnalyticsImproving Findability through Site Search Analytics
Improving Findability through Site Search AnalyticsLouis Rosenfeld
 
YAPC2007 Remote System Monitoring (w. Notes)
YAPC2007 Remote System Monitoring (w. Notes)YAPC2007 Remote System Monitoring (w. Notes)
YAPC2007 Remote System Monitoring (w. Notes)rgiersig
 
Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6Osokop
 
Take Control of Your Fixed Assets Process
Take Control of Your Fixed Assets ProcessTake Control of Your Fixed Assets Process
Take Control of Your Fixed Assets ProcessNet at Work
 
Data Management - NA CACS 2009
Data Management - NA CACS 2009Data Management - NA CACS 2009
Data Management - NA CACS 2009CISA1567
 
Data Quality Process Design For Analytics And Reporting
Data Quality Process Design For Analytics And ReportingData Quality Process Design For Analytics And Reporting
Data Quality Process Design For Analytics And Reportingmacrochaotic
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining IntroAsma CHERIF
 
Enterprise Search - Thinking Outside the Box
Enterprise Search - Thinking Outside the BoxEnterprise Search - Thinking Outside the Box
Enterprise Search - Thinking Outside the BoxCarl Frappaolo
 
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on Rails
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on RailsVisibleGovernment.ca Expense Visualizer Pilot - Montreal on Rails
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on RailsJennifer Bell
 
Sadfe2007
Sadfe2007Sadfe2007
Sadfe2007CTIN
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group
 
CLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierCLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierJavier Otegui
 
Marine Industry EDI Initiative
Marine Industry EDI InitiativeMarine Industry EDI Initiative
Marine Industry EDI Initiativejtstl5378
 
سمینار داده کاوی و کاربردهای آن
سمینار   داده کاوی و کاربردهای آنسمینار   داده کاوی و کاربردهای آن
سمینار داده کاوی و کاربردهای آنsahargahan
 

Similar to Bcs 20080228 Ku (20)

Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Improving Findability through Site Search Analytics
Improving Findability through Site Search AnalyticsImproving Findability through Site Search Analytics
Improving Findability through Site Search Analytics
 
YAPC2007 Remote System Monitoring (w. Notes)
YAPC2007 Remote System Monitoring (w. Notes)YAPC2007 Remote System Monitoring (w. Notes)
YAPC2007 Remote System Monitoring (w. Notes)
 
Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6Fundamentalsof Crime Mapping 6
Fundamentalsof Crime Mapping 6
 
Data Quality
Data QualityData Quality
Data Quality
 
Take Control of Your Fixed Assets Process
Take Control of Your Fixed Assets ProcessTake Control of Your Fixed Assets Process
Take Control of Your Fixed Assets Process
 
Data Management - NA CACS 2009
Data Management - NA CACS 2009Data Management - NA CACS 2009
Data Management - NA CACS 2009
 
Data Quality Process Design For Analytics And Reporting
Data Quality Process Design For Analytics And ReportingData Quality Process Design For Analytics And Reporting
Data Quality Process Design For Analytics And Reporting
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Enterprise Search - Thinking Outside the Box
Enterprise Search - Thinking Outside the BoxEnterprise Search - Thinking Outside the Box
Enterprise Search - Thinking Outside the Box
 
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on Rails
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on RailsVisibleGovernment.ca Expense Visualizer Pilot - Montreal on Rails
VisibleGovernment.ca Expense Visualizer Pilot - Montreal on Rails
 
Sadfe2007
Sadfe2007Sadfe2007
Sadfe2007
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
 
CLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-JavierCLEANING-Error-Flagging-Javier
CLEANING-Error-Flagging-Javier
 
P1
P1P1
P1
 
Marine Industry EDI Initiative
Marine Industry EDI InitiativeMarine Industry EDI Initiative
Marine Industry EDI Initiative
 
سمینار داده کاوی و کاربردهای آن
سمینار   داده کاوی و کاربردهای آنسمینار   داده کاوی و کاربردهای آن
سمینار داده کاوی و کاربردهای آن
 
IPAS at Penn State
IPAS at Penn StateIPAS at Penn State
IPAS at Penn State
 
Groovy Finance
Groovy FinanceGroovy Finance
Groovy Finance
 
End User Informatics
End User InformaticsEnd User Informatics
End User Informatics
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Bcs 20080228 Ku

  • 1. The quality of information and data is strained International Association for Information and Data Quality Keith Underdown Convenor, British Community of Practice International Association for Information and Data Quality
  • 2. Shameless Plug International Association for  Information & Data Quality www.iaidq.org ◦ Student Membership—$25 ◦ Personal Membership—$85 International Association for Information and Data Quality ◦ Corporate Membership Available ◦ Extensive Conference Discounts www.justgiving.com/keithunderdown  ◦ My fundraising page ◦ Reward me if you enjoy my presentation International Association for Information and Data Quality
  • 3. Data “Everybody knows what data is”!  ◦ “Define:data” in a Google search gives 41 results ◦ Mix of International Association for Information and Data Quality  “data processing” biased  Philosophical  Irrelevant (Data is an android in Startrek TNG) My Preference:  A collection of facts held in a formalized manner suitable for processing by automatic or human means. International Association for Information and Data Quality
  • 4. Fundamental Data Quality The facts in the case can be:  ◦ Inaccurate ◦ Incomplete ◦ Inconsistent International Association for Information and Data Quality ◦ Invalid ◦ Incomprehensible International Association for Information and Data Quality
  • 5. The Five “I’s”  Incomplete Data ◦ mandatory fields with null, empty string, etc…  Invalid Data ◦ values outside the allowed value set or fails tests against rules  Inconsistent Data International Association inconsistencyand Data Quality ◦ intra-record for Information ◦ inter-record inconsistency ◦ Inter-datastore Consistency  Inaccurate Data: ◦ Statistical outliers & other “sore thumbs”  E.g. Price 10 times higher than similar models Incomprehensible Data  ◦ without full and accurate context International Association for Information and Data Quality
  • 6. Incomplete Data Facts essential to business process are  missing ◦ Implies that data validation incorrect ◦ Often arises during bulk import of data International Association for Information and Data Quality  Data not immediately available so validation relaxed  Follow-up not completed  Database field cannot be made mandatory International Association for Information and Data Quality
  • 7. Example Change in Law made knowledge of  Social Security number mandatory ◦ Too expensive to go to customers ◦ Populate at need International Association for Information and Data Quality ◦ Telephone agents used their own Customer failed to fill in DoB field  ◦ Data entry clerk guessed! ◦ Customer has high value transaction turned down ◦ Lots of adverse publicity International Association for Information and Data Quality
  • 8. How can we avoid these? Plan for their absence  ◦ When creating new databases plan to populate fields ◦ When bulk updates required bite the bullet International Association for Information and Data Quality ◦ Ensure agents have time and understand the need to collect data  Check for likely “cheats” International Association for Information and Data Quality
  • 9. Invalid Data Data that fails genuine business rules  Or  Fails unstated real world validation  ◦ Company name info spills over into International Association for Information and Data Quality address fields International Association for Information and Data Quality
  • 10. Examples 01222 535681 looks like a valid phone  no. ◦ But Cardiff is an exception  029 2053 5681 International Association for might work it out Quality  Human being Information and Data  Power dialler won’t 02/03/08  ◦ US=3rd February 08 ◦ UK= 2nd March 08 ◦ Which century? International Association for Information and Data Quality
  • 11. How do we avoid these Make field syntax as tight as possible  ◦ E.g. Always use date-stamp fields for dates ◦ Use external validation systems International Association Address File and Data Quality  E.g. Postal for Information ◦ Use masks to validate input patterns  Use carefully, still allows cheating ◦ Use drop-down lists from reference tables International Association for Information and Data Quality
  • 12. Inconsistent Data  Intra-record inconsistency: ◦ Gender=“m”, Marital-Status=“Wife”;  inter-record inconsistency ◦ R1: VIN=VF7N1KFXF36772582; International Association forMark=T87BRB Quality Registration Information and Data ◦ R2: VIN=VF7N1KFXF36772582; Registration Mark=CC04PNL  Inter-datastore inconsistency ◦ E.g. Customer data in many data stores International Association for Information and Data Quality
  • 13. How do we avoid these? “Common sense validation”  ◦ Men cannot be wives But: what is correct value?  So: don’t over-specify  International Association for Information and Data Quality ◦ Marital status? ◦ Better: Relationship Status Legally Married  In Civil Partnership  Unmarried  Divorced  International Association for Information and Data Quality
  • 14. Careful of surrogate keys Entities can often be identified in  different ways ◦ NI Number ◦ NHS Number International Association for Information and Data Quality These are surrogate keys  All key fields should be unique  VIN example could not have arisen if  field required to be unique Nor would have SSN example earlier  International Association for Information and Data Quality
  • 15. Root Cause Often historically poor data quality  ◦ NI numbers poorly administered  Many to many relationships! Keys not unique in practice  International Association for Information and Data Quality  Allows for new errors in data entry International Association for Information and Data Quality
  • 16. An Aside—Checksums Checksums ancient technique to  validate input data ◦ Additional digit attached to key ◦ Derived from key bytes International Association for Information and Data Quality ◦ Mis-keying always generates mismatch Not part of key so store separately if  at all Better to generate key automatically  validate against existing  International Association for Information and Data Quality
  • 17. Inaccurate Data  Statistical outliers & other “sore thumbs” ◦ E.g. Price 10 times higher than similar model ◦ River Temperature >100° C ◦ Gas Bill orders of magnitude too high International Association for Information and Data Quality Transposed Digits  ◦ Accountancy packages have lots of tricks to find these Spurious Accuracy  ◦ Wall length in mm ◦ Averages computed to too many places International Association for Information and Data Quality
  • 18. Incomprehensible Data The facts could meet all the previous  strictures but still be useless They must be put in context  International Association for Information and Data Quality International Association for Information and Data Quality
  • 19. Data in Context 3.142 is a fact   Gertie 3.142 2005-02-02 is data Name Height Measurement Date  International Association for Information and Data Quality Gertie 3.142 2005-02-02 is becoming “Data in Context”  Still need ◦ units for Height (metres) ◦ Date rules (ISO 8601) ◦… International Association for Information and Data Quality
  • 20. No Context => Expensive errors Mars Climate Orbiter  ◦ Discrepancies observed in approach but not formally noted ◦ Spacecraft vanished during insertion into orbit International Association for Information and Data Quality ◦ Engineers specified forces to applied in lb Force (poundal) not Newtons ◦ Factor of 4.45 difference! ◦ They did it again for Mars Polar Lander! International Association for Information and Data Quality
  • 21. More examples Redefining field usage on the fly  ◦ 2-byte field in database but highest value <256 ◦ Project team seeks to avoid cost of inserting new field International Association for Information and Data Quality ◦ Redefines field in code to be two 1-byte fields ◦ Existing reports start giving odd results but nobody notices ◦ Wrong business decisions made International Association for Information and Data Quality
  • 22. Information Information is  ◦ What sentient beings use to:  Facilitate decision-making  Communicate International Association for are sentient (so far)  Only humans Information and Data Quality ◦ Information only exists when humans in value chain Machine-machine communication  ◦ Data in context International Association for Information and Data Quality
  • 23. What is Quality Information Conveys the right “impression”  ◦ Trespassing on Conrad’s territory ◦ We’ll look at some graphical examples Takes into account cultural differences  International Association for Information and Data Quality ◦ “Wait while the red light flashes” International Association for Information and Data Quality
  • 24. Phone Number example again 01222 331988   I see that and “know” that it is wrong  I could programme the rule to convert an erroneously converted number International Association for Information and Data Quality  01222 => 029  Prefix subscriber number with 20  But 029 is officially the code for Wales and other prefixes will appear, 21 already in use. International Association for Information and Data Quality
  • 25. Information Presentation Which of these companies would you  rather buy into? International Association for Information and Data Quality 1 2 3 4 5 6 7 8 International Association for Information and Data Quality
  • 26. Illegality US accounting rules now outlaw chart  manipulations Money Laundering rules  Managers could go to prison International Association for Information and Data Quality Basel II and Sarbanes-Oxley • Directors could go to prison International Association for Information and Data Quality
  • 27. Data Quality is Free Poor Data Quality costs 10-30% of  Turnover routinely Particular issues can be catastrophic  ◦ Regulator can fine companies International Association for Information and Data Quality ◦ People can sue ◦ Officers and directors could go to jail Data Quality is better then Free  But needs to be worked at  International Association for Information and Data Quality
  • 28. No IQ without DQ Cannot have good Information Quality  ◦ Without good quality data Information Quality is a business issue  ◦ Needs complete commitment International Association for Information and Data Quality ◦ Very strong management process Information is the Third Asset  ◦ It is not a cost centre ◦ It is not reflected on the bottom line ◦ Yet International Association for Information and Data Quality
  • 29. Any Questions? keith.underdown@iaidq.org