SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Towards a Vocabulary for
  DQM in Semantic Web
      Architectures
                 (Research in Progress)

        Christian Fürber and Martin Hepp
       christian@fuerber.com, mhepp@computer.org

Presentation @ 1st International Workshop on Linked Web
                    Data Management,
           March 25th, 2011, Uppsala, Sweden
Part 1:
                      What‘s the Problem?



C. Fürber, M. Hepp:                         2
Towards a Vocabulary for DQM
In SemWeb Architectures
Various Data Quality Problems
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification




                                                                                                                           Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  Incorrect reference                                                                      Approximate duplicates




                                                                                                                               Reference: Linking Open Data cloud diagram, by
                                                          Character alignment violation

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements         Untyped literals        Outdated values


C. Fürber, M. Hepp:                                                                                                    3
Towards a Vocabulary for DQM
in SemWeb Architectures
The Problem
                                                                                        Negative
                                                                                        Population


                                                                           Weird Population
                                                                           Values


                                                                                              Invalid
                                                                                              URL‘s

                                Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                  4
Towards a Vocabulary for DQM
in SemWeb Architectures
Part 2:
        What are high quality data?



C. Fürber, M. Hepp:                   5
Towards a Vocabulary for DQM
In SemWeb Architectures
What is Data Quality?
• Data‘s „fitness for use by data consumers“ (Wang, Strong 1996)

• „Conformance to specification“ (Kahn et al. 2002)
• „Data are of high quality if they are fit for their intended
  uses in operations, decision making, and planning. Data
  are fit for use if they are free of defects and possess
  desired features.“ (Redman 2001)


                    • Requirements as „Benchmark“
C. Fürber, M. Hepp:                                              6
Towards a Vocabulary for DQM
in SemWeb Architectures
Perspective-Neutral Data Quality


              Data quality is the degree to which
               data fulfills quality requirements

        …no matter who makes the quality requirements.



C. Fürber, M. Hepp:                                 7
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-
   Requirements
                                    The Problem
                                    Population
                                    cannot be                                                    Negative
                                     negative                                                    Population
                            Population is
                            indicated by
                           numeric values                                           Weird Population
                                                                                    Values
                        URL‘s usually
                       start with http://,
                         https://, etc.                                                                Invalid
                                                                                                       URL‘s

                                         Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql



C. Fürber, M. Hepp:                                                                                           8
Towards a Vocabulary for DQM
in SemWeb Architectures
Satisfying Quality Requirements
         Problem 3: Satisfying
            Requirements            Desired
                                     State

                                                            Individuals

       Status
        Quo
                               =   Desired
                                    State
                                                             Groups


                                    Desired
                                     State
                                                           Standards,
                                                              etc.
  Problem 2: Harmonizing
       Requirements                           Problem 1: Expressing
                                              Quality Requirements
C. Fürber, M. Hepp:                                               9
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 3:
                               Research Goal



C. Fürber, M. Hepp:                            10
Towards a Vocabulary for DQM
In SemWeb Architectures
Major Research Goal
 • Represent Quality-Relevant information for
   automated…
                       – Data Quality Monitoring
                       – Data Quality Assessment
                       – Data Cleansing
                       – Filtering of High Quality Data

                                 …in a standardized vocabulary.


C. Fürber, M. Hepp:                                               11
Towards a Vocabulary for DQM
in SemWeb Architectures
Motives for DQM-Vocabulary
• Support people to explicitly express data quality
  requirements in „same language“ on Web-Scale
• Support the creation of consensual agreements
  upon quality requirements
• Reduce effort for DQM-Activities
• Raise transparency about assumed quality
  requirements
• Enable consistency checks among quality
  requirements
C. Fürber, M. Hepp:                              12
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 4:
                               Our Approach



C. Fürber, M. Hepp:                           13
Towards a Vocabulary for DQM
In SemWeb Architectures
Basic Architecture
                                 Assessment   HQ Data
      Problem                      Scores     Retrieval           Cleansed
    Classification                                                  Data


                                  SPARQL-Query-Engine
                                              DQM-Vocabulary



                          Knowledgebase
                        RDB A     RDB B        Data Acquisition

C. Fürber, M. Hepp:                                                          14
Towards a Vocabulary for DQM
in SemWeb Architectures
Main Concepts of DQM-Vocabulary
                               Classify Quality     Express
                                  Problems        Requirements

                                                                 Annotate
                                                                  Quality
                                                                  Scores




                                                                  Express
                                                                 Cleansing
     Account for                                                   Tasks
   Task-Dependent
    Requirements
C. Fürber, M. Hepp:                                                   15
Towards a Vocabulary for DQM
In SemWeb Architectures
Data Quality Problem Types:
          Source for Potential Requirements
                                                          Inconsistent duplicates
                       Invalid characters                                Missing classification
  Incorrect reference                                     Character alignment violation
                                                                                           Approximate duplicates

                    Word transpositions
                                     Invalid substrings
                                                               Mistyping / Misspelling errors
  Cardinality violation
                                                     Missing values                  Referential integrity violation
                  Misfielded values
         Unique value violation            False values             Functional Dependency
                               Out of range values
                                                                    Violation                Imprecise values
    Existence of Homonyms                 Meaningless values
                                                                            Incorrect classification
         Existence of Synonyms                                   Contradictory relationships
                               Outdated conceptual elements                                 Outdated values
C. Fürber, M. Hepp:
Towards a Vocabulary for DQM                                                                                           16
in SemWeb Architectures
Data Quality Requirements
                                      Syntactical Rules
                                      Semantic Rules
                                     Redundancy Rules
                                    Completeness Rules
                                      Timeliness Rules




C. Fürber, M. Hepp:                                  17
Towards a Vocabulary for DQM
In SemWeb Architectures
Quality-Influencing Artifacts


        Current Focus
     of DQM-Vocabulary
                                    Data




C. Fürber, M. Hepp:                            18
Towards a Vocabulary for DQM
In SemWeb Architectures
Design Alternatives:
   Statements about Classes & Properties


(1) Using classes and properties as subjects

(2) Using datatype properties with xsd:anyURI

(3) Mapping class and property URI‘s to new URI‘s


C. Fürber, M. Hepp:                             19
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 5:
                    Application Examples



C. Fürber, M. Hepp:                        20
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (1/3)


               What instances have illegal values
                 for property foo:country ?




C. Fürber, M. Hepp:                                 21
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (2/3)
                               dqm:LegalValueRule          Class
                                                          Instance

                                                         Literal value
                                  foo:LegalValueRule_1




   “tref:Countries“
                                                          “foo:Countries“



        “tref:countryName“                               “foo:countryName“



C. Fürber, M. Hepp:                                                  22
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 1: Legal Value Rule (3/3)




C. Fürber, M. Hepp:                        23
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (1/2)


               How syntactically accurate are all
                 properties that are subject to
                      LegalValueRules?




C. Fürber, M. Hepp:                                 24
Towards a Vocabulary for DQM
In SemWeb Architectures
Example 2: DQ-Assessment (2/2)




C. Fürber, M. Hepp:                      25
Towards a Vocabulary for DQM
In SemWeb Architectures
Part 6:
                               Conclusions &
                               Planned Work


C. Fürber, M. Hepp:                            26
Towards a Vocabulary for DQM
In SemWeb Architectures
Advantages of DQM-Voabulary

• Minimizes human effort for DQM
• Web-Scale sharing/reuse of data quality
  requirements
• Consistency checks among data quality
  requirements
• Transparency about applied data quality
  rules
C. Fürber, M. Hepp:                         27
Towards a Vocabulary for DQM
In SemWeb Architectures
Limitations
• Representation of complex functional
  dependency rules and derivation rules
• Limited experience on real world-data sets
• Currently no own concepts for classes and
  properties
• Research still in progress


C. Fürber, M. Hepp:                          28
Towards a Vocabulary for DQM
In SemWeb Architectures
Future Work
• Evaluation of design alternatives
• Development of processing framework
• Representation of more complex
  functional dependency rules / derivation
  rules
• Extension of DQM-Vobulary
• Evaluation on real-world data sets
• Publication at http://semwebquality.org
C. Fürber, M. Hepp:                          29
Towards a Vocabulary for DQM
in SemWeb Architectures
Christian Fürber
   Researcher
   E-Business & Web Science Research Group

                 Werner-Heisenberg-Weg 39
                 85577 Neubiberg
                 Germany

                 skype            c.fuerber
                 email            christian@fuerber.com
                 web              http://www.unibw.de/ebusiness
                 homepage         http://www.fuerber.com
                 twitter          http://www.twitter.com/cfuerber




Paper available at http://bit.ly/gYEDdQ
                                                                    30

Más contenido relacionado

La actualidad más candente

Bringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceBringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceSalesforce Developers
 
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)byteLAKE
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
Connected & Autonomous vehicles: cybersecurity on a grand scale v1
Connected & Autonomous vehicles: cybersecurity on a grand scale v1Connected & Autonomous vehicles: cybersecurity on a grand scale v1
Connected & Autonomous vehicles: cybersecurity on a grand scale v1Bill Harpley
 
What is predictive maintenance?
What is predictive maintenance?What is predictive maintenance?
What is predictive maintenance?Danko Nikolic
 
My Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectMy Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectEeshan Srivastava
 
Night vision technology in auto mobiles
Night vision technology in auto mobilesNight vision technology in auto mobiles
Night vision technology in auto mobilesmadhavareddy tangirala
 
Ibm big data-platform
Ibm big data-platformIbm big data-platform
Ibm big data-platformIBM Sverige
 
Night vision technology in cars (automotives)
Night vision technology in cars (automotives)Night vision technology in cars (automotives)
Night vision technology in cars (automotives)Kumar Manikantan T
 
Development of wearable object detection system & blind stick for visuall...
Development of wearable object detection system & blind stick for visuall...Development of wearable object detection system & blind stick for visuall...
Development of wearable object detection system & blind stick for visuall...Arkadev Kundu
 
Machine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamMachine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamInstitute of Contemporary Sciences
 
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...Neeraj Khatri
 
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...Principled Technologies
 
Introduction to Higher Order Functions in Scala
Introduction to Higher Order Functions in Scala	Introduction to Higher Order Functions in Scala
Introduction to Higher Order Functions in Scala Knoldus Inc.
 
AI & ML for Supply Chain Optimization
AI & ML for Supply Chain OptimizationAI & ML for Supply Chain Optimization
AI & ML for Supply Chain OptimizationShiSh Shridhar
 
Inter vehicle communication
Inter vehicle communicationInter vehicle communication
Inter vehicle communicationR prasad
 

La actualidad más candente (20)

Bringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceBringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to Salesforce
 
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
AI for Manufacturing (Machine Vision, Edge AI, Federated Learning)
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Connected & Autonomous vehicles: cybersecurity on a grand scale v1
Connected & Autonomous vehicles: cybersecurity on a grand scale v1Connected & Autonomous vehicles: cybersecurity on a grand scale v1
Connected & Autonomous vehicles: cybersecurity on a grand scale v1
 
What is predictive maintenance?
What is predictive maintenance?What is predictive maintenance?
What is predictive maintenance?
 
My Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectMy Final Year B.Tech Research Project
My Final Year B.Tech Research Project
 
Night vision technology in auto mobiles
Night vision technology in auto mobilesNight vision technology in auto mobiles
Night vision technology in auto mobiles
 
Ibm big data-platform
Ibm big data-platformIbm big data-platform
Ibm big data-platform
 
Night vision technology in cars (automotives)
Night vision technology in cars (automotives)Night vision technology in cars (automotives)
Night vision technology in cars (automotives)
 
Development of wearable object detection system & blind stick for visuall...
Development of wearable object detection system & blind stick for visuall...Development of wearable object detection system & blind stick for visuall...
Development of wearable object detection system & blind stick for visuall...
 
Machine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamMachine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports Team
 
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...
presentation on IR based vehicle with AUTOMATIC BRAKING and DRIVER AWAKENING ...
 
Poly jet ppt
Poly jet pptPoly jet ppt
Poly jet ppt
 
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...
3 key wins: Dell EMC PowerEdge MX with OpenManage Enterprise over Cisco UCS a...
 
Autonomous vehicles
Autonomous vehiclesAutonomous vehicles
Autonomous vehicles
 
Addverb profile
Addverb profileAddverb profile
Addverb profile
 
Introduction to Higher Order Functions in Scala
Introduction to Higher Order Functions in Scala	Introduction to Higher Order Functions in Scala
Introduction to Higher Order Functions in Scala
 
AI & ML for Supply Chain Optimization
AI & ML for Supply Chain OptimizationAI & ML for Supply Chain Optimization
AI & ML for Supply Chain Optimization
 
Digital Manufacturing
Digital ManufacturingDigital Manufacturing
Digital Manufacturing
 
Inter vehicle communication
Inter vehicle communicationInter vehicle communication
Inter vehicle communication
 

Último

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Towards a Vocabulary for Data Quality Management in Semantic Web Architectures

  • 1. Towards a Vocabulary for DQM in Semantic Web Architectures (Research in Progress) Christian Fürber and Martin Hepp christian@fuerber.com, mhepp@computer.org Presentation @ 1st International Workshop on Linked Web Data Management, March 25th, 2011, Uppsala, Sweden
  • 2. Part 1: What‘s the Problem? C. Fürber, M. Hepp: 2 Towards a Vocabulary for DQM In SemWeb Architectures
  • 3. Various Data Quality Problems Inconsistent duplicates Invalid characters Missing classification Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Incorrect reference Approximate duplicates Reference: Linking Open Data cloud diagram, by Character alignment violation Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Untyped literals Outdated values C. Fürber, M. Hepp: 3 Towards a Vocabulary for DQM in SemWeb Architectures
  • 4. The Problem Negative Population Weird Population Values Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 4 Towards a Vocabulary for DQM in SemWeb Architectures
  • 5. Part 2: What are high quality data? C. Fürber, M. Hepp: 5 Towards a Vocabulary for DQM In SemWeb Architectures
  • 6. What is Data Quality? • Data‘s „fitness for use by data consumers“ (Wang, Strong 1996) • „Conformance to specification“ (Kahn et al. 2002) • „Data are of high quality if they are fit for their intended uses in operations, decision making, and planning. Data are fit for use if they are free of defects and possess desired features.“ (Redman 2001) • Requirements as „Benchmark“ C. Fürber, M. Hepp: 6 Towards a Vocabulary for DQM in SemWeb Architectures
  • 7. Perspective-Neutral Data Quality Data quality is the degree to which data fulfills quality requirements …no matter who makes the quality requirements. C. Fürber, M. Hepp: 7 Towards a Vocabulary for DQM In SemWeb Architectures
  • 8. Quality- Requirements The Problem Population cannot be Negative negative Population Population is indicated by numeric values Weird Population Values URL‘s usually start with http://, https://, etc. Invalid URL‘s Data retrieved on 2011-03-12 from http://loc.openlinksw.com/sparql C. Fürber, M. Hepp: 8 Towards a Vocabulary for DQM in SemWeb Architectures
  • 9. Satisfying Quality Requirements Problem 3: Satisfying Requirements Desired State Individuals Status Quo = Desired State Groups Desired State Standards, etc. Problem 2: Harmonizing Requirements Problem 1: Expressing Quality Requirements C. Fürber, M. Hepp: 9 Towards a Vocabulary for DQM In SemWeb Architectures
  • 10. Part 3: Research Goal C. Fürber, M. Hepp: 10 Towards a Vocabulary for DQM In SemWeb Architectures
  • 11. Major Research Goal • Represent Quality-Relevant information for automated… – Data Quality Monitoring – Data Quality Assessment – Data Cleansing – Filtering of High Quality Data …in a standardized vocabulary. C. Fürber, M. Hepp: 11 Towards a Vocabulary for DQM in SemWeb Architectures
  • 12. Motives for DQM-Vocabulary • Support people to explicitly express data quality requirements in „same language“ on Web-Scale • Support the creation of consensual agreements upon quality requirements • Reduce effort for DQM-Activities • Raise transparency about assumed quality requirements • Enable consistency checks among quality requirements C. Fürber, M. Hepp: 12 Towards a Vocabulary for DQM In SemWeb Architectures
  • 13. Part 4: Our Approach C. Fürber, M. Hepp: 13 Towards a Vocabulary for DQM In SemWeb Architectures
  • 14. Basic Architecture Assessment HQ Data Problem Scores Retrieval Cleansed Classification Data SPARQL-Query-Engine DQM-Vocabulary Knowledgebase RDB A RDB B Data Acquisition C. Fürber, M. Hepp: 14 Towards a Vocabulary for DQM in SemWeb Architectures
  • 15. Main Concepts of DQM-Vocabulary Classify Quality Express Problems Requirements Annotate Quality Scores Express Cleansing Account for Tasks Task-Dependent Requirements C. Fürber, M. Hepp: 15 Towards a Vocabulary for DQM In SemWeb Architectures
  • 16. Data Quality Problem Types: Source for Potential Requirements Inconsistent duplicates Invalid characters Missing classification Incorrect reference Character alignment violation Approximate duplicates Word transpositions Invalid substrings Mistyping / Misspelling errors Cardinality violation Missing values Referential integrity violation Misfielded values Unique value violation False values Functional Dependency Out of range values Violation Imprecise values Existence of Homonyms Meaningless values Incorrect classification Existence of Synonyms Contradictory relationships Outdated conceptual elements Outdated values C. Fürber, M. Hepp: Towards a Vocabulary for DQM 16 in SemWeb Architectures
  • 17. Data Quality Requirements Syntactical Rules Semantic Rules Redundancy Rules Completeness Rules Timeliness Rules C. Fürber, M. Hepp: 17 Towards a Vocabulary for DQM In SemWeb Architectures
  • 18. Quality-Influencing Artifacts Current Focus of DQM-Vocabulary Data C. Fürber, M. Hepp: 18 Towards a Vocabulary for DQM In SemWeb Architectures
  • 19. Design Alternatives: Statements about Classes & Properties (1) Using classes and properties as subjects (2) Using datatype properties with xsd:anyURI (3) Mapping class and property URI‘s to new URI‘s C. Fürber, M. Hepp: 19 Towards a Vocabulary for DQM In SemWeb Architectures
  • 20. Part 5: Application Examples C. Fürber, M. Hepp: 20 Towards a Vocabulary for DQM In SemWeb Architectures
  • 21. Example 1: Legal Value Rule (1/3) What instances have illegal values for property foo:country ? C. Fürber, M. Hepp: 21 Towards a Vocabulary for DQM In SemWeb Architectures
  • 22. Example 1: Legal Value Rule (2/3) dqm:LegalValueRule Class Instance Literal value foo:LegalValueRule_1 “tref:Countries“ “foo:Countries“ “tref:countryName“ “foo:countryName“ C. Fürber, M. Hepp: 22 Towards a Vocabulary for DQM In SemWeb Architectures
  • 23. Example 1: Legal Value Rule (3/3) C. Fürber, M. Hepp: 23 Towards a Vocabulary for DQM In SemWeb Architectures
  • 24. Example 2: DQ-Assessment (1/2) How syntactically accurate are all properties that are subject to LegalValueRules? C. Fürber, M. Hepp: 24 Towards a Vocabulary for DQM In SemWeb Architectures
  • 25. Example 2: DQ-Assessment (2/2) C. Fürber, M. Hepp: 25 Towards a Vocabulary for DQM In SemWeb Architectures
  • 26. Part 6: Conclusions & Planned Work C. Fürber, M. Hepp: 26 Towards a Vocabulary for DQM In SemWeb Architectures
  • 27. Advantages of DQM-Voabulary • Minimizes human effort for DQM • Web-Scale sharing/reuse of data quality requirements • Consistency checks among data quality requirements • Transparency about applied data quality rules C. Fürber, M. Hepp: 27 Towards a Vocabulary for DQM In SemWeb Architectures
  • 28. Limitations • Representation of complex functional dependency rules and derivation rules • Limited experience on real world-data sets • Currently no own concepts for classes and properties • Research still in progress C. Fürber, M. Hepp: 28 Towards a Vocabulary for DQM In SemWeb Architectures
  • 29. Future Work • Evaluation of design alternatives • Development of processing framework • Representation of more complex functional dependency rules / derivation rules • Extension of DQM-Vobulary • Evaluation on real-world data sets • Publication at http://semwebquality.org C. Fürber, M. Hepp: 29 Towards a Vocabulary for DQM in SemWeb Architectures
  • 30. Christian Fürber Researcher E-Business & Web Science Research Group Werner-Heisenberg-Weg 39 85577 Neubiberg Germany skype c.fuerber email christian@fuerber.com web http://www.unibw.de/ebusiness homepage http://www.fuerber.com twitter http://www.twitter.com/cfuerber Paper available at http://bit.ly/gYEDdQ 30