This presentation is a supplementary material for the following article -> Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context. In IADIS International Conference e-Health (pp. 119-126).
This research focuses on the analysis of the quality of open health data that are freely available and can be used by everyone for their own purposes. The quality of open data is crucial as it can lead to unreliable decision-making and financial losses, however, the quality of open health data has even more critical role.Despite its importance, this topic is rarely discussed.Therefore, the previously proposed data object-driven approach to data quality evaluation is applied to open health data in Latvia in order to (a) evaluate their quality, highlighting common quality issues that should be considered by both, users and data publishers, (b) demonstrate that the used approach is suitable for given purpose as it is simple enough,and ensures the involvement of users even without IT and data quality knowledge (domain experts) in the data quality analysis examining data for their own purposes. The proposed solution seems to be useful in establishing communication between data users and publishers,improving the overall quality of data.
Analysis of open health data quality using data object-driven approach to data quality evaluation: insights from a Latvian context
1. ANALYSIS OF OPEN HEALTH DATA QUALITY USING
DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION:
INSIGHTS FROM A LATVIAN CONTEXT
13th Multi Conference on Computer Science and Information Systems
11th International Conference on e-Health
17 – 19 July 2019, Porto, Portugal
Anastasija Nikiforova
Faculty of Computing, University of Latvia
Anastasija.Nikiforova@lu.lv
2. (The New York Times, The Economist, WIRED)
Def. I: «Open data» are data that anyone can access, use and share.
The popularity of open data continuously increases.
For instance, European Data Portal collects
more than 800 thousand data sets.
OPEN DATA
The aggregate economic impact from applications
based on open data across the EU27 economy is
estimated to be €140 billion annually.
Open Government Data (OGD):
impact economic growth,
improving government services,
reducing fraud,
reducing wastes.
The McKinsey Global Institute report estimated
that open data could add over $3 trillion
annually in total value to the global economy.
3. The list of researches indicates the existence of
data quality problems in open data:
Ferney et al., 2017;
Kerr et al., 2007;
Kuk and Davies, 2011;
Martin, 2014;
Nikiforova, 2018a, 2018b;
Nikiforova and Bicevskis, 2019;
Vetrò et al., 2016
etc..
8 PRINCIPLES OF OPEN DATA
OGD: the quality aspect takes only the 4th place by popularity after policy,
benefit and risk, although quality can impact these aspects. (Klein et al., 2018)
Data quality appears as one of most problematical dimensions for
open data portals.
Def. II: «Quality» is a desirable goal to be achieved through management of the production process.
Def. III: «Data quality» is a relative concept, largely dependent on specific requirements resulting from the data use.
(SunlightFoundation, 2007), (European Data Portal, 2018)
Open data must be:
1. complete 3. primary
2. timely 4. accessible
7. machine-processable
5. non-discriminatory
6. licence-free 8. non-proprietary
And what about data quality*???
*
4. Latvia:
is one of 70 countries participating in the
Open Government Partnership - an
international platform for domestic reformers that
committed to making their governments more open,
accountable, and responsive to citizens;
is the fast-tracker (among beginners,
followers, fast-trackers, trend-setters);
-Open Data Maturity report
has the highest rate of open data maturity in
comparison with neighbourhoods from
Baltic States and Scandinavian countries.
THE STATE OF OPEN DATA IN LATVIA
In 2017 the Latvian Ministry of Environmental Protection and Regional Development
has launched the new Latvian Open Data Portal:
The state of the quality for Latvia is the worst aspect among
impact, policy, portal, and quality (only 62% while the average
is 71%), compared with the average rate for all analysed countries.
Open data maturity of Latvian open
data portal:
• in 2016 - 31st,
• in 2017 – 20th,
• in 2018 - 12th.
As for the quality aspect – 11th place
with just 370 out of 520 points.
at the moment of its launch
33 data sets
from 13 data publishers
in July of 2018
139 data sets
from 41 publishers
in June of 2019
228 data sets
from 62 publishers.
5. OPEN HEALTH(CARE) DATA I
Aims and possible uses of open health(care) data can be very different, since
health data and information are characterized by multiple number of possible
applications, uses and users.
The volume of health(care) data continuously
increasing, and it is expected to grow
dramatically in the years ahead.
Open health(care) data is one of the most popular categories of open data.
(Cabitza and Batini, 2016)
Health and healthcare data are very broad concepts*,
this research focuses on one subdomain - open health
data.
*Def. IV: «Health care data» are items of knowledge
about an individual patient or a group of patients.
*Def. V: «Health data» are any representation of facts
related to the health of single individuals or entire
populations and that is suitable for communication,
interpretation or processing by manual or electronic
means;
(World Health Organization, 2003)
Abdelhak M, Grostick S, Hanken MA, 2012)
Healthcare is characterized by highly complex labor-
and skill intensive services where the actors involved
still rely primarily on paper tools, their own cognition
(competencies and memory), and
other traditional methods.
(Cabitza & Batini, 2016) HUMAN FACTOR!!!
6. OPEN HEALTH(CARE) DATA II
Between 56% and 79% of Internet users seek
health information online:
- 35%,
- 42%,
with the lowest proportion in the Southern
countries:
- 30%,
- 23%.
(Andreassen et al., 2007)
Open health(care) data must be of high quality, as they:
are needed for health(care) planning and administrative purposes:
can be useful searching data on medications, their dose, contraindications and other
information available for the wide audience.
• provide a sampling frame for medical
research,
• facilitate quality assurance of the
health(care) services,
• etc.
• form the basis for health and medicines authority’s
hospital statistics, or health economic calculations,
• provide authorities with data to support hospital
planning,
• monitor the frequency of various diseases and
treatments,
The list of researches discussing quality of health(care) data in many
countries comes to the one conclusion –
health(care) data have
data quality problems.
7. Assumption: as the level of details of “open” data
might be lower in comparison with “closed” data
stored in databases, quality checks can be simpler.
open data are usually used by wide audience that
may not have deep knowledge in IT or data quality areas
a solution should be simple enough
ensuring particular users with possibility to take part in
the analysis of «third-party» open data
for their own purposes
OPEN [HEALTH] DATA QUALITY
Solution: previously proposed user-oriented data object-driven
approach
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, 2019)
!!! The same data may be
sufficiently qualitative in one case
BUT
completely useless under other
circumstances.
8. General studies on data and information quality - define different dimensions of quality and their
groupings.
✘ The key data quality dimensions are not universally*;
✘ There is no agreement on their meanings and usability **;
✘ Each dimension can be supplied with one or more metrics that varies from one solution
to another;
✘ The number of different data quality dimensions, their definitions and grouping are often
useful for only particular solution.
Question: How to relate particular dimension (and which one?) to a particular use-
case???
RELATED RESEARCHES
Problem: necessity to involve data quality experts at every stage of
data quality analysis process.
Solution: data object-driven approach to data quality evaluation.
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, Bicevskis, 2019)
* «… This state of affairs has led to much confusion within the data
quality community and is even more bewildering for those who are
new to the discipline and more importantly to business
stakeholders…» (DAMA UK, 2018)
** In different proposals, dimensions of the same name can have
different semantics and vice versa. (Batini, 2016)
Example I: (Kerr, et al., 2007):
New Zealand’s healthcare data:
6 data quality dimensions,
24 characteristics
69 data quality criteria.
Example II: (Dahbi et al., 2018; Weiskopf et al.,
2013):
2 data quality dimensions:
accuracy and completeness
9. TDQM data quality lifecycle
Data quality
definition
Data quality
measuring
Data quality
analysis
Data quality
improvement
MAIN PRINCIPLES OF THE
PROPOSED SOLUTION
Each specific application can have its own specific DQ checks;
DQ requirements can be formulated on several levels:
DQ can be checked in various stages of the data processing;
DQ definition language is graphical DSL:
• the diagrams are easy to read, create, understand and edit even by
non-IT and non-DQ experts;
• syntax and semantics can be easily applied to any new IS.
from informal text
in natural
language
to an automatically executable
model,
SQL statements or program code;
10. !!! All three components are
defined by using a graphical
domain specific language
(DSL)**
**Three DSL families were developed as graphic languages based on
the possibilities of the modelling platform DIMOD
1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object
primary data object - the initial DO which quality is analysed;
secondary data object – DO that determines the context for analysis of the primary DO.
* Many objects of the same structure form class of data objects
2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is
considered of high quality.
** May contain: informal or formalized implementation-independent descriptions of conditions
3. DATA QUALITY MEASURING PROCESS - procedures should be performed to
evaluate the data object’s quality.
DATA QUALITY MODEL
instead of dimensions
11. 15 data sets from 7 different data publishers;
15 primary data objects, 11 secondary data objects were involved in
data quality analysis and applied on 35 parameters of primary data objects;
The most popular and frequently occurred data quality issues:
✘ contextual data quality issues;
✘ empty values (completeness);
✘ multiple notation for the same object in scope of one data object and even
parameter;
✘ issues in interrelated parameters.
DATA QUALITY ANALYSIS OF
OPEN HEALTH(CARE) DATA
✘ only 6 out of 15 data sets are
updated as frequently as it is
promised;
✘ only 8 out of 15 data sets are
supplied with explanation of
parameters;
✔ almost all available data sets are
provided in machine-readable format:
the most popular open data format - .xlsx
(53.3%), while 26.7% in .zip, including data
sets in .xlsx and .csv format,
✘ 1 data set cannot be considered open data.
13. SendMessage Assess Field "product_id"
checkValueExists(product_id)
Assess Field "original_name"
checkValueExists(original_name)
Assess Field "pharmaceutical_form"
checkValueExists(pharmaceutical_form)
SendMessage
SendMessage
SendMessage Assess Field "marketing_authorisation_holder"
checkValueExists(marketing_authorisation_holder)
Assess Field "exp_country_en"
checkValueExists(exp_country_en)
Assess Field "exp_country_lv"
checkValueExists(exp_country_lv)
Assess Field "atc_code"
checkValueExists(atc_code)
SendMessage
SendMessage
SendMessage
Assess Field "authorisation_procedure"
checkValueExists(authorisation_procedure)
checkValueEnumerable(authorisation_procedure)
Assess Field "summary_of_product_ characteristics"
checkValueExists(summary_of_product_ characteristics)
checkValueSummary_of_product_
characteristics(Summary_of_product_ characteristics,
'https://www.zva.gov.lv/zalu-registrs/attachments/
pdf.php?id=%'+'&src=description')
SendMessage
SendMessage
ISO3
ISO2
OfficialName
checkMarketing_authorisation
_holderName(Country,
marketing_authorisation_holder)
checkExp_country_enName
(Country, exp_country_en)
checkExp_country_lvName
(Country_LV, exp_country_lv)
checkAtc_codeName (ATC,
atc_code)
ShortName
ATC_code
ShortName_LV
Code (ISO-3166-1)
OfficialName_LV
OK
OK
OK
NO
NO
NO
OK
OK
NO
OK
NO
NO
NO
OK
OK
NO
NO
OK
Quality conditions are defined only for the
primary data object.
DQ requirements are defined by using logical
expressions.
The names of DO attributes/ fields serve as
operands in the logical expressions.
Both syntactical and semantical data quality
can be analysed according to unified principles.
DATA QUALITY SPECIFICATION
Secondary DO
Link between
primary and
secondary DOs
(informal rule)
14. DATA QUALITY MEASURING
PROCESS
The activities to be taken to select data object values from data sources.
One or more steps to evaluate the quality of the data, each of which describes one
test for the compliance of the data object with a specific quality specification.
+
Gather values of the secondary DOs from the data sources if the parameter
indicating the secondary DO’s value in scope of defined quality condition is true:
1. read/ write operations from data source into database,
2. connection of primary and secondary data objects via appropriate
parameters
The steps to improve data quality automatically or manually triggering changes
in the data source.
For contextual
checks
The language describing the quality evaluation
process involves verification activities for a
particular DO that can be defined:
informally as a natural language text,
using UML activity diagrams,
in the own DSL.
Additionally, processing of DO classes
instances may require looping constructions,
similar to iterator used in C#.
15. A concrete DO or a class of DO is used as
an input for a quality verification process.
The quality verification process creates a
test protocol.
In case of SQL:
SELECT statement specifies the target DO
WHERE clause specifies quality requirements
+
JOIN clause link primary and secondary DOs
DATA QUALITY MEASURING
PROCESS
Read data from data sources and write into DB
"Medicinal_Product"
Read data from data
sources and write into
DB "Country"
SendMessage
Assess Field "product_id"
SELECT * from [dbo].[Medicinal_product] WHERE [ product_id] IS
NULL
Assess Field "original_name"
SELECT * from [dbo].[Medicinal_product] WHERE [original_name]
IS NULL
Assess Field "pharmaceutical_form"
SELECT * from [dbo].[Medicinal_product] WHERE
[pharmaceutical_form] IS NULL
SendMessage
SendMessage
SendMessage
Assess Field "marketing_authorisation_holder"
select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country] ON
[dbo].[country].[Short name] = (right(marketing_authorisation_holder,
charindex(',', reverse(marketing_authorisation_holder)) - 2)) OR
[dbo].[country].[Official name] = (right(marketing_authorisation_holder,
charindex(',', reverse(marketing_authorisation_holder)) - 2)) OR
[dbo].[country].[ISO3] = (right(marketing_authorisation_holder,
charindex(',', reverse(marketing_authorisation_holder)) - 2)) WHERE
[dbo].[country].[Short name] IS NULL AND [dbo].[country].[Official
name] IS NULL AND [dbo].[country].[ISO3] IS NULL
Assess Field "exp_country_en"
select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country] ON
[dbo].[country].[Short name] = (exp_country_en) OR
[dbo].[country].[Official name] = (exp_country_en) OR
[dbo].[country].[ISO3] = (exp_country_en) WHERE
[dbo].[country].[Short name] IS NULL AND [dbo].[country].[Official
name] IS NULL AND [dbo].[country].[ISO3] IS NULL
Assess Field "exp_country_lv"
select * from [dbo].[Medicinal_product] LEFT JOIN [dbo].[country_lv] ON
[dbo].[country_lv].[Code (ISO-3166-1)] = (exp_country_lv) OR
[dbo].[country_lv].[ShortName_LV] = (exp_country_lv) OR
[dbo].[country_lv].[LongName_LV] = (exp_country_lv) WHERE
[dbo].[country_lv].[ Code (ISO-3166-1)] IS NULL AND
[dbo].[country_lv].[ShortName_LV] IS NULL AND [dbo].[country_lv].[
LongName_LV] IS NULL
Assess Field "atc_code"
SELECT product_id, REPLACE(SUBSTRING(atc_code,
CHARINDEX(';', atc_code), LEN(atc_code)), ';', '') as atc1,
LEFT(atc_code, CHARINDEX(';', atc_code) - 1) as atc2 into
#atc_divided FROM [dbo].[Medicinal_product] WHERE
LEFT(atc_code, CHARINDEX(';', atc_code) - 0) NOT LIKE '';
SELECT product_id FROM [dbo].[Medicinal_product] LEFT
JOIN [dbo].[ATC] ON [dbo].[ATC].[ATC_code] =
[dbo].[Medicinal_product].[atc_code] WHERE
[dbo].[ATC].[ATC_code] IS NULL EXCEPT SELECT
product_id FROM #atc_divided
SendMessage
SendMessage
SendMessage
Read data from data
sources and write into
DB "Country_LV"
Read data from data sources
and write into DB "ATC"
Assess Field "authorisation_procedure"
SELECT * from [dbo].[Medicinal_product] WHERE
authorisation_procedure IS NULL OR authorisation_procedure
NOT LIKE 'Eiropas centralizētā reģistrācijas procedūra' AND
authorisation_procedure NOT LIKE 'Nacionālā reģistrācijas
procedūra' AND ... AND authorisation_procedure NOT LIKE
'Decentralizētā reģistrācijas procedūra'
Assess Field "summary_of_product_ characteristics"
SELECT * from [dbo].[Medicinal_product] WHERE where
summary_of_product_characteristics IS NULL OR
summary_of_product_characteristics NOT LIKE
'https://www.zva.gov.lv/zalu-registrs/attachments/
pdf.php?id=%'+'&src=description'
SendMessage
SendMessage
OK
OK
OK
NO
NO
NO
OK
OK
NO
OK
NO
NO
NO
OK
OK
NO
NO
OK
16. Publisher Dataset Context
issues/
context total
Empty/
Total
Multiple
notation/
Total
Clean/
Total
Centre for Disease Prevention
and Control
Incidence of 2nd type diabetes in Latvia
- 0/6 0/6 (0) 6/6
Ministry of Welfare
Distribution of persons receiving tech aid by AT 2/2 (100%) 3/7 (43%) 0/7 (0) 2/7
Number of social service providers
2/2 (100%) 22/27
(82%)
10/27 (37%) 4/27
Persons with disabilities by the severity of the
disability and AT
2/2 (100%) 0/23 (0) 0/23 (0) 20/23
Number of children with disabilities by AT 2/2 (100%) 0/10 (0) 0/10 (0) 8/10
State labour inspectorate
Accidents at work
(0-1/1)
(0-100%)
1/10 (10%) 0/10 (0) 8/10
Occupational diseases confirmed 4/5 (80%) 2/11 (18%) 1/11 (0.09%) 9/11
National Blood Donor Centre
Statistics
National Blood Donor Center Statistics - 0/4 (0) 0/4 (0) 4/4
State Agency of medicines
Register of licensed pharmaceutical companies
1/2 (50%) 17/38
(45%)
0/38 (0) 19/38
Medicines consumption statistics 3/3 (100%) 5/8 (63%) 2/8 (25%) 0/8
Medicinal Product Register of Latvia
4/9 (44%) 21/41
(51%)
1/41 (2%) 14/41
Food and veterinary service
Food supplements register
2/2 (100%) 30/35
(86%)
4/35 (11%) 5/35
Dietary foodstuffs register
2/2 (100%) 19/22
(87%)
4/22 (18%) 3/22
APPROBATION. RESULTS
17. DATA QUALITY ANALYSIS OF OPEN
HEALTH(CARE) DATA: CONTEXTUAL ISSUES
Only 1 data set out of 12 (8.3%) didn’t had any data quality issues
(“Accidents at work”), however, some manipulations were needed in order to
achieve this result.
In total 25 out of 35 parameters (71.4%) had at least few
data quality issues.
Data set “Accidents at work”
Value: «88.3332-03»
«88.3332-03»
Data set «Work codes»
Value I: “8332” AND value II: “03”
Value I: “8332”
AND
Value II: “03”
=
Example II: 4 data sets published by the
Ministry of Welfare:
[ATTU code] and [City, county] parameters are
supposed to store the code of the administrative
territory and city that must correspond to the
secondary data object “Classification of
Administrative Territories and Territorial Units”;
✘ 3 values are invalid – aren’t available in the secondary
data set: “Total”, “Abroad” and “Address isn’t specified”.
Possibly, the data publisher is aware of this, as the
appropriate values make sense;
BUT!!!
!!! This data quality problem can be easily
unnoticed and can lead to inaccurate data analysis
results.
18. Example I: “Number of social service providers” data set: 3 parameters: [Service with
accommodation] and [Service without accommodation] and [Service with and without
accommodation];
BUT!!! For 95 records this assumption is not in force.
Example II: “Number of children with disabilities by administrative territory” data set:
For 121 records this assumption is not in force.
At least two possible explanations:
1) there are data quality problems;
2) these field aren’t interconnected, and the sum of values of the first two parameters not necessarily should
be equal with the value of the 3rd parameter.
From the users’ viewpoint:
[Service with and without accommodation] = [Service with accommodation] + [Service without
accommodation]
DATA QUALITY ANALYSIS OF OPEN
HEALTH(CARE) DATA: CONTEXTUAL ISSUES
Another problem for 4 out of 15 data sets (26.7%) - different
number of interrelated values that may appear in different ways:
(a) values in different languages,
(b) ID number and name,
(c) name and supplementary data such as type, country, phone
number of representatives.
which of these
options???
Dataset Context
issues/
context
total
Incidence of 2nd type diabetes in Latvia 0/0
Distribution of persons receiving tech aid by AT 2/2 (100%)
Number of social service providers 2/2 (100%)
Persons with disabilities by the severity of the
disability …
2/2 (100%)
Number of children with disabilities by AT 2/2 (100%)
Accidents at work
(0-1/1)
(0-100%)
Occupational diseases confirmed 4/5 (80%)
National Blood Donor Center Statistics 0/0
Register of licensed pharmaceutical companies 1/2 (50%)
Medicines consumption statistics 3/3 (100%)
Medicinal Product Register of Latvia 4/9 (44%)
Food supplements register 2/2 (100%)
Dietary foodstuffs register 2/2 (100%)
Veterinary medicinal product register 1/3 (33%)
[1# group] = [18-29 years 1# group] + [30-44 years 1# group] + … + [>=65 years 1# group];
[2# group] = [18-29 years 2# group] + [30-44 years 2# group] + … + [>=65 years 2# group];
[3# group] = [18-29 years 3# group] + [30-44 years 3# group] + … + [>=65 years 3# group]
!!! Data publishers must provide a brief explanation of the parameters and how numerical data was
gotten
19. DATA QUALITY ANALYSIS OF OPEN
HEALTH(CARE) DATA: COMPLETENESS
For 136 out of 167 (81.4%) analysed parameters at least one value was empty.
The number of empty values per parameter varies from 1 to all values of a certain
parameter.
The total number of empty values in analysed data sets is 15%.
Problem of empty values appears even for the primary data of the data sets:
Example: “Dietary foodstuffs register”data set:
✘ 4 records don’t have [Name] and [ProducerName].
This issue is almost “traditional” in many sectors and
countries.
However, some researches demonstrate a high level of data
completeness can be achieved.
(Schmidt et al., 2015)
(Oliveira, 2016)
(Wanner et al.,
2018) (Tomic, 2015)
(Yi, 2019)
(Sigurdardottir, 2012) (Larsen, 2009)
Dataset Empty/
Total
Incidence of 2nd type diabetes in Latvia 0/6
Distribution of persons receiving tech aid by AT 3/7 (43%)
Number of social service providers
22/27
(82%)
Persons with disabilities by the severity of the
disability …
0/23 (0)
Number of children with disabilities by AT 0/10 (0)
Accidents at work 1/10 (10%)
Occupational diseases confirmed 2/11 (18%)
National Blood Donor Center Statistics 0/4 (0)
Register of licensed pharmaceutical companies
17/38
(45%)
Medicines consumption statistics 5/8 (63%)
Medicinal Product Register of Latvia
21/41
(51%)
Food supplements register
30/35
(86%)
Dietary foodstuffs register
19/22
(87%)
Veterinary medicinal product register
16/26
(62%)
NOTE: 28 of 136 detected empty values may not be considered
as quality issues, however, while there are no any notes from the
data publisher regarding their nullability, there is no certainty,
that there are no any problems there, as
empty values may have different
interpretations.
20. DATA QUALITY ANALYSIS OF OPEN
HEALTH(CARE) DATA: MULTIPLE NOTATIONS
FOR A SINGLE OBJECT
Multiple notations for a single object within a single data set and even a
parameter:
✘ in 6 out of 15 data sets (40%) in 22 out of 167 parameters (13.2%).
May appear in different ways such as a different name:
• This problem is also widely spread for many
sectors and even countries.
OGD of the UK (Kuk and Davies, 2011).
for one country
for instance,
(a) USA vs. United States
vs. United States of
America;
(b) Northern Ireland vs.
Republic of Ireland
vs. Ireland;
(c) Scotland vs. Scotland
UK, etc.
different patterns for
one value
for instance, phone or
registration number: with or
without (1) code or (2)
delimiter; type of delimiter
etc.
different notations indicating
the absence of a value: NULL
and ‘0’**
Do both NULL and ‘0’ values have
the same meaning???
‘0’ can point out to the value that is
equal to zero, while NULL can mean
that the value isn’t known.
**often called “heterogeneity”
for the type of
preparation, ingredient or
unit size
for instance, (a) singular, (b)
plural, (c) shortened form, (d)
with a spelling mistake, etc.
Dataset Multiple
notation/
Total
Incidence of 2nd type diabetes in Latvia 0/6 (0)
Distribution of persons receiving tech aid by AT 0/7 (0)
Number of social service providers 10/27 (37%)
Persons with disabilities by the severity of the
disability …
0/23 (0)
Number of children with disabilities by AT 0/10 (0)
Accidents at work 0/10 (0)
Occupational diseases confirmed
1/11
(0.09%)
National Blood Donor Center Statistics 0/4 (0)
Register of licensed pharmaceutical companies 0/38 (0)
Medicines consumption statistics 2/8 (25%)
Medicinal Product Register of Latvia 1/41 (2%)
Food supplements register 4/35 (11%)
Dietary foodstuffs register 4/22 (18%)
Veterinary medicinal product register 0/26 (0)
In 5 out 8 cases it could be solved, involving
the mechanisms, controlling the list of
permissible values.
21. Despite the importance of data quality, the quality of open data is not always one of the main
areas of analysis and evaluation of open data.
Open health(care) data have a high number of different data quality problem, however, data
publishers (who provides data used in their IS), probably, don’t even aware of them. The most
frequently occurred are:
✘ contextual data quality issues;
✘ empty values even for primary data;
✘ multiple denotation for the same object within one data object and even a parameter;
✘ issues on interrelated parameters.
RESULTS I
22. Such an analysis and use of a data object-driven approach to data quality evaluation can be applied
not only to open health(care) data but also to other structured and semi-structured data - this solution
is effective in many domains.
The advantages of the used approach:
it can be applied to “third-party” data sets without any information on how data were accrued and processed – it is an
external mechanism with a higher level of abstraction,
it can be used even by users without IT and DQ knowledge.
The use of open data brings significant benefits data providers as because of multiple number of possible use-cases,
data users address various challenges that can rarely be solved by data providers alone.
This can improve data quality not only at the national level, but also at the international level.
RESULTS II
23. THANK YOU!
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv
Article: Nikiforova, A. (2019). Analysis of open health data quality using data object-driven approach to data
quality evaluation: insights from a Latvian context. In IADIS International Conference e-Health (pp. 119-126).