TDWG 2013 talk on data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions.
Authors : Christian Gendreau, David P. Shorthouse, Peter Desmet
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Data quality challenges in the Canadensys network of occurrence records: examples, tools, and solutions
1. Data quality challenges in the
Canadensys network of
occurrence records: examples,
tools, and solutions
Chris&an
Gendreau,
David
Shorthouse
&
Peter
Desmet
2. Game
plan
• Introduc&on
to
Canadensys
• Data
quality
@
Canadensys
• Canadensys
processing
solu&ons
• Numbers
from
Canadensys
• Hopes
and
expecta&ons
9. During
data
entry
• Help
to
avoid
typographical
errors
• Help
to
convert
verba&m
data
Actor : data entry person
10. Before
publica&on
• Detect
file
character
encoding
issue
• Detect
duplicate
or
missing
IDs
Actor : data publisher
Previous Activity:
Data entry
11. During
aggrega&on
• Process
data:
valida&on,
cleaning
• Produce
structured
reports
:
quality
control
Actor : data aggregator
Previous Activity:
Before publication
12. AKer
aggrega&on
• Allow
and
facilitate
community
feedback
• Help
data
publisher
to
integrate
correc&ons
Actor : users and community
Previous Activity:
Aggregation
14. Why
do
we
process
data?
• Enrich
our
Explorer,
h"p://data.canadensys.net
• Provide
structured
reports
to
data
providers
• Help
iden&fy
records
that
need
re-‐examina&on
• Help
to
improve
data
entry
procedure
17. The
narwhal-‐processor
approach
● Single
field
processing
to
allow
complex
processing
(combined
fields)
● Processors
with
common
interface
ease
integra&on
and
usage
● Collabora&on
https://github.com/Canadensys/narwhal-processor
18. Data
usability
before
processing
96%
100%
92%
90%
%
of
non-‐null
clean
verba>m
data
80%
70%
60%
60%
50%
44%
40%
30%
20%
10%
0%
country
text
state/province
text
coordinates
dates
19. Data
usability
aKer
processing
• 7%
of
provided
country
text
USA
ISO
3166-‐2:US,
United
States
20. Data
usability
aKer
processing
• 7%
of
provided
country
text
• 16%
of
provided
state/province
text
Qué
ISO
3166-‐2
CA-‐
QC,
Quebec
21. Data
usability
aKer
processing
• 7%
of
provided
country
text
• 16%
of
provided
state/province
text
• 4%
of
provided
coordinates
45°
32'
25"
N,
129°
40'
31"
W
45.5402778,
-‐129.6752778
22. Data
usability
aKer
processing
• 7%
of
provided
country
text
• 16%
of
provided
state/province
text
• 4%
of
provided
coordinates
• 42%
of
provided
dates
2008
VI
13
2008-‐06-‐13
23. Data
usability
including
processed
data
4%
100%
7%
90%
%
of
non-‐null
provided
80%
70%
16%
42%
60%
50%
96%
92%
40%
60%
30%
44%
20%
10%
0%
country
text
state/province
text
coordinates
dates
24. Projects
With
Data
Quality
Tools
• Atlas
of
living
Australia
• GBIF
Norway,
GBIF
Spain,
Na&onal
Biodiversity
Network,
BioVeL
…
• GBIF
libraries
• Most
nodes
have
their
own
data
quality
rou&ne
26. We
do
not
want
to
• Maintain
taxonomic
authority
files
• Maintain
country,
province
and
city
lists
27. We
prefer
to
• Efficiently
use
specialized
resources/services
• Provide
report,
quality
indices
28. Help
from
Seman&c
Web
• Data
in
other
languages
(French,
Spanish,
…)
should
not
be
flagged
as
error
• Misspellings
should
be
shared
as
a
common
resource
(e.g.
SKOS)
• Understand
historical
data
(e.g.
collected
in
USSR
in
1980)
29. Repor&ng
and
log
• DarwinCore
annota&ons
for
processed
data
• Shared
vocabulary
for
structured
reports
and
quality
indices
30. Summary
• Tools
available
for
sharing
• Use,
review,
contribute
• Opportunity
for
broad
coordina&on
and
increased
efficiencies
31. Thanks
Anne Bruneau, Institut de recherche en biologie végétale and
Département de Sciences Biologiques, Université de Montréal
33. Mul&-‐field
processing
DwC
Field
Raw
data
Processed
data
verba&mLa&tude
45°30ʹ′N
45.5
verba&mLongitude
73°34ʹ′W
-‐73.5666667
country
Canada
Canada
stateProvince
QC
Quebec
municipality
Montreal
City
Montreal
34. Mul&-‐field
processing
1. Get
informa&on
on
coordinates
45.5,-‐73.5666667
2. Compare
with
processed
data
3. Assert
that
these
coordinates
are
in
Montréal