1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics.
2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms.
3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.
Injustice - Developers Among Us (SciFiDevCon 2024)
Big Data Quality Panel: Diachron Workshop @EDBT
1. P.Missier-2016
Diachronworkshoppanel
Big Data Quality Panel
Diachron Workshop @EDBT
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Newcastle University, UK
Bordeaux, March 2016
(*) Painting by Johannes Moreelse
(*)
2. P.Missier-2016
Diachronworkshoppanel
The “curse” of Data and Information Quality
• Quality requirements are often specific to the application
that makes use of the data (“fitness for purpose”)
• Quality Assurance (actions required to meet the
requirements) are specific to the data types
A few generic quality techniques (linkage, blocking, …)
but mostly ad hoc solutions
3. P.Missier-2016
Diachronworkshoppanel
V for “Veracity”?
Q3. To what extent traditional approaches for diagnosis, prevention
and curation are challenged by the Volume Variety and Velocity
characteristics of Big Data?
V Issues Example
High Volume • Scalability: What kinds of QC
step can be parallelised?
• Human curation not feasible
Parallel meta-blocking
High Velocity • Statistics-based diagnosis, data-
type specific
• Human curation not feasible
Reliability of sensor
readings
High Variety • Heterogeneity is not a new issue! Data fusion for decision
making
Recent contributions on Quality & Big Data (IEEE Big Data 2015)
Chung-Yi Li et al., Recommending missing sensor values
Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware
approach for exploring high-dimensional data
S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case
study in the medical domain
V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow
similar entity descriptions in the Web
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking:
Realizing scalable entity resolution over large, heterogeneous data
4. P.Missier-2016
Diachronworkshoppanel
Can we ignore quality issues?
Q4: How difficult is the evaluation of the threshold under which data
quality can be ignored?
• Some analytics algorithms may be tolerant to {outliers, missing
values, implausible values} in the input
• But this “meta-knowledge” is specific to each algorithm. Hard to
derive general models
• i.e. the importance and danger of FP / FN
A possible incremental learning approach:
Build a database of past analytics task:
H = {<In, P, Out>}
Try and learn (In, Out) correlations over a growing collection H
7. P.Missier-2016
Diachronworkshoppanel
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Currency of data and of meta-knowledge:
- What knowledge should be refreshed?
- When, how?
- Cost / benefits
9. P.Missier-2016
Diachronworkshoppanel
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata
items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies
Notas del editor
The times they are a’changin
Problem: this is “blind” and expensive. Can we do better?
These items are partly collected automatically, and partly as manual annotations. They include:
Logs of past executions, automatically collected, to be used for post hoc performance analysis and
estimation of future resource requirements and thus costs (S1) ;
Runtime provenance traces and prospective provenance. The former are automatically
collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects.
External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.