Big Data Quality Panel: Diachron Workshop @EDBT

•Descargar como PPTX, PDF•

0 recomendaciones•646 vistas

1) Traditional approaches to ensuring data quality such as quality assurance and curation face challenges from big data's volume, velocity, and variety characteristics. 2) It is difficult to determine general thresholds for when data quality issues can be ignored as the importance varies between different analytics algorithms. 3) The ReComp decision support system aims to use metadata about past analytics tasks to determine when knowledge needs to be refreshed due to changes in big data or models.

Tecnología

P.Missier-2016
Diachronworkshoppanel
Big Data Quality Panel
Diachron Workshop @EDBT
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Newcastle University, UK
Bordeaux, March 2016
(*) Painting by Johannes Moreelse
(*)

P.Missier-2016
Diachronworkshoppanel
The “curse” of Data and Information Quality
• Quality requirements are often specific to the application
that makes use of the data (“fitness for purpose”)
• Quality Assurance (actions required to meet the
requirements) are specific to the data types
A few generic quality techniques (linkage, blocking, …)
but mostly ad hoc solutions

P.Missier-2016
Diachronworkshoppanel
V for “Veracity”?
Q3. To what extent traditional approaches for diagnosis, prevention
and curation are challenged by the Volume Variety and Velocity
characteristics of Big Data?
V Issues Example
High Volume • Scalability: What kinds of QC
step can be parallelised?
• Human curation not feasible
Parallel meta-blocking
High Velocity • Statistics-based diagnosis, data-
type specific
• Human curation not feasible
Reliability of sensor
readings
High Variety • Heterogeneity is not a new issue! Data fusion for decision
making
Recent contributions on Quality & Big Data (IEEE Big Data 2015)
Chung-Yi Li et al., Recommending missing sensor values
Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware
approach for exploring high-dimensional data
S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case
study in the medical domain
V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow
similar entity descriptions in the Web
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking:
Realizing scalable entity resolution over large, heterogeneous data

P.Missier-2016
Diachronworkshoppanel
Can we ignore quality issues?
Q4: How difficult is the evaluation of the threshold under which data
quality can be ignored?
• Some analytics algorithms may be tolerant to {outliers, missing
values, implausible values} in the input
• But this “meta-knowledge” is specific to each algorithm. Hard to
derive general models
• i.e. the importance and danger of FP / FN
A possible incremental learning approach:
Build a database of past analytics task:
H = {<In, P, Out>}
Try and learn (In, Out) correlations over a growing collection H

P.Missier-2016
Diachronworkshoppanel
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-Knowledge pattern of the Knowledge Economy:

P.Missier-2016
Diachronworkshoppanel
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Change  data currency

P.Missier-2016
Diachronworkshoppanel
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Currency of data and of meta-knowledge:
- What knowledge should be refreshed?
- When, how?
- Cost / benefits

P.Missier-2016
Diachronworkshoppanel
ReComp: 2016-18
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
History DB
Past KAs
and their metadata  provenance
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
META-K

P.Missier-2016
Diachronworkshoppanel
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata
items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies

Más contenido relacionado

La actualidad más candente

Himansu sahoo resume-dsHimansu Sahoo

Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet

La résolution de problèmes à l'aide de graphesData2B

20170110_IOuellette_CVIan Ouellette

Towards reproducibility and maximally-open dataPablo Bernabeu

Future of hpcPutchong Uthayopas

NG2S: A Study of Pro-Environmental Tipping Point via ABMsKan Yuenyong

Moa: Real Time Analytics for Data StreamsAlbert Bifet

Minimal viable-datareuse-cziPaul Groth

Big dataAshish Kulkarni

Data Science, Data & Dashboards DesignKoo Ping Shung

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)Nattiya Kanhabua

The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth

Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet

From Text to Data to the World: The Future of Knowledge GraphsPaul Groth

Role of Data Accessibility During PandemicDatabricks

ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier

Predicting the “Next Big Thing” in Science - #scichallenge2017Adrian Mladenic Grobelnik

End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth

Kenett On Information NYU-Poly 2013The Hebrew University of Jerusalem

La actualidad más candente (20)

Himansu sahoo resume-ds

Sentiment Knowledge Discovery in Twitter Streaming Data

La résolution de problèmes à l'aide de graphes

20170110_IOuellette_CV

Towards reproducibility and maximally-open data

Future of hpc

NG2S: A Study of Pro-Environmental Tipping Point via ABMs

Moa: Real Time Analytics for Data Streams

Minimal viable-datareuse-czi

Big data

Data Science, Data & Dashboards Design

Estimating Query Difficulty for News Prediction Retrieval (poster presentation)

The Roots: Linked data and the foundations of successful Agriculture Data

Pitfalls in benchmarking data stream classification and how to avoid them

From Text to Data to the World: The Future of Knowledge Graphs

Role of Data Accessibility During Pandemic

ReComp: optimising the re-execution of analytics pipelines in response to cha...

Predicting the “Next Big Thing” in Science - #scichallenge2017

End-to-End Learning for Answering Structured Queries Directly over Text

Kenett On Information NYU-Poly 2013

Similar a Big Data Quality Panel: Diachron Workshop @EDBT

Big Data & DS Analytics for PAARLPhilippine Association of Academic/Research Librarians

Big Data for Library Services (2017)Albert Anthony Gavino, MBA

Introduction to open-dataOpenAccessBelgium

ElsevierChristina Azzam

Luciano uvi hackfest.28.10.2020Joanne Luciano

BIG DATA.pptUsmanAliyuAminu

A Big Picture in Research Data ManagementCarole Goble

Challenges in Analytics for BIG DataPrasant Misra

Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupEdward Curry

What Data Science Will Mean to You - One Person's ViewPhilip Bourne

BIG-DATAPPTFINAL.pptrajsharma159890

Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29

The Role of Automated Function Prediction in the Era of Big Data and Small Bu...Philip Bourne

dissertation proposal writing servicePhd Assistance

Data Science: Origins, Methods, Challenges and the future?Cagatay Turkay

My FAIR share of the work - Diamond Light Source - Dec 2018Susanna-Assunta Sansone

Real-time applications of Data Science.pptxshalini s

Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f

NCME Big Data in EducationPhilip Piety

Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem

Similar a Big Data Quality Panel: Diachron Workshop @EDBT (20)

Big Data & DS Analytics for PAARL

Big Data for Library Services (2017)

Introduction to open-data

Elsevier

Luciano uvi hackfest.28.10.2020

BIG DATA.ppt

A Big Picture in Research Data Management

Challenges in Analytics for BIG Data

Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup

What Data Science Will Mean to You - One Person's View

BIG-DATAPPTFINAL.ppt

Pemanfaatan Big Data Dalam Riset 2023.pptx

The Role of Automated Function Prediction in the Era of Big Data and Small Bu...

dissertation proposal writing service

Data Science: Origins, Methods, Challenges and the future?

My FAIR share of the work - Diamond Light Source - Dec 2018

Real-time applications of Data Science.pptx

Data_Science_Applications_&_Use_Cases.pptx

NCME Big Data in Education

Data_Science_Applications_&_Use_Cases.pptx

Más de Paolo Missier

Towards explanations for Data-Centric AI using provenance recordsPaolo Missier

Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier

Data-centric AI and the convergence of data and model engineering:opportunit...Paolo Missier

Realising the potential of Health Data Science:opportunities and challenges ...Paolo Missier

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier

A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier

Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier

Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier

Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier

Digital biomarkers for preventive personalised healthcarePaolo Missier

Data Provenance for Data SciencePaolo Missier

Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier

Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Paolo Missier

Data Science for (Health) Science:tales from a challenging front line, and h...Paolo Missier

Analytics of analytics pipelines:from optimising re-execution to general Dat...Paolo Missier

ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier

Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Paolo Missier

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier

Más de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records

Interpretable and robust hospital readmission predictions from Electronic Hea...

Data-centric AI and the convergence of data and model engineering:opportunit...

Realising the potential of Health Data Science:opportunities and challenges ...

Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)

A Data-centric perspective on Data-driven healthcare: a short overview

Capturing and querying fine-grained provenance of preprocessing pipelines in ...

Tracking trajectories of multiple long-term conditions using dynamic patient...

Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...

Digital biomarkers for preventive personalised healthcare

Data Provenance for Data Science

Capturing and querying fine-grained provenance of preprocessing pipelines in ...

Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...

Data Science for (Health) Science:tales from a challenging front line, and h...

Analytics of analytics pipelines:from optimising re-execution to general Dat...

ReComp, the complete story: an invited talk at Cardiff University

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...

Efficient Re-computation of Big Data Analytics Processes in the Presence of C...

Último

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

🐬 The future of MySQL is Postgres 🐘RTylerCroy

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

How to convert PDF to text with Nanonetsnaman860154

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Big Data Quality Panel: Diachron Workshop @EDBT

1. P.Missier-2016 Diachronworkshoppanel Big Data Quality Panel Diachron Workshop @EDBT Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes Moreelse (*)

2. P.Missier-2016 Diachronworkshoppanel The “curse” of Data and Information Quality • Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) • Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions

3. P.Missier-2016 Diachronworkshoppanel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? V Issues Example High Volume • Scalability: What kinds of QC step can be parallelised? • Human curation not feasible Parallel meta-blocking High Velocity • Statistics-based diagnosis, data- type specific • Human curation not feasible Reliability of sensor readings High Variety • Heterogeneity is not a new issue! Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data

4. P.Missier-2016 Diachronworkshoppanel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? • Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input • But this “meta-knowledge” is specific to each algorithm. Hard to derive general models • i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = {<In, P, Out>} Try and learn (In, Out) correlations over a growing collection H

5. P.Missier-2016 Diachronworkshoppanel Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:

6. P.Missier-2016 Diachronworkshoppanel The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change  data currency

7. P.Missier-2016 Diachronworkshoppanel The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Currency of data and of meta-knowledge: - What knowledge should be refreshed? - When, how? - Cost / benefits

8. P.Missier-2016 Diachronworkshoppanel ReComp: 2016-18 Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata  provenance Observe change Assess and measure Estimate Enact KA: Knowledge Assets META-K

9. P.Missier-2016 Diachronworkshoppanel Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies

Notas del editor

The times they are a’changin
Problem: this is “blind” and expensive. Can we do better?
These items are partly collected automatically, and partly as manual annotations. They include: Logs of past executions, automatically collected, to be used for post hoc performance analysis and estimation of future resource requirements and thus costs (S1) ; Runtime provenance traces and prospective provenance. The former are automatically collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects. External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.

Big Data Quality Panel: Diachron Workshop @EDBT

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data Quality Panel: Diachron Workshop @EDBT

Similar a Big Data Quality Panel: Diachron Workshop @EDBT (20)

Más de Paolo Missier

Más de Paolo Missier (20)

Último

Último (20)

Big Data Quality Panel: Diachron Workshop @EDBT

Notas del editor

Big Data Quality Panel : Diachron Workshop @EDBT

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Big Data Quality Panel : Diachron Workshop @EDBT

Similar a Big Data Quality Panel : Diachron Workshop @EDBT (20)

Más de Paolo Missier

Más de Paolo Missier (20)

Último

Último (20)

Big Data Quality Panel : Diachron Workshop @EDBT

Notas del editor

Big Data Quality Panel: Diachron Workshop @EDBT

Similar a Big Data Quality Panel: Diachron Workshop @EDBT

Similar a Big Data Quality Panel: Diachron Workshop @EDBT (20)

Big Data Quality Panel: Diachron Workshop @EDBT