SlideShare una empresa de Scribd logo
1 de 112
Data Quality for Machine Learning Tasks
KDD Tutorial / © 2021 IBM Corporation
Nitin Gupta Shashank
Mujumdar
Satoshi Masuda
Hima Patel
IBM Research India and IBM Research Japan
Need for Data Quality for Machine
Learning
KDD Tutorial / © 2021 IBM Corporation
Let us start with a story..
KDD Tutorial / © 2021 IBM Corporation
Data Scientist
Picture Courtesy: This Photo by Unknown Author is licensed under CC BY-SA-NC
Yay!! I am so
excited!!
Data Preparation is a time-consuming activity in
data science lifecycle
KDD Tutorial / © 2021 IBM Corporation
“Data collection and preparation are typically
the most time-consuming activities in developing
an AI-based application, much more so than
selecting and tuning a model.” – MIT Sloan Survey
https://sloanreview.mit.edu/projects/reshaping-business-with-artificial-
intelligence/
Data preparation accounts for about 80% of the work of data
scientists” - Forbes
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-
most-time-consuming-least-enjoyable-data-science-task-survey-
says/#70d9599b6f63
Challenges with Data Preparation
KDD Tutorial / © 2021 IBM Corporation
Data issues are not known at the start of the project. Discovered via iterative data
debugging which is cumbersome and time consuming
Data Quality Analysis can help..
KDD Tutorial / © 2021 IBM Corporation
 Know the issues in the data
beforehand, example noise in labels,
overlap between classes.
 Objective measurement of how good
or bad the data is, and how does
each operation affect the data
 Make informed choices for data
preprocessing and model selection
 Reduce turn around time for data
science projects.
Data Quality 2.0
KDD Tutorial / © 2021 IBM Corporation
Why is automated data quality analysis important?
 Lot of progress in last several years on improving ML algorithms and building
automated machine learning toolkits (AutoML) [*]
 Several commercially ready pipelines are available
 AutoAI with IBM Watson Studio
 CloudAutoML from Google
 …
 Open source pipelines
 Autosklearn
 Autokeras
 …
 However, quality of model is upper bounded by quality of input data
GAP: No systematic efforts to measure the quality of data for machine learning (ML)
George Fuechsel,
IBM 305 RAMAC
technician
How is it different from traditional data quality?
KDD Tutorial / © 2021 IBM Corporation
• Data quality is a well established area in database
community and the capabilities to measure the quality of
data in databases and data lakes exist in several products
• There is a need to re-look at this approach from the lens of
building machine learning models, as new metrics and
dimensions need to be defined.
• Hence, need for data quality 2.0 Image Courtesy: https://fractalenlightenment.com/33427/
life/changing-perspectives-choose-to-view-life-through-a-different-
lens
To put it all together
KDD Tutorial / © 2021 IBM Corporation
Data Assessment and Readiness Module
 Need for algorithms and tools that can assess training datasets
 Need for algorithms and tools that can remediate datasets
Bring in automation, standardization and more democratization of data science process.
To summarize:
KDD Tutorial / © 2021 IBM Corporation
George Fuechsel, IBM 305 RAMAC technician
Lot of progress in last several years on improving ML
algorithms including building automated machine
learning toolkits (AutoML)
However,
Quality of a ML model is directly proportional to
Quality of Data
Hence, there is a need for systematic study of
measuring quality of data with respect to machine
learning tasks.
Broad Research Challenges
How to systematically measure the quality of data for ML?
How to best remediate the data? Does the sequence of operations matter for data
remediation?
How to systematically capture all the data changes via automated documentation?
How to address different modalities of data?
Can we build multimodal solutions?
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics: Desired Qualities
 Capabilities to assess different dimensions of the data that can affect a model
performance, example bias in the data, class imbalance etc. We will explore this in detail
in the tutorial today.
 Standardization of output from the different metrics, for high level view of the health of the
dataset.
 Capability to explain to the user why the data is “bad” for a given dimension
 Customized recommendations based on the severity of data issues, data size and other
data attributes
KDD Tutorial / © 2021 IBM Corporation
Challenge of different modalities
Structured Datasets
 Tabular
 Spatio-temporal
Unstructured Datasets
 Social Media Data: Tweets, posts, documents, chat msgs etc
 IT Operations Data : Tickets, Logs, Github pull requests, alerts, JSON, XML etc
 More generic data forms: documents, web pages etc
Image Datasets
Speech Datasets
Multimodal Datasets
KDD Tutorial / © 2021 IBM Corporation
Timeseries
Challenge of different modalities
Structured Datasets
 Tabular
 Spatio-temporal
Unstructured Datasets
 Social Media Data: Tweets, posts, documents, chat msgs etc
 IT Operations Data : Tickets, Logs, Github pull requests, alerts, JSON, XML etc
 More generic data forms: documents, web pages etc
Image Datasets
Speech Datasets
Multimodal Datasets
KDD Tutorial / © 2021 IBM Corporation
Timeseries
Detection of
outliers
Detection of
Noisy Labels
Overlapping
data points
Getting started in this area: Existing tools and libraries
KDD Tutorial / © 2021 IBM Corporation
Open Source Libraries:
Deequ: https://github.com/awslabs/deequ
Tensorflow Data Validation: https://www.tensorflow.org/tfx/guide/tfdv
Pandas Profiler: https://github.com/pandas-profiling/pandas-profiling
Beta/Trial Versions:
Data Quality For AI : https://developer.ibm.com/apis/catalog/dataquality4ai--data-quality-for-
ai/Introduction/
Know Your Data: https://knowyourdata.withgoogle.com/
In this tutorial:
Part 1: Overview of data quality for ML (covered)
Part 2: Techniques for data quality measurements for structured datasets
Part 3: Techniques for data quality measurements for spatio-temporal
datasets
Part 4: Techniques for data quality measurements for unstructured datasets
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics for
Structured Data
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics
We had covered the following topics
in KDD 2020 tutorial:
Classification specific metrics:
 Data Cleaning taxonomy
 Class Imbalance
 Data Valuation
 Data Homogeneity
 Data Transformation
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Label Noise
KDD Tutorial / © 2021 IBM Corporation
Given Label – Iris-setosa
Correct Label –Iris-virginica (based on attributes analysis)
 Most of the large data generated or annotated
have some noisy labels.
 In this metric we discuss “How one can identify
these label errors and correct them to model
data better?”.
There are atleast 100,000 label issues is ImageNet!
Source: https://l7.curtisnorthcutt.com/confident-learning
Effects of Label Noise
Possible Sources of Label Noise:
 Insufficient information provided to the labeler
 Errors in the labelling itself
 Subjectivity of the labelling task
 Communication/encoding problems
Label noise can have several effects:
 Decrease in classification performance
 Pose a threat to tasks like feature selection
 In online settings, new labelled data may
contradict the original labelled data
KDD Tutorial / © 2021 IBM Corporation
Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
 Designing robust algorithms that are
insensitive to noise
 Not directly extensible to other learning
algorithms
 Requires to change an existing
method, which neither is always
possible nor easy to develop
 Filtering out noise before passing to
underlying ML task
 Independent of the classification
algorithm
 Helps in improving classification
accuracy and reduced model
complexity.
KDD Tutorial / © 2021 IBM Corporation
Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
 Learning with Noisy Labels (NIPS-2014)
 Robust Loss Functions under Label Noise
for Deep Neural Networks (AAAI-2017)
 Probabilistic End-To-End Noise Correction
for Learning With Noisy Labels (CVPR-2019)
 Can Gradient Clipping Mitigate Label Noise?
(ICLR-2020)
 Identifying mislabelled training data (Journal
of artificial intelligence research 1999)
 On the labeling correctness in computer
vision datasets (IAL 2018)
 Finding label noise examples in large scale
datasets (SMC 2017)
 Confident Learning: Estimating Uncertainty
in Dataset Label (Arxiv -2019)
KDD Tutorial / © 2021 IBM Corporation
Filter Based Approaches
 On the labeling correctness in
computer vision datasets
[ARK18]
 Published in Interactive
Adaptive Learning 2018
 Train CNN ensembles to
detect incorrect labels
 Voting schemes used:
 Majority Voting
KDD Tutorial / © 2021 IBM Corporation
 Identifying mislabelled training
data [BF99]
 Published in Journal of
artificial intelligence research,
1999
 Train filters on parts of the
training data in order to
identify mislabeled examples
in the remaining training data.
 Voting schemes used:
 Majority Voting
 Consensus Voting
 Finding label noise examples in large
scale datasets [EGH17]
 Published in International
Conference on Systems, Man, and
Cybernetics, 2017
 Two-level approach
 Level 1 – Train SVM on the
unfiltered data. All support
vectors can be potential
candidates for noisy points.
 Level 2 – Train classifier on the
original data without the support
vectors. Samples for which the
original and the predicted label
does not match, are marked as
label noise.
Source: Broadley et al, 1999, Identifying mislabeled training Data
Confident Learning: Estimating
Uncertainty in Dataset Label -2019
The Principles of Confident Learning
 Prune to search for label errors
 Count to train on clean data
 Rank which examples to use during training
KDD Tutorial / © 2021 IBM Corporation
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]
KDD Tutorial / © 2021 IBM Corporation
Confident Learning: Estimating
Uncertainty in Dataset Label -2019
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]
Data Quality for AI: Label Purity
KDD Tutorial / © 2021 IBM Corporation
 Label purity algorithm is built on top of the confident learning-based approach
(CleanLab).
 One limitation of CleanLab approach is that it can tag some correct samples as noisy if
they lie in the overlap region thereby generating false positives.
 An overlap region is a region where samples of multiple classes start sharing features
because of multiple reasons such as the presence of weak features, fine-grained
classes, etc. This paper improvise the label purity algorithm to address this problem in
two ways:
 Algorithm to detect overlap regions, which helps in removing the false positives
 Provide an effective neighborhood based pruning mechanism to remove identified
noisy candidate samples
[DQT21]
Data Quality for AI: Label Purity
KDD Tutorial / © 2021 IBM Corporation
Comparison of (a) Precision and (b) Recall of Label Purity Algorithm with CleanLab Algorithm on 35 Datasets
Takeaways:
 This algorithm outperforms CleanLab in terms of precision (5−15% improvement).
 For recall, both the algorithms have similar performance.
 On an average over 35 datasets, (a) the precision of this algorithm is .92 and CleanLab is .76, (b) the recall of this algorithm is
.75 and CleanLab is .78
 Overall, 5−16% improvement in precision at the cost of 2% drop in the recall
[DQT21]
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Class Overlap : Data Quality for AI
KDD Tutorial / © 2021 IBM Corporation
Analyse the dataset to find samples that reside in
the overlapping region of the data space.
 Identify data points which are close to
each other but belongs to
different classes
 Identify data points which lies closer to
or other side of the class boundary
Why is it useful for ML pipeline?
 Overlapping regions are hard to detect and
can cause ML classifiers to misclassify points
in that region.
 If amount of overlap is high, we need good
feature representation or more complex model
Example 2
Example 1
[DQT21]
Data Quality for AI - Class Overlap
KDD Tutorial / © 2021 IBM Corporation
 Precision and Recall of Class Overlap Algorithm on 20 datasets (after inducing 30% overlap points)
[DQT21]
Class Overlap : Data Quality for AI
KDD Tutorial / © 2021 IBM Corporation
[DQT21]
Class Overlap
KDD Tutorial / © 2021 IBM Corporation
[XHJ10]
 The objective of Support Vector Data Description
(SVDD) is to find a sphere or domain with minimum
volume containing all or most of the data.
 The data dropped in both two spheres can be
thought as the overlapping data which is close to or
overlaps with each other.
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
Source: All images are taken from google images,
 According to Hawkins, "An outlier is an
observation that deviates so much from other
observations as to arouse suspicion that it
was generated by a different mechanism".
Why is it useful for ML pipeline?
 Machine learning algorithms are sensitive to
the range and distribution of attribute values.
 Data outliers can spoil and mislead the
training process resulting in longer training
times, less accurate models and ultimately
poorer results.
Outliers – All the circles with + sign
Other examples
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
Source: https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561
Most common causes of outliers on a data set:
 Data entry errors (human errors)
 Measurement errors (instrument errors)
 Experimental errors (data extraction or experiment
planning/executing errors)
 Intentional (dummy outliers made to test detection methods)
 Data processing errors (data manipulation or data set unintended
mutations)
 Sampling errors (extracting or mixing data from wrong or various
sources)
 Natural (not an error, novelties in data)
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
[OT21]
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
[OT21]
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
From Precision Point of View-
CBLOF (.466)> IFOREST
(.416)> COPOD (.404)> HBOS
(.376) > LODA (.337)
From Recall Point of View-
LODA (.382) > CBLOF (.388)>
HBOS (.322)> IFOREST (.316)>
COPOD (.274)
[OT21]
Outlier Detection
KDD Tutorial / © 2021 IBM Corporation
From RF
Classifier Point
LODA (8/14) >
HBOS (7/14) >
COPOD (5/14) >
CBLOF (3/14)=
IFOREST (3/14)
From DT
Classifier Point
CBLOF (8/14) >
LODA (7/14) =
HBOS (7/14) >
COPOD (5/14) =
IFOREST (5/14)
From Time Point of View -
HBOS (14 sec) < LODA (70 sec) < COPOD (312 sec) < CBLOF (2076) < IFOREST (3209)
[OT21]
Outlier Detection
(HBOS VS LODA)
KDD Tutorial / © 2021 IBM Corporation
 Histogram Based Outlier Score (HBOS) is a
fast unsupervised, statistical and non-
parametric method.
 Assumption - all the features are independent.
 In case of categorical data, simple counting is
used while for numerical values, static or
dynamic bins are made.
 The height of each bin represents the density
estimation. To ensure an equal weight of each
feature, the histograms are normalized [0-1].
 In HBOS, outlier score is calculated for each
single feature of the dataset. These calculated
values are inverted such that outliers have a
high HBOS and inliers have a low score.
 Lightweight On- line Detector of Anomalies
(LODA) is particularly useful when huge data
is processed in real time.
 It is not only fast and accurate but also able
to operate and update itself on data with
missing variables.
 It can identify features in which the given
sample deviates from the majority, which
basically finds out the cause of anomaly.
 It constructs an ensemble of T one-
dimensional histogram density estimators.
 LODA is a collection of weak classifiers can
result in a strong classifier.
[OT21]
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Regression Task – Overview of Metrics
KDD Tutorial / © 2021 IBM Corporation
[DQR18]
Data quality issues from regression task –
 Missing values: refers when one variable or attribute does not contain any value. The missing values
occur when the source of data has a problem, e.g., sensor faults, faulty measurements, data transfer
problems or incomplete surveys.
 Outlier: can be an observation univariate or multivariate. An observation is denominated an outlier
when it deviates markedly from other observations, in other words, when the observation appears to
be inconsistent respect to the remainder of observations.
 High dimensionality: is referred to when dataset contains a large number of features. In this case, the
regression model tends to overfit, decreasing its performance.
 Redundancy: represents duplicate instances in data sets which might detrimentally affect the
performance of classifiers.
 Noise: defined as irrelevant or meaningless data. The data noisy reduce the predictive ability in a
regression model.
KDD Tutorial / © 2021 IBM Corporation
[DQR18]
Regression Task – Overview of Metrics
Data cleaning task from regression task –
 Imputation: replaces missing data with substituted values. Four relevant approaches to imputing missing
values:
 Deletion: excludes instances if any value is missing.
 Hot deck: missing items are replaced by using values from the same dataset.
 Imputation based on missing attribute: assigns a representative value to a missing one based on
measures of central tendency (e.g., mean, median, mode, trimmed mean).
 Imputation based on non-missing attributes: missing attributes are treated as dependent variables, and a
regression or classification model is performed to impute missing values.
 Outlier detection: identifies candidate outliers through approaches based on Clustering (e.g., DBSCAN:
Density-based spatial clustering of applications with noise) or Distance (e.g., LOF: Local Outlier Factor).
KDD Tutorial / © 2021 IBM Corporation
[DQR18]
Regression Task – Overview of Metrics
Data cleaning task from regression task –
 Dimensionality reduction: reduces the number of attributes finding useful features to represent the dataset. A
subset of features is selected for the learning process of the regression model. The best subset of relevant
features is the one with least number of dimensions that most contribute to learning accuracy. Dimensionality
reduction can take on four approaches:
 Filter: selects features based on discriminating criteria that are relatively independent of the regression
(e.g., correlation coefficients).
 Wrapper: based on the performance of regression models (e.g., error measures) are maintained or
discarded features in each iteration.
 Embedded: the features are selected when the regression model is trained. The embedded methods try to
reduce the computation time of the wrapper methods.
 Projection: looks for a projection of the original space to space with orthogonal dimensions (e.g., principal
component analysis).
 Remove duplicate instances: identifies and removes duplicate instances.
[DQR18]
Regression Task – Overview of Metrics
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Source
Outliers - Regression Task
1.There is one outlier far from the other points, though it only appears to slightly influence the line.
2.There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential.
3.There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how
the line around the primary cloud doesn’t appear to fit very well.
4.There is a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the
line somewhat strongly, making the least square line fit poorly almost everywhere. There might be an interesting explanation for
the dual clouds, which is something that could be investigated.
5.There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least
squares line.
6.There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very
influential.
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Class Imbalance
KDD Tutorial / © 2021 IBM Corporation
Unequal distribution of classes within a dataset
Source: https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59
Class Imbalance
KDD Tutorial / © 2021 IBM Corporation
Accuracy Paradox
90%
Majority
10%
Minority
Learning
Algorithm
Always
Predict
Accuracy = 90% Reasonable ?
F1 Score, Recall, AUC PR, AUC ROC, G-Mean….
Source: https://en.wikipedia.org/wiki/Accuracy_paradox
Class Imbalance
KDD Tutorial / © 2021 IBM Corporation
Imbalance Ratio
Learning algorithm biased towards majority class
Minority sample considered as the noise
In confusion region, priorities given to majority
Increase Increase
 
Is the imbalance ratio only cause of performance
degradation in learning from imbalanced data? NO

Source: . https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8
Class Imbalance – for Regression
KDD Tutorial / © 2021 IBM Corporation
[SM13]
 Several real-world prediction problems involve forecasting rare values of a target variable.
 When this variable is nominal, we have a problem of class imbalance that was already studied
thoroughly within machine learning (classification task).
 For regression tasks, where the target variable is continuous, few works exist addressing this type of
problem. Still, important application areas involve forecasting rare extreme values of a continuous target
variable.
Problem Statement
 Predicting rare extreme values of a continuous variable is a
particular class of regression problems.
 In this context, given a training sample of the problem, D = {hx, yi}N
i=1, our goal is to obtain a model that approximates the unknown
regression function y = f(x).
 The particularity of our target tasks is that the goal is the predictive
accuracy on a particular subset of the domain of the target variable
Y - the rare and extreme values.
Class Imbalance – for Regression
KDD Tutorial / © 2021 IBM Corporation
[SM13]
 Smote is a sampling method to address classification problems with imbalanced class distribution.
 The key feature of this method is that it combines under-sampling of the frequent classes with over-
sampling of the minority class.
 A variant of Smote for addressing regression tasks where the key goal is to accurately predict rare
extreme values, which we will name SmoteR.
 There are three key components of the Smote algorithm that need to be address in order to adapt it for
our target regression tasks:
 how to define which are the relevant observations and the ”normal” cases
 how to create new synthetic examples (i.e. over-sampling)
 how to decide the target variable value of these new synthetic examples
Class Imbalance – for Regression
KDD Tutorial / © 2021 IBM Corporation
[SM13]
 There are three key components of the Smote algorithm that need to be address in order to adapt it for our
target regression tasks:
 how to define which are the relevant observations and the ”normal” cases
 original algorithm is based on the information provided by the user concerning which class value is
the target/rare class (usually known as the minority or positive class).
 in regression problem, infinite number of values of the target variable are possible.
 Solution - relevance function on a user-specified threshold on the relevance values, that leads to
the definition of the set Dr. Algorithm will over-sample the observations in Dr and under-sample the
remaining cases (Di), thus leading to a new training set with a more balanced distribution of the
values.
Class Imbalance – for Regression
KDD Tutorial / © 2021 IBM Corporation
[SM13]
 how to create new synthetic examples (i.e. over-sampling)
 Regards the second key component, the generation of new cases, same approach as in the original
SMOTE algorithm with some small modifications for being able to handle both numeric and nominal
attributes.
Class Imbalance – for Regression
KDD Tutorial / © 2021 IBM Corporation
[SM13]
 how to decide the target variable value of these new synthetic examples
 In the original algorithm this is a trivial question, because as all rare cases have the same class
(the target minority class), the same will happen to the examples generated from this set.
 In our case the answer is not so trivial. The cases that are to be over-sampled do not have the
same target variable value, although they do have a high relevance score (φ(y)). This means that
when a pair of examples is used to generate a new synthetic case, they will not have the same
target variable value.
 Proposed is to use a weighed average of the target variable values of the two seed examples. The
weights are calculated as an inverse function of the distance of the generated case to each of the
two seed examples.
Data Quality Metrics
Today, we will cover the following
topics:
Classification specific metrics
 Label Noise
 Class Overlap
 Outlier Detection
Regression specific metrics
 Metrics Overview
 Outlier
 Class Imbalance
Metric Sequencing
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
Ordering/Sequence of data quality metrics to achieve better
performance
KDD Tutorial / © 2021 IBM Corporation
While there are several data quality issues that need to be addressed,
fixing these in an arbitrary order is shown to yield sub-optimal results.
as the search space of all possible sequences is combinatorically large, an important
challenge is to find the best sequence with respect to underlying task.
For example, authors in [FF12] suggest that correcting the data for missing values via missing
value imputation can affect outliers in the dataset.
This raises some interesting questions:
what should be the guiding factors by which such a sequence can be chosen,
how can we quantitatively measure these factors
is it possible to find an optimal sequence that allows us to maximize the ML classifier
performance
Ordering/Sequence of data quality metrics to achieve better
performance
KDD Tutorial / © 2021 IBM Corporation
 Learn2Clean, a method based on Q-
Learning, a model-free reinforcement
learning technique
 For a given dataset, it selects a ML model,
and a quality performance metric, the optimal
sequence of tasks for pre-processing the
data such that the quality of the ML model
result is maximized.
 More intuitively, the problem that this work
address is the following: Given a dataset as
input D, a ML pipeline θ to apply to the input
dataset, a quality performance metric q, and
the space of all possible data preparation
and cleaning strategies Φ(D): Find the
dataset D ′ in Φ(D) that maximizes the quality
metric q
[FF12]
Ordering/Sequence of data quality metrics to
achieve better performance
KDD Tutorial / © 2021 IBM Corporation
[FF12]
Summary
 Addressing data quality issues before it enters an ML pipeline allows
taking remedial actions to reduce model building efforts and turn-around
times.
 Measuring the data quality in a systematic and objective manner,
through standardised metrics for example, can improve its reliability and
permit informed reuse.
 There are distinct metrics to quantify the data issues and anomalies
based on whether the data is structured or unstructured.
 Various data workers or personas are involved in the data quality
assessment pipeline depending on the level and type of human
intervention required.
 Ordering of the metrics can serve as a powerful framework to integrate
and optimize the data quality assessment process.
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics for
Spatio-Temporal Data
KDD Tutorial / © 2021 IBM Corporation
Data Quality for Spatio-temporal
Outline
 Spatio-temporal data in data science
 Data types of spatio-temopral
 Challenges for quality of spatio-temporal data
 Quality analysis for spatio-temporal data
 Detecting spatio-temporal outliers
KDD Tutorial / © 2021 IBM Corporation
Source: http://zed.uchicago.edu/crime.html
Domains that use spatio-temporal data
 Geostatistics
 Statistics focusing on spatial or spatiotemporal datasets of
geospatial information.
 E.g. petroleum geology, hydrology, meteorology
 Spatial econometrics
 Field where spatial analysis and econometrics intersect.
 Model examples: spatial auto-correlation, neighbourhood
effects
KDD Tutorial / © 2021 IBM Corporation
Sources:
http://geologylearn.blogspot.com/p/petroleum.html
http://uncglobalhydrology.org/
https://waterprogramming.wordpress.com/2020/07/07/spatial-statistic-part-1-spatial-autocorrelation/
Spatio-temporal data in data science
Applications of Spatio-temporal(ST) data:
 Retail sales analysis
 Climate science
 Transportation
 Criminology
 etc.
KDD Tutorial / © 2021 IBM Corporation
Source https://github.com/danielLinke/CitiBike_NYC
Example: Bike sharing analysis
heat map shows the utilization for each bike station
Spatio-temporal data in data science
 About 70% of tasks of ST data in data science
addresses prediction and detection by deep
learning models. [WCY20]
 Example tasks:
 Demand prediction
 Flood forecast
KDD Tutorial / © 2021 IBM Corporation
spatial maps. ConvLSTM can be considered as a hybrid
del which combines RNN and CNN, and are usually used
handle spatial maps. AE and SDAE are mostly used to
n features from time series, trajectories and spatial maps.
2Seq model is generally designed for sequential data, and
s only used to handle time series and trajectories. The
rid models are also common for STDM. For example,
N and RNN can be stacked to learn the spatial features
, and then capture the temporal correlations among the
orical ST data. Hybrid models can be designed to fit all
four types of data representations. Other models such as
work embedding [164], multi-layer perceptron (MLP) [57],
6], generative adversarial nets (GAN) [49], [93], Residual
s [78], [89], deep reinforcement learning [50], etc. are also
d in recent works.
Addressing STDM Problems
inally, the selected or designed deep learning models are
d to address various STDM tasks such as classification,
dictivelearning, representation learning and anomaly detec-
. Note that usually how to select or design a deep learning
del depends on the particular data mining task and the input
a. However, to show the pipeline of the framework we first
way. Deep learning models are also used in other STDM
tasks including classification, detection, inference/estimation,
recommendation, etc. Next we will introduce the major STDM
problems in detail and summarize the corresponding deep
learning based solutions.
Fig. 14. Distributions of the STDM problems addressed by deep learning
models
Source S. Wang, J. Cao, and P. Yu, “Deep learning for spatio-temporal data mining: A survey,“ IEEE
Transactions on Knowledge and Data Engineering,2020.
Spatio-temporal data types
1. Event data
– Discrete events occurring at point locations and times. E.g., incidences of crime
events in the city
2. Trajectory data
– Paths traced by bodies moving in space over time. E.g., the patrol route of a
police surveillance car.
3. Point reference data
– Measurements of a continuous ST field such as temperature, vegetation, or
population over a set of moving reference points in space and time.
4. Raster data
– Measurements of a continuous or discrete ST field that are recorded at fixed
locations in space and at fixed time points. E.g., population density in geographic
information system)
KDD Tutorial / © 2021 IBM Corporation
[SJA15, WCY20]
Challenges spatio-temporal data quality
 Quality of the ST data determines most of the
accuracy of the prediction and detection.
 Aspects for quality of ST data includes spatial,
temporal and the combination of them.
 A key challenge for ST data quality is to detect
outlier.
KDD Tutorial / © 2021 IBM Corporation
Spatio-temporal data distribution
Source: Ferreira, L.N., Vega-Oliveros, D.A., Cotacallapa, M. et al. Spatiotemporal data analysis with
chronological networks. Nat Commun 11, 4036 (2020).
Quality analysis for spatio-temporal data
- Detecting Spatio-temporal outlier
Detecting spatio-temporal outlier:
 A typical framework of ST outlier detection consists of
finding spatial outliers at first and verifying temporal outlier
at the next step.
KDD Tutorial / © 2021 IBM Corporation
ST outlier detection framework
Source: M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey,"
IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250{2267, 2014.
Classical spatial outlier detection
- Local Moran’s I [Anselin95]
 Spatial autocorrelation
 Identify influential observations (e.g. hot spot, cold spot)
and outliers among the areas.
 𝐼 =
𝑁
𝑊
𝑖 𝑗 𝑤𝑖𝑗
(𝑥𝑖
−𝑥)(𝑥𝑗
−𝑥)
𝑖 𝑥𝑖
−𝑥 2
where 𝑁 is the number of spatial units indexed by 𝑖 and 𝑗; 𝑥 is
the variable of interest; 𝑥 is the mean of 𝑥 ; 𝑤𝑖𝑗 is a matrix of
spatial weights with zeroes on the diagonal (i.e., 𝑤𝑖𝑗 = 0); and 𝑊
is the sum of all 𝑤𝑖𝑗.
KDD Tutorial / © 2021 IBM Corporation
3 2 0
5 8 1
2 4 1
(1) (2) (3)
(4) (5) (6)
(7) (8) (9)
Example
i,j = (1), (2), (3),...
xi = 3, 2, 0, ...
w12 = 1, w13 = 0, ...
1 2 3
1 1 2
2 1 3
Local Moran’s quadrant
Spatial values
Classical temporal data outlier detection based on ARMA model
 Autoregressive moving average (ARMA) model
 A model for time series data combined AR and MA
model.
𝑋𝑡 = 𝑐 + 𝜀𝑡 +
𝑖=1
𝑝
𝜑𝑖𝑋𝑡 − 𝑖 +
𝑖=1
𝑞
𝜃𝑖𝜀𝑡 − 𝑖
 Outlier detection based on ARMA model
 Create ARMA model from real data
 Determine outlier from the difference prediction of
the ARMA model and the real data
 ARIMA: Integrated ARMA
 SARIMA: Seasonal ARIMA
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.kumilog.net/entry/sarima-pv
Spatio-temporal Outlier Detection
 “Spatio-temporal outlier detection is an extension of
spatial outlier detection.” [WLC10]
 Spatio-temporal object is represented by a set of
instances 𝑜_𝑖𝑑, 𝑠𝑖, 𝑡𝑖 , where the spacestamp 𝑠𝑖, is
the location of object 𝑜_𝑖𝑑 at timestamp 𝑡𝑖.
 Exact-Grid Top-K [WLC10] is a spatio-temporal outlier
detection algorithm that is based on:
 Spatial Scan Statistic [Kulldorff97]
 Exact-Grid [AMP06]
KDD Tutorial / © 2021 IBM Corporation
A moving region
Source:
[WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data,"
in Knowledge Discovery from Sensor Data, Springer Berlin Heidelberg, pp.115-133, 2010
Spatio-temporal Outlier Detection
- Spatial Scan Statistics [Kulldorff97]
 Identify clusters of randomly positioned points
 Determine hotspots in spatial data
 Steps:
 Study area observed events.
 Place a point on the grid.
 Observed and expected numbers of events are
recorded
 Drawback: Huge amounts of calculation.
KDD Tutorial / © 2021 IBM Corporation
Source: Editing from https://rr-asia.oie.int/wp-content/uploads/2020/03/lecture-4_cluster-
detection-using-the-spatial-scan-statistic-satscan_20180920-min_vink.pdf
Spatio-temporal Outlier Detection
- Exact-Grid [AMP06]
 A simple exact algorithm for finding the largest
discrepancy region in a domain.
 Algorithm running in time 𝑂(𝑛4
) from 𝑂(𝑛5
)
 Approximation algorithm for a large class of
discrepancy functions (including the Kulldorff scan
statistic).
KDD Tutorial / © 2021 IBM Corporation
Figure 1: Exam ple of m axim al discrepancy range
on a dat a set . X s are m easur ed dat a and Os are
baseline dat a.
An equi
point r =
Pr obl e
ancy func
range R ∈
In this p
ing of axis
the same s
crepancy
axis-parall
B oundar
fitting, we
very small
Formally,
has a mea
C ≥ 1.
equivalent
Sn = [C/
care about
than base
{ (mR , bR )
Example of maximal discrepancy range on a data set.
pr
pl
pb
(a) Sweep Line in Algo-
rithm Exact .
r
r∗
Cr ∗
Cr
p
q
ui = nr
(b) Error between contours.
φ
φ
2
<
l
2
h
φ
(c) Error in approximating
an arc with a segment.
Figure 2: Sweep lines, cont ours, and arcs.
Grid algorit hm s. For some algorithms, the data is as-
sumed to lie on a grid, or is accumulated onto a set of
A lgor it hm 1 Algorithm Exact
maxd = -1
Techniques that the algorithm uses
[AMP06] Deepak Agarwal, Andrew McGregor, Jeff M Philipps, et al. "Spatial scan statistics: approximations and performance
study." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006.
Spatio-temporal Outlier Detection
- Exact-Grid Top-k [WLC10]
 Extend Exact-Grid and Approx-Grid algorithms for
overlapping problem. Find the top-k outliers in a spatial
grid for each time period.
 Finding the top-k outliers
 Find every possible region size and shape in the grid.
 Get each region’s discrepancy value to determine
which is a more significant outlier.
 Keeps track of the top-k regions rather than just the
top-1.
KDD Tutorial / © 2021 IBM Corporation
left right
top
bottom
Overlap problem
Finding Top-k algorithm
Source: https://slideplayer.com/slide/4806555/
[WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data,"
in Knowledge Discovery from Sensor Data, Springer Berlin Heidelberg, pp.115-133, 2010
Summary of Data Quality for Spatio-temporal
 Spatio-temporal data in data science
 Retail sales analysis, Transportation
 Data types of spatio-temopral
 Event ,trajectory point reference, and raster data
 Challenges for quality of spatio-temporal data
 Outlier of ST data
 Quality analysis for spatio-temporal data
 Spatial Scan Statistics [Kulldorff97], Exact-Grid [AMP06, WLC10]
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics for
Unstructured Text Data
KDD Tutorial / © 2021 IBM Corporation
Data Quality Metrics
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
We had covered the following topics in KDD
2020 tutorial
 Metrics for Generic Text Quality
 Metrics for Corpus Filtering
 Metrics for Text Classification
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
What is Unstructured Text?
KDD Tutorial / © 2021 IBM Corporation
Html/Xml
Documents
Tweets Ticket Data
Email
Product Reviews
How to Assess Quality of Text?
Different properties of text can be measured to quantify its quality!
 Lexical Properties
- Vocabulary, Misspellings etc.
 Syntactic Properties
- Grammatical Relations, Syntax Violations etc.
 Semantic Properties
- Text Meaningfulness, Text Complexity etc.
 Discourse Properties
- Readability, Text Coherence etc.
 Distributional Properties
- Outliers, Topics, Embeddings
KDD Tutorial / © 2021 IBM Corporation
How to Assess Quality of Text?
Different Approaches can be further broadly classified as
 Generic Text quality
 Task specific
KDD Tutorial / © 2021 IBM Corporation
Generic Text Quality Approaches:
 Text Readability
 Text Coherence
 Text Formality
 Text Ambiguity
 Text Outliers
Task/Dataset Specific Approaches:
 Text Classification
 Text Generation
 Semantic Similarity
 Dialog Systems
 Label Noise
 Lexical Diversity
 Text Cleanliness
 Text Appropriateness
 Text Complexity
 Text Bias
 Dataset Valuation
 Dataset Complexity
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
Dataset Cartography:
Mapping and Diagnosing Datasets with Training Dynamics
Objective:
Leverage the behavior of the ML model on individual instances during training to generate a data map
that illustrates easy-to-learn, hard-to-learn and ambiguous samples in the dataset.
Approach:
1. Measure the confidence, correctness and variability of the model during training epochs.
2. Model confidence is measured as the mean probability of the true label across training epochs.
3. Model correctness is measured as the fraction of times the model correctly predicts the label.
4. Model variability is measured as the standard deviation of the predicted probabilities of the true
label across training epochs.
Task:
Compare performance of various baselines generated by selecting subsets of the dataset and training
the RoBERTA large model.
KDD Tutorial / © 2021 IBM Corporation
Source : Swayamdipta, Swabha, et al. 2020, Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
[SSL+20]
KDD Tutorial / © 2021 IBM Corporation
Source : Swayamdipta, Swabha, et al. 2020, Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Dataset Cartography:
Mapping and Diagnosing Datasets with Training Dynamics
Data Map and Insights:
 Data follows a bell-shaped curve with respect to
confidence and variability
 correctness determines discrete regions
 Instances with high confidence and low variability
region of the map (top-left) are easy-to-learn
 Instances with low variability and low confidence
(bottom-left) are hard-to-learn
 Instances with high variability (right) are ambiguous
 hard-to-learn and ambiguous instances are most
informative for learning.
 Some amount of easy-to-learn instances are also
necessary for successful optimization
[SSL+20]
Data Valuation Using Reinforcement Learning
Objective:
Determine a reward for each sample by quantifying the performance of the predictor model on a small validation
set and use it as a reinforcement signal to learn the likelihood of the sample being using in training of the
predictor model.
Approach:
1. Perform end-to-end training of (i) target task predictor model and (ii) data value estimator model
2. Data value estimator generates selection probabilities for a mini-batch of training samples
3. Predictor model trains on the mini-batch and loss is computed against the validation set
4. Predictor model parameters are updated through back-propagation
5. Data value estimator parameters are updated using the Reinforce approach
Task:
1. Compare DVRL framework against standard baselines on standard datasets from different domains
KDD Tutorial / © 2021 IBM Corporation
Source : Yoon, Jinsung et. al 2020, Data valuation using reinforcement learning
[YAP20]
Data Valuation Using Reinforcement Learning
KDD Tutorial / © 2021 IBM Corporation
Insights:
 Using only 60%-70% of the training set (the highest valued samples), DVRL can obtain a similar performance compared to training
on the entire dataset.
 The framework also outperforms baselines in the presence of noisy labels and is able to detect noisy labels in the dataset by
assigning them low scores.
 The computational complexity of DVRL framework, instead of being exponential in terms of the dataset size, the overhead is only
twice of conventional training.
Source : Yoon, Jinsung et. al 2020, Data valuation using reinforcement learning
[YAP20]
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
Outlier Detection
Outlier in a tabular or timeseries data can be
interpreted as a
 Deviating point
 Noise
 Rarely occurring point
What can be an outlier in text data?
 Topically diverse sample?
 Gibberish text?
 Meaningless sentences?
 Samples from other language?
KDD Tutorial / © 2021 IBM Corporation Images Credits: Google Images
Figure 1 Figure 2
Figure 3
Outlier Examples in Text Data
 Repetitive: greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad
full stop greatest show ever mad full stop
 Incomprehensible: lived let idea heck bear walk never heard whole years really funny beginning went
hill quickly
 Incomplete: Suspenseful subtle much much disturbing
KDD Tutorial / © 2021 IBM Corporation
Outlier Detection
Classical techniques
Matrix factorization
Techniques
DL techniques
Self-attention
techniques for text
representation
Outlier Detection in Text Data
Outlier Detection for Text Data
 Feature Generation – Simple Bag of Words
Approach
 Apply Matrix Factorization – Decompose the
given term matrix into a low rank matrix L and
an outlier matrix Z
 Further, L can be expressed as
 where,
 The l2 norm score of a particular column zx
serves as an outlier score for the document
KDD Tutorial / © 2021 IBM Corporation
Source : Kannan et al, 2017. Outlier detection for text data
[KWAP17]
Outlier Detection for Text Data
 For experiments, outliers are sampled from a unique class from standard datasets
 Receiver operator characteristics are studied to assess the performance of the
proposed approach.
 Approach is useful at identifying outliers even from regular classes.
 Patterns such as unusually short/long documents, unique words, repetitive vocabulary
etc. were observed in detected outliers.
KDD Tutorial / © 2021 IBM Corporation
Source : Kannan et al, 2017. Outlier detection for text data
[KWAP17]
Unsupervised Anomaly Detection on Text
Multi-Head Self-Attention:
KDD Tutorial / © 2021 IBM Corporation
Sentence Embedding
Attention matrix
 Proposes a novel one-class classification method which leverages pretrained word
embeddings to perform anomaly detection on text
 Given a word embedding matrix H, multi-head self-attention is used to map sentences of
variable length to a collection of fixed size representations, representing a sentence with
respect to multiple contexts
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
(1)
(2)
[RZV+19]
Unsupervised Anomaly Detection on Text
KDD Tutorial / © 2021 IBM Corporation
CVDD Objective
Context Vector
Orthogonality Constraint
 These sentence representations are trained along with a collection of context vectors such that
the context vectors and representations are similar while keeping the context vectors diverse
 Greater the distance of mk(H) to ck implies a more anomalous sample w.r.t. context k
Outlier Scoring
(1)
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
[RZV+19]
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
EDA: Easy Data Augmentation Techniques for Boosting
Performance on Text Classification Tasks
Objective:
Utilize simple text operators to perform data augmentation and boost performance of ML models on text
classification tasks specifically for small datasets.
Approach:
1. Four specific text operators are discussed – (i) synonym replacement (ii) random insertion (iii) random
swap and (iv) random deletion
2. For a given sentence in the training set, one of the operations is performed at random.
3. The number of words changed, n, is based on the sentence length l with the formula n=αl.
4. For each original sentence, naug augmented sentences are generated.
Task:
1. Compare EDA on five NLP tasks with CNNs and RNNs
KDD Tutorial / © 2021 IBM Corporation
Wei, Jason, and Kai Zou, 2019, Eda: Easy data augmentation techniques for boosting performance on text classification tasks.
[WZ19]
EDA: Easy Data Augmentation Techniques for Boosting
Performance on Text Classification Tasks
KDD Tutorial / © 2021 IBM Corporation
Insights:
 Average improvement of 0.8% for full datasets
and 3.0% for Ntrain=500 is observed
 For the training set fractions {1, 5, 10, 20, 30, 40}
there is consistent and significant improvement
observed across all datasets and tasks
 It is empirically shown that EDA conserves the
labels the original sentence by analyzing t-SNE
plots of augmented samples
 Some of the limitations are (i) performance gain
can be marginal when data is sufficient and (ii)
EDA might not yield substantial improvements
when using pre-trained models
Wei, Jason, and Kai Zou, 2019, Eda: Easy data augmentation techniques for boosting performance on text classification tasks.
[WZ19]
Do not Have Enough Data? Deep Learning to the
Rescue!
Objective:
Given a small labelled dataset, perform data augmentation by generating synthetic samples using a pre-trained
language model fine-tuned on the dataset.
Approach:
1. Use a pre-trained language model (GPT-2) to fine-tune on the available labelled dataset and use it to
synthesize new labelled sentences.
2. Independently, train a classifier on the original dataset and use it to filter the synthesized data corpus by
filtering out synthesized samples with low classifier confidence score.
Task:
1. Compare LAMBADA framework against SOA baselines on standard datasets to compare performance.
KDD Tutorial / © 2021 IBM Corporation
Source : Anaby-Tavor, Ateret, et al., 2020, Do not have enough data? Deep learning to the rescue!.
[ATCG+20]
Do not Have Enough Data? Deep Learning to the
Rescue!
KDD Tutorial / © 2021 IBM Corporation
Insights:
 Proposed framework is compared against various
classifier models against different baselines on 3
standard datasets.
 It is empirically shown that LAMBADA approach shows
improvement in performance with upto 50 samples per
class and it is classifier agnostic.
 When compared with SOA baselines, LAMBADA
consistently performs better on all the datasets with
various classifiers.
 Experiments are also done to show that LAMBADA can
also serve as an alternative to semi-supervised
techniques when unlabelled data does not exist.
Source : Anaby-Tavor, Ateret, et al., 2020, Do not have enough data? Deep learning to the rescue!.
[ATCG+20]
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
Evolutionary Data Measures: Understanding the Difficulty
of Text Classification Tasks
Proposes an approach to design data quality metric which
explains the complexity of the given data for classification task
 Considers various data characteristics to generates a 48 dim
feature vector for each dataset.
 Data characteristics include
 Class Diversity: Count based probability distribution of classes in
the dataset
 Class Imbalance: 𝑐=1
𝐶
|
1
𝐶
−
𝑛𝑐
𝑇𝑑𝑎𝑡𝑎
|
 Class Interference: Similarities among samples belonging to
different classes
 Data Complexity: Linguistic properties of data samples
KDD Tutorial / © 2021 IBM Corporation
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
Feature vector for a given dataset cover
quality properties such as
 Class Diversity (2-Dim)
 Shannon Class Diversity
 Shannon Class Equitability
 Class Imbalance (1-Dim)
 Class Interference (24-Dim)
 Hellinger Similarity
 Top N-Gram Interference
 Mutual Information
 Data Complexity (21-Dim)
 Distinct n-gram : Total n-gram
 Inverse Flesch Reading Ease
 N-Gram and Character diversity
[CRZ18]
Understanding the Difficulty of Text Classification Tasks
 Authors propose usage of genetic algorithms to intelligently explore the 248 possible combinations
 The fitness function for the genetic algorithm was Pearson correlation between difficulty score and
model performance on test set
 89 datasets were considered for evaluation and 12 different types of models on each dataset
 The effectiveness of a given combination of metric is measured using its correlation with the
performance of various models on various datasets
 Stronger the negative correlation of a metric with model performance, better the metric explains data
complexity
KDD Tutorial / © 2021 IBM Corporation
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
[CRZ18]
Understanding the Difficulty of Text Classification Tasks
KDD Tutorial / © 2021 IBM Corporation
Difficulty Measure D2 =
Distinct Unigrams : Total Unigrams + Class Imbalance +
Class Diversity + Maximum Unigram Hellinger Similarity
+ Unigram Mutual Info.
Correlation = −0.8814
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
[CRZ18]
Outline
KDD Tutorial / © 2021 IBM Corporation
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
 Motivation
 What is Unstructured Text?
 How to Assess Text Quality?
 Metrics for Text Quality
 Text Quality Metrics for Dataset Valuation
 Text Quality Metrics for Outlier Detection
 Text Quality Metrics for Class Imbalance
 Text Quality Metrics for Dataset Complexity
 Future Directions
 Next Steps
Next Steps
KDD Tutorial / © 2021 IBM Corporation
How to assess overall text quality?
 Framework that allows user to assess text quality across various dimensions
 Standardized set of quality metrics that output a score to indicate a low/high score
 Provide insights into the specific samples in the dataset which contribute to a low/high
score
 Specific recommendations for addressing poor text quality and evidence of model
performance improvement
We invite you to join us on this agenda towards a data-centric approach to analyzing data quality.
Contact Hima Patel with your ideas and enquiries.
KDD Tutorial / © 2021 IBM Corporation
References
KDD Tutorial / © 2021 IBM Corporation
[BF99] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167, 1999.
[ARK18] Mohammed Al-Rawi and Dimosthenis Karatzas. On the labeling correctness in computer vision datasets. In IAL@PKDD/ECML, 2018.
[NJC19] Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint
arXiv:1911.00068, 2019.
[EGH17] Rajmadhan Ekambaram, Dmitry B Goldgof, and Lawrence O Hall. Finding label noise examples in large scale datasets. In2017 IEEE
International Conference on Systems, Man, and Cybernetics pages 2420–2424., 2017.
[XHJ10] Xiong, Haitao, Junjie Wu, and Lu Liu. "Classification with class overlapping: A systematic study." The 2010 International Conference on
E-Business Intelligence. 2010
[DQT21] Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti,
Sameep Mehta, Sandeep Hans, Pranay Lohia, Aniya Aggarwal, Diptikalyan Saha. Data Quality Toolkit: Automatic assessment of data quality
and remediation for machine learning datasets. arXiv, 2021, https://arxiv.org/pdf/2108.05935.pdf
[FF12] W. Fan and F. Geerts, “Foundations of data quality management,”Syn-thesis Lectures on Data Management, vol. 4, no. 5, pp. 1–217,
2012
[DQR18] Corrales, David Camilo, Juan Carlos Corrales, and Agapito Ledezma. "How to address the data quality issues in regression models: a
guided process for data cleaning." Symmetry 10.4 (2018): 99.
References
KDD Tutorial / © 2021 IBM Corporation
[OT21] Agarwal, Amulya, and Nitin Gupta. "Comparison of Outlier Detection Techniques for Structured Data." arXiv preprint
arXiv:2106.08779 (2021).
[SM13] Torgo, Luís, et al. "Smote for regression." Portuguese conference on artificial intelligence. Springer, Berlin, Heidelberg, 2013.
WCY20] S. Wang, J. Cao, and P. Yu, “Deep learning for spatio-temporal data mining: A survey," IEEE Transactions on Knowledge and Data
Engineering,2020.
[SJA15] S. Shekhar, Z. Jiang, R. Y. Ali, et al., “Spatiotemporal data mining: A computational perspective,” ISPRS International Journal of Geo-
Information, vol. 4, no. 4, pp. 2306-2338, 2015.
[WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data," in Knowledge Discovery from Sensor Data,
Springer Berlin Heidelberg, pp.115-133, 2010
[GGA14] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey," IEEE Transactions on Knowledge and
Data Engineering, vol. 26, no. 9, pp. 2250-2267, 2014.
[KN98] E. M. Knorr and R. T. Ng, “Algorithms for mining distance-based outliers in large datasets," in Proceedings of the 24th International
Conference on Very Large Data Bases, ser. VLDB '98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, p. 392-403.
References
KDD Tutorial / © 2021 IBM Corporation
[SLZ01] S. Shekhar, C.-T. Lu, and P. Zhang, “Detecting graph-based spatial outliers: Algorithms and applications (a summary of results)," in
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '01. New York, NY,
USA: Association for Computing Machinery, 2001, p. 371-376.
[CL04] T. Cheng and Z. Li, “A hybrid approach to detect spatio-temporal outliers," in Proceedings of the 12th International Conference on
Geoinformatics, 2004, p. 173-178
[Anselin95] Anselin, Luc. "Local indicators of spatial association—LISA." Geographical analysis 27.2 (1995): 93-115.
[AMP06] Deepak Agarwal, Andrew McGregor, Jeff M Philipps, et al. "Spatial scan statistics: approximations and performance
study." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006.
[SSL+20] Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset
cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 9275–9293, 2020.
[WZ19] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 6382–6388, 2019.
References
KDD Tutorial / © 2021 IBM Corporation
[ATCG+20] Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama
Zwerdling. Do not have enough data? Deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume
34, pages 7383–7390, 2020.
[CRZ18] Edward Collins, Nikolai Rozanov, and Bingbing Zhang. Evolutionary data measures: Understanding the difficulty of text classification
tasks. arXiv preprintarXiv:1811.01910, 2018.
[KWAP17] Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. Outlier detection for text data. In Proceedings of the
2017 siam international conference ondata mining, pages 489–497. SIAM, 2017.
[RWGS20] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models
with checklist.arXiv preprintarXiv:2005.04118, 2020.
[RZV+19] Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. Self-attentive, multi-context one-class
classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 4061–4071, 2019.
[YAP20] Jinsung Yoon, Sercan Arik, and Tomas Pfister. Data valuation using reinforcement learning. InInternational Conference on Machine
Learning, pages 10842–10851. PMLR,2020.

Más contenido relacionado

La actualidad más candente

Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Krishnaram Kenthapadi
 
Introduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdfIntroduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdf
SisayNegash4
 

La actualidad más candente (20)

Winning Data Science Competitions
Winning Data Science CompetitionsWinning Data Science Competitions
Winning Data Science Competitions
 
Data preprocessing PPT
Data preprocessing PPTData preprocessing PPT
Data preprocessing PPT
 
Smart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case StudiesSmart Data Slides: Machine Learning - Case Studies
Smart Data Slides: Machine Learning - Case Studies
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
 
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
Presentation - Msc Thesis - Machine Learning Techniques for Short-Term Electr...
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AI
 
Introduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdfIntroduction to machine learning-2023-IT-AI and DS.pdf
Introduction to machine learning-2023-IT-AI and DS.pdf
 
Feature Engineering in Machine Learning
Feature Engineering in Machine LearningFeature Engineering in Machine Learning
Feature Engineering in Machine Learning
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 

Similar a Data Quality for Machine Learning Tasks

Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 

Similar a Data Quality for Machine Learning Tasks (20)

A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Sustainable & Composable Generative AI
Sustainable & Composable Generative AISustainable & Composable Generative AI
Sustainable & Composable Generative AI
 
Compositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML ServicesCompositional AI: Fusion of AI/ML Services
Compositional AI: Fusion of AI/ML Services
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
 
Bhadale group of companies data science project methodologies catalogue
Bhadale group of companies data science project methodologies catalogueBhadale group of companies data science project methodologies catalogue
Bhadale group of companies data science project methodologies catalogue
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
MSR 2022 Foundational Contribution Award Talk: Software Analytics: Reflection...
 
IRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of ClusteringIRJET - Encoded Polymorphic Aspect of Clustering
IRJET - Encoded Polymorphic Aspect of Clustering
 
Ezml Stanford 2015
Ezml Stanford 2015Ezml Stanford 2015
Ezml Stanford 2015
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
How Does Data Create Economic Value_ Foundations For Valuation Models.pdf
How Does Data Create Economic Value_ Foundations For Valuation Models.pdfHow Does Data Create Economic Value_ Foundations For Valuation Models.pdf
How Does Data Create Economic Value_ Foundations For Valuation Models.pdf
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNINGHANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
HANDWRITTEN DIGIT RECOGNITION USING MACHINE LEARNING
 

Último

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 

Data Quality for Machine Learning Tasks

  • 1. Data Quality for Machine Learning Tasks KDD Tutorial / © 2021 IBM Corporation Nitin Gupta Shashank Mujumdar Satoshi Masuda Hima Patel IBM Research India and IBM Research Japan
  • 2. Need for Data Quality for Machine Learning KDD Tutorial / © 2021 IBM Corporation
  • 3. Let us start with a story.. KDD Tutorial / © 2021 IBM Corporation Data Scientist Picture Courtesy: This Photo by Unknown Author is licensed under CC BY-SA-NC Yay!! I am so excited!!
  • 4. Data Preparation is a time-consuming activity in data science lifecycle KDD Tutorial / © 2021 IBM Corporation “Data collection and preparation are typically the most time-consuming activities in developing an AI-based application, much more so than selecting and tuning a model.” – MIT Sloan Survey https://sloanreview.mit.edu/projects/reshaping-business-with-artificial- intelligence/ Data preparation accounts for about 80% of the work of data scientists” - Forbes https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation- most-time-consuming-least-enjoyable-data-science-task-survey- says/#70d9599b6f63
  • 5. Challenges with Data Preparation KDD Tutorial / © 2021 IBM Corporation Data issues are not known at the start of the project. Discovered via iterative data debugging which is cumbersome and time consuming
  • 6. Data Quality Analysis can help.. KDD Tutorial / © 2021 IBM Corporation  Know the issues in the data beforehand, example noise in labels, overlap between classes.  Objective measurement of how good or bad the data is, and how does each operation affect the data  Make informed choices for data preprocessing and model selection  Reduce turn around time for data science projects.
  • 7. Data Quality 2.0 KDD Tutorial / © 2021 IBM Corporation Why is automated data quality analysis important?  Lot of progress in last several years on improving ML algorithms and building automated machine learning toolkits (AutoML) [*]  Several commercially ready pipelines are available  AutoAI with IBM Watson Studio  CloudAutoML from Google  …  Open source pipelines  Autosklearn  Autokeras  …  However, quality of model is upper bounded by quality of input data GAP: No systematic efforts to measure the quality of data for machine learning (ML) George Fuechsel, IBM 305 RAMAC technician
  • 8. How is it different from traditional data quality? KDD Tutorial / © 2021 IBM Corporation • Data quality is a well established area in database community and the capabilities to measure the quality of data in databases and data lakes exist in several products • There is a need to re-look at this approach from the lens of building machine learning models, as new metrics and dimensions need to be defined. • Hence, need for data quality 2.0 Image Courtesy: https://fractalenlightenment.com/33427/ life/changing-perspectives-choose-to-view-life-through-a-different- lens
  • 9. To put it all together KDD Tutorial / © 2021 IBM Corporation Data Assessment and Readiness Module  Need for algorithms and tools that can assess training datasets  Need for algorithms and tools that can remediate datasets Bring in automation, standardization and more democratization of data science process.
  • 10. To summarize: KDD Tutorial / © 2021 IBM Corporation George Fuechsel, IBM 305 RAMAC technician Lot of progress in last several years on improving ML algorithms including building automated machine learning toolkits (AutoML) However, Quality of a ML model is directly proportional to Quality of Data Hence, there is a need for systematic study of measuring quality of data with respect to machine learning tasks.
  • 11. Broad Research Challenges How to systematically measure the quality of data for ML? How to best remediate the data? Does the sequence of operations matter for data remediation? How to systematically capture all the data changes via automated documentation? How to address different modalities of data? Can we build multimodal solutions? KDD Tutorial / © 2021 IBM Corporation
  • 12. Data Quality Metrics: Desired Qualities  Capabilities to assess different dimensions of the data that can affect a model performance, example bias in the data, class imbalance etc. We will explore this in detail in the tutorial today.  Standardization of output from the different metrics, for high level view of the health of the dataset.  Capability to explain to the user why the data is “bad” for a given dimension  Customized recommendations based on the severity of data issues, data size and other data attributes KDD Tutorial / © 2021 IBM Corporation
  • 13. Challenge of different modalities Structured Datasets  Tabular  Spatio-temporal Unstructured Datasets  Social Media Data: Tweets, posts, documents, chat msgs etc  IT Operations Data : Tickets, Logs, Github pull requests, alerts, JSON, XML etc  More generic data forms: documents, web pages etc Image Datasets Speech Datasets Multimodal Datasets KDD Tutorial / © 2021 IBM Corporation Timeseries
  • 14. Challenge of different modalities Structured Datasets  Tabular  Spatio-temporal Unstructured Datasets  Social Media Data: Tweets, posts, documents, chat msgs etc  IT Operations Data : Tickets, Logs, Github pull requests, alerts, JSON, XML etc  More generic data forms: documents, web pages etc Image Datasets Speech Datasets Multimodal Datasets KDD Tutorial / © 2021 IBM Corporation Timeseries Detection of outliers Detection of Noisy Labels Overlapping data points
  • 15. Getting started in this area: Existing tools and libraries KDD Tutorial / © 2021 IBM Corporation Open Source Libraries: Deequ: https://github.com/awslabs/deequ Tensorflow Data Validation: https://www.tensorflow.org/tfx/guide/tfdv Pandas Profiler: https://github.com/pandas-profiling/pandas-profiling Beta/Trial Versions: Data Quality For AI : https://developer.ibm.com/apis/catalog/dataquality4ai--data-quality-for- ai/Introduction/ Know Your Data: https://knowyourdata.withgoogle.com/
  • 16. In this tutorial: Part 1: Overview of data quality for ML (covered) Part 2: Techniques for data quality measurements for structured datasets Part 3: Techniques for data quality measurements for spatio-temporal datasets Part 4: Techniques for data quality measurements for unstructured datasets KDD Tutorial / © 2021 IBM Corporation
  • 17. Data Quality Metrics for Structured Data KDD Tutorial / © 2021 IBM Corporation
  • 18. Data Quality Metrics We had covered the following topics in KDD 2020 tutorial: Classification specific metrics:  Data Cleaning taxonomy  Class Imbalance  Data Valuation  Data Homogeneity  Data Transformation KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 19. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 20. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 21. Label Noise KDD Tutorial / © 2021 IBM Corporation Given Label – Iris-setosa Correct Label –Iris-virginica (based on attributes analysis)  Most of the large data generated or annotated have some noisy labels.  In this metric we discuss “How one can identify these label errors and correct them to model data better?”. There are atleast 100,000 label issues is ImageNet! Source: https://l7.curtisnorthcutt.com/confident-learning
  • 22. Effects of Label Noise Possible Sources of Label Noise:  Insufficient information provided to the labeler  Errors in the labelling itself  Subjectivity of the labelling task  Communication/encoding problems Label noise can have several effects:  Decrease in classification performance  Pose a threat to tasks like feature selection  In online settings, new labelled data may contradict the original labelled data KDD Tutorial / © 2021 IBM Corporation
  • 23. Label Noise Techniques ‘ Label Noise Algorithm Level Approaches Data Level Approaches  Designing robust algorithms that are insensitive to noise  Not directly extensible to other learning algorithms  Requires to change an existing method, which neither is always possible nor easy to develop  Filtering out noise before passing to underlying ML task  Independent of the classification algorithm  Helps in improving classification accuracy and reduced model complexity. KDD Tutorial / © 2021 IBM Corporation
  • 24. Label Noise Techniques ‘ Label Noise Algorithm Level Approaches Data Level Approaches  Learning with Noisy Labels (NIPS-2014)  Robust Loss Functions under Label Noise for Deep Neural Networks (AAAI-2017)  Probabilistic End-To-End Noise Correction for Learning With Noisy Labels (CVPR-2019)  Can Gradient Clipping Mitigate Label Noise? (ICLR-2020)  Identifying mislabelled training data (Journal of artificial intelligence research 1999)  On the labeling correctness in computer vision datasets (IAL 2018)  Finding label noise examples in large scale datasets (SMC 2017)  Confident Learning: Estimating Uncertainty in Dataset Label (Arxiv -2019) KDD Tutorial / © 2021 IBM Corporation
  • 25. Filter Based Approaches  On the labeling correctness in computer vision datasets [ARK18]  Published in Interactive Adaptive Learning 2018  Train CNN ensembles to detect incorrect labels  Voting schemes used:  Majority Voting KDD Tutorial / © 2021 IBM Corporation  Identifying mislabelled training data [BF99]  Published in Journal of artificial intelligence research, 1999  Train filters on parts of the training data in order to identify mislabeled examples in the remaining training data.  Voting schemes used:  Majority Voting  Consensus Voting  Finding label noise examples in large scale datasets [EGH17]  Published in International Conference on Systems, Man, and Cybernetics, 2017  Two-level approach  Level 1 – Train SVM on the unfiltered data. All support vectors can be potential candidates for noisy points.  Level 2 – Train classifier on the original data without the support vectors. Samples for which the original and the predicted label does not match, are marked as label noise. Source: Broadley et al, 1999, Identifying mislabeled training Data
  • 26. Confident Learning: Estimating Uncertainty in Dataset Label -2019 The Principles of Confident Learning  Prune to search for label errors  Count to train on clean data  Rank which examples to use during training KDD Tutorial / © 2021 IBM Corporation Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label [NJC19]
  • 27. KDD Tutorial / © 2021 IBM Corporation Confident Learning: Estimating Uncertainty in Dataset Label -2019 Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label [NJC19]
  • 28. Data Quality for AI: Label Purity KDD Tutorial / © 2021 IBM Corporation  Label purity algorithm is built on top of the confident learning-based approach (CleanLab).  One limitation of CleanLab approach is that it can tag some correct samples as noisy if they lie in the overlap region thereby generating false positives.  An overlap region is a region where samples of multiple classes start sharing features because of multiple reasons such as the presence of weak features, fine-grained classes, etc. This paper improvise the label purity algorithm to address this problem in two ways:  Algorithm to detect overlap regions, which helps in removing the false positives  Provide an effective neighborhood based pruning mechanism to remove identified noisy candidate samples [DQT21]
  • 29. Data Quality for AI: Label Purity KDD Tutorial / © 2021 IBM Corporation Comparison of (a) Precision and (b) Recall of Label Purity Algorithm with CleanLab Algorithm on 35 Datasets Takeaways:  This algorithm outperforms CleanLab in terms of precision (5−15% improvement).  For recall, both the algorithms have similar performance.  On an average over 35 datasets, (a) the precision of this algorithm is .92 and CleanLab is .76, (b) the recall of this algorithm is .75 and CleanLab is .78  Overall, 5−16% improvement in precision at the cost of 2% drop in the recall [DQT21]
  • 30. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 31. Class Overlap : Data Quality for AI KDD Tutorial / © 2021 IBM Corporation Analyse the dataset to find samples that reside in the overlapping region of the data space.  Identify data points which are close to each other but belongs to different classes  Identify data points which lies closer to or other side of the class boundary Why is it useful for ML pipeline?  Overlapping regions are hard to detect and can cause ML classifiers to misclassify points in that region.  If amount of overlap is high, we need good feature representation or more complex model Example 2 Example 1 [DQT21]
  • 32. Data Quality for AI - Class Overlap KDD Tutorial / © 2021 IBM Corporation  Precision and Recall of Class Overlap Algorithm on 20 datasets (after inducing 30% overlap points) [DQT21]
  • 33. Class Overlap : Data Quality for AI KDD Tutorial / © 2021 IBM Corporation [DQT21]
  • 34. Class Overlap KDD Tutorial / © 2021 IBM Corporation [XHJ10]  The objective of Support Vector Data Description (SVDD) is to find a sphere or domain with minimum volume containing all or most of the data.  The data dropped in both two spheres can be thought as the overlapping data which is close to or overlaps with each other.
  • 35. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 36. Outlier Detection KDD Tutorial / © 2021 IBM Corporation Source: All images are taken from google images,  According to Hawkins, "An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism". Why is it useful for ML pipeline?  Machine learning algorithms are sensitive to the range and distribution of attribute values.  Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results. Outliers – All the circles with + sign Other examples
  • 37. Outlier Detection KDD Tutorial / © 2021 IBM Corporation Source: https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561 Most common causes of outliers on a data set:  Data entry errors (human errors)  Measurement errors (instrument errors)  Experimental errors (data extraction or experiment planning/executing errors)  Intentional (dummy outliers made to test detection methods)  Data processing errors (data manipulation or data set unintended mutations)  Sampling errors (extracting or mixing data from wrong or various sources)  Natural (not an error, novelties in data)
  • 38. Outlier Detection KDD Tutorial / © 2021 IBM Corporation [OT21]
  • 39. Outlier Detection KDD Tutorial / © 2021 IBM Corporation [OT21]
  • 40. Outlier Detection KDD Tutorial / © 2021 IBM Corporation From Precision Point of View- CBLOF (.466)> IFOREST (.416)> COPOD (.404)> HBOS (.376) > LODA (.337) From Recall Point of View- LODA (.382) > CBLOF (.388)> HBOS (.322)> IFOREST (.316)> COPOD (.274) [OT21]
  • 41. Outlier Detection KDD Tutorial / © 2021 IBM Corporation From RF Classifier Point LODA (8/14) > HBOS (7/14) > COPOD (5/14) > CBLOF (3/14)= IFOREST (3/14) From DT Classifier Point CBLOF (8/14) > LODA (7/14) = HBOS (7/14) > COPOD (5/14) = IFOREST (5/14) From Time Point of View - HBOS (14 sec) < LODA (70 sec) < COPOD (312 sec) < CBLOF (2076) < IFOREST (3209) [OT21]
  • 42. Outlier Detection (HBOS VS LODA) KDD Tutorial / © 2021 IBM Corporation  Histogram Based Outlier Score (HBOS) is a fast unsupervised, statistical and non- parametric method.  Assumption - all the features are independent.  In case of categorical data, simple counting is used while for numerical values, static or dynamic bins are made.  The height of each bin represents the density estimation. To ensure an equal weight of each feature, the histograms are normalized [0-1].  In HBOS, outlier score is calculated for each single feature of the dataset. These calculated values are inverted such that outliers have a high HBOS and inliers have a low score.  Lightweight On- line Detector of Anomalies (LODA) is particularly useful when huge data is processed in real time.  It is not only fast and accurate but also able to operate and update itself on data with missing variables.  It can identify features in which the given sample deviates from the majority, which basically finds out the cause of anomaly.  It constructs an ensemble of T one- dimensional histogram density estimators.  LODA is a collection of weak classifiers can result in a strong classifier. [OT21]
  • 43. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 44. Regression Task – Overview of Metrics KDD Tutorial / © 2021 IBM Corporation [DQR18] Data quality issues from regression task –  Missing values: refers when one variable or attribute does not contain any value. The missing values occur when the source of data has a problem, e.g., sensor faults, faulty measurements, data transfer problems or incomplete surveys.  Outlier: can be an observation univariate or multivariate. An observation is denominated an outlier when it deviates markedly from other observations, in other words, when the observation appears to be inconsistent respect to the remainder of observations.  High dimensionality: is referred to when dataset contains a large number of features. In this case, the regression model tends to overfit, decreasing its performance.  Redundancy: represents duplicate instances in data sets which might detrimentally affect the performance of classifiers.  Noise: defined as irrelevant or meaningless data. The data noisy reduce the predictive ability in a regression model.
  • 45. KDD Tutorial / © 2021 IBM Corporation [DQR18] Regression Task – Overview of Metrics Data cleaning task from regression task –  Imputation: replaces missing data with substituted values. Four relevant approaches to imputing missing values:  Deletion: excludes instances if any value is missing.  Hot deck: missing items are replaced by using values from the same dataset.  Imputation based on missing attribute: assigns a representative value to a missing one based on measures of central tendency (e.g., mean, median, mode, trimmed mean).  Imputation based on non-missing attributes: missing attributes are treated as dependent variables, and a regression or classification model is performed to impute missing values.  Outlier detection: identifies candidate outliers through approaches based on Clustering (e.g., DBSCAN: Density-based spatial clustering of applications with noise) or Distance (e.g., LOF: Local Outlier Factor).
  • 46. KDD Tutorial / © 2021 IBM Corporation [DQR18] Regression Task – Overview of Metrics Data cleaning task from regression task –  Dimensionality reduction: reduces the number of attributes finding useful features to represent the dataset. A subset of features is selected for the learning process of the regression model. The best subset of relevant features is the one with least number of dimensions that most contribute to learning accuracy. Dimensionality reduction can take on four approaches:  Filter: selects features based on discriminating criteria that are relatively independent of the regression (e.g., correlation coefficients).  Wrapper: based on the performance of regression models (e.g., error measures) are maintained or discarded features in each iteration.  Embedded: the features are selected when the regression model is trained. The embedded methods try to reduce the computation time of the wrapper methods.  Projection: looks for a projection of the original space to space with orthogonal dimensions (e.g., principal component analysis).  Remove duplicate instances: identifies and removes duplicate instances.
  • 47. [DQR18] Regression Task – Overview of Metrics
  • 48. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 49. Source Outliers - Regression Task 1.There is one outlier far from the other points, though it only appears to slightly influence the line. 2.There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential. 3.There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn’t appear to fit very well. 4.There is a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the line somewhat strongly, making the least square line fit poorly almost everywhere. There might be an interesting explanation for the dual clouds, which is something that could be investigated. 5.There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least squares line. 6.There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very influential.
  • 50. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 51. Class Imbalance KDD Tutorial / © 2021 IBM Corporation Unequal distribution of classes within a dataset Source: https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59
  • 52. Class Imbalance KDD Tutorial / © 2021 IBM Corporation Accuracy Paradox 90% Majority 10% Minority Learning Algorithm Always Predict Accuracy = 90% Reasonable ? F1 Score, Recall, AUC PR, AUC ROC, G-Mean…. Source: https://en.wikipedia.org/wiki/Accuracy_paradox
  • 53. Class Imbalance KDD Tutorial / © 2021 IBM Corporation Imbalance Ratio Learning algorithm biased towards majority class Minority sample considered as the noise In confusion region, priorities given to majority Increase Increase   Is the imbalance ratio only cause of performance degradation in learning from imbalanced data? NO  Source: . https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8
  • 54. Class Imbalance – for Regression KDD Tutorial / © 2021 IBM Corporation [SM13]  Several real-world prediction problems involve forecasting rare values of a target variable.  When this variable is nominal, we have a problem of class imbalance that was already studied thoroughly within machine learning (classification task).  For regression tasks, where the target variable is continuous, few works exist addressing this type of problem. Still, important application areas involve forecasting rare extreme values of a continuous target variable. Problem Statement  Predicting rare extreme values of a continuous variable is a particular class of regression problems.  In this context, given a training sample of the problem, D = {hx, yi}N i=1, our goal is to obtain a model that approximates the unknown regression function y = f(x).  The particularity of our target tasks is that the goal is the predictive accuracy on a particular subset of the domain of the target variable Y - the rare and extreme values.
  • 55. Class Imbalance – for Regression KDD Tutorial / © 2021 IBM Corporation [SM13]  Smote is a sampling method to address classification problems with imbalanced class distribution.  The key feature of this method is that it combines under-sampling of the frequent classes with over- sampling of the minority class.  A variant of Smote for addressing regression tasks where the key goal is to accurately predict rare extreme values, which we will name SmoteR.  There are three key components of the Smote algorithm that need to be address in order to adapt it for our target regression tasks:  how to define which are the relevant observations and the ”normal” cases  how to create new synthetic examples (i.e. over-sampling)  how to decide the target variable value of these new synthetic examples
  • 56. Class Imbalance – for Regression KDD Tutorial / © 2021 IBM Corporation [SM13]  There are three key components of the Smote algorithm that need to be address in order to adapt it for our target regression tasks:  how to define which are the relevant observations and the ”normal” cases  original algorithm is based on the information provided by the user concerning which class value is the target/rare class (usually known as the minority or positive class).  in regression problem, infinite number of values of the target variable are possible.  Solution - relevance function on a user-specified threshold on the relevance values, that leads to the definition of the set Dr. Algorithm will over-sample the observations in Dr and under-sample the remaining cases (Di), thus leading to a new training set with a more balanced distribution of the values.
  • 57. Class Imbalance – for Regression KDD Tutorial / © 2021 IBM Corporation [SM13]  how to create new synthetic examples (i.e. over-sampling)  Regards the second key component, the generation of new cases, same approach as in the original SMOTE algorithm with some small modifications for being able to handle both numeric and nominal attributes.
  • 58. Class Imbalance – for Regression KDD Tutorial / © 2021 IBM Corporation [SM13]  how to decide the target variable value of these new synthetic examples  In the original algorithm this is a trivial question, because as all rare cases have the same class (the target minority class), the same will happen to the examples generated from this set.  In our case the answer is not so trivial. The cases that are to be over-sampled do not have the same target variable value, although they do have a high relevance score (φ(y)). This means that when a pair of examples is used to generate a new synthetic case, they will not have the same target variable value.  Proposed is to use a weighed average of the target variable values of the two seed examples. The weights are calculated as an inverse function of the distance of the generated case to each of the two seed examples.
  • 59. Data Quality Metrics Today, we will cover the following topics: Classification specific metrics  Label Noise  Class Overlap  Outlier Detection Regression specific metrics  Metrics Overview  Outlier  Class Imbalance Metric Sequencing KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/
  • 60. Ordering/Sequence of data quality metrics to achieve better performance KDD Tutorial / © 2021 IBM Corporation While there are several data quality issues that need to be addressed, fixing these in an arbitrary order is shown to yield sub-optimal results. as the search space of all possible sequences is combinatorically large, an important challenge is to find the best sequence with respect to underlying task. For example, authors in [FF12] suggest that correcting the data for missing values via missing value imputation can affect outliers in the dataset. This raises some interesting questions: what should be the guiding factors by which such a sequence can be chosen, how can we quantitatively measure these factors is it possible to find an optimal sequence that allows us to maximize the ML classifier performance
  • 61. Ordering/Sequence of data quality metrics to achieve better performance KDD Tutorial / © 2021 IBM Corporation  Learn2Clean, a method based on Q- Learning, a model-free reinforcement learning technique  For a given dataset, it selects a ML model, and a quality performance metric, the optimal sequence of tasks for pre-processing the data such that the quality of the ML model result is maximized.  More intuitively, the problem that this work address is the following: Given a dataset as input D, a ML pipeline θ to apply to the input dataset, a quality performance metric q, and the space of all possible data preparation and cleaning strategies Φ(D): Find the dataset D ′ in Φ(D) that maximizes the quality metric q [FF12]
  • 62. Ordering/Sequence of data quality metrics to achieve better performance KDD Tutorial / © 2021 IBM Corporation [FF12]
  • 63. Summary  Addressing data quality issues before it enters an ML pipeline allows taking remedial actions to reduce model building efforts and turn-around times.  Measuring the data quality in a systematic and objective manner, through standardised metrics for example, can improve its reliability and permit informed reuse.  There are distinct metrics to quantify the data issues and anomalies based on whether the data is structured or unstructured.  Various data workers or personas are involved in the data quality assessment pipeline depending on the level and type of human intervention required.  Ordering of the metrics can serve as a powerful framework to integrate and optimize the data quality assessment process. KDD Tutorial / © 2021 IBM Corporation
  • 64. Data Quality Metrics for Spatio-Temporal Data KDD Tutorial / © 2021 IBM Corporation
  • 65. Data Quality for Spatio-temporal Outline  Spatio-temporal data in data science  Data types of spatio-temopral  Challenges for quality of spatio-temporal data  Quality analysis for spatio-temporal data  Detecting spatio-temporal outliers KDD Tutorial / © 2021 IBM Corporation Source: http://zed.uchicago.edu/crime.html
  • 66. Domains that use spatio-temporal data  Geostatistics  Statistics focusing on spatial or spatiotemporal datasets of geospatial information.  E.g. petroleum geology, hydrology, meteorology  Spatial econometrics  Field where spatial analysis and econometrics intersect.  Model examples: spatial auto-correlation, neighbourhood effects KDD Tutorial / © 2021 IBM Corporation Sources: http://geologylearn.blogspot.com/p/petroleum.html http://uncglobalhydrology.org/ https://waterprogramming.wordpress.com/2020/07/07/spatial-statistic-part-1-spatial-autocorrelation/
  • 67. Spatio-temporal data in data science Applications of Spatio-temporal(ST) data:  Retail sales analysis  Climate science  Transportation  Criminology  etc. KDD Tutorial / © 2021 IBM Corporation Source https://github.com/danielLinke/CitiBike_NYC Example: Bike sharing analysis heat map shows the utilization for each bike station
  • 68. Spatio-temporal data in data science  About 70% of tasks of ST data in data science addresses prediction and detection by deep learning models. [WCY20]  Example tasks:  Demand prediction  Flood forecast KDD Tutorial / © 2021 IBM Corporation spatial maps. ConvLSTM can be considered as a hybrid del which combines RNN and CNN, and are usually used handle spatial maps. AE and SDAE are mostly used to n features from time series, trajectories and spatial maps. 2Seq model is generally designed for sequential data, and s only used to handle time series and trajectories. The rid models are also common for STDM. For example, N and RNN can be stacked to learn the spatial features , and then capture the temporal correlations among the orical ST data. Hybrid models can be designed to fit all four types of data representations. Other models such as work embedding [164], multi-layer perceptron (MLP) [57], 6], generative adversarial nets (GAN) [49], [93], Residual s [78], [89], deep reinforcement learning [50], etc. are also d in recent works. Addressing STDM Problems inally, the selected or designed deep learning models are d to address various STDM tasks such as classification, dictivelearning, representation learning and anomaly detec- . Note that usually how to select or design a deep learning del depends on the particular data mining task and the input a. However, to show the pipeline of the framework we first way. Deep learning models are also used in other STDM tasks including classification, detection, inference/estimation, recommendation, etc. Next we will introduce the major STDM problems in detail and summarize the corresponding deep learning based solutions. Fig. 14. Distributions of the STDM problems addressed by deep learning models Source S. Wang, J. Cao, and P. Yu, “Deep learning for spatio-temporal data mining: A survey,“ IEEE Transactions on Knowledge and Data Engineering,2020.
  • 69. Spatio-temporal data types 1. Event data – Discrete events occurring at point locations and times. E.g., incidences of crime events in the city 2. Trajectory data – Paths traced by bodies moving in space over time. E.g., the patrol route of a police surveillance car. 3. Point reference data – Measurements of a continuous ST field such as temperature, vegetation, or population over a set of moving reference points in space and time. 4. Raster data – Measurements of a continuous or discrete ST field that are recorded at fixed locations in space and at fixed time points. E.g., population density in geographic information system) KDD Tutorial / © 2021 IBM Corporation [SJA15, WCY20]
  • 70. Challenges spatio-temporal data quality  Quality of the ST data determines most of the accuracy of the prediction and detection.  Aspects for quality of ST data includes spatial, temporal and the combination of them.  A key challenge for ST data quality is to detect outlier. KDD Tutorial / © 2021 IBM Corporation Spatio-temporal data distribution Source: Ferreira, L.N., Vega-Oliveros, D.A., Cotacallapa, M. et al. Spatiotemporal data analysis with chronological networks. Nat Commun 11, 4036 (2020).
  • 71. Quality analysis for spatio-temporal data - Detecting Spatio-temporal outlier Detecting spatio-temporal outlier:  A typical framework of ST outlier detection consists of finding spatial outliers at first and verifying temporal outlier at the next step. KDD Tutorial / © 2021 IBM Corporation ST outlier detection framework Source: M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250{2267, 2014.
  • 72. Classical spatial outlier detection - Local Moran’s I [Anselin95]  Spatial autocorrelation  Identify influential observations (e.g. hot spot, cold spot) and outliers among the areas.  𝐼 = 𝑁 𝑊 𝑖 𝑗 𝑤𝑖𝑗 (𝑥𝑖 −𝑥)(𝑥𝑗 −𝑥) 𝑖 𝑥𝑖 −𝑥 2 where 𝑁 is the number of spatial units indexed by 𝑖 and 𝑗; 𝑥 is the variable of interest; 𝑥 is the mean of 𝑥 ; 𝑤𝑖𝑗 is a matrix of spatial weights with zeroes on the diagonal (i.e., 𝑤𝑖𝑗 = 0); and 𝑊 is the sum of all 𝑤𝑖𝑗. KDD Tutorial / © 2021 IBM Corporation 3 2 0 5 8 1 2 4 1 (1) (2) (3) (4) (5) (6) (7) (8) (9) Example i,j = (1), (2), (3),... xi = 3, 2, 0, ... w12 = 1, w13 = 0, ... 1 2 3 1 1 2 2 1 3 Local Moran’s quadrant Spatial values
  • 73. Classical temporal data outlier detection based on ARMA model  Autoregressive moving average (ARMA) model  A model for time series data combined AR and MA model. 𝑋𝑡 = 𝑐 + 𝜀𝑡 + 𝑖=1 𝑝 𝜑𝑖𝑋𝑡 − 𝑖 + 𝑖=1 𝑞 𝜃𝑖𝜀𝑡 − 𝑖  Outlier detection based on ARMA model  Create ARMA model from real data  Determine outlier from the difference prediction of the ARMA model and the real data  ARIMA: Integrated ARMA  SARIMA: Seasonal ARIMA KDD Tutorial / © 2021 IBM Corporation Source: https://www.kumilog.net/entry/sarima-pv
  • 74. Spatio-temporal Outlier Detection  “Spatio-temporal outlier detection is an extension of spatial outlier detection.” [WLC10]  Spatio-temporal object is represented by a set of instances 𝑜_𝑖𝑑, 𝑠𝑖, 𝑡𝑖 , where the spacestamp 𝑠𝑖, is the location of object 𝑜_𝑖𝑑 at timestamp 𝑡𝑖.  Exact-Grid Top-K [WLC10] is a spatio-temporal outlier detection algorithm that is based on:  Spatial Scan Statistic [Kulldorff97]  Exact-Grid [AMP06] KDD Tutorial / © 2021 IBM Corporation A moving region Source: [WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data," in Knowledge Discovery from Sensor Data, Springer Berlin Heidelberg, pp.115-133, 2010
  • 75. Spatio-temporal Outlier Detection - Spatial Scan Statistics [Kulldorff97]  Identify clusters of randomly positioned points  Determine hotspots in spatial data  Steps:  Study area observed events.  Place a point on the grid.  Observed and expected numbers of events are recorded  Drawback: Huge amounts of calculation. KDD Tutorial / © 2021 IBM Corporation Source: Editing from https://rr-asia.oie.int/wp-content/uploads/2020/03/lecture-4_cluster- detection-using-the-spatial-scan-statistic-satscan_20180920-min_vink.pdf
  • 76. Spatio-temporal Outlier Detection - Exact-Grid [AMP06]  A simple exact algorithm for finding the largest discrepancy region in a domain.  Algorithm running in time 𝑂(𝑛4 ) from 𝑂(𝑛5 )  Approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic). KDD Tutorial / © 2021 IBM Corporation Figure 1: Exam ple of m axim al discrepancy range on a dat a set . X s are m easur ed dat a and Os are baseline dat a. An equi point r = Pr obl e ancy func range R ∈ In this p ing of axis the same s crepancy axis-parall B oundar fitting, we very small Formally, has a mea C ≥ 1. equivalent Sn = [C/ care about than base { (mR , bR ) Example of maximal discrepancy range on a data set. pr pl pb (a) Sweep Line in Algo- rithm Exact . r r∗ Cr ∗ Cr p q ui = nr (b) Error between contours. φ φ 2 < l 2 h φ (c) Error in approximating an arc with a segment. Figure 2: Sweep lines, cont ours, and arcs. Grid algorit hm s. For some algorithms, the data is as- sumed to lie on a grid, or is accumulated onto a set of A lgor it hm 1 Algorithm Exact maxd = -1 Techniques that the algorithm uses [AMP06] Deepak Agarwal, Andrew McGregor, Jeff M Philipps, et al. "Spatial scan statistics: approximations and performance study." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006.
  • 77. Spatio-temporal Outlier Detection - Exact-Grid Top-k [WLC10]  Extend Exact-Grid and Approx-Grid algorithms for overlapping problem. Find the top-k outliers in a spatial grid for each time period.  Finding the top-k outliers  Find every possible region size and shape in the grid.  Get each region’s discrepancy value to determine which is a more significant outlier.  Keeps track of the top-k regions rather than just the top-1. KDD Tutorial / © 2021 IBM Corporation left right top bottom Overlap problem Finding Top-k algorithm Source: https://slideplayer.com/slide/4806555/ [WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data," in Knowledge Discovery from Sensor Data, Springer Berlin Heidelberg, pp.115-133, 2010
  • 78. Summary of Data Quality for Spatio-temporal  Spatio-temporal data in data science  Retail sales analysis, Transportation  Data types of spatio-temopral  Event ,trajectory point reference, and raster data  Challenges for quality of spatio-temporal data  Outlier of ST data  Quality analysis for spatio-temporal data  Spatial Scan Statistics [Kulldorff97], Exact-Grid [AMP06, WLC10] KDD Tutorial / © 2021 IBM Corporation
  • 79. Data Quality Metrics for Unstructured Text Data KDD Tutorial / © 2021 IBM Corporation
  • 80. Data Quality Metrics KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/ We had covered the following topics in KDD 2020 tutorial  Metrics for Generic Text Quality  Metrics for Corpus Filtering  Metrics for Text Classification
  • 81. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 82. What is Unstructured Text? KDD Tutorial / © 2021 IBM Corporation Html/Xml Documents Tweets Ticket Data Email Product Reviews
  • 83. How to Assess Quality of Text? Different properties of text can be measured to quantify its quality!  Lexical Properties - Vocabulary, Misspellings etc.  Syntactic Properties - Grammatical Relations, Syntax Violations etc.  Semantic Properties - Text Meaningfulness, Text Complexity etc.  Discourse Properties - Readability, Text Coherence etc.  Distributional Properties - Outliers, Topics, Embeddings KDD Tutorial / © 2021 IBM Corporation
  • 84. How to Assess Quality of Text? Different Approaches can be further broadly classified as  Generic Text quality  Task specific KDD Tutorial / © 2021 IBM Corporation Generic Text Quality Approaches:  Text Readability  Text Coherence  Text Formality  Text Ambiguity  Text Outliers Task/Dataset Specific Approaches:  Text Classification  Text Generation  Semantic Similarity  Dialog Systems  Label Noise  Lexical Diversity  Text Cleanliness  Text Appropriateness  Text Complexity  Text Bias  Dataset Valuation  Dataset Complexity
  • 85. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 86. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Objective: Leverage the behavior of the ML model on individual instances during training to generate a data map that illustrates easy-to-learn, hard-to-learn and ambiguous samples in the dataset. Approach: 1. Measure the confidence, correctness and variability of the model during training epochs. 2. Model confidence is measured as the mean probability of the true label across training epochs. 3. Model correctness is measured as the fraction of times the model correctly predicts the label. 4. Model variability is measured as the standard deviation of the predicted probabilities of the true label across training epochs. Task: Compare performance of various baselines generated by selecting subsets of the dataset and training the RoBERTA large model. KDD Tutorial / © 2021 IBM Corporation Source : Swayamdipta, Swabha, et al. 2020, Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics [SSL+20]
  • 87. KDD Tutorial / © 2021 IBM Corporation Source : Swayamdipta, Swabha, et al. 2020, Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics Data Map and Insights:  Data follows a bell-shaped curve with respect to confidence and variability  correctness determines discrete regions  Instances with high confidence and low variability region of the map (top-left) are easy-to-learn  Instances with low variability and low confidence (bottom-left) are hard-to-learn  Instances with high variability (right) are ambiguous  hard-to-learn and ambiguous instances are most informative for learning.  Some amount of easy-to-learn instances are also necessary for successful optimization [SSL+20]
  • 88. Data Valuation Using Reinforcement Learning Objective: Determine a reward for each sample by quantifying the performance of the predictor model on a small validation set and use it as a reinforcement signal to learn the likelihood of the sample being using in training of the predictor model. Approach: 1. Perform end-to-end training of (i) target task predictor model and (ii) data value estimator model 2. Data value estimator generates selection probabilities for a mini-batch of training samples 3. Predictor model trains on the mini-batch and loss is computed against the validation set 4. Predictor model parameters are updated through back-propagation 5. Data value estimator parameters are updated using the Reinforce approach Task: 1. Compare DVRL framework against standard baselines on standard datasets from different domains KDD Tutorial / © 2021 IBM Corporation Source : Yoon, Jinsung et. al 2020, Data valuation using reinforcement learning [YAP20]
  • 89. Data Valuation Using Reinforcement Learning KDD Tutorial / © 2021 IBM Corporation Insights:  Using only 60%-70% of the training set (the highest valued samples), DVRL can obtain a similar performance compared to training on the entire dataset.  The framework also outperforms baselines in the presence of noisy labels and is able to detect noisy labels in the dataset by assigning them low scores.  The computational complexity of DVRL framework, instead of being exponential in terms of the dataset size, the overhead is only twice of conventional training. Source : Yoon, Jinsung et. al 2020, Data valuation using reinforcement learning [YAP20]
  • 90. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 91. Outlier Detection Outlier in a tabular or timeseries data can be interpreted as a  Deviating point  Noise  Rarely occurring point What can be an outlier in text data?  Topically diverse sample?  Gibberish text?  Meaningless sentences?  Samples from other language? KDD Tutorial / © 2021 IBM Corporation Images Credits: Google Images Figure 1 Figure 2 Figure 3
  • 92. Outlier Examples in Text Data  Repetitive: greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad full stop  Incomprehensible: lived let idea heck bear walk never heard whole years really funny beginning went hill quickly  Incomplete: Suspenseful subtle much much disturbing KDD Tutorial / © 2021 IBM Corporation Outlier Detection Classical techniques Matrix factorization Techniques DL techniques Self-attention techniques for text representation Outlier Detection in Text Data
  • 93. Outlier Detection for Text Data  Feature Generation – Simple Bag of Words Approach  Apply Matrix Factorization – Decompose the given term matrix into a low rank matrix L and an outlier matrix Z  Further, L can be expressed as  where,  The l2 norm score of a particular column zx serves as an outlier score for the document KDD Tutorial / © 2021 IBM Corporation Source : Kannan et al, 2017. Outlier detection for text data [KWAP17]
  • 94. Outlier Detection for Text Data  For experiments, outliers are sampled from a unique class from standard datasets  Receiver operator characteristics are studied to assess the performance of the proposed approach.  Approach is useful at identifying outliers even from regular classes.  Patterns such as unusually short/long documents, unique words, repetitive vocabulary etc. were observed in detected outliers. KDD Tutorial / © 2021 IBM Corporation Source : Kannan et al, 2017. Outlier detection for text data [KWAP17]
  • 95. Unsupervised Anomaly Detection on Text Multi-Head Self-Attention: KDD Tutorial / © 2021 IBM Corporation Sentence Embedding Attention matrix  Proposes a novel one-class classification method which leverages pretrained word embeddings to perform anomaly detection on text  Given a word embedding matrix H, multi-head self-attention is used to map sentences of variable length to a collection of fixed size representations, representing a sentence with respect to multiple contexts Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text (1) (2) [RZV+19]
  • 96. Unsupervised Anomaly Detection on Text KDD Tutorial / © 2021 IBM Corporation CVDD Objective Context Vector Orthogonality Constraint  These sentence representations are trained along with a collection of context vectors such that the context vectors and representations are similar while keeping the context vectors diverse  Greater the distance of mk(H) to ck implies a more anomalous sample w.r.t. context k Outlier Scoring (1) Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text [RZV+19]
  • 97. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 98. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks Objective: Utilize simple text operators to perform data augmentation and boost performance of ML models on text classification tasks specifically for small datasets. Approach: 1. Four specific text operators are discussed – (i) synonym replacement (ii) random insertion (iii) random swap and (iv) random deletion 2. For a given sentence in the training set, one of the operations is performed at random. 3. The number of words changed, n, is based on the sentence length l with the formula n=αl. 4. For each original sentence, naug augmented sentences are generated. Task: 1. Compare EDA on five NLP tasks with CNNs and RNNs KDD Tutorial / © 2021 IBM Corporation Wei, Jason, and Kai Zou, 2019, Eda: Easy data augmentation techniques for boosting performance on text classification tasks. [WZ19]
  • 99. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks KDD Tutorial / © 2021 IBM Corporation Insights:  Average improvement of 0.8% for full datasets and 3.0% for Ntrain=500 is observed  For the training set fractions {1, 5, 10, 20, 30, 40} there is consistent and significant improvement observed across all datasets and tasks  It is empirically shown that EDA conserves the labels the original sentence by analyzing t-SNE plots of augmented samples  Some of the limitations are (i) performance gain can be marginal when data is sufficient and (ii) EDA might not yield substantial improvements when using pre-trained models Wei, Jason, and Kai Zou, 2019, Eda: Easy data augmentation techniques for boosting performance on text classification tasks. [WZ19]
  • 100. Do not Have Enough Data? Deep Learning to the Rescue! Objective: Given a small labelled dataset, perform data augmentation by generating synthetic samples using a pre-trained language model fine-tuned on the dataset. Approach: 1. Use a pre-trained language model (GPT-2) to fine-tune on the available labelled dataset and use it to synthesize new labelled sentences. 2. Independently, train a classifier on the original dataset and use it to filter the synthesized data corpus by filtering out synthesized samples with low classifier confidence score. Task: 1. Compare LAMBADA framework against SOA baselines on standard datasets to compare performance. KDD Tutorial / © 2021 IBM Corporation Source : Anaby-Tavor, Ateret, et al., 2020, Do not have enough data? Deep learning to the rescue!. [ATCG+20]
  • 101. Do not Have Enough Data? Deep Learning to the Rescue! KDD Tutorial / © 2021 IBM Corporation Insights:  Proposed framework is compared against various classifier models against different baselines on 3 standard datasets.  It is empirically shown that LAMBADA approach shows improvement in performance with upto 50 samples per class and it is classifier agnostic.  When compared with SOA baselines, LAMBADA consistently performs better on all the datasets with various classifiers.  Experiments are also done to show that LAMBADA can also serve as an alternative to semi-supervised techniques when unlabelled data does not exist. Source : Anaby-Tavor, Ateret, et al., 2020, Do not have enough data? Deep learning to the rescue!. [ATCG+20]
  • 102. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 103. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks Proposes an approach to design data quality metric which explains the complexity of the given data for classification task  Considers various data characteristics to generates a 48 dim feature vector for each dataset.  Data characteristics include  Class Diversity: Count based probability distribution of classes in the dataset  Class Imbalance: 𝑐=1 𝐶 | 1 𝐶 − 𝑛𝑐 𝑇𝑑𝑎𝑡𝑎 |  Class Interference: Similarities among samples belonging to different classes  Data Complexity: Linguistic properties of data samples KDD Tutorial / © 2021 IBM Corporation Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks Feature vector for a given dataset cover quality properties such as  Class Diversity (2-Dim)  Shannon Class Diversity  Shannon Class Equitability  Class Imbalance (1-Dim)  Class Interference (24-Dim)  Hellinger Similarity  Top N-Gram Interference  Mutual Information  Data Complexity (21-Dim)  Distinct n-gram : Total n-gram  Inverse Flesch Reading Ease  N-Gram and Character diversity [CRZ18]
  • 104. Understanding the Difficulty of Text Classification Tasks  Authors propose usage of genetic algorithms to intelligently explore the 248 possible combinations  The fitness function for the genetic algorithm was Pearson correlation between difficulty score and model performance on test set  89 datasets were considered for evaluation and 12 different types of models on each dataset  The effectiveness of a given combination of metric is measured using its correlation with the performance of various models on various datasets  Stronger the negative correlation of a metric with model performance, better the metric explains data complexity KDD Tutorial / © 2021 IBM Corporation Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks [CRZ18]
  • 105. Understanding the Difficulty of Text Classification Tasks KDD Tutorial / © 2021 IBM Corporation Difficulty Measure D2 = Distinct Unigrams : Total Unigrams + Class Imbalance + Class Diversity + Maximum Unigram Hellinger Similarity + Unigram Mutual Info. Correlation = −0.8814 Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks [CRZ18]
  • 106. Outline KDD Tutorial / © 2021 IBM Corporation Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/  Motivation  What is Unstructured Text?  How to Assess Text Quality?  Metrics for Text Quality  Text Quality Metrics for Dataset Valuation  Text Quality Metrics for Outlier Detection  Text Quality Metrics for Class Imbalance  Text Quality Metrics for Dataset Complexity  Future Directions  Next Steps
  • 107. Next Steps KDD Tutorial / © 2021 IBM Corporation How to assess overall text quality?  Framework that allows user to assess text quality across various dimensions  Standardized set of quality metrics that output a score to indicate a low/high score  Provide insights into the specific samples in the dataset which contribute to a low/high score  Specific recommendations for addressing poor text quality and evidence of model performance improvement
  • 108. We invite you to join us on this agenda towards a data-centric approach to analyzing data quality. Contact Hima Patel with your ideas and enquiries. KDD Tutorial / © 2021 IBM Corporation
  • 109. References KDD Tutorial / © 2021 IBM Corporation [BF99] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167, 1999. [ARK18] Mohammed Al-Rawi and Dimosthenis Karatzas. On the labeling correctness in computer vision datasets. In IAL@PKDD/ECML, 2018. [NJC19] Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint arXiv:1911.00068, 2019. [EGH17] Rajmadhan Ekambaram, Dmitry B Goldgof, and Lawrence O Hall. Finding label noise examples in large scale datasets. In2017 IEEE International Conference on Systems, Man, and Cybernetics pages 2420–2424., 2017. [XHJ10] Xiong, Haitao, Junjie Wu, and Lu Liu. "Classification with class overlapping: A systematic study." The 2010 International Conference on E-Business Intelligence. 2010 [DQT21] Nitin Gupta, Hima Patel, Shazia Afzal, Naveen Panwar, Ruhi Sharma Mittal, Shanmukha Guttula, Abhinav Jain, Lokesh Nagalapatti, Sameep Mehta, Sandeep Hans, Pranay Lohia, Aniya Aggarwal, Diptikalyan Saha. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv, 2021, https://arxiv.org/pdf/2108.05935.pdf [FF12] W. Fan and F. Geerts, “Foundations of data quality management,”Syn-thesis Lectures on Data Management, vol. 4, no. 5, pp. 1–217, 2012 [DQR18] Corrales, David Camilo, Juan Carlos Corrales, and Agapito Ledezma. "How to address the data quality issues in regression models: a guided process for data cleaning." Symmetry 10.4 (2018): 99.
  • 110. References KDD Tutorial / © 2021 IBM Corporation [OT21] Agarwal, Amulya, and Nitin Gupta. "Comparison of Outlier Detection Techniques for Structured Data." arXiv preprint arXiv:2106.08779 (2021). [SM13] Torgo, Luís, et al. "Smote for regression." Portuguese conference on artificial intelligence. Springer, Berlin, Heidelberg, 2013. WCY20] S. Wang, J. Cao, and P. Yu, “Deep learning for spatio-temporal data mining: A survey," IEEE Transactions on Knowledge and Data Engineering,2020. [SJA15] S. Shekhar, Z. Jiang, R. Y. Ali, et al., “Spatiotemporal data mining: A computational perspective,” ISPRS International Journal of Geo- Information, vol. 4, no. 4, pp. 2306-2338, 2015. [WLC10] E. Wu, W. Liu, and S. Chawla, “Spatio-temporal outlier detection in precipitation data," in Knowledge Discovery from Sensor Data, Springer Berlin Heidelberg, pp.115-133, 2010 [GGA14] M. Gupta, J. Gao, C. C. Aggarwal, and J. Han, “Outlier detection for temporal data: A survey," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 9, pp. 2250-2267, 2014. [KN98] E. M. Knorr and R. T. Ng, “Algorithms for mining distance-based outliers in large datasets," in Proceedings of the 24th International Conference on Very Large Data Bases, ser. VLDB '98. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, p. 392-403.
  • 111. References KDD Tutorial / © 2021 IBM Corporation [SLZ01] S. Shekhar, C.-T. Lu, and P. Zhang, “Detecting graph-based spatial outliers: Algorithms and applications (a summary of results)," in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD '01. New York, NY, USA: Association for Computing Machinery, 2001, p. 371-376. [CL04] T. Cheng and Z. Li, “A hybrid approach to detect spatio-temporal outliers," in Proceedings of the 12th International Conference on Geoinformatics, 2004, p. 173-178 [Anselin95] Anselin, Luc. "Local indicators of spatial association—LISA." Geographical analysis 27.2 (1995): 93-115. [AMP06] Deepak Agarwal, Andrew McGregor, Jeff M Philipps, et al. "Spatial scan statistics: approximations and performance study." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006. [SSL+20] Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, 2020. [WZ19] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, 2019.
  • 112. References KDD Tutorial / © 2021 IBM Corporation [ATCG+20] Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, and Naama Zwerdling. Do not have enough data? Deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390, 2020. [CRZ18] Edward Collins, Nikolai Rozanov, and Bingbing Zhang. Evolutionary data measures: Understanding the difficulty of text classification tasks. arXiv preprintarXiv:1811.01910, 2018. [KWAP17] Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. Outlier detection for text data. In Proceedings of the 2017 siam international conference ondata mining, pages 489–497. SIAM, 2017. [RWGS20] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist.arXiv preprintarXiv:2005.04118, 2020. [RZV+19] Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4061–4071, 2019. [YAP20] Jinsung Yoon, Sercan Arik, and Tomas Pfister. Data valuation using reinforcement learning. InInternational Conference on Machine Learning, pages 10842–10851. PMLR,2020.

Notas del editor

  1. https://arxiv.org/abs/2106.08779
  2. https://arxiv.org/abs/2106.08779
  3. https://arxiv.org/abs/2106.08779
  4. https://arxiv.org/abs/2106.08779
  5. https://arxiv.org/abs/2106.08779
  6. https://arxiv.org/abs/2106.08779
  7. https://arxiv.org/abs/2106.08779
  8. https://arxiv.org/abs/2106.08779
  9. https://arxiv.org/abs/2106.08779
  10. https://arxiv.org/abs/2106.08779
  11. https://arxiv.org/abs/2106.08779
  12. https://arxiv.org/abs/2106.08779
  13. https://arxiv.org/abs/2106.08779
  14. Academic domains that use spatio-temporal data
  15. Spatio-temporal data in data science. Applications of ST data. This is like a business point of view.
  16. Introduce several outlier detection technique for spatio-temporal data
  17. Outliers detection and removal is an important preprocessing step prior to building a machine learning model. Usually outliers are those points in the data which do not follow the general trends and stand out when compared to other points. Such points if not removed from the dataset, might hamper the ability of a machine learning model to capture the data properties in a generalized way. It is relatively easy to understand the concept of outlier in the case of tabular or timeseries data. For example, by simply plotting a given tabular data, as shown in Fig1, one can observe that one point lies away from the general trend of the data and can be treated as an outlier. Or, one can also find outliers by plotting statistical nature of the data, such as using box plots in Fig2 and identify points lying far away from the usual data distribution. Even in case of timeseries data, a simple plotting as shown in Fig3 can help one discover outliers present in the data. Similar is not the case for text data, given a text classification data, it is not clear what an outlier can be or how to identify them. There are multiple ways to interpret an outlier in text data. Given a dataset, can outlier be a data point which is topically diverse from data data points (given political news corpus, presence of sports news article) or is it some gibberish text (which can be found in product reviews, tweets, etc.) or some incomplete sentence which does not convey any meaning or data points having some foreign language when compared to majority (French data points in an English corpus)
  18. Here are the few samples from the popular IMDB sentiment analysis dataset. This dataset has a movie review scraped from IMDB and a label associated to each review to indicate if the review is positive or negative. A data sample can be anomalous due to various reasons such as there is a lot of repetitive content in it, as shown in the first example the sample is difficult to comprehend even for a human and assign a label, as shown in the second example the sample is incomplete, as shown in the third example From these examples, it can be observed that leave alone the model, it might sometimes becomes difficult for a human to understand a sample and provide a label to it. Having shared some examples, I would like to discuss two approaches for anomaly or outlier detection in text. The first one is a very classical approach based on the matrix factorization technique adapted to function on text data The second replies more on newer DL based techniques such as pretrained word embeddings, self-attention to identify anomalies in the data
  19. Outlier Detection for Text Data was published in SIAM 2017 and proposes matrix factorization techniques to detect outliers in text data Matrix factorization technique is predominantly used in recommender systems to decompose an interaction matrix into two lower dimensional matrices. The same technique is applied here to identify outliers in the given text data. Firstly, a numeric representation of data is required and a simple way of representing a document can be in the form of words present in them, which is called as a Bag of Words approach Given a set of documents, we represent them in terms of a term matrix Amxn where m is the number of unique words in given set of documents and n is the number of given documents Now, we explore the matrix factorization techniques to decompose the term matrix A into a low rank matrix (L) and an outlier matrix (Z) Further the low rank matrix L can be represented as a product of two matrices W, H. Intuitively, this corresponds to the case that every document ai , is represented as the linear combination of the r topics. In cases, where this is not true, the document is an outlier, and those unrepresentable sections of the matrix are captured by the non-zero entries in the Z0 matrix In order to get the matrix Z, we solve the optimization problem where were find values for the matrices W,H,Z which closely approximates A The L1,2-norm penalty on Z defines the sum of the L2 norm outlier scores over all the documents. Therefore, the optimization problem essentially tries to find the best model, an important component of which is to minimize the sum of the outlier scores over all documents Once the equation is solved, each entry in the Z corresponds to a term in a document. Since we are interested in the outlier behaviour of entire document the aggregate outlier behaviour of the document x can be modelled with the L2 norm score of a particular column zx.
  20. For high dimensional data such as text data, sparse coefficients are required for obtaining an interpretable low rank matrix WH, hence and additional l1 penalty is imposed on the matrix H Due to L1,2-norm, this optimization corresponds to the two block non-smooth BCD framework, where the problem is solved in two steps – In step1, we freeze WH and find the value for Z and in step2, we freeze Z and find values for W,H according to the problem formulation in equation2 Additionally, we partition the matrix Z into vector blocks zi and construct Z as a set of vectors zi. This way, we are imposing a semantic constraint that outliers in one document do not effects the other documents and also, when all other blocks of w1,···,wr,h1,···,hr, are fixed, every vector zi∈Z, can be solved to optimal in parallel. As mentioned earlier, once the optimization problem converges, we can identify outlier documents by aggregating the outlier scores present in each column of Z. Higher the aggregate score of a column, higher the chance that document being an outlier Since we saw how a traditional techniques such as matrix factorization be used to detect outliers in a dataset, we now move on to the newer approaches which rely on DL and pretrained word embeddings to detect outliers in the data
  21. This is one of the recent papers published in a prominent NLP conference ACL in the year 2019. Unlike the previous approach, this paper relies on recent techniques such as pretrained word embeddings and deep learning architectures to identify outliers present in the data. This paper, proposes a one-class classification method which takes as input the word embeddings and identified if a document is anomaly or not. Similar to the previous technique, the first step we do here is to represent text in numerical format. For this, we rely on pretrained word embeddings. For those of you who are unaware of what word embeddings are, word embedding is a numerical representation of a word which is learnt over huge text corpus such as news, Wikipedia, etc. These word embeddings come in various sizes such as 50D, 100D, 300D, etc and have interesting properties such as words having similar meaning occur nearby in the hyperspace, and identify interesting relationships among words such as countries and capitals, countries and currencies, etc. To put simply, understand word embedding as a lookup table or a dictionary when queries with a word, provides a numeric representation of it. Now for each sentence in a word, embeddings are obtained and the next task is to represent all the sentences in a fixed size representation irrespective of the number of words in it. For this, the authors rely on a technique called Multi-Head self attention which maps variable length sentences to a fixed length representation and additionally, gives multiple numeric representations for the same sentence considering multiple contexts. For example, the word MARCH can be a month or a march past, hence based on the context, a different representation is required for a given sentence. In the self attention, given a word embedding matrix of the data, we compute attention matrix as shown in equation1 Once the attention matrix is computed, as shown in Eq2, we multiply it with the word embedding matrix to get sentence representations
  22. Once the sentence level embeddings are obtained, the authors now propose determination of context vectors. These context vectors are expected to behave similarly as the sentence representation with an additional constraint that different context vectors for the same sentence captures diverse contexts. The context vectors are determined using Eq1 which minimizes the cosine distance between the sentence embeddings M and the context vectors C, as mentioned the orthogonality constraint is imposed on the context vectors to capture diverse contexts Once converged, most of the context vectors have similar representation to that of sentence embedding matrix M and some do not. We can quantify the similarity between these two representations using a cosine distance function. For a given context, cosine distance is computed with respective representations of sentence and the context vectors. Now, in order to get a single score, the scores for all the contexts needs to be aggregated – they can either be given same weights and averaged as shown in the equation or can be assigned different weights and a weighted average can be taken. Higher the score s(H) for a given sentence, greater the chance it is an outlier – since its context vector is away from the sentence embedding.
  23. Now, we have seen various quality metrics for quantifying quality of given text data. How do we know what is the right combination of metrics to understand the complexity of a dataset for classification task. This paper attempts to solve this exact problem. This paper proposes an approach to design data quality metric which explains the complexity of the given data for classification task by considering various properties of the data. The authors consider 4 properties of the data namely – class diversity, class imbalance, class interference and, data complexity. Class Diversity characteristic is used to get the count based probability distribution of classes in the dataset – such as Shannon Class Diversity & Shannon Class Equitability – these metrics consider the class distribution and measure diversity of the dataset Class Imbalance characteristic is used to measure the amount of imbalance present in the data and is computed using the formula shown Class interference characteristic is used to measure similarity among samples belonging to difference classes. Hellinger Similarity: measures similarity between two probability distributions Top N-gram interference: Average Jaccard similarity between the set of the top 10 most frequent n-grams from each class. Mutual Information: Average mutual information score between the set of the top 10 most frequent n-grams from each class. Data Complexity characteristic is used to measure complexity of data based on linguistic properties Distinct n-gram : Total n-gram - Count of distinct n-grams in a dataset divided by the total number of n-grams. Inverse Flesch Reading Ease - grades text from 100 to 0, 100 indicating most readable and 0 indicating difficult to read. We take the reciprocal of this measure. N-Gram and Character diversity - Using the Shannon Index and Equitability, we calculate the diversity and equitability of n-grams and characters The authors consider these various properties of each data characteristic and proposed a 48 dimensional feature vector to represent complexity of the given data. On the right side you can see the different characteristics of data and properties within each characteristics. The number in the parenthesis denotes the number of dimensions assigned to each data characteristic
  24. Once the 48 dimensional feature vector is constructed for each dataset, there are 248 possible combinations of metrics which can be designed from this feature representation In order to intelligently traverse this search space and find the best metric, authors propose usage of genetic algorithms. These algorithms function based on a fitness function to rank a combination. In the current settings, authors use the pearson correlation between the difficulty score obtained from a given combination and the accuracies of different models obtained on this dataset. Stronger the negative correlation of metric and the model accuracy, better is the metric To identify the best metric, authors use a huge database consisting of 89 datasets and also consider 12 different model for each dataset Based on the experiments, authors showcase the best metric they identified to describe the data complexity. On the right you can see the metric which has a strong negative correlation of -0.88 with the model accuracy. For a qualitative analysis, we can look at the provided plot. On the X-axis is the difficulty measure of a dataset measure using the given metric D2 and on the Y-axis is the F1 score of the models It can be observed from the plots that as we move from left to right on the X-axis, the model performances are dropping For low data difficulty measure such as 2, model performances lie above 0.9, but when we consider higher data difficult measure such as 5, we can see model performances lie in the range of 0.2-0.4, thereby illustrating the effectiveness of the identified metric.
  25. Once the 48 dimensional feature vector is constructed for each dataset, there are 248 possible combinations of metrics which can be designed from this feature representation In order to intelligently traverse this search space and find the best metric, authors propose usage of genetic algorithms. These algorithms function based on a fitness function to rank a combination. In the current settings, authors use the pearson correlation between the difficulty score obtained from a given combination and the accuracies of different models obtained on this dataset. Stronger the negative correlation of metric and the model accuracy, better is the metric To identify the best metric, authors use a huge database consisting of 89 datasets and also consider 12 different model for each dataset Based on the experiments, authors showcase the best metric they identified to describe the data complexity. On the right you can see the metric which has a strong negative correlation of -0.88 with the model accuracy. For a qualitative analysis, we can look at the provided plot. On the X-axis is the difficulty measure of a dataset measure using the given metric D2 and on the Y-axis is the F1 score of the models It can be observed from the plots that as we move from left to right on the X-axis, the model performances are dropping For low data difficulty measure such as 2, model performances lie above 0.9, but when we consider higher data difficult measure such as 5, we can see model performances lie in the range of 0.2-0.4, thereby illustrating the effectiveness of the identified metric.