Validation Is (Not) Easy

Validation Is (Not) Easy
Dmytro Panchenko
Machine learning engineer, Altexsoft

What is validation?
Validation is a way to select and
evaluate our models.
Two most common strategies:
• train-validation-test split
(holdout validation)
• k-fold cross-validation + test
holdout
Source: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
2

What do we expect?
Validation:
• To compare and select models
3

What do we expect?
Validation:
• To compare and select models
Test:
• To evaluate model’s performance
4

What’s wrong?
•Non-representative splits
6

What’s wrong?
•Unstable validation
7

What’s wrong?
•Unstable validation
•Data leakages
Source: https://www.kaggle.com/alexisbcook/data-leakage
8

Representative sampling
The Good
10

The Bad
11

The Ugly
12

Adversarial validation
1. Merge train and test into a single dataset
2. Label train samples as 0 and test samples as 1
3. Train classifier
4. Train samples with the highest error are the most similar to test
distribution
13

Adversarial validation: usage
• To detect discrepancy in distributions (ROC-AUC > 0.5)
• To make train or validation close to test (by removing features or
sampling most similar items)
• To make test close to production (if we have an unlabeled set of real-
world data)
Examples:
https://www.linkedin.com/pulse/winning-13th-place-kaggles-magic-competition-corey-levinson/
https://www.kaggle.com/c/home-credit-default-risk/discussion/64722
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77251
14

The Ugly
15

Reasons for instability
•Not enough data
18

•Not enough data
•Bad stratification
19

•Not enough data
•Noisy labels
20

•Not enough data
•Noisy labels
•Outliers in data
21

Case #1. Mercedes-Benz competition
Competitors were working with a
dataset representing different
features of Mercedes-Benz cars to
predict the time it takes to pass
testing for a car.
Metric: R2.
Only 4k rows.
Extreme outliers in target.
22
Source: https://habr.com/ru/company/ods/blog/336168/

Case #1. Mercedes-Benz competition
Gold medal solution:
1. Multiple k-folds (10x5 folds) to collect more fold statistics
2. Dependent Student’s t-test for paired samples to compare two
models:
𝑇 𝑋1
𝑛
, 𝑋2
𝑛
=
𝐸 𝑋1 − 𝐸 𝑋2
𝑆/ 𝑛
where 𝑛 – number of folds, 𝑋1
𝑛
, 𝑋2
𝑛
– metrics for each fold for
models #1 and #2, 𝑆 – dispersion of elementwise differences.
Source: https://habr.com/ru/company/ods/blog/336168/ Author: https://www.kaggle.com/daniel89
23

Case #2. ML BootCamp VI
1. 19M rows of logs
2. Adversarial validation gives
0.9+ ROC-AUC
3. Extremely unstable CV:
unclear how to stratify
Author: https://www.kaggle.com/sergeifironov/
24

Case #2. ML BootCamp VI
First place solution:
1. Train model on stratified k-folds
2. Compute out-of-fold error for each
sample
3. Stratify dataset by error
4. Optionally: go to step #1 again
Author: https://www.kaggle.com/sergeifironov/
25

Data leakage
Data leakage is the contamination
of the training data by additional
information that will not be
available at the actual prediction
time.
Source: https://www.kaggle.com/alexisbcook/data-leakage
26

Case #1. HPA Classification Challenge
Multiple shots from single experiment
are available.
If one shot is placed into train and
another is placed into validation, you
have a leakage.
27

Case #1. HPA Classification Challenge
Solution: if you have data from
several groups that share the target,
always place whole group to a single
set!
28

Case #2. Telecom Data Cup
Where is the leakage?
Client uses mobile provider services
Client answers to engagement survey
Survey result is written into DB
All previous history is aggregated into a
row in the dataset
29

Engagement survey call itself
is accounted in the call
history.
Short call means everything
was fine. Long conversation
means complaining.
row in the dataset
30
LEAKAGE!

Solution: you must only use
data that was available at the
point when prediction should
have been made.
row in the dataset
31
LEAKAGE!

Case #3. APTOS Blindness Detection
Different classes were probably collected separately and artificially mixed
into a single dataset, so aspect ratio, image size and crop type vary for
different classes.
32

It leads network to learning arbitrary metafeatures of image instead of
actual symptoms.
33
Source: https://www.kaggle.com/dimitreoliveira/diabetic-retinopathy-shap-model-explainability

34
Solution: remove metafeatures that are not related to the task and
thoroughly investigate all suspicious “data properties”.

Case #4. Airbus Ship Detection Challenge
35
Original dataset consists of
high-resolution images.
They were cropped,
augmented and after that
divided into train and test.

Case #4. Airbus Ship Detection Challenge
36
Solution: first split data into
train and test, after that apply
all preprocessing.
If preprocessing is data-driven
(e.g. target encoding), use
only train data for that.

Summary
• Always ensure that your
validation is representative
• Check that your validation
scenario corresponds real-
world prediction scenario
• Good luck!
37

Thank you for your attention
Questions are welcomed

Validation Is (Not) Easy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (6)

Similar a Validation Is (Not) Easy

Similar a Validation Is (Not) Easy (20)

Último

Último (20)

Validation Is (Not) Easy