2. What is validation?
Validation is a way to select and
evaluate our models.
Two most common strategies:
• train-validation-test split
(holdout validation)
• k-fold cross-validation + test
holdout
Source: https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9
2
3. What do we expect?
Validation:
• To compare and select models
3
4. What do we expect?
Validation:
• To compare and select models
Test:
• To evaluate model’s performance
4
13. Adversarial validation
1. Merge train and test into a single dataset
2. Label train samples as 0 and test samples as 1
3. Train classifier
4. Train samples with the highest error are the most similar to test
distribution
13
14. Adversarial validation: usage
• To detect discrepancy in distributions (ROC-AUC > 0.5)
• To make train or validation close to test (by removing features or
sampling most similar items)
• To make test close to production (if we have an unlabeled set of real-
world data)
Examples:
https://www.linkedin.com/pulse/winning-13th-place-kaggles-magic-competition-corey-levinson/
https://www.kaggle.com/c/home-credit-default-risk/discussion/64722
https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77251
14
22. Case #1. Mercedes-Benz competition
Competitors were working with a
dataset representing different
features of Mercedes-Benz cars to
predict the time it takes to pass
testing for a car.
Metric: R2.
Only 4k rows.
Extreme outliers in target.
22
Source: https://habr.com/ru/company/ods/blog/336168/
23. Case #1. Mercedes-Benz competition
Gold medal solution:
1. Multiple k-folds (10x5 folds) to collect more fold statistics
2. Dependent Student’s t-test for paired samples to compare two
models:
𝑇 𝑋1
𝑛
, 𝑋2
𝑛
=
𝐸 𝑋1 − 𝐸 𝑋2
𝑆/ 𝑛
where 𝑛 – number of folds, 𝑋1
𝑛
, 𝑋2
𝑛
– metrics for each fold for
models #1 and #2, 𝑆 – dispersion of elementwise differences.
Source: https://habr.com/ru/company/ods/blog/336168/ Author: https://www.kaggle.com/daniel89
23
24. Case #2. ML BootCamp VI
1. 19M rows of logs
2. Adversarial validation gives
0.9+ ROC-AUC
3. Extremely unstable CV:
unclear how to stratify
Author: https://www.kaggle.com/sergeifironov/
24
25. Case #2. ML BootCamp VI
First place solution:
1. Train model on stratified k-folds
2. Compute out-of-fold error for each
sample
3. Stratify dataset by error
4. Optionally: go to step #1 again
Author: https://www.kaggle.com/sergeifironov/
25
26. Data leakage
Data leakage is the contamination
of the training data by additional
information that will not be
available at the actual prediction
time.
Source: https://www.kaggle.com/alexisbcook/data-leakage
26
27. Case #1. HPA Classification Challenge
Multiple shots from single experiment
are available.
If one shot is placed into train and
another is placed into validation, you
have a leakage.
27
28. Case #1. HPA Classification Challenge
Solution: if you have data from
several groups that share the target,
always place whole group to a single
set!
28
29. Case #2. Telecom Data Cup
Where is the leakage?
Client uses mobile provider services
Client answers to engagement survey
Survey result is written into DB
All previous history is aggregated into a
row in the dataset
29
30. Case #2. Telecom Data Cup
Engagement survey call itself
is accounted in the call
history.
Short call means everything
was fine. Long conversation
means complaining.
Client uses mobile provider services
Client answers to engagement survey
Survey result is written into DB
All previous history is aggregated into a
row in the dataset
30
LEAKAGE!
31. Case #2. Telecom Data Cup
Solution: you must only use
data that was available at the
point when prediction should
have been made.
Client uses mobile provider services
Client answers to engagement survey
Survey result is written into DB
All previous history is aggregated into a
row in the dataset
31
LEAKAGE!
32. Case #3. APTOS Blindness Detection
Different classes were probably collected separately and artificially mixed
into a single dataset, so aspect ratio, image size and crop type vary for
different classes.
32
33. Case #3. APTOS Blindness Detection
It leads network to learning arbitrary metafeatures of image instead of
actual symptoms.
33
Source: https://www.kaggle.com/dimitreoliveira/diabetic-retinopathy-shap-model-explainability
34. Case #3. APTOS Blindness Detection
34
Solution: remove metafeatures that are not related to the task and
thoroughly investigate all suspicious “data properties”.
35. Case #4. Airbus Ship Detection Challenge
35
Original dataset consists of
high-resolution images.
They were cropped,
augmented and after that
divided into train and test.
36. Case #4. Airbus Ship Detection Challenge
36
Solution: first split data into
train and test, after that apply
all preprocessing.
If preprocessing is data-driven
(e.g. target encoding), use
only train data for that.
37. Summary
• Always ensure that your
validation is representative
• Check that your validation
scenario corresponds real-
world prediction scenario
• Good luck!
37