Robust inference via generative classifiers for handling noisy labels

Robust Inference via Generative Classifiers
for Handling Noisy Labels
1 Korea Advanced Institute of Science and Technology (KAIST)
2 University of Michigan
3University of Illinois at Urbana Champaign
4Google Brain
5AItrics
ICML 2019
Kimin Lee1 Sukmin Yun1 Kibok Lee2 Honglak Lee4,2 Bo Li3 Jinwoo Shin1,5

• Large-scale datasets collect class labels from
• Data mining on social media and web data
Introduction: Noisy Labels
1
Web-crawled dataset [Krause’ 16]
[Mahajan’ 18] Exploring the Limits of Weakly Supervised Pretraining. In ECCV, 2018.
[Krause’ 16] The unreasonable effectiveness of noisy data for fine-grained recognition. In ECCV 2016.
[Kuznetsova’ 18] The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint 2018.
Open Images dataset [Kuznetsova’ 18]
FaceBook’s Instagram dataset
[Mahajan’ 18]

• Large-scale datasets may contain noisy (incorrect) labels
1
FaceBook’s Instagram dataset
[Mahajan’ 18]
Cat
Human
(noisy label)
Human
(noisy label)
Cat Cat Cat

• DNNs do not generalize well from such noisy datasets
1

[Test set accuracy]
DenseNet-100
Testsetaccuracy(%)
50
60
70
80
90
100
Noise fraction
0 0.2 0.4 0.6
[Huang’ 17] Densely connected convolutional networks. In CVPR, 2017.
[Krizhevsky’ 09] Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
1
DeerDog Frog
[Uniform noise]
…• Dataset: CIFAR-10 [Krizhevsky’ 09]
• Model: DenseNet-100 [Huang’ 17]

• Several training strategies have also been investigated
1
• Utilizing an estimated/corrected label
• Bootstrapping [Reed’ 14; Ma’ 18]
• Loss correction [Patrini’ 17; Hendrycks’ 18]
• Training on selected (cleaner) samples
• Ensemble [Malach’ 17; Han’ 18]
• Meta-learning [Jiang’ 18]
[Reed’ 14] Training deep neural networks on noisy labels with bootstrapping. arXiv preprint 2014.
[Hendrycks’ 18] Using trusted data to train deep networks on labels corrupted by severe noise.
In NeurIPS, 2018
[Ma’ 18] Dimensionality-driven learning with noisy labels. In ICML, 2018
[Partrini’ 17] Making deep neural networks robust to label noise: A loss correction approach.
In CVPR, 2017
[Han’ 18] Co-teaching: robust training deep neural networks with extremely noisy labels.
In NeurIPS, 2018.
[Jiang’ 18] Mentornet: Regularizing very deep neural networks on corrupted labels.
In ICML, 2018.
[Malach ‘ 17] Decoupling” when to update” from” how to update”. In NeurIPS, 2017.

• We propose a new inference method which can be applied to any pre-trained DNNs
Our Contributions
2

Our Contributions
2
Softmax
Generative (sample mean on noisy labels)
Testsetaccuracy(%)
40
50
60
70
80
90
100
Noise fraction
0 0.2 0.4 0.6
• Inducing a “generative classifier”

Our Contributions
2
Softmax
Generative (MCD on noisy labels)
Testsetaccuracy(%)
40
50
60
70
80
90
100
Noise fraction
0 0.2 0.4 0.6
• Applying a “robust inference” to estimate
parameters of generative classifier
• Breakdown points
• Generalization bounds

Our Contributions
2
• Applying a “robust inference” to estimate
parameters of generative classifier
• Breakdown points
• Generalization bounds
• Introducing “ensemble of generative
classifiers”
Softmax
Generative (MCD on noisy labels)
Generative (MCD on noisy labels) + ensemble
Testsetaccuracy(%)
40
50
60
70
80
90
100
Noise fraction
0 0.2 0.4 0.6

Outline
• Our method: Robust Inference via Generative Classifiers
• Generative classifier
• Minimum covariance determinant estimator
• Ensemble of generative classifiers
• Experiments
• Experimental results on synthetic noisy labels
• Experimental results on semantic and open-set noisy labels
• Conclusion

• t-SNE embedding of DenseNet-100 trained on CIFAR-10 with uniform noisy labels
Motivation: Why Generative Classifier?
3
Training samples with clean labels
Training samples with noisy labels

• Features from training samples with noisy labels (red stars) are distributed like outliers
3

• Features from training samples with clean labels (black dots) are still clustered!!
3

• If we remove the outliers and induce decision boundaries, they can be more robust
3

• If we remove the outliers and induce decision boundaries, they can be more robust
• Generative classifier: model of a data distribution instead of
3

• Given pre-trained softmax classifier with DNNs
• Inducing a generative classifier on the hidden feature space
Robust Inference via Generative Classifier
4
Penultimate layer Gaussian distribution

• Why Gaussian?
• Features from softmax classifier could be well-fitted in Gaussian distributions [Lee’ 18]
4
Penultimate layer Gaussian distribution
[Lee’ 18] A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.

4
Penultimate layer
Bayes’ rule

• How to estimate the parameters of the generative classifier?
4
Penultimate layer
Bayes’ rule

• How to estimate the parameters of the generative classifier?
• With training data
4
Penultimate layer
Bayes’ rule

• Naïve sample estimator (green circle) can be affected by outliers (i.e., noisy labels)
Minimum Covariance Determinant (MCD)
5
Naïve sample estimator

• Minimum Covariance Determinant (MCD) estimator (blue circle)
5
MCD estimator
Naïve sample estimator

• For each class c, find a subset for which the determinant of the sample covariance
matrix is minimum
5
“Variance ∝ determinant”
MCD estimator
Motivation of MCD
- Outliers (e.g., sample with
noisy labels) are scattered in
the sample spaces

• For each class c, find a subset for which the determinant of the sample covariance
matrix is minimum
• Compute the mean and covariance matrix only using selected samples
5
Motivation of MCD
- Outliers (e.g., sample with
noisy labels) are scattered in
the sample spaces

• 1. Breakdown points
• The smallest fraction of outliers to carry the estimate beyond all bounds.
• High breakdown points = robust to outliers
Advantages of MCD estimators
6

• 1. Breakdown points
• The smallest fraction of outliers to carry the estimate beyond all bounds.
• High breakdown points = robust to outliers
• Theorem 1 (Lopuhaa et al., 1991)
6
Under some mild assumptions, MCD estimator has
near-optimal breakdown points, i.e., almost 50 %
[Lopuhaa et al., 1991] Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. The Annals of Statistics, 1991.
Note: Naïve sample estimator has 0% breakdown points

• 2. Tighter generalization errors under noisy labels
• Theorem 2 (Lee et al., 2019)
• Where
7
[Durrant et al., 2010] A. Compressed fisher linear discriminant analysis: Classification of randomly projected data. In ACM SIGKDD, 2010.
Under some mild assumptions, parameters from MCD estimator are more closer
to true parameters than parameters from sample estimator and has larger inter-
class distance

• 2. Tighter generalization errors under noisy labels
• Theorem 2 (Lee et al., 2019)
7
[Durrant et al., 2010] A. Compressed fisher linear discriminant analysis: Classification of randomly projected data. In ACM SIGKDD, 2010.
Theorem 3 (Durrant et al., 2010)
Generalization error of generative classifier is bounded by negative of
inter-class distance and distance between true and estimated parameters
Under some mild assumptions, parameters from MCD estimator are more closer
to true parameters than parameters from sample estimator and has larger inter-
class distance

• Finding the optimal solution of MCD is computationally intractable
• [Bernholt’ 06] showed “Vertex Cover” (NP-hard) reduces to MCD
How to Solve MCD?
8
[Bernholt’ 06] Robust estimators are hard to compute. Technical report, Technical Report/Universit at Dortmund, SFB 475 Komplexitatsreduktion in Multivariaten Datenstrukturen, 2006.

How to Solve MCD?
8
Two-step approach [Hubert’ 04]
[Hubert’ 04] Fast and robust discriminant analysis.
Computational Statistics & Data Analysis, 2004.

How to Solve MCD?
8
• Step 1. For each class, find a subset as follows:
• A. Uniformly sample an initial subset
& compute sample mean and covariance matrix

How to Solve MCD?
8
• B. Compute the Mahalanobis distance
• ‘

How to Solve MCD?
8
• ‘
• C. Construct a new subset which contains samples with
smaller distances

How to Solve MCD?
8
• ‘
smaller distances
• D. Update the sample mean and covariance matrix

How to Solve MCD?
8
• ‘
smaller distances
• Repeat Step B ~ D until the determinant of covariance is
not decreasing

How to Solve MCD?
8
• ‘
smaller distances
not decreasing
• Step 2. Compute the mean and covariance only using
selected samples

How to Solve MCD?
8
• ‘
smaller distances
not decreasing
• Step 2. Compute the mean and covariance only using
selected samples
Monotonically decreasing
a objective of MCD
estimator [Hubert’ 04] !

Summary
9
Penultimate layer
MCD estimator
Is it possible to utilize the low-level features

Ensemble of Generative Classifiers
9
• Boosting the performance: utilizing low-level features
• Inducing the generative classifiers with respect to low-level features
• Why?
• Low-level features can be often more robust to noisy labels [Morcos’ 18, Arpit’ 17]
• They can capture the similar patterns from the input data
[Arpit’ 17] A closer look at memorization in deep networks. In ICML, 2017.
[Morcos’ 18] Insights on representational similarity in neural networks with canonical correlation. In NeurIPS, 2018.

Ensemble of Generative Classifiers
9
• Boosting the performance: utilizing low-level features
• Inducing the generative classifiers with respect to low-level features
• Ensemble of generative classifiers
Posterior distribution from ℓ-th layer

• Model: DenseNet-100 [Huang’ 17] and ResNet-34 [He’ 16]
• Image classification on CIFAR-10, CIFAR-100 [Krizhevsky’ 09] and SVHN [Netzer’ 11]
Experiments: Setup
10
[Huang’ 17] Densely connected convolutional networks. In CVPR, 2017.
[Krizhevsky’ 09] Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
[Netzer’ 11] Reading digits in natural images with unsupervised feature learning. In NeurIPS workshop, 2011.
[He’ 16] Deep residual learning for image recognition. In CVPR, 2016.
• 32 × 32 RGB
• 10/ 100 classes
• 50,000 training set
• 10,000 test set
• 32 × 32 RGB
• 10 classes
• 73,257 training set
• 26,032 test set

• NLP tasks on Tweeter [Gimpel’ 11] and Reuters [Lewis’ 04]
Experiments: Setup
10
Parts-of-Speech Tagging using Tweets
• 19 classes
• 14,468 tweets for training
• 7,082 tweets for test
[Gimpel’ 11] Part-of-speech tagging for twitter: Annotation, features, and experiments. In ACL, 2011
[Lewis’ 04] Rcv1: A new benchmark collection for text categorization research. JMLR, 2004
Text categorization on Reuters
• 7 classes
• 5,444 training data
• 2,179 test data

• NLP tasks on Tweeter [Gimpel’ 11] and Reuters [Lewis’ 04]
• Noise type
• Uniform: corrupting a label to other class uniformly at random
• Flip: corrupting a label only to a specific class
Experiments: Setup
10
Deer
Dog
Frog
[Uniform noise]
… Deer
Dog
Frog
[Flip noise]
…

• Test set accuracy of ResNet-34 trained on CIFAR-10 with 60% uniform noise
Experiments: Empirical Analysis
11

• MCD estimator improves the performance by removing outliers
11

• MCD estimator can select clean data samples more effectively than sample estimator!
11
[The fraction of clean samples]

• Ensemble method significantly improves the classification accuracy
11

• Ensemble method significantly improves the classification accuracy
• Generative classifiers from lower layers are more robust to noisy labels
• Generative classifier does not decrease the classification accuracy on clean dataset
11
[Layer-wise accuracy]
Clean
Uniform (20%)
Uniform (40%)
Uniform (60%)
Testsetaccuracy(%)
40
50
60
70
80
90
100
Index of basic block
2 4 6 8 10 12 14

• Test set accuracy of ResNet-44 trained on CIFAR-10 with 60% uniform noises
Comparison with Prior Training Methods
12
[Reed’ 14] Training deep neural networks on noisy labels with bootstrapping. arXiv preprint 2014.
[Ma’ 18] Dimensionality-driven learning with noisy labels. In ICML, 2018
[Partrini’ 17] Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017
Softmax
Testsetaccuracy(%)
60
65
70
75
80
Cross entropy Bootstrap Forward D2L
• Utilizing an
estimated/corrected label
• Bootstrap [Reed’ 14]
• Forward[Patrini’ 17]
• D2L [Ma’ 18]
Softmax
Testsetaccuracy(%)
60
65
70
75
80
Softmax
Generative + MCD (ours)
Testsetaccuracy(%)
60
65
70
75
80

• Training methods with an ensemble of
classifiers or meta-learning model
• Model: 9-layer CNNs
• Dataset: CIFAR-100
• Noise: 45% Flip noise
Comparison with Prior Training Methods
12
Testsetaccuracy(%)
25
30
35
40
45
50
Cross entropy MentoNet Co-teaching Co-teaching + ours
• Training methods utilizing clean labels
on NLP datasets
• Model: 2-layer FCNs
• Dataset: Twitter
• Noise: 60% uniform noise
Testsetaccuracy(%)
45
50
55
60
65
70
75
Cross entropy Forward (gold) GLC GLC + ours

• Semantic noisy labels from a weak machine labeler
Experiments: Machine Noisy Labels
13
(small number of) labeled data Classifier

• Semantic noisy labels from a weak machine labeler
• Confusion graph from ResNet-34 trained on 5% of
CIFAR-10 labels
Experiments: Machine Noisy Labels
13
Pre-trained
(weak) classifier
Unlabeled dataPseudo-labeled data
* Node: class, Edge: its most confusing class
Softmax
Generative + MCD (ours)
Testsetaccuracy(%)
66.0
66.5
67.0
67.5
68.0
68.5
69.0

• What is Open-set noisy labels?
• Noisy samples from out-of-distribution [Wang’ 18]
• E.g., “Cat” in CIFAR-10
Experiments: Open-set Noisy Labels
[Wang’ 18] Iterative learning with open-set noisy labels. In CVPR, 2018.
14

• E.g., “Cat” in CIFAR-10 (which does not contain “apple” and “chair”)
14
Open-set noisy labels

• E.g., “Cat” in CIFAR-10 (which does not contain “apple” and “chair”)
14
• Experimental setup
• In-distribution: CIFAR-10
• 60% of noise samples from
ImageNet and CIFAR-100
• Model: DenseNet-100
Open-set noisy labels
[Test accuracy (%) of DenseNet on the CIFAR-10]

• To handle noisy labels,
• We believe that our results can be useful for many machine learning problems:
• Defense against adversarial attacks
• Detecting out-of-distribution samples
• Poster session: Pacific Ballroom #16
Conclusion
15
Generative classifier Robust inference Ensemble method
- New inference method
- LDA-based generative classifier
- MCD estimator
- Generalization error
- Generative classifier from
multiple layers
Thank you for your attention 

Robust inference via generative classifiers for handling noisy labels

Recomendados

Recomendados

Más contenido relacionado

Similar a Robust inference via generative classifiers for handling noisy labels

Similar a Robust inference via generative classifiers for handling noisy labels (20)

Último

Último (20)

Robust inference via generative classifiers for handling noisy labels

Notas del editor