Learning loss for active learning

Learning Loss
for Active Learning
Donggeun Yoo In So Kweon
CVPR 2019 (Oral presentation)
Lunit KAIST

Introduction
•Very important for deep learning
•It is not questionable that
more data still improves network performance
[Mahajan et al., ECCV’18]
천만~10억장

Introduction
•Problem: Limited budget for annotation
Horse=1
$ $$ $$$

Introduction
•Problem: Limited budget for annotation
•Disease-level annotations for medical images:
super-expensive
$$$$$
Horse=1
$

Active Learning
Labeled
Training

Active Learning
Unlabeled
pool
Labeled
Inference

Active Learning
Unlabeled
pool
Labeled
Inference
Labeling
If uncertain,

Active Learning
Unlabeled
pool
Labeled
Inference
Labeling
Training
If uncertain,

Active Learning
If uncertain,
The key of active learning is
how to measure the uncertainty.

Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
(−) Task-specific design
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]

*Entropy
• An information-theoretic measure that represents the
information amount needed to “encode” a distribution.
• The use of entropy in active learning
• Dense prediction (0.33, 0.33, 0.33) → maximum
• Sparse prediction (1.00, 0.00, 0.00) → minimum

*Entropy
• An information-theoretic measure that represents the
information amount needed to “encode” a distribution.
• The use of entropy in active learning
• Dense prediction (0.33, 0.33, 0.33) → maximum
• Sparse prediction (1.00, 0.00, 0.00) → minimum
(+) Very simple but works well (also in deep networks)
(−) Specific for classification problem

*Bayesian Inference
• Training
• Dropout layer inserted to every convolution layer
• Inference
• N feed forwards → N predictions
• Uncertainty = variance between predictions

*Bayesian Inference
• Training
• Dropout layer inserted to every convolution layer
(−) Super slow convergence
→ impractical for current deep nets
• Inference
• N feed forwards → N predictions
• Uncertainty = variance between predictions
(−) Computationally expensive

*Diversity:
Core-set
Distribution of unlabeled pool

*Diversity:
Core-set
𝛿
{ } is 𝛿-cover of { }

*Diversity:
Core-set
𝛿
{ } is 𝛿-cover of { }
𝑥 = min
{𝑥}
𝛿
Optimization problem

*Diversity:
Core-set
(+) can be task-agnostic
as it only depends on feature space
(−) not considering ”hard” examples
near the decision boundaries
(−) Expensive optimization for large pool

Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large CNNs and data [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]
(−) Not considering hard examples

Active Learning: Our approach
• Active learning by learning loss
• Attach a “loss prediction module” to a target network
• Learn the module to predict the loss
Unlabeled
pool
⋯Predicted
losses
Labeled
training set
Human oracles
annotate top-𝐾
data points

Active Learning: Our approach
• Requirements
• Task-agnostic method
• Not heuristic, learning-based
• Scalable to state-of-the-art networks and large data

Active Learning by Learning Loss
Model
Loss prediction module
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss

Model
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
Multi-task learning

Model
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
(+) Applicable to
• any network and data
• any tasks
(+) Nearly zero cost

Model
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
(+) Applicable to
• any network and data
• any tasks
(+) Nearly zero cost
𝑥
ො𝑦
𝑦
መ𝑙
𝑙
𝐿loss
መ𝑙, 𝑙

•The loss for loss-prediction 𝐿loss
መ𝑙, 𝑙
•Mean square error?
𝐿𝑙𝑜𝑠𝑠
መ𝑙, 𝑙 = መ𝑙 − 𝑙
2

መ𝑙, 𝑙
•Mean square error?
→ target task loss 𝑙 reduced as training progresses
𝐿𝑙𝑜𝑠𝑠
መ𝑙, 𝑙 = መ𝑙 − 𝑙
2
Scale changes

መ𝑙, 𝑙
•To ignore scale changes of 𝑙,
we use a ranking loss

መ𝑙, 𝑙
•To ignore scale changes of 𝑙,
we use a ranking loss as
𝐿loss
መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗 = max 0, −𝟏 𝑙𝑖, 𝑙𝑗 ⋅ መ𝑙𝑖 − መ𝑙𝑗 + 𝜉
where 𝟏 𝑙𝑖, 𝑙𝑗 = ቊ
+1, if 𝑙𝑖 > 𝑙𝑗
−1, otherwise
A pair of
predicted losses
A pair of
real losses
Margin (=1)

•Given a mini-batch B,
the total loss is defined as
1
B
෍
𝑥,𝑦 ∈B
𝐿task ො𝑦, 𝑦 + 𝜆
1
B
⋅ ෍
𝑥 𝑖,𝑦 𝑖,𝑥 𝑗,𝑦 𝑗 ∈B
𝐿loss
መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗
where 𝑙𝑖 = 𝐿task ො𝑦𝑖, 𝑦𝑖
Target task Loss prediction
A pair 𝑖, 𝑗 within a mini-batch B

•MSE loss VS. Ranking loss
MSE
ResNet-18
CIFAR-10

•MSE loss VS. Ranking loss
MSE
Ranking
ResNet-18
CIFAR-10

•Loss prediction module
Target model
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.

Enough convolutions
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.
Convolved
features

Enough convolutions
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.
Backprop.
to convs

Enough convolutions
• The convolutions would be learned by
the loss prediction loss as well as the target loss
• Sufficiently large receptive field size

Enough convolutions
• The convolutions would be learned by
the loss prediction loss as well as the target loss
• Sufficiently large receptive field size
→ Don’t need more convolutions,
we just focus on merging the multiple features

Target model
FC
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.
(+) very efficient as GAP reduces the feature dimension

Target model
Target
model
FC
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.
Conv
BN
ReLU
GAP
FC
ReLU
Conv
BN
ReLU
GAP
FC
ReLU
Conv
BN
ReLU
GAP
FC
ReLU
: Added layer

More convolutions VS. Just FC
ResNet-18
CIFAR-10

Target model
FC
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.

Experiments (1)
•To validate “task-agnostic” + “state-of-the-art architectures”
Classification
Task Image
classification
Data CIFAR-10
Net ResNet-18
[He et al., CVPR’16]

Experiments (1)
Classification Classification
+ regression
Task Image
classification
Object
detection
Data CIFAR-10 PASCAL VOC
2007+2012
Net ResNet-18
SSD
[Liu et al., ECCV’16]

Experiments (1)
Classification Classification
+ regression
Regression
Task Image
classification
Object
detection
Human pose
estimation
Data CIFAR-10 PASCAL VOC
2007+2012
MPII
Net ResNet-18
SSD
Stacked
Hourglass
Networks
[Newell et al., ECCV’16]

Results
•Image classification over CIFAR 10
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
ResNet-18
Target
prediction
512×4×4
256×8×8
64×32×32
128×16×16
128
128
128
128
512

Results
•Image classification
over CIFAR 10
Loss prediction performance

Results
over CIFAR 10
(mean of 5 trials)
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→

Results
over CIFAR 10
(mean of 5 trials)
+3.37%

Results
over CIFAR 10
(mean of 5 trials)
+3.37%
Data selection VS. Architecture
Data selection by active learning → +3.37%
DenseNet121[Huang et al.] − ResNet18 → +2.02%

Results
•Object detection
SSD (ImageNet pre-trained)
FC
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Target
prediction
512×38×38
1024×19×19
512×10×10
256×5×5
256×3×3
256×1×1
128
768

Results
•Object detection over
PASCAL VOC 07+12

Results
•Object detection on
PASCAL VOC 07+12
(mean of 3 trials)

Results
PASCAL VOC 07+12
(mean of 3 trials)
+2.21%

Results
PASCAL VOC 07+12
(mean of 3 trials)
+2.21%
YOLOv2[Redmon et al.] − SSD → +1.80%

Results
•Human pose estimation
over MPII dataset Stacked Hourglass Network
[Newell et al., ECCV’16]
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Target
prediction
An hourglass256×64×64
256×64×64
256×64×64
256×64×64
128
128
128
128
1024

Results
over MPII dataset

Results
over MPII dataset
(mean of 3 trials)

Results
over MPII dataset
(mean of 3 trials)
+1.84%

Results
over MPII dataset
(mean of 3 trials)
+1.84%
Data selection VS. Number of stacks
8-stacked − 2-stacked → +0.25%

Results
•Entropy VS predicted loss over MPII dataset
MSE loss MSE loss

Experiments (2)
•To validate “active domain adaptation”,
Dataset Data stats Active learning
Source
domain
MNIST #train:60k
#test: 10k
Use 60k as an
initial labeled pool
Target
domain
MNIST +
background
#train: 12k
#test: 50k
Add 1k for
each cycle

Results
•Image classification over MNIST
*https://github.com/pytorch/examples/tree/master/mnist
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
PyTorch MNIST model*
Target
prediction
Conv
ReLU
Conv
ReLU
FC
ReLU
FC
Image
10×12×12
20×4×4
50
64
64
64
192

Results
•Domain adaptation from MNIST to MNIST+background
•Loss prediction performance

Results
•Domain adaptation
from MNIST
to MNIST+background
•Target domain
performance

Feature space overfitted
to source domain

Results
from MNIST
to MNIST+background
•Target domain
performance

to source domain
+1.20%

Results
from MNIST
to MNIST+background
•Target domain
performance

to source domain
+1.20%
WideResNet14 − PytorchMNIST(4 layers) → +2.85%

Conclusion
•Introduced a novel active learning method that is
• Works well with current deep networks
• Task-agnostic
•Verified with
• Three major visual recognition tasks
• Three popular network architectures

Conclusion
•Introduced a novel active learning method that is
• Works well with current deep networks
• Task-agnostic
•Verified with
• Three major visual recognition tasks
• Three popular network architectures
“
”
Pick more important data,
and get better performance!

Learning loss for active learning

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Learning loss for active learning

Similar a Learning loss for active learning (20)

Más de NAVER Engineering

Más de NAVER Engineering (20)

Último

Último (20)

Learning loss for active learning