Relational Knowledge Transfer Boosts Model Performance

Relational Knowledge Distillation
Wonpyo Park
CVLab @ POSTECH

• What is Knowledge Distillation (Transfer) ?
• Recent Approaches
• Relational Knowledge Distillation (RKD)
• Discussion
• Conclusion
Contents
2

Knowledge Distillation (Transfer) Transfer Learning
What is Knowledge Distillation?
Model A Model B
Domain A Domain B
Domain A
Student Model
(Small & Shallow)
Teacher Model
(Big & Deep)
educate
(transfer)
transfer
train
train
train train
• For model compression
• To improve performance of student
over teacher
• When data is not sufficient.
• When label for a problem is not presented.
• E.g., pretrained-model on ImageNet
3

Model Compression using Knowledge Distillation
4
Model 1
Model 2
Model 4
Model 3
Ensemble
Example
v
v
v
v
v
Output of
Each Model
Output of
Ensemble<
• Ensemble is an easy way to improve performance of a Neural Network.
• However, it requires large computing resources.

5
Model 1
Model 2
Model 4
Model 3
Example
v
v
v
v
v
• By educating the student model to mimic output of the teacher
model, the student model can achieve comparable performance.
Student v
Teacher
Transfer

6
Distillation
Model 1
Model 2
Model 4
Model 3
Student
Teacher

• Distilling the Knowledge in a Neural Network
Hinton et al. In NIPS, 2014.
Recent Approaches: Transfer Class Probability
𝑥𝑖
𝑙𝑜𝑔𝑖𝑡
𝜏
Image
transfer
Class probability
Student Classifier
𝒇 𝑺
Teacher Classifier
𝒇 𝑻
softmax
softmax
Objective:
7

• FitNets: Hints for Thin Deep Nets
Romero et al. In ICLR, 2015.
Recent Approaches: Transfer Hidden Activation
𝑥𝑖
Teacher
𝒇 𝑻
Student
𝒇 𝑺
𝛽
transfer
Hidden Activation
Random linear
transformation𝐶′
𝐶
where 𝐶′
> 𝐶
Objective:
𝛽 𝑓𝑇 𝑥𝑖
8

• Paying More Attention to Attention: Improving the Performance of
Convolutional Neural Networks via Attention Transfer
Zagoruyko & Komodakis. In ICLR, 2017.
Recent Approaches: Transfer Attention
𝑥𝑖
Student
𝒇 𝑺
H
W
C’
H
W
C
𝑄 𝑇
H
W
𝑄 𝑆
H
W
Average
over channel
transfer
Objective:
Teacher
𝒇 𝑻
9

• Born-Again Neural Networks (Furlanello et al. In ICML, 2018.)
• Label Refinery: Improving ImageNet Classificationthrough Label
Progression (Bagherinezhad et al. In arXiv, 2018.)
Recent Approaches: Student Over Teacher
𝑥𝑖
Student Classifier
𝒇 𝑺
Teacher Classifier
𝒇 𝑻
train
Class probability
Ground-truth for student
Surprisingly, the student is significantly better than the teacher.
Student architecture is identical to teacher
10

• Previous works can be expressed as a form of:
• 𝑓𝑇: teacher, 𝑓𝑆: student, 𝑙: loss, 𝑡𝑖 = 𝑓𝑇 𝑥𝑖 , 𝑠𝑖 = 𝑓𝑆 𝑥𝑖 .
• IKD transfers output of individual example from teacher to student.
Individual Knowledge Distillation: Generalization
11

Q. What constitutes the knowledge in a learned model?
What is the Knowledge of a Model?
12

A. (IKD) Output of individual examples represented by the teacher.
13

A. (IKD) Output of individual examples represented by the teacher.
A. (RKD) Relations among examples represented by the teacher.
14

• Relational knowledge distillation can be expressed as a form of:
• 𝜓: function extracting relation.
• RKD transfers relation among examples represented by teacher to student.
Relational Knowledge Distillation: Generalization
15

• IKD transfers output of individual examples represented by teacher to student.
• RKD transfers relation among examples represented by teacher to student.
IKD versus RKD
16

• Among many relations, we transfer the “structure” of embedding space.
• Distance-wise loss (pair)
• Angle-wise loss (triplet)
Relational Knowledge Distillation: Structure to Structure
𝑡1 𝑡2
𝑡3
𝑠1 𝑠2
𝑠3
Structure to Structure
Relational KD
Point to Point
Individual KD
𝑡1 𝑡2
𝑡3
𝑠1
𝑠2
𝑠3
17
vs.

• Distance-wise loss (RKD-D)
• RKD-D transfers relative distance
between points on embedding space.
Relational Knowledge Distillation: Distance-wise Loss
Where 𝑙 𝛿 is Huber loss:
𝑡1
𝑠1
𝑠2
𝑠3
𝑡2
𝑡3
1.2
1.0
0.8
0.9
1.70.4
Embedding Space
18

• Angle-wise loss (RKD-A)
• RKD-A transfers angle formed by three points
on embedding space.
Relational Knowledge Distillation: Angle-wise Loss
𝑡1
𝑠1
𝑠2
𝑠3
𝑡2
𝑡3
Embedding Space
𝜃1
𝜃3
𝜃2
𝜃1
𝜃3
𝜃2
19

• Where to apply ?
• On any hidden layers or embedding layers.
• Not on layer where individual output values are crucial.
 Because, RKD does not transfer output value of individual examples.
 E.g., softmax layer for classification.
• How to use RKD during training ?
• RKD loss can be combined with task-specified loss, ℒ 𝑡𝑎𝑠𝑘 + 𝜆 ⋅ ℒ 𝑅𝐾𝐷.
• RKD loss can be used solely for training embedding network, ℒ 𝑅𝐾𝐷.
Relational Knowledge Distillation: How to use RKD?
20

• Metric Learning (Image retrieval)
• Image Classification
• Few-Show Learning
Experiment
21

Metric learning
• It aims to train an embedding model.
• In embedding space, distances between projected examples
correspond to their semantic similarity.
Experiment: What is Metric Learning?
Images DNN
𝑓(𝑥; 𝑊)
d-dimensional
Embedding Space
𝑥1
𝑥2
𝑥3
𝑓(𝑥1)
𝑓(𝑥2)
𝑓(𝑥3)
t-SNE of embedding space on Cars 196 dataset.
(Wang et al., 2017)
positive
negative
22

• Evaluation
• Image retrieval, recall@k
• Dataset
• Cars 196 (Krause et al. In 3dRR, 2013.)
• CUB-200-2011 (Wah et al. In CNS-TR, 2011.)
• Stanford Online Products (Song et al. In CVPR, 2016.)
• Architecture
• Teacher: ResNet50 (backbone) + 512-d fc layer (embedding layer) + L2 normalization
• Student: ResNet18 + various dimension fc layer + L2 normalization (optional)
• Targeting layer of RKD
• Final embedding outputs of teacher and student
• Training Objective
• Teacher: Triplet loss & Distance-weighted sampling (Wu et al. In ICCV, 2017.)
• Student: Triplet loss, RKD-D, RKD-A, RKD-DA, DarkRank (Chen et al. In AAAI, 2018.)
Experiment: Metric Learning
23

(a) Recall@1 on CUB-200-2011
(b) Recall@1 on Cars 196
Distillationto small network
• Model-d refer to a model with d-dimensional embedding.
24

Self-Distillation
• Teacher: ResNet50 + 512-d fc + L2 normalization
• Trained using triplet loss
• Student: ResNet50 + 512-d fc
• Trained using RKD-DA
(a) Recall@1 of Self-Distillation
25

Comparison with state-of-the-art methods
• CUB-200-2011, we achieve state-of-the art performance regardless of backbone network.
• Cars 196 & Stanford Online Products, we achieve second-best performance.
Note that, ABE8 (Kim et al. In ECCV, 2018) requires additional attention modules for 8 branches.
26
(a) Recall@K comparison with state-of-the-art methods.

Qualitative Results
• Where the teacher (Triplet) fails, the student (RKD-DA) succeeds at top-1.
27(a) Retrieval results on CUB-200-2011. (b) Retrieval results on Cars 196.

Experiment: Image Classification
Image Classification
• Datasets: CIFAR-10, CIFAR-100
• Architecture
• Teacher: ResNet50
• Student: VGG-11 with BatchNorm
• Targeting layer of RKD
• Teacher: output of avgpool layer
• Student: output of pool5 layer
• Teacher: cross-entropy
• Student: cross-entropy + (Hinton et al., RKD-D and RKD-DA)
(a) Accuracy (%) on CIFAR-10 and CIFAR-100.
28
ResNet50
VGG11 with BN fc
fc
CNN Classifier
Teacher
Student
transfer

Experiment: What is Few-Shot Learning?
Few-shot learning
• A classifier learns to generalize to new unseen classes with only few
examples for each new class.
• Shot: the number of examples given for each new class
• Way: the number of new classes
• E.g., Prototypical Network (Snell et al., In NIPS, 2017)
• An embedding network that
classification is performed based on distance
from given examples of new classes.
Prototypical Networks for few-shot learning.
29

Experiment: Few-Shot Learning
Few-shot learning
• Datasets
• Omniglot (Lake et al. In Science, 2015.)
• miniImageNet (Vinyals et al. In NIPS, 2016.)
• Architecture
• Teacher: 4 convolutional layers
• Student: Same with teacher
• Targeting layer of RKD.
• Final embedding output of
teacher and student
• Teacher: Snell et al. (prototypical networks)
• Student: Snell et al. + (RKD-D or RKD-DA)
30
(a) Accuracy (%) on Omniglot.
(a) Accuracy (%) on miniImageNet.

Discussion: Effective Adaptation on Source Domain
• Both Cars 196 & CUB-200-2011 are fine-grained classification dataset.
• Requires adaptation to specific characteristic of the domain.
e.g.) Finding local patch that distinguish a object from others.
• ‘Triplet’ is the teacher network used for educating ‘RKD-DA’ model.
(a) Recall@1 curve of train/evaluation set
during training the teacher (Triplet) and
the student (RKD-DA) on Cars 196.
(b) Recall@1 on various domains.
Both ‘Triplet’ and ‘RKD-DA’ are the models
trained on Cars 196.
31

• We have introduced Relational KD that effectively transfers knowledge
using relations among data examples represented by the teacher.
• Experiments conducted on different tasks and benchmarks show that the
Relational KD improves the performance of the educated student networks
with a signiﬁcant margin.
Conclusion
32

[1] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS workshop, 2015.
[2] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
[3] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.
In ICLR, 2017.
[4] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born-again neural networks. In ICML, 2018.
[5] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi. Label refinery: Improving imagenet classification through label progression. In arXiv, 2018.
[6] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 3dRR, 2013.
[7] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. In CNS-TR, 2011.
[8] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
[9] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In ICCV, 2017.
[10] Y. Chen, N. Wang, and Z. Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In AAAI, 2018.
[11] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention based ensemble for deep metric learning. In ECCV, 2018.
[12] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
[13] S. R. Lake, Brenden M and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. In Science, 2015.
[14] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016.
References
34

Relational Knowledge Transfer Boosts Model Performance

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Relational Knowledge Transfer Boosts Model Performance

Similar a Relational Knowledge Transfer Boosts Model Performance (20)

Más de NAVER Engineering

Más de NAVER Engineering (20)

Último

Último (20)

Relational Knowledge Transfer Boosts Model Performance