The document discusses relational knowledge distillation (RKD), a technique for transferring knowledge from a teacher model to a student model. It begins by providing background on knowledge distillation and recent approaches. It then introduces RKD, which transfers relational information between examples in the teacher's embedding space, such as distances and angles, rather than just individual example outputs. The document describes experiments applying RKD to metric learning, image classification, and few-shot learning, finding it improves student model performance over other distillation methods. It concludes RKD effectively leverages relational information to transfer knowledge between models.
3. Knowledge Distillation (Transfer) Transfer Learning
What is Knowledge Distillation?
Model A Model B
Domain A Domain B
Domain A
Student Model
(Small & Shallow)
Teacher Model
(Big & Deep)
educate
(transfer)
transfer
train
train
train train
• For model compression
• To improve performance of student
over teacher
• When data is not sufficient.
• When label for a problem is not presented.
• E.g., pretrained-model on ImageNet
3
4. Model Compression using Knowledge Distillation
4
Model 1
Model 2
Model 4
Model 3
Ensemble
Example
v
v
v
v
v
Output of
Each Model
Output of
Ensemble<
• Ensemble is an easy way to improve performance of a Neural Network.
• However, it requires large computing resources.
5. Model Compression using Knowledge Distillation
5
Model 1
Model 2
Model 4
Model 3
Example
v
v
v
v
v
• By educating the student model to mimic output of the teacher
model, the student model can achieve comparable performance.
Student v
Teacher
Transfer
6. Model Compression using Knowledge Distillation
6
Distillation
Model 1
Model 2
Model 4
Model 3
Student
Teacher
7. • Distilling the Knowledge in a Neural Network
Hinton et al. In NIPS, 2014.
Recent Approaches: Transfer Class Probability
𝑥𝑖
𝑙𝑜𝑔𝑖𝑡
𝜏
Image
transfer
Class probability
Student Classifier
𝒇 𝑺
Teacher Classifier
𝒇 𝑻
softmax
softmax
Objective:
7
8. • FitNets: Hints for Thin Deep Nets
Romero et al. In ICLR, 2015.
Recent Approaches: Transfer Hidden Activation
𝑥𝑖
Teacher
𝒇 𝑻
Student
𝒇 𝑺
𝛽
transfer
Hidden Activation
Random linear
transformation𝐶′
𝐶
where 𝐶′
> 𝐶
Objective:
𝛽 𝑓𝑇 𝑥𝑖
8
9. • Paying More Attention to Attention: Improving the Performance of
Convolutional Neural Networks via Attention Transfer
Zagoruyko & Komodakis. In ICLR, 2017.
Recent Approaches: Transfer Attention
𝑥𝑖
Student
𝒇 𝑺
H
W
C’
H
W
C
𝑄 𝑇
H
W
𝑄 𝑆
H
W
Average
over channel
transfer
Objective:
Teacher
𝒇 𝑻
9
10. • Born-Again Neural Networks (Furlanello et al. In ICML, 2018.)
• Label Refinery: Improving ImageNet Classificationthrough Label
Progression (Bagherinezhad et al. In arXiv, 2018.)
Recent Approaches: Student Over Teacher
𝑥𝑖
Student Classifier
𝒇 𝑺
Teacher Classifier
𝒇 𝑻
train
Class probability
Ground-truth for student
Surprisingly, the student is significantly better than the teacher.
Student architecture is identical to teacher
10
11. • Previous works can be expressed as a form of:
• 𝑓𝑇: teacher, 𝑓𝑆: student, 𝑙: loss, 𝑡𝑖 = 𝑓𝑇 𝑥𝑖 , 𝑠𝑖 = 𝑓𝑆 𝑥𝑖 .
• IKD transfers output of individual example from teacher to student.
Individual Knowledge Distillation: Generalization
11
12. Q. What constitutes the knowledge in a learned model?
What is the Knowledge of a Model?
12
13. Q. What constitutes the knowledge in a learned model?
A. (IKD) Output of individual examples represented by the teacher.
What is the Knowledge of a Model?
13
14. Q. What constitutes the knowledge in a learned model?
A. (IKD) Output of individual examples represented by the teacher.
A. (RKD) Relations among examples represented by the teacher.
What is the Knowledge of a Model?
14
15. • Relational knowledge distillation can be expressed as a form of:
• 𝜓: function extracting relation.
• RKD transfers relation among examples represented by teacher to student.
Relational Knowledge Distillation: Generalization
15
16. • IKD transfers output of individual examples represented by teacher to student.
• RKD transfers relation among examples represented by teacher to student.
IKD versus RKD
16
17. • Among many relations, we transfer the “structure” of embedding space.
• Distance-wise loss (pair)
• Angle-wise loss (triplet)
Relational Knowledge Distillation: Structure to Structure
𝑡1 𝑡2
𝑡3
𝑠1 𝑠2
𝑠3
Structure to Structure
Relational KD
Point to Point
Individual KD
𝑡1 𝑡2
𝑡3
𝑠1
𝑠2
𝑠3
17
vs.
18. • Distance-wise loss (RKD-D)
• RKD-D transfers relative distance
between points on embedding space.
Relational Knowledge Distillation: Distance-wise Loss
Where 𝑙 𝛿 is Huber loss:
𝑡1
𝑠1
𝑠2
𝑠3
𝑡2
𝑡3
1.2
1.0
0.8
0.9
1.70.4
Embedding Space
18
19. • Angle-wise loss (RKD-A)
• RKD-A transfers angle formed by three points
on embedding space.
Relational Knowledge Distillation: Angle-wise Loss
𝑡1
𝑠1
𝑠2
𝑠3
𝑡2
𝑡3
Embedding Space
𝜃1
𝜃3
𝜃2
𝜃1
𝜃3
𝜃2
19
20. • Where to apply ?
• On any hidden layers or embedding layers.
• Not on layer where individual output values are crucial.
Because, RKD does not transfer output value of individual examples.
E.g., softmax layer for classification.
• How to use RKD during training ?
• RKD loss can be combined with task-specified loss, ℒ 𝑡𝑎𝑠𝑘 + 𝜆 ⋅ ℒ 𝑅𝐾𝐷.
• RKD loss can be used solely for training embedding network, ℒ 𝑅𝐾𝐷.
Relational Knowledge Distillation: How to use RKD?
20
22. Metric learning
• It aims to train an embedding model.
• In embedding space, distances between projected examples
correspond to their semantic similarity.
Experiment: What is Metric Learning?
Images DNN
𝑓(𝑥; 𝑊)
d-dimensional
Embedding Space
𝑥1
𝑥2
𝑥3
𝑓(𝑥1)
𝑓(𝑥2)
𝑓(𝑥3)
t-SNE of embedding space on Cars 196 dataset.
(Wang et al., 2017)
positive
negative
22
23. • Evaluation
• Image retrieval, recall@k
• Dataset
• Cars 196 (Krause et al. In 3dRR, 2013.)
• CUB-200-2011 (Wah et al. In CNS-TR, 2011.)
• Stanford Online Products (Song et al. In CVPR, 2016.)
• Architecture
• Teacher: ResNet50 (backbone) + 512-d fc layer (embedding layer) + L2 normalization
• Student: ResNet18 + various dimension fc layer + L2 normalization (optional)
• Targeting layer of RKD
• Final embedding outputs of teacher and student
• Training Objective
• Teacher: Triplet loss & Distance-weighted sampling (Wu et al. In ICCV, 2017.)
• Student: Triplet loss, RKD-D, RKD-A, RKD-DA, DarkRank (Chen et al. In AAAI, 2018.)
Experiment: Metric Learning
23
24. Experiment: Metric Learning
(a) Recall@1 on CUB-200-2011
(b) Recall@1 on Cars 196
Distillationto small network
• Model-d refer to a model with d-dimensional embedding.
24
25. Self-Distillation
• Teacher: ResNet50 + 512-d fc + L2 normalization
• Trained using triplet loss
• Student: ResNet50 + 512-d fc
• Trained using RKD-DA
Experiment: Metric Learning
(a) Recall@1 of Self-Distillation
25
26. Comparison with state-of-the-art methods
Experiment: Metric Learning
• CUB-200-2011, we achieve state-of-the art performance regardless of backbone network.
• Cars 196 & Stanford Online Products, we achieve second-best performance.
Note that, ABE8 (Kim et al. In ECCV, 2018) requires additional attention modules for 8 branches.
26
(a) Recall@K comparison with state-of-the-art methods.
27. Experiment: Metric Learning
Qualitative Results
• Where the teacher (Triplet) fails, the student (RKD-DA) succeeds at top-1.
27(a) Retrieval results on CUB-200-2011. (b) Retrieval results on Cars 196.
28. Experiment: Image Classification
Image Classification
• Datasets: CIFAR-10, CIFAR-100
• Architecture
• Teacher: ResNet50
• Student: VGG-11 with BatchNorm
• Targeting layer of RKD
• Teacher: output of avgpool layer
• Student: output of pool5 layer
• Training Objective
• Teacher: cross-entropy
• Student: cross-entropy + (Hinton et al., RKD-D and RKD-DA)
(a) Accuracy (%) on CIFAR-10 and CIFAR-100.
28
ResNet50
VGG11 with BN fc
fc
CNN Classifier
Teacher
Student
transfer
29. Experiment: What is Few-Shot Learning?
Few-shot learning
• A classifier learns to generalize to new unseen classes with only few
examples for each new class.
• Shot: the number of examples given for each new class
• Way: the number of new classes
• E.g., Prototypical Network (Snell et al., In NIPS, 2017)
• An embedding network that
classification is performed based on distance
from given examples of new classes.
Prototypical Networks for few-shot learning.
29
30. Experiment: Few-Shot Learning
Few-shot learning
• Datasets
• Omniglot (Lake et al. In Science, 2015.)
• miniImageNet (Vinyals et al. In NIPS, 2016.)
• Architecture
• Teacher: 4 convolutional layers
• Student: Same with teacher
• Targeting layer of RKD.
• Final embedding output of
teacher and student
• Training Objective
• Teacher: Snell et al. (prototypical networks)
• Student: Snell et al. + (RKD-D or RKD-DA)
30
(a) Accuracy (%) on Omniglot.
(a) Accuracy (%) on miniImageNet.
31. Discussion: Effective Adaptation on Source Domain
• Both Cars 196 & CUB-200-2011 are fine-grained classification dataset.
• Requires adaptation to specific characteristic of the domain.
e.g.) Finding local patch that distinguish a object from others.
• ‘Triplet’ is the teacher network used for educating ‘RKD-DA’ model.
(a) Recall@1 curve of train/evaluation set
during training the teacher (Triplet) and
the student (RKD-DA) on Cars 196.
(b) Recall@1 on various domains.
Both ‘Triplet’ and ‘RKD-DA’ are the models
trained on Cars 196.
31
32. • We have introduced Relational KD that effectively transfers knowledge
using relations among data examples represented by the teacher.
• Experiments conducted on different tasks and benchmarks show that the
Relational KD improves the performance of the educated student networks
with a significant margin.
Conclusion
32
34. [1] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS workshop, 2015.
[2] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
[3] S. Zagoruyko and N. Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer.
In ICLR, 2017.
[4] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar. Born-again neural networks. In ICML, 2018.
[5] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi. Label refinery: Improving imagenet classification through label progression. In arXiv, 2018.
[6] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In 3dRR, 2013.
[7] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. In CNS-TR, 2011.
[8] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
[9] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl. Sampling matters in deep embedding learning. In ICCV, 2017.
[10] Y. Chen, N. Wang, and Z. Zhang. Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In AAAI, 2018.
[11] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon. Attention based ensemble for deep metric learning. In ECCV, 2018.
[12] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In NIPS, 2017.
[13] S. R. Lake, Brenden M and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. In Science, 2015.
[14] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NIPS, 2016.
References
34