발표자: 윤재홍(KAIST 박사과정)
발표일: 2018.7.
We propose a novel deep network architecture for lifelong learning which we refer to as Dynamically Expandable Network (DEN), that can dynamically decide its network capacity as it trains on a sequence of tasks, to learn a compact overlapping knowledge sharing structure among tasks. DEN is efficiently trained in an online manner by performing selective retraining, dynamically expands network capacity upon arrival of each task with only the necessary number of units, and effectively prevents semantic drift by splitting/duplicating units and timestamping them. We validate DEN on multiple public datasets under lifelong learning scenarios, on which it not only significantly outperforms existing lifelong learning methods for deep networks, but also achieves the same level of performance as the batch counterparts with substantially fewer number of parameters. Further, the obtained network fine-tuned on all tasks obtained significantly better performance over the batch models, which shows that it can be used to estimate the optimal network structure even when all tasks are available in the first place.
2. Introduction
Korea Advanced Institute of Science and Technology (KAIST)
• Ph.D. in School of Computing (Aug. 2018. – )
• Advisor: Prof. Sung Ju Hwang
Ulsan National Institute of Science and Technology (UNIST)
• M. S. in Computer Engineering (Aug. 2016 – Feb. 2018)
• Advisor: Prof. Sung Ju Hwang
• B. S. in Computer Science Engineering (Mar. 2012 – Aug. 2016)
• Biological Science Minor
Jaehong Yoon
- Education
3. Introduction
Juho Lee, S. Kim, J. Yoon, H. B. Lee, E. Yang, S. J. Hwang, “Adaptive Network Sparsification via Dependent
Variational Beta-Bernoulli Dropout”, arXiv preprint arXiv:1805.10896 (2018).
Jaehong Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong Learning with Dynamically Expandable Networks”,
International Conference on Learning Representation (ICLR), 2018
Jaehong Yoon, and S. J. Hwang, “Combined Group and Exclusive Sparsity for Deep Neural Networks”,
International Conference on Machine Learning (ICML), 2017
- Experience
- Publications
Korea Advanced Institute of Science and Technology (KAIST)
• Contract Research Scientist (Feb. 2018 ~ Aug. 2018)
AItrics
• Research Intern (Mar. 2018 ~ May 2018)
4. Challenge: Incomplete, Growing Dataset
In many large-scale learning scenarios, not all training data might be available when
we want to begin training the network.
Car
Convertible Sports car
ImageNet
22,000 classes
Sedan
Roadster
5. Challenge: Incomplete, Growing Dataset
In many large-scale learning scenarios, not all training data might be available when
we want to begin training the network.
Car
Sports car
Sedan
Roadster
1M classes
BMW Z4
Ferrari 458 spider
Convertible Ferrari 458 Italia
Porsche 911
Turbo
Hyundai Sonata
BMW 3 series
6. Challenge: Incomplete, Growing Dataset
Even worse, the set of tasks may dynamically grow as new tasks are introduced.
Car
Sports car
Sedan
Roadster
BMW Z4
Ferrari 458 spider
Convertible
2015 Mustang
Convertible
Ferrari 458 Italia
Porsche 911
Turbo
Hyundai Sonata
Tesla Model SBMW 3 series
1M classes
7. Solution: Lifelong Learning
Humans learn forever throughout their lives - couldn’t we build a similar system
that basically learns forever while becoming increasingly smarter over time?
We integrate our model into a lifelong learning framework, that continuously learns by
actively discovering new categories and learning them in the context of known ones.
t-2 t-1 t
Learning
Model
t+1
Learned knowledge
3) New knowledge is
stored for
future use
2) Knowledge is
transferred from
previously
Learned tasks
1) Tasks are received in
a sequential order
4) Refine existing
knowledge
Humans learn forever throughout their lives
8. Lifelong Learning of a Deep Neural Network
However, if the classes we had in the early stages of learning significantly differs from
the new class, utilization of prior knowledge may degenerate performance.
𝑾𝑾 1
𝑾𝑾 2
t-2 t-1 t t+1
New class
+
𝑾𝑾 2
9. Semantic Drift
Introduction of new units can also result in semantic drift or catastrophic forgetting,
where original meaning of the features change as they fit to later tasks.
𝑾𝑾 1
𝑾𝑾 2
New class
+
10. Network Expansion
To learn new tasks which are relatively different from early stages of learning, model
may need to expand network capacity.
𝑾𝑾 1
𝑾𝑾 2
+
New k hidden units (fixed)
New class
t-2 t-1 t t+1
+
…
11. Dynamically Expandable Network (DEN)
To prevent this, we propose a novel deep network that can selectively utilize prior
knowledge for each task while dynamically expanding its capacity when necessary.
𝑾𝑾 1
𝑾𝑾 2
+
New hidden units
New class
t-2 t-1 t t+1
+
+
12. Dynamically Expandable Network (DEN)
Existing models simply retrain the network for the new task, or expand the network
with fixed number of neurons without retraining.
Elastic Weight Consolidation
[Kirkpatrick et al. 16]
Progressive Network
[Rusu et al. 16]
Dynamically Expandable Network
[Ours]
Our dynamically expandable network, on the other hand, partially retrain the existing
network and add in only the necessary number of neurons.
13. Incremental Training of a DEN
We further prevent semantic drift by splitting/duplicating units that have significantly
changed in their meanings after learning for each task 𝑡𝑡, and timestamping units.
Selective retraining Dynamic network
expansion
Network split /
duplication
For all hidden unit i,
We first identify and retrain only the relevant parameters for task 𝑡𝑡. If the loss is still
high, we expand each layer by 𝑘𝑘 neurons with group sparsity to drop unnecessary ones.
14. Incremental Training of a DEN
minimize
𝑾𝑾𝐿𝐿,𝑡𝑡
𝑡𝑡
𝓛𝓛 𝑾𝑾𝐿𝐿,𝑡𝑡
𝑡𝑡
; 𝑾𝑾1:𝐿𝐿–1
𝑡𝑡–1
, 𝒟𝒟𝑡𝑡 + μ 𝑾𝑾𝐿𝐿,𝑡𝑡
𝑡𝑡
1
minimize
𝑾𝑾𝑆𝑆
𝑡𝑡
𝓛𝓛 𝑾𝑾𝑆𝑆
𝑡𝑡
; 𝑾𝑾𝑆𝑆 𝑐𝑐
𝑡𝑡–1
, 𝒟𝒟𝑡𝑡 + μ 𝑾𝑾𝑆𝑆
𝑡𝑡
2
1. Selective Retraining
• Initially, train the network with ℓ𝟏𝟏-regularization
to promote sparsity in the weights.
• Fit a sparse linear model to predict task 𝑡𝑡 using
topmost hidden units of the neural network.
• Perform breadth-first search on the network
starting from selected nodes.
When the model learns new tasks, the network finds relevant neurons, and retrains
only them.
t-1 t
𝒙𝒙𝟐𝟐 𝒙𝒙𝒊𝒊. . .𝒙𝒙𝟏𝟏
15. Incremental Training of a DEN
minimize
𝑾𝑾𝑙𝑙
𝑁𝑁
𝓛𝓛 𝑾𝑾𝑙𝑙
𝑁𝑁
; 𝑾𝑾𝑙𝑙
𝑡𝑡–1
, 𝒟𝒟𝑡𝑡 + λ∑𝑔𝑔 𝑾𝑾𝑙𝑙,𝑔𝑔
𝑁𝑁
2
When loss is higher than threshold 𝝉𝝉, expand constant k neurons at each layer,
and remove useless ones among them.
t-1 t
𝒙𝒙𝟐𝟐 𝒙𝒙𝒊𝒊. . .𝒙𝒙𝟏𝟏
+
+
2. Dynamically Network Expansion
• Perform group sparsity regularization on the
added parameters.
where 𝑔𝑔 ∈ 𝐺𝐺 is a group defined on the incoming weights
for each neuron.
• The model captures new features that were not
previously represented by 𝑾𝑾𝑙𝑙
𝑡𝑡−1
.
16. Group Sparsity Regularization
Ω 𝑾𝑾 𝑙𝑙
= �
𝑔𝑔
𝑾𝑾𝑔𝑔
𝑙𝑙
2
Group sparsity
Layer 𝒍𝒍 − 𝟏𝟏
Layer 𝒍𝒍
[Wen16] Wen, Wei, et al. "Learning structured sparsity in deep neural networks." Advances in Neural Information Processing Systems. 2016.
Layer 𝒍𝒍 − 𝟏𝟏
Layer 𝒍𝒍
Grouping !
(2,1)-norm, which is the 1-norm over 2-norm groups, promotes feature sharing
and results in complete elimination of the features that are not shared.
17. Incremental Training of a DEN
minimize
𝑾𝑾𝑡𝑡
𝓛𝓛 𝑾𝑾𝑡𝑡
; 𝒟𝒟𝑡𝑡 + λ 𝑾𝑾𝑡𝑡
– 𝑾𝑾𝑡𝑡−1
2
2
After 2., if the similarity with neurons of previous step is larger than the threshold σ,
we split & duplicate those neurons and restore them to previous step.
t-1 t
+
+
Copy
3. Network Split / Duplication
• Measure the amount of semantic drift 𝜌𝜌𝑖𝑖
𝑡𝑡
for each
hidden unit 𝑖𝑖, if 𝜌𝜌𝑖𝑖
𝑡𝑡
> 𝜎𝜎, copy it.
• After the duplication, retrain the network since
split changes the overall structure.
18. Incremental Training of a DEN
We timestamp each newly added units to record the stage 𝑡𝑡 when it is added to the
network, to further prevent drift by the introduction of new hidden units.
t-2 tt-1
19. Datasets and Networks
We validate our method on four public datasets for classification, with various
networks.
CIFAR-100
• 100 animal and
vehicle classes
• Used modified
version of AlexNet
MNIST-variation
• Modified MNIST
dataset including
perturbation
• Used LeNet-4
(2 of conv., 2 of fc.)
Permuted-MNIST
• Different random
permutation of the
input pixels
• Used LeNet-4
AwA
• 50 animal classes
• Used feedforward
network
20. Baselines
We compare our networks against relevant baselines.
D D D
M1 M2 M3
D D D
M1 M3 M3
STL
MTL
minimize
𝑾𝑾𝑡𝑡
𝓛𝓛 𝑾𝑾𝑡𝑡; 𝒟𝒟𝑡𝑡 +
+ λ 𝑾𝑾𝑡𝑡– 𝑾𝑾𝑡𝑡−1
2
2
L2
EWC
Progressive Networks
Rusu, Andrei A., et al. "Progressive neural networks." arXiv preprint arXiv:1606.04671 (2016).
Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences 114.13 (2017): 3521-3526.
DEN
21. Results
Incremental training with DEN results in obtaining a much smaller network that
performs almost the same as the networks that are trained in batch.
Further fine-tuning of DEN on all tasks obtains the best performance, which shows
that DEN is also useful for network capacity estimation.
22. Results
DEN maintains the performance obtained on the previous tasks and allows for
higher performance improvements for later tasks.
Also, timestamped inference is highly effective in preventing semantic drift.
23. Results
Selective retraining takes significantly less time than the full retraining of the
network, even shows much higher AUROC.
DNN-selective mostly selects less portion of upper level units which are more task-
specific, while selecting larger portion of more generic lower layer units.
24. Results
There are the models with a variant of our model that does selective retraining and
layer expansion, but without network split on MNIST-Variation dataset.
DEN-Dynamic even outperforms DEN-Constant with similar capacity, since the model
can dynamically adjust the number of neurons at each layers.
25. Results
In the permuted MNIST, our DEN outperforms all lifelong learning baselines while
using only 1.39 times of base network capacity.
Further, DEN-Finetune achieves the best AUROC among all models, including DNN-STL
and DNN-MTL.
26. Conclusion
• We proposed a novel deep neural network for lifelong learning, Dynamically
Expandable Network (DEN).
• DEN performs partial retraining of the network trained on old while
increasing its capacity when necessary.
• DEN significantly outperforms the existing lifelong learning methods,
achieving almost the same performance as the network trained in batch.
• Further fine-tuning of the models on all tasks results in obtaining models that
outperform the batch models, which shows that DEN is useful for network
structure estimation as well.