TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
3. Introduction
• Over the past several years better architectures have resulted in
considerable progress in a wide range of visual recognition tasks.
Ex)VGG, ResNet, MobileNet, EfficientNet, etc.
• While manual network design has led to large advances, finding well-
optimized networks manually can be challenging, especially as the
number of design choices increases.
• A popular approach to address this limitation is neural architecture
search (NAS).
• However, it does not enable discovery of network design principles
that deepen our understanding and allow us to generalize to new
settings.
4. Introduction
• In this work, the authors present a new network design paradigm
that combines the advantages of manual design and NAS.
• Instead of focusing on designing individual network instances, they
design design spaces that parametrize populations of networks.
5. Exploring RandomlyWired Neural Networks for
Image Recognition(PR-155)
• Design a Network Generator not an
Individual Network!
6. Introduction
• The authors start with a relatively unconstrained design space we call
AnyNet and apply human-in- the-loop methodology to arrive at a
low-dimensional design space consisting of simple “regular”
networks, RegNet.
• RegNet design space generalizes to various compute regimes,
schedule lengths and network block types.
• They analyze the RegNet design space and arrive at interesting
findings that do not match the current practice of network design.
7. Tools for Design Space Design
• Rather than designing or searching for a single best model under
specific settings, the authors study the behavior of populations of
models.
• They rely on the concept of network design spaces introduced by
Radosavovic et al., “On network design spaces for visual
recognition.”, ICCV2019.
• Core idea of the paper is that we can quantify the quality of a design
space by sampling a set of models from that design space and
characterizing the resulting model error distribution.
8. Tools for Design Space Design
• To obtain a distribution of models, sample and train n models from a
design space.
• A primary tool for analyzing design space quality is the error
empirical distribution function (EDF).The error EDF of n models with
errors 𝑒𝑖 is given by:
𝐹 𝑒 =
1
𝑛
𝑖=1
𝑛
1[𝑒𝑖 < 𝑒]
• F(e) gives the fraction of models with
error less than 𝑒.
9. Tools for Design Space Design
• Given a population of trained models, we can plot and analyze
various network properties versus network error.
• For these plots, an empirical bootstrap is applied to estimate the
likely range in which the best models fall.
The blue shaded regions are ranges containing the best models with 95% confidence, and the black vertical line
the most likely best value.
10. Tools for Design Space Design
• To summarize:
1. generate distributions of models obtained by sampling and
training n models from a design space.
2. compute and plot error EDFs to summarize design space quality.
3. visualize various properties of a design space and use an
empirical bootstrap to gain insight.
4. use these insights to refine the design space.
11. The AnyNet Design Space
• Given an input image, a network consists of a simple stem, followed by the
network body that performs the bulk of the computation, and a final network
head that predicts the output classes.
• Keep the stem and head fixed and as simple as possible, and instead focus on
the structure of the network body.
• The network body consists of 4 stages operating at progressively reduced
resolution, each stage consists of a sequence of identical blocks.
12. AnyNetX
• Most of our experiments use the standard residual bottlenecks block
with group convolution.They refer to this as the X block, and the
AnyNet design space built on it as AnyNetX.
13. AnyNetX
• The AnyNetX design space has 16 degrees of freedom as each
network consists of 4 stages and each stage 𝑖 has 4 parameters: the
number of blocks 𝑑𝑖, block width 𝑤𝑖, bottleneck ratio 𝑏𝑖, and group
width 𝑔𝑖.
• Resolution 𝑟 = 224 (fixed)
• To obtain valid models, we perform log-uniform sampling of 𝑑𝑖 ≤ 16,
𝑤𝑖 ≤ 1024 and divisible by 8, 𝑏𝑖 ∈ {1, 2, 4}, and 𝑔𝑖 ∈ {1, 2, … , 32}.
• There are (16 ∙ 128 ∙ 3 ∙ 6)4≈ 1018possible model configurations in
the AnyNetX design space.
14. Design Space Design Aims
1. To simplify the structure of the design.
2. To improve the interpretability of the design space.
3. To improve or maintain the design space quality.
4. To maintain model diversity in the design space.
15. AnyNetX(A, B, C)
• Refer to unconstrained AnyNet design space as AnyNetXA.
• Shared bottleneck ratio 𝑏𝑖 = 𝑏 for all stage i for the AnyNetXA AynNetXB.
• Shared group width 𝑔𝑖 = 𝑔 for all stage i for the AnyNetXB AnyNetXC.
16. AnyNetX(D, E)
• AnyNetXD is from examining typical network structures of both good
and bad networks from AnyNetXC.
A pattern emerges: good network have increasing widths.
• AnyNetXD constraint: AnyNetXC & 𝑤𝑖+1 ≥ 𝑤𝑖.
• In addition to stage widths 𝑤𝑖 increasing with i, the stage depths 𝑑𝑖
likewise tend to increase for the best models
• AnyNetXE constraint: AnyNetXD & 𝑑𝑖+1 ≥ 𝑑𝑖.
• Finally, constraints on 𝑤𝑖 and 𝑑𝑖 each reduce the design space by 4!,
with a cumulative reduction of O(107) from AnyNetXA.
18. Linear Fits
• To gain further insight into the model structure, the best 20 models
from AnyNetXE are showed in a single plot.
• While there is significant variance in the individual models (gray
curves), in the aggregate a pattern emerges.
• In particular, in the same plot we show the line 𝑤𝑗 = 48 · (𝑗 + 1) for
0 ≤ 𝑗 ≤ 20
19. Linear Fits
• Inspired of AnyNetXD and AnyNetXE, a linear parameterization of
block widths is as follow:
𝑢𝑗 = 𝑤0 + 𝑤 𝑎 ⋅ 𝑗 for 0 ≤ 𝑗 < 𝑑, 𝑤0 > 0, 𝑤 𝑎 > 0
• To quantize 𝑢𝑗, 𝑤 𝑚 is introduced as an additional parameter
𝑢𝑗 = 𝑤0 ⋅ 𝑤 𝑚
𝑠 𝑗
• Then, to quantize 𝑢𝑗, simply rounding 𝑠𝑗 and compute quantized per-
block width 𝑤𝑗 via:
𝑤𝑗 = 𝑤0 ⋅ 𝑤 𝑚
ہ 𝑠 ۀ𝑗
• Converting the per-block 𝑤𝑗 to per-stage format 𝑤𝑖:
𝑤𝑖 = 𝑤0 ⋅ 𝑤 𝑚
𝑖
𝑑𝑖 =
𝑗
1 ہ 𝑠 ۀ𝑗 = 1
21. The RegNet Design Space
• The design space of RegNet contains only simple, regular models.
𝑑 < 64
𝑤0, 𝑤 𝑎 < 256
1.5 ≤ 𝑤 𝑚 ≤ 3
𝑏 𝑎𝑛𝑑 𝑔 are same as AnyNet
• 𝑤 𝑚 = 2 𝑎𝑛𝑑 𝑤0 = 𝑤 𝑎 make good performance, but to maintain
the diversity of models they are not applied to RegNet design space.
25. Common Design Patterns
• The deeper the model, the better the performance.
• Double the number of channels whenever the spatial activation size
is reduced.
• Skip connection is good.
• Bottleneck is good.
• Depthwise separable convolution is popular for low compute regime.
• Inverted bottleneck is also good.
26. RegNetTrends
• The depth of best models is stable across regimes, with an optimal
depth of ~20 blocks(60 layers).
• This is in contrast to the common practice of using deeper models for
higher flop regimes.
27. RegNetTrends
• The best models use a bottleneck ratio 𝑏 of 1.0, which effectively
removes the bottleneck.
• The width multiplier 𝑤 𝑚 of good models is ~2.5, similar but not
identical to the popular recipe of doubling widths across stages.
29. Complexity Analysis
• While not a common measure of network complexity, activations can
heavily affect runtime on memory-bound hardware accelerators.
• Activations increase with the square-root of flops, parameters
increase linearly.
30. RegNetX Constrained
• Using these findings, RegNetX design space is refined – RegNetX C
𝑏 = 1, 𝑑 ≤ 40, and 𝑤 𝑚 ≥ 2
Limited parameters and activations following complexity analysis
Further depth limit: 12 ≤ 𝑑 ≤ 28
31. Alternate Design Choices
• Inverted bottleneck(𝑏 < 1) degrades the EDF slightly and depthwise
conv performs even worse relative to 𝑏 = 1 and 𝑔 ≥ 1.
• For RegNetX, a fixed resolution of 224x224 is best, even at higher flops.
• Squeeze-and-Excitation(SE) op yields good gains – RegNetY
36. Comparison to Existing Networks
• The higher flop models have a large number of blocks in the third
stage and a small number of blocks in the last stage.
• The group width 𝑔 increases with complexity, but depth 𝑑 saturates
for large models.
39. EfficientNet
Comparison
At low flops, EfficientNet outperforms the
RegNetY. At intermediate flops, RegNetY
outperforms EfficientNet, and at higher
flops both RegNetX and RegNetY perform
better.
41. Additional Ablations
• Fixed Depth
Surprisingly, fixed-depth networks can match the performance of variable depth networks
for all flop regimes.
• Fewer Stages
Top RegNet models at high flops have few blocks in the fourth stage but, 3 stage networks
perform considerably worse.
• Inverted Bottleneck
In a high-compute regime, b < 1 degrades results further.
42. Additional Ablations
• Swish vs ReLU
Swish outperforms ReLU at low flops, but ReLU is better at high flops.
Interestingly, if g is restricted to be 1(depthwise conv), Swish performs much
better than ReLU.