6. DEEP LEARNING TO NAS
Neural Architecture Search(NAS)?
Design
an Individual Network
→
Design
a Network Generator
Automation of
Feature Engineering
→
Automation of
Architecture Engineering
10. • Network Generator 𝑔
𝑔 ∶ Θ ↦ 𝒩
where Θ : a parameter space, 𝒩 : a family of related networks
• e.g. In ResNet generator, 𝒩: ResNets and 𝜃 ∈ Θ specify the number of
stages, number of residual blocks for each stage, depth/width/filter sizes,
activation types, etc.
• Deterministic
• Stochastic network generator 𝑔
𝑔 ∶ Θ × 𝑆 ↦ 𝒩
where Θ : a parameter space, 𝑆 : seeds of a pseudo-random
number generator, 𝒩 : a family of related networks
• e.g. NAS. 𝜃: weight matrices of LSTM, The output of each LSTM time-
step is a probability distribution conditioned on 𝜃
STOCHASTIC NETWORK GENERATOR
Randomly Wired Neural Networks
11. • Turing’s unorganized machines, which is a form of the earliest
randomly connected neural network
• Infant human’s cortex: Small-world properties
• Random graph modeling has been used as a tool to study the neural
networks of human brains
• Random graph models are an effective tool for modeling and
analyzing real-world graphs, e.g., social networks, world wide web,
citation networks
MOTIVATION
Randomly Wired Neural Networks
12. 1. Generating general graphs(DAG)
2. Mapping from a general graph to neural network
operations
• Edge operations
- Data flow
• Node operations
- Aggregation: the input data via a weighted sum (learnable
and positive)
- Transformation: ReLU-convolution-BN triplet
- Distribution: The same copy of the transformed data
is sent out
3. Attaching Input and Output nodes
4. Stages
METHODOLOGY
Randomly Wired Neural Networks
13. • Additive aggregation maintains the same number of output channels
as input channels.
• Transformed data can be combined with the data from any other
nodes.
• Fixing the channel count keeps the FLOPs and parameter count
unchanged for each node, regardless of its input and output degrees.
• The overall FLOPs and parameter count of a graph are roughly
proportional to the number of nodes and nearly independent of the
number of edges
• This enables the comparison of different graphs without
inflating/deflating model complexity. Differences in task
performance are therefore reflective of the properties of the
wiring pattern.
NICE PROPERTIES OF NODE OPERATION
Randomly Wired Neural Networks
14. • Input
• The same copy of the data flow
• Output
• average(unweighted) from all original
output nodes.
ATTACHING INPUT AND OUTPUT NODES
Randomly Wired Neural Networks
Extra input node
Extra output node
15. • An entire network consists of multiple stages.
• One random graph represents one stage
• For all nodes that are directly connected to the input node, their
transformations are modified to have a stride 2.
• The channel count in a random graph is increased by 2x when going
from one stage to the next stage.
STAGES
Randomly Wired Neural Networks
RandWire Architecture
17. • Erdös-Rényi(ER), 1959.
• ER(N, P)
• Has N nodes
• An edge between two nodes is connected with probability P.
• The ER generation model has only a single parameter P, and is denoted
as ER(P).
• Any graph with N nodes has non-zero probability of being generated by
the ER model.
GENERATING GENERAL GRAPHS
Randomly Wired Neural Networks
18. • Barabási-Albert (BA), 1999.
• BA(N, M)
• 1 ≤ M < N
𝑖𝑛𝑖𝑡𝑖𝑎𝑡𝑒 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ 𝐺 𝑎𝑠 𝑀 𝑛𝑜𝑑𝑒𝑠 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑎𝑛𝑦 𝑒𝑑𝑔𝑒
𝑖𝑡𝑒𝑟𝑎𝑡𝑒: 𝐴𝑑𝑑 𝑎 𝑛𝑜𝑑𝑒 𝑣 𝑡 𝑠. 𝑡.
𝑓𝑜𝑟 𝑛𝑜𝑑𝑒 𝑣 𝑖𝑛 𝐺
𝑐𝑜𝑛𝑛𝑒𝑐𝑡 𝑣 𝑎𝑛𝑑 𝑣 𝑡 𝑏𝑦
𝑃 𝑣 𝑡 𝑎𝑛𝑑 𝑣 𝑎𝑟𝑒 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 ∝ 𝑑𝑒𝑔𝑟𝑒𝑒(𝑣)
𝑢𝑛𝑡𝑖𝑙 𝑣 𝑡 ℎ𝑎𝑠 𝑀 𝑒𝑑𝑔𝑒𝑠
𝑢𝑛𝑡𝑖𝑙 𝐺 ℎ𝑎𝑠 𝑁 𝑛𝑜𝑑𝑒𝑠
• has exactly M(N-M) edges. → a subset of all graphs with N nodes
GENERATING GENERAL GRAPHS
Randomly Wired Neural Networks
19. • Watts-Strogatz(WS), 1998.
• WS(N, K, P)
• “Small World” model.
High clustering, small diameter
0. N nodes를 원형으로 나열
1. 각 node 별로 양쪽으로 K/2개의 nodes를 연결
2. 시계방향으로 돌면서 rewire with probability P (uniformly)
• Has N⋅K edges → Smaller subset of N-node Graph
GENERATING GENERAL GRAPHS
Randomly Wired Neural Networks
20. • a stochastic network generator 𝑔(𝜃, 𝑠).
• The random graph parameters, P, M, (K; P) in ER, BA, WS
respectively, are part of the parameters 𝜃.
• The “optimization” of such a 1-or 2-parameter space is essentially
done by trial-and-error by human designers. – line/grid search
• The accuracy variation of our networks is small for different seeds 𝑠 so
they perform no random search and report mean accuracy of
multiple random network instances.
DESIGN AND OPTIMIZATION
Randomly Wired Neural Networks
22. • Imagenet Classification
• A small computation regime – MobileNet& ShuffleNet
• A regular computation regime – ResNet-50/101
• N nodes, C channels determine network complexity.
• N = 32, C = 79 for the small regime.
• N = 32, C = 109 or 154 for the regular regime.
• Random Seeds
• Randomlysample 5 network instances, train them from scratch.
• Report the classification accuracy with “mean±std” for all 5 network instances.
• Implementation Details
• Train for 100 epochs
• Half-period-cosine shaped learning rate decay and initial learning rate 0.1
• The weight decay is 5e-5
• Momentum 0.9
• Label smoothing regularization with a coefficient of 0.1
ARCHITECTURE DETAILS
Experiments
23. • 모두 학습 성공
• ER, BA, WS 모두 특정 세팅에서 mean accuracy > 73%
• Accuracy의 variance가 작음(std : 0.2 ~ 0.4 %)
• Random generator 별로 평균적인 accuracy차이가 있음
IMAGENET CLASSIFICATION
Experiments
24. • Node remove
• WS
the mean degradation of accuracy is larger when the output degree
of the removed node is higher.
“hub” nodes in WS that send information to many nodes are
influential
GRAPH DAMAGE
Experiments
25. • Edge remove
• If the input degree of an edge’s target node is smaller, removing this
edge tends to change a larger portion of the target node’s inputs.
• ER
less sensitive to edge removal, possibly because in ER’s definition
wiring of every edge is independent.
GRAPH DAMAGE
Experiments
26. • Same conv in all nodes
• adjust the factor C to keep the complexity of all alternative networks
• the Pearson correlation between any two series in Figure is 0:91 ~
0:98
NODE OPERATIONS
Experiments
27. • Small computation regime
COMPARISONS
Experiments
*250 epochs for fair comparisons
28. • Regular computation regime
• Use a regularization method inspired by edge removal analysis.
Randomly remove one edge whose target node has input degree > 1
with probability of 0.1.
COMPARISONS
Experiments
29. • Larger computation
• Increase the test image size to 320 x 320 without retraining
COMPARISONS
Experiments
30. • Object detection
• The features learned by randomly wired networks can also transfer.
COMPARISONS
Experiments
32. • The mean accuracy of these models is competitive with hand-
designed and optimized from NAS(Net).
• The authors hope that future work exploring new generator designs
may yield new, powerful networks designs.
• Contribution
• Layer type보다는 wiring pattern에 집중하여 search space를 잘 정의함
• 좋은 Search space를 찾는 것 만으로도 좋은 결과를 낼 수 있음
• (Stochastic) Network Generator 개념을 도입함
CONCLUSION
34. • Search Space
• 더 나은 search space를 찾는 아이디어
• Search Methods
• Prior knowledge: 의도나 해석이 없음
• 성능이 좋은 Network의 특성에 대한 연구가 있으면 좋을듯(AutoML의 문
제)
DISCUSSION
35. [1] Xie, S., Kirillov, A., Girshick, R., & He, K. (2019). Exploring Randomly
Wired Neural Networks for Image Recognition. arXiv preprint
arXiv:1904.01569.
[2] Elsken, T., Metzen, J. H., & Hutter, F. (2019). Neural Architecture
Search: A Survey. Journal of Machine Learning Research, 20(55), 1-21.
[3] Zoph, B., & Le, Q. V. (2017). Neural architecture search with
reinforcement learning. ICLR 2017.
[4] Zoph, B., Vasudevan, V., Shlens, J., & Le, Q. V. (2018). Learning
transferable architectures for scalable image recognition. In
Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 8697-8710).
[5] Jinwon, L. PR-155: Exploring Randomly Wired Neural Networks for
Image Recognition. https://www.youtube.com/watch?v=qnGm1h365tc
REFERENCES
Deep learning은 Feature engineering을 자동화하는데 기여함
하지만 이는 곧 직접 network architecture를 구성하는 Architecture engineering으로 변질됨
많은 network architecture들이 개발됐지만 time-consuming한 작업이고 error-prune한 작업임
Layer 길이가 7인 DNN등 space를 제한 가능. 사전 지식(conv쓰는게 좋음, 3x3 conv가 좋음, BN쓰면 좋음 등)을 반영하면 search space를 줄일 수 있어 효율적일 수 있지만 이 또한 novel architecture 탐색을 방해하는 human bias가 될 수 있음. 좋은 예시: Cell/Block을 학습 시켜서 반복시키면 다른 data에 transfer가능하면서도 space reduction하면서 좋은 성능 유지 가능.
RL, Random pick, evolutionary methods 등 사용 가능. Search space는 보통 exponentially 크거나 unbounded -> 잘 찾아야함; exploration-exploitation trade-off
보통은 그냥 training-validation을 거치는데 최근 이 과정을 효율적으로 만들기 위한 시도가 있음
“Connectionist” Approach
Swish와 같은 activation function이나
Auto augment와 같은 augmentation도 찾아내곤 함
결과적으로 한 형태의 convolution이나 layer크기들만 사용하게 됨
지금까지의 works는 search space에 대한 조절을 layer위주로 하거나 search strategy 위주의 연구가 많았음. 저자는 wiring에 대한 영향이 궁금함
(이제부터 NAS는 Neural Architecture Search with Reinforcement Learning)
Relu를 마지막에 두면 weight가 positive여서 positive 가 계속 더해지게 돼서 값이 계속 커짐-> BN을 마지막에 둬서 조절함
Issue: 특별한 형태의 확률이 낮고 평균적으로 비슷한 그래프를 만들듯
P > ln(N)/N 이면 single component(connected)
This gives one example on how an underlying prior can be introduced by the graph generator in spite of randomness.
친구가 많은 애들이랑 연결될 확률이 높음
“Rewiring” is defined as uniformly choosing a random node that is not v and that is not a duplicate edge.
We randomly remove one
node (top) or remove one edge (bottom) from a graph after the
network is trained, and evaluate the loss (delta) in accuracy on ImageNet.
Red circle: mean; gray bar: median; orange box: interquartile range;
blue dot: an individual damaged instance.
We randomly remove one
node (top) or remove one edge (bottom) from a graph after the
network is trained, and evaluate the loss (delta) in accuracy on ImageNet.
Red circle: mean; gray bar: median; orange box: interquartile range;
blue dot: an individual damaged instance.
Regularization의 효과를 비교하는 실험도 했으면 좋겠다
그런데 내가 재현하기가 너무 어렵다