- POSTECH EECE695J, "딥러닝 기초 및 철강공정에의 활용", Week 5
- Contents: Restricted Boltzmann Machine (RBM), various activation functions, data preprocessing, regularization methods, training of a neural network
- Video: https://youtu.be/v4rGPl-8wdo
1. Neural Networks II
Sang Jun Lee
Ph.D. candidate, POSTECH
Email: lsj4u0208@postech.ac.kr
EECE695J 전자전기공학특론J(딥러닝기초및철강공정에의활용) – LECTURE 5 (2017. 9. 28)
2. 2
▣ Lecture 4: Neural Network I
1-page Review
Input layer Output layerHidden layer
Perceptron Multilayer perceptron (MLP)
Backpropagation Vanishing gradient
Local gradient의 곱을 이용하여 parameter gradient 계산 Deep Neural Network → parameter gradient ≅ 0
7. Training of a Neural Network
Activation functions
Data preprocessing
Regularization
Tips for training a neural network
7
Contents
8. Sigmoid function
- Saturated neurons “kill” the gradient (크기가 작거나 큰 입력 X에 대한 gradient ≅ 0)
- Sigmoid outputs are always positive
8
Activation Functions
𝑑𝑑
𝑑𝑑𝑑𝑑
𝜎𝜎 𝑥𝑥 = 1 − 𝜎𝜎 𝑥𝑥 ⋅ 𝜎𝜎 𝑥𝑥 ≤ 1
• 각 layer의 local gradient가 곱해 짐에 따라 parameter에 대한 gradient 감소
• 입력 데이터에 의한 학습효과 x
11. ReLU
- Computationally efficient
- Does not saturated (in + region)
- Always positive output
- Dead (output) neuron will never activate (not updated)
(slightly positive biases are commonly used)
11
Activation Functions
ReLU
𝑥𝑥0
𝑥𝑥1
𝑥𝑥𝑑𝑑
12. Leaky ReLU
- 𝑓𝑓 𝑥𝑥 = max(𝛼𝛼𝛼𝛼, 𝑥𝑥)
- 𝑥𝑥의 부호에 따라 +1 또는 𝛼𝛼의 local gradient를
backpropagation 과정에 반영
Activation function에 따른 영상 분류 성능 비교 (CIFAR-10)
(* VLReLU: Very Leaky ReLU, Mishkin et al. 2015)
12
Activation Functions
13. Mean subtraction
- Data가 모두 양수이면 parameter gradient (vector)의 component의 부호가 모두 + 또는 –
- Zero-centered data:
�𝑋𝑋 = 𝑋𝑋 − 𝜇𝜇𝑋𝑋
- 주의 할 점: 𝜇𝜇𝑋𝑋를 구할 때 training data만 사용하며, validation 또는 test 할 때에도 𝜇𝜇𝑋𝑋를 이용하여 data를
preprocessing
13
Data Preprocessing
14. Normalization
�𝑋𝑋 =
𝑋𝑋 − 𝜇𝜇𝑋𝑋
𝜎𝜎𝑋𝑋
�𝑋𝑋 =
2 𝑋𝑋 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋 𝑚𝑚𝑚𝑚 𝑚𝑚
− 1 ∈ [−1, +1]
- 참고: 영상 데이터에 대해서는 일반적으로 zero-center를 preprocessing으로 사용
14
Data Preprocessing
15. RBM (Restricted Boltzmann Machine)
A bipartite graph, no connection within a layer
15
Weight Initialization
16. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
16
Weight Initialization
17. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
17
Weight Initialization
18. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
18
Weight Initialization
19. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
19
Weight Initialization
20. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
20
Weight Initialization
21. DBN (Deep Belief Network)
Unsupervised learning on adjacent two layers as a pre-training step (weight initialization)
21
Weight Initialization
22. DBN (Deep Belief Network)
Minimize KL divergence between input and recreated input
22
Weight Initialization
27. No need to use complicated RBM for weight initialization
Simple methods for weight initialization
Make sure the weights are ‘just right’ (not too small & not too big)
- Small random number (ex. Gaussian with zero mean and 10−2
standard deviation)
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)
- Xavier initialization: X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward
neural networks,” in International conference on artificial intelligence and statistics, 2010
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)/ 𝒏𝒏
- He’s initialization: K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification,” 2015
𝑾𝑾~𝑵𝑵(𝟎𝟎, 𝝈𝝈𝟐𝟐
)/ 𝟐𝟐𝒏𝒏
27
Weight Initialization
30. Stochastic gradient descent (SGD)
What if loss changes quickly in one direction and slowly in another?
Very slow progress along shallow dimension, jitter along steep direction
Local minima or saddle point → zero gradient
30
Optimization
31. SGD with momentum
Build up “velocity” as a running mean of gradients
𝜌𝜌 gives “friction” (typically 𝜌𝜌 = 0.9 or 0.99)
31
Optimization
32. AdaGrad
Adaptive gradient algorithm: a modified stochastic gradient descent with per-parameter learning rate
Element-wise scaling of the gradient based on historical sum of squares in each dimension
32
Optimization
36. The problem of overfitting
Basic idea:
- Add randomness
- Marginalize the noise
36
Regularization
Training accuracy
Test accuracy
37. Model ensemble
- Train multiple independent models
- Average their results at test time
37
Regularization
Reference : http://www.slideshare.net/sasasiapacific/ipb-improving-the-models-predictive-power-with-ensemble-approaches
38. Dropout
- In training step, randomly set some neurons to zero (hyper-parameter: drop probability)
- Kind of ensemble model
38
Regularization
39. Dropout (test time)
Consider a single neuron
In standard neural network: 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Want to obtain the expectation: 𝑦𝑦 = 𝑓𝑓 𝑥𝑥 = 𝐸𝐸𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 = ∫ 𝑝𝑝 𝑧𝑧 𝑓𝑓 𝑥𝑥, 𝑧𝑧 𝑑𝑑𝑑𝑑
At test time, we have: 𝐸𝐸 𝑎𝑎 = 𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦
Applying dropout with the drop probability of 0.5:
𝐸𝐸 𝑎𝑎 =
1
4
𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
𝑤𝑤1 𝑥𝑥 + 0𝑦𝑦 +
1
4
0𝑥𝑥 + 𝑤𝑤2 𝑦𝑦 +
1
4
0𝑥𝑥 + 0𝑦𝑦 =
1
2
(𝑤𝑤1 𝑥𝑥 + 𝑤𝑤2 𝑦𝑦)
At test time, multiply by dropout probability
39
Regularization
50. Weight initialization
- ReLU
- Leaky ReLU
Optimization
- Adam optimizer
- ...
Regularization
- Dropout or batch normalization is generally sufficient
50
Practical Tips for training a Neural Network
51. Activation functions
Sigmoid, tanh, ReLU, Leaky ReLU
Data preprocessing
Mean subtraction, normalization
Regularization
Model ensemble, dropout, data augmentation, ...
Tips for training a neural network
Learning rate, transfer learning
51
Summary
52. Computer Vision
영상 데이터에 대한 이해
Convolutional Neural Network
영상에 CNN이 효과적인 이유
52
Preview (Lecture 6)