7. Gradient descent
Stochastic Gradient Descent
l Gradient descent
̶ Compute the gradient of L(q) with regard to q; g(q), then update q
using g(q) as
qt+1 := qt – at g(qt)
where at>0 is a learning rate
l Stochastic gradient descent:
̶ Since the exact computation of gradient is expensive, we instead use an
approximated gradient by using a sampled data set (mini-batch)
g’(qt) = 1/|B| Si∈B l(qt, xi, yi)
-αg
θ2
θ1
Contour of L(q)
17. Deep and Wide NN also create no bad local minima
[Nguyen+ 2017]
l If the following conditions hold
̶ (1) Activation function s is analytic on R, strictly monotonically
increasing
̶ (2) s is bounded*
̶ (3) the loss function l(a) is twice differentiable,
̶ l’(a)=0 if a is a global minimum
̶ (4) Training samples are linearly independent,
then every critical point for with the weight matrices have full
column rank, is a global minimum
̶ We can achieve these conditions if we use sigmoid, tanh or softplus for
s and the squared loss for l
̶ -> Solved in the case of non-linear NN with some conditions
20. Random Labeling experiment [Zhang+ 17]
l Model capacity should be restricted to achieve generalization
̶ C.f. Rademacher complexity, VC-dimension, uniform stability
l Conduct an experiment on a copy of the test data where the
true labels were replaced by random labels
-> NN model easily fit even for random labels
l Compare the result with that using regularization techniques
-> No significant difference
l Therefore NN model has enough model complexity to fit to
random labeling but it can generalize well w/o regularization
̶ For random labels, NN memorize the samples, but for true labels NN
learn patterns for generalization [Arpit+ 17]
l … WHY?
21. SGD plays a significant role for generalization
l SGD achieves an approximate Bayesian inference [Mandt+ 17]
̶ Bayesian inference provides a sample following q ~ P(q|D)
l SGD’s noise removes unnecessary information of input to
estimate output [Shwartz-Ziv+ 17]
̶ During training the mutual information between input and the network
is decreased but that between the network and output is kept
l Sharpness and norms of weights also relate to generalization
̶ Flat minima achieve generalization. But it
depends on the scale of weights
̶ If we find a flat minimum with small norm of weights, then it achieves
generalization [Neyshabur+ 17]
FlatSharp
27. Why do we consider generative models?
l For more accurate recognition and inference
̶ If we know the generate process, we can improve recognition and inference
u “What I cannot create, I do not understand”
Richard Feynman
u “Computer vision is inverse computer graphics”
Geofferty Hinton
̶ By inverting the generation process, we obtain recognition process
l For transfer learning
̶ By changing covariates, we can transfer the learned model to other
environments
l For sampling examples to compute statistics and validation
27/50
29. Representation learning is more powerful than
the nearest neighbor method and manifold learning
l Actually we can significantly reduce the required training samples when
using representation learning [Arora+ 2017]
l Using the distance metric defined on the original space, or the
neighborhood notion may not work
?
In reality, samples with the same label are
located in very different places in the
original space. Their region may not be
even connected in original space
Ideally, near sample
will help to determine
the label
Man with
glasses
54. ICA: Independent component analysis
Reference: [Hyvärinen 01]
l Find a component z that generates data x
x = f(z)
where f is an unknown function called mixture function and
components are independent each other p(z) = Pp(zi)
l When f is linear and p(zi) is non-Gaussian, we can identify f and
z correctly
l However, when f is nonlinear, we cannot identify f and z
̶ There are infinitely many possible f and z
l -> When data is time-series data x(1), x(2), …, x(n) and they are
generate from z which are (1) non-stationary or (2) stationary
independent sources, we can identify non-linear f and z
64. Two-leayr NN update rule interpretation
[Okanohara unpublished]
l The update rule of two layer feedforward network for
h = Relu(W1x)
a = W2h
is
dh = W2
Tda
dW2= da hT
dW1= dh diag(Relu’(W1x)) xT
= W2
Tda diag(Relu’(W1x)) xT
l
These update rules correspond to storing the error (da) as a
value and storing input (x) as a key for memory network
̶ Update only for active memories (Relu’(W1x))
65. Resnet is memory augmented network
[Okanohara unpublished]
l Since resnet is the following form
h = h + Resnet(h)
and Resnet(h) consists of two layer, we can interpret it as
recalling memory and add it to the current vector
̶ Squeeze operation correspond to limit the number of memory cells
l Resnet lookups memory iteratively
̶ Large number of steps = large number of memory lookups
l This interpretation is different from using shortcut [He+15] or
unrolled iterative estimation [Greff+16]
67. Conclusion
l There are still many unsolved problems in DNN
̶ Why can DNN learn in general setting ?
̶ How to represent real world information ?
l There are still many unsolved problems in AI
̶ Disentanglement of information
̶ One-shot learning using attention and memory mechanism
u Avoid catastrophic forgetting, interference
̶ Stable, data-efficient reinforcement learning
̶ How to abstract information
u grounding (language), strong noise (e.g. dropout), extract hidden
factors by using (non-)stationary or commonality among task
68. References
l [Choromanska+ 2015] “The loss surface of multilayer networks”, A. Choromanska,
and et al., AIstats 2015
l [Lu+ 2017] ”Depth creates No Bad Local Minima”, H. Lu, and et al.,
arXiv:1702.08580
l [Nguyen+ 2017] “The loss surface of deep and wide neural networks”, Q. Nguyen,
and et al., arXiv:1704.08045
l [Zhang+ 2017] “Understanding deep learning requires rethinking generalization”, C.
Zhang, and et al., ICLR 2017
l [Arpit+ 2017] ”A Closer Look at Memorization in Deep Networks”, D. Arpit, and et al.,
ICML 2017
l [Mangt+ 2017] “Stochastic Gradient Descent as Approximate Bayesian Inference”, S.
Mandt and et al., arXiv:1704.04289
l [Shwartz-Ziv+ 2017] “Opening the Black Box of Deep Neural Networks via
Information”, R. Shartz-Ziv, and et al., arXiv:1703.00810
70. l [Goodfellow+ 14] “Generative Adversarial Nets”, I. Goodfellow, and et al.,
NIPS 2014
l [Goodfellow 16] “NIPS 16 Tutorial: Generative Adversarial Networks”,
arXiv:1701.00160
l [Oord+ 16a], “Conditional Image Generation with PixelCNN decoders”, A.
Oord and et al., NIPS 2016
l [Oord+ 16b], “WaveNet: A Generative Model for Raw Audio”, A. Oord and
et al., arXiv1609.03499
l [Reed+ 17] “Parallel Multiscale Autoregressive Density Estimation”, S. Reed
and et al, arXiv:1703.03664
l [Zhao+ 17] ”Energy-based Generative Adversarial Network”, J. Zhao and et
al., arXiv:1609.03126
l [Dai+ 17] “Calibrating Energy-based Generative Adversarial networks”, Z.
Dai and et al., ICLR 2017