Recombination DNA Technology (Nucleic Acid Hybridization )
The reversible residual network
1. The Reversible Residual Network:
Backpropagation Without Storing Activations
Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B.
Grosse
presentation by Jiaqi Yang
LAMDA Group
2. Idea
Deep residual networks (ResNets) are the state-of-the-art
architecture across multiple computer vision tasks. The key
architectural innovation behind ResNets was the residual block.
Memory consumption is a bottleneck of deep neural networks, as
one needs to store the activations in order to calculate gradients
using backpropagation.
If we can restore activation from outputs, then backpropagation can
be as memory efficient as forward pass.
1
4. Related work
Trade memory with computation.
Checkpointing: divide to O(
√
n) blocks, reduce memory to O(
√
n).
Exploit the idea of checkpointing recursively:
g(n) = k + g(n/(k + 1)) =⇒ g(n) = klogk+1(n).
k = 1 =⇒ g(n) = log2(n).
computational complexity: O(nlogn).
3
5. ResNet
One of the main difficulties in training very deep networks is the
problem of exploding and vanishing gradients.
residual blocks:
y = x + f(x)
The basic and bottleneck residual block:
a(x) = ReLU(BN(x))
ck = Convk×k(a(x))
Basic(x) = c3(c3(x))
Bottleneck(x) = c1(c3(c1(x)))
4
6. Reversible Residual Blocks
Partition the units in each layer into two groups, denoted x1 and x2.
Partition the channels.
Each reversible block takes inputs (x1, x2) and produces outputs
(y1, y2).
y1 = x1 + f(x2)
y2 = x2 + g(y1)
Each layer’s activations can be reconstructed from the next layer’s
activations:
x2 = y2 − g(y1)
x1 = y1 − f(x2)
5
8. Extend to RNN
Reversible Recurrent Neural Networks (NIPS 2018).
Trouble: forget-gate
ht
= zt
⊙ ht−1
+ (1 − zt
) ⊙ gt
The forget gate make it hard to use the same idea directly.
Drop the forget-gate?
ht
= ht−1
+ (1 − zt
) ⊙ gt
7
9. Extend to RNN
Simply drop the forget gate will harm performance (they call it:
Impossibility of No Forgetting), show by repeat.
Deal with fixed point math explicitly (still need to tolerate some loss)
=⇒ Gradient-based Hyperparameter Optimization through
Reversible Learning (ICML 2015).
Attention mechanism: crop the cell state to a fraction.
8