7. Deep RL for Pong:
▪ Three possible acHons: {UP,STILL, DOWN}.
▪ Policy Gradient algorithm: has probabiliHes for each acHon.
▪ ProbabiliHes: sobmax from raw pixels with a NN. Weights: Random init.
▪ At each step, decide acHon by sampling the sobmax probabiliHes:
▪ Good final outcome (win) increases the probabiliHes of ALL the acHons
chosen. A loss decreases all of them. Gradient descent (or RMSProp)
Links: hVps://karpathy.github.io/2016/05/31/rl/
hVps://gist.github.com/greydanus/5036f784eec2036252e1990da21eda18
9. Deep RL for Neural Nets:
▪ Controller: two-layer LSTM with 35 hidden units each
▪ Child: mulH-layer convoluHonal neural network CNN
▪ Possible acHons: Filters in {24, 36, 48, 64}, filter Height in {1,3,5,7} etc.
▪ Like in Pong example, the acHons are decided sequenHally by sampling the
sobmax probabiliHes (à la np.random.choice) for each feature and moving
to the next. This determines the CNN child architecture.
▪ Training: 45,000 CIFAR images. Accuracy R: 5,000 validaHon images.
▪ Policy Gradient algorithm: REINFORCE. (But other choices are possible).
▪ Obtained results on CIFAR-10 are state of the art with 3.65% error rate
16. InstaDeep’s plaYorm
▪ Sees neural networks as a graph
▪ OpHmizers too are a graph
▪ For example, the graph on the right
describes a 4-layer neural net of
respecHvely 256, 256, 128 and 10
with a non-linear Relu funcHon on the
first layer.
▪ The graph on the right describes
an Adam opHmizer with parameters
Beta1 = 0.5 and beta2 = 0.9 respecHvely