5. Solutions in general
𝑥𝑗 = 𝑥1, 𝑥2, 𝑥3, 𝑥4, … , 𝑥𝑖, … 𝑗 ∈ 𝑋
𝑦𝑗 = 𝑦1, 𝑦2, … , 𝑦 𝑘, … 𝑗 ∈ 𝑌
𝐹: 𝑋 → 𝑌
Classification
𝑦1 = 1,0,0
𝑦2 = 0,0,1
𝑦3 = 0,1,0
𝑦4 = 0,1,0
Index of sample in dataset
sample of class “0”
sample of class “2”
sample of class “2”
sample of class “1”
Regression
𝑦1 = 0.3
𝑦2 = 0.2
𝑦3 = 1.0
𝑦4 = 0.65
6. What is artificial Neural Networks?
Is it biology?
Simulation of biological neural networks (synapses, axons,
chains, layers, etc.) is a good abstraction for understanding
topology.
Bio NN is only inspiration and illustration. Nothing more!
7. What is artificial Neural Networks?
Let’s imagine black box!
F
inputs
params
outputs
General form:
𝑜𝑢𝑡𝑝𝑢𝑡𝑠 = 𝐹 𝑖𝑛𝑝𝑢𝑡𝑠, 𝑝𝑎𝑟𝑎𝑚𝑠
Steps:
1) choose “form” of F
2) find params
8. What is artificial Neural Networks?
It’s a simple math!
free parameters
activation function
𝑠𝑖 =
𝑗=1
𝑛
𝑤𝑖𝑗 𝑥𝑗 + 𝑏𝑖
𝑦𝑖 = 𝑓 𝑠𝑖
Output of i-th neuron:
9. What is artificial Neural Networks?
It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
10. What is artificial Neural Networks?
It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
11. What is artificial Neural Networks?
It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
12. What is artificial Neural Networks?
It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
13. What is artificial Neural Networks?
It’s a simple math!
activation: 𝑦 = 𝑓 𝑤𝑥 + 𝑏 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑤𝑥 + 𝑏)
14. What is artificial Neural Networks?
It’s a simple math!
n inputs
m neurons
in hidden layer
𝑠𝑖 =
𝑗=1
𝑛
𝑤𝑖𝑗 𝑥𝑗 + 𝑏𝑖
𝑦𝑖 = 𝑓 𝑠𝑖
Output of i-th neuron:
Output of k-th layer:
1) 𝑆 𝑘 = 𝑊𝑘 𝑋 𝑘 + 𝐵 𝑘 =
=
𝑤11 𝑤12 ⋯ 𝑤1𝑛
𝑤21 𝑤21 ⋯ 𝑤21
⋯ ⋯ ⋯ ⋯
𝑤 𝑚1 𝑤 𝑚2 ⋯ 𝑤 𝑚𝑛 𝑘
𝑥1
𝑥2
𝑥3
⋮
𝑥 𝑛 𝑘
+
𝑏1
𝑏2
𝑏3
⋮
𝑏 𝑛 𝑘
2) 𝑌𝑘 = 𝑓𝑘 𝑆 𝑘
apply element-wise
Kolmagorov & Arnold function superposition
Form of F:
15. Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
16. How to find parameters
W and B?
Supervised learning:
Training set (pairs of variables and responses):
𝑋; 𝑌 𝑖, 𝑖 = 1. . 𝑁
Find: 𝑊∗
, 𝐵∗
= 𝑎𝑟𝑔𝑚𝑖𝑛
𝑊,𝐵
𝐿 𝐹 𝑋 , 𝑌
Cost function (loss, error):
logloss: L 𝐹 𝑋 , 𝑌 =
1
𝑁 𝑖=1
𝑁
𝑗=1
𝑀
𝑦𝑖.𝑗 log 𝑓𝑖,𝑗
rmse: L 𝐹 𝑋 , 𝑌 =
1
𝑁 𝑖=1
𝑁
𝐹 𝑋𝑖 − 𝑌𝑖 2
“1” if in i-th sample is
class j else “0”
previously scaled:
𝑓𝑖,𝑗 = 𝑓𝑖,𝑗 𝑗 𝑓𝑖,𝑗
Just an examples.
Cost function depend on
problem (classification,
regression) and domain
knowledge
17. Training or optimization algorithm
So, we have model cost 𝐿 (or error of prediction)
And we want to update weights in order to minimize 𝑳:
𝑤∗ = 𝑤 + 𝛼Δ𝑤
In accordance to gradient descent: Δ𝑤 = −𝛻𝐿
It’s clear for network with only one layer (we have
predicted outputs and targets, so can evaluate 𝐿).
But how to find 𝜟𝒘 for hidden layers?
18. Meet “Error Back Propagation”
Find Δ𝑤 for each layer from the last to the first
as influence of weights to cost:
∆𝑤𝑖,𝑗 =
𝜕𝐿
𝜕𝑤𝑖,𝑗
and:
𝜕𝐿
𝜕𝑤𝑖,𝑗
=
𝜕𝐿
𝜕𝑓𝑗
𝜕𝑓𝑗
𝜕𝑠 𝑗
𝜕𝑠 𝑗
𝜕𝑤𝑖,𝑗
20. Gradient Descent
in real life
Recall gradient descent:
𝑤∗
= 𝑤 + 𝛼Δ𝑤
𝛼 is a “step” coefficient. In term of ML – learning rate.
Recall cost function:
𝐿 =
1
𝑁
𝑁
…
GD modification: update 𝑤 for each sample.
Sum along all samples,
And what if 𝑁 = 106 or more?
Typical: 𝛼 = 0.01. . 0.1
21. Gradient Descent
Stochastic & Minibatch
“Batch” GD
(L for full set)
need a lot of memory
Stochastic GD
(L for each sample)
fast, but fluctuation
Minibatch GD
(L for subsets)
less memory & less fluctuations
Size of minibatch depends on HW Typical: minibatch=32…256
22. Termination criteria
By epochs count
max number of iterations along all data set
By value of gradient
when gradient is equal to 0 than minimum, but small gradient => very
slow learning
When cost didn’t change during several epochs
if error is not change than training procedure is not converges
Early stopping
Stop when “validation” score starts increase
even when “train” score continue decreasing
Typical: epochs=50…200
23. Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
24. What about “form” of F?
Network topology
“Shallow” networks 1, 2 hidden layers => not
enough parameters => pure separation abilities
“Deep” networks is a NN with 2..10 layers
“Very deep” networks is a NN with >10 layers
25. Deep learning. Problems
• Big networks => Too huge
separating ability => Overfitting
• Vanishing gradient problem
during training
• Complex error’s surface => Local
minimum
• Curse of dimensionality => memory
& computations
𝑚(𝑖−1)
𝑚(𝑖)
dim 𝑊(𝑖)
= 𝑚 𝑖−1
∗ 𝑚(𝑖)
26. Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo
27. Additional methods
Conventional
• Momentum (prevent the variations on error surface)
∆𝑤(𝑡)
= −𝛼𝛻𝐿 𝑤 𝑡
+ 𝛽∆𝑤(𝑡−1)
𝑚𝑜𝑚𝑒𝑛𝑡𝑢𝑚
• LR decay (make smaller steps near optimum)
𝛼(𝑡)
= 𝑘𝛼(𝑡−1)
, 0 < 𝑘 < 1
• Weight Decay(prevent weight growing, and smooth F)
𝐿∗
= 𝐿 + 𝜆 𝑤(𝑡)
L1 or L2 regularization often used
Typical: 𝛽 = 0.9
Typical: apply LR decay (𝑘 = 0.1) each 10..100 epochs
Typical: 𝐿2 with 𝜆 = 0.0005
28. Neural networks. Overview
• Common principles
– Structure
– Learning
• Shallow and Deep NN
• Additional methods
– Conventional
– Voodoo