1. Deep Learning en acción
Jose María Alvarez | Assoc. Prof. UC3M | josemaria.alvarez@uc3m.es
2. 2Cátedra RTVE-UC3M
Deep Learning en acción
Agenda
03
02
01 Resumen de arquitectura y configuración
Visión general de Deep Learning
Keras
Entorno tecnológico
Ejemplos y casos de uso
Resolución de ejemplos
6. 6Cátedra RTVE-UC3M
Visión general
Preguntas iniciales
¿Qué es un sistema de
Deep Learning?
¿Tipología de problemas?
¿Qué son las capas?
¿Criterios para selección del
número de capas?
¿Cuántos nodos por capas?
¿Cómo funciona un sistema de
Deep Learning de forma
general?
¿Qué es una función de activación?
¿Ejemplos?
¿Criterios de selección?
¿Qué es una función de calculo
de pérdida?
¿Ejemplos?
¿Criterios de selección?
¿Cómo se mide el
rendimiento de un Sistema de
Deep Learning?
¿Medidas?
12. 12Cátedra RTVE-UC3M
Visión general
Elementos básicos
Connection weights
“Weights on connections in a
neural network are coefficients
that scale (amplify or
minimize) the input signal to a
given neuron in the network. In
common representations of
neural networks, these are the
lines/arrows going from one
point to”
25%
Activation functions
“The functions that govern the artificial
neuron’s behavior are called activation
functions. The transmission of that input is
known as forward propagation. Activation
functions transform the combination of
inputs, weights, and biases..”
25%
Biases “Biases are scalar values added to the
input to ensure that at least a few
nodes per layer are activated
regardless of signal strength. Biases
allow learning to happen by giving the
network action in the event of low
signal. They allow the network to try
new interpretations or behaviors.
Biases are generally notated b, and,
like weights, biases are modified
throughout the learning process.”
25%
Loss functions
“Loss functions quantify how close
a given neural network is to the
ideal toward which it is training.
The idea is simple. We calculate a
metric based on the error we observe
in the network’s predictions”
25%
13. 13Cátedra RTVE-UC3M
• Text-to-speech synthesis (Fan et al., Microsoft,
Interspeech 2014)
• Language identification (Gonzalez-Dominguez et al.,
Google, Interspeech 2014)
• Large vocabulary speech recognition (Sak et al., Google,
Interspeech 2014)
• Prosody contour prediction (Fernandez et al., IBM,
Interspeech 2014)
• Medium vocabulary speech recognition (Geiger et al.,
Interspeech 2014)
• English-to-French translation (Sutskever et al., Google,
NIPS 2014)
• Audio onset detection (Marchi et al., ICASSP 2014)
• Social signal classification (Brueckner & Schulter,
ICASSP 2014)
• Arabic handwriting recognition (Bluche et al., DAS 2014)
• TIMIT phoneme recognition (Graves et al., ICASSP
2013)
• Optical character recognition (Breuel et al., ICDAR 2013)
• Image caption generation (Vinyals et al., Google, 2014)
• Video-to-textual description (Donahue et al., 2014)
• Syntactic parsing for natural language processing
(Vinyals et al., Google, 2014)
• Photo-real talking heads (Soong and Wang, Microsoft,
2014)
• Automated image sharpening
• Automating image upscaling
• WaveNet: generating human
speech that can imitate
anyone’s voice
• WaveNet: generating
believable classical music
• Speech reconstruction from
silent video
• Generating fonts
• Image autofill for missing
regions
• Automated image
captioning (see
also: https://github.com/karpath
y/neuraltalk2)
• Turning hand-drawn doodles
into stylized artwork
Visión general
Algunos casos de uso…
Deep Learning by Adam Gibson; Josh PattersonPublished by O'Reilly Media, Inc., 2017
17. 17Cátedra RTVE-UC3M
Arquitectura y configuración
Avances
DBNs, CNN, RBMs, etc.
Tipos de capas
Datos e imágenes
Arquitecturas híbridas
RNN, LSTM, GRU, etc.
Tipos de neuronas
IA como servicio
Tecnología
20. 20Cátedra RTVE-UC3M
Arquitectura y configuración
Funciones de coste/pérdida
Mean absolute error loss (L1)
Mean squared log error loss
Mean Squared Error Loss (L2)
Mean absolute percentage error
Hinge Loss
Logistic Loss
Referencia: https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
21. 21Cátedra RTVE-UC3M
Arquitectura y configuración
Arquitectura de un sistema de Deep Learning
• Nombre: AlexNet
• Autor:
• Geoffrey Hinton
• 1980
• Aplicación:
• Tareas particulares (visión por
computador)
• Características:
• Convolutional y pooling layers
• Capas totalmente conectadas
• Referencia:
• https://papers.nips.cc/paper/4
824-imagenet-classification-
with-deep-convolutional-
neural-networks.pdf
• Código en Keras:
• https://gist.github.com/JBed/c
2fb3ce8ed299f197eff
22. 22Cátedra RTVE-UC3M
Arquitectura y configuración
Arquitectura de un sistema de Deep Learning
• Nombre: VGG Net
• Autor:
• Visual graphics group (Oxford)
• 2014
• Aplicación:
• Tareas particulares (visión por
computador)
• Características:
• Convolutional y pooling layers
(19)
• Entrenamiento lento
• Referencia:
• https://arxiv.org/abs/1409.1556
• Código en Keras:
• https://github.com/keras-
team/keras/blob/master/keras/a
pplications/vgg16.py
23. 23Cátedra RTVE-UC3M
Arquitectura y configuración
Arquitectura de un sistema de Deep Learning
• Nombre: GoogleNet (Inception
Network)
• Autor:
• Google
• 2014
• Aplicación:
• Tareas particulares (visión por
computador)
• Características:
• Convolutional y pooling layers (22)
• No secuencialrendimiento
• Entrenamiento más rápido que
VGG
• Referencia:
• https://arxiv.org/abs/1512.00567
• Código en Keras:
• https://github.com/keras-
team/keras/blob/master/keras/ap
plications/inception_v3.py
24. 24Cátedra RTVE-UC3M
Arquitectura y configuración
Arquitectura de un sistema de Deep Learning
• Nombre: ResNet
• Residual Networks
• Autor:
• FIXME
• Aplicación:
• Tareas generales (visión por
computador)
• Características:
• Procesamiento en lotes de la
entrada
• Referencia:
• https://arxiv.org/abs/1512.0338
5
• Código en Keras:
• https://github.com/keras-
team/keras/blob/master/keras/a
pplications/resnet50.py
25. 25Cátedra RTVE-UC3M
Arquitectura y configuración
Arquitectura de un sistema de Deep Learning
• Nombre: ResNetX
• Autor:
• Google
• 2014
• Aplicación:
• Tareas particulares (visión por
computador)
• Características:
• Procesamiento en lotes de la
entrada
• Referencia:
• https://arxiv.org/pdf/1611.05431.pdf
• Código en Keras:
• https://github.com/titu1994/Ker
as-ResNeXt
27. 27Cátedra RTVE-UC3M
Arquitectura y configuración
Otras mejoras
Fuente: https://www.datasciencecentral.com/profiles/blogs/24-neural-network-adjustements
29. 29Deep Learning en acción
Entorno tecnológico
Tecnología disponible
Fuente: https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a
Otra comparación: https://www.kdnuggets.com/2018/03/deep-learning-frameworks.html
30. 30Deep Learning en acción
Entorno tecnológico
Deep Learning as a Service
Fuente: https://www.ibm.com/blogs/research/2018/03/deep-learning-advances/
38. 38Cátedra RTVE-UC3M
Entorno tecnológico
Principios de diseño
-API para humanos
Usabilidad
-Fácil de crear nuevas funciones
Extensible
-Tensorflow
-Theano
-CNTKN
Diferentes motores
-Secuencia o grafo.
-Elementos ortogonales: capas, funciones, etc.
Modularidad
-Reutilización de bibliotecas existentes e
integración: pandas, scikit-learn, matplotlib,
numpy, etc.
Trabajo con Python
01
02
03
04
05
39. 39Cátedra RTVE-UC3M
Entorno tecnológico
Elementos del API
Models
• Secuencia
• API functional
Layers Preprocessing Metrics
Activation
functions
Loss functions Optimizers Callbacks
Utils Datasets Visualization
43. 43Deep Learning en acción
Ejemplos y casos de uso
Metodología de trabajo
Métodolo
gía
Tareas y referencias
Descubrimiento de
funcionalidades
Paso 6: Persistencia
Guardar el modelo
Otras operaciones
Paso 5: Test y predicción
Validación del modelo
Predicción …
Paso 4: Entrenamiento
Aprendizaje
Python notebooks
Online: Google colab
Off-line: Jupyter
Definición del problema
Entender el problema y los
datos disponibles.
Paso 1: Gestión de datos
Preparación de los datos para
su entrenamiento y prueba
Paso 2: Arquitectura e implementación
Configuración de la red: capas,
funciones de activación,
pérdida, etc.
44. 44Deep Learning en acción
Ejemplos y casos de uso
Listado de ejemplos y casos de uso
1. Ejemplo 1: Aproximación de una función con regresión lineal
2. Ejemplo 2: El dataset MINST, reconocimiento de caracteres
3. Ejemplo 3: El dataset IRIS, clasificación de flores
4. Ejemplo 4: Predicción de la inversión en un coche
5. Ejemplo 5: Predicción de cáncer de pecho
6. Ejemplo 6: Predicción de spam
7. Ejemplo 7: Predicción de precios de las casas
8. Ejemplo 8: Introducción a RNN
9. Ejemplo 9: Series temporales con Keras
10. Ejemplo 10: Predicción de consumo de energía en casas
11. Ejemplo 11: Mantenimiento predictivo: clasificación
12. Ejemplo 12: Mantenimiento predictivo: regresión
45. 45Cátedra RTVE-UC3M
1. http://d2l.ai/ Dive into Deep Learning (Libro)
2. https://blog.slavv.com/37-reasons-why-your-neural-network-is-not-working-
4020854bd607
3. https://github.com/fchollet/keras-resources
Deep Learning en acción
Enlaces relevantes
Notas del editor
The behavior of neural networks is shaped by its network architecture. A network’s architecture can be defined (in part) by the following:
Number of neurons
Number of layers
Types of connections between layers
The most well-known and simplest-to-understand neural network is the feed-forward multilayer neural network. It has an input layer, one or many hidden layers, and a single output layer. Each layer can have a different number of neurons and each layer is fully connected to the adjacent layer. The connections between the neurons in the layers form an acyclic graph, as illustrated in Figure 2-1.
A feed-forward multilayer neural network can represent any function, given enough artificial neuron units. It is generally trained by a learning algorithm called backpropagation learning. Backpropagation uses gradient descent (see Chapter 1) on the weights of the connections in a neural network to minimize the error on the output of the network.
The perceptron is a linear-model binary classifier with a simple input–output relationship as depicted in Figure 2-3, which shows we’re summing n number of inputs times their associated weights and then sending this “net input” to a step function with a defined threshold. Typically with perceptrons, this is a Heaviside step function with a threshold value of 0.5. This function will output a real-valued single binary value (0 or a 1), depending on the input.
The artificial neuron of the multilayer perceptron is similar to its predecessor, the perceptron, but it adds flexibility in the type of activation layer we can use. Figure 2-5 shows an updated diagram for the artificial neuron that is based on the perceptron.
The artificial neuron (see Figure 2-6) takes input that, based on the weights on the connections, can be ignored (by a 0.0 weight on an input connection) or passed on to the activation function. The activation function also has the ability to filter out data if it does not provide a non-zero activation value as output.
We express the net input to a neuron as the weights on connections multiplied by activation incoming on connection, as shown in Figure 2-6. For the input layer, we’re just taking the feature at that specific index, and the activation function is linear (it passes on the feature value). For hidden layers, the input is the activation from other neurons. Mathematically, we can express the net input (total weighted input) of the artificial neuron as
input_sumi = Wi · Ai
where Wi is the vector of all weights leading into neuron i and Ai is the vector of activation values for the inputs to neuron i. Let’s build on this equation by accounting for the bias term that is added per layer (explained further below):
input_sumi = Wi · Ai + b
To produce output from the neuron, we’d then wrap this net input with an activation function g, as demonstrated in the following equation:
Layers became more varied with the different types of architectures. Deep Belief Networks (DBNs) demonstrated success with using Restricted Boltzmann Machines (RBMs) as layers in pretraining to build features. CNNs used new and different types of activation functions in layers and changed how we connected layers (from fully connected to locally connected patches). Recurrent Neural Networks explored the use of connections that better modeled the time domain in time-series data.
Recurrent Neural Networks specifically created advancements in the types of neurons (or units) applied in the work around LSTM networks. They introduced new units specific to Recurrent Neural Networks such as the LSTM Memory Cell and Gated Recurrent Units (GRUs).
Continuing the theme of matching input data to architecture type, we have seen hybrid architectures emerge for types of data that has both a time domain and image data involved. For instance, classifying objects in video has been successfully demonstrated by combining layers from both CNNs and Recurrent Neural Networks into a single hybrid network. Hybrid neural network architectures can allow us to take advantage of the best of both worlds in some cases.
a linear transform (see Figure 2-11) is basically the identity function, and f(x) = Wx, where the dependent variable has a direct, proportional relationship with the independent variable. In practical terms, it means the function passes the signal through unchanged.
We see this activation function used in the input layer of neural networks.
Like all logistic transforms, sigmoids can reduce extreme values or outliers in data without removing them. The vertical line in Figure 2-12 is the decision boundary.
A sigmoid function is a machine that converts independent variables of near infinite range into simple probabilities between 0 and 1, and most of its output will be very close to 0 or 1.
Pronounced “tanch,” tanh is a hyperbolic trigonometric function (see Figure 2-13). Just as the tangent represents a ratio between the opposite and adjacent sides of a right triangle, tanh represents the ratio of the hyperbolic sine to the hyperbolic cosine: tanh(x) = sinh(x) / cosh(x). Unlike the Sigmoid function, the normalized range of tanh is –1 to 1. The advantage of tanh is that it can deal more easily with negative numbers.
Softmax is a generalization of logistic regression inasmuch as it can be applied to continuous data (rather than classifying binary) and can contain multiple decision boundaries. It handles multinomial labeling systems. Softmax is the function you will often find at the output layer of a classifier.
Rectified linear is a more interesting transform that activates a node only if the input is above a certain quantity. While the input is below zero, the output is zero, but when the input rises above a certain threshold, it has a linear relationship with the dependent variable f(x) = max(0, x), as demonstrated in Figure 2-14.
Rectified linear units (ReLU) are the current state of the art because they have proven to work in many different situations. Because the gradient of a ReLU is either zero or a constant, it is possible to reign in the vanishing exploding gradient issue. ReLU activation functions have shown to train better in practice than sigmoid activation functions.
This activation function is considered to be the “smooth version of the ReLU,” as is illustrated in Figure 2-15. Compare this plot to the ReLU in Figure 2-14.
Figure 2-15 shows that the softplus activation function (f(x) = ln[ 1 + exp(x) ]) has a similar shape to the ReLU. We also notice the differentiability and nonzero derivative of the softplus everywhere on the graph, in contrast to the ReLU.
This activation function is considered to be the “smooth version of the ReLU,” as is illustrated in Figure 2-15. Compare this plot to the ReLU in Figure 2-14.
Figure 2-15 shows that the softplus activation function (f(x) = ln[ 1 + exp(x) ]) has a similar shape to the ReLU. We also notice the differentiability and nonzero derivative of the softplus everywhere on the graph, in contrast to the ReLU.
Loss functions quantify how close a given neural network is to the ideal toward which it is training. The idea is simple. We calculate a metric based on the error we observe in the network’s predictions. We then aggregate these errors over the entire dataset and average them and now we have a single number representative of how close the neural network is to its ideal.
Looking for this ideal state is equivalent to finding the parameters (weights and biases) that will minimize the “loss” incurred from the errors. In this way, loss functions help reframe training neural networks as an optimization problem. In most cases, these parameters cannot be solved for analytically, but, more often than not, they can be approximated well with iterative optimization algorithms like gradient descent. The following section provides an overview on commonly seen loss functions, linking them back to their origins in machine learning, as necessary.
REGRESSION LOSS FUNCTION DISCUSSION
These are all valid choices, and there certainly is no single loss function that will outperform all other loss functions for every scenario. The MSE is very widely used and is a safe bet in most cases. So is the MAE. The MSLE and the MAPE are worth taking into consideration if our network is predicting outputs that vary largely in range. Suppose that a network is to predict two output variables: one in the range of [0, 10] and the other in the range of [0, 100]. In this case, the MAE and the MSE will penalize the error in the second output more significantly than the first. The MAPE makes it a relative error and therefore doesn’t discriminate based on the range. The MSLE squishes the range of all the outputs down, simply how 10 and 100 translate to 1 and 2 (in log base 10).
COMMON PRACTICE FOR REGRESSION IN NEURAL NETWORKS
Although MSLE and MAPE are approaches to handling large ranges, common practice with neural networks is to normalize inputs to a suitable range and use the MSE or MAE to optimize for either the mean or the median.
https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
Loss functions quantify how close a given neural network is to the ideal toward which it is training. The idea is simple. We calculate a metric based on the error we observe in the network’s predictions. We then aggregate these errors over the entire dataset and average them and now we have a single number representative of how close the neural network is to its ideal.
Looking for this ideal state is equivalent to finding the parameters (weights and biases) that will minimize the “loss” incurred from the errors. In this way, loss functions help reframe training neural networks as an optimization problem. In most cases, these parameters cannot be solved for analytically, but, more often than not, they can be approximated well with iterative optimization algorithms like gradient descent. The following section provides an overview on commonly seen loss functions, linking them back to their origins in machine learning, as necessary.
REGRESSION LOSS FUNCTION DISCUSSION
These are all valid choices, and there certainly is no single loss function that will outperform all other loss functions for every scenario. The MSE is very widely used and is a safe bet in most cases. So is the MAE. The MSLE and the MAPE are worth taking into consideration if our network is predicting outputs that vary largely in range. Suppose that a network is to predict two output variables: one in the range of [0, 10] and the other in the range of [0, 100]. In this case, the MAE and the MSE will penalize the error in the second output more significantly than the first. The MAPE makes it a relative error and therefore doesn’t discriminate based on the range. The MSLE squishes the range of all the outputs down, simply how 10 and 100 translate to 1 and 2 (in log base 10).
COMMON PRACTICE FOR REGRESSION IN NEURAL NETWORKS
Although MSLE and MAPE are approaches to handling large ranges, common practice with neural networks is to normalize inputs to a suitable range and use the MSE or MAE to optimize for either the mean or the median.
AlexNet is the first deep architecture which was introduced by one of the pioneers in deep learning – Geoffrey Hinton and his colleagues. It is a simple yet powerful network architecture, which helped pave the way for groundbreaking research in Deep Learning as it is now. Here is a representation of the architecture as proposed by the authors.
When broken down, AlexNet seems like a simple architecture with convolutional and pooling layers one on top of the other, followed by fully connected layers at the top. This is a very simple architecture, which was conceptualised way back in 1980s. The things which set apart this model is the scale at which it performs the task and the use of GPU for training. In 1980s, CPU was used for training a neural network. Whereas AlexNet speeds up the training by 10 times just by the use of GPU.
Although a bit outdated at the moment, AlexNet is still used as a starting point for applying deep neural networks for all the tasks, whether it be computer vision or speech recognition.