Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Deep Learning with TensorFlow 2 and Keras

You're reading from   Deep Learning with TensorFlow 2 and Keras Regression, ConvNets, GANs, RNNs, NLP, and more with TensorFlow 2 and the Keras API

Arrow left icon
Product type Paperback
Published in Dec 2019
Publisher Packt
ISBN-13 9781838823412
Length 646 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Dr. Amita Kapoor Dr. Amita Kapoor
Author Profile Icon Dr. Amita Kapoor
Dr. Amita Kapoor
Sujit Pal Sujit Pal
Author Profile Icon Sujit Pal
Sujit Pal
Antonio Gulli Antonio Gulli
Author Profile Icon Antonio Gulli
Antonio Gulli
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Neural Network Foundations with TensorFlow 2.0 2. TensorFlow 1.x and 2.x FREE CHAPTER 3. Regression 4. Convolutional Neural Networks 5. Advanced Convolutional Neural Networks 6. Generative Adversarial Networks 7. Word Embeddings 8. Recurrent Neural Networks 9. Autoencoders 10. Unsupervised Learning 11. Reinforcement Learning 12. TensorFlow and Cloud 13. TensorFlow for Mobile and IoT and TensorFlow.js 14. An introduction to AutoML 15. The Math Behind Deep Learning 16. Tensor Processing Unit 17. Other Books You May Enjoy
18. Index

Regularization

In this section, we will review a few best practices for improving the training phase. In particular, regularization and batch normalization will be discussed.

Adopting regularization to avoid overfitting

Intuitively, a good machine learning model should achieve a low error rate on training data. Mathematically this is equivalent to minimizing the loss function on the training data given the model:

min: {loss(Training Data | Model)}

However, this might not be enough. A model can become excessively complex in order to capture all the relations inherently expressed by the training data. This increase of complexity might have two negative consequences. First, a complex model might require a significant amount of time to be executed. Second, a complex model might achieve very good performance on training data, but perform quite badly on validation data. This is because the model is able to contrive relationships between many parameters in the specific training context, but these relationships in fact do not exist within a more generalized context. Causing a model to lose its ability to generalize in this manner is termed "overfitting." Again, learning is more about generalization than memorization:

Figure 31: Loss function and overfitting

As a rule of thumb, if during the training we see that the loss increases on validation, after an initial decrease, then we have a problem of model complexity, which overfits to the training data.

In order to solve the overfitting problem, we need a way to capture the complexity of a model, that is, how complex a model can be. What could the solution be? Well, a model is nothing more than a vector of weights. Each weight affects the output, except for those which are zero, or very close to it. Therefore, the complexity of a model can be conveniently represented as the number of non-zero weights. In other words, if we have two models M1 and M2 achieving pretty much the same performance in terms of loss function, then we should choose the simplest model, the one which has the minimum number of non-zero weights.

We can use a hyperparameter >=0 for controlling the importance of having a simple model, as in this formula:

min: {loss(Training Data|Model)} + * complexity(Model)

There are three different types of regularization used in machine learning:

  • L1 regularization (also known as LASSO): The complexity of the model is expressed as the sum of the absolute values of the weights.
  • L2 regularization (also known as Ridge): The complexity of the model is expressed as the sum of the squares of the weights
  • Elastic regularization: The complexity of the model is captured by a combination of the preceding two techniques

Note that playing with regularization can be a good way to increase the performance of a network, particularly when there is an evident situation of overfitting. This set of experiments is left as an exercise for the interested reader.

Also note that TensorFlow supports L1, L2, and ElasticNet regularization. Adding regularization is easy:

from tf.keras.regularizers import l2, activity_l2
model.add(Dense(64, input_dim=64, W_regularizer=l2(0.01),
activity_regularizer=activity_l2(0.01)))

A complete list of regularizers can be found at https://www.tensorflow.org/api_docs/python/tf/keras/regularizers.

Understanding BatchNormalization

BatchNormalization is another form of regularization and one of the most effective improvements proposed during the last few years. BatchNormalization enables us to accelerate training, in some cases by halving the training epochs, and it offers some regularization. Let's see what the intuition is behind it.

During training, weights in early layers naturally change and therefore the inputs of later layers can significantly change. In other words, each layer must continuously re-adjust its weights to the different distribution for every batch. This may slow down the model's training greatly. The key idea is to make layer inputs more similar in distribution, batch after batch and epoch after epoch.

Another issue is that the sigmoid activation function works very well close to zero, but tends to "get stuck" when values get sufficiently far away from zero. If, occasionally, neuron outputs fluctuate far away from the sigmoid zero, then said neuron becomes unable to update its own weights.

The other key idea is therefore to transform the layer outputs into a Gaussian distribution unit close to zero. In this way, layers will have significantly less variation from batch to batch. Mathematically, the formula is very simple. The activation input x is centered around zero by subtracting the batch mean from it. Then, the result is divided by , the sum of batch variance and a small number , to prevent division by zero. Then, we use a linear transformation to make sure that the normalizing effect is applied during training.

In this way, and are parameters that get optimized during the training phase in a similar way to any other layer. BatchNormalization has been proven as a very effective way to increase both the speed of training and accuracy, because it helps to prevent activations becoming either too small and vanishing or too big and exploding.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image