Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Convolutional Neural Networks

You're reading from   Practical Convolutional Neural Networks Implement advanced deep learning models using Python

Arrow left icon
Product type Paperback
Published in Feb 2018
Publisher Packt
ISBN-13 9781788392303
Length 218 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Mohit Sewak Mohit Sewak
Author Profile Icon Mohit Sewak
Mohit Sewak
Md. Rezaul Karim Md. Rezaul Karim
Author Profile Icon Md. Rezaul Karim
Md. Rezaul Karim
Pradeep Pujari Pradeep Pujari
Author Profile Icon Pradeep Pujari
Pradeep Pujari
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Preface 1. Deep Neural Networks – Overview 2. Introduction to Convolutional Neural Networks FREE CHAPTER 3. Build Your First CNN and Performance Optimization 4. Popular CNN Model Architectures 5. Transfer Learning 6. Autoencoders for CNN 7. Object Detection and Instance Segmentation with CNN 8. GAN: Generating New Images with CNN 9. Attention Mechanism for CNN and Visual Models 10. Other Books You May Enjoy

Understanding backpropagation 

In this section, we will understand an intuition about backpropagation. This is a way of computing gradients using the chain rule. Understanding this process and its subtleties is critical for you to be able to understand and effectively develop, design, and debug neural networks.

In general, given a function f(x), where x is a vector of inputs, we want to compute the gradient of f at x denoted by ∇(f(x)). This is because in the case of neural networks, the function f is basically a loss function (L) and the input x is the combination of weights and training data. The symbol ∇  is pronounced as nabla:

(xi, yi ) i = 1......N

Why do we take the gradient on weight parameters?

It is given that the training data is usually fixed and the parameters are variables that we have control over. We usually compute the gradient of the parameters so that we can use it for parameter updates. The gradient ∇f is the vector of partial derivatives, that is:

∇f = [ df/dx, df/dy] = [y,x]

In a nutshell, backpropagation will consist of:

  • Doing a feed-forward operation
  • Comparing the output of the model with the desired output
  • Calculating the error
  • Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights
  • Using this to update the weights, and get a better model
  • Continuing this until we have a model that is good

We will be building a neural network that recognizes digits from 0 to 9. This kind of network application is used for sorting postal mail by zip code, recognizing phone numbers and house numbers from images, extracting package quantities from image of the package and so on.

In most cases, backpropagation is implemented in a framework, such as TensorFlow. However, it is not always true that by simply adding an arbitrary number of hidden layers, backpropagation will magically work on the dataset. The fact is if the weight initialization is sloppy, these non linearity functions can saturate and stop learning. That means training loss will be flat and refuse to go down. This is known as the vanishing gradient problem. 

If your weight matrix W is initialized too large, the output of the matrix multiply too could probably have a very large range, which in turn will make all the outputs in the vector z almost binary: either 1 or 0. However, if this is the case, then, z*(1-z), which is the local gradient of the sigmoid non-linearity, will become zero (vanish) in both cases, which will make the gradient for both x and W also zero. The rest of the backward pass will also come out all zero from this point onward on account of the multiplication in the chain rule.

Another nonlinear activation function is ReLU, which thresholds neurons at zero shown as follows. The forward and backward pass for a fully connected layer that uses ReLU would at the core include:

z = np.maximum(0, np.dot(W, x)) #Representing forward pass
dW = np.outer(z > 0, x) #Representing backward pass: local gradient for W

If you observe this for a while, you'll see that should a neuron get clamped to zero in the forward pass (that is, z = 0, it doesn't fire), then its weights will get a zero gradient. This can lead to what is called the dead ReLU problem. This means if a ReLU neuron is unfortunately initialized in such a way that it never fires, or if a neuron's weights ever get knocked off with a large update during training into this regime, in such cases this neuron will remain permanently dead. It is similar to permanent, irrecoverable brain damage. Sometimes, you can even forward the entire training set through a trained network and finally realize that a large fraction (about 40%) of your neurons were zero the entire time.

In calculus, the chain rule is used for computing the derivative of the composition of two or more functions. That is, if we have two functions as f and g, then the chain rule represents the derivative of their composition f ∘ g. The function that maps x to f(g(x))) in terms of the derivatives of f and g and the product of functions is expressed as follows:

There is a more explicit way to represent this in terms of the variable. Let F = f ∘ g, or equivalently, F(x) = f(g(x)) for all x. Then one can also write:

F'(x)=f'(g(x))g'(x).

The chain rule can be written with the help of Leibniz's notation in the following way. If a variable z is dependent on a variable y, which in turn is dependent on a variable x (such that y and z are dependent variables), then z depends on x as well via the intermediate y. The chain rule then states:

z = 1/(1 + np.exp(-np.dot(W, x))) # forward pass
dx = np.dot(W.T, z*(1-z)) # backward pass: local gradient for x
dW = np.outer(z*(1-z), x) # backward pass: local gradient for W

The forward pass on the left in the following figure calculates z as a function f(x,y) using the input variables x and y. The right side of the figures represents the backward pass. Receiving dL/dz, the gradient of the loss function with respect to z, the gradients of x and y on the loss function can be calculated by applying the chain rule, as shown in the following figure:

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image