Computing gradients via automatic differentiation
As you already know, optimizing NNs requires computing the gradients of the loss with respect to the NN weights. This is required for optimization algorithms such as stochastic gradient descent (SGD). In addition, gradients have other applications, such as diagnosing the network to find out why an NN model is making a particular prediction for a test example. Therefore, in this section, we will cover how to compute gradients of a computation with respect to its input variables.
Computing the gradients of the loss with respect to trainable variables
PyTorch supports automatic differentiation, which can be thought of as an implementation of the chain rule for computing gradients of nested functions. Note that for the sake of simplicity, we will use the term gradient to refer to both partial derivatives and gradients.
Partial derivatives and gradients
A partial derivative can be understood as the rate of change...