In the previous section, we learned about performing the following steps on top of the input data to come up with error values in forward-propagation (the code file is available as Neural_network_working_details.ipynb in GitHub):
- Initialize weights randomly
- Calculate the hidden layer unit values by multiplying input values with weights
- Perform activation on the hidden layer values
- Connect the hidden layer values to the output layer
- Calculate the squared error loss
A function to calculate the squared error loss values across all data points is as follows:
import numpy as np
def feed_forward(inputs, outputs, weights):
pre_hidden = np.dot(inputs,weights[0])+ weights[1]
hidden = 1/(1+np.exp(-pre_hidden))
out = np.dot(hidden, weights[2]) + weights[3]
squared_error = (np.square(pred_out - outputs))
return squared_error
In the preceding function, we take the input variable values, weights (randomly initialized if this is the first iteration), and the actual output in the provided dataset as the input to the feed-forward function.
We calculate the hidden layer values by performing the matrix multiplication (dot product) of the input and weights. Additionally, we add the bias values in the hidden layer, as follows:
pre_hidden = np.dot(inputs,weights[0])+ weights[1]
The preceding scenario is valid when weights[0] is the weight value and weights[1] is the bias value, where the weight and bias are connecting the input layer to the hidden layer.
Once we calculate the hidden layer values, we perform activation on top of the hidden layer values, as follows:
hidden = 1/(1+np.exp(-pre_hidden))
We now calculate the output at the hidden layer by multiplying the output of the hidden layer with weights that connect the hidden layer to the output, and then adding the bias term at the output, as follows:
pred_out = np.dot(hidden, weights[2]) + weights[3]
Once the output is calculated, we calculate the squared error loss at each row, as follows:
squared_error = (np.square(pred_out - outputs))
In the preceding code, pred_out is the predicted output and outputs is the actual output.
We are then in a position to obtain the loss value as we forward-pass through the network.
While we considered the sigmoid activation on top of the hidden layer values in the preceding code, let's examine other activation functions that are commonly used.
Tanh
The tanh activation of a value (the hidden layer unit value) is calculated as follows:
def tanh(x):
return (exp(x)-exp(-x))/(exp(x)+exp(-x))
ReLu
The Rectified Linear Unit (ReLU) of a value (the hidden layer unit value) is calculated as follows:
def relu(x):
return np.where(x>0,x,0)
Linear
The linear activation of a value is the value itself.
Softmax
Typically, softmax is performed on top of a vector of values. This is generally done to determine the probability of an input belonging to one of the n number of the possible output classes in a given scenario. Let's say we are trying to classify an image of a digit into one of the possible 10 classes (numbers from 0 to 9). In this case, there are 10 output values, where each output value should represent the probability of an input image belonging to one of the 10 classes.
The softmax activation is used to provide a probability value for each class in the output and is calculated explained in the following sections:
def softmax(x):
return np.exp(x)/np.sum(np.exp(x))
Apart from the preceding activation functions, the loss functions that are generally used while building a neural network are as follows.
Mean squared error
The error is the difference between the actual and predicted values of the output. We take a square of the error, as the error can be positive or negative (when the predicted value is greater than the actual value and vice versa). Squaring ensures that positive and negative errors do not offset each other. We calculate the mean squared error so that the error over two different datasets is comparable when the datasets are not the same size.
The mean squared error between predicted values (p) and actual values (y) is calculated as follows:
def mse(p, y):
return np.mean(np.square(p - y))
The mean squared error is typically used when trying to predict a value that is continuous in nature.
Mean absolute error
The mean absolute error works in a manner that is very similar to the mean squared error. The mean absolute error ensures that positive and negative errors do not offset each other by taking an average of the absolute difference between the actual and predicted values across all data points.
The mean absolute error between the predicted values (p) and actual values (y) is implemented as follows:
def mae(p, y):
return np.mean(np.abs(p-y))
Similar to the mean squared error, the mean absolute error is generally employed on continuous variables.
Categorical cross-entropy
Cross-entropy is a measure of the difference between two different distributions: actual and predicted. It is applied to categorical output data, unlike the previous two loss functions that we discussed.
Cross-entropy between two distributions is calculated as follows:
y is the actual outcome of the event and p is the predicted outcome of the event.
Categorical cross-entropy between the predicted values (p) and actual values (y) is implemented as follows:
def cat_cross_entropy(p, y):
return -np.sum((y*np.log2(p)+(1-y)*np.log2(1-p)))
Note that categorical cross-entropy loss has a high value when the predicted value is far away from the actual value and a low value when the values are close.