There are tons of documents out there to explain how neural network works in different angles. So I guess people don’t mind me adding one more from my perspective. Hopefully, someone may get inspired from this post.
In short, the neural network is a kind of system that transforms the input data into its corresponding output (or labels), strictly speaking, for a supervised learning. This system is consisting of three types of layers, i.e. the input layer, the hidden layer and the output layer. Normally, the hidden layer may have multiple layers according to one’s design. Among layers, the data from the upper layer are transformed to the data in the lower layer via a linear combination with an initialized weight matrix. After that, a nonlinear transformation that is generally called the activation function is applied to the data that would be converted to a form for the next round until reaching the output layer. To assure the neural network feedback the correct output, people need to train the neural network by adjusting the weight matrix. They are usually adjusted via minimizing the error between the output from the final layer and the known output from the data used for training.
Without loss of generality, here I consider a simplest neural network which only possesses fully connected layers, meaning that every neuron in each layer connects all neurons of the upper and lower layers. The mathematical logic behind is analogous to other structures of neural network. Firstly, I used a series of graphs to illustrate how the data flows to the final layer and introduce the notations used in the derivation. Then, a general formulation is present subsequently.
A 6-layer neural network
The following four graphs illustrate how the data evolves along the neural network and the notations used in the derivation.
1 Forward flow of the neural network
Consider a neural network with \(k+1\) layers including the input layer and the output layer. The input data is consisting of \(n\) samples with \(L_{1}\) features
The output label is given accordingly
where the label on the top left denotes the index of the sample. Too abstract? Imagine that sample \(^{1}\boldsymbol{x}\) corresponds to model 1 while sample \(^{2}\boldsymbol{x}\) corresponds to model 2. Then the output can be either an array of two vectors indicating the probabilities of model 1 and model 2, for example, \({(1,0),(0,1)}\), or a vector of two elements \({0,1}\) where 0 indicates model 1 and 1 indicates model 2 or whatever you label them. The purpose is to use a huge amount of data sets to train the neural network, more precisely to compute the weight matrix among the adjacent pair of layers, to fit the output layer to the output vectors. Will see it later on.
Now, let’s feed one sample (the first one \(^{1}\boldsymbol{x}\)) to the neural network to see how to compute the output layer. Later, we will see how to train the network with multiple samples. Assume the second layer has \(L_{2}\) neurons
Note that the first layer is the input layer in which we feed one sample with \(L_{1}\) feature to the neural network at first. Then the weight matrix from the first layer to the second layer is defined as
Thus, multiplying the weight matrix Eq.4 with the neurons of the upper layer Eq.3 yields
More generally, a bias term (constant term) is incorporated as follows
which can also be written in the form if we absorb the constant vector \(\boldsymbol{b}_{1}\) into the weight matrix
After the transformation, an activation function is applied to \(\boldsymbol{z}^{(2)}\) elementwisely. For a model classification problem, the sigmoid function and the hyperbolic tangent function are widely used. Here we use the sigmoid function for instance
Till now, the transformation from the first layer (the input layer) to the second layer (the first hidden layer) is done. This process can be generalized as .
where \(\boldsymbol{a}’\) is the vector \(\boldsymbol{a}\) absorbing 1 at the end as what I did in Eq.8. Note that \(\boldsymbol{a}^{(1)}=^{1}\boldsymbol{x}\).
Finally, the neural network will return \(\boldsymbol{a}^{(k+1)}\) with \(L_{k+1}\) elements from the output layer. Given the corresponding output label \(\boldsymbol{y}^{(1)}\), we can compute the error between the feedback \(\boldsymbol{a}^{(k+1)}\) and the output label \(\boldsymbol{y}^{(1)}\). This is normally called the loss of the result to the output. There are several candidate loss functions for model classification like the least square, the cross-entropy function. We take the least square function as the example
So far, we have computed out the loss of the neural network with a bunch of randomly initialized weight matrix. No doubt, the loss would be huge. Our aim is to minimized the loss \(J\) by tuning the weight matrix. How? Remember your advanced calculus in the high school or the university? \(J\) can be envisaged as a function of every entry in the weight matrix that we want to adjust. Thus, the derivative of \(J\) with respect to each entries of the weight matrix would tell us how to tune the weight matrix to minimize the loss. This method is called gradient descend. And the loss can also propagate backwards to determine the gradient of the entries of the weight matrix at each layer. The full process to tune the weight matrix is also called Backward Propagation.
2 Backward propagation
From the \(k+1\)th layer to the \(k\)th layer
Let us start from the final layer (the \(k+1\)th layer) to the previous one (the \(k\)th layer). Note that the variable of our concern is the entry of the weight matrix from the \(k\)th layer to the \(k+1\)th layer, \(\theta_{L_{k+1}\times L_{k}}^{(k)}\). Here I only consider tuning \(\theta\) instead of \(\theta’\), meaning that the bias terms are not tuned. Updating the bias terms is similar and you can practice it afterward. The derivative of the loss function with respect to the matrix is defined as
At the mean time, from the loss function Eq.10, we obtain the derivative according to the chain rule
where
and
where \(\otimes\) is the outer product. Simplifying the gradient Eq.12 yields
If we define
The gradient Eq.17 can be further simplified as
Note that we define the outer product of two column vectors is the element-wise produce of the corresponding elements
Now, as \(\boldsymbol{z}^{(k+1)}\), \(\boldsymbol{a}^{(k+1)}\) and \(^{1}\boldsymbol{y}\) are known, we can update the weight matrix \(\theta^{(k)}\) by
where \(\lambda\) is a constant called the learning rate given in advance.
From the \(k\)th layer to the \(k-1\)th layer
Let’s do the calculation one more time from the \(k\)th layer to the \(k-1\)th layer. At the end of this section, we will get a general formula to update all weight matrix via which we can develop the algorithm for a deep neural network.
Now we consider one more previous weight matrix \(\theta^{(k-1)}\). The derivative of the loss function with respect to that matrix yields
The first two terms on the right hand side are the same as Eq.12. The third term \(\frac{\partial\boldsymbol{z}^{(k+1)}}{\partial\boldsymbol{a}^{(k)}}\) produces the weight matrix from the \(k+1\)th layer to the \(k\)th layer
The last two terms on the right hand side are similar to what we have done above. Finally, we get
Comparing Eq.19 and Eq.23 tells us that we can update the error
for each transition among layers. This is how the error propagates backwards along the neural network and where the name comes from. Note that this expression is a slightly different from the formula in Chapter 9.2 of the machine learning course by Andrew Ng of Standford. Do you see why is that?
3 Training multiple samples
We have derived mathematically how to train one sample on a neural network. How about multiple samples? Easy. Because all the weight matrix of the neural network are shared for all samples, we can update the weight matrix by a fraction of error of each sample. Normally this fraction is \(\frac{1}{\text{sample size}}\) where in our example the sample size is \(n\)
Then, the weight matrix is tuned to minimize the error of samples.
4 Program a neural network
Till now, we have derived a general formula Eq.23 to allow us to update all weight matrix. After updating, the loss function is applied again to examine if the error is sufficiently small. If not, update the weight matrix again till meeting our criterion. Understanding the math behind is a huge step towards the expert level but not the final one. Whether you can equip it via code is essential. Here I attached my code in Python from my perspective as a reference. The user can add any number of hidden layers and deploy any number of neurons there. It is a bit like a minimalistic version of tensorflow.
1 | import numpy as np |
You can also clone it on my Github
The end
My code is obviously not the most efficient one as a lot of loops are used. But in another sense it is easy to read for a starter. As you see in the derivation and in the code, there are a huge amount of independent computing that can be parallelized to speed up. So the future plan is to parallelize the code on GPU which may substantially improve the efficiency of the neural network. It would also be a good practice for you to step into the machine learning field.
Reference
An introduction to the math used in Machine Learning: The Matrix Calculus You Need For Deep Learning .
A concrete derivation of backward propagation for a two-layer neural network in Chinese here.