In the first part of these series we began by a brief review of essential mathematical concepts to helps us build a foundation for understanding backpropagation using matrix derivatives.

In this post we will express our neural network model in matrix form and symbolically derive its backpropagation equations using gradient descent.

A simple neuron model

In the first part, we observed that our neural model is represented as follows;

$\displaystyle \hat{y} = f(w*x+b)$

$\displaystyle L = MSE(y, \hat{y})$

Where;

  1. $\displaystyle w$ and $\displaystyle b$ are the unknown parameters which we want to “estimate”. Here, $\displaystyle w$ is referred to as weights while $\displaystyle b$ as the bias.
  2. $\displaystyle x$, represent the inputs to the model.
  3. $\displaystyle f$, is the model activation function.
  4. $\displaystyle \hat{y}$ ist the predicted output.
  5. $\displaystyle y$, the expected, actual, output.
  6. $\displaystyle MSE=\frac{(y-\hat{y})^2}{2}$, is the mean square error between the actual value and the predicted value. It measures the difference between the actual output and the predicted values.
  7. L, is the cost function of our model.

The goal of the model is to minimize the cost function $\displaystyle L$. To achieve this we must find the values of $\displaystyle w$ and $\displaystyle b$ that minimizes the difference between the predicted output and the actual output.

Minimizing the cost function, L.

In our simple single input(scalar) model, we can easily find the parameters $\displaystyle w$ and $\displaystyle b$ that minimizes the cost function by applying differentiation.

$$ \frac{\partial L}{\partial \hat{y}}=-(y-\hat{y})=(\hat{y}-y) $$$$ \frac{\partial \hat{y}}{\partial w}=\frac{\partial f(w*x+b)}{\partial w}=f^{'}(w*x+b)*x $$$$ \frac{\partial \hat{y}}{\partial b}=\frac{\partial f(w*x+b)}{\partial b}=f^{'}(w*x+b) $$$$ \frac{\partial L}{\partial w}={\nabla}_{w}L=(\hat{y}-y)*f^{'}(w*x+b)*x $$$$ \frac{\partial L}{\partial b}={\nabla}_{b}L=(\hat{y} - y)*f^{'}(w*x+b) $$

Applying gradient descent to $\displaystyle w$ and $\displaystyle b$ we would update their values as follows;

$$ w_{new} = w - \eta * {\nabla}_{w}L $$$$ b_{new} = b - \eta * {\nabla}_{b}L $$

Matrix Relations and Derivatives

Lets state some important relations and derivatives we will require, for further details I refer you to [3].

  1. Given the values/variables $\displaystyle x_{1}, x_{2}, x_{3}, ... , x_{n}$, they are represented as matrix in column form;
$$ X=\begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ {\vdots} \\ x_{n} \end{bmatrix} $$
  1. Given the matrix product $\displaystyle W*X$, its derivative with respect to $\displaystyle W$ and $\displaystyle X$ would be;
$$ \frac{\partial W*X}{\partial X}=W^{T} $$$$ \frac{\partial W*X}{\partial W}=X^{T} $$
  1. Given the matrix product $\displaystyle {\bar 1}^{T}*X$

where

$$ \bar{1}^{T}=\begin{bmatrix} 1 & 1 & {\cdots} & 1 \end{bmatrix} $$

it’s derivative with respect to $\displaystyle X$ is;

$$ \frac{\partial({\bar 1}^{T}*X)}{\partial X}={\bar 1}^{T} $$
  1. Given the matrix product $\displaystyle {\bar 1}^{T}*diag(X)$, where
$$ diag(X)=\begin{bmatrix} x_{11} & 0 & {\cdots} & 0 \\ 0 & x_{22} & {\cdots} & 0 \\ 0 & 0 & x_{ij} & 0 \\ {\vdots} & {\vdots} & {\cdots} & {\vdots} \\ {\cdots} & {\cdots} & {\cdots} & x_{nn} \end{bmatrix} $$

then

$$ {\bar 1}^{T}*diag(X) = X^{T} $$
  1. Given the matrix $\displaystyle diag(X)$, it can also be written as the matrix product;
$$ diag(X) = I*X $$
  1. Given the sum $\displaystyle y = \sum{x_{i}} = x_{1} + x_{2} + {\cdots} + x_{n}$ it can be written in matrix form as follows;
$$ Y=\bar{1}^{T}X $$
  1. Given the product $\displaystyle diag(X)*diag(Y)$, then
$$ diag(X)*diag(Y) = I*X{\otimes}Y $$

With the hadamard product, $\displaystyle \otimes$, having precedence.

  1. Given the matrix function $\displaystyle Y=H(X)$, its written in matrix form;
$$\begin{bmatrix} y_{1} \\ y_{2}\\ {\vdots} \\ y_{n} \end{bmatrix}=\begin{bmatrix} h_{1}(x_{1}) \\ h_{2}(x_{2}) \\ {\vdots} \\ h_{n}(x_{n}) \end{bmatrix}$$

and it’s derivative, is the Jacobian

$$\begin{bmatrix} \frac{\partial y_{1}}{\partial X} \\ \frac{\partial y_{2}}{\partial X}\\ {\vdots} \\ \frac{\partial y_{n}}{\partial X} \end{bmatrix}=\begin{bmatrix} \frac{\partial h_{1}(x_{1})}{\partial x_{1}} & \frac{\partial h_{1}(x_{1})}{\partial x_{2}} & {\cdots} & \frac{\partial h_{1}(x_{1})}{\partial x_{n}} \\ \frac{\partial h_{2}(x_{2})}{\partial x_{1}} & \frac{\partial h_{2}(x_{2})}{\partial x_{2}} & {\cdots} & \frac{\partial h_{2}(x_{2})}{\partial x_{n}} \\ {\vdots} & {\vdots} & {\cdots} & {\vdots} \\ \frac{\partial h_{n}(x_{n})}{\partial x_{1}} & \frac{\partial h_{n}(x_{n})}{\partial x_{2}} & ... & \frac{\partial h_{n}(x_{n})}{\partial x_{n}} \end{bmatrix}$$

The Jacobian in numerator layout form. The Jacobian can also be written as;

$$ \begin{bmatrix} \frac{\partial y_{1}}{\partial X} \\ \frac{\partial y_{2}}{\partial X}\\ {\vdots} \\ \frac{\partial y_{n}}{\partial X} \end{bmatrix} = \begin{bmatrix} \nabla h_{1}(x_{1}) \\ \nabla h_{2}(x_{2}) \\ {\vdots} \\ \nabla h_{n}(x_{n}) \end{bmatrix} $$

or as

$$ \frac{\partial Y}{\partial X}=\frac{\partial H(X)}{\partial X} $$

which we can simplify to

$$ \nabla_{X} Y=\nabla_{X} H(X) $$
  1. Given the function $\displaystyle y=h(x_{i})$, with a component-wise representation given by
$$ y=\begin{bmatrix} h(x_{1}) \\ h(x_{2}) \\ \vdots \\ h(x_{n}) \end{bmatrix} = h( \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{bmatrix}) = h(X) $$

has the Jacobian

$$ \nabla y = \begin{bmatrix} \nabla h(x_{1}) \\ \nabla h(x_{2}) \\ \vdots \\ \nabla h(x_{n}) \end{bmatrix} = \begin{bmatrix} \frac{\partial h(x_{1})}{\partial x_{1}} & 0 & \cdots & 0 \\ 0 & \frac{\partial h(x_{2})}{\partial x_{2}} & \cdots & 0 \\ \vdots & \vdots & \cdots & \vdots \\ 0 & 0 & \cdots & \frac{\partial h(x_{n})}{\partial x_{n}} \end{bmatrix} = diag( \frac{\partial h(x_{i})}{\partial x_{i}} ) $$

1-Layer Network as a Matrix

In practise, our network would probably have more than 1 input. For example consider a 2 input model that is fully connected;

$$\hat{y}_{1} = f(w_{11}*x_{1}+w_{12}*x_{2}+b_{1})$$$$\hat{y}_{2} = f(w_{21}*x_{1}+w_{22}*x_{2}+b_{2})$$$$L = MSE(y, (\hat{y}_{1}+\hat{y}_{2}))$$

Simplifying the model;

$$z_{1}=w_{11}*x_{1}+w_{12}*x_{2}+b_{1}$$$$z_{2}=w_{21}*x_{1}+w_{22}*x_{2}+b_{2}$$$$\hat{y}_{1} = f(z_{1})$$$$\hat{y}_{2} = f(z_{2})$$$$L = MSE(y, (\hat{y}_{1}+\hat{y}_{2}))$$

We can represent our model as matrix as follows;

$$\begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix}=\begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix}\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix}+\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}$$$$\begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \end{bmatrix}=\begin{bmatrix} f(z_{1}) \\ f(z_{2}) \end{bmatrix}=f(\begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix})$$$$L=MSE(\begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \end{bmatrix})$$

Simplifying further, by letting;

$$ Z=\begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix}, W=\begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix}, B=\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} $$

and

$$ \hat{Y}=\begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \end{bmatrix} $$

finally our model as a matrix would be represented as follows;

$$Z=W*X+B$$$$\hat{Y}=f(Z)$$$$L=MSE(\hat{Y})$$

Since, we would like to be able to use different cost functions, we replace $\displaystyle MSE(\hat Y)$ with a general loss function $\displaystyle g(\hat{Y})$, therefore;

$$L=g(\hat{Y})$$

Gradient descent of 1-Layer network

We have represented our neural model as;

$$Z=W*X+B$$$$\hat{Y}=f(Z)$$$$L=g(\hat{Y})$$

The model derivative starting off, with it’s cost function, are

$$ \nabla_{\hat{Y}} L=diag(g^{'}(\hat{Y}))=I*g^{'}(\hat{Y}) $$$$ \nabla_{Z} \hat{Y}=diag(f^{'}(Z))=I*f^{'}(Z) $$$$ \nabla_{X} Z=W^{T} $$$$ \nabla_{W} Z=X^{T} $$$$ \nabla_{B} Z=I $$

therefore;

$$ \nabla_{W} L = {\nabla_{\hat{Y}} L}*{\nabla_{Z} \hat{Y}}*\nabla_{W} Z=I*g^{'}(\hat{Y}){\otimes}f^{'}(Z)*X^{T}=g^{'}(\hat{Y}){\otimes}f^{'}(Z)*X^{T} $$$$ \nabla_{X} L = {\nabla_{\hat{Y}} L}*{\nabla_{Z} \hat{Y}}*\nabla_{X} Z=I*g^{'}(\hat{Y}){\otimes}f^{'}(Z)*W^{T}=W^{T}*g^{'}(\hat{Y}){\otimes}f^{'}(Z) $$$$ \nabla_{B} L = {\nabla_{\hat{Y}} L}*{\nabla_{Z} \hat{Y}}*\nabla_{B} Z=I*g^{'}(\hat{Y}){\otimes}f^{'}(Z)*I=g^{'}(\hat{Y}){\otimes}f^{'}(Z) $$

and we would update our W and B as follows;

$$ W_{next}=W - {\eta}*\nabla_{W}L=W - {\eta}*g^{'}(\hat{Y}){\otimes}f^{'}(Z)*X^{T} $$$$ B_{next}=B - {\eta}*\nabla_{B}L=B - {\eta}*g^{'}(\hat{Y}){\otimes}f^{'}(Z) $$

where

$$ g^{'}(\hat{Y})=\begin{bmatrix} \frac{\partial g}{\partial \hat{y}_{1}} \\ \frac{\partial g}{\partial \hat{y}_{2}} \\ . \\ \frac{\partial g}{\partial \hat{y}_{n}} \end{bmatrix} $$$$ f^{'}(Z)=\begin{bmatrix} \frac{\partial f}{\partial z_{1}} \\ \frac{\partial f}{\partial z_{2}} \\ . \\ \frac{\partial f}{\partial z_{n}} \end{bmatrix} $$$$ X^{T}=\begin{bmatrix} x_{1} & x_{2} & ... & x_{n} \end{bmatrix} $$

Example:

  1. Suppose our neural network has 2-inputs and 1 output, we can model it with the following functions $$ \hat{y}=f(w_{1}x_{1}+w_{2}x_{2}+b) $$ $$ L=\frac{(y-\hat{y})^2}{2} $$ Our loss function is the Mean-Square-Errors, while the activation function $\displaystyle f$ is any appropriate function e.g sigmoid, tanh e.tc, here we assume that the function has a partial derivative. Derive its backpropagation weight and bias relations;

Solution:

There are 2 ways to solve for the relations;

  1. The direct method, where we carry out the differentiation by hand. This is possible because the model is simple.

  2. Using our gradient descent relations obtained above. The key to using method, is being able to represent our model as a matrix.

We shall use both methods, however the direct method will be used to verify the results we obtain using our gradient descent relations.

Our model as matrix would be;

$$ \begin{bmatrix} z_{1} \\ z_{2} \end{bmatrix}=\begin{bmatrix} w_{1} & 0 \\ 0 & w_{2} \end{bmatrix}\begin{bmatrix} x_{1} \\ x_{2} \end{bmatrix} + \begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} $$$$ \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \end{bmatrix}=\begin{bmatrix} f(z_{1}) \\ f(z_{2}) \end{bmatrix} $$$$ \begin{bmatrix} l_{1} \\ l_{2} \end{bmatrix} = \begin{bmatrix} g_{1}(\hat{y_{1}}) \\ g_{2}(\hat{y_{2}}) \end{bmatrix} = \begin{bmatrix} \frac{(y-(\hat{y_{1}}+\hat{y_{2}}))^2}{2} \\ \frac{(y-(\hat{y_{1}}+\hat{y_{2}}))^2}{2} \end{bmatrix} $$

Therefore;

$$ \nabla_{W} L = \begin{bmatrix} (\hat{y}_{1} - y) \\ (\hat{y}_{2} - y) \end{bmatrix} {\otimes} \begin{bmatrix} f^{'}(z_{1}) \\ f^{'}(z_{2}) \end{bmatrix} \begin{bmatrix} x_{1} & x_{2} \end{bmatrix} = \begin{bmatrix} (\hat{y}_{1} - y)f^{'}(z_{1})x_{1} & 0 \\ 0 & (\hat{y}_{2} - y)f^{'}(z_{2})x_{2} \end{bmatrix} $$

while ignoring the off-diagonal terms, as they are not required for updating weights.

and,

$$ \nabla_{B} L = \begin{bmatrix} (\hat{y}_{1} - y) \\ (\hat{y}_{2} - y) \end{bmatrix} {\otimes} \begin{bmatrix} f^{'}(z_{1}) \\ f^{'}(z_{2}) \end{bmatrix} $$

finally our weights would be updated as follows;

$$ \begin{bmatrix} w_{1}^{next} & 0 \\ 0 & w_{2}^{next} \end{bmatrix} = \begin{bmatrix} w_{1} & 0 \\ 0 & w_{2} \end{bmatrix} - \begin{bmatrix} {\eta}(\hat{y}_{1} - y)f^{'}(z_{1})x_{1} & 0 \\ 0 & {\eta}(\hat{y}_{2} - y)f^{'}(z_{2})x_{2} \end{bmatrix} $$

while our biases as follows;

$$ \begin{bmatrix} b_{1}^{next} \\ b_{2}^{next} \end{bmatrix} = \begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} - \begin{bmatrix} {\eta}(\hat{y}_{1} - y)f^{'}(z_{1}) \\ {\eta}(\hat{y}_{2} - y)f^{'}(z_{2}) \end{bmatrix} $$

We can then confirm our results by differentiating our relations by hand.

Gradient Descent of a 2-Layer network

On the single-Layer neural network, we saw that it’s composed of 3 functions as follows;

  1. Linear combination function
$$Z=W*X+B$$
  1. Activation function
$$\hat{Y}=f(Z)$$
  1. A loss function $$L=g(\hat{Y})$$

Together these functions define the operations of a single neuron.

A 2-Layer network is composed of the following layers;

  1. An input layer.

    This layer is composed of a linear combination function and activation function.

  2. An output layer.

    This layer is composed of a linear combination function, an activation function and a loss function.

The computations in the network can be represented with the following expressions;

$$Z^{[1]}=W^{[1]}*X^{[1]}+B^{[1]}$$$$\hat{Y}^{[1]}=f(Z^{[1]})$$$$X^{[2]}=Y^{[1]}$$$$Z^{[2]}=W^{[2]}*X^{[2]}+B^{[2]}$$$$\hat{Y}^{[2]}=f(Z^{[2]})$$$$L=g(\hat{Y}^{[2]})$$

where, the superscript $\displaystyle [i]$ indicates the $\displaystyle i$-th layer of the network.

Starting off with the output layer, the backpropagation expressions are;

$$ \nabla_{W^{[2]}} L = g^{'}(\hat{Y}^{[2]}){\otimes}f^{'}(Z^{[2]})*X^{[2]T} $$$$ \nabla_{X^{[2]}} L = W^{[2]T}*g^{'}(\hat{Y}^{[2]}){\otimes}f^{'}(Z^{[2]}) $$$$ \nabla_{B^{[2]}} L = g^{'}(\hat{Y}^{[2]}){\otimes}f^{'}(Z^{[2]}) $$

while the overall input layer expressions are;

$$ \nabla_{W^{[1]}} L = \nabla_{W^{[1]}} \hat{Y}^{[1]}* \nabla_{X^{[2]}} L $$$$ \nabla_{X^{[1]}} L = \nabla_{X^{[1]}} \hat{Y}^{[1]} * \nabla_{X^{[2]}} L $$$$ \nabla_{B^{[1]}} L = \nabla_{B^{[1]}} \hat{Y}^{[1]} * \nabla_{X^{[2]}} L $$

since our input layer does not have a loss-function, $\displaystyle g(X)$, the expressions $\displaystyle \nabla_{W^{[1]}} \hat{Y}^{[1]}$ and $\displaystyle \nabla_{B^{[1]}} \hat{Y}^{[1]}$ simplify to;

$$ \nabla_{W^{[1]}} \hat{Y}^{[1]} = f^{'}(Z^{[1]})*X^{[1]T} $$$$ \nabla_{X^{[1]}} \hat{Y}^{[1]} = W^{[1]T} $$$$ \nabla_{B^{[1]}} \hat{Y}^{[1]} = f^{'}(Z^{[1]}) $$

and the complete input layer backpropagation expressions are;

$$ \nabla_{W^{[1]}} L = [\ f^{'}(Z^{[1]})*X^{[1]T} \ ]* [\ W^{[2]T}*g^{'}(\hat{Y}^{[2]}){\otimes}f^{'}(Z^{[2]}) \ ] $$$$ \nabla_{B^{[1]}} L = [\ W^{[2]T}*g^{'}(\hat{Y}^{[2]}){\otimes}f^{'}(Z^{[2]}) \ ] * [\ f^{'}(Z^{[1]}) \ ] $$

The terms in the expression $\displaystyle \nabla_{B^{[1]}} L$, have been reordered to ensure proper alignment of matrix dimensions.

and finally our gradient descent expressions for the 2 layers are;

  1. output layer:
$$ {W^{[2]}}_{next}=W^{[2]} - {\eta}*\nabla_{W^{[2]}}L $$$$ {B^{[2]}}_{next}=B^{[2]} - {\eta}*\nabla_{B^{[2]}}L $$
  1. input layer:
$$ {W^{[1]}}_{next}=W^{[1]} - {\eta}*\nabla_{W^{[1]}}L $$$$ {B^{[1]}}_{next}=B^{[1]} - {\eta}*\nabla_{B^{[1]}}L $$

Generalization

From the single-layer and 2-layer network we have observed that the loss function is only on the output layer.

If our network has $\displaystyle N$ layers, starting from the output layer $\displaystyle N$, our back-propagation expressions would be;

$$ \nabla_{W^{[N]}} L = g^{'}(\hat{Y}^{[N]}){\otimes}f^{'}(Z^{[N]})*X^{[N]T} $$$$ \nabla_{X^{[N]}} L = W^{[N]T}*g^{'}(\hat{Y}^{[N]}){\otimes}f^{'}(Z^{[N]}) $$$$ \nabla_{B^{[N]}} L = g^{'}(\hat{Y}^{[N]}){\otimes}f^{'}(Z^{[N]}) $$

and the layer preceding the output layer would have the following expressions;

$$ \nabla_{W^{[N-1]}} L = \nabla_{W^{[N-1]}} \hat{Y}^{[N-1]}* \nabla_{X^{[N]}} L $$$$ \nabla_{X^{[N-1]}} L = \nabla_{X^{[N-1]}} \hat{Y} ^{[N-1]}* \nabla_{X^{[N]}} L $$$$ \nabla_{B^{[N-1]}} L = \nabla_{B^{[N-1]}} \hat{Y}^{[N-1]} * \nabla_{X^{[N]}} L $$

where;

$$ \nabla_{W^{[N-1]}} \hat{Y}^{[N-1]} = f^{'}(Z^{[N-1]})*X^{[N-1]T} $$$$ \nabla_{X^{[N-1]}} \hat{Y} ^{[N-1]} = W^{[N-1]T} * f^{'}(Z^{[N-1]}) $$$$ \nabla_{B^{[N-1]}} \hat{Y}^{[N-1]} = f^{'}(Z^{[N-1]}) $$

We would recurse, with these expressions upto to our input layer.

The terms in the expression $\displaystyle \nabla_{B^{[N-1]}} L$ would require to be re-ordered as follows

$\displaystyle \nabla_{B^{[N-1]}} L = \nabla_{X^{[N]}} L * \nabla_{B^{[N-1]}} \hat{Y}^{[N-1]} $

to ensure proper alignment of matrix dimensions

References and further reading;

  1. Charu C. Aggarwal, Neural Networks and Deep Learning.
  2. Marc Peter et. al, Mathematics for Machine Learning.
  3. Terence Parr and Jeremy Howard, The Matrix Calculus You Need For Deep Learning