BackProp with Matrix Derivatives - Part I

A simple neuron with a single input can be represented mathematically as;

$\displaystyle \hat{y} = f(w*x +b)$
$\displaystyle L = MSE(y, \hat{y})$

Here, the $\displaystyle f$ is an activation function that transforms its input into some class label. MSE stands for Mean Squared Error, and it’s a function that measures the difference between the predicted output $\displaystyle \hat{y}$ and the actual output $\displaystyle y$. The goal of this series of posts is to express a neural model using matrix notation and apply matrix calculus to derive the backpropagation algorithm using gradient descent.

To lay the groundwork, we’ll begin with a brief review of essential mathematical concepts.

Brief Math review

1. Linear functions

Functions such as $\displaystyle z=f(x)=c * x $, where $\displaystyle c$ is a constant value, are referred to as linear functions. Given an input $\displaystyle u+v$ the function output $\displaystyle z=f(u+v)=c*(u+v)=c*u+c*v=f(u)+f(v)$, a linear combination of the output from the separate inputs. This property, gives us the first condition of what is considered a linear function.

Another property is that if you multiply the input by some constant value, $\displaystyle k$,

$\displaystyle z=f(kx)=c*kx=kf(x)$

The results are just our original function results (before multiplying our input with the constant value) multiplied by our constant value. This property gives us the second condition of what is considered a linear function.

Therefore, a function is considered a linear function if it has the following properties;

$\displaystyle f(u+v)=f(u)+f(v)$
$\displaystyle f(ku)=c*ku=kf(u)$

Another form of linear functions are functions with the signature; $\displaystyle z=f(x, y)=f(x)+f(y)$. They are referred to as bi-linear functions. Our neural network function $\displaystyle z=w_{1}*x_{1}+w_{2}*x_{2}$, is a bi-linear function; That is;

$\displaystyle z=f(w_{1}*x, w_{2}*y)=f(w_{1}*x)+f(w_{2}*y)=w_{1}*f(x)+w_{2}*f(y)$

2. Matrices

Matrices are a large topic and it would not be possible to review it in its entirety. Here, I will try to review what is important to be able to understand matrix calculus.

A matrix is a “table” of rows and columns whose entries are numbers. In maths there is a special symbol to represent them. Examples are:

a matrix with 2 rows and 1 column $$ A= \begin{bmatrix} 9 \\ 11 \end{bmatrix} $$

The dimensions of the matrix are usually written out as, (rows x columns). In this case this would be (2x1) matrix. The symbol “x” is read as “by”

a matrix with 1 row and 2 columns

$$ A= \begin{bmatrix} 3 & 4 \end{bmatrix} $$

a matrix with 2 rows and 2 columns

$$ A= \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} $$

General presentation of a matrix

In general matrices are denoted by upper-case letters such as $\displaystyle A$, $\displaystyle B$ and so on, while its elements are denoted by lower-case letters such as $\displaystyle a$, $\displaystyle b$ along with the element row and column position, for example $\displaystyle a_{ij}$ is the element at row $\displaystyle i$ and column $\displaystyle j$ of matrix A.

For example:

$$ A= \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} $$$$ B= \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} $$

Given some data, for example, X = (1, 3, 5), we can represent the data in matrix form in 2 ways;

$$ X= \begin{bmatrix} 1 & 3 & 5 \end{bmatrix} $$

$$ X= \begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} $$

the convention is to present the data in a colum matrix.

Each column should represent a single feature.

Matrix operations

1. Multiplication

Let’s review multiplying two matrices; consider matrices, A and B

$$ A= \begin{bmatrix} a \\ b \end{bmatrix} $$$$ B = \begin{bmatrix} c & d \end{bmatrix} $$

The rules to multiplying the 2 matrices is;

the columns of the first matrix must match the rows of the second matrix
the size of the resulting matrix will be determined by the number rows of the first matrix and the number of columns of the second matrix.

That is;

$$ B * A = \begin{bmatrix} c & d \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} a * c & b * d \end{bmatrix} $$

in case of 2x2 and 2x1 matrix

$$ A*B= \begin{bmatrix} a & c \\ b & d \end{bmatrix} \begin{bmatrix} e \\ f \end{bmatrix} = \begin{bmatrix} a*e + c*f \\ b*e + d*f \end{bmatrix} $$$$ A*B= \begin{bmatrix} a & c \\ b & d \end{bmatrix} \begin{bmatrix} e & g \\ f & h \end{bmatrix} = \begin{bmatrix} a*e + c*f & a*g + c*h \\ b*e + d * f & b*g+d*h \end{bmatrix} $$

2. Addition

Matrix addition is carried out by element-wise addition of the matrix elements. However, the matrices being added should be of the same dimensions. If they are not you can pad the matrices with 0 to make them same dimensions.

Example:

$$ A+B= \begin{bmatrix} 98 & 24 & 42 \\ 34 & 15 & 22 \end{bmatrix} \begin{bmatrix} 55 & 19 & 44 \\ 43 & 53 & 38 \end{bmatrix} = \begin{bmatrix} 98+55 & 24+19 & 42+44 \\ 34+43 & 15+53 & 22+38 \end{bmatrix} $$

3. Transpose

This involves switching columns to rows or rows to columns. The operation is denoted by the superscript letter $\displaystyle T$.

$$ A= \begin{bmatrix} 98 & 24 & 42 \\ 34 & 15 & 22 \end{bmatrix} $$$$ A^T= \begin{bmatrix} 98 & 34 \\ 24 & 15 \\ 42 & 22 \end{bmatrix} $$

4. Hadamard Product

This operation involves the element-wise multiplication of the matrix elements.

$$ A{\otimes}B= \begin{bmatrix} 98 & 24 & 42 \\ 34 & 15 & 22 \end{bmatrix} \begin{bmatrix} 55 & 19 & 44 \\ 43 & 53 & 38 \end{bmatrix} = \begin{bmatrix} 98*55 & 24*19 & 42*44 \\ 34*43 & 15*53 & 22*38 \end{bmatrix} $$

Matrices as linear maps

We observed that a linear function/map is a function that satisfies the following 2 conditions;

Each unique input is mapped to a unique output.
If we scale the input, the output is the scaled by same constant value.

Matrices also have these properties and hence are referred to as linear maps.

3. Linear functions as Matrices

We have observed that matrices are linear maps, therefore we can represent any linear function as a matrix.

$$\displaystyle y1=a*x1 + b * x2$$$$\displaystyle y2=c*x1 + d * x2$$

can be represented as matrix as follows;

the inputs $\displaystyle x1$ and $\displaystyle x2$ would be represented by the matrix; $$ X= \begin{bmatrix} x1 \\ x2 \end{bmatrix} $$

or as

$$ X= \begin{bmatrix} x1 & x2 \end{bmatrix} $$

$$ A= \begin{bmatrix} a & b \\ c & d \end{bmatrix} $$
$$ Y= \begin{bmatrix} y1 \\ y2 \end{bmatrix} $$

$$ \begin{bmatrix} y1 \\ y2 \end{bmatrix} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x1 \\ x2 \end{bmatrix} = \begin{bmatrix} a*x1 + b*x2 \\ c*x1 + d*x2 \end{bmatrix} $$$$ Y= A*X $$$$ Y=AX $$

Question
We know that the function $\displaystyle y=a*x1 + b*x2$ is a linear function, how would we represent it in matrix form?

4. Derivatives

Derivatives describes how a function changes when there are small changes in the inputs to the function. This is referred to as the gradient of the function. It’s a ratio of how much the output changes when there is a change in inputs.

For a linear function, you would probably correctly guess that the relationship would be linear.

For a single variable function $\displaystyle f(x)$, its derivative with respect to $\displaystyle x$ is denoted by $\displaystyle f^{'}$

$$ \frac{df(x)}{dx} = f^{'} $$

The two important techniques for our purposes are;

1. Chain rule

Give a function $\displaystyle y=f(x)$ and $\displaystyle x=h(u)$. Combining the two functions we get $\displaystyle y=f(h(u))$

The derivative of the function, with respect to $\displaystyle u$, would be;

$$ \frac{dy}{du}=\frac{df(h(u))}{dh}*\frac{dh(u)}{du} $$$$ \frac{dy}{du}=\frac{df(h)}{dh}*\frac{dh(u)}{du} $$

A short-cut method would be to instead differentiate the original separate functions;

$$ \frac{dy}{dx}=\frac{df(x)}{dx} $$$$ \frac{dx}{du}=\frac{dh(u)}{du} $$

and since we want to differentiate $\displaystyle y$ with respect to $\displaystyle u$.

$$ \frac{dy}{du}=\frac{dy}{dx}*\frac{dx}{du}=\frac{f(x)}{dx}*\frac{dh(u)}{du} $$

2. Partial derivatives

Sometimes a function is made up of 2 or more input variables, for example by $\displaystyle z=f(x, y)$.

In the case where $\displaystyle z=f(x, y)$, we would then suppose that if there are small changes in $\displaystyle z$ they would be as a result of the combination of small changes in $\displaystyle x$ and small changes in $\displaystyle y$.

That is, we make some small changes in $\displaystyle x$ while holding $\displaystyle y$ fixed, then observe how much $\displaystyle z$ changes. Then make small changes in $\displaystyle y$ while holding $\displaystyle x$ fixed and observe how much $\displaystyle z$ changes. The total change in $\displaystyle z$ would be proportionate to the sum of these observed changes.

If we let the rate of change in $\displaystyle z$ due to change in x be $\displaystyle m_{x}$ and the rate of changes in $\displaystyle z$ due to y be $\displaystyle m_{y}$. Then the total change in $\displaystyle z$, denoted by $\displaystyle \Delta z$, would be;

$$ \Delta z =m_{x}{\Delta x} + m_{y}{\Delta y} $$$$ m_{x}=\frac{\partial z}{\partial x}=\frac{\partial f(x,y)}{\partial x}{|}_{y=constant}=\frac{\partial f}{\partial x}=f^{'}_{x} $$$$ m_{y}=\frac{\partial z}{\partial y}=\frac{\partial f(x,y)}{\partial y}{|}_{x=constant}=\frac{\partial f}{\partial y}=f^{'}_{y} $$

The derivatives $\displaystyle m_{x}$ and $\displaystyle m_{y}$ are referred to as partial derivatives and are calculated similarly to normal derivatives.

$$ \frac{\partial f(x,y)}{\partial x}{|}_{y=constant}:=\frac{df(x,y)}{dx} $$$$ \frac{\partial f(x,y)}{\partial y}{|}_{x=constant}:=\frac{df(x,y)}{dy} $$

3. Partial derivatives with change of variables

$$ z=f(x, y)=f(g(u,v), h(u, v)) $$$$ \Delta z = m_{u}\Delta u + m_{v}\Delta v $$

To compute the derivative of $\displaystyle z$ with respect to $\displaystyle u$ and $\displaystyle v$ we apply the chain-rule;

$$ \frac{\partial z}{\partial u}=\frac{\partial f}{\partial x} * \frac{\partial x}{\partial u} + \frac{\partial f}{\partial y} * \frac{\partial y}{\partial u} $$$$ \frac{\partial z}{\partial v}=\frac{\partial f}{\partial x} * \frac{\partial x}{\partial v} + \frac{\partial f}{\partial y} * \frac{\partial y}{\partial v} $$

We can simplify these equations by writing them in matrix form;

$$ A= \begin{bmatrix} \frac{\partial f}{dx} \\ \frac{\partial f}{dy} \end{bmatrix} $$

$$ J= \begin{bmatrix} \frac{\partial x}{du} & \frac{\partial y}{du}\\ \frac{\partial x}{dv} & \frac{\partial y}{dv} \end{bmatrix} $$

$$ Z= \begin{bmatrix} \frac{\partial z}{dv} \\ \frac{\partial z}{du} \end{bmatrix} $$$$ Z=JA $$

The matrix $\displaystyle J$ is special and is referred to as a Jacobian.

You can consider it as a differential transformation matrix, that is, if you want to change(transform) the differential $\displaystyle \frac{\partial z}{\partial x}$ to be with respect to $\displaystyle u$ and $\displaystyle v$, then you apply $\displaystyle J$ to $\displaystyle \frac{\partial z}{\partial x}$.

Note:
Therer are 2 ways to lay out the Jacobian matrix.
The Denominator Layout $$J= \begin{bmatrix} \frac{\partial f_{1}}{dx_{1}} & \frac{\partial f_{2}}{dx_{1}}\\ \frac{\partial f_{1}}{dx_{2}} & \frac{\partial f_{2}}{dx_{2}} \end{bmatrix} $$
The Numerator Layout $$J= \begin{bmatrix} \frac{\partial f_{1}}{dx_{1}} & \frac{\partial f_{1}}{dx_{2}}\\ \frac{\partial f_{2}}{dx_{1}} & \frac{\partial f_{2}}{dx_{2}} \end{bmatrix} $$

Suppose that $\displaystyle z=x+y$ and would like to make a change of variable to $\displaystyle u$ and $\displaystyle v$ using the following relations; $\displaystyle x=2u+3v$ and $\displaystyle y=2u-5v$. What differential transformation matrix would we need to apply to be able to change the differentials with respect to $\displaystyle x$ and $\displaystyle y$ to be with respect to $\displaystyle u$ and $\displaystyle v$?

Solution:

$$ \frac{\partial z}{\partial x} = 1 \qquad \frac{\partial z}{\partial y} = 1 $$

To change to $\displaystyle \frac{\partial z}{\partial u}$ and $\displaystyle \frac{\partial z}{\partial v}$ we would apply the following Jacobian;

$$ J= \begin{bmatrix} \frac{\partial x}{du} & \frac{\partial y}{du}\\ \frac{\partial x}{dv} & \frac{\partial y}{dv} \end{bmatrix} = \begin{bmatrix} 2 & 2\\ 3 & -5 \end{bmatrix} $$$$ \begin{bmatrix} \frac{\partial z}{\partial u} \\ \frac{\partial z}{\partial v} \end{bmatrix} = \begin{bmatrix} 2 & 2\\ 3 & -5 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ -2 \end{bmatrix} $$

To prove this is true, lets do the substitutions and compute the differentials;

$$ z=x+y=(2u+3v)+(2u-5v)=4u-2v$$$$ \frac{\partial z}{\partial u} = 4 \qquad \frac{\partial z}{\partial v}= -2$$

5. Gradient, $\displaystyle \nabla$

$$ J= \begin{bmatrix} \frac{\partial x}{du} & \frac{\partial y}{du}\\ \frac{\partial x}{dv} & \frac{\partial y}{dv} \end{bmatrix} $$$$ \begin{bmatrix} \frac{\partial }{du}\\ \frac{\partial }{dv} \end{bmatrix} $$

They are gradients of the function. Writing it in row form,

$$ \nabla = \begin{bmatrix} \frac{\partial }{du} & \frac{\partial }{dv} \end{bmatrix} $$

It’s referred to as Gradient and its denoted by the symbol “nabla” $\displaystyle \nabla$.

Note:
The $\displaystyle \nabla$ is an operator, it takes a function and converts it to a matrix of differentials.

Example:

Suppose our function is defined as follows $\displaystyle f(x_{1}, x_{2})=x^2_{1}x_{2}+x_{1}x^3_{2}$. calculate its Grad?

Solution:

In this case our function variables are $\displaystyle x_{1}$ and $\displaystyle x_{2}$. Therefore;

$$ \nabla = \begin{bmatrix} \frac{\partial }{dx_{1}} & \frac{\partial }{dx_{2}} \end{bmatrix} $$

Applying the operator to our function $\displaystyle f$, we get

$$ \nabla f = \begin{bmatrix} \frac{\partial f}{dx_{1}} & \frac{\partial f}{dx_{2}} \end{bmatrix} = \begin{bmatrix} 2x_{1}x_{2}+x^3_{2} & x^2_{1}+3x_{1}x^2_{2} \end{bmatrix} $$

If we want the result in matrix column form, a vector, we transpose the matrix

$$ \begin{bmatrix} \nabla f \end{bmatrix} ^ T = \begin{bmatrix} \frac{\partial f}{dx_{1}} \\ \frac{\partial f}{dx_{2}} \end{bmatrix} = \begin{bmatrix} 2x_{1}x_{2}+x^3_{2} \\ x^2_{1}+3x_{1}x^2_{2} \end{bmatrix} $$

6. Optimization

Sometime given a model $\displaystyle f(x)$, where $\displaystyle x$ is some input to our model, we may be interested with the point at which the function is either at maximum or minimum. For example, a business may be interested with how much product to sell to maximize their profits, or a manufacturing plant may be interested with the point at which they have the minimum production costs.

These types of problems are referred to as optimization problems, and solved using various techniques, analytical and numerical.

For our purposes we will cover,

Derivatives, and
Gradient Descent.

1. Derivatives

This is an analytical method which uses derivatives to obtain the point in which the model is either maximum and minimum. At the point a function and is minimum or maximum the function gradient is equal to 0.

Example:

Suppose we have the following manufacturing cost model, $\displaystyle c(x)=x^3-15*x^2+1000$, where $\displaystyle x$ is the number of units produced. We would be interested with the questions;

how many units the plant should produce to have the lowest costs and
what’s the lowest possible production costs.

Solution:

Differentiate the model;

$$ \frac{\partial f(x)}{\partial x}=3*x^2-30*x $$

Set the differential to 0 and solve;

$$3*x^2-30*x = 0$$$$x=10$$

Ten units is the number of units the plant can produce at a minimum cost.

To confirm, this is the minimum point.
$$ \frac{\partial (3*x^2-30*x)}{\partial x}=6*x-30$$$$ 6*(10) - 30 = 30$$
since the value is positive, the point is at minimum.

2. Gradient Descent

Sometimes a cost function may not have a simple closed solution to get its minimum cost, in such case we would have to result to numerical methods.

Gradient descent is numerical method that attempts to obtain parameters that minimizes the cost function. This method uses the function gradient to get the rate of change in a direction that minimizes the cost function.

$$ -1 * {|}{\nabla f}{|} $$

Where $\displaystyle x_{1}, ... , x_{n}$ are the variable to the cost functions, and $\displaystyle {|}{\nabla f}{|}$ refers to the absolute value of the gradient.

This algorithm is derived from the calculus concept of directional derivatives.

For a one-variable (scalar) cost function $\displaystyle y=f(x)$ this simplifies to

$$ -1 * \frac{df}{dx}=-1*f^{'} $$

This function states that if we move along the x direction with a change of $\displaystyle -1f^{'}$ we will converge (assuming it exists) to some point in which the cost function is at minimum.

Gradient Descent algorithm adjusts it by parametrizing the $\displaystyle -1$ to $\displaystyle -\eta$. This parameter is referred to as the learning rate, and it sets the step-size.

The choice of the value of the learning rate is important, too small the algorithm takes long to converge, too large we may over-shoot our minimum point.

The full formula, is therefore;

$$ \displaystyle \hat{x} = x - \eta * {|}{\nabla f}{|}=x-\eta * f^{'} $$

Where $\displaystyle \hat{x}$ is the next value to our cost function.

References and further reading;

Charu C. Aggarwal, Neural Networks and Deep Learning.
Marc Peter et. al, Mathematics for Machine Learning.
Terence Parr and Jeremy Howard, The Matrix Calculus You Need For Deep Learning

Brief Math review#

1. Linear functions#

2. Matrices#

General presentation of a matrix#

Matrix operations#

1. Multiplication#

2. Addition#

3. Transpose#

4. Hadamard Product#

Matrices as linear maps#

3. Linear functions as Matrices#

4. Derivatives#

1. Chain rule#

2. Partial derivatives#

3. Partial derivatives with change of variables#

5. Gradient, $\displaystyle \nabla$#

6. Optimization#

1. Derivatives#

2. Gradient Descent#

References and further reading;#

Brief Math review

1. Linear functions

2. Matrices

General presentation of a matrix

Matrix operations

1. Multiplication

2. Addition

3. Transpose

4. Hadamard Product

Matrices as linear maps

3. Linear functions as Matrices

4. Derivatives

1. Chain rule

2. Partial derivatives

3. Partial derivatives with change of variables

5. Gradient, $\displaystyle \nabla$

6. Optimization

1. Derivatives

2. Gradient Descent

References and further reading;