Backward propagation or more precisely Error Backward Propagation is a technique in neural networks used to optimize a neural network. An output error is propagated backward within the network and as the error propagates the network parameter are adjusted with the aim to reduce resulting output error.

Our Model:

Consider the following graph of our model of the neuron;

  graph LR
  A@{shape: circ, label: "w"} --> B@{shape: rect, label: "g(.)"} --> C@{shape: circ, label: "y"} --> D@{shape: rect, label: "h(.)"} --> E@{shape: circ, label: "f"}
  F@{shape: circ, label: "b"} --> B

In this model;

  1. The output ‘f’ represents a single output value.

  2. ‘w’ and ‘b’ are inputs into our model, we consider them a inputs because the are of interest to us.

We can break the graph into 2 sub-graphs; I call this approach as the black-box approach

  graph LR
  C@{shape: circ, label: "y"} --> D@{shape: rect, label: "."} --> E@{shape: circ, label: "f"}
  graph LR
  A@{shape: circ, label: "w"} --> B@{shape: rect, label: "."} --> C@{shape: circ, label: "y"}
  F@{shape: circ, label: "b"} --> B

From sub-graph 1, we can calculate the change of output ‘f’ with respect to its input ‘y’ as;

$$\frac{\partial f}{\partial y}=E$$

We denote the result of the Computation as ‘E’ since at the moment we are not concerned with its exact value.

On sub-graph 2;

We can calculate the change of output ‘y’ with respect to inputs ‘w’ and ‘b’ as;

$$\frac{\partial y}{\partial w}$$$$\frac{\partial y}{\partial b}$$

If we are interested in the overall change, for example the change in output ‘f’ with respect to input ‘w’, it would be a matter of multiplication; we have already seen this on part II of the article.

$$\frac{\partial f}{\partial w}=E * \frac{\partial y}{\partial w}$$

Notice, that we are moving backwards from the output to the input, hence the name backward propagation.

A neuron

Lets consider a neuron with 2 inputs; Its mathematical expression is;

$$F=L(y, \hat{y})$$$$\hat{y}=\sigma(w_{1}x_{1}+w_{2}x_{2}+b)$$

Our computational graph would be;

  graph LR
  A@{shape: circ, label: "x1"} --> B@{shape: rect, label: "×"} --> C@{shape: circ, label: "A"}
  D@{shape: circ, label: "w1"} --> B
  E@{shape: circ, label: "x2"} --> F@{shape: rect, label: "×"} --> G@{shape: circ, label: "B"}
  H@{shape: circ, label: "w2"} --> F
  C --> I@{shape: rect, label: "+"} --> J@{shape: circ, label: "C"}--> K@{shape: rect, label: "+"}
  K --> L@{shape: circ, label: "D"} --> M@{shape: rect, label: "sigma(.)"} --> N@{shape: circ, label: "y^hat"} --> P@{shape: rect, label: "L(., .)"} --> R@{shape: circ, label: "F"}
  G --> I
  T@{shape: circ, label: "b"} --> K
  V@{shape: circ, label: "y"} --> P

We can use the black box approach and break the graph into 2; this is similar to the expressions.

  graph LR
  N@{shape: circ, label: "y^hat"} --> P@{shape: rect, label: "L(., .)"} --> R@{shape: circ, label: "F"}
  V@{shape: circ, label: "y"} --> P

We will call this graph the loss-graph.

  graph LR
  A@{shape: circ, label: "x1"} --> B@{shape: rect, label: "×"} --> C@{shape: circ, label: "A"}
  D@{shape: circ, label: "w1"} --> B
  E@{shape: circ, label: "x2"} --> F@{shape: rect, label: "×"} --> G@{shape: circ, label: "B"}
  H@{shape: circ, label: "w2"} --> F
  C --> I@{shape: rect, label: "+"} --> J@{shape: circ, label: "C"}--> K@{shape: rect, label: "+"}
  K --> L@{shape: circ, label: "D"} --> M@{shape: rect, label: "sigma(.)"} --> N@{shape: circ, label: "y^hat"}
  G --> I
  T@{shape: circ, label: "b"} --> K

and call this graph the input graph.

During a training run(epoch), a neuron will injest some values, through the variables x1 and x2 and emit some output $\displaystyle \hat{y}$. This output is compared to some expected value, the difference is called the error. The aim of the training is to minimize this error.

There many ways to calculate the error and since we are generalizing we use the symbol $\displaystyle L$ to denote any of those methods.

The parameters we will use to optimize the neuron are $\displaystyle w_{1}$, $\displaystyle w_{2}$, $\displaystyle b$. However in order to also propagate the error backwards parameters $\displaystyle x_{1}$ and $\displaystyle x_{2}$ are also required.

In our single neuron the differentiation with respect to $\displaystyle x_{1}$ and $\displaystyle x_{2}$ is presented but its not required.

  1. The loss-graph expressions;
$$F=L(y, \hat{y})$$$$\frac{\partial F}{\partial \hat{y}}=\frac{\partial L(y, \hat{y})}{\partial \hat{y}}=E$$

We differentiate only with respect to $\displaystyle \hat{y}$ since its the only input influenced by our optimizing parameters.

  1. The input-graph expressions;
$$\hat{y}=\sigma(D)$$$$D=C+b$$$$C=A+B$$$$B=w_{2}x_{2}$$$$A=w_{1}x_{1}$$

writing out our partial differentiation of the expressions;

$$\frac{\partial \hat{y}}{\partial D}=\dot{\sigma}(D)$$$$\frac{\partial D}{\partial C}=1 \qquad \frac{\partial D}{\partial b}=1$$$$\frac{\partial C}{\partial A}=1 \qquad \frac{\partial C}{\partial B}=1$$$$\frac{\partial B}{\partial w_{2}}=x_{2} \qquad \frac{\partial B}{\partial x_{2}}=w_{2}$$$$\frac{\partial A}{\partial w_{1}}=x_{1} \qquad \frac{\partial A}{\partial x_{1}}=w_{1}$$

therefore, the differentiation of $\displaystyle \hat{y}$ with respect to our optimizing parameters, we get;

$$\frac{\partial \hat{y}}{\partial b}=\dot{\sigma}(D)*1$$$$\frac{\partial \hat{y}}{\partial w_{2}}=\dot{\sigma}(D)*x_{2}$$$$\frac{\partial \hat{y}}{\partial w_{1}}=\dot{\sigma}(D)*x_{1}$$$$\frac{\partial \hat{y}}{\partial x_{2}}=\dot{\sigma}(D)*w_{2}$$$$\frac{\partial \hat{y}}{\partial x_{1}}=\dot{\sigma}(D)*w_{1}$$

Finally, the overall differentiation is;

$$\frac{\partial F}{\partial b}=E*\dot{\sigma}(D)*1$$$$\frac{\partial F}{\partial w_{2}}=E*\dot{\sigma}(D)*x_{2}$$$$\frac{\partial F}{\partial w_{1}}=E*\dot{\sigma}(D)*x_{1}$$$$\frac{\partial F}{\partial x_{2}}=E*\dot{\sigma}(D)*w_{2}$$$$\frac{\partial F}{\partial x_{1}}=E*\dot{\sigma}(D)*w_{1}$$

The result of this values are used to optimize the values of the parameters on the next training run.

2 layer neural network

A 2-layer network will have an input layer and an output layer. Basically, the network is made up of 3 neurons joined together, 2 neurons on the input and 1 on the output.

An important point to note about the input layer neurons, is that they do not have the loss function, that is, using the graph analogy, they do not have the loss-graph.

Skipping drawing of the computational graphs, the sub-expression of the output and input layers would be;

  1. Output layer
$$H=L(y, M)$$$$M=\sigma(R)$$$$R=K+b_{3}$$$$K=I+J$$$$I=w_{31}*Z_{1}$$$$J=w_{32}*Z_{2}$$

$\displaystyle Z_{1},\ Z_{2}$ are its input from the previous stage and $\displaystyle L$ is our error/loss function.

  1. Input Layer

It will be made up of 2 neurons; The sub-expressions of the first neuron are;

$$Z_{1}=\sigma(F)$$$$F=P + b_{1}$$$$P=A + B$$$$B=w_{12}*y$$$$A=w_{11}*x$$

while the second are;

$$Z_{2}=\sigma(H)$$$$H=G + b_{2}$$$$G=C + D$$$$C=w_{21}*x$$$$D=w_{22}*y$$

At this point it would easy to verify;

  1. Output Layer
$$if \quad \frac{\partial H}{\partial M}=E, \quad then$$$$\frac{\partial H}{\partial w_{31}}=E*\dot{\sigma}(R)*Z_{1}$$$$\frac{\partial H}{\partial w_{32}}=E*\dot{\sigma}(R)*Z_{2}$$$$\frac{\partial H}{\partial Z_{1}}=E*\dot{\sigma}(R)*w_{31}$$$$\frac{\partial H}{\partial Z_{2}}=E*\dot{\sigma}(R)*w_{32}$$

We also differentiate with respect to its inputs, $\displaystyle Z_{1}$, and $\displaystyle Z_{2}$, so that we are able to propagate the error backwards to our input layer.

  1. Input Layer 1
$$\frac{\partial Z_{1}}{\partial b_{1}}=\dot{\sigma}(F)*1$$$$\frac{\partial Z_{1}}{\partial w_{12}}=\dot{\sigma}(F)*y$$$$\frac{\partial Z_{1}}{\partial w_{11}}=\dot{\sigma}(F)*x$$
  1. Input layer 2
$$\frac{\partial Z_{2}}{\partial b_{2}}=\dot{\sigma}(H)*1$$$$\frac{\partial Z_{2}}{\partial w_{21}}=\dot{\sigma}(H)*x$$$$\frac{\partial Z_{2}}{\partial w_{22}}=\dot{\sigma}(H)*y$$

Finally, the backward propagation; which would involve differentiating the final output with respect to parameters on the input layers;

For example;

the partial $\displaystyle H$ with respect to input 1 layer parameters $\displaystyle w_{11}$, and $\displaystyle w_{12}$, would be;

$$\frac{\partial H}{\partial w_{11}}=\frac{\partial H}{\partial Z_{1}} * \frac{\partial Z_{1}}{\partial w_{11}}$$$$\frac{\partial H}{\partial w_{11}}=E*\dot{\sigma}(R)*w_{31}*\dot{\sigma}(F)*x$$

and;

$$\frac{\partial H}{\partial w_{12}}=\frac{\partial H}{\partial Z_{1}} * \frac{\partial Z_{1}}{\partial w_{12}}$$$$\frac{\partial H}{\partial w_{12}}=E*\dot{\sigma}(R)*w_{32}*\dot{\sigma}(F)*y$$

These equations show us how to back-propagate the ’error’ from the output layer to the input layer 1.

Staring at the 2 equations, we can observe some patterns, suggesting, these can be generalized even further.

Conclusion

We have seen how to apply our computational graph to a neural network and seen how easy it is to generate optimizing expressions. With networks with much more layers the use of a computational graph becomes impractical and other techniques, such as matrices and calculus of matrices, have to be used. However, the general concepts remain the same.

References and further reading;

  1. Marc Peter et. al, Mathematics for Machine Learning.

  2. Ian Goodfellow et. al, Deep Learning

  3. Ronald T. Kneusel, Math for Deep Learning

  4. Baydin et. al, Automatic Differentiation in Machine Learning.