Wikipedia defines Automatic Differentiation (AD) as a technique to evaluate partial derivative of a function. I will not repeat the definition, but I will quote an important point that’s relevant to what we are doing;
… By applying the chain rule repeatedly to these operations, partial derivatives of arbitrary order can be computed automatically …
The take away here is that with AD;
we breaks down a long expression into small sub-expressions.
we repeatedly apply of differentiation to those sub-expressions to get the total differentiation.
In part I, of the article we learned how we can use a computational graph to transform a mathematical expression into a diagram showing how inputs will processed to some eventual output using a sequence of small operations. We also learned how we can use the graph to generate sub-expressions of the much longer expression.
In this article we will learn how we can use the results of a computational graph to perform repeated differentiation. The only pre-requisite will be that you can perform partial differentiation, and you can use a technique of differentiation referred to as the chain rule.
Repeated Differentiation
Lets start off with a simple example;
Example 1:
Given the expression $\displaystyle f=\sin(x^2)$, we can turn it into a computational graph by following our steps in article I;
Identify the inputs:
$\displaystyle x$
Identify the outputs;
$\displaystyle f$
Identify the operators;
$\displaystyle (.)^2$, $\displaystyle \sin(.)$
Applying our rules, we obtain the graph;
graph LR A@{shape: circ, label: "x"} --> B@{shape: rect, label: "(.)^2"} --> C@{shape: circ, label: "A"} --> D@{shape: rect, label: "sin(.)"} --> E@{shape: circ, label: "F"}
From the graph we obtain the following sub-expressions;
$$ F=\sin(A)$$$$ A=x^2$$Differentiation
We can differentiate the sub-expressions, obtaining;
$$\frac{\partial F}{\partial A}=\cos(A)$$$$\frac{\partial A}{\partial x}=2x$$The full differentiation of the expression, that is, differentiating the output, $\displaystyle F$, with respect to its input, $\displaystyle x$, we obtain;
$$\frac{\partial F}{\partial x}=\frac{\partial F}{\partial A}*\frac{\partial A}{\partial x}=\cos(A)*2x$$We can compare this result to the results we obtain using the chain rule;
$$f=\sin(x^2)$$$$\frac{\partial f}{\partial x}=\frac{\partial \sin(x^2)}{\partial x}$$$$let \quad u=x^2 $$$$\frac{\partial f}{\partial x}=\frac{\partial \sin(u)}{\partial u}*\frac{\partial u}{\partial x}$$$$\frac{\partial f}{\partial x}=\cos(u)*2x$$
which is the same, however, using a computational graph the process is much simpler.
Generalization
Consider the following graph;
graph LR A@{shape: circ, label: "A"} --> B@{shape: rect, label: ""} --> C@{shape: circ, label: "B"} --> D@{shape: rect, label: ""} --> E@{shape: circ, label: "F"}
Differentiation is the ratio of change in output with respect to the change in input. From the graph the output nodes are: B, F, while the input nodes are A, B.
Note, from part I of the article, input/output are depicted as circular nodes.
Starting off from the right;
- change in output F with respect to input B is; $$ \frac{\partial F}{\partial B}=b$$
The results denoted as ‘b’.
- change in output B with respect to input A would be; $$ \frac{\partial B}{\partial A}=a$$
The results denoted as ‘a’.
- and finally the change in the entire system/process would be; $$ \frac{\partial F}{\partial A}=R$$
The change in the entire system can also be derived by from the small changes in the system, that is;
$$\frac{\partial F}{\partial A}=\frac{\partial F}{\partial B}*\frac{\partial B}{\partial A}$$$$\frac{\partial F}{\partial A}=b*a$$$$R=b*a$$And that’s it to Automatic Differentiation.
Lets consider another example before we can generalize;
graph LR A@{shape: circ, label: "A"} --> B@{shape: rect, label: "."} --> C@{shape: circ, label: "B"} C --> D@{shape: rect, label: "."} --> E@{shape: circ, label: "C"} --> F@{shape: rect, label: "."} --> G@{shape: circ, label: "F"}
Starting off from the right;
- Change in output F with respect to input C;
- Change in output C with respect to input B;
- Change output B with respect to input A;
- finally, change in the entire system;
In general;
graph LR A@{shape: circ, label: "A"} --> B@{shape: rect, label: ""} --> C@{shape: circ, label: "B"} C --> D@{shape: rect, label: ""} --> E@{shape: circ, label: "C"} --> F@{shape: rect, label: "Black box of nodes and edges"} --> G@{shape: circ, label: "F"}
Lets assume at node C, we know the change in output F with respect to input C is E, that is;
$$ \frac{\partial F}{\partial C}=E$$Can we calculate the change in the entire system, $\displaystyle \frac{\partial F}{\partial A}$ ?
The answer is YES.
All we do, is calculate the changes in our sub-system, that is A to C then multiply with the changes of the sub-system between C and F.
For our graph above, if we know that;
$$\frac{\partial C}{\partial B}=b$$and
$$\frac{\partial B}{\partial A}=a$$then
the changes of final output F with respect to A would be;
$$\frac{\partial F}{\partial A}=\frac{\partial F}{\partial C}*\frac{\partial C}{\partial B}*\frac{\partial B}{\partial A}$$$$\frac{\partial F}{\partial A}=E*b*a$$Example 2:
This is an AD practise example:
Given the expression $\displaystyle f=\ln(x_{1})+x_{1}x_{2}-\sin(x_{2})$ turn it into a computational graph and differentiate with respect to $\displaystyle x_{1}$ and $\displaystyle x_{2}$;
solution:
Using our steps to convert the expression into a computational graph;
Identify the inputs:
$\displaystyle x_{1},\quad x_{2}, -1$
Identify the outputs;
$\displaystyle f$
Identify the operators;
$\displaystyle +$, $\displaystyle \ln(.)$, $\displaystyle *$, $\displaystyle \sin(.)$
Applying our rules, we obtain the graph;
graph LR A@{shape: circ, label: "x1"} --> B@{shape: rect, label: "ln(.)"} --> C@{shape: circ, label: "A"} D@{shape: circ, label: "x2"} --> E@{shape: rect, label: "×"} --> F@{shape: circ, label: "B"} A --> E D --> G@{shape: rect, label: "sin(.)"} --> H@{shape: circ, label: "C"} C --> J@{shape: rect, label: "+"} --> K@{shape: circ, label: "E"} F --> J H --> L@{shape: rect, label: "×"} --> M@{shape: circ, label: "D"} N@{shape: circ, label: "-1"} --> L M --> P@{shape: rect, label: "+"} --> Q@{shape: circ, label: "F"} K --> P
- Compute our sub-expressions starting off from the right of the graph;
- Differentiating the sub-expressions;
The total change with respect to $\displaystyle x_{1}$ is;
- Paths: F-E-A-x1 and F-E-B-x1;
The total change with respect to $\displaystyle x_{2}$ is;
- Paths: F-E-B-x2 and F-D-C-x2
- Plugging in the values we obtain;
Conclusion
In this article we have learned how to use a computational graph to perform auto-differentiation and observed the simplicity of the technique. In the original paper presenting Auto-Differentiation, they laid out 2 ways to perform auto-differentiation;
- Forward Mode
- Reverse Mode
What we have learned is the reverse mode technique.
In the final article, we will apply the techniques learned so far to an important algorithm in machine learning referred to as the Backward Propagation.