Algorithmic Differentiation
This section introduces the mathematics behind AD. Even if you have worked with AD before, we recommend reading in order to acclimatise yourself to the perspective that Mooncake.jl takes on the subject.
Derivatives
A foundation on which all of AD is built the the derivate – we require a fairly general definition of it, which we build up to here.
Scalar-to-Scalar Functions
Consider first $f : \RR \to \RR$, which we require to be differentiable at $x \in \RR$. Its derivative at $x$ is usually thought of as the scalar $\alpha \in \RR$ such that
\[\text{d}f = \alpha \, \text{d}x .\]
Loosely speaking, by this notation we mean that for arbitrary small changes $\text{d} x$ in the input to $f$, the change in the output $\text{d} f$ is $\alpha \, \text{d}x$. We refer readers to the first few minutes of the first lecture mentioned before for a more careful explanation.
Vector-to-Vector Functions
The generalisation of this to Euclidean space should be familiar: if $f : \RR^P \to \RR^Q$ is differentiable at a point $x \in \RR^P$, then the derivative of $f$ at $x$ is given by the Jacobian matrix at $x$, denoted $J[x] \in \RR^{Q \times P}$, such that
\[\text{d}f = J[x] \, \text{d}x .\]
It is possible to stop here, as all the functions we shall need to consider can in principle be written as functions on some subset $\RR^P$.
However, when we consider differentiating computer programmes, we will have to deal with complicated nested data structures, e.g. struct
s inside Tuple
s inside Vector
s etc. While all of these data structures can be mapped onto a flat vector in order to make sense of the Jacobian of a computer programme, this becomes very inconvenient very quickly. To see the problem, consider the Julia function whose input is of type Tuple{Tuple{Float64, Vector{Float64}}, Vector{Float64}, Float64}
and whose output is of type Tuple{Vector{Float64}, Float64}
. What kind of object might be use to represent the derivative of a function mapping between these two spaces? We certainly can treat these as structured "view" into a "flat" Vector{Float64}
s, and then define a Jacobian, but actually finding this mapping is a tedious exercise, even if it quite obviously exists.
In fact, a more general formulation of the derivative is used all the time in the context of AD – the matrix calculus discussed by [1] and [2] (to name a couple) make use of a generalised form of the derivative in order to work with functions which map to and from matrices (albeit there are slight differences in naming conventions from text to text), without needing to "flatten" them into vectors in order to make sense of them.
In general, it will be much easier to avoid "flattening" operations wherever possible. In order to do so, we now introduce a generalised notion of the derivative.
Functions Between More General Spaces
In order to avoid the difficulties described above, we consider functions $f : \mathcal{X} \to \mathcal{Y}$, where $\mathcal{X}$ and $\mathcal{Y}$ are finite dimensional real Hilbert spaces (read: finite-dimensional vector space with an inner product, and real-valued scalars). This definition includes functions to / from $\RR$, $\RR^D$, but also real-valued matrices, and any other "container" for collections of real numbers. Furthermore, we shall see later how we can model all sorts of structured representations of data directly as such spaces.
For such spaces, the derivative of $f$ at $x \in \mathcal{X}$ is the linear operator (read: linear function) $D f [x] : \mathcal{X} \to \mathcal{Y}$ satisfying
\[\text{d}f = D f [x] \, (\text{d} x)\]
The purpose of this linear operator is to provide a linear approximation to $f$ which is accurate for arguments which are very close to $x$.
Please note that $D f [x]$ is a single mathematical object, despite the fact that 3 separate symbols are used to denote it – $D f [x] (\dot{x})$ denotes the application of the function $D f [x]$ to argument $\dot{x}$. Furthermore, the dot-notation ($\dot{x}$) does not have anything to do with time-derivatives, it is simply common notation used in the AD literature to denote the arguments of derivatives.
So, instead of thinking of the derivative as a number or a matrix, we think about it as a function. We can express the previous notions of the derivative in this language.
In the scalar case, rather than thinking of the derivative as being $\alpha$, we think of it is a the linear operator $D f [x] (\dot{x}) := \alpha \dot{x}$. Put differently, rather than thinking of the derivative as the slope of the tangent to $f$ at $x$, think of it as the function decribing the tangent itself. Observe that up until now we had only considered inputs to $D f [x]$ which were small ($\text{d} x$) – here we extend it to the entire space $\mathcal{X}$ and denote inputs in this space $\dot{x}$. Inputs $\dot{x}$ should be thoughts of as "directions", in the directional derivative sense (why this is true will be discussed later).
Similarly, if $\mathcal{X} = \RR^P$ and $\mathcal{Y} = \RR^Q$ then this operator can be specified in terms of the Jacobian matrix: $D f [x] (\dot{x}) := J[x] \dot{x}$ – brackets are used to emphasise that $D f [x]$ is a function, and is being applied to $\dot{x}$.[note_for_geometers]
To reiterate, for the rest of this document, we define the derivative to be "multiply by $\alpha$" or "multiply by $J[x]$", rather than to be $\alpha$ or $J[x]$. So whenever you see the word "derivative", you should think "linear function".
The Chain Rule
The chain rule is the result which makes AD work. Fortunately, it applies to this version of the derivative:
\[f = g \circ h \implies D f [x] = (D g [h(x)]) \circ (D h [x])\]
By induction, this extends to a collection of $N$ functions $f_1, \dots, f_N$:
\[f := f_N \circ \dots \circ f_1 \implies D f [x] = (D f_N [x_N]) \circ \dots \circ (D f_1 [x_1]),\]
where $x_{n+1} := f(x_n)$, and $x_1 := x$.
An aside: the definition of the Frechet Derivative
This definition of the derivative has a name: the Frechet derivative. It is a generalisation of the Total Derivative. Formally, we say that a function $f : \mathcal{X} \to \mathcal{Y}$ is differentiable at a point $x \in \mathcal{X}$ if there exists a linear operator $D f [x] : \mathcal{X} \to \mathcal{Y}$ (the derivative) satisfying
\[\lim_{\text{d} h \to 0} \frac{\| f(x + \text{d} h) - f(x) + D f [x] (\text{d} h) \|_\mathcal{Y}}{\| \text{d}h \|_\mathcal{X}} = 0,\]
where $\| \cdot \|_\mathcal{X}$ and $\| \cdot \|_\mathcal{Y}$ are the norms associated to Hilbert spaces $\mathcal{X}$ and $\mathcal{Y}$ respectively. It is a good idea to consider what this looks like when $\mathcal{X} = \mathcal{Y} = \RR$ and when $\mathcal{X} = \mathcal{Y} = \RR^D$. It is sometimes helpful to refer to this definition to e.g. verify the correctness of the derivative of a function – as with single-variable calculus, however, this is rare.
Another aside: what does Forwards-Mode AD compute?
At this point we have enough machinery to discuss forwards-mode AD. Expressed in the language of linear operators and Hilbert spaces, the goal of forwards-mode AD is the following: given a function $f$ which is differentiable at a point $x$, compute $D f [x] (\dot{x})$ for a given vector $\dot{x}$. If $f : \RR^P \to \RR^Q$, this is equivalent to computing $J[x] \dot{x}$, where $J[x]$ is the Jacobian of $f$ at $x$. For the interested reader we provide a high-level explanation of how forwards-mode AD does this in How does Forwards-Mode AD work?.
Another aside: notation
You may have noticed that we typically denote the argument to a derivative with a "dot" over it, e.g. $\dot{x}$. This is something that we will do consistently, and we will use the same notation for the outputs of derivatives. Wherever you see a symbol with a "dot" over it, expect it to be an input or output of a derivative / forwards-mode AD.
Reverse-Mode AD: what does it do?
In order to explain what reverse-mode AD does, we first consider the "vector-Jacobian product" definition in Euclidean space which will be familiar to many readers. We then generalise.
Reverse-Mode AD: what does it do in Euclidean space?
In this setting, the goal of reverse-mode AD is the following: given a function $f : \RR^P \to \RR^Q$ which is differentiable at $x \in \RR^P$ with Jacobian $J[x]$ at $x$, compute $J[x]^\top \bar{y}$ for any $\bar{y} \in \RR^Q$. This is useful because we can obtain the gradient from this when $Q = 1$ by letting $\bar{y} = 1$.
Adjoint Operators
In order to generalise this algorithm to work with linear operators, we must first generalise the idea of multiplying a vector by the transpose of the Jacobian. The relevant concept here is that of the adjoint operator. Specifically, the adjoint $A^\ast$ of linear operator $A$ is the linear operator satisfying
\[\langle A^\ast \bar{y}, \dot{x} \rangle = \langle \bar{y}, A \dot{x} \rangle.\]
where $\langle \cdot, \cdot \rangle$ denotes the inner-product. The relationship between the adjoint and matrix transpose is: if $A (x) := J x$ for some matrix $J$, then $A^\ast (y) := J^\top y$.
Moreover, just as $(A B)^\top = B^\top A^\top$ when $A$ and $B$ are matrices, $(A B)^\ast = B^\ast A^\ast$ when $A$ and $B$ are linear operators. This result follows in short order from the definition of the adjoint operator – (and is a good exercise!)
Reverse-Mode AD: what does it do in general?
Equipped with adjoints, we can express reverse-mode AD only in terms of linear operators, dispensing with the need to express everything in terms of Jacobians. The goal of reverse-mode AD is as follows: given a differentiable function $f : \mathcal{X} \to \mathcal{Y}$, compute $D f [x]^\ast (\bar{y})$ for some $\bar{y}$.
Notation: $D f [x]^\ast$ denotes the single mathematical object which is the adjoint of $D f [x]$. It is a linear function from $\mathcal{Y}$ to $\mathcal{X}$. We may occassionally write it as $(D f [x])^\ast$ if there is some risk of confusion.
We will explain how reverse-mode AD goes about computing this after some worked examples.
Aside: Notation
You will have noticed that arguments to adjoints have thus far always had a "bar" over them, e.g. $\bar{y}$. This notation is common in the AD literature and will be used throughout. Additionally, this "bar" notation will be used for the outputs of adjoints of derivatives. So wherever you see a symbol with a "bar" over it, think "input or output of adjoint of derivative".
Some Worked Examples
We now present some worked examples in order to prime intuition, and to introduce the important classes of problems that will be encountered when doing AD in the Julia language. We will put all of these problems in a single general framework later on.
An Example with Matrix Calculus
We have introduced some mathematical abstraction in order to simplify the calculations involved in AD. To this end, we consider differentiating $f(X) := X^\top X$. Results for this and similar operations are given by [1]. A similar operation, but which maps from matrices to $\RR$ is discussed in Lecture 4 part 2 of the MIT course mentioned previouly. Both [1] and Lecture 4 part 2 provide approaches to obtaining the derivative of this function.
Following either resource will yield the derivative:
\[D f [X] (\dot{X}) = \dot{X}^\top X + X^\top \dot{X}\]
Observe that this is indeed a linear operator (i.e. it is linear in its argument, $\dot{X}$). (You can always plug it in to the definition of the Frechet derivative to confirm that it is indeed the derivative.)
In order to perform reverse-mode AD, we need to find the adjoint operator. Using the usual definition of the inner product between matrices,
\[\langle X, Y \rangle := \textrm{tr} (X^\top Y)\]
we can rearrange the inner product as follows:
\[\begin{align} +
Algorithmic Differentiation
This section introduces the mathematics behind AD. Even if you have worked with AD before, we recommend reading in order to acclimatise yourself to the perspective that Mooncake.jl takes on the subject.
Derivatives
A foundation on which all of AD is built the the derivate – we require a fairly general definition of it, which we build up to here.
Scalar-to-Scalar Functions
Consider first $f : \RR \to \RR$, which we require to be differentiable at $x \in \RR$. Its derivative at $x$ is usually thought of as the scalar $\alpha \in \RR$ such that
\[\text{d}f = \alpha \, \text{d}x .\]
Loosely speaking, by this notation we mean that for arbitrary small changes $\text{d} x$ in the input to $f$, the change in the output $\text{d} f$ is $\alpha \, \text{d}x$. We refer readers to the first few minutes of the first lecture mentioned before for a more careful explanation.
Vector-to-Vector Functions
The generalisation of this to Euclidean space should be familiar: if $f : \RR^P \to \RR^Q$ is differentiable at a point $x \in \RR^P$, then the derivative of $f$ at $x$ is given by the Jacobian matrix at $x$, denoted $J[x] \in \RR^{Q \times P}$, such that
\[\text{d}f = J[x] \, \text{d}x .\]
It is possible to stop here, as all the functions we shall need to consider can in principle be written as functions on some subset $\RR^P$.
However, when we consider differentiating computer programmes, we will have to deal with complicated nested data structures, e.g. struct
s inside Tuple
s inside Vector
s etc. While all of these data structures can be mapped onto a flat vector in order to make sense of the Jacobian of a computer programme, this becomes very inconvenient very quickly. To see the problem, consider the Julia function whose input is of type Tuple{Tuple{Float64, Vector{Float64}}, Vector{Float64}, Float64}
and whose output is of type Tuple{Vector{Float64}, Float64}
. What kind of object might be use to represent the derivative of a function mapping between these two spaces? We certainly can treat these as structured "view" into a "flat" Vector{Float64}
s, and then define a Jacobian, but actually finding this mapping is a tedious exercise, even if it quite obviously exists.
In fact, a more general formulation of the derivative is used all the time in the context of AD – the matrix calculus discussed by [1] and [2] (to name a couple) make use of a generalised form of the derivative in order to work with functions which map to and from matrices (albeit there are slight differences in naming conventions from text to text), without needing to "flatten" them into vectors in order to make sense of them.
In general, it will be much easier to avoid "flattening" operations wherever possible. In order to do so, we now introduce a generalised notion of the derivative.
Functions Between More General Spaces
In order to avoid the difficulties described above, we consider functions $f : \mathcal{X} \to \mathcal{Y}$, where $\mathcal{X}$ and $\mathcal{Y}$ are finite dimensional real Hilbert spaces (read: finite-dimensional vector space with an inner product, and real-valued scalars). This definition includes functions to / from $\RR$, $\RR^D$, but also real-valued matrices, and any other "container" for collections of real numbers. Furthermore, we shall see later how we can model all sorts of structured representations of data directly as such spaces.
For such spaces, the derivative of $f$ at $x \in \mathcal{X}$ is the linear operator (read: linear function) $D f [x] : \mathcal{X} \to \mathcal{Y}$ satisfying
\[\text{d}f = D f [x] \, (\text{d} x)\]
The purpose of this linear operator is to provide a linear approximation to $f$ which is accurate for arguments which are very close to $x$.
Please note that $D f [x]$ is a single mathematical object, despite the fact that 3 separate symbols are used to denote it – $D f [x] (\dot{x})$ denotes the application of the function $D f [x]$ to argument $\dot{x}$. Furthermore, the dot-notation ($\dot{x}$) does not have anything to do with time-derivatives, it is simply common notation used in the AD literature to denote the arguments of derivatives.
So, instead of thinking of the derivative as a number or a matrix, we think about it as a function. We can express the previous notions of the derivative in this language.
In the scalar case, rather than thinking of the derivative as being $\alpha$, we think of it is a the linear operator $D f [x] (\dot{x}) := \alpha \dot{x}$. Put differently, rather than thinking of the derivative as the slope of the tangent to $f$ at $x$, think of it as the function decribing the tangent itself. Observe that up until now we had only considered inputs to $D f [x]$ which were small ($\text{d} x$) – here we extend it to the entire space $\mathcal{X}$ and denote inputs in this space $\dot{x}$. Inputs $\dot{x}$ should be thoughts of as "directions", in the directional derivative sense (why this is true will be discussed later).
Similarly, if $\mathcal{X} = \RR^P$ and $\mathcal{Y} = \RR^Q$ then this operator can be specified in terms of the Jacobian matrix: $D f [x] (\dot{x}) := J[x] \dot{x}$ – brackets are used to emphasise that $D f [x]$ is a function, and is being applied to $\dot{x}$.[note_for_geometers]
To reiterate, for the rest of this document, we define the derivative to be "multiply by $\alpha$" or "multiply by $J[x]$", rather than to be $\alpha$ or $J[x]$. So whenever you see the word "derivative", you should think "linear function".
The Chain Rule
The chain rule is the result which makes AD work. Fortunately, it applies to this version of the derivative:
\[f = g \circ h \implies D f [x] = (D g [h(x)]) \circ (D h [x])\]
By induction, this extends to a collection of $N$ functions $f_1, \dots, f_N$:
\[f := f_N \circ \dots \circ f_1 \implies D f [x] = (D f_N [x_N]) \circ \dots \circ (D f_1 [x_1]),\]
where $x_{n+1} := f(x_n)$, and $x_1 := x$.
An aside: the definition of the Frechet Derivative
This definition of the derivative has a name: the Frechet derivative. It is a generalisation of the Total Derivative. Formally, we say that a function $f : \mathcal{X} \to \mathcal{Y}$ is differentiable at a point $x \in \mathcal{X}$ if there exists a linear operator $D f [x] : \mathcal{X} \to \mathcal{Y}$ (the derivative) satisfying
\[\lim_{\text{d} h \to 0} \frac{\| f(x + \text{d} h) - f(x) + D f [x] (\text{d} h) \|_\mathcal{Y}}{\| \text{d}h \|_\mathcal{X}} = 0,\]
where $\| \cdot \|_\mathcal{X}$ and $\| \cdot \|_\mathcal{Y}$ are the norms associated to Hilbert spaces $\mathcal{X}$ and $\mathcal{Y}$ respectively. It is a good idea to consider what this looks like when $\mathcal{X} = \mathcal{Y} = \RR$ and when $\mathcal{X} = \mathcal{Y} = \RR^D$. It is sometimes helpful to refer to this definition to e.g. verify the correctness of the derivative of a function – as with single-variable calculus, however, this is rare.
Another aside: what does Forwards-Mode AD compute?
At this point we have enough machinery to discuss forwards-mode AD. Expressed in the language of linear operators and Hilbert spaces, the goal of forwards-mode AD is the following: given a function $f$ which is differentiable at a point $x$, compute $D f [x] (\dot{x})$ for a given vector $\dot{x}$. If $f : \RR^P \to \RR^Q$, this is equivalent to computing $J[x] \dot{x}$, where $J[x]$ is the Jacobian of $f$ at $x$. For the interested reader we provide a high-level explanation of how forwards-mode AD does this in How does Forwards-Mode AD work?.
Another aside: notation
You may have noticed that we typically denote the argument to a derivative with a "dot" over it, e.g. $\dot{x}$. This is something that we will do consistently, and we will use the same notation for the outputs of derivatives. Wherever you see a symbol with a "dot" over it, expect it to be an input or output of a derivative / forwards-mode AD.
Reverse-Mode AD: what does it do?
In order to explain what reverse-mode AD does, we first consider the "vector-Jacobian product" definition in Euclidean space which will be familiar to many readers. We then generalise.
Reverse-Mode AD: what does it do in Euclidean space?
In this setting, the goal of reverse-mode AD is the following: given a function $f : \RR^P \to \RR^Q$ which is differentiable at $x \in \RR^P$ with Jacobian $J[x]$ at $x$, compute $J[x]^\top \bar{y}$ for any $\bar{y} \in \RR^Q$. This is useful because we can obtain the gradient from this when $Q = 1$ by letting $\bar{y} = 1$.
Adjoint Operators
In order to generalise this algorithm to work with linear operators, we must first generalise the idea of multiplying a vector by the transpose of the Jacobian. The relevant concept here is that of the adjoint operator. Specifically, the adjoint $A^\ast$ of linear operator $A$ is the linear operator satisfying
\[\langle A^\ast \bar{y}, \dot{x} \rangle = \langle \bar{y}, A \dot{x} \rangle.\]
where $\langle \cdot, \cdot \rangle$ denotes the inner-product. The relationship between the adjoint and matrix transpose is: if $A (x) := J x$ for some matrix $J$, then $A^\ast (y) := J^\top y$.
Moreover, just as $(A B)^\top = B^\top A^\top$ when $A$ and $B$ are matrices, $(A B)^\ast = B^\ast A^\ast$ when $A$ and $B$ are linear operators. This result follows in short order from the definition of the adjoint operator – (and is a good exercise!)
Reverse-Mode AD: what does it do in general?
Equipped with adjoints, we can express reverse-mode AD only in terms of linear operators, dispensing with the need to express everything in terms of Jacobians. The goal of reverse-mode AD is as follows: given a differentiable function $f : \mathcal{X} \to \mathcal{Y}$, compute $D f [x]^\ast (\bar{y})$ for some $\bar{y}$.
Notation: $D f [x]^\ast$ denotes the single mathematical object which is the adjoint of $D f [x]$. It is a linear function from $\mathcal{Y}$ to $\mathcal{X}$. We may occassionally write it as $(D f [x])^\ast$ if there is some risk of confusion.
We will explain how reverse-mode AD goes about computing this after some worked examples.
Aside: Notation
You will have noticed that arguments to adjoints have thus far always had a "bar" over them, e.g. $\bar{y}$. This notation is common in the AD literature and will be used throughout. Additionally, this "bar" notation will be used for the outputs of adjoints of derivatives. So wherever you see a symbol with a "bar" over it, think "input or output of adjoint of derivative".
Some Worked Examples
We now present some worked examples in order to prime intuition, and to introduce the important classes of problems that will be encountered when doing AD in the Julia language. We will put all of these problems in a single general framework later on.
An Example with Matrix Calculus
We have introduced some mathematical abstraction in order to simplify the calculations involved in AD. To this end, we consider differentiating $f(X) := X^\top X$. Results for this and similar operations are given by [1]. A similar operation, but which maps from matrices to $\RR$ is discussed in Lecture 4 part 2 of the MIT course mentioned previouly. Both [1] and Lecture 4 part 2 provide approaches to obtaining the derivative of this function.
Following either resource will yield the derivative:
\[D f [X] (\dot{X}) = \dot{X}^\top X + X^\top \dot{X}\]
Observe that this is indeed a linear operator (i.e. it is linear in its argument, $\dot{X}$). (You can always plug it in to the definition of the Frechet derivative to confirm that it is indeed the derivative.)
In order to perform reverse-mode AD, we need to find the adjoint operator. Using the usual definition of the inner product between matrices,
\[\langle X, Y \rangle := \textrm{tr} (X^\top Y)\]
we can rearrange the inner product as follows:
\[\begin{align} \langle \bar{Y}, D f [X] (\dot{X}) \rangle &= \langle \bar{Y}, \dot{X}^\top X + X^\top \dot{X} \rangle \nonumber \\ &= \textrm{tr} (\bar{Y}^\top \dot{X}^\top X) + \textrm{tr}(\bar{Y}^\top X^\top \dot{X}) \nonumber \\ &= \textrm{tr} ( [\bar{Y} X^\top]^\top \dot{X}) + \textrm{tr}( [X \bar{Y}]^\top \dot{X}) \nonumber \\ @@ -19,4 +19,4 @@ D f [x] (\dot{x}) &= [(D \mathcal{l} [g(x)]) \circ (D g [x])](\dot{x}) \nonumber \\ &= \langle \bar{y}, D g [x] (\dot{x}) \rangle \nonumber \\ &= \langle D g [x]^\ast (\bar{y}), \dot{x} \rangle, \nonumber -\end{align}\]
from which we conclude that $D g [x]^\ast (\bar{y})$ is the gradient of the composition $l \circ g$ at $x$.
The consequence is that we can always view the computation performed by reverse-mode AD as computing the gradient of the composition of the function in question and an inner product with the argument to the adjoint.
The above shows that if $\mathcal{Y} = \RR$ and $g$ is the function we wish to compute the gradient of, we can simply set $\bar{y} = 1$ and compute $D g [x]^\ast (\bar{y})$ to obtain the gradient of $g$ at $x$.
Summary
This document explains the core mathematical foundations of AD. It explains separately what is does, and how it goes about it. Some basic examples are given which show how these mathematical foundations can be applied to differentiate functions of matrices, and Julia function
s.
Subsequent sections will build on these foundations, to provide a more general explanation of what AD looks like for a Julia programme.
Asides
How does Forwards-Mode AD work?
Forwards-mode AD achieves this by breaking down $f$ into the composition $f = f_N \circ \dots \circ f_1$, where each $f_n$ is a simple function whose derivative (function) $D f_n [x_n]$ we know for any given $x_n$. By the chain rule, we have that
\[D f [x] (\dot{x}) = D f_N [x_N] \circ \dots \circ D f_1 [x_1] (\dot{x})\]
which suggests the following algorithm:
- let $x_1 = x$, $\dot{x}_1 = \dot{x}$, and $n = 1$
- let $\dot{x}_{n+1} = D f_n [x_n] (\dot{x}_n)$
- let $x_{n+1} = f(x_n)$
- let $n = n + 1$
- if $n = N+1$ then return $\dot{x}_{N+1}$, otherwise go to 2.
When each function $f_n$ maps between Euclidean spaces, the applications of derivatives $D f_n [x_n] (\dot{x}_n)$ are given by $J_n \dot{x}_n$ where $J_n$ is the Jacobian of $f_n$ at $x_n$.
- [1]
- M. Giles. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. Unpublished (2008).
- [2]
- T. P. Minka. Old and new matrix algebra useful for statistics. See www. stat. cmu. edu/minka/papers/matrix. html 4 (2000).
- note_for_geometersin AD we only really need to discuss differentiatiable functions between vector spaces that are isomorphic to Euclidean space. Consequently, a variety of considerations which are usually required in differential geometry are not required here. Notably, the tangent space is assumed to be the same everywhere, and to be the same as the domain of the function. Avoiding these additional considerations helps keep the mathematics as simple as possible.
Settings
This document was generated with Documenter.jl version 1.7.0 on Monday 21 October 2024. Using Julia version 1.11.1.
from which we conclude that $D g [x]^\ast (\bar{y})$ is the gradient of the composition $l \circ g$ at $x$.
The consequence is that we can always view the computation performed by reverse-mode AD as computing the gradient of the composition of the function in question and an inner product with the argument to the adjoint.
The above shows that if $\mathcal{Y} = \RR$ and $g$ is the function we wish to compute the gradient of, we can simply set $\bar{y} = 1$ and compute $D g [x]^\ast (\bar{y})$ to obtain the gradient of $g$ at $x$.
Summary
This document explains the core mathematical foundations of AD. It explains separately what is does, and how it goes about it. Some basic examples are given which show how these mathematical foundations can be applied to differentiate functions of matrices, and Julia function
s.
Subsequent sections will build on these foundations, to provide a more general explanation of what AD looks like for a Julia programme.
Asides
How does Forwards-Mode AD work?
Forwards-mode AD achieves this by breaking down $f$ into the composition $f = f_N \circ \dots \circ f_1$, where each $f_n$ is a simple function whose derivative (function) $D f_n [x_n]$ we know for any given $x_n$. By the chain rule, we have that
\[D f [x] (\dot{x}) = D f_N [x_N] \circ \dots \circ D f_1 [x_1] (\dot{x})\]
which suggests the following algorithm:
- let $x_1 = x$, $\dot{x}_1 = \dot{x}$, and $n = 1$
- let $\dot{x}_{n+1} = D f_n [x_n] (\dot{x}_n)$
- let $x_{n+1} = f(x_n)$
- let $n = n + 1$
- if $n = N+1$ then return $\dot{x}_{N+1}$, otherwise go to 2.
When each function $f_n$ maps between Euclidean spaces, the applications of derivatives $D f_n [x_n] (\dot{x}_n)$ are given by $J_n \dot{x}_n$ where $J_n$ is the Jacobian of $f_n$ at $x_n$.
- [1]
- M. Giles. An extended collection of matrix derivative results for forward and reverse mode automatic differentiation. Unpublished (2008).
- [2]
- T. P. Minka. Old and new matrix algebra useful for statistics. See www. stat. cmu. edu/minka/papers/matrix. html 4 (2000).
- note_for_geometersin AD we only really need to discuss differentiatiable functions between vector spaces that are isomorphic to Euclidean space. Consequently, a variety of considerations which are usually required in differential geometry are not required here. Notably, the tangent space is assumed to be the same everywhere, and to be the same as the domain of the function. Avoiding these additional considerations helps keep the mathematics as simple as possible.