## Disclaimer on these notes

$\newcommand{\R}{\mathbb{R}}$ $\newcommand{\C}{\mathbb{C}}$ $\newcommand{\N}{\mathbb{N}}$ $\newcommand{\Z}{\mathbb{Z}}$

## Analysis

Analysis is the careful building up of the concepts needed to talk about infinitesimal change. Often, calculus can be used without diving into analytic details, but it’s often useful.

### Sequence

A function $S : \N\to B$ for some set $B$ such as $\R$ or $\C^n$

### Series

A sequence T of the form $T(n) = \sum_i^nS(i)$. Intuitively: the $n$th term of $T$ is the sum of the first $n$ terms of $S$.

### Cauchy Sequences, Convergent Sequences, and Completeness

A sequence $S$ is *Cauchy* if $\forall \epsilon>0:\exists k: \forall n,m > k: d(S(n),S(m))<\epsilon $

A sequence $S$ converges to a point $p$ if $\forall\epsilon>0:\exists k: \forall n>k: d(p,S(n))<\epsilon$

A space is complete if all Cauchy sequences converge. Note: the reals, for example, are complete. The spaces in question can be more abstract - for example, the space of all Lebesgue integrable functions is complete. Here, we need a metric on functions.

### Continuity:

Continuous function $f: A\to B$ at a point $x\in A$ is defined by:

$$\forall \epsilon >0: \exists \delta>0: \forall y\in A: (d(x,y)<\delta)\to (d(f(x),f(y))<\epsilon)$$

**Equivalently**: a function $f$ is continuous at a point $x$ if $\lim_{y\to x}f(y)$ exists.

Uniformly continuous function is defined as follows and is different from being continuous at every point:

$$\forall \epsilon >0: \exists \delta>0: \forall x,y\in A: (d(x,y)<\delta)\to (d(f(x),f(y))<\epsilon)$$

### Lipschitz Functions

Uniformly Lipschitz:

$\exists k > 0 : \forall x\forall y : d(f(x),f(y)) \leq k\cdot d(x,y) $

This implies continuity.

Locally Lipschitz function is uniformly Lipschitz but for a compact set. That is, first choose any compact set $K$, and then have that $x$ and $y$ from above are in $K$.

A continuously differentiable function $f$ is locally Lipschitz. To prove this, we want to show that around any point $p$, there’s a (compact) neighborhood in which $f:U\to\R^n$ is uniformly Lipschitz.

Strategy: we’ll make use of the fact that $Df$ at any point $x\in S$ for $S\subset U$ a ball around $p$ is bounded, i.e. $\sup_{x\in S}||D_xf(x)||_{op} = \sup_{x\in S}sup_{h\neq 0} \frac{D_xf(x)(h)}{|h|} \leq K$. For this last inequality, we’re using the fact that continuous functions on compact sets are bounded and that $Df$ is continuous (by assumption). Then we’ll define a function that interpolates between two points in $S$, $x$ and $y$, i.e. $z(t) = x + t(y-x)$. Note how $z(0)=x$ and $z(1)=y$. Finally note that a ball is convex, so $\forall t: z(t)\in S$.

Here’s the calculation that yields the result:

$$||f(y) - f(x)|| = ||\int_0^1 \frac{d}{ds}f(z(s))ds|| = |\int_0^1 Df_{z_s}(z(s))(y-x)| \leq \int_0^1 ||Df_{z_s}(z(s))||_{op}\cdot |(y-x)| \leq K|y-x|$$

First step uses fundamental theorem of calculus, second uses chain rule, third uses that absolute value of integral is less than integral of absolute value, and penultimate step uses the fact that the operator norm has the property that $||Av|| \leq ||A||_{op}||v||$.

### Contraction Mapping Principle

One of a class of fix point theorems. It states that if a function $f:D\to D$ is a contraction on a metric space $D$, in the sense that for some $0 < q < 1$:
$$
d(f(x)-f(y)) \leq q\cdot d(x,y)
$$
then $f$ has a fix point, i.e. $\exists x: f(x)=x$. **Note**: the use of $q$ is crucial. This is **not** equivalent to the property
$$
d(f(x)-f(y)) < d(x,y)
$$
because then the contraction could be arbitrarily small.

The proof is nice and constructive. We simply consider the sequence $x_n=f(x_{n-1})$, with $x_0$ chosen arbitrarily. $\lim_{n\to\infty}x_n$ must be a fix point if it exists, it exists if the sequence converges, and the sequence converges if it is Cauchy.

### Arzela-Ascoli:

Good example of fiddly analysis result.

**Statement**: every equicontinuous, sequence of uniformly bounded functions $f: J\to R^n$, for $J$ compact, converges uniformly.

**Proof**: Let $A$ be a countable dense subset of $J$, enumerable as $t_1\cdots t_n\cdots$. Then by Bolzano-Weierstrass, we can find a subsequence $x_{k_(l,1)}(t_1)$ (varying $l$). Similarly, we can find a subsequence of \emph{that} sequence, $x_{k_(l,2)}(t_1)$. Diagonalizing, $x_{k_{(l,l)}}$ must converge for every $t$.

Now we want to show that $x_{k_{(l,l)}}$ converges uniformly for $J$. To do so, we are going to first consider a finite set of points in $A$, which are contained in open intervals whose union contains $J$. We can do this because $J$ is compact.

Since $E$ is finite, and $x_{k_{l,l}}$ at any $s\in E$ is convergent hence Cauchy, we can pick $l,l’$ st. $max_{s\in E} |x_{k_{(l,l)}}(s)-x_{k_{(l’,l’)}}(s)| < \epsilon$.

Then, for any $y\in J$, $|x_{k_{(l,l)}}(y)-x_{k_{(l’,l’)}}(y)| \leq |x_{k_{(l,l)}}(y)-x_{k_{(l,l)}}(s)| + |x_{k_{(l,l)}}(s)-x_{k_{(l’,l’)}}(s)| + |x_{k_{(l’,l’)}}(s)-x_{k_{(l’,l’)}}(y)|$. This shrinks as small as we want, by equicontinuity for the first and last terms on the right of the inequality and the above for the middle.

## Calculus

Calculus is the tool used to understand continuous change. Since the world is full of things which are either continuous, or approximately continuous, it is pretty much the central tool of applied maths.

### Derivatives

### Little o-notation

The statement $f(x)=o(g(x))$ is a shorthand for: $\lim_{x\to\infty}\frac{f(x)}{g(x)}=0$. This is very handy notation for discussing derivatives.

For example, consider $f(x)=x^2$, and

$$\frac{d}{dx}f(x)=\lim_{h\to0}\frac{f(x+h)-f(x)}{h}=\lim_{h\to0}\frac{(x+h)^2-x^2}{h}=\lim_{h\to0}\frac{2xh+h^2}{h}$$

$$=\lim_{h\to0}\frac{2xh}{h}+\lim_{h\to0}\frac{h^2}{h}=\lim_{h\to0}2x+\lim_{h\to0}h = 2x+0=2x$$

Note that $h^2$ in the numerator was not a linear function of $h$, in particular, using little-o notation, we can say that $h^2$ is $o(h^n)$ for $n>2$, and as such drops out.

This pattern crops up all the time, where to find the derivative of a function, you can drop all terms that are $o(h^n)$ for $n>2$. In particular, it’s used in the context of Taylor series expansions, e.g. you could write: $e^x=1+x+o(x^3)$.

### Taylor Series

A power series representation of $f(x)$ takes the form $\sum_{i=0}^{\inf} c_i(x-a)^i$. What’s cool is that if such a series exists for an infinitely differentiable function $f$, we can recover the ``basis coefficients” $c_i$ by noting that: (1) $f(a) = c_0$ and (2) that $f’(x) = \sum_{i=1}^{\inf} ic_i(x-a)^{i-1}$, so that $c_1 = f’(x)/1$. Continuing in this vein with $n$th derivatives, we find that $c_i = \frac{d^{(i)f(x)}}{dx^{(i)}}(a)\frac{1}{i!}$. Substituting in these values for $c_i$ into the series gives us the Taylor series.

$$
T : (\R\to\R)\to\R\to\R\to(\R\to\R)

T(f)(t_0)(t) = \sum_{i=0}^{\infty}\frac{f^{(i)(t_0)}}{i!}(t-t_0)^i
$$
The intuition is that for a point $x$, we’re approximating $f$ locally by a sum (which when infinite may converge to a perfect approximation) of the derivatives of $f$ around that point. This makes sense: the derivatives give you local context: first derivative is what the function is doing nearby, second derivative is what the first derivative is doing nearby, etc.

A natural extension to $\C$, $\R^n$ and $\C^n$ exists. The Taylor series up to term $n$ is the $n$th order approximation of $f$, which is a polynomial. Write it as $T_n(f)(t_0)$.

It is often useful to ask about the *remainder* $R_n(f)(t_0)=T_n(f)(t_0)$.

Taylor’s theorem states that $R_n(f)(t_0)(t)=o(|t-t_0|^n)$ as $t_0\to t$. In other words, for $t_0$ sufficiently close to $t$, the error is dominated by a polynomial of order $n$.

We can also give explicit formulas for the remainder, such as the integral form: $$ R_n(f)(t_0)(t)=\int_{t_0}^t\frac{f^{(n+1)}(s)}{(n+1)!}(t-s)^{n+1}ds $$ From which the mean value theorem gives us that: $$ \exists t’: R_n(f)(t_0)(t) = \frac{f^{(n+1)(t’)}}{(n+1)!}(t-t_0)^{n+1} $$ Here $t’$ is known to exist by the mean value theorem.

### Integrals

An integral is like a continuous analog of a sum, where the summands (the things you sum together) are infinitesimally small and infinitely many. In fact, this is pretty much how the Riemann integral is defined. There are also other ways of defining integrals which are more general, like the Lebesgue integral.

### Fundamental Theorem of Calculus and Leibniz’ Rule

$$f(x)=\frac{d}{dx}\int_a^xf(s)ds$$

In other words, the integral, as a way of measuring volume, is the inverse of the derivative.

Now suppose that we want to take the derivative of an integral with respect to a variable inside the integrand. For example, suppose $\phi(y)=\int_a^bf(s,y)ds$. Then $\frac{d}{dy}\phi(x,y)=\frac{d}{dy}\int_a^bf(s,y)ds=\int_a^b\frac{d}{dy}f(s,y)ds$.

But now suppose that $a$ and $b$ are functions of $y$. Then, using the multivariable chain rule, with $\phi(y,a(y),b(y))=\int_{a(y)}^{b(y)}f(s,y)ds$, we have that $\frac{d}{dy}\phi(y,a(y),b(y))=\frac{\partial \phi(y,a(y),b(y))}{\partial y}+\frac{\partial \phi(y,a(y),b(y))}{\partial a(y)}\frac{da(y)}{dy}+\frac{\partial \phi(y,a(y),b(y))}{\partial b(y)}\frac{db(y)}{dy}$.

Using the fundamental theorem of calculus, as above, and flipped one of the integrals, we get $\frac{d}{dy}\phi(y,a(y),b(y))=\int_{a(y)}^{b(y)}\frac{\partial}{\partial y}f(s,y)ds-f(a(y),y)a’(y)+f(b(y),y)b’(y)$. This is Leibniz’ rule.

## Multivariate Calculus

These are results at the intersection of linear algebra and calculus.

One of the really important ideas is that the differential $(Df)(x)$ of a function $f$ $\R^n\to\R^n$ is *a linear map*. This is one of the reasons why linear algebra is so important.

The derivative of $f:\R^n\to\R^n$ at a point $p$ is the limiting linear approximation of $f$ given by the line between $p$ and $q$ as $q$ nears $p$. In the general multivariate case, the differential $Df:\R^n\to L(\R^n\to\R^n)$, if it exists, is a linear map characterized by: $$ \lim_{x\to a}\frac{||f(x)-f(a)-Df(a)(x-a)||}{||x-a||} = 0 $$ or equivalently $$ \lim_{x_0\to 0}\frac{||f(a+x_0)-f(a)-Df(a)(x_0)||}{||x_0||} = 0 $$ So basically the idea is: the total derivative $Df(a)$ is the limiting linearization of $f$ around $a$.

The following is also true (using index notation):

$$ Df(x)_{ij} = \frac{\partial f(x)_i}{\partial x_j}$$

## Chain rule

$$D_x(f(g(x)))_{ij} = \sum_k \frac{\partial f(g(x))_i}{\partial g(x)_k}\frac{\partial g(x)_k}{\partial x_k} $$

Or without index notation:

$$D : (\R^n\to\R^n)\to\R^n\to L(\R^n\to\R^n)$$ $$D(f\circ g)(x)=[(Df)(g(x)][(Dg)(x)]$$

## Gradient $\nabla$ and Hessian $H$

$$ \nabla_x{f(x)}_i = \dfrac{df(x)}{dx_i}$$

$$ H_xf(x)_{ij} = \dfrac{df(x)}{dx_idx_j}$$

## Examples:

$$ (D_x Ax)_{ij} = \frac{d(Ax)_{i}}d{x_j} = \frac{dA_{ik}x_k}d{x_j} = A_{ij} $$

$$ (D_Ax^TAx)_{ij} = \frac{dxAx}{A_{ij}} = \frac{dx_aA_{ab}x_b}{A_{ij}} = x_ix_j \Rightarrow D_Ax^TAx = xx^T $$

$$ \nabla_x{f(Ax)} = A^T\nabla_{Ax}(f(Ax))$$ $$ \nabla_{A^{-1}x}f(x) = \nabla_k{f(Ak)} = A^T\nabla_{Ak}f(Ak) = A^T\nabla_{x}f(x)$$

$$ H_xf(Ax) = A^T(H_xf(Ax))A $$ $$ H_{A^{-1}x}f(x) = H_{k}f(Ak) = A^TH_{Ak}f(Ak)A = A^TH_xf(x)A $$

## Change of variables

OK, the idea is that instead of integrating $f$ with respect to its input $x$ (i.e. calculating $\int_Af(x)dx$), we can view $x$ as a function of $u$ (i.e. $x=g(u)$) and then pull back to an integral over $u$ (i.e. $\int_{g^{-1}(A)}f(g(u))du$). But this is a differently valued integral. So this new integral requires a term to offset the change, which rather nicely happens to be $|Det(D_ug(u))|$. Intuition is that we account for the change in area of the differential (as it approaches the limit). The absolute value is because the sign of the determinant only measures the order of the dimensions and we don’t care about this here. he equation looks as follows:

$$\int_{g(A)} (f\circ g)(x)dg(x) = \int_{A} (f\circ g)(x)\cdot |Dg(x)|dx $$ where $Dg(x)$ is the determinant of the matrix of partial derivatives of $g(x)$ wrt. $x$.

### Derivative of the determinant

$$ D\det(A)(I) = \lim_{h\to0}\frac{\det(I+hA)-\det(I)}{h} = \lim_{h\to0}\frac{(\prod_i(1+h\lambda_i))-1}{h}$$

$$ = \lim_{h\to0}\frac{(1+h\sum_i\lambda_i + o(h^3))-1}{h}=\sum_i\lambda_i=tr(A)$$

### Existence and Differentiability of the Matrix Exponential

$e^{t\psi}=\sum_{i=0}^{\infty}\frac{t^i\psi^i}{i!}$. The series is Cauchy, since $||\sum_{i=0}^{m+1}\frac{t^i\psi^i}{i!}-\sum_{i=0}^{m}\frac{t^i\psi^i}{i!}||=||\frac{t^{m+1}\psi^{m+1}}{(m+1)!}\leq \frac{t^{m+1}}{(m+1)!}||\psi||^{m+1}$, which is dominated by the factorial term. Thus the sequence converges (the space of linear operators being complete).

Differentiability of $f(t)= e^{t\psi}$:

$\frac{d}{dt}f(t)=\lim_{h\to0}\frac{1}{h}(e^{(t+h)\psi}-e^{t\psi})=\lim_{h\to0}(e^{h\psi}+I)e^{t\psi}$

So all that it needed is to show that $\lim_{h\to0}(e^{h\psi}+I)=\psi$. $$ e^{t\psi}=\sum_{i=0}^{\infty}\frac{t^i\psi^i}{i!}=I+t\sum_{i=1}^{\infty}\frac{t^{i-1}\psi^i}{i!}\Rightarrow \frac{1}{t}(e^{t\psi}-I)=\sum_{i=1}^{\infty}\frac{t^{i-1}\psi^i}{i!} $$ We can then rather cleverly observe that the this final sum goes to $\psi$ as we take $t$ to $0$, because the term of the sum with $i=0$ is $\frac{0^0\psi}{0!}=\psi$.