# Probability


Highly recommend Machine Learning: A Probabilistic Perspective by Kevin Murphy as a no-nonsense reference for a whole lot of probabilistic models and inference strategies.

## PDF of function of random variable

Suppose Y = g(X) and g is increasing. Then $F_Y(y) = P(Y\leq y) = P(g(X)\leq y) = P(X\leq g^{-1}(y)) = F_X(g^{-1}(y))$.

Then $f_Y(y) = F’_Y(y) = \frac{d}{dy}F_X(g^{-1}(y)) = f_X(g^{-1}(y))\frac{d}{dy}g^{-1}(y) = f_X(g^{-1}(y))\frac{dg^{-1}(y)}{dy}$

Or:

$$F_Y(A) = P(Y\in A) = P(g(X)\in A) = P(X\in g^{-1}(A)) = \int_{g^{-1}(A)}f_X(x)dx$$

By the general form of the substitution rule, we then have that

$$P(Y\in A) = \int_{A}(f_X\circ g^{-1})(y)\cdot|J_{g^{-1}}(y)|dy$$

But $J_{g^{-1}}(y) = det D(g^{-1}(y)) = det (Dg(g^{-1}(y)))^{-1} = J_g(g^{-1}(y))^{-1}$ so

$$P(Y\in A) = \int_{A}(f_X\circ g^{-1})(y)\cdot|J_g(g^{-1}(y))^{-1}|dy = \int_{A}\frac{(f_X\circ g^{-1})(y)}{|J_g(g^{-1}(y))^{-1}|}dy$$

So $\frac{d}{dy}P(Y\in A) = f_Y(y) = \frac{(f_X\circ g^{-1})(y)}{|J_g(g^{-1}(y))|}$

So $(f_X\circ g^{-1})(y) = f_Y(y)\cdot|J_g(g^{-1}(y))^{-1}|$

## Sum Random Variables by Convolving PDFs

$$F_{X+Y}(z) = \int_{-\infty}^{\infty}\int_{-\infty}^{z-x} f_{(X,Y)}(x,y) dydx$$

$$= \int_{-\infty}^{\infty}\int_{-\infty}^{z} f_{(X,Y)}(x,v-x) dvdx = \int_{-\infty}^{z}\int_{-\infty}^{\infty} f_{(X,Y)}(x,v-x) dvdx$$. Then, taking the derivative with respect to $x$, using the fundamental theorem of calculus, and assuming independence of $X$ and $Y$, we have that $f_{X+Y}(z) = \int_{-\infty}^{\infty} f_X(x)\cdot f_Y(z-x) dx$.

## Characteristic Function

Characteristic function of a random variable is the (inverse) Fourier transform of its density.

Characteristic function of a Gaussian is a Gaussian (see notes on Fourier analysis)

## Some distributions

### Gaussian Distribution

Its weird looking pdf can be understood by first looking at the standard normal N(0,1), with pdf $\frac{1}{\sqrt{2\pi)}}e^{-\frac{1}{2}x^2}$ and then doing change of variables, noting that if $f$ is the pdf of $X$, $\frac{1}{\sigma}f(\frac{x-\mu}{\sigma})$ is the pdf of $\mu + \sigma X$, by change of variables. Similarly for the multivariate Gaussian, observe that a change from $X$ to $\mu + \Sigma X$ justifies its form.

So really, the hard part is finding the value of the Gaussian integral $\int_{-\infty}^{\infty}e^{-x^2}=\sqrt{\pi}$.

Smoothing over a bunch of analytic difficulties, here’s the basic idea:

$$\int_{-\infty}^{\infty}e^{-x^2}=\sqrt{(\int_{-\infty}^{\infty}e^{-x^2}dx)^2}$$ $$= \sqrt{\int_{-\infty}^{\infty}e^{-x^2}dx\int_{-\infty}^{\infty}e^{-y^2}dy}$$ $$= \sqrt{\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}e^{-(x^2+y^2)}dxdy}$$ $$= \sqrt{\int_{-\infty}^{0}e^{-r^2}2\pi rdr}$$ $$=\sqrt{\pi}$$

### Exponential Distribution

$$PDF(x;\lambda) = \int \lambda e^{-\lambda x}$$

$$E[X;\lambda] = E[\int \lambda e^{-\lambda x}] = \int x \lambda e^{-\lambda x} = [x \cdot -e^{-\lambda x}]_0^{\inf} + \int e^{-\lambda x}$$

$$= 0 + [-\frac{1}{\lambda}\cdot e^{-\lambda x}]_0^{inf} = \frac{1}{\lambda}$$

### Markov Random Field

$$P(x|\theta) = \frac{e^{\theta^T\cdot f(x)}}{\sum_{x’}e^{\theta^T\cdot f(x’)}}$$

where $f(x)$ is an indicator function and $\theta$ is a vector of parameters. This parametrization shows that an MRF is an exponential family distribution.

### Empirical Distribution

The delta distribution which places full weight on the data (viewed as a single data point).

## Central Limit Theorem

Various increasingly powerful versions, but simplest is:

For a sequence ${X_i}$ of independent and identically distributed (iid) random variables with mean $\mu$ and variance $\sigma^2$, with $X_n = \left(\sum_i^nX_i\right)-n\mu$, the random variable $Z_n=\frac{X_n-n\mu}{\sqrt{n\sigma^2}}$ converges in distribution to $N(0,1)$, i.e. to a standard normal.

The proof uses Fourier methods. The PDF of a sum of iid random variables $X_i$ is a convolution of $n$ PDFs of $X_i$, the Fourier transform of a convolution is a raising of the Fourier transform of those PDFs to the $n$th power, and Taylor expanding inside the Fourier integral basically gives us the result (see e.g. page 130 of this).

## Exponential Family

Not to be confused with the exponential distribution. Really it’s a family of families. For fixed functions $b$ and $T$ but with $\eta$ varying, we have a family:

$$P(x;\eta) = b(x)e^{\eta^TT(x)-a(\eta)}$$

$T(X)$: sufficient statistics, $\eta$ : natural parameter, $a(\eta)$: log-normalizer, $b$ : base measure.

Canonical Parameters: $\mu,\sigma…$ : $\Omega$ (the sample space)

Natural Parameters: $\eta : \R$

Link function: $L : \Omega \to \R$

Response function: $R : \R \to \Omega$

So $\eta$ parametrizes a family. For instance, we can choose $T$ and $b$ to make $\eta$ range over all Gaussians, or Bernoulli, Poisson, Exponential, Von-Mises, Gamma etc distributions. $a$ is determined by $T$ and $b$ and is the log of the normalizing constant.

The reason this is useful is that we can prove a bunch of super useful things about distributions in this form. In particular:

• $\frac{da(\eta)}{d\eta} = E[T(x);\eta]$
• $H_{\eta}a(\eta) = Cov[T(x);\eta]$
• $a$ is convex in $\theta$
• Each exponential family is conjugate to an exponential family
• Maximum Entropy: exponential families are the most entropic distributions given that the expectation is equal to some $\alpha$, for a fixed base measure

To find the maximum likelihood $\theta$ for an observation $D$, one wants to take the gradient of log $P(x|\theta)$, since log-likelihood in exponential families is convex. Taking the gradient of the log of the numerator is easy. But the gradient of the log of the denominator is as follows:

$$\frac{d}{d\theta} log \sum_{x’}e^{\theta^T\cdot f(x’)} = \frac{1}{\sum_{x’}e^{\theta^T\cdot f(x’)}} \cdot \sum_{x’} \frac{d}{d\theta} e^{\theta^T\cdot f(x’)}$$

$$= \frac{1}{Z(\theta)} \cdot \sum_{x’}e^{\theta^T\cdot f(x’)} \cdot f(x’) = \sum_{x’} P(x’|\theta) \cdot f(x’) = E_{x\sim p(x|\theta)}[f(x)]$$

The Hessian wrt. $\theta$ is the covariance of f(x) wrt the MRF. Covariance matrices are positive semi-definite, proving that the log-likelihood is convex.

Example of how to put Bernoulli distribution in exponential family form:

$T=id$, $b(x)=1, \eta=\sigma^{-1}(\mu)$.

Crucial point: how do we convert from the mean parametrized form of the Bernoulli distribution, namely $P(x|\mu) = x\mu\cdot x^{1-\mu}$ to the naturally parametrized form?

Easy: take the log odds: $\eta=log(\frac{\mu}{1-\mu})$. And the inverse is the sigmoid function.

## Useful approximations

### Stirling’s Approximation

$N!$ rapidly becomes intractable to analytically or numerically compute, which is bad for statistical mechanics, because it appears a lot. But for large $N$ we can estimate it. First a rough estimate, and then a better one:

$$\log N! = \log \prod_{i=1}^N i = \sum_{i=1}^N \log i \approx \int_1^N \log x dx$$ $$= [x\log x]_1^N-\int_1^N\frac{x}{x}=N\log N - N + 1$$

A second, better approximation is attained as follows, recalling that $\Gamma(n)=n!$ for $n\in \Z$, where

$$N!=\Gamma(N)=\int_0^{\infty} e^{-x}x^Ndx$$

We then do something crazy, which is to approximate the integrand by a Gaussian. The reason this works well is that the integrand $f(x)=e^{-x}x^N$ is peaky. The gaussian function is:

$$g(x)=Ae^{-\frac{1}{2}(\frac{x-x_0}{\sigma})^2}$$

We set the mean $x_0$ at the maximum, which is found by setting $0=\frac{d}{dx}\log e^{-x}x^N=\frac{d}{dx}N\log x-x=\frac{N}{x}-1$, which implies that $x_0=N$. Then $A=g(x_0)=g(N)\approx e^{-N}N^N$.

Finally, we note that $\frac{d^2}{d x^2}\log g(x)=-\frac{1}{\sigma^2}$ and equating this to $\frac{d^2}{d x^2}\log f(x)=-\frac{N}{x^2}$ implies that $\sigma$ should be set to $\frac{x^2}{N}$

This gives:

$$N! = \Gamma(N)=\int_0^{\infty} e^{-x}N^Ndx \approx \int_0^{\infty} e^{-N}N^Ne^{-\frac{1}{2}(\frac{x-N}{\sqrt{N}})^2}=e^{-N}N^N\sqrt{2\pi N}$$

## Monte Carlo Methods

Often you find yourself in a situation where you have specified a distribution P, which you know up to a normalization constant. You might find yourself in this position in statistical physics (e.g. you don’t know the partition function) or in Bayesian inference (you know the posterior distribution up to the denominator, which has a hard integral). A common thing you might want is the expectation of some function $f$ under P (i.e. $\int_{support(P)} f(x)P(x)dx)$, where I’m using $P$ to also mean the pdf of the distribution $P$). You can approximate this, and other such quantities, if you have a good means of taking representative samples from $P$.

Monte Carlo methods are methods for taking samples from the typical set of a distribution and using those to calculate approximate quantities of the distribution, like the expectation or other moments.

In particular, you want samples from the typical set, i.e. the part of the space of the support of $P$ which has both large volume and high probability density. It turns out that in high dimensions, almost all of the mass is concentrated in such a set. (Note: the notion of a typical set comes from information theory - as I’m using it here, it’s not a precise notion; I’m not saying, for example, that it has an exact boundary).

### Markov Chain Monte Carlo

A type of Monte Carlo method. You choose a translation kernel T (in the finite case, representable as a stochastic matrix from states to states) and then take a random Markovian walk with this kernel. Given some assumptions (ergodicity of your kernel), you can prove that the distribution over states induced by the random walk will eventually converge to the stationary distribution (unit eigenvector) of T. Now

$$P(x)T(x’|x) = P(x’)T(x|x’) \Rightarrow \sum_xP(x)T(x’|x) = \sum_xP(x’)T(x|x’) = P(x’)\sum_xT(x|x’) = P(x’)$$

which is to say that $P$ is just such a stationary distribution. The condition denoted by the first equation above is known as “detailed balance”.

In other words, the states in your Markov chain are assignments to all the variables of the distribution (it could be a joint distribution) and so the stationary distribution of the walk is a distribution over the relevant variable(s). If detailed balance is satisfied, then $P$ is the stationary distribution, and so you can sample from $P$ by taking a nice long walk along the chain, since $\lim_nT^n(x) = P$ for any starting value $x$. Magic!

Metropolis-Hasting is one method for obtaining T. The idea is that you have a kernel Q, and let $T = A(x’|x)Q(x’|x)$, where $A(x’|x)=min(1,\frac{P(x’)Q(x|x’)}{P(x)Q(x’|x)})$. Note that you don’t have to know $P$, but rather $P$ only up to normalization, since it appears in a ratio. Clever. By design, some simple maths shows that with $T$ as defined, $(P,T)$ satisfies detailed balance, so you’re in business.

## Models

### State space models

This is the kind of territory where Kalman filters, particle filters, etc, come up.

Chapter 18 of Machine Learning: A Probabilistic Perspective is great.

It introduces state space models (SSMs) by analogy to Hidden Markov Models, but with continuous hidden states. Basically, you have a hidden state that transitions with time (probabilistically), and at each time point, produces an observation (probabilistically). The goal is to infer the current, or future states of the hidden variable from previous states. Inference methods take advantage of this recurrency inherent in the model.

A linear dynamical system is one where the temporal transitions and production of observations are by linear, with additive Gaussian noise. They are solvable in closed form by a Kalman filter.

In the case where the transition and observation models are not linear, one option (the Extended Kalman Filter) is to take the Jacobian in order to obtain a linear approximation, and basically do a Kalman filter with that. The unscented Kalman filter is similar, but instead first passes the distribution (or sample points) through the non-linear function, and then fits a Gaussian.