# Information theory


# Information theory

Suppose you have a distribution over some set $A$, $p : D(A)$, from which you will be given a sample. Your goal is to come up with a bijection $f : A \to List({0,1})$ (i.e. from $A$ to arbitrary length bitstrings) such that $E_{x\sim p}[len(f(x))]$ is minimized.

Intuition: you are trying to minimize the expected length of messages which losslessly communicate.

The upshot is that one can talk of the information contained in a distribution, measured in the expected number of bits (Booleans) needed to encode a draw from the distribution.

The source coding theorem states that the minimum expected length is $H(p)$, known as the entropy of $p$, which is, up to a factor, the KL distance from a uniform dsitribution.

$$- \sum_ip_ilog(p_i)$$

Intuition: if the entropy is low, this means the distribution puts a lot of mass on certain elements of the support, and accordingly, these can be assigned short messages, minimizing the expected message length.

## Properties of entropy

As the source coding theorem makes clear, information, as measured by entropy, is an additive quantity, which is why a logarithm appears in the definition.

It has various intuitive properties, not least that your uncertainty on a distribution over the cells of a partition of the $p_i$, added to your uncertainty for each distribution over the cell weighted by the cell probability, is equal to the entropy of the unpartitioned set.