# The Asymptotic Equipartition Theorem

In the following analysis, we develop an intuition for the Asymptotic Equipartition Theorem which is a fundamental cornerstone of information theory before formally proving that it describes the compressibility of sequences output from a stochastic source.

Aidan Rocke https://github.com/AidanRocke
03-23-2024

### Entropy as the typical number of bits required to describe a random variable:

Let’s suppose we toss a coin that comes up heads with probability $$p = \frac{k}{n}$$, how many bits would we typically need to encode $$k$$ heads in a sequence of $$n$$ coin flips? Using Stirling’s formula for the factorial, we find:

$$${n \choose k} = \frac{n!}{k!(n-k)!} \sim \frac{(\frac{n}{e})^n}{(\frac{k}{e})^{k} \cdot (\frac{n-k}{e})^{n-k}} = \frac{n^n}{k^k \cdot (n-k)^{n-k}} \tag{1}$$$

which simplifies to:

$$${n \choose k} = \big(\frac{k}{n}\big)^{-k} \cdot \big(\frac{n-k}{n}\big)^{-(n-k)} = p^{-p\cdot n} \cdot (1-p)^{-(1-p) \cdot n} = \big( 2^{-p \log_2 p} \cdot 2^{-(1-p) \cdot \log_2 (1-p)} \big)^n = 2^{n \cdot H(p)} \tag{2}$$$

It follows that if we model a sequence of coin flips where the coin is observed to come up heads with probability $$p$$, then we require about $$n \cdot H(p)$$ bits to describe a subsequence of $$k$$ heads in a sequence of $$n$$ coin flips.

### Formulating a stochastic source:

There are four key notions:

1. Memoryless source, $$U$$: $$\mathcal{U}_1, \mathcal{U}_2, ...\mathcal{U}_n \sim U$$

2. Alphabet: $$\mathcal{U} = \{\alpha_i\}_{i=1}^r$$ which specifies the possible values that each symbol $$\mathcal{U}_i$$ can take, where the size of $$\mathcal{U}$$ is given by $$\lvert \mathcal{U} \rvert = r$$.

3. Source sequence: $$U^n = (U_1,...,U_n)$$ denotes the n-tuple that specifies a sequence of $$n$$ source symbols whereas $$\mathcal{U}^n$$ indicates the set of all possible source sequences of length $$n$$.

4. Probability: The probability assigned to a source sequence $$U^n$$ is given by $$P(U^n) = \prod_{i=1}^n P(U_i)$$

## The $$\epsilon$$-typical set, $$\mathcal{A}_{\epsilon}^n$$:

For some $$\epsilon > 0$$, the source sequence $$U^n$$ is $$\epsilon$$-typical if:

$$$\big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \tag{3}$$$

If we let $$\mathcal{A}_{\epsilon}^n$$ denote the $$\epsilon$$-typical set then we may derive the equivalent definition:

$$$\big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \iff H(U) - \epsilon \leq -\frac{1}{n} \log_2 P(U^n) \leq H(U) + \epsilon \tag{4}$$$

which simplifies to:

$$$2^{-n \cdot (H(U) + \epsilon)} \leq P(U^n) \leq 2^{-n \cdot (H(U) -\epsilon)} \tag{5}$$$

## The Asymptotic Equipartition Theorem:

$$$\forall \epsilon > 0, \lim_{n \to \infty} P(U^n \in \mathcal{A}_{\epsilon}^n) = 1 \tag{*}$$$

We may observe that:

$\begin{equation*} P(U^n \in \mathcal{A}_{\epsilon}^n) = P(\big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon) = P(\big\lvert \frac{1}{n} \sum_{i=1}^n -\log_2 P(U_i) - H(U) \big\rvert \leq \epsilon) \end{equation*}$

and the result follows from the Law of Large Numbers since:

$\begin{equation*} H(U) = \mathbb{E}[-\log_2 P(U)] \end{equation*}$

QED.

## Theorem: The AEP inequality

$$\forall \epsilon > 0$$, and sufficiently large $$n$$, $$(1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \leq \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)}$$

Proof of the upper bound:

$\begin{equation*} 1 \geq P(U^n \in \mathcal{A}_{\epsilon}^n) \geq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) +\epsilon)} = 2^{-n \cdot (H(U) +\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}$

Hence,

$\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)} \end{equation*}$

As for the lower bound, we have:

$\begin{equation*} 1- \epsilon \leq P(U^n \in \mathcal{A}_{\epsilon}^n) \leq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) -\epsilon)} = 2^{-n \cdot (H(U) -\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}$

which yields:

$\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \geq (1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \end{equation*}$

QED.

## Maximum Entropy inference via the AEP:

Finally, we may note that $$\mathcal{A}_{\epsilon}^n$$ is exponentially smaller than $$\mathcal{U}^n$$ except when $$P(U)$$ is uniform:

$\begin{equation*} \frac{\lvert \mathcal{A}_{\epsilon}^n \rvert}{\lvert \mathcal{U}^n \rvert} \sim \frac{2^{n \cdot H(U)}}{r^n} = 2^{-n (\log_2 r - H(U))} \end{equation*}$

Hence, the reason why Maximum Entropy inference works within the setting of Machine Learning and Data Compression is precisely due to the validity of the Asymptotic Equipartition Property.

### Citation

Rocke (2024, March 23). Kepler Lounge: The Asymptotic Equipartition Theorem. Retrieved from keplerlounge.com
@misc{rocke2024the,
}