Kepler Lounge: The Asymptotic Equipartition Theorem

Aidan Rocke

Entropy as the typical number of bits required to describe a random variable:

Let’s suppose we toss a coin that comes up heads with probability \(p = \frac{k}{n}\), how many bits would we typically need to encode \(k\) heads in a sequence of \(n\) coin flips? Using Stirling’s formula for the factorial, we find:

\[\begin{equation} {n \choose k} = \frac{n!}{k!(n-k)!} \sim \frac{(\frac{n}{e})^n}{(\frac{k}{e})^{k} \cdot (\frac{n-k}{e})^{n-k}} = \frac{n^n}{k^k \cdot (n-k)^{n-k}} \tag{1} \end{equation}\]

which simplifies to:

\[\begin{equation} {n \choose k} = \big(\frac{k}{n}\big)^{-k} \cdot \big(\frac{n-k}{n}\big)^{-(n-k)} = p^{-p\cdot n} \cdot (1-p)^{-(1-p) \cdot n} = \big( 2^{-p \log_2 p} \cdot 2^{-(1-p) \cdot \log_2 (1-p)} \big)^n = 2^{n \cdot H(p)} \tag{2} \end{equation}\]

It follows that if we model a sequence of coin flips where the coin is observed to come up heads with probability \(p\), then we require about \(n \cdot H(p)\) bits to describe a subsequence of \(k\) heads in a sequence of \(n\) coin flips.

Formulating a stochastic source:

There are four key notions:

Memoryless source, \(U\): \(\mathcal{U}_1, \mathcal{U}_2, ...\mathcal{U}_n \sim U\)
Alphabet: \(\mathcal{U} = \{\alpha_i\}_{i=1}^r\) which specifies the possible values that each symbol \(\mathcal{U}_i\) can take, where the size of \(\mathcal{U}\) is given by \(\lvert \mathcal{U} \rvert = r\).
Source sequence: \(U^n = (U_1,...,U_n)\) denotes the n-tuple that specifies a sequence of \(n\) source symbols whereas \(\mathcal{U}^n\) indicates the set of all possible source sequences of length \(n\).
Probability: The probability assigned to a source sequence \(U^n\) is given by \(P(U^n) = \prod_{i=1}^n P(U_i)\)

The \(\epsilon\)-typical set, \(\mathcal{A}_{\epsilon}^n\):

For some \(\epsilon > 0\), the source sequence \(U^n\) is \(\epsilon\)-typical if:

\[\begin{equation} \big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \tag{3} \end{equation}\]

If we let \(\mathcal{A}_{\epsilon}^n\) denote the \(\epsilon\)-typical set then we may derive the equivalent definition:

\[\begin{equation} \big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \iff H(U) - \epsilon \leq -\frac{1}{n} \log_2 P(U^n) \leq H(U) + \epsilon \tag{4} \end{equation}\]

which simplifies to:

\[\begin{equation} 2^{-n \cdot (H(U) + \epsilon)} \leq P(U^n) \leq 2^{-n \cdot (H(U) -\epsilon)} \tag{5} \end{equation}\]

The Asymptotic Equipartition Theorem:

\[\begin{equation} \forall \epsilon > 0, \lim_{n \to \infty} P(U^n \in \mathcal{A}_{\epsilon}^n) = 1 \tag{*} \end{equation}\]

We may observe that:

\[\begin{equation*} P(U^n \in \mathcal{A}_{\epsilon}^n) = P(\big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon) = P(\big\lvert \frac{1}{n} \sum_{i=1}^n -\log_2 P(U_i) - H(U) \big\rvert \leq \epsilon) \end{equation*}\]

and the result follows from the Law of Large Numbers since:

\[\begin{equation*} H(U) = \mathbb{E}[-\log_2 P(U)] \end{equation*}\]

QED.

Theorem: The AEP inequality

\(\forall \epsilon > 0\), and sufficiently large \(n\), \((1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \leq \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)}\)

Proof of the upper bound:

\[\begin{equation*} 1 \geq P(U^n \in \mathcal{A}_{\epsilon}^n) \geq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) +\epsilon)} = 2^{-n \cdot (H(U) +\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}\]

Hence,

\[\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)} \end{equation*}\]

As for the lower bound, we have:

\[\begin{equation*} 1- \epsilon \leq P(U^n \in \mathcal{A}_{\epsilon}^n) \leq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) -\epsilon)} = 2^{-n \cdot (H(U) -\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}\]

which yields:

\[\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \geq (1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \end{equation*}\]

QED.

Maximum Entropy inference via the AEP:

Finally, we may note that \(\mathcal{A}_{\epsilon}^n\) is exponentially smaller than \(\mathcal{U}^n\) except when \(P(U)\) is uniform:

\[\begin{equation*} \frac{\lvert \mathcal{A}_{\epsilon}^n \rvert}{\lvert \mathcal{U}^n \rvert} \sim \frac{2^{n \cdot H(U)}}{r^n} = 2^{-n (\log_2 r - H(U))} \end{equation*}\]

Hence, the reason why Maximum Entropy inference works within the setting of Machine Learning and Data Compression is precisely due to the validity of the Asymptotic Equipartition Property.

The Asymptotic Equipartition Theorem