In the following analysis, we develop an intuition for the Asymptotic Equipartition Theorem which is a fundamental cornerstone of information theory before formally proving that it describes the compressibility of sequences output from a stochastic source.
Let’s suppose we toss a coin that comes up heads with probability \(p = \frac{k}{n}\), how many bits would we typically need to encode \(k\) heads in a sequence of \(n\) coin flips? Using Stirling’s formula for the factorial, we find:
\[\begin{equation} {n \choose k} = \frac{n!}{k!(n-k)!} \sim \frac{(\frac{n}{e})^n}{(\frac{k}{e})^{k} \cdot (\frac{n-k}{e})^{n-k}} = \frac{n^n}{k^k \cdot (n-k)^{n-k}} \tag{1} \end{equation}\]
which simplifies to:
\[\begin{equation} {n \choose k} = \big(\frac{k}{n}\big)^{-k} \cdot \big(\frac{n-k}{n}\big)^{-(n-k)} = p^{-p\cdot n} \cdot (1-p)^{-(1-p) \cdot n} = \big( 2^{-p \log_2 p} \cdot 2^{-(1-p) \cdot \log_2 (1-p)} \big)^n = 2^{n \cdot H(p)} \tag{2} \end{equation}\]
It follows that if we model a sequence of coin flips where the coin is observed to come up heads with probability \(p\), then we require about \(n \cdot H(p)\) bits to describe a subsequence of \(k\) heads in a sequence of \(n\) coin flips.
There are four key notions:
Memoryless source, \(U\): \(\mathcal{U}_1, \mathcal{U}_2, ...\mathcal{U}_n \sim U\)
Alphabet: \(\mathcal{U} = \{\alpha_i\}_{i=1}^r\) which specifies the possible values that each symbol \(\mathcal{U}_i\) can take, where the size of \(\mathcal{U}\) is given by \(\lvert \mathcal{U} \rvert = r\).
Source sequence: \(U^n = (U_1,...,U_n)\) denotes the n-tuple that specifies a sequence of \(n\) source symbols whereas \(\mathcal{U}^n\) indicates the set of all possible source sequences of length \(n\).
Probability: The probability assigned to a source sequence \(U^n\) is given by \(P(U^n) = \prod_{i=1}^n P(U_i)\)
For some \(\epsilon > 0\), the source sequence \(U^n\) is \(\epsilon\)-typical if:
\[\begin{equation} \big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \tag{3} \end{equation}\]
If we let \(\mathcal{A}_{\epsilon}^n\) denote the \(\epsilon\)-typical set then we may derive the equivalent definition:
\[\begin{equation} \big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon \iff H(U) - \epsilon \leq -\frac{1}{n} \log_2 P(U^n) \leq H(U) + \epsilon \tag{4} \end{equation}\]
which simplifies to:
\[\begin{equation} 2^{-n \cdot (H(U) + \epsilon)} \leq P(U^n) \leq 2^{-n \cdot (H(U) -\epsilon)} \tag{5} \end{equation}\]
\[\begin{equation} \forall \epsilon > 0, \lim_{n \to \infty} P(U^n \in \mathcal{A}_{\epsilon}^n) = 1 \tag{*} \end{equation}\]
We may observe that:
\[\begin{equation*} P(U^n \in \mathcal{A}_{\epsilon}^n) = P(\big\lvert -\frac{1}{n} \log_2 P(U^n) - H(U) \big\rvert \leq \epsilon) = P(\big\lvert \frac{1}{n} \sum_{i=1}^n -\log_2 P(U_i) - H(U) \big\rvert \leq \epsilon) \end{equation*}\]
and the result follows from the Law of Large Numbers since:
\[\begin{equation*} H(U) = \mathbb{E}[-\log_2 P(U)] \end{equation*}\]
QED.
\(\forall \epsilon > 0\), and sufficiently large \(n\), \((1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \leq \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)}\)
Proof of the upper bound:
\[\begin{equation*} 1 \geq P(U^n \in \mathcal{A}_{\epsilon}^n) \geq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) +\epsilon)} = 2^{-n \cdot (H(U) +\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}\]
Hence,
\[\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \leq 2^{n \cdot (H(U) +\epsilon)} \end{equation*}\]
As for the lower bound, we have:
\[\begin{equation*} 1- \epsilon \leq P(U^n \in \mathcal{A}_{\epsilon}^n) \leq \sum_{u^n \in \mathcal{A}_{\epsilon}^n} 2^{-n \cdot (H(U) -\epsilon)} = 2^{-n \cdot (H(U) -\epsilon)} \cdot \lvert \mathcal{A}_{\epsilon}^n \rvert \end{equation*}\]
which yields:
\[\begin{equation*} \lvert \mathcal{A}_{\epsilon}^n \rvert \geq (1-\epsilon) \cdot 2^{n \cdot (H(U) -\epsilon)} \end{equation*}\]
QED.
Finally, we may note that \(\mathcal{A}_{\epsilon}^n\) is exponentially smaller than \(\mathcal{U}^n\) except when \(P(U)\) is uniform:
\[\begin{equation*} \frac{\lvert \mathcal{A}_{\epsilon}^n \rvert}{\lvert \mathcal{U}^n \rvert} \sim \frac{2^{n \cdot H(U)}}{r^n} = 2^{-n (\log_2 r - H(U))} \end{equation*}\]
Hence, the reason why Maximum Entropy inference works within the setting of Machine Learning and Data Compression is precisely due to the validity of the Asymptotic Equipartition Property.
For attribution, please cite this work as
Rocke (2024, March 23). Kepler Lounge: The Asymptotic Equipartition Theorem. Retrieved from keplerlounge.com
BibTeX citation
@misc{rocke2024the, author = {Rocke, Aidan}, title = {Kepler Lounge: The Asymptotic Equipartition Theorem}, url = {keplerlounge.com}, year = {2024} }