Statement of Shannon’s coding theorem:

Let \(\Sigma_1, \Sigma_2\) denote two finite alphabets and let \(\Sigma_1^*\) and \(\Sigma_2^*\) denote the set of all finite words from these alphabets. Now, let’s suppose that \(X\) is a random variable taking values in \(\Sigma_1\) and let \(f\) be a uniquely-decodable code from \(\Sigma_1^*\) to \(\Sigma_2^*\):

\begin{equation} f: \Sigma_1^* \rightarrow \Sigma_2^* \end{equation}

where \(\Sigma_1 \subset \Sigma_1^*\) and \(\lvert \Sigma_2 \rvert = a\).

If we define \(S\) as an auxiliary random variable which measures the length of the codeword \(f(x)\):

\begin{equation} \forall X \sim \Sigma_1, S = l(f(X)) \end{equation}

then \(f\) has the minimal expected word length for \(X\) if it satisfies:

\begin{equation} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

Shannon’s coding theorem:

Assuming that \(|\Sigma_1| = n\), let \(s_i\) denote the word length of each \(x_i \in \Sigma_1\). In particular, let’s choose \(s_i\) so that:

\begin{equation} s_i := \lceil -\log_a p_i \rceil \end{equation}

which implies that:

\begin{equation} -\log_a p_i \leq s_i < -\log_a p_i + 1 \end{equation}

and so we have:

\begin{equation} p_i \approx a^{-s_i} \end{equation}

\begin{equation} a^{-s_i} \leq p_i \end{equation}

which implies that:

\begin{equation} \sum_{i=1}^n a^{-s_i} \leq \sum_{i=1}^n p_i = 1 \end{equation}

and by Kraft’s inequality there exists a prefix-free code with these code lengths.

It follows that the minimal \(S\) satisfies:

\begin{equation} \mathbb{E}[S] = \sum_{i} p_i \cdot s_i < \sum_i p_i (-\log_a p_i + 1) = - \sum_i p_i \frac{\log_2 p_i}{\log_2 a} + 1 \end{equation}

which gives us the right hand side of the inequality:

\begin{equation} \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

and we also have:

\begin{equation} \mathbb{E}[S] \geq = - \sum_i p_i \frac{\log_2 p_i}{\log_2 a} \end{equation}

from which we may deduce (3):

\begin{equation} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

which simplifies to:

\begin{equation} H(X) \leq \mathbb{E}[S] < H(X) + 1 \end{equation}

when \(a=2\).


  1. C.E. Shannon, “A Mathematical Theory of Communication”, Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948.

  2. David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003.