## Statement of Shannon’s coding theorem:

Let $$\Sigma_1, \Sigma_2$$ denote two finite alphabets and let $$\Sigma_1^*$$ and $$\Sigma_2^*$$ denote the set of all finite words from these alphabets. Now, let’s suppose that $$X$$ is a random variable taking values in $$\Sigma_1$$ and let $$f$$ be a uniquely-decodable code from $$\Sigma_1^*$$ to $$\Sigma_2^*$$:

\begin{equation} f: \Sigma_1^* \rightarrow \Sigma_2^* \end{equation}

where $$\Sigma_1 \subset \Sigma_1^*$$ and $$\lvert \Sigma_2 \rvert = a$$.

If we define $$S$$ as an auxiliary random variable which measures the length of the codeword $$f(x)$$:

\begin{equation} \forall X \sim \Sigma_1, S = l(f(X)) \end{equation}

then $$f$$ has the minimal expected word length for $$X$$ if it satisfies:

\begin{equation} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

## Shannon’s coding theorem:

Assuming that $$|\Sigma_1| = n$$, let $$s_i$$ denote the word length of each $$x_i \in \Sigma_1$$. In particular, let’s choose $$s_i$$ so that:

\begin{equation} s_i := \lceil -\log_a p_i \rceil \end{equation}

which implies that:

\begin{equation} -\log_a p_i \leq s_i < -\log_a p_i + 1 \end{equation}

and so we have:

\begin{equation} p_i \approx a^{-s_i} \end{equation}

\begin{equation} a^{-s_i} \leq p_i \end{equation}

which implies that:

\begin{equation} \sum_{i=1}^n a^{-s_i} \leq \sum_{i=1}^n p_i = 1 \end{equation}

and by Kraft’s inequality there exists a prefix-free code with these code lengths.

It follows that the minimal $$S$$ satisfies:

\begin{equation} \mathbb{E}[S] = \sum_{i} p_i \cdot s_i < \sum_i p_i (-\log_a p_i + 1) = - \sum_i p_i \frac{\log_2 p_i}{\log_2 a} + 1 \end{equation}

which gives us the right hand side of the inequality:

\begin{equation} \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

and we also have:

\begin{equation} \mathbb{E}[S] \geq = - \sum_i p_i \frac{\log_2 p_i}{\log_2 a} \end{equation}

from which we may deduce (3):

\begin{equation} \frac{H(X)}{\log_2 a} \leq \mathbb{E}[S] < \frac{H(X)}{\log_2 a} + 1 \end{equation}

which simplifies to:

\begin{equation} H(X) \leq \mathbb{E}[S] < H(X) + 1 \end{equation}

when $$a=2$$.

1. C.E. Shannon, “A Mathematical Theory of Communication”, Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948.

2. David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003.