Let’s suppose we have an organism \(\mathcal{O}\) that can measure an observable \(X\) which is modelled by a discrete random variable generated by a system \(\hat{X}\). Assuming that \(\mathcal{O}\) has a finite number of behavioural and sensory states and that the organism has so far observed \({O} = \{O_i\}_{i=1}^N \in \hat{X}^N\) a plausible model of its world is given by:

\begin{equation} P(O_{M+1}|\{O_i\}_{i=1}^M) \end{equation}

\begin{equation} \sum_{i=1}^N P(O_{M+1} = X_i|\{O_i\}_{i=1}^M) = 1 \end{equation}

assuming that this world is one-step Markov.

Now, under ergodic assumptions the event associated with the frequency:

\begin{equation} p_i = P(O_{M+1} = X_i | \{O_i\}_{i=1}^M) \end{equation}

would happen \(T \cdot p_i\) times where:

\begin{equation} T = \max_i \frac{1}{p_i} \end{equation}

is the recurrence time of \(\hat{X}\).

Given that an event that occurs with frequency \(p_i\) generally requires modelling a sequence of length \(\frac{1}{p_i}\), in order to encode the structure of such a rare event, the organism would generally need a number of bits proportional to:

\begin{equation} \ln \big(\frac{1}{p_i}\big) = - \ln p_i \end{equation}

and we note that:

\begin{equation} p_i \to 0 \implies \ln \big(\frac{1}{p_i}\big) \to \infty \end{equation}

since modelling rare events generally requires a number of observations that is inversely proportional to their frequency.

However, given that the process is assumed to be ergodic and the memory of an organism is finite an asymptotically optimal encoding would use the expected number of bits:

\begin{equation} -p\ln p_i \end{equation}

in order to encode an event that occurs with frequency \(p\).

Given (7), we may deduce that from the perspective of \(\mathcal{O}\) the memory requirements associated with encoding the probabilistic structure of the event \(\hat{X}\) is proportional to:

\begin{equation} H(\hat{X}) := - \sum_{i=1}^N P(O_{M+1} = X_i | O) \ln P(O_{M+1} = X_i | O) = - \sum_{i=1}^N p_i \ln p_i \end{equation}

which is what most scientists know as Shannon entropy.

I would like to add that even if an organism \(\mathcal{O}\) lived in a deterministic universe, given realistic assumptions of bounded information-processing resources, large and complex state and action spaces, and partial observability, a probabilistic model of its environment is not only a practical model but a useful representation of its epistemic uncertainty.

In this sense, the Shannon entropy is a robust measure of an agent’s epistemic uncertainty relative to what is really going on. This is clear when you consider that \(H(\cdot)\) is maximised by the uniform distribution:

\begin{equation} H(\hat{X}) \leq \ln N \end{equation}

where we have equality when \(\forall i, p_i = \frac{1}{N}\) which corresponds to a situation where the organism has no idea what is going on.


  1. Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.
  2. John A. Wheeler, 1990, “Information, physics, quantum: The search for links” in W. Zurek (ed.) Complexity, Entropy, and the Physics of Information. Redwood City, CA: Addison-Wesley.
  3. Olivier Rioul. This is IT: A Primer on Shannon’s Entropy and Information. Séminaire Poincaré. 2018.