My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.-Claude Shannon

Introduction:

In my conversations with other scientists it appears that most can do sophisticated calculations with entropy. However, I think a deep conceptual understanding is more fundamental in order to recognise the role of entropy in different situations.

Here, I will begin by introducing the Boltzmann entropy with a concrete example before moving to the more nuanced notions of entropy due to Shannon and John Wheeler.

Boltzmann’s notion of entropy as a measure of the number of possible arrangements:

Let’s suppose we have a box containing a fluid with sides of length 1 cm and \(k=100\) nano-sized particles happen to be suspended in this fluid. Due to the size of the particles in relation to the box, there are roughly:

\begin{equation} N = (10^7)^3 = 10^{21} \end{equation}

distinct addresses for each particle.

Now, if there are 5 different kinds of particles and equal numbers of each kind we may use the multinomial formula to count the number of distinct arrangements:

\begin{equation} W = \frac{10^{21}!}{(10^{21}-100)!(20!)^5} \end{equation}

and we note that there are \(\log_2 W\) binary encodings of each arrangement. Moreover, since \(\log_2(\cdot)\) is strictly increasing this measure is consistent with Boltzmann’s notion of entropy as a measure of possible arrangements.

Shannon’s notion of entropy as the average amount of information gained from observing a random variable:

If we push Boltzmann’s notion of entropy a bit further, we may recover the Shannon entropy most scientists are familiar with.

Let’s consider the number of arrangements associated with a system \(X\) consisting of \(\{N_i\}_{i=1}^k\) particles where there are \(N_i\) particles for each species and:

\begin{equation} N = \sum_{i=1}^k N_i \end{equation}

particles in total.

If we assume that there are also \(N\) distinct particle locations then we have:

\begin{equation} W = \frac{N!}{\prod_{i=1}^k N_i} \end{equation}

possible arrangements.

This motivates us to introduce the Shannon entropy as the average amount of information gained from observing \(X\):

\begin{equation} H(X) = \frac{\log_2 W}{N} = \frac{1}{N} \log_2 \big(\frac{N!}{\prod_{i=1}^k N_i!}\big) \end{equation}

and using Stirling’s log-factorial approximation, we find:

\begin{equation} H(X) \approx -\sum_{i=1}^k \frac{N_i}{N} \log_2 \frac{N_i}{N} \end{equation}

and if we define the frequencies \(p_i = \frac{N_i}{N}\), we recover the usual Shannon entropy \(H(\cdot)\):

\begin{equation} H(X) \approx -\sum_{i=1}^k p_i \log_2 (p_i) \end{equation}

so we now have a measure of our statistical uncertainty associated with the actual state of \(X\).

The it from bit approach to doing science:

Now, we may move on to a view of entropy due to Wheeler which is compatible with that of Shannon and Boltzmann but a bit more developed from an epistemological standpoint. In this view, the world does not exist independently of our scientific inquiries. It emerges from a sequence of yes/no questions, or what scientists would call experiments.

For concreteness, let’s suppose a scientist would like to test a number of machine learning models \(\{M_i\}_{i=1}^N\) in a sequential manner. If each model \(M_i\) requires a finite number of experiments to be validated, we may further assume that each each experiment has a probability \(0<p<1\) of success so we have:

\begin{equation} S_N = \sum_{i=1}^{N-1} \sum_{n=M_i}^{M_{i+1}} P(\text{experiment n}|M_i) \end{equation}

where it is implicitly assumed that the \(M_i\) are identified with unique integers.

In practice, the probabilities \(P(\text{experiment n}|M_i)\) are difficult to quantify and each experiment will either succeed or fail from the perspective of the experimentalist. If we view each experiment as a yes/no question in the manner of John Wheeler then each set of experiments has \(2^{M_{i+1}-M_i}\) possible outcomes. Furthermore, if the sets of experiments are uncorrelated then we may represent the range of possible outcomes by:

\begin{equation} W = 2^{\sum_{i=1}^{N-1} M_{i+1}-M_i} \end{equation}

where the average amount of information gained from validating each model is:

\begin{equation} \frac{\log_2 W}{N} \end{equation}

so if \(\sum_{i=1}^N M_{i+1}-M_i = \lambda N, \frac{\log_2 W}{N} = \lambda\) where \(\lambda\) is the average number of experiments per model. This is consistent with Wheeler’s view of entropy as a measure of the average number of bits(i.e. yes/no questions) associated with each model.

Put a different way, the statistical uncertainty associated with the performance of a particular machine learning model is proportional to the number of experiments that are required to validate it so entropy is also a good proxy measure of model complexity.

Conclusion:

What we have gained from this analysis is the understanding that there are several complementary views of entropy. We may, like Boltzmann, use entropy as a tool for counting arrangements. Shannon then pushed this notion further by developing entropy as a measure of statistical uncertainty. Finally, Wheeler showed that entropy may be used to give us a holistic understanding of how scientific knowledge emerges so we may view entropy as a measure of our epistemic uncertainty.

The power of Wheeler’s formulation may be appreciated from the insight that there is a direct correspondence between phenomena that appear random from the vantage point of a particular scientific theory and the epistemic limits of that theory. When we say that a process is random we are ultimately referring to epistemic uncertainty.

References:

  1. Dénes Petz. Entropy, von Neumann and the von Neumann entropy. 2001.
  2. Olivier Rioul. This is IT: A Primer on Shannon’s Entropy and Information. Séminaire Poincaré. 2018.
  3. John A. Wheeler, 1990, “Information, physics, quantum: The search for links” in W. Zurek (ed.) Complexity, Entropy, and the Physics of Information. Redwood City, CA: Addison-Wesley.