Introducing the method of level sets:

The notion of level sets was developed in order to address two unresolved problems. First, we would like a necessary and sufficient mathematical abstraction for comparing the asymptotic rates of entropy production of distinct sequences of random variables which are used for modelling rare events(ex. the event that a number is prime). Second, we would like to use the same mathematical abstraction to deduce the typical probability that a rare event occurs in an interval. From a frequentist perspective, we may think of these as a priori probabilities that a rare event is observed.

As a measure of the efficacy of this method it is applied to Cramér’s conjecture, an open problem in probabilistic number theory.

Classifying rare events using level sets:

Broadly speaking, applied mathematicians are interested in rare events that are recurrent, of variable frequency, and unpredictable. By unpredictable, I mean binary sequences that are not finite-state compressible. But, how might we classify rare events that satisfy these criteria?

If we identify rare events using the function \(1_X: \mathbb{N} \rightarrow \{0,1\}\), we may analyse the rate of entropy production of the computable binary sequence \(X_n = \{x_i\}_{i=1}^n \in \{0,1\}^n\). In order to qualify as a rare event, a randomly sampled element \(x_z\) of the sequence \(X_n\) must generally pass \(\pi_x(n)\) tests. These tests \(A_i, A_{j \neq i} \in \{A_i\}_{i=1}^{\pi_x(n)}\) are de-correlated as rare events are assumed to be independent of each other:

\begin{equation} \forall z \sim U([1,n]), P(z \in A_i \land z \in A_{j \neq i}) = P(z \in A_i ) \cdot P(z \in A_{j \neq i}) \end{equation}

where the \(P(z \in A_i )\) have a natural frequentist interpretation. We generally assume that the tests are consistent so \(\{A_i\}_{i=1}^{\pi_x(n)} \subset \{A_i\}_{i=1}^{\pi_x(n+1)}\) where \(\pi_{x}(n)\) counts the number of rare events in the time interval \([1,n]\) so it is monotonically increasing.

Given (1), we may define the average probability density using the inclusion-exclusion principle:

\begin{equation} \forall z \sim U([1,n]), P(x_z =1) = P(z \in \bigcap_{k \leq \pi_x(n)} A_k) = \prod_{k=1}^{\pi_x(n)} P(z \in A_k) \end{equation}

as well as the entropy:

\begin{equation} \forall z \sim U([1,n]), H(X_n) = -\sum_{\lambda =1}^n \frac{1}{n} \cdot \ln P(x_z =1) = -\ln \prod_{k=1}^{\pi_x(n)} P(z \in A_k) \end{equation}

which is a measure of expected surprise, or the expected information gained from observing a rare event.

Having defined the entropy (3), we may compare the entropies of distinct sequences using the typical probabilities \(q_n \in (0,1)\), which allow us to define an equivalence relation over entropies:

\begin{equation} \mathcal{L}_{q_n} = \{X_n \in \{0,1\}^{\infty}: H(X_n) \sim \ln \big(\frac{1}{q_n}\big) \} \end{equation}

Finally, (4) implies that:

\begin{equation} \forall X_n \in \mathcal{L_{q_n}}, \lim_{n \to \infty} -\frac{\ln q_n}{H(X_n)} = 1 \end{equation}

so the entropy rates of distinct sequences \(X_n,X'_n \in \mathcal{L_{q_n}}\) are asymptotically equivalent. From a data compression perspective, this tells us that in the asymptotic limit we need \(\sim \ln \big(\frac{1}{q_n}\big)\) bits in order to compress a particular class of rare events.

An interesting consequence is that the rare event counting function, as defined here, generally satisfies:

\begin{equation} \pi_x(N) \sim N \cdot q_N \end{equation}

for large \(N \in \mathbb{N}\).

Statistical generalisations of the Prime Number Theorem:

From statistical considerations, Gauss and Legendre inferred that the density of primes at \(n \in \mathbb{N}\) is on the order of \(\sim \frac{1}{\ln n}\). This led them to conjecture that the number of primes less than \(N\) is on the order of:

\begin{equation} \pi(N) \sim \int_{2}^N \frac{1}{\ln x} dx \sim N \cdot \frac{1}{\ln N} \end{equation}

which is equivalent to the Prime Number Theorem.

However, given our assumptions we are more generally interested in the family of density functions that satisfy:

\begin{equation} 0 < c \ll N, \int_{c}^N \frac{1}{f(x)} dx \sim N \cdot \frac{1}{f(N)} \end{equation}

where \(f\) is analytic and \(\lim_{N \to \infty} \frac{1}{f(N)}=0\). Using integration by parts, this is equivalent to the criterion:

\begin{equation} \int_{c}^N g(x) dx = N \cdot g(N) - c \cdot g(c) - \int_{c}^N x \cdot g’(x) dx \sim N \cdot g(N) \end{equation}

where \(g(x)=\frac{1}{f(x)}\), and \(g(x)\) dominates \(x \cdot g'(x)\). So this family naturally includes density functions of the form:

\begin{equation} \exists a > 0 \forall x \in \mathbb{R}_+, g(x) = (\ln x)^{-a} \end{equation}

which we may call the log-zeta distribution in the sense that the un-normalised zeta distribution is given by:

\begin{equation} \forall s >1, \zeta(s) = \zeta_s \circ \mathbb{N} = \sum_{n \in \mathbb{N}} \frac{1}{n^s} \end{equation}

and if we log-transform the state-space \(\mathbb{N}\) we have:

\begin{equation} \forall s >1, \zeta_s \circ \ln \big(\mathbb{N}\big) = \sum_{n \in \mathbb{N}} \frac{1}{(\ln n)^s} \end{equation}

where the logarithm transports us from the space of integers to the space of bits measured in base-e as the information-content of integers is asymptotically:

\begin{equation} \forall n \in \mathbb{N}, K_U(n) \sim \log_2(n) \end{equation}

as demonstrated by Chaitin [7].

The advantage of using the exponential of entropy

It is worth noting that the entropy rate \(\frac{1}{q_n}\) corresponds to the exponential of entropy

\begin{equation} \frac{1}{q_n} \approx e^{H(X_n)} \end{equation}

which has the advantage of being parametrisation invariant i.e. not depending on the choice of base for logarithms. This generally allows us to simplify calculations. Moreover, this is a robust measure for the expected number of observations that we need to collapse the entropy \(H(X_n)\). Equivalently, it is a robust measure of the average distance between rare events.

On the other hand, the typical probability \(q_n\) corresponds to the density of rare events at the instant \(n \in \mathbb{N}\). It may also be understood as the frequency with which we collapse the entropy \(H(X_n)\) at the instant \(n \in \mathbb{N}\). We may also observe that the typical probability \(q_n\) and entropy rate \(\frac{1}{q_n}\) are both parametrisation invariant as they are multiplicative inverses of each other. Together they form the basis for the method of level sets which provides us with the necessary and sufficient apparatus for classifying sequences of rare events in terms of their rate of entropy production.

A remark on applications of the level set method

In practice, the empirical data will consist of a binary sequence of rare event observations and the actual tests are generally unknown. Assuming that the events are unpredictable, we may use machine learning to evaluate the hypothesis that the events are causally independent i.e. any test inferred from data has no generalisation power. However, a rare-event modeller may still be able to infer density functions like (9) by performing a regression analysis on the rare-event data.

Such a density function may be used to predict the occurrence of future events as well as calculating upper and lower-bounds on the waiting time between events. As a concrete demonstration, we shall consider the application of the level set method to an open problem in probabilistic number theory.

Application to Cramér’s random model, Part I:

Using the method of level sets, we may demonstrate that the typical probability that an integer in the interval \([1,n] \subset \mathbb{N}\) is prime is given by:

\begin{equation} \forall z \sim U([1,n]), P(z \in \mathbb{P}) \sim \frac{1}{\ln n} \end{equation}

If \(X_n = \{x_i\}_{i=1}^n \in \{0,1\}^n\) defines a prime encoding i.e. a sequence where \(x_k = 1\) if \(k \in \mathbb{P}\) and \(x_k = 0\) otherwise, then we may define \(\pi(\sqrt{n})\) primality tests as any composite integer in \([1,n]\) has at most \(\pi(\sqrt{n})\) distinct prime factors:

\begin{equation} \forall A_p \in \{A_{p_k}\}_{k=1}^{\pi(\sqrt{n})}, A_p = \{z \in [1,n]: \text{gcd}(p,z) = 1\} \bigcup {p} \end{equation}

and we’ll note that \(\forall z \sim U([1,n]), P(z \in A_{p \geq \sqrt{n}}) = 1- \frac{1}{n} + \frac{1}{n}=1\) so we have:

\begin{equation} \forall z \sim U([1,n]), H(X_n) = -\ln \prod_{k=1}^{\pi(n)} P(z \in A_{p_k}) = -\ln \prod_{k=1}^{\pi(\sqrt{n})} P(z \in A_{p_k}) \end{equation}

In order to define \(P(z \in A_{p_k})\) we shall implicitly use the approximation that for \(p \leq \sqrt{n}, \frac{1}{p}-\frac{1}{n} \approx \frac{1}{p}\) so we have:

\begin{equation} \forall z \sim U([1,n]), P(z \in A_{p_k}) = \big(1-\frac{1}{p_k}\big) + \frac{1}{n} \approx \big(1-\frac{1}{p_k}\big) \end{equation}

which allows us to make (10) precise:

\begin{equation} H(X_n) = -\ln \prod_{p \leq \sqrt{n}} \big(1-\frac{1}{p}\big) \end{equation}

Using Mertens’ third theorem we have:

\begin{equation} \prod_{p \leq \sqrt{n}} \big(1-\frac{1}{p}\big) \approx \frac{e^{-\gamma}}{\frac{1}{2} \cdot \ln n} \approx \frac{0.9}{\ln n} \end{equation}

so we may infer that:

\begin{equation} H(X_n) \sim \ln \ln n \end{equation}

and therefore \(X_n \in \mathcal{L}_{q_n}\) where:

\begin{equation} \mathcal{L_{q_n}} = \{X_n \in \{0,1\}^{\infty}: H(X_n) \sim \ln \ln n\} \end{equation}

From (15), we may deduce (8):

\begin{equation} \forall z \sim U([1,n]), q_n = P(z \in \mathbb{P}) \sim \frac{1}{\ln n} \end{equation}

which means that the density of the primes at \(x \in \mathbb{N}\) is on the order of \(\sim \frac{1}{\ln x}\) and therefore the expected number of primes less than \(n\) is given by:

\begin{equation} \pi(n) \sim \int_{2}^n \frac{1}{\ln x} dx \sim \frac{n}{\ln n} \end{equation}

in complete agreement with the Prime Number Theorem.

Application to Cramér’s random model, Part II:

Given the prime encoding \(X_n\), let’s define the nth prime gap \(G_n\) so we may consider the cumulative probability:

\begin{equation} \sum_{k=1}^{G_n} P(x_{p_n + k} = 1 \land p_{n+1}-p_n = k) = 1 \end{equation}

where \(G_n < p_n\) due to Bertrand’s postulate and \(P(p_{n+1}-p_n=k) = 1\) if and only if we simultaneously know the values of both \(p_n\) and \(p_{n+1}\).

As the subsequence \(\{x_{p_{n+k}}\}_{k=1}^{G_n}\) halts at \(k \in [1,G_n]\) where \(x_{p_n+k}=1\), there are at most \(G_n\) possible ways for this sequence to halt. Furthermore, in order to bound the halting probability \(P(x_{p_n + k} = 1 \land p_{n+1}-p_n = k)\) we may use the density formula (15) to consider its components:

\begin{equation} \forall k \sim U([1,G_n]), P(p_{n+1} = p_n + k) = P(p_n = p_{n+1} -k) \sim \frac{1}{\ln p_n} \end{equation}

\begin{equation} \forall k \sim U([1,G_n]),P(x_{p_n + k} = 1) \sim \frac{1}{\ln p_n} \end{equation}


\begin{equation} P(x_{p_n + k} = 1 \land p_{n+1}-p_n = k) \geq P(x_{p_n + k} = 1) \cdot P(p_{n+1} = p_n + k) \end{equation}

Using (19),(20) and (21) we may derive the lower bound:

\begin{equation} \forall k \sim U([1,G_n]),P(x_{p_n + k} = 1 \land p_{n+1}-p_n = k) \geq \frac{1}{(\ln p_n)^2} \end{equation}

which implies:

\begin{equation} \frac{1}{(\ln p_n)^2} \leq \frac{1}{G_n} \sum_{k=1}^{G_n} P(x_{p_n + k} = 1 \land p_{n+1}-p_n = k) = \frac{1}{G_n} \end{equation}

and since \(G_n = p_{n+1}-p_n\), we may conclude:

\begin{equation} p_{n+1}-p_n = \mathcal{O}((\ln p_n)^2) \end{equation}

as conjectured by Harald Cramér in 1936 [1].

Details of the \(\frac{1}{p} - \frac{1}{n} \approx \frac{1}{p}\) approximation in Cramér’s model:

If we define,

\begin{equation} \Lambda_k = \text{set of } {\pi(\sqrt{n}) \choose k} \text{ distinct subsets of } \{p_k\}_{k=1}^{\pi(\sqrt{n})} \end{equation}

then \(\lvert \Lambda_k \rvert = {\pi(\sqrt{n}) \choose k}\) and we may derive the absolute error:

\begin{equation} \Big\lvert \prod_{p \leq \sqrt{n}} \big(1-\frac{1}{p} + \frac{1}{n}) - \prod_{p \leq \sqrt{n}} \big(1-\frac{1}{p}\big) \Big\rvert \leq \sum_{k=1}^{\pi(\sqrt{n})-1} \frac{1}{n^k} \sum_{\lambda \in \Lambda_k} \prod_{p \in \lambda} \big(1-\frac{1}{p}\big) + \frac{1}{n^{\pi(\sqrt{n})}} \end{equation}

and given that:

\begin{equation} \pi(\sqrt{n}) < \sqrt{n} \implies \frac{1}{n^k} \cdot {\pi(\sqrt{n}) \choose k} \leq \frac{1}{n^{k/2}} \end{equation}

the absolute error satisfies the following inequality:

\begin{equation} \Delta_n = \Big\lvert \prod_{p \leq \sqrt{n}} \big(1- \frac{1}{p}+\frac{1}{n}\big) - \prod_{p \leq \sqrt{n}} \big(1- \frac{1}{p}\big) \Big\rvert \leq \frac{1}{\sqrt{n}} + \frac{1}{n} \end{equation}

and we find that the relative error converges to zero:

\begin{equation} \lim_{n \to \infty} \frac{\Delta_n}{\prod_{p \leq \sqrt{n}} \big(1- \frac{1}{p}\big)} \leq \lim_{n \to \infty} \big(\frac{\ln n}{\sqrt{n}} + \frac{\ln n}{n}\big) = 0 \end{equation}

which provides us with the necessary justifications in our application to Cramér’s random model, specifically formulas (12),(13) and (14).


In this article, it is demonstrated that Cramér’s conjectured upper-bound on prime gaps is essentially correct. However, Harald Cramér also conjectured that \((\ln p_n)^2\) defines the least upper-bound on prime gaps so we have:

\begin{equation} \lim_{n \to \infty}\sup \frac{p_{n+1}-p_n}{(\ln p_n)^2} = 1 \end{equation}

This problem of least upper-bounds is beyond the scope of the present article, but I believe information-theoretic methods may be used to probe this matter further.


  1. Cramér, H. “On the Order of Magnitude of the Difference Between Consecutive Prime Numbers.” Acta Arith. 2, 23-46, 1936.

  2. Hardy, G. H.; Ramanujan, S. (1917), “The normal number of prime factors of a number n”, Quarterly Journal of Mathematics

  3. Turán, Pál (1934), “On a theorem of Hardy and Ramanujan”, Journal of the London Mathematical Society

  4. Yufei Zhao. The Probabilistic Method in Combinatorics. 2019.

  5. F. Mertens. J. reine angew. Math. 78 (1874)

  6. Adrien-Marie Legendre, Essai sur la théorie des nombres (seconde edition), Courcier: Paris, 1808.

  7. Lance Fortnow. Kolmogorov Complexity. 2000.