Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution over an interval \([a,b] \subset \mathbb{R}\). I gave a few arguments for why I thought a normal distribution wouldn’t be good but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions) and we arrive at a result which I wouldn’t consider intuitive.

For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.

Statement of the problem:

We would like to minimise the KL-Divergence:

\begin{equation} \mathcal{D_{KL}}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx \end{equation}

where \(P\) is the target uniform distribution and \(Q\) is the approximating Gaussian:

\begin{equation} p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0 \end{equation}


\begin{equation} q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}} \end{equation}

Now, given that \(\lim_{x \to 0} x\ln(x) = 0\) if we assume that \((a,b)\) is fixed our loss may be expressed in terms of \(\mu\) and \(\sigma\):

\begin{equation} \begin{split} \mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split} \end{equation}

Minimising with respect to \(\mu\) and \(\sigma\):

We can easily show that the mean and variance of the Gaussian which minimises \(\mathcal{L}(\mu,\sigma)\) correspond to the mean and variance of a uniform distribution over \([a,b]\):

\begin{equation} \frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2} \end{equation}

\begin{equation} \frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12} \end{equation}

Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.

Analysing the loss with respect to optimal Gaussians:

After entering the optimal values of \(\mu\) and \(\sigma\) into \(\mathcal{L}(\mu,\sigma)\) and simplifying the resulting expression we have the following residual loss:

\begin{equation} \mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17 \end{equation}

I find this result surprising because I didn’t expect the dependence on \(\Delta = b-a\) to vanish. That said, my current intuition for this result is that if we tried fitting \(\mathcal{U}(a,b)\) to \(\mathcal{N}(\mu,\sigma)\) we would obtain:

\begin{equation} [a,b] = [\mu - \sqrt{3}\sigma, \mu + \sqrt{3}\sigma] \end{equation}

so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of \(\mu\) and \(\sigma\).


The reader may experiment with the following TensorFlow function which outputs the approximating mean and variance of a Gaussian given a uniform distribution on the interval \([a,b]\).