Motivation:

Earlier today I was talking to a researcher about how well a normal distribution could approximate a uniform distribution over an interval . I gave a few arguments for why I thought a normal distribution wouldn’t be good but I didn’t have the exact answer at the top of my head so I decided to find out. Although the following analysis involves nothing fancy I consider it useful as it’s easily generalised to higher dimensions(i.e. multivariate uniform distributions) and we arrive at a result which I wouldn’t consider intuitive.

For those who appreciate numerical experiments, I wrote a small TensorFlow script to accompany this blog post.

Statement of the problem:

We would like to minimise the KL-Divergence:

\begin{equation} \mathcal{D_{KL}}(P|Q) = -\int_{-\infty}^\infty p(x) \ln \frac{p(x)}{q(x)}dx \end{equation}

where is the target uniform distribution and is the approximating Gaussian:

\begin{equation} p(x)= \frac{1}{b-a} \mathbb{1}_{[b-a]} \implies p(x \notin [b-a]) = 0 \end{equation}

and

\begin{equation} q(x)= \frac{1}{\sqrt{2 \pi \sigma^2}} e^{\frac{(x-\mu)^2}{2 \sigma^2}} \end{equation}

Now, given that if we assume that is fixed our loss may be expressed in terms of and :

\begin{equation} \begin{split} \mathcal{L}(\mu,\sigma) & = -\int_{a}^b p(x) \ln \frac{p(x)}{q(x)}dx
& = \ln(b-a) - \frac{1}{2}\ln(2\pi\sigma^2)-\frac{\frac{1}{3}(b^3-a^3)-\mu(b^2-a^2)+\mu^2(b-a)}{2\sigma^2(b-a)} \end{split} \end{equation}

Minimising with respect to and :

We can easily show that the mean and variance of the Gaussian which minimises correspond to the mean and variance of a uniform distribution over :

\begin{equation} \frac{\partial}{\partial \mu} \mathcal{L}(\mu,\sigma) = \frac{(b+a)}{2\sigma^2} - \frac{2\mu}{2\sigma^2}= 0 \implies \mu = \frac{a+b}{2} \end{equation}

\begin{equation} \frac{\partial}{\partial \sigma} \mathcal{L}(\mu,\sigma) = -\frac{1}{\sigma}+\frac{\frac{1}{3}(b^2+a^2+ab)-\frac{1}{4}(b+a)^2}{\sigma^3} =0 \implies \sigma^2 = \frac{(b-a)^2}{12} \end{equation}

Although I wouldn’t have guessed this result the careful reader will notice that this result readily generalises to higher dimensions.

Analysing the loss with respect to optimal Gaussians:

After entering the optimal values of and into and simplifying the resulting expression we have the following residual loss:

\begin{equation} \mathcal{L}^* = -\frac{1}{2}(\ln \big(\frac{\pi}{6}\big)+1) \approx -.17 \end{equation}

I find this result surprising because I didn’t expect the dependence on to vanish. That said, my current intuition for this result is that if we tried fitting to we would obtain:

\begin{equation} [a,b] = [\mu - \sqrt{3}\sigma, \mu + \sqrt{3}\sigma] \end{equation}

so this minimisation problem corresponds to a linear re-scaling of the uniform parameters in terms of and .

Remark:

The reader may experiment with the following TensorFlow function which outputs the approximating mean and variance of a Gaussian given a uniform distribution on the interval .