Jekyll2019-04-24T13:01:45+00:00/feed.xmlKepler Lounge
The math journal of Aidan Rocke
Probability in High Dimension Part II2019-04-21T00:00:00+00:002019-04-21T00:00:00+00:00/probability/2019/04/21/high-dimension-prob-2<h2 id="motivation">Motivation:</h2>
<p>A couple weeks ago I was working on a problem that involved the expected value of a ratio of two random variables:</p>
<p>\begin{equation}
\mathbb{E}\big[\frac{X_n}{Z_n}\big] \approx \frac{\mu_{X_n}}{\mu_{Z_n}} - \frac{\mathrm{Cov}(X_n,Z_n)}{\mu_{Z_n}^2} + \frac{\mathrm{Var(Z_n)}\mu_{X_n}}{\mu_{Z_n}^3}
\end{equation}</p>
<p>where <script type="math/tex">Z_n</script> was defined as a sum of <script type="math/tex">n</script> i.i.d. random variables with a symmetric distribution centred at zero.</p>
<p>Everything about this approximation worked fine in computer simulations where <script type="math/tex">n</script> was large but mathematically there appeared to be a problem since:</p>
<p>\begin{equation}
\mathbb{E}\big[Z_n\big] = 0
\end{equation}</p>
<p>Given that (2) didn’t appear to be an issue in simulation, I went through the code several times to check whether there was an error
but found none. After thinking about the problem for a bit longer it occurred to me formalise the problem and analyse:</p>
<p>\begin{equation}
P(\sum_{n=1}^N a_n = 0)
\end{equation}</p>
<p>where <script type="math/tex">a_n</script> are i.i.d. random variables with a uniform distribution centred at zero so <script type="math/tex">\mathbb{E}[a_i]=0</script>. We may think of this as a measure-theoretic phenomenon in high-dimensional spaces where <script type="math/tex">N \in \mathbb{N}</script> is our dimension and <script type="math/tex">\vec{a} \in \mathbb{R}^N</script> is a random vector.</p>
<p>Now, while in a <a href="https://keplerlounge.com/probability/2019/04/20/high-dimension-prob-1.html">previous article</a> I analysed this problem as an infinite series for the special case of <script type="math/tex">a_i \sim \mathcal{U}(\{-1,1\})</script>, for the more general case of
<script type="math/tex">a_i \sim \mathcal{U}([-N,N])</script> where <script type="math/tex">[-N,N] \subset \mathbb{Z}</script> it occurred to me that modelling this problem as a random walk on <script type="math/tex">\mathbb{Z}</script> might be an effective
approach.</p>
<h2 id="a-random-walk-on-mathbbz">A random walk on <script type="math/tex">\mathbb{Z}</script>:</h2>
<p>Let’s suppose <script type="math/tex">a_i \sim \mathcal{U}([-N,N])</script> where <script type="math/tex">[-N,N] \subset \mathbb{Z}</script>. We may then define:</p>
<p>\begin{equation}
S_n = \sum_{i=1}^n a_i
\end{equation}</p>
<p>Due to the i.i.d. assumption we have:</p>
<p>\begin{equation}
\mathbb{E}\big[S_n\big]= n \cdot \mathbb{E}\big[a_i\big]=0
\end{equation}</p>
<p>We may now define:</p>
<p>\begin{equation}
u_n = P(S_n=0)
\end{equation}</p>
<p>and ask whether <script type="math/tex">u_n</script> is decreasing. In other words, what is the probability that we observe the expected value as <script type="math/tex">n</script> becomes large?</p>
<h2 id="small-and-large-deviations">Small and Large deviations:</h2>
<p>It’s useful to observe the following nested structure:</p>
<p>\begin{equation}
\forall k \in [0,N], \{\lvert S_n \rvert \leq k\} \subset \{\lvert S_n \rvert \leq k+1 \}
\end{equation}</p>
<p>From (7), we may deduce that:</p>
<p>\begin{equation}
P(\lvert S_n \rvert \leq N) + P(\lvert S_n \rvert > N) = 1
\end{equation}</p>
<p>So we are now ready to define the probability of a ‘small’ deviation:</p>
<p>\begin{equation}
\alpha_n = P(\lvert S_n \rvert \leq N)
\end{equation}</p>
<p>as well as the probability of ‘large’ deviations:</p>
<p>\begin{equation}
\beta_n = P(\lvert S_n \rvert > N)
\end{equation}</p>
<p>Additional motivation for analysing <script type="math/tex">\alpha_n</script> and <script type="math/tex">\beta_n</script> arises from:</p>
<p>\begin{equation}
P(S_{n+1}| \lvert S_n \rvert > N) = 0
\end{equation}</p>
<p>\begin{equation}
P(S_{n+1}| \lvert S_n \rvert \leq N) = \frac{1}{2N+1}
\end{equation}</p>
<p>Furthermore, by the law of total probability we have:</p>
<p>\begin{equation}
\begin{split}
P(S_{n+1}) & = \sum_{i=1}^N P(S_{n+1}|\lvert S_n \rvert \leq N) \cdot P(\lvert S_n \rvert \leq N) + P(S_{n+1}|\lvert S_n \rvert > N) \cdot P(\lvert S_n \rvert > N) \\
& = P(S_{n+1}|\lvert S_n \rvert \leq N) \cdot P(\lvert S_n \rvert \leq N) \\
& = \frac{P(\lvert S_n \rvert \leq N)}{2N+1}
\end{split}
\end{equation}</p>
<h2 id="a-remark-on-symmetry">A remark on symmetry:</h2>
<p>It’s useful to note the following alternative definitions of <script type="math/tex">\alpha_n</script> and <script type="math/tex">\beta_n</script> that emerge
due to symmetries intrinsic to the problem:</p>
<p>\begin{equation}
\beta_n = P(\lvert S_n \rvert > N) = 2 \cdot P(S_n > N) = 2 \cdot P(S_n < -N)
\end{equation}</p>
<p>\begin{equation}
\alpha_n = P(\lvert S_n \rvert \leq N) = 1-2 \cdot P(S_n > N)=1-2 \cdot P(S_n < -N)
\end{equation}</p>
<h2 id="the-case-of-n1-and-n2">The case of <script type="math/tex">n=1</script> and <script type="math/tex">n=2</script>:</h2>
<p>Given that <script type="math/tex">S_0=0</script>:</p>
<p>\begin{equation}
P(S_1=0)=\frac{P(\lvert S_0 \rvert \leq N)}{2N+1}= \frac{1}{2N+1}
\end{equation}</p>
<p>As for the case of <script type="math/tex">n=2</script>:</p>
<p>\begin{equation}
P(\lvert S_2 \rvert \leq N) =1 \implies P(S_2=0) = \frac{1}{2N+1}
\end{equation}</p>
<h2 id="the-case-of-n3">The case of <script type="math/tex">n=3</script>:</h2>
<p>The case of <script type="math/tex">n=3</script> requires that we calculate:</p>
<p>\begin{equation}
P(S_3=0)=\frac{P(\lvert S_2 \rvert \leq N)}{2N+1}= \frac{1}{2N+1}
\end{equation}</p>
<p>\begin{equation}
\begin{split}
P(S_{2} > N) & = P(S_{2}| S_1 = i) \cdot P( S_1 = i) \\
& = \frac{1}{2N+1} \sum_{i=1}^N (\frac{1}{2N+1} + … + \frac{N}{2N+1}) \\
& = \frac{N \cdot (N-1)}{2 \cdot (2N+1)^2}
\end{split}
\end{equation}</p>
<p>and using (19) we may derive <script type="math/tex">P(S_{2} \leq N)</script>:</p>
<p>\begin{equation}
\begin{split}
P(S_{2} \leq N) & = 1 - 2 \cdot P(S_{2} > N) \\
& = 1- \frac{N \cdot (N-1)}{(2N+1)^2} \\
& = \frac{3N^2+5N+1}{(2N+1)^2} \sim \frac{3}{4}
\end{split}
\end{equation}</p>
<p>and so for <script type="math/tex">n=3</script> we have:</p>
<p>\begin{equation}
\begin{split}
P(S_{3} = 0) & = P(S_{3} = 0 | \lvert S_2 \rvert \leq N) \cdot P(\lvert S_2 \rvert \leq N) \\
& = \frac{3N^2+5N+1}{(2N+1)^3} \sim \frac{3}{8N}
\end{split}
\end{equation}</p>
<h2 id="average-drift-or-why-ps_nk--ps_n--k1">Average drift or why <script type="math/tex">P(S_n=k) > P(S_n = k+1)</script>:</h2>
<p>It’s useful to note that we may decompose <script type="math/tex">S_n</script> into:</p>
<p>\begin{equation}
S_n = S_n^+ + S_n^{-}
\end{equation}</p>
<p>where <script type="math/tex">S_n^+</script> defines the sum of the positive terms and <script type="math/tex">S_n^{-}</script> defines the sum
of the negative terms.</p>
<p>By grouping the terms in the manner of (22) we may observe that when
<script type="math/tex">n</script> is large the average step length is given by:</p>
<p>\begin{equation}
\Delta = \frac{N+1}{2}
\end{equation}</p>
<p>so that if <script type="math/tex">k</script> positive steps and <script type="math/tex">n-k</script> negative steps are taken:</p>
<p>\begin{equation}
S_n^+ \approx k \cdot \Delta
\end{equation}</p>
<p>\begin{equation}
S_n^{-} \approx (n-k) \cdot (-\Delta)
\end{equation}</p>
<p>\begin{equation}
S_n = S_n^+ + S_n^{-} \approx k \cdot \Delta + (n-k) \cdot (-\Delta) = \Delta \cdot (2k-n)
\end{equation}</p>
<p>Now, we may note that due to symmetry:</p>
<p>\begin{equation}
P(\lvert S_n \rvert =k) > P(\lvert S_n \rvert =k+1) \iff P(S_n =k) > P(S_n =k+1)
\end{equation}</p>
<p>so it suffices to demonstrate <script type="math/tex">P(S_n =k) > P(S_n =k+1)</script>.</p>
<p>In order to proceed with our demonstration we choose <script type="math/tex">k \in [\lfloor \frac{n}{2} \rfloor + 1,nN-1]</script> so:</p>
<p>\begin{equation}
P(S_{n} = \lfloor \Delta \cdot \tau \rfloor) \propto {n \choose k} \frac{1}{2^{n}}
\end{equation}</p>
<p>where <script type="math/tex">k \geq \lfloor \frac{n}{2} \rfloor + 1</script> implies that:</p>
<p>\begin{equation}
\frac{P(S_n=k)}{P(S_n=k+1)} \sim \frac{(k+1)!(n-k-1)!}{k!(n-k)!} = \frac{k+1}{n-k} > \frac{\lfloor \frac{n}{2} \rfloor+2}{\lfloor \frac{n}{2} \rfloor-1} > 1
\end{equation}</p>
<h2 id="proof-that-u_n-is-decreasing">Proof that <script type="math/tex">u_n</script> is decreasing:</h2>
<p>Given (13) we may derive the following ratio:</p>
<p>\begin{equation}
\frac{u_{n+1}}{u_n} = \frac{P(\lvert S_n \rvert \leq N)}{(2N+1) \cdot P(S_n = 0)}
\end{equation}</p>
<p>So in order to prove that <script type="math/tex">u_n</script> is decreasing we must show that:</p>
<p>\begin{equation}
P(\lvert S_n \rvert \leq N) < (2N+1) \cdot P(S_n=0)
\end{equation}</p>
<p>Meanwhile, if we define:</p>
<p>\begin{equation}
W_k^N = \sum_{j=k}^{nN-1} P(S_n=j)-P(S_n=j+1) = P(S_n=k) - \frac{1}{(2N+1)^n} \approx P(S_n=k)
\end{equation}</p>
<p>We may observe that <script type="math/tex">W_k^N</script> is a sum of positive terms and that:</p>
<p>\begin{equation}
\forall k \in \mathbb{N}, W_k^N \leq W_0^N \approx P(S_n=0)
\end{equation}</p>
<p>And given (32) we have:</p>
<p>\begin{equation}
W_{2k}^N < \frac{W_k^N}{2} \implies \forall k \in L \subset [N], P(S_n=k) < \frac{P(S_n=0)}{2^k}
\end{equation}</p>
<p>and we can show that <script type="math/tex">\lvert L \rvert \geq \log_2 N</script> so <script type="math/tex">% <![CDATA[
\sum_{k \in L} P(S_n=k) < P(S_n=0) %]]></script>.</p>
<p>Using the last few relations we may derive the following inequality:</p>
<p>\begin{equation}
\sum_{k=0}^{N} W_k^N \approx \sum_{k=0}^{N} P(S_n=k) = \frac{P(\lvert S_n \rvert \leq N) + P(S_n=0)}{2} < (N-\log_2 N)P(S_n=0)
\end{equation}</p>
<p>and so we may deduce that (31) is in fact true:</p>
<p>\begin{equation}
P(\lvert S_n \rvert \leq N) < (2N-1) \cdot P(S_n=0)
\end{equation}</p>
<h2 id="proof-that-lim_n-to-infty-u_n--lim_n-to-infty-alpha_n--0">Proof that <script type="math/tex">\lim_{n \to \infty} u_n = \lim_{n \to \infty} \alpha_n = 0</script>:</h2>
<p>An immediate consequence of (36) is that:</p>
<p>\begin{equation}
\lim_{n \to \infty} \frac{P(S_{n+1}=0)}{P(S_1=0)} = \prod_{n=1}^\infty \frac{P(S_{n+1}=0)}{P(S_n=0)} = \prod_{n=1}^\infty \frac{P(\lvert S_n \rvert \leq N)}{(2N+1)P(S_n=0)} < \prod_{n=1}^\infty \frac{2N-1}{2N+1} = 0
\end{equation}</p>
<p>so we may deduce that <script type="math/tex">u_n</script> decreases exponentially fast and that:</p>
<p>\begin{equation}
\lim_{n \to \infty} u_n = \lim_{n \to \infty} P(S_{n+1}=0) = \frac{0}{2N+1}=0
\end{equation}</p>
<p>Likewise, given that:</p>
<p>\begin{equation}
\alpha_n = P(\lvert S_n \rvert \leq N) = (2N+1) \cdot P(S_{n+1}=0)
\end{equation}</p>
<p>we may conclude that large deviations are exponentially more likely as <script type="math/tex">n</script> becomes large:</p>
<p>\begin{equation}
\lim_{n \to \infty} \alpha_n = \lim_{n \to \infty} (2N+1) \cdot P(S_{n+1}=0) = 0
\end{equation}</p>
<p>\begin{equation}
\lim_{n \to \infty} \beta_n = \lim_{n \to \infty} P(\lvert S_n \rvert > N) = \lim_{n \to \infty} 1 - \alpha_n = 1
\end{equation}</p>
<p>One interpretation of the last two limits is that the mass of the discrete hypercube moves away from the centre and towards the corners
which is a concentration-of-measure phenomenon.</p>
<h2 id="discussion">Discussion:</h2>
<p>I find it quite surprising that random structures, in this case a random walk, are useful for analysing high-dimensional systems.
Indeed, I have to say that for such a general result thirty four equations isn’t much. But, what about the case of uniform distributions
on closed intervals of the form <script type="math/tex">[-N,N] \subset \mathbb{R}</script>?</p>
<p>It’s useful to note that <script type="math/tex">[-N,N]^n \subset \mathbb{R}^n</script> defines the hypercube with volume <script type="math/tex">(2N)^n</script> and I suspect that in the continuous
setting, hypercube geometry and convex analysis might be particularly insightful.</p>Aidan RockeMotivation:Probability in High Dimension Part I2019-04-20T00:00:00+00:002019-04-20T00:00:00+00:00/probability/2019/04/20/high-dimension-prob-1<h2 id="motivation">Motivation:</h2>
<p>A couple weeks ago I was working on a problem that involved the expected value of a ratio of two random variables:</p>
<p>\begin{equation}
\mathbb{E}\big[\frac{X_n}{Z_n}\big] \approx \frac{\mu_{X_n}}{\mu_{Z_n}} - \frac{\mathrm{Cov}(X_n,Z_n)}{\mu_{Z_n}^2} + \frac{\mathrm{Var(Z_n)}\mu_{X_n}}{\mu_{Z_n}^3}
\end{equation}</p>
<p>where <script type="math/tex">Z_n</script> was a sum of <script type="math/tex">n</script> i.i.d. random variables with a symmetric distribution centred at zero.</p>
<p>Everything about this approximation worked fine in computer simulations where <script type="math/tex">n</script> was large but mathematically there appeared to be a problem since:</p>
<p>\begin{equation}
\mathbb{E}\big[Z_n\big] = 0
\end{equation}</p>
<p>Given that (2) didn’t appear to be an issue in simulation, I went through the code several times to check whether there was an error
but found none. After thinking about the problem for a bit longer it occurred to me to formalise the problem and analyse:</p>
<p>\begin{equation}
P(\sum_{n=1}^N a_n = 0)
\end{equation}</p>
<p>where <script type="math/tex">a_n</script> are i.i.d. random variables with a uniform distribution centred at zero so <script type="math/tex">\mathbb{E}[a_i]=0</script>. My intuition suggested that under
relatively weak assumptions:</p>
<p>\begin{equation}
\lim_{N \to \infty} P(\sum_{n=1}^N a_n = 0) = 0
\end{equation}</p>
<p>We may think of this as a measure-theoretic phenomenon in high-dimensional spaces where <script type="math/tex">N \in \mathbb{N}</script> is our dimension and <script type="math/tex">\vec{a} \in \mathbb{R}^N</script> is a random vector.</p>
<h2 id="analysis-of-a-special-case">Analysis of a special case:</h2>
<p>Given that (3) is a very general problem, I decided to start by analysing the special case of <script type="math/tex">a_i \sim \mathcal{U}(\{-1,1\})</script> where:</p>
<p>\begin{equation}
\forall n \in \mathbb{N}, P(a_n=1)=P(a_n=-1)=\frac{1}{2}
\end{equation}</p>
<p>\begin{equation}
S_0 = \{ (a_n)_{n=1}^N \in \{-1,1\} : \sum_n a_n = 0\}
\end{equation}</p>
<p>Knowing that <script type="math/tex">S_0</script> is non-empty only if we have parity of positive and negative terms, we may deduce that:</p>
<p>\begin{equation}
S_0 \neq \emptyset \iff N \in 2\mathbb{N}
\end{equation}</p>
<p>For the above reason, I focused my analysis on the following sequence:</p>
<p>\begin{equation}
u_N = P(\sum_{n=1}^{2N} a_n = 0)= \frac{2N \choose N}{2^{2N}} = \frac{(2N)!}{2^{2N}(N!)^2}
\end{equation}</p>
<h2 id="proof-that-u_n-is-decreasing">Proof that <script type="math/tex">u_N</script> is decreasing:</h2>
<p>We can demonstrate that <script type="math/tex">u_N</script> is strictly decreasing by considering the ratio:</p>
<p>\begin{equation}
\frac{u_{N+1}}{u_N}=\frac{\frac{(2N+2)!}{2^{2N+2}((N+1)!)^2}}{\frac{(2N)!}{2^{2N}(N!)^2}}=\frac{(2N+2)(2N+1)}{4(N+1)^2}=\frac{2N+1}{2N+2} < 1
\end{equation}</p>
<p>Now, with (9) we have what is necessary to show that:</p>
<p>\begin{equation}
\lim_{n \to \infty} u_N = 0
\end{equation}</p>
<h2 id="analysis-of-the-limit-lim_n-to-infty-u_n">Analysis of the limit <script type="math/tex">\lim_{n \to \infty} u_N</script>:</h2>
<p>Using (9) we may derive a recursive definition of <script type="math/tex">u_N</script>:</p>
<p>\begin{equation}
u_{N+1}=\frac{2N+1}{2N+2} \cdot u_N
\end{equation}</p>
<p>and given that <script type="math/tex">u_0=1</script> we have:</p>
<p>\begin{equation}
u_{N}=\prod_{n=0}^{N-1} \frac{2n+1}{2n+2}= \frac{1}{3} \cdot \frac{3}{4} \cdot \frac{5}{6} \cdot …
\end{equation}</p>
<p>At this point we can make the useful observation:</p>
<p>\begin{equation}
\lim_{N \to \infty} u_N = 0 \implies \lim_{N \to \infty} - \ln u_N = \infty
\end{equation}</p>
<h2 id="proof-that-lim_n-to-infty-u_n0">Proof that <script type="math/tex">\lim_{n \to \infty} u_N=0</script>:</h2>
<p>By combining (12) and (13) we find that:</p>
<p>\begin{equation}
-\ln u_N = -\ln u_N \prod_{n=0}^{N-1} \frac{2n+1}{2n+2}= \sum_{n=0}^{N-1} \ln \frac{2n+2}{2n+1}= \sum_{n=0}^{N-1} \ln \big(1+\frac{1}{2n+1}\big)
\end{equation}</p>
<p>We note that when <script type="math/tex">n\in \mathbb{N}</script> is large:</p>
<p>\begin{equation}
\ln \big(1+\frac{1}{n}\big) \approx \frac{1}{n}
\end{equation}</p>
<p>Now, from (15) it follows that:</p>
<p>\begin{equation}
\sum_{n=1}^\infty \frac{1}{2n+1} = \infty \implies \sum_{n=0}^{\infty} \ln \big(1+\frac{1}{2n+1}\big) = \infty
\end{equation}</p>
<p>As a result, we may conclude that (10) is indeed true. In some sense, when <script type="math/tex">n</script> is large we can expect to observe the expected value
with vanishing probability.</p>
<h2 id="discussion">Discussion:</h2>
<p>A natural question that follows is whether the above method may be used to handle other cases. Let’s consider <script type="math/tex">a_i \sim \mathcal{U}(\{-1,0,1\})</script> where:</p>
<p>\begin{equation}
\forall n \in \mathbb{N}, P(a_n=1)=P(a_n=0)=P(a_n=-1)=\frac{1}{3}
\end{equation}</p>
<p>so we may define:</p>
<p>\begin{equation}
u_N = P(\sum_{n=1}^{4N} a_n = 0)= \frac{(4N)!}{3^{4N}} \sum_{k=1}^N \frac{1}{(2k)!^2 (4N-4k)!}
\end{equation}</p>
<p>I actually tried to analyse the combinatorics of this sequence but quickly realised that even if I managed to show that this sequence converged to zero,
it wasn’t clear how this method would manage to handle the most general setting, the case of all integer dimensions <script type="math/tex">N \in \mathbb{N}</script>, and it didn’t
appear to be very effective in terms of the number of calculations per case.</p>
<p>In order to make progress, I decided to model this problem from a <a href="https://keplerlounge.com/probability/2019/04/21/high-dimension-prob-2.html">different perspective</a>.</p>Aidan RockeMotivation:Kinematics of a random walk on the special linear group2019-04-02T00:00:00+00:002019-04-02T00:00:00+00:00/mathematics/2019/04/02/kinematics-special-linear-group<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/sequence_1.jpg" width="50%" height="50%" align="middle" /></center>
<center>A stochastic sequence with symmetry hiding in plain sight</center>
<h2 id="introduction">Introduction:</h2>
<p>A few days ago I decided to analyse the symmetries of the two-thirds power law [1] and this analysis naturally led to the following kinematic sequence:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}\begin{bmatrix}\ddot{x}_{n+1}\\\dot{x}_{n+1}\end{bmatrix} = M_n \cdot \begin{bmatrix}\ddot{x}_{n}\\\dot{x}_{n}\end{bmatrix}= \begin{bmatrix}a & b\\c & d\end{bmatrix} \cdot \begin{bmatrix}\ddot{x}_{n}\\\dot{x}_{n}\end{bmatrix} = \begin{bmatrix}a\ddot{x}_n + b\dot{x}_n\\c\ddot{x}_n + d\dot{x}_n\end{bmatrix}\end{equation} %]]></script>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}\begin{bmatrix}\ddot{y}_{n+1}\\\dot{y}_{n+1}\end{bmatrix} = M_n \cdot \begin{bmatrix}\ddot{y}_{n}\\\dot{y}_{n}\end{bmatrix}= \begin{bmatrix}a & b\\c & d\end{bmatrix} \cdot \begin{bmatrix}\ddot{y}_{n}\\\dot{y}_{n}\end{bmatrix} = \begin{bmatrix}a\ddot{y}_n + b\dot{y}_n\\c\ddot{y}_n + d\dot{y}_n\end{bmatrix}\end{equation} %]]></script>
<p>where <script type="math/tex">M_n \in SL(2, \mathbb{R})</script> is a volume-preserving transformation and the position is updated using:</p>
<p>\begin{equation}
x_{n+1} = x_n + \dot{x}_n\cdot \Delta t + \frac{1}{2} \ddot{x}_n \cdot \Delta t^2
\end{equation}</p>
<p>Now, in order to make sure that <script type="math/tex">ad-bc=1</script> I decided to use the trigonometric identity:</p>
<p>\begin{equation}
cos^2(\theta) + sin^2(\theta) = 1
\end{equation}</p>
<p>so we only have to sample three random numbers <script type="math/tex">\alpha, \beta,\theta \in \mathbb{R}</script> so we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
M_n = \begin{bmatrix}\frac{cos(2 \pi \theta)}{\alpha} & \beta \cdot sin(2 \pi \theta) \\ \frac{-sin(2 \pi \theta)}{\beta}& \alpha \cdot cos(2 \pi \theta)\end{bmatrix}
\end{equation} %]]></script>
<p>For the rest of the discussion we shall assume that <script type="math/tex">\alpha,\beta \sim (-1)^{\operatorname{Bern}(0.5)} \cdot U(0.1,10)</script> and <script type="math/tex">\theta \sim U(0,1)</script> .</p>
<p>Now, the key question I have is whether:</p>
<p>\begin{equation}
\mathbb{E}\big[\frac{\Delta y_n}{\Delta x_n}\big] = \text{Cst}
\end{equation}</p>
<p>i.e. whether the expected value of the rate of change is constant.</p>
<h2 id="using-a-symmetry-to-simplify-calculations">Using a symmetry to simplify calculations:</h2>
<h3 id="a-tale-of-two-branching-processes">A tale of two branching processes:</h3>
<p>The following diagram, derived from the first figure, is a particularly useful method for visualising the trajectory of our stochastic sequence:</p>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/sequence_2.jpg" width="75%" height="75%" align="middle" /></center>
<center>A tale of two branching processes</center>
<p>If we use <script type="math/tex">\Sigma_{1}^n</script> and <script type="math/tex">\Sigma_{2}^n</script> to denote random variables associated
with the first and second kinds of branching processes, we may simplify (1) so we have:</p>
<p>\begin{equation}
\ddot{x_n} = \ddot{x_0} \cdot c_0 \cdot \Sigma_{2}^n + \dot{x_0} \cdot d_0 \cdot \Sigma_{2}^n = q_1 \Sigma_{2}^n
\end{equation}</p>
<p>\begin{equation}
\dot{x_n} = \ddot{x_0} \cdot a_0 \cdot \Sigma_{1}^n + \dot{x_0} \cdot b_0 \cdot \Sigma_{1}^n = q_2 \Sigma_{1}^n
\end{equation}</p>
<p>Similarly, we find that for <script type="math/tex">\ddot{y}_n</script> and <script type="math/tex">\dot{y}_n</script> we have:</p>
<p>\begin{equation}
\ddot{y_n} = \ddot{y_0} \cdot c_0 \cdot \Sigma_{2}^n + \dot{y_0} \cdot d_0 \cdot \Sigma_{2}^n = q_3 \Sigma_{2}^n
\end{equation}</p>
<p>\begin{equation}
\dot{y_n} = \ddot{y_0} \cdot a_0 \cdot \Sigma_{1}^n + \dot{y_0} \cdot b_0 \cdot \Sigma_{1}^n = q_4 \Sigma_{1}^n
\end{equation}</p>
<h3 id="analysis-of-the-rate-of-change">Analysis of the rate of change:</h3>
<p>Given equation (3) we may deduce that:</p>
<p>\begin{equation}
\frac{\Delta y_n}{\Delta x_n} = \frac{y_{n+1}-y_n}{x_{n+1}-x_n} = \frac{\dot{y_n} \Delta t + \frac{1}{2} \ddot{y_n} \Delta t^2}{\dot{x_n} \Delta t + \frac{1}{2} \ddot{x_n} \Delta t^2} = \frac{\dot{y_n} + h\ddot{y_n}}{\dot{x_n} + h \ddot{x_n}}
\end{equation}</p>
<p>where <script type="math/tex">h = \frac{\Delta t}{2}</script>.</p>
<p>Now, using equations (7), (8), (9) and (10) we find that:</p>
<p>\begin{equation}
\frac{\Delta y_n}{\Delta x_n} = \frac{\dot{y_n} + h\ddot{y_n}}{\dot{x_n} + h \ddot{x_n}} = \frac{q_4 \Sigma_{1}^n + h \cdot q_3 \Sigma_{2}^n}{q_2 \Sigma_{1}^n + h \cdot q_1 \Sigma_{2}^n}
\end{equation}</p>
<h2 id="an-experimental-observation">An experimental observation:</h2>
<h3 id="expected-values-of-sigma_1n-and-sigma_2n">Expected values of <script type="math/tex">\Sigma_{1}^n</script> and <script type="math/tex">\Sigma_{2}^n</script>:</h3>
<p>It’s useful to note that given that the matries <script type="math/tex">M_n</script> are independent and:</p>
<p>\begin{equation}
\forall n \in \mathbb{N}, \mathbb{E}[M_n] = 0
\end{equation}</p>
<p>we may deduce that:</p>
<p>\begin{equation}
\mathbb{E}[\Sigma_{1}^n] =\mathbb{E}[\Sigma_{2}^n]= 0
\end{equation}</p>
<h3 id="numerical-experiments-with-fracdelta-y_ndelta-x_n">Numerical experiments with <script type="math/tex">\frac{\Delta y_n}{\Delta x_n}</script>:</h3>
<p>My intuition told me from the beginning that (12) might be useful for analysing the expected value of <script type="math/tex">\frac{\Delta y_n}{\Delta x_n}</script>. In fact,
numerical experiments suggest:</p>
<p>\begin{equation}
\frac{\Delta y_n}{\Delta x_n} \approx \frac{q_4}{q_2}
\end{equation}</p>
<p>To be precise, <a href="https://gist.github.com/AidanRocke/33c1d5268d8f8c3b395cc81ba6397f47">numerical experiments</a> show that 100% of the time the sign of <script type="math/tex">\frac{q_4}{q_2}</script> is in agreement with the sign of <script type="math/tex">\frac{\Delta y_n}{\Delta x_n}</script> and more than 70% of the time these two numbers disagree with each other by less than a factor of 1.5 i.e. a 30% difference.</p>
<h2 id="analysis">Analysis:</h2>
<p>If we take the limit as <script type="math/tex">h \rightarrow 0</script>:</p>
<p>\begin{equation}
\lim_{h \to 0} \frac{\Delta y_n}{\Delta x_n} = \lim_{h \to 0} \frac{q_4 \Sigma_1^n+h \cdot q_3 \cdot \Sigma_2^n}{q_2 \Sigma_1^n+h \cdot q_1 \cdot \Sigma_2^n} = \frac{q_4}{q_2}
\end{equation}</p>
<p>so it appears that what I observed numerically depends on <script type="math/tex">h</script> and it’s still not clear to me how to calculate <script type="math/tex">\mathbb{E}\big[\frac{\Delta y_n}{\Delta x_n}\big]</script> directly, which was my original question.</p>
<h2 id="conjecture">Conjecture:</h2>
<p>While I’m still looking for a closed form expression for <script type="math/tex">\mathbb{E}\big[\frac{\Delta y_n}{\Delta x_n}\big]</script> my previous analysis leads me to conclude that, for any random matrices <script type="math/tex">M_i</script> sampled i.i.d., as <script type="math/tex">h \rightarrow 0</script>:</p>
<p>\begin{equation}
\lim_{h \to 0} \frac{\Delta y_n}{\Delta x_n} = \frac{q_4}{q_2}
\end{equation}</p>
<p>which is a general result I didn’t expect in advance.</p>
<p>Now, given that there is strong numerical evidence for (6) regardless of the magnitude of <script type="math/tex">\Delta t</script>, I wonder whether we can show:</p>
<p>\begin{equation}
\lim_{h \to 0} \frac{\Delta y_n}{\Delta x_n} = \mathbb{E}\big[\frac{\Delta y_n}{\Delta x_n} \big]
\end{equation}</p>
<h1 id="references">References:</h1>
<ol>
<li>D. Huh & T. Sejnowski. Spectrum of power laws for curved hand movements. 2015.</li>
</ol>Aidan RockeA stochastic sequence with symmetry hiding in plain sightThe true cost of AlphaGo Zero2019-03-24T00:00:00+00:002019-03-24T00:00:00+00:00/artificial/intelligence/2019/03/24/alpha-go-zero<h1 id="motivation">Motivation:</h1>
<p>Rich Sutton’s <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">Bitter Lesson</a> for AI essentially argues that we
should focus on meta-methods that scale well with compute instead of trying to understand the structure and function of biological minds.
The latter according to him are <em>endlessly complex</em> and therefore unlikely to scale. Furthermore, Rich Sutton(who works at Deep Mind) considers Deep Mind’s work on AlphaGo
an exemplary model of AI research.</p>
<p>After reading <a href="https://twitter.com/shimon8282/status/1106534178676506624">Shimon Whiteson’s detailed rebuttal</a> as well as <a href="https://twitter.com/SussilloDavid/status/1106643708626137089">David Sussilo’s reflection on loss functions</a> I think it’s time to re-evaluate
the scientific value of AlphaGo Zero research and the carbon footprint of a win-at-any-cost research culture. In particular, I’d like to address the following questions:</p>
<ol>
<li>What kinds of real problems can be solved with AlphaGo Zero algorithms?</li>
<li>What is the true cost of AlphaGo Zero? (i.e. the carbon footprint)</li>
<li>Does Google’s carbon offsetting scheme accomplish more than virtue signalling?</li>
<li>Should AI researchers be noting their carbon footprint in their publications?</li>
<li>Finally, might energetic constraints present an opportunity for better AI research?</li>
</ol>
<p>I haven’t seen these questions addressed in one manuscript but I believe they are related and timely, hence this article. Now, I’d like to add that we can’t seriously
entertain notions of <em>safe</em> AI without carefully developing <em>environmentally friendly</em> AI especially when the only thing that we know for certain is that we will do exponentially more FLOPs(i.e. computations) in the future.</p>
<p><strong>Note:</strong> This post builds on the analyses of <a href="https://medium.com/@karpathy/alphago-in-context-c47718cb95a5">Andrej Karpathy</a> and <a href="https://www.yuzeh.com/data/agz-cost.html">Dan Huang</a>.</p>
<h2 id="alphago-zeros-contribution-to-humanity">AlphaGo Zero’s contribution to humanity:</h2>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/netflix_alphago.jpeg" width="75%" height="75%" align="middle" /></center>
<center>Deep Mind's AlphaGo movie on Netflix</center>
<p>Before measuring the carbon footprint of AlphaGo Zero it’s a good idea to remind ourselves of the types of environments this ‘meta-method’ can handle:</p>
<ol>
<li>The environments must have <em>deterministic</em> dynamics which simplifies planning considerably.</li>
<li>The environment must be <em>fully-observable</em> which rules out large and complex environments.</li>
<li>A <em>perfect simulator</em> must be available to the agent which rules out any biologically-plausible environments.</li>
<li>Evaluation is <em>simple</em> and <em>objective</em>: win/lose. For biological organisms all rewards are <em>internal</em> and <em>subjective</em>.</li>
<li>Static state-spaces and action-spaces: so we can’t generalise…not even to <script type="math/tex">N \times N</script> Go where <script type="math/tex">N \neq 19</script>.</li>
</ol>
<p>These constraints effectively rule out the application of AlphaGo Zero’s algorithms to any practical problem in robotics because perfect simulators are non-existent. But, it may be used to solve any two-person board game which is a historic achievement and a great publicity stunt for Google assuming that the carbon footprint of this
project is reasonable. This consideration is doubly important when you take into account the influence of Deep Mind on the modern AI research culture.</p>
<p>However, before estimating the metric tonnes of CO2 blasted into the atmosphere by Deep Mind let’s consider a related question. How much would it cost an entity outside of Google to replicate this type of research?</p>
<h2 id="the-cost-of-alphago-zero-in-us-dollars">The cost of AlphaGo Zero in US dollars:</h2>
<p>In ‘Mastering the game of Go without human knowledge’ [2] they had both a three day experiment as well as
a forty day experiment. Let’s start with the three day experiment.</p>
<h3 id="the-three-day-experiment">The three day experiment:</h3>
<ol>
<li>Over 72 hours, ~ 5 million games were played.</li>
<li>Each move of self-play used ~ 0.4 seconds of computer time and each self-play machine consisted of 4 TPUs.</li>
<li>How many self-play machines <script type="math/tex">N_{SP}</script> were used?</li>
<li>
<p>If the average game has 200 moves we have:</p>
<p>\begin{equation}
\frac{72\cdot 60 \cdot 60 \cdot N_{SP}}{200 \cdot 5 \cdot 10^6} \approx 0.4 \implies N_{SP} \approx 1500
\end{equation}</p>
</li>
<li>
<p>Given that each self-play machine contained 4 TPUs we have:</p>
<p>\begin{equation}
N_{TPU} = 4 \cdot N_{SP} \approx 6000
\end{equation}</p>
</li>
</ol>
<p>If we use the <a href="https://cloud.google.com/tpu/docs/pricing">Google’s TPU pricing</a> as of March 2019, the cost for an organisation outside of
Google to replicate this experiment is therefore:</p>
<p>\begin{equation}
\text{Cost} > 6000 \cdot 72 \cdot 4.5 \approx 2 \cdot 10^6 \quad \text{US dollars}
\end{equation}</p>
<h3 id="the-forty-day-experiment">The forty day experiment:</h3>
<p>For the forty day experiment one thing that’s different is that the policy network has twice as many layers so,
as Dan Huang pointed out in <a href="https://www.yuzeh.com/data/agz-cost.html">his article</a>, it’s reasonable to infer
that twice the amount of time was used per move. So ~ 0.8 seconds rather than ~ 0.4 seconds.</p>
<ol>
<li>Over 40 days, ~ 29 million games were played.</li>
<li>Each move of self-play used ~ 0.8 seconds of computer time where each self-play machine consisted of 4 TPUs.</li>
<li>How many self-play machines <script type="math/tex">N_{SP}</script> were used?</li>
<li>
<p>If the average game has 200 moves we have:</p>
<p>\begin{equation}
\frac{40 \cdot 24 \cdot 3600 \cdot N_{SP}}{200 \cdot 29 \cdot 10^6} \approx 0.8 \implies N_{SP} \approx 1300
\end{equation}</p>
</li>
<li>
<p>Given that each self-play machine contained 4 TPUs we have:</p>
<p>\begin{equation}
N_{TPU} = 4 \cdot N_{SP} \approx 5000
\end{equation}</p>
</li>
</ol>
<p>If we use the <a href="https://cloud.google.com/tpu/docs/pricing">Google’s TPU pricing</a> as of March 2019, the cost for an organisation outside of
Google to replicate this experiment is therefore:</p>
<p>\begin{equation}
\text{Cost} > 5000 \cdot 960 \cdot 4.5 \approx 2 \cdot 10^7 \quad \text{US dollars}
\end{equation}</p>
<p>It goes without saying that this is well outside the budget of any AI lab in academia.</p>
<h2 id="the-true-cost-of-googles-40-day-experiment">The true cost of Google’s 40 day experiment:</h2>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/climate_cartoon.jpeg" width="75%" height="75%" align="middle" /></center>
<center>Climate agreements since 1990: progress or the illusion thereof</center>
<p>This in my opinion is the more important calculation. While it’s not at all clear that AI research will ‘save the world’ in the long term,
in the short term what is certain is that compute-intensive AI experiments have a non-trivial carbon footprint. So I think it would be wise
to use our energy budget carefully and, realistically, the only way to do this is to calculate the carbon footprint of any AI research project and place it on
the front page of your research paper. Meanwhile, let’s proceed with the calculation.</p>
<p>The nature of this calculation involves first converting TPU hours into KiloWatt Hours(KWH) and then converting this value to metric tonnes of CO2:</p>
<ol>
<li>~5000 TPUs were used for 960 hours.</li>
<li>~40 Watts per TPU according to [6].</li>
<li>
<p>This means that we have:</p>
<p>\begin{equation}
\text{KWH} = 5000 \cdot 960 \cdot 40 \approx 1.9 \cdot 10^5
\end{equation}</p>
</li>
<li>
<p>This is approximately 23 American homes’ electricity for a year according to the <a href="https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator">EPA</a>.</p>
</li>
<li>In the USA, where Google Cloud TPUs are located, we have ~ ,5 kg of CO2/KWH so AlphaGo Zero was responsible for approximately 96 tonnes of CO2 into the atmosphere.</li>
</ol>
<p>To appreciate the significance of 96 tonnes of CO2 over 40 days…this is approximately equivalent to 1000 hours of air travel and also approximately the carbon footprint of
23 American homes for a <em>year</em>. Relatively speaking, this is a large footprint for a board game ‘experiment’ that lasts 40 days.</p>
<p>Is this reasonable? At this point a Googler might start talking to me about Google’s carbon offsetting scheme.</p>
<h2 id="googles-carbon-offsetting-scheme">Google’s carbon offsetting scheme:</h2>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/charts.png" width="100%" height="100%" align="middle" /></center>
<center>Google's carbon offsetting strategy in 2018: ~92% wind and ~8% solar</center>
<p>I don’t have much time for this section because Google’s carbon offsetting scheme is basically a joke but let’s break it down anyway:</p>
<ol>
<li>
<p><a href="https://cloud.google.com/renewable-energy/">According to Google</a>, the Google Cloud is supposedly 100% sustainable because Google purchases an equal amount of renewable energy for the total amount of energy used by their Cloud infrastructure.</p>
</li>
<li>
<p>If you check <a href="https://www.blog.google/outreach-initiatives/environment/meeting-our-match-buying-100-percent-renewable-energy/">the charts of Urs Hölze</a>, the Senior VP of technical infrastructure at Google, this means that they buy a lot of wind(~ 92%) and some solar(~ 8%).</p>
</li>
<li>
<p>Let’s suppose we can take these points at face value. Does this carbon offsetting scheme actually work out?</p>
</li>
</ol>
<p>David J.C. Mackay, a giant of 20th century machine learning, would probably be rolling in his grave right now because he spent the last part of his life carefully assessing
the potential contribution of wind and solar to humanity’s energy budget [7]. He was in fact <em>Scientific Advisor to the Department of Energy and Climate Change</em>
and his essential contribution was to explain how the fundamental limits to wind and solar energy technologies weren’t technological; we are talking about hard physical limits. I will refer the reader to ‘Sustainable Energy-without the hot air’ by David J.C. Mackay which is <a href="https://www.withouthotair.com/">freely available online</a> rather
than repeat his thorough calculations here.</p>
<p>Unfortunately, no combination of wind and solar energy can provide energy security for a country with the USA’s energy requirements. In the best case scenario, Google’s carbon offsetting scheme is thinly veiled virtue signalling. What then are the serious clean energy solutions?</p>
<p>Past the year 2050 it’s possible to make a strong case for nuclear fusion as being necessary for human civilisation to
continue. Between now and the day we figure out how to engineer reliable nuclear fusion reactors we should use our energy budget wisely.</p>
<h2 id="boltzmanns-razor">Boltzmann’s razor:</h2>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/boltzmann.jpg" width="75%" height="75%" align="middle" /></center>
<center>Boltzmann's theory complements Darwin's theory in many ways</center>
<p>According to various sources the human brain uses ~20 Watts which is incredibly efficient compared to the 200 KiloWatts used by 5000 TPUs. In other words, AlphaGo Zero was ten thousand times less energy efficient than a human being for a comparable result. I don’t see how this is a strong argument for <em>scalability</em> at all.</p>
<p>The human brain isn’t an outlier. All biological organisms are energy efficient because they must first survive the second law of thermodynamics which
is a minimum energy principle. Now, there are two ways organisms perform computations in an economical manner that I am aware of:</p>
<ol>
<li>
<p>Morphological computation:</p>
<p>a. If you check the work of Tad McGeer [8] you will realise that it’s possible to build a walking robot without any electronics that simply exploits the laws of classical mechanics. It does computations by virtue of having a body. Some researchers might say that this is an instance of <em>embodied cognition</em> [12].</p>
<p>b. Romain Brette and his collaborators have been working on a <a href="http://romainbrette.fr/neuroscience-of-a-swimming-neuron/">project that involves a <em>swimming neuron</em></a>. This
is an organism, the Paramecium, that has a single cell yet it’s capable of navigation, hunting, and procreation in very complex environments. How does the Paramecium do this? What is the reward function? Is it doing reinforcement learning?</p>
</li>
<li>
<p>The role of development:</p>
<p>a. If you consider any growing organism you will realise that its <em>state space</em> and <em>action space</em> are rapidly changing. This should make learning very hard. Yet, development is in some sense a form of curriculum learning and makes learning simpler.</p>
<p>b. I must add that during development the brain of the organism is rapidly changing. Shouldn’t this make learning impossible?</p>
</li>
</ol>
<p>Morphospaces and developmental trajectories are fundamentally physical considerations. In some fundamental way organisms succeed in reorganizing physics locally. Termites in the desert construct mounds whose physical behavior is consistent with but not reducible to the physics of sand. Birds build nests whose physics isn’t reducible to its constituent parts. The resulting systems do <em>computations</em> in an economical manner by taking <em>thermodynamics</em> into
account.</p>
<p>This is why energy efficiency is both a challenge and opportunity. It will force researchers to recognize the importance of understanding the biophysics of organisms at every scale where such biophysics contributes to survival. If I may distill this into a single principle I would call it <em>Boltzmann’s razor</em>:</p>
<p><em>Given two comparably effective intelligent systems focus on the research and development of those systems which consume less energy.</em></p>
<p>Naturally, the more economical system would be capable of accomplishing more tasks given the same amount of energy.</p>
<h2 id="discussion">Discussion:</h2>
<p>Of the AI researchers I have discussed the above issues with I noted a bimodal distribution. Roughly 30% agreed with me and roughly 70% pushed back really
hard. Among the counter-arguments of the second group I remember the following:</p>
<ol>
<li>If you force AI researchers to reduce their carbon footprint you will <em>kill</em> AI research.</li>
<li>Why do you care about what Google does? It’s their own money and they can do whatever they want with it.</li>
<li>You’re not a real AI researcher anyway. Why do you care about things outside your field?</li>
</ol>
<p>I think these are all terrible arguments. Regarding the ad hominem, like many masters students I’m 1.5 years away from starting a PhD. I have already met a potential PhD supervisor that I have been in touch with since 2017. I will add that last year I worked as a consultant on an object detection project where I engineered a state-of-the-art object detection system inspired by Polygon RNN for a Central European computer vision company using only one NVIDIA GTX 1080 Ti [10]. Part of this system is <a href="https://github.com/AidanRocke/vertex_prediction">on Github</a>.</p>
<p>So not only do I know what I’m talking about but I have experience building reliable systems in a resourceful manner. In fact, resourcefulness is a direct implication of <em>Boltzmann’s razor</em>.</p>
<h1 id="references">References:</h1>
<ol>
<li>D. Sutton. The Bitter Lesson. 2019.</li>
<li>D. Silver et al. Mastering the game of Go without human knowledge. 2017.</li>
<li>A. Karpathy. AlphaGo, in context. 2017.</li>
<li>D. Huang. How much did AlphaGo Zero cost? 2018.</li>
<li>The Twitter Thread of Shimon Whiteson: https://twitter.com/shimon8282/status/1106534178676506624</li>
<li>This Tweet by David Sussillo: https://twitter.com/SussilloDavid/status/1106643708626137089</li>
<li>Google Inc. In-Datacenter Performance Analysis of a Tensor Processing Unit. 2017.</li>
<li>D. McKay. Sustainable Energy-without the hot air. 2008.</li>
<li>T. McGeer. Passive Dynamic Walking. 1990.</li>
<li>L. Castrejon et al. Annotating Object Instances with a Polygon-RNN. 2017.</li>
<li>D. B. Chklovskii & C. F. Stevens. Wiring optimization in the brain. 2000.</li>
<li>G. Montufar et al. A Theory of Cheap Control in Embodied Systems. 2014.</li>
</ol>Aidan RockeMotivation:Understanding the two-thirds power law2019-03-23T00:00:00+00:002019-03-23T00:00:00+00:00/biomechanics/2019/03/23/two-thirds-law<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/galoisnotes.jpg" width="75%" height="75%" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>If you consider the above scribbles of Évariste Galois, who developed Galois theory, you will note that some of the scribbles appear random. Yet, upon closer inspection none of the scribbles are completely random. Many of the scribbles are rather smooth which would be improbable if the trajectories were generated
by some kind of Brownian-type motion.</p>
<p>This isn’t really surprising if you consider the biomechanical constraints on handwritten text. In fact, some scientists have attempted to distill this observation into
a physical law known as the two-thirds power law which I analyse here. Briefly speaking, here’s a breakdown of my analysis:</p>
<ol>
<li>I provide a mathematical description of the law and describe how it may be used as a discriminative model.</li>
<li>We may also use this equation as a generative model if we consider symmetries of the equation. <a href="https://gist.github.com/AidanRocke/33c1d5268d8f8c3b395cc81ba6397f47">Here is the code.</a></li>
<li>The limitations of the ‘law’ are considered and arguments are given to shift focus on plausible generative models.</li>
</ol>
<p>In spite of its limitations I think that the <script type="math/tex">2/3</script> power law is a very good starting point for understanding biomechanical constraints
on realistic drawing tasks.</p>
<h2 id="description-of-the-law">Description of the law:</h2>
<h3 id="brief-description">Brief description:</h3>
<p>The <script type="math/tex">2/3</script> power law for the motion of the endpoint of the human upper-limb during drawing motion may be formulated as follows:</p>
<p>\begin{equation}
v(t) = K \cdot k(t)^\beta
\end{equation}</p>
<p>where <script type="math/tex">k(t)</script> is the instantaneous curvature of the path and the <script type="math/tex">2/3</script> law is satisfied when <script type="math/tex">\beta \approx -\frac{1}{3}</script>. By taking logarithms
of both sides of the equation we have:</p>
<p>\begin{equation}
\ln v(t) = K - \frac{1}{3} \ln k(t)
\end{equation}</p>
<h3 id="frenet-serret-formulas">Frenet-Serret formulas:</h3>
<p>To clarify what we mean by instantaneous curvature <script type="math/tex">k(t)</script> in (2) it’s necessary to use a moving reference frame, aka Frenet-Serret frame, where
in two dimensions our reference frame is described by the unit vector tangent to the curve and a unit vector normal to the curve.</p>
<p>With this moving frame we may define the curvature of regular curves(i.e. curves whose derivatives never vanish) parametrized by time as follows:</p>
<p>\begin{equation}
k(t) = \frac{\lvert \ddot{x}\dot{y} - \ddot{y}\dot{x} \rvert}{(\dot{x}^2 + \dot{y}^2)^{3/2}} = \frac{\lvert \ddot{x}\dot{y} - \ddot{y}\dot{x} \rvert}{v^3(t)}
\end{equation}</p>
<p>Now, if we denote:</p>
<p>\begin{equation}
\alpha(t) = \lvert \ddot{x}\dot{y} - \ddot{y}\dot{x} \rvert
\end{equation}</p>
<p>we have:</p>
<p>\begin{equation}
\ln v(t) = \frac{1}{3} \ln \alpha(t) - \frac{1}{3} \ln k(t)
\end{equation}</p>
<p>and we note that our law is satisfied when <script type="math/tex">\forall t, \alpha(t)=K</script>. Given that this is a linear equation we may use this equation as a discriminative model by
performing a linear regression analysis on drawing data.</p>
<h3 id="parallelograms">Parallelograms:</h3>
<p>If we focus on <script type="math/tex">(4)</script> we may note that this value corresponds to the determinant of a particular matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
H = \begin{bmatrix}\ddot{x} & \ddot{y}\\\dot{x} & \dot{y}\end{bmatrix}
\end{equation} %]]></script>
<p>Furthermore, we may note that this determinant may be identified with the area <script type="math/tex">K</script> of a parallegram with the following vertices:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
O & = (0,0) \\
A & = (\ddot{x}, \dot{x})\\
B & = (\ddot{y}, \dot{y}) \\
C & = A + B
\end{split}
\end{equation} %]]></script>
<p>This formulation is useful as invariants of <script type="math/tex">\lvert \ddot{x}\dot{y} - \ddot{y}\dot{x} \rvert=K</script> now correspond to volume-preserving transformations applied to the above parallelogram.</p>
<h2 id="generative-modelling-via-invariants">Generative modelling via Invariants:</h2>
<h3 id="invariance-via-volume-preserving-transforms">Invariance via volume-preserving transforms:</h3>
<p>Let’s first note that if we always have:</p>
<p>\begin{equation}
\lvert \ddot{x}\dot{y} - \ddot{y}\dot{x} \rvert=K
\end{equation}</p>
<p>for some <script type="math/tex">K \in \mathbb{R}</script> then we must have:</p>
<p>\begin{equation}
\lvert \ddot{x}(0)\dot{y}(0) - \ddot{y}(0)\dot{x}(0) \rvert=K
\end{equation}</p>
<p>Now, given that</p>
<p>\begin{equation}
\mathcal{M} = \{ M \in \mathbb{R}^{2 \times 2}: det(M)=1 \}
\end{equation}</p>
<p>are volume-preserving transformations, we may use <script type="math/tex">M \in \mathcal{M}</script> to simulate arbitrary trajectories that satisfy <script type="math/tex">(2)</script>. We may
think of this as the Jacobian of a linear, hence differentiable, transformation.</p>
<h3 id="computer-simulation">Computer simulation:</h3>
<p>In order to simulate these trajectories, we note that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}\begin{bmatrix}\ddot{x}_{n+1}\\\dot{x}_{n+1}\end{bmatrix} = M \cdot \begin{bmatrix}\ddot{x}_{n}\\\dot{x}_{n}\end{bmatrix}= \begin{bmatrix}a & b\\c & d\end{bmatrix} \cdot \begin{bmatrix}\ddot{x}_{n}\\\dot{x}_{n}\end{bmatrix} = \begin{bmatrix}a\ddot{x}_n + b\dot{x}_n\\c\ddot{x}_n + d\dot{x}_n\end{bmatrix}\end{equation} %]]></script>
<p>where the position is updated using:</p>
<p>\begin{equation}
x_{n+1} = x_n + \dot{x}_n\cdot \Delta t + \frac{1}{2} \ddot{x}_n \cdot \Delta t^2
\end{equation}</p>
<p>and in order to make sure that <script type="math/tex">ad-bc=1</script> we may use the trigonometric identity:</p>
<p>\begin{equation}
cos^2(\theta) + sin^2(\theta) = 1
\end{equation}</p>
<p>so we have:</p>
<p>\begin{equation}
ad = cos^2(\theta)
\end{equation}</p>
<p>\begin{equation}
bc = -sin^2(\theta)
\end{equation}</p>
<p>and as a result we have a generative variant of the 2/3 power law. Ok, but are these ‘scribbles’ ecologically plausible? I don’t think so, which is why I call <a href="https://gist.github.com/AidanRocke/33c1d5268d8f8c3b395cc81ba6397f47">the main Julia function I used to simulate these trajectories ‘crazy paths’</a>.</p>
<h2 id="criticism">Criticism:</h2>
<ol>
<li>The <script type="math/tex">2/3</script> law is a pretty weak discriminative model because as shown by [2] the exponent varies with the viscosity of the drawing medium and as shown by [1] the exponent also depends on the complexity of the shape drawn.</li>
<li>The <script type="math/tex">2/3</script> law is an even weaker generative model as it completely ignores environmental cues. The output ‘scribbles’ aren’t the result of any plausible interaction of an agent with an ecologically realistic environment.</li>
</ol>
<p>This point is even more clear when you consider the underlying minimum-jerk theory that is supposed to justify this ‘law’. A verbatim interpretation of jerk minimisation would
imply that humans should mainly draw straight lines. However, there’s certainly a tradeoff between energy minimisation and the expressiveness of the figure drawn since drawing is an activity that involves <em>communicating</em> a particular message.</p>
<h1 id="references">References:</h1>
<ol>
<li>D. Huh & T. Sejnowski. Spectrum of power laws for curved hand movements. 2015.</li>
<li>M. Zago et al. The speed-curvature power law of movements: a reappraisal. 2017.</li>
<li>U. Maoz et al. Noise and the two-thirds power law. 2006.</li>
<li>M. Richardson & T. Flash. Comparing Smooth Arm Movements with the Two-Thirds Power Law and the Related Segmented-Control Hypothesis. 2002.</li>
</ol>Aidan RockeWhat is the exact value of culture?2019-03-20T00:00:00+00:002019-03-20T00:00:00+00:00/culture/2019/03/20/value-culture<p>As someone who thinks about the origins of intelligence everyday, the importance of culture has grown on me over time. I can’t overstate its importance.</p>
<p>Some people have asked me what’s the exact value of culture and expect this question to stump me. But, that’s an absurd question as without culture it would be impossible to ask such questions. Without culture there’s no language, art, political systems, science or technology. An essential question therefore is what are the minimal conditions
for culture to emerge in a particular species? Can complex cultures develop in animals without the capacity for language?</p>
<p>I wouldn’t say that there can be no intelligent behaviour without culture but I can confidently say-based on empirical evidence and analysis-that without culture
there would be a strict upper-bound on the kinds of intelligent systems that are possible.</p>
<h3 id="note-i-would-normally-have-a-list-of-references-here-but-this-time-i-would-advise-the-curious-reader-to-gather-their-own-data-and-try-various-thought-experiments">Note: I would normally have a list of references here but this time I would advise the curious reader to gather their own data and try various thought experiments.</h3>Aidan RockeAs someone who thinks about the origins of intelligence everyday, the importance of culture has grown on me over time. I can’t overstate its importance.Parsing Elon Musk’s solar energy calculation2019-03-17T00:00:00+00:002019-03-17T00:00:00+00:00/energy/2019/03/17/blue-square<h2 id="introduction">Introduction:</h2>
<p>Back in 2015, Elon Musk gave an engaging presentation on Tesla’s new home battery products as part of his vision of a solar energy future. A key part of his
presentation hinged on a ‘blue square’, a <script type="math/tex">10^4 km^2</script> section of Texas that is supposedly sufficient to cover USA’s electricity consumption. To give the reader some idea of the size of <script type="math/tex">10^4 km^2</script>, you can fit around ten New York cities in that ‘blue square’.</p>
<p>Thinking about Elon’s solar energy calculation led me to consider the following:</p>
<ol>
<li>Can Elon’s argument be justified by a back-of-the-envelope calculation?</li>
<li>Is solar energy a serious option if we take into account the expected rate of globalisation and technological progress?</li>
</ol>
<p>Although I’m not American, the USA is a very important case as it represents approximately 20% of
the Globe’s total energy consumption while its population represents barely 5% of the world population. If we take into account the rapid
Americanisation of all countries then the second question emerges naturally.</p>
<p>Disclaimer: I take it for granted that using fossil fuels isn’t reasonable by any measure.</p>
<h2 id="elon-musks-blue-square">Elon Musk’s blue square:</h2>
<p>Let’s first note that <a href="https://www.eia.gov/electricity/annual/html/epa_01_02.html">according to the EIA</a> US electricity consumption was about 3758 TeraWatt Hours in 2015 which we may then convert to watts as follows:</p>
<p>\begin{equation}
\frac{3758 \quad \text{TeraWatt hours}}{365 \cdot 24} \approx 429 \quad \text{GigaWatts}
\end{equation}</p>
<p>Now, it’s useful to note that according to the US Energy Information Administration total energy consumption in 2015 was around <script type="math/tex">9.5 \cdot 10^{16}</script> British Thermal Units, or five times the total electricity consumption in the USA. So even if Elon is right that <script type="math/tex">10^4 km^2</script> may be sufficient for <em>electricity consumption</em> it’s a good idea to note that at least five blue squares will be needed for <em>total USA energy consumption</em>. That’s more than fifty times the surface area of New York city!</p>
<p>But, is Elon right? Let’s use the following formula for calculating solar energy yield:</p>
<p>\begin{equation}
Energy = A \cdot r \cdot H \cdot PR
\end{equation}</p>
<p>where <script type="math/tex">A</script> is the total solar panel area in square meters, <script type="math/tex">r</script> is the solar panel efficiency, <script type="math/tex">H</script> is the annual average solar radiation(which varies between regions)
and <script type="math/tex">PR</script> is the performance ratio(usually between 0.5 and 0.9). If we assume Saudi Arabian solar radiation levels,that the performance ratio is nearly 1.0 and about
21% efficiency we have:</p>
<p>\begin{equation}
Energy = 10^4 \cdot .21 \cdot 2600 \cdot 1.0 \approx 5.460 \quad \text{PetaWatt Hours}
\end{equation}</p>
<p>which I can convert to Watts as follows:</p>
<p>\begin{equation}
Wattage = \frac{Energy}{365 \cdot 24} = \frac{5460 \quad \text{TeraWatt Hours}}{365 \cdot 24} \approx 623 \quad \text{GigaWatts}
\end{equation}</p>
<p>which more than satisfies the first equation so Elon is right or at least he isn’t wrong in a manner that is obvious.</p>
<h2 id="globalisation-and-technological-progress">Globalisation and Technological progress:</h2>
<h3 id="upper-bound-on-potential-solar-power-on-earth-due-to-the-sun">Upper-bound on potential solar power on Earth due to the sun:</h3>
<p>In order to estimate the potential solar power on Earth due to the sun we may use the Stefan-Boltzmann equation for luminosity:</p>
<p>\begin{equation}
L_o= \sigma A T^4
\end{equation}</p>
<p>which depends on <script type="math/tex">T</script> the effective temperature of the sun, <script type="math/tex">A</script> its surface area and <script type="math/tex">\sigma</script> the Stefan-Boltzmann constant.</p>
<p>Now if we define 1 <script type="math/tex">AU</script> to be the average distance between the Earth and the sun, the maximum solar power available to Earth is given by:</p>
<p>\begin{equation}
P= L_o \frac{A_{Earth}}{A_{1AU}} = \sigma T^4 \big(\frac{R_s}{1 AU}\big)^2 4 \pi R_{Earth}^2 \approx 174 \quad \text{PetaWatts}
\end{equation}</p>
<p>and if we take into account that about 70% of sunlight is lost to outerspace, only about a third of the Globe is
terrestrial and the solar energy conversion efficiency is around 20% we have:</p>
<p>\begin{equation}
\bar{P}= 0.7 \cdot .3 \cdot .2 \cdot P \approx 7300 \quad \text{TeraWatts}
\end{equation}</p>
<p>which is about five hundred times current use. This might sound like a large margin until you realise that energy consumption in developed countries
has been growing exponentially during the last two hundred years.</p>
<h3 id="the-rate-of-globalisation">The rate of globalisation:</h3>
<p>Given that new energy infrastructure takes time to build it’s essential to realise that you build it for the foreseeable future and by that I mean at least the next couple decades. Is it possible that within a few decades we might start getting
dangerously close to <script type="math/tex">\bar{P}</script>? I would argue that we are already in trouble because we are rapidly becoming American in our energy consumption
patterns.</p>
<p>As I mentioned earlier the USA is responsible for approximately 20% of Earth’s total energy consumption while its population
represents barely 5% of the world population. So convergence in American-style economic development implies that eventually the entire world will be consuming around four times more energy per annum. Let’s denote this global energy consumption pattern by <script type="math/tex">\hat{P}</script> and suppose that this event happens within a decade, then:</p>
<p>\begin{equation}
\frac{\bar{P}}{\hat{P}} \approx 100
\end{equation}</p>
<p>so we’re only a factor of a hundred away from doomsday and we haven’t even started building solar panels seriously yet. Can things get worse?</p>
<h3 id="the-rate-of-technological-progress">The rate of technological progress:</h3>
<p>As pointed out by Tom Murphy, a professor of Physics at UC San Diego, <a href="https://dothemath.ucsd.edu/2011/07/galactic-scale-energy/">the rate of energy consumption in the USA has been increasing at an exponential rate</a>. Around 2.3% per year which might not sound like much until you think about the effect of compounding. Let’s suppose that by 2030 all countries have similar energy consumption patterns. In order to figure out how much time human civilisation has left on Earth we must calculate:</p>
<p>\begin{equation}
x = \frac{\ln 100}{\ln 1.023} \approx 200 \quad \text{years}
\end{equation}</p>
<p>but that’s the result of a simple extrapolation. I’m actually much less optimistic.</p>
<p>At a time when we have two emerging superpowers, China & India, and the decline of the USA that is willing to do everything it can to maintain its hegemony; I
believe that we’ll witness an acceleration.</p>
<h2 id="discussion">Discussion:</h2>
<p>If we do go all the way with solar energy another important factor consider is the volume of lithium ion battery production that would be necessary in
order to supply electricity at night around the Globe. I have yet to do detailed calculations on this but this makes me even less confident that the future
can be 100% solar. Yet, if not solar energy then what? I think we can actually make a case for nuclear fusion especially if we start thinking about multi-planetary
civilisations. But, I will save this analysis for another day.</p>
<h1 id="references">References:</h1>
<ol>
<li>T. Murphy. Galactic-Scale Energy. 2011.</li>
<li>D. McKay. Sustainable Energy-without the hot air. 2008.</li>
<li>EIA. Annual Energy Outlook. 2019.</li>
</ol>Aidan RockeIntroduction:An introduction to Intrinsic Physics2019-03-15T00:00:00+00:002019-03-15T00:00:00+00:00/mathematics/2019/03/15/intrinsic_physics_1<center><img src="https://i.stack.imgur.com/f4ned.png" width="75%" height="75%" align="middle" /></center>
<h2 id="introduction">Introduction:</h2>
<p>Around November 2018 I went through an existential crisis regarding general theories of intelligence. The conversation I had with Alex Gomez-Marin, a behavioural neuroscientist, back in April 2018 rattled in my head as I struggled to find ways to justify their existence. The main issues were that from the perspective
of an organism, all the theories of general intelligence I had surveyed whether this was Friston’s Free Energy Principle [3], Polani’s information-theoretic Empowerment theory,
Wissner-Gross’ statistical mechanical Causal Entropic Forces theory [4] or Hutter’s compression-based AIXI theory were simultaneously computationally intractable and epistemologically unsound.</p>
<p>I realised that it’s also possible to spend decades working on toy models of these theories to demonstrate a ‘proof of concept’ but these implementations on toy problems couldn’t possibly scale for reasons that were probably known to the authors beforehand. Alternatively, certain authors might use ‘approximations’ of these theories but none of these researchers ever tried to quantify the quality of their approximation. I also realised that I had no desire to become either kind of researcher.</p>
<p>After a couple weeks I decided to step back from dreams of a general theory and focus my attention on essential questions for intelligent behaviour in organisms.</p>
<h2 id="subjective-physics">Subjective Physics?:</h2>
<p>A few months after my conversation with Gomez-Marin I went over ‘Neuroscience Needs Behavior: Correcting a Reductionist Bias’[1] which made a strong case for the role of behavioural neuroscience in designing neural interventions. As with ‘Could a Neuroscientist Understand a Microprocessor?’[4] by Jonas and Kording they pointed out that an astronomical amount of neural data wouldn’t be enough to isolate the causal neural mechanisms that determine the behaviour of an organism. In order for neural interventions to have any meaning you must close the loop by designing ecologically realistic behavioural experiments. The purpose of the brain after all is to generate meaningful behaviour.</p>
<p>That said, this opened the door to many approaches. Which one should I choose? By some chance occurrence I was going through the blog of Romain Brette,
an excellent neuroscientist, and stumbled upon an article of his on <a href="http://romainbrette.fr/subjective-physics/">Subjective Physics</a>. His 43 page text was motivated by the
following thought experiment:</p>
<blockquote>
<p>Imagine a naive organism who does not know anything about the world. It can capture signals through its sensors and it can make actions. What kind of knowledge about the world is accessible to the organism?</p>
</blockquote>
<p>I thought Romain Brette’s text was wonderful. It made sense from both a behavioural science as well as cybernetics perspective. On the one hand, all theories of physics and complex systems have a compositional structure where powerful methods emerge from simpler principles. On the other hand, by the Good Regulator Theorem intelligent organisms must try to learn an internal model of their environment in order simulate the evolution of key physical variables. What statistical models might be useful for learning such physical models with re-usable components and how might such models be learned in an intrinsically motivated manner?</p>
<h2 id="a-behaviorist-account">A behaviorist account:</h2>
<p>Before continuing, I’d like to start by introducing a behaviorist framework that is universally accepted among many AI researchers. Specifically, let’s
suppose that all intelligent organisms do reinforcement learning:</p>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/RL.png" width="75%" height="75%" align="middle" /></center>
<p>There are several outstanding problems:</p>
<ol>
<li>Where do the rewards come from?</li>
<li>What is the relationship between embodiment and learning?</li>
<li>Are observations objective?</li>
</ol>
<p>Is the organism trying to reliably and perceptibly control its environment? If so, the good regulator theorem implies an internal model of the environment and I’d argue
that a consistent internal model implies the existence of physical laws. But, what do I mean by <em>internal model</em>, <em>consistent</em> and <em>physical law</em>?</p>
<h2 id="the-consistency-criterion-for-intrinsic-physics">The consistency criterion for Intrinsic Physics:</h2>
<h3 id="what-is-the-consistency-criterion">What is the consistency criterion?</h3>
<p>By <em>physical law</em> I mean a forward model which takes as input initial conditions and may be used to simulate future states of an environment. Let’s take Newton’s laws of motion in <script type="math/tex">\mathbb{R}^3</script> for a concrete example:</p>
<p>\begin{equation}
\sum F = 0 \iff \frac{dv}{dt} = 0
\end{equation}</p>
<p>\begin{equation}
F = \frac{d \vec{p}}{dt} = 0
\end{equation}</p>
<p>\begin{equation}
F_{AB} = -F_{BA}
\end{equation}</p>
<p>It can be demonstrated that neither of these laws contradict each other. In that sense they are <em>consistent</em> with each other. You can also show that they are necessary and
sufficient for simulating any conservative mechanical system(whose kinematics may be described using Newton’s Calculus). But, that’s a different matter.</p>
<p>So this is what I mean by consistency. Now, what do I mean by internal model and how might the <em>consistency criterion</em> apply to an internal model?</p>
<h3 id="what-is-an-internal-model">What is an internal model?</h3>
<center><img src="https://raw.githubusercontent.com/Kepler-Lounge/Kepler-Lounge.github.io/master/_images/internal_model.png" width="75%" height="75%" align="middle" /></center>
<p>Within the Probabilistic Graphical Model framework, an internal model is a directed graph whose nodes are variables and edges are probabilistic relations between the nodes.
The main function of an internal model is to simulate the future of an agent’s environment as explained in ‘World Models’ [8].</p>
<p>For concreteness let’s consider the deep autoregressive model for the decoder distribution used in ‘Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning’ [9]. This model takes as input the initial position of a flat organism <script type="math/tex">x \in \mathbb{R}^2</script> as well as a future state <script type="math/tex">x' \in \mathbb{R}^2</script> and
outputs a probability distribution over trajectories which may be used for sampling paths.</p>
<p>Here’s the explicit mathematical description:</p>
<p>\begin{equation}
q_{\xi}(\vec{a}|x,x’) = q(a_1|x,x’)\prod_{k=2}^K q(a_k|f_{\xi}(a_{k-1}|x,x’))
\end{equation}</p>
<p>where the per-action distribution is defined as follows:</p>
<p>\begin{equation}
q(a_k|f_{\xi}(a_{k-1}|x,x’)) = \mathcal{N}(a_k|\mu_{\xi}(a_{k-1},x,x’),\sigma_{\xi}^2(a_{k-1},x,x’))
\end{equation}</p>
<p>\begin{equation}
\mu_{\xi}(a_{k-1},x,x’) = g(W_\mu \eta + b)
\end{equation}</p>
<p>\begin{equation}
\log \sigma_{\xi}(a_{k-1},x,x’) = g(W_\sigma \eta + b)
\end{equation}</p>
<p>\begin{equation}
\eta = l(W_2g(W_1 x + b_1)+b_2)
\end{equation}</p>
<p>where <script type="math/tex">g(\cdot)</script> refers to the relu activation.</p>
<p>Incidentally, I spent a few months <a href="https://github.com/AidanRocke/variational_empowerment">trying to implement</a> this particular model with Ildefons Magrans without too much success. The main difficulty was that the learning algorithm wasn’t very stable.</p>
<h3 id="consistent-internal-model">Consistent Internal Model:</h3>
<p>Now, I posit that if an internal model is consistent and has a graphical representation then it must be in some sense fully connected. To be a bit more precise, I think a necessary and sufficient requirement for the internal model to be consistent is that there exist a sequence of directed edges between each node. These edges represent a probabilistic relation that is modelled by a parametric model whose parameters may be modified subject to the agent’s experience.</p>
<p>The main idea behind this requirement is that if there is a single probabilistic logic that is being used for communication between nodes in the internal model then the agent can’t simultaneously pursue two contradictory policies. Therefore it’s ontology, i.e. knowledge representation, will be consistent in an intuitive sense.</p>
<h2 id="motivating-questions-and-further-discussion">Motivating questions and further discussion:</h2>
<p>I’d like to emphasise that my Intrinsic Physics framework is pretty light on formalism. In fact, I think this is essential in the early stages of any scientific theory. On the
contrary, if the history of science is any guide we find that new theories which were incredibly complicated at their inception tended to have a harder time recovering from their flaws. Moreover, in order for the framework to stay on the right track I propose the following motivating questions:</p>
<ol>
<li>Epistemology: What does my RL agent really know vs. what kinds of physics can be known to an organism?</li>
<li>Learning: What kinds of statistical models might allow an organism to gain such knowledge in a Popperian sense?</li>
<li>Morphogenesis: Relationship between structure and function(ex. learning)? How does development complement the learning paradigm?</li>
<li>Uncertainty: How might the organism represent and compute uncertainty?</li>
<li>Information flows: Causal influence of an agent on its environment as opposed to the causal influence of the environment on the agent. (i.e. control as inference)</li>
</ol>
<p>Finally, there is an unwritten rule. Always work on the simplest systems you don’t understand. You might also work on more complex systems but keep the simpler systems
like microprocessors and C. Elegans in mind. These will keep your theory honest and make sure that the most fundamental intellectual constructs on which your theory
is founded aren’t merely diversions. For this reason I decided to start by applying the Intrinsic Physics framework to neural networks as far as <a href="https://keplerlounge.com/deep/learning/2019/03/06/extrinsic-geometry.html">structure and function</a>
is concerned.</p>
<p>One more thing. Feedback is definitely welcome especially on the consistency criterion for intrinsic physics. You can reach me via email: aidanrocke@gmail.com</p>
<h1 id="references">References:</h1>
<ol>
<li>J. Krakauer, A. Ghazanfar, A. Gomez-Marin, M. MacIver, and D. Poeppel. Neuroscience Needs Behavior: Correcting a Reductionist Bias. 2017.</li>
<li>R. Brette. Subjective physics. 2013.</li>
<li>E. Jonas & K. Kording. Could a Neuroscientist Understand a Microprocessor? 2017.</li>
<li>K. Friston. The free-energy principle: a rough guide to the brain? 2009.</li>
<li>C. Salge, C. Glackin, D. Polani. Empowerment – an Introduction. 2013.</li>
<li>A. D. Wissner-Gross & C. E. Freer. Causal Entropic Forces. 2013.</li>
<li>Marcus Hutter. Universal Algorithmic Intelligence: A mathematical top→down approach. 2007.</li>
<li>D. Ha & J. Schmidhuber. World Models. 2018.</li>
<li>D. Rezende & S. Mohamed. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning. 2015.</li>
</ol>Aidan RockeExperiments with Products of Random Matrices2019-03-11T00:00:00+00:002019-03-11T00:00:00+00:00/random/matrices/2019/03/11/random_matrix_expt<p>Let’s suppose that we have a random vector <script type="math/tex">X_0 \sim U(-1,1)^3</script> which we transform using a sequence of random
matrices <script type="math/tex">M_i \sim U(-1,1)^{3 \times 3}</script>. We may then define the following:</p>
<p>\begin{equation}
\pi_{n} = \prod_{i=1}^n M_i
\end{equation}</p>
<p>\begin{equation}
X_n = \pi_n X_0
\end{equation}</p>
<p>Now, based on <a href="https://gist.github.com/AidanRocke/1c154b42a52902015a83603a5efb640e">numerical experiments</a> I conjecture that <script type="math/tex">\{X_i\}_{i=1}^n</script> behaves like a Brownian type motion and:</p>
<p>\begin{equation}
\forall C \in [0, \infty) , \lim_{N \to \infty} P(\lVert X_N -X_0 \rVert \geq C \quad i.o.) = 1
\end{equation}</p>
<p>In order to proceed with a mathematical analysis of this problem it’s useful to note that the product
of two uniformly distributed random variables <script type="math/tex">x_1, x_2 \sim U(0,1)</script> aren’t uniform. In fact, it can be
easily shown that:</p>
<p>\begin{equation}
P(0 \leq x_1 \cdot x_2 \leq z) = z - z\ln z
\end{equation}</p>
<p>I shall definitely return to this problem next week and one motivation for it is that I think <script type="math/tex">\{X_i\}_{i=1}^n</script> may
be used to approximate any space of continuous functions with compact support in <script type="math/tex">\mathbb{R}^3</script> as <script type="math/tex">N \to \infty</script>.
Assuming this is true, I’m certain it generalises readily to <script type="math/tex">\mathbb{R}^3</script>.</p>
<p>Below are references which I have yet to consult that may be useful for further investigations.</p>
<h1 id="references">References:</h1>
<ol>
<li>J.R. Ipsen. Products of Independent Gaussian Random Matrices. 2015.</li>
<li>Vladislav Kargin. Products of Random Matrices: Dimension and Growth in Norm. 2010.</li>
<li>Carl P. Dettmann & Orestis Georgiou. Product of n independent Uniform Random Variables. 2009.</li>
</ol>Aidan RockeLet’s suppose that we have a random vector which we transform using a sequence of random matrices . We may then define the following:The extrinsic geometry of deep rectifier networks2019-03-06T00:00:00+00:002019-03-06T00:00:00+00:00/deep/learning/2019/03/06/extrinsic-geometry<center><img src="https://home.cern/sites/home.web.cern.ch/files/2018-06/bubble-chamber-bebc.jpg
" width="75%" height="75%" align="middle" /></center>
<center>An artist's rendition of trajectories in latent space(source: CERN)</center>
<h1 id="introduction">Introduction:</h1>
<p>Let’s suppose that one day you could somehow visualize what was going on inside deep neural networks(with relu activation of course) while
they were being fed data at an insane rate by GPUs. What would you see?</p>
<p>If you embed the neural network calculations in some high-dimensional affine space, i.e. lacking an origin, then you would probably see the equivalent of high-dimensional particles hurtling at an insane rate. Information collapsing into dot products at the nodes of one layer before fanning out again in unpredictable directions due to the affine transformations at the next layer. Unpredictable to yourself because you are an external observer that has no idea what’s going on…but with this extrinsic perspective it’s still possible to infer a lot even if your choice of ambient space might be a nonsensical parametrization with respect to the neural network.</p>
<p>As a result of careful analysis you might realise the following:</p>
<ol>
<li>
<p>The latent space of a deep rectifier network <script type="math/tex">F_\theta</script> is an Orthogonal Function space and functions by de-correlating input signals <script type="math/tex">x \sim X</script>. This became clear to me
<a href="https://keplerlounge.com/deep/learning/2019/02/12/deep-rectifiers.html">a few weeks ago</a>.</p>
</li>
<li>
<p>Besides the algebraic structure you might also notice how lower levels of the network may be identified with re-usable geometric transformations that are used exponentially more often than expressions at the higher levels of the network. So the deep rectifier network may be identified with a special kind of <a href="https://keplerlounge.com/deep/learning/2019/03/03/self-organised-origami.html">geometric decision tree</a>.</p>
</li>
<li>
<p>In higher-dimensional spaces geometric transformations correspond to Jacobians if we can justify the transition from discrete-time to continuous time-and-space geometric
decision trees.</p>
</li>
<li>
<p>From the perspective of function approximation I show that this transition is justified and also that function approximation implies manifold learning. I also explain that information concerning the manifold must be encoded by sequences of affine transformations which are trajectories of information in latent space.</p>
</li>
<li>
<p>Finally, I show that the trajectory formalism leads to a natural statistical relation between linear interpolations in parameter space <script type="math/tex">\theta</script> and the non-convexity of <script type="math/tex">F_\theta</script>.</p>
</li>
</ol>
<p>Given that many readers of this article may not have read my two previous articles on deep rectifier networks, my objective is to first summarize the Function Space and Geometric Decision Tree perspective before motivating the trajectory-in-latent-space perspective. I then show that the structure of trajectories in latent space natural statistical relation between linear interpolations in parameter space <script type="math/tex">\theta</script> and the non-convexity of <script type="math/tex">F_\theta</script>. This explanation is accompanied by <a href="https://gist.github.com/AidanRocke/a0d8b16a3008ca0d7555500e8d597f43">calculations using
sequences of random matrices in Julia Lang</a>.</p>
<h1 id="the-orthogonal-function-space-structure-of-the-latent-space">The Orthogonal Function Space structure of the latent space:</h1>
<p>Mathematically, we may introduce deep rectifier networks as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
F_\theta: \mathbb{R}^{n_i} & \rightarrow \mathbb{R}^{n_o}\\
F(x;\theta) & = f \circ h_L \circ ... \circ h_1(x) = f \circ \phi(x) \\
h_l & = relu \circ f_l \\
f_l(x_{l-1}) & = W_lx_{l-1} + b_l
\end{split}
\end{equation} %]]></script>
<p>where the parameter space <script type="math/tex">\theta</script> is defined as follows:</p>
<p>\begin{equation}
\theta = \big\{\ W_l \in \mathbb{R}^{ n_l \times n_{l-1}},b_l \in \mathbb{R}^{ n_l}: l \in [L] \big\}
\end{equation}</p>
<p>so if there are <script type="math/tex">L</script> layers and <script type="math/tex">N = \sum_{i=1}^L n_i</script> nodes you may observe that:</p>
<ol>
<li>The ReLU activation serves as a gating mechanism for a deep network with <script type="math/tex">N</script> nodes.</li>
<li>This gating mechanism decomposes the latent space of a deep rectifier network into <script type="math/tex">m \leq 2^N</script> linear feature maps <script type="math/tex">\phi_i</script>.</li>
<li>Each of these feature maps <script type="math/tex">\phi_i</script> have compact support and their domains <script type="math/tex">X_i</script> are pair-wise disjoint.</li>
<li>It follows that the latent space of <script type="math/tex">F_\theta</script> forms an orthogonal function space.</li>
<li>The continuity and compact support of the <script type="math/tex">\phi_i</script> implies that the <script type="math/tex">\phi_i</script> are square integrable so the latent space of <script type="math/tex">F_\theta</script> forms a Hilbert space.</li>
</ol>
<p>This is interesting because many natural signals, ex. sounds, inhabit Hilbert spaces.</p>
<h1 id="deep-rectifier-networks-as-geometric-decision-trees">Deep Rectifier Networks as Geometric Decision trees:</h1>
<p>The Hilbert Space perspective is insightful as it shows that deep rectifier networks function by de-correlating input signals <script type="math/tex">x \sim X</script> but
this algebraic perspective doesn’t explain why depth or width of a rectifier network may be important. For this we need to venture beyond algebraic
structure and think in terms of geometry.</p>
<p>A colleague suggested that I have a look at ‘Tropical Geometry of Deep Neural Networks’ [2] which supposedly explains why deeper networks are exponentially
more expressive but as I have little familiarity with tropical geometry I decided to try and follow my own reasoning first. Here’s what I found:</p>
<ol>
<li>
<p>A deep rectifier network may be identified with an <script type="math/tex">N</script> level decision tree for solving geometric problems in a sequential manner with lower layers
of the network forming re-usable sub-expressions.</p>
</li>
<li>
<p>Each expression is an affine transformation and all geometric transformations in Euclidean space are affine. It follows that deep rectifier networks may be viewed as geometric programs where each sub-expression is a geometric operation.</p>
</li>
<li>
<p>At each layer of the network the candidate expressions are a subset of the power set of distinct nodes at that layer. Hence the importance of network width for
versatile geometric reasoning.</p>
</li>
<li>
<p>Furthermore, each sub-expression at layer <script type="math/tex">i</script> may be re-used at most <script type="math/tex">2^{N-\sum_{k=1}^i n_k}</script> times by sub-expressions at higher levels of the decision tree. Montufar [1] gives a similar argument but he identifies expressions with ‘folds’ which is incorrect in my opinion.</p>
</li>
<li>
<p>The importance of the last statement is clear when we think in terms of geometric transformations. Deep rectifier networks permit an exponential number of possible sequences
of geometric transformations of length <script type="math/tex">N</script> and therefore the complexity of the set of possible transformations of the latent space is proportional to network depth.</p>
</li>
</ol>
<p>In summary depth gives us geometric complexity and width gives us versatility. But, how should we think of geometric transformation in higher-dimensional space?</p>
<h1 id="geometric-transformations-in-higher-dimensional-space">Geometric transformations in higher-dimensional space:</h1>
<p>The best way to understand spatial deformations in higher-dimensional affine spaces is to go to the continuous space limit where we may think in
terms of the Jacobian and the determinant of the Hessian, which tells us something about local curvature.</p>
<p>Given an affine transformation <script type="math/tex">T</script> from <script type="math/tex">Aff(\mathbb{R}^3)</script>:</p>
<p>\begin{equation}
x \overset{T}\mapsto Ax + b
\end{equation}</p>
<p>the Jacobian of <script type="math/tex">T</script> is simply <script type="math/tex">J(T)=A</script>.</p>
<p>The reason why this continuous-space approximation is valid is that as network depth increases we may think in terms of smooth trajectories in latent space.
The geometry of these trajectories is essential for both function approximation and manifold learning.</p>
<h1 id="from-discrete-time-to-continuous-time-and-space-geometric-decision-trees">From discrete-time to continuous-time-and-space geometric decision trees:</h1>
<p>Sequences of affine transformations, which channel trajectories in latent space, may not be reduced to a single affine transformation because the former tells
you how space was traversed whereas the latter tells you only the origin and destination of the <em>sequence</em> of affine transformations. We can make this argument
more precise by quantifying the information gained at each level of the geometric decision tree.</p>
<p>Given that the output at level <script type="math/tex">i</script> tells you everything you need to know about the <script type="math/tex">N- \sum_{k=1}^i n_k</script> remaining nodes, the information gained at the ith
level of the geometric decision tree is exactly:</p>
<p>\begin{equation}
\log_2(2^{N- \sum_{k=1}^i n_k})= N- \sum_{k=1}^i n_k
\end{equation}</p>
<p>For these reasons we may argue that trajectories capture important relative information and in the next section I explain that they encode information concerning
the manifold of <script type="math/tex">X</script>.</p>
<h1 id="function-approximation-as-manifold-learning-or-how-to-see-the-forest-from-the-trees">Function approximation as manifold learning, or how to see the forest from the trees:</h1>
<p>Thinking in terms of functions and function spaces is generally more useful than reasoning about networks with particular parametrizations. One reason for this
is that if we assume that a fully-connected network has <script type="math/tex">N</script> layers and <script type="math/tex">n_i</script> nodes per layer, there are:</p>
<p>\begin{equation}
\prod_{i=1}^N n_i !
\end{equation}</p>
<p>layer-wise permutations that result in functions equivalent to <script type="math/tex">F_\theta</script>. This is clear when you think about how dot products encode no information about summation order.</p>
<p>Another reason why the function approximation perspective is insightful is that it allows us to reason abou the deformation in latent spaces as we increase network depth.
These intricate deformations in latent space are responsible for the usefulness of the Hilbert Space structure of the latent space.</p>
<p>In this context, let <script type="math/tex">T_n</script> denote an affine transformation and <script type="math/tex">X_0 \sim X</script> denote a signal randomly sampled from <script type="math/tex">X</script>. Then:</p>
<p>\begin{equation}
X_n := T_n \circ X_{n-1}
\end{equation}</p>
<p>denotes an element from an affine space.</p>
<p>Using (6) we may define a sequence of affine transformations in latent space:</p>
<p>\begin{equation}
T^N := T_N \circ T_{N-1} \circ … \circ T_1
\end{equation}</p>
<p>If we assume that function approximation capacity improves with network depth these sequences converge point-wise:</p>
<p>\begin{equation}
\lim_{N \to \infty} T^N(X_0) = F(X_0)=Y_0
\end{equation}</p>
<p>where <script type="math/tex">F</script> is the function to be approximated.</p>
<p>Now, from a statistical perspective function approximation requires minimising the conditional entropy:</p>
<p>\begin{equation}
H(Y|X) = H(Y,X) - H(X) \geq 0
\end{equation}</p>
<p>and this means that learning the joint distribution <script type="math/tex">P(Y,X)</script> requires learning everything there is to know about the structure of <script type="math/tex">X</script>.
Intuitively this makes sense as exploiting the structure of spatio-temporal signals makes it easier to de-correlate the signals <script type="math/tex">x \sim X</script>
in a useful way. It follows that discovering informative trajectories in latent space is equivalent to discovering a useful but not necessarily
unique Hilbert Space structure for the latent space. This is how I see the forest from the trees.</p>
<p>Having motivated the importance of trajectories in latent space, I’d like to show how they explain why deep networks with <script type="math/tex">N > 1</script> layers
have increasingly non-convex learning behaviour as <script type="math/tex">N</script> becomes large.</p>
<h1 id="the-statistical-relation-between-linear-variations-in-parameter-space-theta-and-the-non-convexity-of-f_theta">The statistical relation between linear variations in parameter space <script type="math/tex">\theta</script> and the non-convexity of <script type="math/tex">F_\theta</script>:</h1>
<p>Let’s suppose <script type="math/tex">\phi_i</script> denotes a particular feature map in an <script type="math/tex">N</script> layer deep rectifier network:</p>
<p>\begin{equation}
\phi_i := T_{N} \circ T_{N-1} \circ … \circ T_1
\end{equation}</p>
<p>and let’s suppose <script type="math/tex">\hat{T}</script> denotes an affine mapping:</p>
<p>\begin{equation}
x \overset{\hat{T}}\mapsto Ax + b
\end{equation}</p>
<p>If we consider the scenario that <script type="math/tex">\forall x \in \bar{X} \subset X, \phi(x_0) = \hat{T}(x_0)</script> can we conclude that:</p>
<p>\begin{equation}
\phi_i \rvert_{x_0 \in \bar{X}} \equiv \hat{T} \rvert_{x_0 \in \bar{X}}
\end{equation}</p>
<p>well, the answer is no for two important reasons:</p>
<ol>
<li>First, we lose essential information about the trajectory of <script type="math/tex">x_0</script> in latent space due to the sequence of affine transformations in <script type="math/tex">\phi_i</script>.</li>
<li>Second, if we perturb the parameters of <script type="math/tex">\hat{T}</script> we will get a linear effect with probability 1.0, so <script type="math/tex">\hat{T}</script> is convex w.r.t. its parameters, whereas a slight perturbation of the parameters of <script type="math/tex">\phi_i</script> using backprop for example leads to a non-linear effect with probability tending to 1.0 as <script type="math/tex">N</script> becomes large.</li>
</ol>
<p>The second point implies that <script type="math/tex">\phi_i</script> is unlikely to be convex with respect to linear interpolations in paramter space <script type="math/tex">\theta</script> as <script type="math/tex">N</script> becomes large. Using Julia Lang,
I found this to be the case by constructing chains of Xavier-initialised 3x3 matrices and checking whether a weak version of the usual convexity inequality was satisfied:</p>
<p>\begin{equation}
\forall \theta_1, \theta_2 \in \theta, \lVert T_{t \theta_1 + (1-t)\theta_2}(n) \rvert_{x \in X} \rVert_{\infty} \leq \lVert tT_{\theta_1}(n)\rvert_{x \in X} + (1-t)T_{\theta_2}(n)\rvert_{x \in X} \rVert_{\infty}
\end{equation}</p>
<p>where <script type="math/tex">n \in \mathbb{N}</script> and <script type="math/tex">T_\theta</script> is identified with a sequence of Xavier-initialised, i.e. random, affine transformations where the bias term was set to a constant vector of 0.1:</p>
<p>\begin{equation}
T_\theta := X_{n} \circ X_{n-1} \circ … \circ X_1
\end{equation}</p>
<p>and as expected the probability that this inequality was satisfied decreased quickly as the length of the sequence <script type="math/tex">n</script> was allowed to increase from <script type="math/tex">10</script> to <script type="math/tex">200</script>.</p>
<p><strong>Note:</strong> <a href="https://gist.github.com/AidanRocke/a0d8b16a3008ca0d7555500e8d597f43">The code is available as a gist on Github</a>.</p>
<h1 id="discussion">Discussion:</h1>
<p>I think that there is a false axiom embedded in this whole convex vs. non-convex discussion. We are somehow assuming that there exists a canonical parametrization of
neural networks and that this parametrization is linear. Yet, this extrinsic view merely reflects the observer’s Euclidean bias.</p>
<p>It makes more sense to take an intrinsic view and understand how the neural network re-parametrizes the latent space in order to make it linear with respect to the intrinsic geometry that is appropriate for the manifold on which <script type="math/tex">X</script> happens to reside. The learning process essentially involves approximating the structure of spatio-temporal signals.</p>
<p>In fact, this line of reflection leads me to the following conjecture concerning deep learning:</p>
<ol>
<li>Convexification: The early stages of learning involves exploring and selecting useful parametrizations from a large number of possible parametrizations of the space.</li>
<li>Optimisation: The later stages of learning involve fine-tuning a chosen parametrization.</li>
</ol>
<p>Furthermore, I suspect that these two learning regimes have different dynamics that can be analysed and that such analysis is key to
developing a powerful theory of intrinsic geometry that can be applied to a variety of spatio-temporal signals. This will certainly lead to more powerful learning
algorithms and statistical models whose learning and inferential mechanisms we shall actually understand. The reason being that we have a sequence of geometric
transformations that may potentially lead to arbitrarily complex trajectories in latent space.</p>
<p>I’d like to end this discussion by noting that the history of physics has made considerable progress largely by identifying suitable parametrizations
for general classes of natural signals. In the case of macroscopic motions through space Galilean reference frames were sufficiently general. For
General Relativity on the other hand, Riemannian geometry proved to be essential. But, that only became clear because Einstein, Lorentz and others
made an enormous effort to understand what was going on.</p>
<h1 id="references">References:</h1>
<ol>
<li>Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.</li>
<li>L. Zhang, G. Naitzat & Lek-Heng Lim. Tropical Geometry of Deep Neural Networks. 2018.</li>
</ol>Aidan RockeAn artist's rendition of trajectories in latent space(source: CERN)