Jekyll2020-07-01T12:35:01+00:00/feed.xmlKepler Lounge
The math journal of Aidan Rocke
An elementary derivation of the Singular Value Decomposition2020-06-30T00:00:00+00:002020-06-30T00:00:00+00:00/real/analysis/2020/06/30/SVD<h2 id="motivation">Motivation:</h2>
<p>Physical sciences have often made progress by identifying useful coordinate transformations. Whether we are using Euclidean
Geometry for Newton’s Calculus or Riemannian Geometry for General Relativity, the objective is to find a useful and parsimonious
parametrization that allows us to disentangle the variables responsible for the data-generating process. In the past this required human ingenuity but today thanks to the invention of computers we can run algorithms that allow us to discover useful
parametrizations for a given dataset.</p>
<p>If our data happens to be organised into rows and columns, like a matrix, one very powerful algorithm is the Singular Value
Decomposition. It is in some sense a data-driven Fourier Transform and the objective of this article is to provide an existence
proof that such decompositions are possible for rectangular matrices.</p>
<h2 id="derivation-of-the-singular-value-decomposition">Derivation of the Singular Value Decomposition:</h2>
<p>Given a matrix <script type="math/tex">X \in \mathbb{R}^{m \times n}</script> we may form the correlation matrices <script type="math/tex">X \cdot X^T \in \mathbb{R}^{m \times m}</script> and <script type="math/tex">X^T \cdot X \in \mathbb{R}^{n \times n}</script>. Now, since these correlation matrices are both square they both have
eigendecompositions and we may show that they have the same eigenvalues.</p>
<p>Indeed, let’s suppose:</p>
<p>\begin{equation}
\exists \vec{u} \in \mathbb{R}^{m},X \cdot X^T \cdot \vec{u} = \lambda \vec{u}
\end{equation}</p>
<p>then we have:</p>
<p>\begin{equation}
\vec{v} = X^T \vec{u} \in \mathbb{R}^n
\end{equation}</p>
<p>\begin{equation}
X^T \cdot X \cdot \vec{v} = \lambda \vec{v}
\end{equation}</p>
<p>so the eigenvectors of <script type="math/tex">X \cdot X^T</script> and <script type="math/tex">X^T \cdot X</script> are linear transformations of each other and we may deduce that
they have the following eigendecompositions:</p>
<p>\begin{equation}
\exists V \in \mathbb{R}^{n \times m}, X^T \cdot X \cdot V = V \cdot \text{diag}(\vec{\lambda})
\end{equation}</p>
<p>\begin{equation}
\exists U \in \mathbb{R}^{m \times m}, X \cdot X^T \cdot U = U \cdot \text{diag}(\vec{\lambda})
\end{equation}</p>
<p>so <script type="math/tex">\text{diag}(\vec{\lambda}) \in \mathbb{R}^{m \times m}</script>.</p>
<p>Furthermore, we may show that all the eigenvalues <script type="math/tex">\lambda_i</script> are non-negative since <script type="math/tex">X^T \cdot X</script> is positive-semidefinite:</p>
<p>\begin{equation}
\forall z \in \mathbb{R}^n, z^T (X^T \cdot X) z = (Xz)^T (Xz) = \lVert Xz \rVert^2 \geq 0
\end{equation}</p>
<p>and from this it follows that <script type="math/tex">\text{diag}(\vec{\lambda})= \Sigma^2</script> for some <script type="math/tex">\Sigma \in \mathbb{R_{+}}^{m \times m}</script>.</p>
<p>Now, we may note that <script type="math/tex">X \cdot X^T</script> is real and symmetric so we have:</p>
<p>\begin{equation}
X \cdot X^T = U \Sigma^2 U^{-1} = (U \Sigma^2 U^{-1})^T \implies U^T = U^{-1}
\end{equation}</p>
<p>and so we may deduce that <script type="math/tex">U</script> and <script type="math/tex">V</script> are both unitary matrices.</p>
<p>Combining these results, we have:</p>
<p>\begin{equation}
X \cdot X^T = U \Sigma^2 U^T = (U \Sigma V^T) \cdot (V \Sigma U^T) = (U \Sigma V^T) \cdot (U \Sigma V^T)^T
\end{equation}</p>
<p>Finally, we claim that <script type="math/tex">X = U \Sigma V^T</script> since by (1) and (2) we have:</p>
<p>\begin{equation}
X \cdot \vec{v_i} = \lambda_i \vec{u_i} \implies \lVert X \vec{v_i} \rVert^2 = \lambda_i = \sigma_i^2
\end{equation}</p>
<p>which implies:</p>
<p>\begin{equation}
\vec{u_i}^T \cdot X \cdot \vec{v_i} = \sigma_i
\end{equation}</p>
<p>so in matrix form we have:</p>
<p>\begin{equation}
U^T \cdot X \cdot V = \Sigma \implies X = U \Sigma V^T
\end{equation}</p>
<p>and this concludes our existence proof that a singular value decomposition exists for rectangular matrices.</p>Aidan RockeMotivation:Modelling covid-19 infection risk for rational covid-19 test allocation2020-06-13T00:00:00+00:002020-06-13T00:00:00+00:00/epidemics/2020/06/13/infection-risk<h2 id="motivation">Motivation:</h2>
<p>To understand the potential effectiveness of a peer-to-peer not-a-tracing(NAT) system against an unknown pathogen,
we may consider how a data-driven approach to defining an infection risk function <script type="math/tex">\mathcal{R}</script> at each node
of a graph may allow us to allocate limited testing capacity in a rational manner. Key challenges shall become clear after we have analysed the structure of our problem.</p>
<p>Initially, it appears that the main challenge involves finding a reliable definition of <script type="math/tex">\mathcal{R}</script> when we are in the low-data regime.</p>
<h2 id="graph-structure-and-state-space">Graph structure and state space:</h2>
<p>We shall assume that our graph is a small world network sampled from an epidemiological state space:</p>
<p>a. Each node <script type="math/tex">v_i \in V</script> represents an individual whose set of neighbours(i.e. physical contacts) may be represented by <script type="math/tex">\mathcal{N}(v_i) \subset V</script>.</p>
<p>b. We shall assume that at any instant a particular node is either susceptible or infected, so for a graph with <script type="math/tex">N</script> nodes
there are <script type="math/tex">2^N</script> possible states.</p>
<h2 id="data-collection">Data collection:</h2>
<p>We shall assume that individuals in this social network use a NAT app where:</p>
<p>a. They log clinically relevant phenotypic data(aka phenotypic space): age, biological sex, pre-existing medical conditions</p>
<p>b. Symptoms(aka symptom space): cough, sore throat, temperature, sense of smell</p>
<p>We may assume that symptoms shall be logged on a daily basis.</p>
<h2 id="modelling-infection-risk-using-machine-learning">Modelling infection risk using machine learning:</h2>
<p>Given that test kits are limited we need a method for prioritising test allocation. This may be done using a model of infection
risk. Mathematically, in order to determine whether a particular individual should take a test we use a parametric risk function:</p>
<p>\begin{equation}
\mathcal{R}(\theta): \mathbb{R}^l \times \mathbb{R}^{d} \rightarrow [0,1]
\end{equation}</p>
<p>where <script type="math/tex">d</script> is the dimension of the symptom space, and <script type="math/tex">l</script> is the dimension
of the phenotypic space. Furthermore, for each vertex <script type="math/tex">v_i \in V</script> there is a feature map <script type="math/tex">F</script> such that <script type="math/tex">F(v_i) \in \mathbb{R}^l \times \mathbb{R}^{d}</script>.</p>
<h2 id="mathcalrtheta-in-the-low-data-regime"><script type="math/tex">\mathcal{R}(\theta)</script> in the low-data regime:</h2>
<p>In the low-data regime, learning is unstable and there are no convergence guarantees for <script type="math/tex">\theta</script>. However, we still need
a reliable definition of <script type="math/tex">\mathcal{R}(\theta)</script>. In such a regime it may be possible to have <script type="math/tex">\mathcal{R}(\theta)</script> either hard-coded by a team of experts or we may use a function <script type="math/tex">\hat{\mathcal{R}}(\theta)</script> that is pre-trained on data with similar properties.</p>
<p>I must add that in this regime, we don’t do any learning just function evaluations or what some in the machine learning community would call inference.</p>
<h2 id="discussion">Discussion:</h2>
<p>In the large-data regime, we can use some form of privacy-preserving machine learning. However, I think it makes sense to
first focus on the low-data regime problem. I suspect that the authors of [3] might have a reasonable candidate for <script type="math/tex">\hat{\mathcal{R}}(\theta)</script>.</p>
<p>Finally, I’d like to add that if <script type="math/tex">\hat{\mathcal{R}}(\theta)</script> is good enough not only is machine learning unnecessary but
it may also be used as a proxy measure for test outcomes.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>Alexander A. Alemi, Matthew Bierbaum, Christopher R. Myers, James P. Sethna. You Can Run, You Can Hide: The Epidemiology and Statistical Mechanics of Zombies. Arxiv. 2015.</p>
</li>
<li>
<p>Jussi Taipale, Paul Romer, Sten Linnarsson. Population-scale testing can suppress the spread of covid-19. medrxiv. 2020.</p>
</li>
<li>
<p>Hagai Rossman, Ayya Keshet, Smadar Shilo, Amir Gavrieli, Tal Bauman, Ori Cohen, Esti Shelly, Ran Balicer, Benjamin Geiger, Yuval Dor & Eran Segal. A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys. Nature. 2020.</p>
</li>
</ol>Aidan RockeMotivation:Local deformations of small world networks during a zombie outbreak2020-05-31T00:00:00+00:002020-05-31T00:00:00+00:00/epidemics/2020/05/31/zombie-pandemic<h2 id="motivation">Motivation:</h2>
<p>To understand the potential effectiveness of a peer-to-peer not-a-tracing system against an unknown pathogen,
in this case zombies, we may consider its effect on scientists residing at a technology park. I believe
this is a potentially useful thought experiment for sub-populations that can choose who they interact with.
This would include people that can work from home.</p>
<p>We shall assume that there is a one-day incubation period and that:</p>
<ol>
<li>There is a population of N=500 people.</li>
<li>1% of the population is infected i.e. 5 scientists.</li>
<li>The reproduction number, <script type="math/tex">R_0</script>, is greater than one.</li>
<li>The incubation period is somewhere between one to two days.</li>
<li>The social graph is assumed to be a small-world network so given two distinct vertices <script type="math/tex">v_i</script> and <script type="math/tex">v_j</script>:</li>
</ol>
<p>\begin{equation}
\mathbb{E}[d(v_i,v_j)] \leq \log N = \log 500 \approx 3
\end{equation}</p>
<p>Furthermore, we shall assume that each scientist has an ‘R U a zombie’ app on their smartphone that is powered
by Google AI and collects data on symptoms such as anosmia on a daily basis.</p>
<h2 id="the-social-network-graph-approximates-the-transmission-network">The social network graph approximates the transmission network:</h2>
<p>When ‘R U a zombie’ with its advanced machine learning software identifies that a particular scientist is at risk
of being infected, they and their friends are encouraged to self-isolate. The system operates entirely through daily self-reporting and information is transferred in a peer-to-peer manner without a centralised databased. This should work in principle because the social network approximates the transmission network.</p>
<p>It’s worth adding that during a zombie pandemic individuals naturally minimise their risk surface area so they mainly
socialise with friends, if at all.</p>
<h2 id="invertible-surgical-operations-on-small-world-networks">Invertible surgical operations on small-world networks:</h2>
<p>Given that <script type="math/tex">R_0 > 1</script> and the average path length between nodes is less than 3.0, the health consequences of this outbreak
would quickly explode if not for preventive measures that anticipate future zombie infections.</p>
<p>If <script type="math/tex">v_i</script> is infected, a simple and effective mechanism for reducing the spread of infection would be to modify the network <script type="math/tex">G=(V,E)</script> by removing vertices from the vertex set <script type="math/tex">V</script> as follows:</p>
<p>\begin{equation}
S \subset v_i \cup N(v_i) \cup \bigcup_{v \in N(v_i)} N(v)
\end{equation}</p>
<p>\begin{equation}
V \mapsto V \setminus S
\end{equation}</p>
<p>\begin{equation}
V^* := V \setminus S
\end{equation}</p>
<p>and we note that the residual graph <script type="math/tex">G^* =(V^*,E^*)</script> is still a small-world network since the operations are local.</p>
<p>We also note that when members of <script type="math/tex">S</script> recover from infection the operation is reversible so we have:</p>
<p>\begin{equation}
V^* \mapsto V^* \cup S
\end{equation}</p>
<p>\begin{equation}
V := V^* \cup S
\end{equation}</p>
<p>and these operations should be sufficient to halt outbreaks provided that the false negative rate of ‘R U a zombie’ is close
to zero and <script type="math/tex">S</script> is carefully chosen.</p>
<p>A simple and robust algorithm for choosing <script type="math/tex">S</script> is as follows:</p>
<ol>
<li>
<p>If ‘R U a zombie’ determines that <script type="math/tex">v_i</script> is at risk of being zombie-positive given their symptoms, a message is broadcast
to <script type="math/tex">N(v_i)</script> to self-isolate for a week.</p>
</li>
<li>
<p><script type="math/tex">v_i \cup N(v_i)</script> are then tested for infection on every day of self-isolation.</p>
</li>
<li>
<p>If any of <script type="math/tex">v_i \cup N(v_i)</script> are zombie-positive then they are to be tested every two days until they are zombie-negative.
Now, <script type="math/tex">\bigcup_{v \in N(v_i)} N(v)</script> are also at risk of being zombie-positive so they may proceed with step (1).</p>
</li>
</ol>
<h2 id="asymptotic-analysis-of-susceptible-infected-and-recovered-populations">Asymptotic analysis of susceptible, infected and recovered populations:</h2>
<p>The rationale for the previous section is quite simple. Let’s assume that at time <script type="math/tex">t</script> we have:</p>
<p>\begin{equation}
S(t) = \text{number of susceptible scientists}
\end{equation}</p>
<p>\begin{equation}
I(t) = \text{number of infected scientists}
\end{equation}</p>
<p>\begin{equation}
R(t) = \text{number of recovered scientists}
\end{equation}</p>
<p>Assuming that the zombie virus is non-lethal:</p>
<p>\begin{equation}
\forall t, N = S(t) + I(t) + R(t)
\end{equation}</p>
<p>and the effect of ‘R U a zombie’ is to guarantee that <script type="math/tex">I(t)</script> is a decreasing function of time so:</p>
<p>\begin{equation}
\frac{dI}{dt} \leq 0
\end{equation}</p>
<p>and in principle <script type="math/tex">\lim_{t \rightarrow \infty} I(t) = 0</script>. So if the outbreak is identified early enough
its spread may be halted relatively quickly at minimal cost.</p>
<h2 id="discussion">Discussion:</h2>
<p>It is worth noting that in order to incentivise adoption of an app like ‘R U a zombie’ we need to assure the population
that the system will work. In practice, this requires three things:</p>
<ol>
<li>
<p>Effective daily monitoring of symptoms using a machine learning system such as the online survey proposed in [3].</p>
</li>
<li>
<p>A system for large-scale testing with results within 48 hours.</p>
</li>
<li>
<p>Paid sick leave during self-isolation.</p>
</li>
</ol>
<p>It is worth noting that in most European countries we don’t even have two out of three. Small and medium sized businesses
such as restaurants don’t have guaranteed incomes at all. As it is, occidental democracies aren’t even able to cater to the population that can afford to work from home which represents the most privileged sub-population.</p>
<p>Having said this, each of those three points I have noted corresponds to an engineering
problem that may be solved so there is hope.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>Alexander A. Alemi, Matthew Bierbaum, Christopher R. Myers, James P. Sethna. You Can Run, You Can Hide: The Epidemiology and Statistical Mechanics of Zombies. Arxiv. 2015.</p>
</li>
<li>
<p>Jussi Taipale, Paul Romer, Sten Linnarsson. Population-scale testing can suppress the spread of covid-19. medrxiv. 2020.</p>
</li>
<li>
<p>Hagai Rossman, Ayya Keshet, Smadar Shilo, Amir Gavrieli, Tal Bauman, Ori Cohen, Esti Shelly, Ran Balicer, Benjamin Geiger, Yuval Dor & Eran Segal. A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys. Nature. 2020.</p>
</li>
</ol>Aidan RockeMotivation:The parallel postulate, constructive mathematics and the relativity of mathematical truth2020-05-21T00:00:00+00:002020-05-21T00:00:00+00:00/mathematics/2020/05/21/parallel-postulate<h2 id="motivation">Motivation:</h2>
<p>Many scientists, including a large number of mathematicians, hold the mathematical sciences in high regard due to
our ability to derive profound truths from a small number of postulates. This is consistent with the belief that
mathematical insights, such as Newton’s Principia are discovered rather than made.</p>
<p>However, I think we should revisit this common assumption that mathematical axiom systems
are not free constructions of the human mind.</p>
<h2 id="euclidean-geometry-and-euclids-parallel-postulate">Euclidean geometry and Euclid’s parallel postulate:</h2>
<p>Euclidean geometry, being the geometry most scientists are familiar with, is often considered by many to be above scrutiny.
But, as this geometry is reducible to <a href="https://mathworld.wolfram.com/EuclidsPostulates.html">five axioms</a> it may be worth a closer analysis:</p>
<ol>
<li>A straight line segment may be drawn joining two points.</li>
<li>Any straight line segment may be extended indefinitely in a straight line.</li>
<li>Given any straight line segment, a circle can be drawn having the segment as radius and one endpoint as center.</li>
<li>All right angles are congruent.</li>
<li>Parallel postulate: In a plane, through a point not on a given straight line, at most one line can be drawn that
never meets the given line.</li>
</ol>
<p>However, in the early 1800s Bolyai and Lobachevsky independently discovered non-Euclidean geometry where the parallel
postulate does not hold. Other mathematicians, such as Riemann, built upon these findings-which paved the way for
Einstein’s theory of relativity.</p>
<h2 id="the-relativity-of-mathematical-truth">The relativity of mathematical truth:</h2>
<p>As the reader reflects on the significance of the parallel postulate, it is worth noting that any science may be
distilled to a finite set of axioms that are necessary and sufficient. So we may deduce that the axiomatic method is
inevitable and that any scientific truth is relative to a particular choice of axioms that have withstood the test
of numerous experiments. Now, mathematics being one science among others, does the relativity of mathematical truth
mean that it is arbitrary?</p>
<p>Not in the least. Any axiom system is ultimately chosen for reasons of convenience.</p>
<h2 id="the-case-for-constructive-mathematics">The case for constructive mathematics:</h2>
<p>We have reached the stage where we may assert that the mathematical sciences and their corresponding axiom systems
are in some sense free constructions of the human mind. But, we may also assert that they are convenient. How
can we make these ideas precise?</p>
<p>A set of axioms is in some sense the minimum description length of a scientific theory. If <script type="math/tex">\mathcal{A}</script> is our
axiom system and <script type="math/tex">T</script> is our theory then we have:</p>
<p>\begin{equation}
K(T):= \min_{\mathcal{A}} \{l(\mathcal{A}):U(\mathcal{A})=T\}
\end{equation}</p>
<p>where <script type="math/tex">U</script> is a universal Turing machine and <script type="math/tex">K</script> denotes Kolmogorov complexity. It follows that choosing a set of
axioms amounts to a problem in model selection.</p>
<h2 id="discussion">Discussion:</h2>
<p>At this point the reader may ascertain that constructing axiom systems is a form of scientific model building and therefore a creative enterprise. The reader may also be inclined to believe that axiom systems are the ultimate representation of a scientific theory and therefore the main activity of theoretical science is reducible to the construction of useful axiom systems. However, this would be a step too far.</p>
<p>With a scientific axiom system we can construct all possible theorems that are true relative to that set of axioms but how many of these theorems are actually interesting? It is worth noting that scientific intuition, the human process for sampling from the space of possible theorems, escapes the deductive process. I think this is a strong indication that any axiomatisation is necessarily incomplete.</p>
<p>But, what then is scientific intuition? That perhaps is a subject for another day.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>Poincaré, Henri. La Science et l’Hypothèse. Champs Sciences. 1902.</p>
</li>
<li>
<p>Weisstein, Eric. “Euclid’s Postulates.” From MathWorld–A Wolfram Web Resource. https://mathworld.wolfram.com/EuclidsPostulates.html</p>
</li>
<li>
<p>Marcus Hutter (2007) Algorithmic information theory. Scholarpedia, 2(3):2519.</p>
</li>
</ol>Aidan RockeMotivation:A note on the covid-19 spike protein2020-04-29T00:00:00+00:002020-04-29T00:00:00+00:00/biology/2020/04/29/spike-protein-I<p><strong>Disclaimer:</strong> I am not a protein physicist, and what follows is simply a collection of facts on
the role of the spike(S) protein in covid-19 infection driven by my curiosity about how to approach
modelling and controlling covid-19 infection.</p>
<h2 id="a-few-useful-pieces-of-information">A few useful pieces of information:</h2>
<ol>
<li>
<p>Coronaviruses derive their name from the spike(S) protein which gives them their spiky structure.</p>
</li>
<li>
<p>The spike serves two purposes:</p>
<p>a. <em>Attachment</em> at host cell surface.</p>
<p>b. <em>Entry</em> at the host cell membrane, which allows infection to begin.</p>
</li>
<li>
<p>The spike is also the main target of antibodies so it is a key focus of vaccine design.</p>
</li>
<li>
<p>Scientists have found structural evidence that covid-19 S protein binds ACE2(an enzyme) with high affinity
where the presence of ACE2 normally helps regulate blood pressure.</p>
</li>
<li>
<p>Analysis of epithelial cells in the respiratory system reveals that nasal epithelial cells, specifically
goblet and ciliated cells, display the highest ACE2 expression of all epithelial cells analysed. So nasal
epithelial cells are a key point of entry.</p>
</li>
</ol>
<p>Besides nasal epithelial cells, ACE2 is also found in cells in the cornea of the eye and intestine linings [1].
So these represent other routes of infection, and this motivates the creation of a <a href="https://www.covid19cellatlas.org/">cell atlas for covid-19</a>.</p>
<h2 id="discussion">Discussion:</h2>
<p>If we focus on the local interaction of the covid-19 spike protein with ACE2, then we may try modelling the
dynamic behaviour of the covid-19 spike protein. Some researchers at Max Planck are <a href="https://www.mpg.de/14657720/corona-spike-protein">actively working on this</a> using biophysical models running on supercomputers.</p>
<p>However, if we want to model the interactions of covid-19 with the human body i.e. the progression of infection
…I think this is a much more complex task. For this we may need machine learning models as pursued in [4].</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>Sungnak, W., Huang, N., Bécavin, C. et al. SARS-CoV-2 entry factors are highly expressed in nasal epithelial cells together with innate immune genes. Nat Med (2020). https://doi.org/10.1038/s41591-020-0868-6</p>
</li>
<li>
<p>Alexandra C.Walls, Young-Jun Park, Alejandra Tortorici, Abigail Wall, Andrew T.McGuire, David Veesler. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell. 2020.</p>
</li>
<li>
<p>Ratul Chowdhury & Costas D. Maranas. Biophysical characterization of the SARS-CoV-2 spike protein binding with the ACE2
receptor explains increased COVID-19 pathogenesis. biorxiv. 2020.</p>
</li>
<li>
<p>Wanglong Gou, Yuanqing Fu, Liang Yue, Geng-dong Chen, Xue Cai, Menglei Shuai, Fengzhe Xu, Xiao Yi, Hao Chen, Yi Judy Zhu, Mian-li Xiao, Zengliang Jiang, Zelei Miao, Congmei Xiao, Bo Shen, Xiaomai Wu, Haihong Zhao, Wenhua Ling, Jun Wang, Yu-ming Chen, Tiannan Guo, Ju-Sheng Zheng. Gut microbiota may underlie the predisposition of healthy individuals to COVID-19. 2020.</p>
</li>
</ol>Aidan RockeDisclaimer: I am not a protein physicist, and what follows is simply a collection of facts on the role of the spike(S) protein in covid-19 infection driven by my curiosity about how to approach modelling and controlling covid-19 infection.Derivation of the complex-step method from the Cauchy Integral Formula for derivatives2020-02-28T00:00:00+00:002020-02-28T00:00:00+00:00/applied-math/2020/02/28/CIF-autodiff<h2 id="introduction">Introduction:</h2>
<p>Less than two months ago I had an exchange with <a href="http://www.bcl.hamilton.ie/~barak/">Barak Pearlmutter</a>, a neuroscientist who has made seminal contributions to the field of automatic differentiation, on the relation between Cauchy’s Integral Formula for derivatives and forward mode automatic differentiation. In particular, he mentioned that if we collapse the contour to a point we obtain the complex-step method which is in some sense equivalent to forward-mode algorithmic differentiation [3].</p>
<p>Therefore the objective of this article is to derive the complex-step method from the Cauchy Integral Formula for derivatives and thereby establish a correspondence between the latter and forward-mode algorithmic differentiation.</p>
<h2 id="problem">Problem:</h2>
<p>The complex derivative for a holomorphic function in one complex variable <script type="math/tex">f: A \rightarrow \mathbb{C}</script> is given by:</p>
<p>\begin{equation}
\forall z_0 \in A,f’(z_0)=\lim_{z \to z_0} \frac{f(z)-f(z_0)}{z-z_0}
\end{equation}</p>
<p>and using Cauchy’s Integral Formula we also have:</p>
<p>\begin{equation}
\forall z_0 \in A,f’(z_0)= \frac{1}{2\pi i} \int_{\gamma} \frac{f(z)}{(z-z_0)^2} dz
\end{equation}</p>
<p>where <script type="math/tex">\gamma</script> is a simple closed piecewise smooth and positively oriented curve in <script type="math/tex">A</script> where <script type="math/tex">z_0 \in \text{Int}({\gamma})</script>.</p>
<p>It is possible to derive (2) from (1) using Cauchy’s Integral Formula without too much difficulty and we can also derive the complex-step derivative approximation from (1) as follows:</p>
<p>Let’s suppose <script type="math/tex">z_0:=x \in \mathbb{R}</script>. Due to path-independence we have:</p>
<p>\begin{equation}
f’(x) = \lim_{z \to x} \frac{f(z)-f(x)}{z-x} = \lim_{h \to 0} \frac{f(x+ih)-f(x)}{(x+ih)-x} = \lim_{h \to 0} \frac{f(x+ih)-f(x)}{ih}
\end{equation}</p>
<p>Now, if we compute the Taylor series of <script type="math/tex">f(x+ih)</script> we have:</p>
<p>\begin{equation}
f(x+ih)=f(x) + f’(x)(ih)-\frac{h^2 f^{(2)}(x)}{2} + …
\end{equation}</p>
<p>and so (3) simplifies to:</p>
<p>\begin{equation}
f’(x) = \lim_{h \to 0} \frac{\text{Im}(f(x+ih))}{h}
\end{equation}</p>
<p>which is the complex-step method as introduced in [3]. Now, the question is how to reach (5) from (2) by ‘collapsing’ the contour <script type="math/tex">\gamma</script> to a point.</p>
<h2 id="solution">Solution:</h2>
<p>By defining the contour in (2) as a disk with radius <script type="math/tex">r</script> centred at <script type="math/tex">x \in \mathbb{R}</script>:</p>
<p>\begin{equation}
\gamma = x+ U_r = \{x+r\cdot e^{i \theta}: r >0, \theta \in [0,2\pi]\} \subset A
\end{equation}</p>
<p>the Cauchy Integral Formula for derivatives, i.e. equation (2), simplifies to:</p>
<p>\begin{equation}
f’(x)= \frac{1}{2\pi}\int_{0}^{2\pi} \frac{f(x+r\cdot e^{i\theta})}{r\cdot e^{i\theta}} d\theta
\end{equation}</p>
<p>Due the symmetries of <script type="math/tex">U_r</script>, <script type="math/tex">z \in U_r \iff -z \in U_r</script> so if we take the limit as <script type="math/tex">r \rightarrow 0</script> and use the Taylor series expansion of <script type="math/tex">f</script> at <script type="math/tex">x</script>, (7) becomes:</p>
<p>\begin{equation}
f’(x)= \lim_{r \to 0} \frac{1}{2\pi}\int_{0}^{2\pi} \frac{f(x) + f’(x) \cdot (r\cdot e^{i\theta})+…}{r\cdot e^{i\theta}} d\theta = \lim_{r \to 0} \frac{\text{Im}(f(x+ir))}{r}
\end{equation}</p>
<p>where to obtain the RHS we implicitly used the cancellation of <script type="math/tex">\frac{f(x)}{r\cdot e^{i\theta}}</script> terms and path-independence to choose <script type="math/tex">\theta = \frac{\pi}{2}</script> so <script type="math/tex">r\cdot e^{i\frac{\pi}{2}}=ir</script>.</p>
<h2 id="references">References:</h2>
<ol>
<li>J.N. Lyness & C.B. Moler. NUMERICAL DIFFERENTIATION OF ANALYTIC FUNCTIONS. SIAM Journal of Numerical Analysis. 1967.</li>
<li>William Squire & George Trapp. USING COMPLEX VARIABLES TO ESTIMATE DERIVATIVES OF REAL FUNCTIONS. SIAM review. 1998.</li>
<li>Joaquim Martins, Peter Sturdza, and Juan J. Alonso. THE CONNECTION BETWEEN THE COMPLEX-STEP DERIVATIVE APPROXIMATION AND ALGORITHMIC DIFFERENTIATION. American Institute of Aeronautics and Astronautics. 2001.</li>
</ol>Aidan RockeIntroduction:Differentiable approximations to the min and max operators2020-02-13T00:00:00+00:002020-02-13T00:00:00+00:00/applied-math/2020/02/13/analytic-min-max<h2 id="motivation">Motivation:</h2>
<p>Within the context of optimisation, differentiable approximations of the <script type="math/tex">\min</script> and <script type="math/tex">\max</script> operators on <script type="math/tex">\mathbb{R}^n</script>
are very useful. In particular, I am interested in analytical approximations <script type="math/tex">f_N,g_N \in C^{\infty}</script>:</p>
<p>\begin{equation}
f_N: \mathbb{R}^n \rightarrow \mathbb{R}
\end{equation}</p>
<p>\begin{equation}
g_N: \mathbb{R}^n \rightarrow \mathbb{R}
\end{equation}</p>
<p>where <script type="math/tex">\forall X \in \mathbb{R}^n \forall \epsilon > 0 \exists N \in \mathbb{N}</script>:</p>
<p>\begin{equation}
\forall m > N, \max(\lvert f_m(X)-\max(X) \rvert, \lvert g_m(X)-\max(X) \rvert) < \epsilon
\end{equation}</p>
<p>I found a few proposed solutions to a <a href="https://mathoverflow.net/questions/35191/a-differentiable-approximation-to-the-minimum-function?noredirect=1&lq=1">related question on MathOverflow</a> but when I tested these methods I found
that none are numerically stable with respect to the relative error <script type="math/tex">\frac{\lvert \delta x \rvert}{\lvert x \rvert}</script>.</p>
<p>This motivated me to come up with a numerically stable solution inspired by properties of the infinity norm:</p>
<p>\begin{equation}
\forall X \in \mathbb{R}^n, \lVert X \rVert_{\infty} = \max(X) = \bar{x} \in \mathbb{R}
\end{equation}</p>
<p>where the reason for (4) is that <script type="math/tex">\forall X \in \mathbb{R_{+}}^n</script> where <script type="math/tex">\max(X) \neq 0</script>:</p>
<p>\begin{equation}
\lVert X \rVert_{\infty} = \lim_{n \to \infty} (\sum_{i=1}^n x_i^n)^{\frac{1}{n}} = \lim_{n \to \infty} \bar{x} \cdot \big(\sum_{i=1}^n \big(\frac{x_i}{\bar{x}}\big)^n\big)^{\frac{1}{n}} = \bar{x}
\end{equation}</p>
<p>which may be used to define analytic approximations to both the min and max operators on <script type="math/tex">\mathbb{R}^n</script> provided that we use an order-preserving transformation
from <script type="math/tex">\mathbb{R}^n</script> to <script type="math/tex">\mathbb{R_{+}}^n</script>.</p>
<h2 id="analysis-of-several-proposed-solutions">Analysis of several proposed solutions:</h2>
<p>Without loss of generality, let’s focus on approximations to the <script type="math/tex">\max</script> operator. In total, I analysed three different analytic approximations to this operator.
Two which were taken from the MathOverflow [2] and the <em>smooth max</em>, which has its own Wikipedia page [3].</p>
<p>I actually encourage the interested reader to experiment with the following methods which are all correct in principle but all susceptible to overflow errors:</p>
<h3 id="the-generalised-mean">The generalised mean:</h3>
<p>\begin{equation}
\forall X \in \mathbb{R_+}^n \forall N \in \mathbb{N}, GM(X,N)= \Big(\frac{1}{n}\sum_{i=1}^n x_i^N\Big)^{\frac{1}{N}}
\end{equation}</p>
<p>converges to <script type="math/tex">\max(X), \forall X \in \mathbb{R_+}^n</script> since:</p>
<p>\begin{equation}
\lim_{N \to \infty} \Big(\frac{1}{n} \sum_{i=1}^n x_i^N \Big)^{\frac{1}{N}} = \lim_{N \to \infty} \bar{x} \cdot \Big(\frac{1}{n} \sum_{i=1}^n \big(\frac{x_i}{\bar{x}}\big)^N \Big)^{\frac{1}{N}}
\end{equation}</p>
<p>where</p>
<p>\begin{equation}
n^{-\frac{1}{N}} \leq \Big(\frac{1}{n} \sum_{i=1}^n \big(\frac{x_i}{\bar{x}}\Big)^N \leq 1
\end{equation}</p>
<p>In Julia this may be implemented as follows:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="nf"> GM</span><span class="x">(</span><span class="n">X</span><span class="o">::</span><span class="kt">Array</span><span class="x">{</span><span class="kt">Float64</span><span class="x">,</span> <span class="mi">1</span><span class="x">},</span><span class="n">N</span><span class="o">::</span><span class="kt">Int64</span><span class="x">)</span>
<span class="c">## generalised mean: https://en.wikipedia.org/wiki/Generalized_mean</span>
<span class="c">## this method returns a type error unless all elements of X are positive:</span>
<span class="c">## https://math.stackexchange.com/questions/317528/how-do-you-compute-negative-numbers-to-fractional-powers/317546#317546</span>
<span class="k">return</span> <span class="n">mean</span><span class="x">(</span><span class="n">X</span><span class="o">.^</span><span class="n">N</span><span class="x">)</span><span class="o">^</span><span class="x">(</span><span class="mi">1</span><span class="o">/</span><span class="n">N</span><span class="x">)</span>
<span class="k">end</span>
</code></pre></div></div>
<p>It is worth noting that this particular method returns a type error for odd exponents unless all elements of <script type="math/tex">X</script> are positive.</p>
<h3 id="exponential-generalised-mean">Exponential generalised mean:</h3>
<p>\begin{equation}
\forall X \in \mathbb{R}^n \forall N \in \mathbb{N}, EM(X,N)= \frac{1}{N} \cdot \log \big(\frac{1}{n}\sum_{i=1}^n e^{N \cdot x_i}\big)
\end{equation}</p>
<p>converges to <script type="math/tex">\max(X), \forall X \in \mathbb{R}^n</script> since:</p>
<p>\begin{equation}
\frac{e^{N \bar{x}}}{n} \leq \frac{1}{n} \sum_{i=1}^n e^{N x_i} \leq e^{N \bar{x}}
\end{equation}</p>
<p>so we have:</p>
<p>\begin{equation}
\bar{x} - \frac{\log n}{N} \leq \frac{1}{N} \cdot \log \big(\frac{1}{n}\sum_{i=1}^n e^{N \cdot x_i}\big) \leq \bar{x}
\end{equation}</p>
<p>In Julia this may be implemented as follows:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="nf"> exp_GM</span><span class="x">(</span><span class="n">X</span><span class="o">::</span><span class="kt">Array</span><span class="x">{</span><span class="kt">Float64</span><span class="x">,</span> <span class="mi">1</span><span class="x">},</span><span class="n">N</span><span class="o">::</span><span class="kt">Int64</span><span class="x">)</span>
<span class="c">### This method is terrible. Overflow errors everywhere. </span>
<span class="n">exp_</span> <span class="o">=</span> <span class="n">mean</span><span class="x">(</span><span class="n">exp</span><span class="o">.</span><span class="x">(</span><span class="n">N</span><span class="o">*</span><span class="n">X</span><span class="x">))</span>
<span class="k">return</span> <span class="x">(</span><span class="mi">1</span><span class="o">/</span><span class="n">N</span><span class="x">)</span><span class="o">*</span><span class="n">log</span><span class="x">(</span><span class="n">exp_</span><span class="x">)</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Unlike the previous method, this one is defined <script type="math/tex">\forall X \in \mathbb{R}^n</script> but it too is numerically unstable.</p>
<h3 id="the-smooth-max">The smooth max:</h3>
<p>\begin{equation}
\forall X \in \mathbb{R}^n \forall N \in \mathbb{N}, SM(X,N)= \frac{\sum_{i=1}^n x_i \cdot e^{N \cdot x_i}}{\sum_{i=1}^n e^{N \cdot x_i}}
\end{equation}</p>
<p>converges to <script type="math/tex">\max(X), \forall X \in \mathbb{R}^n</script> since <script type="math/tex">x_i - \bar{x} \leq 0</script> so we may define the sets <script type="math/tex">\{x_i = \bar{x}\}</script> and <script type="math/tex">% <![CDATA[
\{x_i < \bar{x}\} %]]></script>
where <script type="math/tex">\lvert \{x_i = \bar{x}\} \rvert = k</script> and <script type="math/tex">% <![CDATA[
\lvert \{x_i < \bar{x}\} \rvert = n-k %]]></script> so we have:</p>
<p>\begin{equation}
\frac{\sum_{i=1}^n x_i \cdot e^{N \cdot x_i}}{\sum_{i=1}^n e^{N \cdot x_i}} = \frac{\sum_{i=1}^n x_i \cdot e^{N \cdot (x_i-\bar{x})}}{\sum_{i=1}^n e^{N \cdot (x_i-\bar{x})}} = \frac{k\bar{x} + \sum_{x_i < \bar{x}} x_i \cdot e^{N \cdot (x_i-\bar{x})}}{k + \sum_{x_i < \bar{x}} e^{N \cdot (x_i-\bar{x})}}
\end{equation}</p>
<p>and we note that:</p>
<p>\begin{equation}
\lim_{N \to \infty} \max \big(\sum_{x_i < \bar{x}} x_i \cdot e^{N \cdot (x_i-\bar{x})},\sum_{x_i < \bar{x}} e^{N \cdot (x_i-\bar{x})}\big) = 0
\end{equation}</p>
<p>In Julia this may be implemented as follows:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span><span class="nf"> smooth_max</span><span class="x">(</span><span class="n">X</span><span class="o">::</span><span class="kt">Array</span><span class="x">{</span><span class="kt">Float64</span><span class="x">,</span> <span class="mi">1</span><span class="x">},</span><span class="n">N</span><span class="o">::</span><span class="kt">Int64</span><span class="x">)</span>
<span class="c">## implementation of the smooth maximum: </span>
<span class="c">## https://en.wikipedia.org/wiki/Smooth_maximum</span>
<span class="n">exp_</span> <span class="o">=</span> <span class="n">exp</span><span class="o">.</span><span class="x">(</span><span class="n">X</span><span class="o">*</span><span class="n">N</span><span class="x">)</span>
<span class="k">return</span> <span class="n">sum</span><span class="x">(</span><span class="n">X</span><span class="o">.*</span><span class="n">exp_</span><span class="x">)</span><span class="o">/</span><span class="n">sum</span><span class="x">(</span><span class="n">exp_</span><span class="x">)</span>
<span class="k">end</span>
</code></pre></div></div>
<p>This method appears to be the industry standard but like the other methods it is vulnerable to overflow errors as it fails to normalise the input vector <script type="math/tex">X</script>. In fact, the reader might want to explore an <a href="https://github.com/AidanRocke/analytic_min-max_operators/blob/master/analytic_max_operator.ipynb">IJulia notebook</a> where I analysed each method.</p>
<h2 id="my-proposed-solution">My proposed solution:</h2>
<p>This analysis motivated me to come up with my own solution inspired by the properties of the <a href="https://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_dimensions">infinity norm</a> where I first rescale the vectors so they have zero mean and unit variance:</p>
<p>\begin{equation}
\forall X \in \mathbb{R}^n \forall N \in \mathbb{N}, AM(\hat{X},N) = \sigma_X \cdot \big(\frac{1}{N} \log \big(\sum_{i=1}^n e^{N\cdot \hat{x_i}} \big)\big) + \mu_{X}
\end{equation}</p>
<p>where <script type="math/tex">\hat{X}</script> is defined as follows:</p>
<p>\begin{equation}
\hat{X} = \frac{X - 1_n\cdot \mu_X}{\sigma_X}
\end{equation}</p>
<p>and the partial derivative of <script type="math/tex">AM(\hat{X},N)</script> with respect to <script type="math/tex">\hat{x_i}</script> is simply the softmax:</p>
<p>\begin{equation}
\frac{\partial}{\partial \hat{x_i}} AM(\hat{X},N) = \frac{e^{N \cdot \hat{x_i}}}{\sum_{i=1}^n e^{N\cdot \hat{x_i}}}
\end{equation}</p>
<p>This method may be readily generalised to approximate both the <script type="math/tex">\min</script> and <script type="math/tex">\max</script> operators on <script type="math/tex">\mathbb{R}^n</script> in the Julia programming language:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">Statistics</span>
<span class="k">function</span><span class="nf"> analytic_min_max</span><span class="x">(</span><span class="n">X</span><span class="o">::</span><span class="kt">Array</span><span class="x">{</span><span class="kt">Float64</span><span class="x">,</span> <span class="mi">1</span><span class="x">},</span><span class="n">N</span><span class="o">::</span><span class="kt">Int64</span><span class="x">,</span><span class="n">case</span><span class="o">::</span><span class="kt">Int64</span><span class="x">)</span>
<span class="s">"""
An analytic approximation to the min and max operators
Inputs:
X: a vector from R^n where n is unknown
N: an integer such that the approximation of max(X)
improves with increasing N.
case: If case == 1 apply analytic_min(), otherwise
apply analytic_max() if case == 2
Output:
An approximation to min(X) if case == 1, and max(X) if
case == 2
"""</span>
<span class="k">if</span> <span class="x">(</span><span class="n">case</span> <span class="o">!=</span> <span class="mi">1</span><span class="x">)</span><span class="o">*</span><span class="x">(</span><span class="n">case</span> <span class="o">!=</span> <span class="mi">2</span><span class="x">)</span>
<span class="k">return</span> <span class="n">print</span><span class="x">(</span><span class="s">"Error: case isn't well defined"</span><span class="x">)</span>
<span class="k">else</span>
<span class="c">## q is the degree of the approximation: </span>
<span class="n">q</span> <span class="o">=</span> <span class="n">N</span><span class="o">*</span><span class="x">(</span><span class="o">-</span><span class="mi">1</span><span class="x">)</span><span class="o">^</span><span class="n">case</span>
<span class="n">mu</span><span class="x">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="n">mean</span><span class="x">(</span><span class="n">X</span><span class="x">),</span> <span class="n">std</span><span class="x">(</span><span class="n">X</span><span class="x">)</span>
<span class="c">## rescale vector so it has zero mean and unit variance:</span>
<span class="n">Z_score</span> <span class="o">=</span> <span class="x">(</span><span class="n">X</span><span class="o">.-</span><span class="n">mu</span><span class="x">)</span><span class="o">./</span><span class="n">sigma</span>
<span class="n">exp_sum</span> <span class="o">=</span> <span class="n">sum</span><span class="x">(</span><span class="n">exp</span><span class="o">.</span><span class="x">(</span><span class="n">Z_score</span><span class="o">*</span><span class="n">q</span><span class="x">))</span>
<span class="n">log_</span> <span class="o">=</span> <span class="n">log</span><span class="x">(</span><span class="n">exp_sum</span><span class="x">)</span><span class="o">/</span><span class="n">q</span>
<span class="k">return</span> <span class="x">(</span><span class="n">log_</span><span class="o">*</span><span class="n">sigma</span><span class="x">)</span><span class="o">+</span><span class="n">mu</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>and the reader may check that it passes the following numerical stability test:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function numerical_stability(method,type::Int64)
"""
A simple test for numerical stability with respect to the relative error.
Input:
method: the approximation used
type: 1 for min() and 2 for max()
Output:
Check that the average relative error is less than 10%.
"""
## test will be run 100 times
relative_errors = zeros(100)
for i = 1:100
## a vector sampled uniformly from [-1000,1000]^100
X = (2*rand(100).-1.0)*1000
## the test for min operators
if type == 1
min_ = minimum(X)
relative_errors[i] = abs(min_-method(X,i))/abs(min_)
## the test for max operators
else
max_ = maximum(X)
relative_errors[i] = abs(max_-method(X,i))/abs(max_)
end
end
return mean(relative_errors) < 0.1
end
</code></pre></div></div>
<h2 id="discussion">Discussion:</h2>
<p>It took me about an hour to come up with my solution so I doubt this method is either original or state-of-the-art. In fact, I wouldn’t be surprised at
all if numerical analysts have a better solution than the one I propose here.</p>
<p>If you know of a more robust method, feel free to join the discussion on the <a href="https://mathoverflow.net/questions/352548/analytic-approximations-of-the-min-and-max-operators">MathOverflow</a>.</p>
<h2 id="references">References:</h2>
<ol>
<li>
<p>Aidan Rocke (https://mathoverflow.net/users/56328/aidan-rocke), analytic approximations of the min and max operators, URL (version: 2020-02-13): https://mathoverflow.net/q/352548</p>
</li>
<li>
<p>eakbas (https://mathoverflow.net/users/5223/eakbas), A differentiable approximation to the minimum function, URL (version: 2016-08-12): https://mathoverflow.net/q/35191</p>
</li>
<li>
<p>Wikipedia contributors. Smooth maximum. Wikipedia, The Free Encyclopedia. March 25, 2019, 21:07 UTC. Available at: https://en.wikipedia.org/w/index.php?title=Smooth_maximum&oldid=889462421. Accessed February 12, 2020.</p>
</li>
<li>
<p>Wikipedia contributors. (2019, December 3). Generalized mean. In Wikipedia, The Free Encyclopedia. Retrieved 21:37, February 13, 2020, from https://en.wikipedia.org/w/index.php?title=Generalized_mean&oldid=929065968</p>
</li>
<li>
<p>alwayscurious (https://stats.stackexchange.com/users/194748/alwayscurious), What is the reasoning behind standardization (dividing by standard deviation)?, URL (version: 2019-03-18): https://stats.stackexchange.com/q/398116</p>
</li>
<li>
<p>Sergey Ioffe & Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015.</p>
</li>
<li>
<p>J. Cook. Basic properties of the soft maximum. Working Paper Series 70, UT MD Anderson Cancer Center Department of Biostatistics, 2011. https://www.johndcook.com/soft_maximum.pdf</p>
</li>
<li>
<p>M. Lange, D. Zühlke, O. Holz, and T. Villmann, “Applications of lp-norms and their smooth approximations for gradient based learning vector quantization,” in Proc. ESANN, Apr. 2014, pp. 271-276.</p>
</li>
<li>
<p>Aidan Rocke. analytic_min-max_operators(2020).GitHub repository, https://github.com/AidanRocke/analytic_min-max_operators</p>
</li>
</ol>Aidan RockeMotivation:Almost all random matrices are nonsingular2020-02-03T00:00:00+00:002020-02-03T00:00:00+00:00/applied-math/2020/02/03/all-random-matrices<h2 id="introduction">Introduction:</h2>
<p>A classical result in the theory of random matrices states that any random matrix sampled from a continuous distribution is nonsingular
with probability one. This has many important consequences.</p>
<p>One fundamental consequence is that almost all linear models with square jacobian matrices are invertible.</p>
<h2 id="the-spectral-norm-and-the-largest-eigenvalue">The spectral norm and the largest eigenvalue:</h2>
<p>For random matrices <script type="math/tex">A \sim \mathcal{U}([-N,N])^{n \times n}</script>, if <script type="math/tex">\lambda</script> is the largest eigenvalue of <script type="math/tex">A</script> and <script type="math/tex">\upsilon</script> the corresponding unit eigenvector:</p>
<p>\begin{equation}
\lVert A \rVert_{2} \geq \lVert A \upsilon \rVert_{2} = \lVert A \lambda \rVert_{2} = | \lambda | \lVert \upsilon \rVert_{2} = | \lambda |
\end{equation}</p>
<p>where</p>
<p>\begin{equation}
\lVert A \rVert_{2} = \sqrt{\lambda_{max}(A^T A)} = \sigma_{max}(A)
\end{equation}</p>
<p>so the largest eigenvalue of <script type="math/tex">A</script> is less than or equal to its spectral norm.</p>
<h2 id="hölders-inequality-and-the-determinant">Hölder’s inequality and the determinant:</h2>
<p>Now, by Hölder’s inequality we have:</p>
<p>\begin{equation}
\lVert A \rVert_{2} \leq \sqrt{\lVert A \rVert_{1} \lVert A \rVert_{\infty}}
\end{equation}</p>
<p>where <script type="math/tex">\lVert A \rVert_{1}</script> is the maximum column sum and <script type="math/tex">\lVert A \rVert_{\infty}</script> is the maximum row sum:</p>
<p>\begin{equation}
\lVert A \rVert_{1} = \max_{1 \leq j \leq n} \sum_{i=1}^m |a_{ij}|
\end{equation}</p>
<p>\begin{equation}
\lVert A \rVert_{\infty} = \max_{1 \leq i \leq m} \sum_{j=1}^n |a_{ij}|
\end{equation}</p>
<p>Given that the determinant of a matrix is the product of its eigenvalues we may use (1) and (3):</p>
<p>\begin{equation}
\det(A) = \prod_{i=1}^n \lambda_i \leq \lVert A \rVert_{2}^n \leq \sqrt{\lVert A \rVert_{1} \lVert A \rVert_{\infty}}^n \leq (\sqrt{n}N)^n
\end{equation}</p>
<p>and using (6) we may assert the following equivalence relation:</p>
<p>\begin{equation}
\forall A \in [-N,N]^{n \times n} \exists \vec{\lambda} \in [-\sqrt{n}N,\sqrt{n}N]^n, A \sim \vec{\lambda} \implies \det(A-\vec{\lambda}I_n) = 0
\end{equation}</p>
<p>so the determinant mapping is defined:</p>
<p>\begin{equation}
\det: [-N,N]^{n \times n} \longrightarrow [-(\sqrt{n}N)^n,(\sqrt{n}N)^n]
\end{equation}</p>
<p>and we can also show that this mapping is analytic.</p>
<h2 id="the-determinant-maps-sets-of-positive-measure-to-sets-of-positive-measure">The determinant maps sets of positive measure to sets of positive measure:</h2>
<p>The determinant is a polynomial in the coordinates of <script type="math/tex">A \in [-N,N]^{n \times n}</script>:</p>
<p>\begin{equation}
\det(a_{ij}) = \sum_{\sigma \in S_n} \big( \text{sgn}(\sigma) \prod_{i=1}^n a_{i,\sigma_i} \big)
\end{equation}</p>
<p>so if we define the set of singular matrices:</p>
<p>\begin{equation}
S = \{A \in [-N,N]^{n \times n}: \det(A)=0\}
\end{equation}</p>
<p>we note that <script type="math/tex">\{0\}</script> is a set of measure zero and since analytic functions map sets of positive measure to sets of positive measure, <script type="math/tex">S</script> must be a set of
measure zero.</p>Aidan RockeIntroduction:Almost all random vectors are orthogonal2020-02-03T00:00:00+00:002020-02-03T00:00:00+00:00/applied-math/2020/02/03/orthogonal-vectors<h2 id="motivation">Motivation:</h2>
<p>To develop algorithms for problems in machine learning or statistical physics, it is useful to develop an understanding of high-dimensional Euclidean spaces.
One interesting property is that almost all high-dimensional random vectors are orthogonal with respect to the cosine distance.</p>
<p>I shall start with a simple uniformly distributed random variable that gives insight into more complex cases.</p>
<h2 id="an-illustrative-problem">An illustrative problem:</h2>
<p>We note that for any <script type="math/tex">X,Y \sim \mathcal{U}(\{-1,1\})^{2n}</script>:</p>
<p>\begin{equation}
\lVert X \rVert = \sqrt{2n}
\end{equation}</p>
<p>\begin{equation}
X \cdot Y = \sum_{i=1}^{2n} x_i \cdot y_i
\end{equation}</p>
<p>where <script type="math/tex">\forall i, x_i \cdot y_i</script> equals <script type="math/tex">-1</script> or <script type="math/tex">+1</script> with equal probability.</p>
<p>As a result, if we define the cosine distance:</p>
<p>\begin{equation}
\text{COS}(X,Y) = \frac{X \cdot Y}{\lVert X \rVert \lVert Y \rVert}
\end{equation}</p>
<p>we find that this expression simplifies to:</p>
<p>\begin{equation}
S_n = \frac{\sum_{i=1}^{2n} x_i \cdot y_i}{2n} \approx \mathbb{E}[X \cdot Y] = 0
\end{equation}</p>
<p>and by applying the Central Limit Theorem to <script type="math/tex">S_n</script> we find that:</p>
<p>\begin{equation}
\forall \epsilon > 0, \lim_{n \to \infty} P(|S_n - \mathbb{E}[X \cdot Y]| > \epsilon) = \lim_{n \to \infty} P(|S_n| > \epsilon) = 0
\end{equation}</p>
<h2 id="the-case-of-isotropic-gaussian-vectors">The case of isotropic Gaussian vectors:</h2>
<p>For the case of <script type="math/tex">X,Y \sim \mathcal{N}(0, \sigma^2)^{n}</script> we may proceed in a similar manner. It’s particularly useful to start by
analysing the denominator of the cosine formula:</p>
<p>\begin{equation}
\lVert X \rVert^2 = \sum_{x_i > 0} x_i^2 + \sum_{x_i < 0} x_i^2 \approx Cn
\end{equation}</p>
<p>where the constant <script type="math/tex">C</script> is given by:</p>
<p>\begin{equation}
C = \int_{0}^\infty x^2 \cdot \frac{f(x)}{P(x \geq 0)} dx = \int_{0}^\infty 2 x^2 \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{x^2}{2 \sigma^2}} dx
\end{equation}</p>
<p>As a result, if we compute the cosine distance of <script type="math/tex">X</script> and <script type="math/tex">Y</script> independently sampled from <script type="math/tex">\mathcal{N}(0, \sigma^2)^{n}</script> we find that:</p>
<p>\begin{equation}
\text{COS}(X,Y) = \frac{X \cdot Y}{\lVert X \rVert \lVert Y \rVert} \approx \frac{X \cdot Y}{\lVert X \rVert^2} \approx \frac{S_n}{C}
\end{equation}</p>
<p>where <script type="math/tex">S_n</script> is given by:</p>
<p>\begin{equation}
S_n = \frac{\sum_{i=1}^{n} x_i \cdot y_i}{n} \approx \mathbb{E}[X \cdot Y]
\end{equation}</p>
<p>and since <script type="math/tex">\mathbb{E}[X \cdot Y] = 0</script> by the Central Limit Theorem we have:</p>
<p>\begin{equation}
\forall \epsilon > 0, \lim_{n \to \infty} P(|S_n| > \epsilon) = 0
\end{equation}</p>Aidan RockeMotivation:Are partial derivatives the computational primitives of deep neural networks?2020-01-26T00:00:00+00:002020-01-26T00:00:00+00:00/neural-computation/2020/01/26/partial-derivatives<h2 id="introduction">Introduction:</h2>
<p>The typical deep neural network tutorial introduces deep networks as compositions of nonlinearities and affine transforms. In fact, a deep network with relu activation simplifies to a linear combination of affine transformations with compact support. But, why would affine transformations be useful for nonlinear regression?</p>
<p>After some reflection it occurred to me that the reason why they work is that they are actually first-order Taylor approximations and by this logic partial derivatives, i.e. Jacobians, are computational primitives for both inference and learning.</p>
<h2 id="approximating-continuous-functions-with-piece-wise-linear-functions">Approximating continuous functions with piece-wise linear functions:</h2>
<p>A piece-wise linear function <script type="math/tex">f:\mathbb{R} \rightarrow \mathbb{R}</script> is defined as follows:</p>
<p>\begin{equation}
f(x) = \sum_i \lambda_i(x) \cdot 1_{X_i}(x)
\end{equation}</p>
<p>where <script type="math/tex">1_{X_i}</script> is the indicator function and the <script type="math/tex">\lambda_i</script> are affine functions. Clearly, every polynomial <script type="math/tex">p: \mathbb{R} \rightarrow \mathbb{R}</script> may be approximated using piecewise linear functions since they are locally Lipschitz.</p>
<p>Now, let’s consider a nonlinear continuous function with compact domain and compact co-domain <script type="math/tex">X,Y \subset \mathbb{R}^3</script>:</p>
<p>\begin{equation}
F: X \rightarrow Y
\end{equation}</p>
<p>Since the polynomials are dense in the space of continuous functions, due to Stone-Weierstrass, we may substitute <script type="math/tex">F</script> with a polynomial of degree <script type="math/tex">n</script>, <script type="math/tex">F_n \in P_n</script> such that if we define the functional:</p>
<p>\begin{equation}
\mathcal{L}[F_n] = \int_X \left\lVert F(x)-F_n(x) \right\rVert^2 dx
\end{equation}</p>
<p>the sequence <script type="math/tex">\{\hat{F}_n\}_{n=1}^\infty</script> converges uniformly to <script type="math/tex">F</script>:</p>
<p>\begin{equation}
\hat{F}_n = \underset{F_n \in P_n}{\arg\min} \mathcal{L}[F_n]
\end{equation}</p>
<p>\begin{equation}
\lim_{n \to \infty} \mathcal{L}[\hat{F}_n] = 0
\end{equation}</p>
<p>Moreover, due to the Lipschitz condition (3) also allows us to define the maximum local error with respect to a ball <script type="math/tex">B(x_0,r)</script>:</p>
<p>\begin{equation}
\exists C > 0, E_n(r)= \underset{x_0 \in X}{\max} \int_{B(x_0,r)} \left\lVert F(x)-\hat{F}_n(x) \right\rVert^2 dx \leq C \frac{4}{3}\pi r^5
\end{equation}</p>
<p>which then allows us to introduce the first-order Taylor approximation:</p>
<p>\begin{equation}
\forall x \in B(x_0,r), \phi(x) = \frac{\partial \hat{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0
\end{equation}</p>
<p>where <script type="math/tex">\frac{\partial \hat{F}_n}{\partial x}</script> is a Jacobian. This will allow us to construct a linear basis of <script type="math/tex">\phi_i</script> for <script type="math/tex">F</script>.</p>
<h2 id="constructing-a-linear-basis-for-continuous-functions">Constructing a linear basis for continuous functions:</h2>
<p>We may construct a linear basis of <script type="math/tex">\phi_i</script> for <script type="math/tex">F</script> using (7) where the number of piece-wise linear <script type="math/tex">\phi_i</script> that allows us to approximate <script type="math/tex">F</script> is given by:</p>
<p>\begin{equation}
N(r) = \frac{,74 \cdot \text{Vol}(X)}{\frac{4}{3}\pi r^3} \approx \frac{\text{Vol}(X)}{3} \cdot r^{-3}
\end{equation}</p>
<p>where <script type="math/tex">,74</script> is the density of closely-packed spheres in <script type="math/tex">\mathbb{R}^3</script>. Furthermore, the <script type="math/tex">\phi_i</script> have compact support and pairwise disjoint domains by construction so we have:</p>
<p>\begin{equation}
\forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0
\end{equation}</p>
<p>and therefore we may approximate <script type="math/tex">F</script> as follows:</p>
<p>\begin{equation}
F(x) \approx \hat{F_r}(x) = \sum_{i=1}^{N(r)} \phi_i(x)
\end{equation}</p>
<p>We also find the following useful upper-bound using (6) and (8):</p>
<p>\begin{equation}
\mathcal{L}[\hat{F_r}] = \int_X \left\lVert F(x)-\hat{F_r}(x) \right\rVert^2 dx \leq C \cdot \text{Vol}(X) \cdot r^2
\end{equation}</p>
<p>Now, the challenge is essentially to find the first-order Taylor approximations and I will try to show that this is basically what
is being done by a deep neural network.</p>
<h2 id="fully-connected-deep-rectifier-networks">Fully-connected deep rectifier networks:</h2>
<p>Let’s consider a fully-connected deep network with fixed width, relu activation and compact domain <script type="math/tex">X \subset \mathbb{R}^n</script> and co-domain <script type="math/tex">Y \subset \mathbb{R}^n</script>:</p>
<p>\begin{equation}
F_\theta: X \rightarrow Y
\end{equation}</p>
<p>\begin{equation}
F(x;\theta) = f \circ h_L \circ … \circ h_1(x) = f \circ \phi(x)
\end{equation}</p>
<p>where <script type="math/tex">relu(x)=\max(0,x)</script> and the parameters of <script type="math/tex">F_\theta</script> are given by:</p>
<p>\begin{equation}
\theta = \{W_l \in \mathbb{R}^{n_l \times n_{l-1}}, b_l \in \mathbb{R}^{n_l}: I \in [L]\}
\end{equation}</p>
<p>Now, let’s note that if <script type="math/tex">F_\theta</script> has <script type="math/tex">N</script> layers and <script type="math/tex">n</script> nodes per layer there are:</p>
<p>\begin{equation}
(n!)^N
\end{equation}</p>
<p>layer-wise permutations that result in functions equivalent to <script type="math/tex">F_\theta</script> so it’s more useful to think in terms
of function spaces than particular parameterisations of a network.</p>
<p>We may also observe that similar to (11) <script type="math/tex">F_\theta</script> is a linear combination of functions <script type="math/tex">\phi_i</script> that are affine on <script type="math/tex">X_i</script>
and zero on <script type="math/tex">X \setminus X_i</script>. Since <script type="math/tex">\phi_i</script> is characterised by the activation patterns of <script type="math/tex">F_\theta</script> and <script type="math/tex">\phi_i</script> is continuous
on <script type="math/tex">X_i</script>, <script type="math/tex">X_i</script> must be compact and <script type="math/tex">X_i</script> and <script type="math/tex">X_j</script> must be pair-wise disjoint so:</p>
<p>\begin{equation}
\forall i,j\neq i, X_i \cap X_{j \neq i} = \emptyset
\end{equation}</p>
<p>\begin{equation}
\forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0
\end{equation}</p>
<p>\begin{equation}
\exists A \in \mathbb{R}^{n \times n} B \in \mathbb{R}^{n}, \phi_i(x) = Ax+B
\end{equation}</p>
<p>Now, if the reader is wondering about (16) let’s suppose that there exists <script type="math/tex">x \in X</script> such that <script type="math/tex">x \in X_i</script> and <script type="math/tex">x \in X_j</script>. Since <script type="math/tex">\phi_i</script> and <script type="math/tex">\phi_j</script> correspond to different activation patterns in <script type="math/tex">F_\theta</script> there must be a node in <script type="math/tex">F_\theta(x)</script>
with value <script type="math/tex">\alpha \in \mathbb{R}</script> such that <script type="math/tex">\alpha > 0</script> and <script type="math/tex">\alpha \leq 0</script>.</p>
<h2 id="function-approximation-with-neural-networks">Function Approximation with Neural Networks:</h2>
<p>If we are trying to approximate a continuous function <script type="math/tex">F:X \rightarrow Y</script> by means of <script type="math/tex">F_\theta</script> we may use the results developed earlier
and define:</p>
<p>\begin{equation}
\mathcal{L}[F_{\theta}] = \int_X \left\lVert F(x)-F_{\theta}(x) \right\rVert^2 dx
\end{equation}</p>
<p>and given a suitable vector-valued polynomial approximation <script type="math/tex">\tilde{F}</script> such that:</p>
<p>\begin{equation}
\mathcal{L}[\tilde{F}] \approx 0
\end{equation}</p>
<p>it is sufficient to choose the <script type="math/tex">\phi_i</script> in <script type="math/tex">F_\theta</script> such that:</p>
<p>\begin{equation}
\forall x \in B(x_0,r), \phi_i(x) = \frac{\partial \tilde{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0
\end{equation}</p>
<p>We may also note that for a deep rectifier network with <script type="math/tex">N</script> nodes, each node is either on or off so it has <script type="math/tex">2^N</script>
possible activation patterns which correspond to distinct <script type="math/tex">\phi_i</script> and a partition of <script type="math/tex">X</script> into <script type="math/tex">2^N</script> pair-wise disjoint
and compact subsets. In principle, this means that for networks <script type="math/tex">F_\theta</script> with <script type="math/tex">N</script> nodes and at least one hidden
layer we should empirically observe:</p>
<p>\begin{equation}
\mathcal{L}[F_\theta] \sim \frac{1}{2^N}
\end{equation}</p>
<p>If there are any doubts concerning the existence of suitable polynomials, for finitely many points <script type="math/tex">(p,q) \in \mathbb{R}^n \times \mathbb{R}^m</script> there are infinitely many interpolating polynomials. The interesting question is which polynomial generalises best but that is beyond the scope of the present article.</p>
<h2 id="discussion">Discussion:</h2>
<p>At this point this idea still represents a rough sketch that must be developed further. But, I think it also provides some insight into
why neural networks are so good at function approximation. They are approximating the intrinsic geometry of a mapping using piece-wise linear
approximations which works because we can find a suitable polynomial approximation, and all polynomials are locally Lipschitz.</p>
<p>Thoughts and comments are welcome!</p>
<h2 id="references">References:</h2>
<ol>
<li>Andrey Kolmogorov & S.V. Fomin. Elements of the Theory of Functions and Functional Analysis: Metric and normed spaces. Dover. 1954.</li>
<li>W. Rudin. Real and complex analysis. McGraw-Hill. 3rd ed.1986.</li>
<li>Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press. 2016.</li>
</ol>Aidan RockeIntroduction: