# Are 20 amino acids necessary and sufficient?

## Motivation:

There is a persistent belief among evolutionary biologists that chance played a significant role in biological evolution. This would imply that life and its building blocks could have been significantly different. In particular, some protein scientists suspect that evolution could have used as few as 10 amino acids to build life rather than the original 20. We may arrive at such a conclusion by considering the minimum number of letters required to fold a protein [1].

However, there are a couple limitations with such analyses. First they ignore second-order, third-order and fourth-order constraints on the structure of amino acids which quickly lead to a combinatorial explosion in the number of constraints facing any minimal set of amino acids. Second, they ignore the time and space constraints on the search problem of finding the last universal common ancestor containing the set of 20 amino acids that we know.

By taking these two points into consideration we find that there was actually very little room for chance, leading us to a biological equivalent of the anthropic principle.

## The argument for sufficiency:

There isn’t much to be said about sufficiency except that the diversity of multicellular life forms are an existence proof that 20 amino acids are effectively sufficient. Establishing necessity is a lot harder.

## The compositional structure of life forms:

It’s worth noting that biological systems have the following compositional structure:

\begin{equation} \text{amino acids}(\mathcal{A}) \rightarrow \text{proteins}(\mathcal{P}) \rightarrow \text{cells}(\mathcal{C}) \rightarrow \text{eukarya}(\mathcal{E}) \end{equation}

where each level of abstraction is consistent with but not reducible to the levels of abstraction below it. These may also be represented as a hierarchy of nonlinear transformations:

\begin{equation} \mathcal{A} \overset{f_1}{\rightarrow} \mathcal{P} \overset{f_2}{\rightarrow} \mathcal{C} \overset{f_3}{\rightarrow} \mathcal{E} \end{equation}

\begin{equation} E = f_3 \circ \mathcal{C} \end{equation}

\begin{equation} E = f_3 \circ f_2 \circ \mathcal{P} \end{equation}

\begin{equation} E = f_3 \circ f_2 \circ f_1 \circ \mathcal{A} \end{equation}

where the nonlinearity of the implies that eukaryotes are more sensitive to variations in their basic set of amino acids than variations in cell type. In other words, if and are nonlinear mappings then is most probably even more nonlinear in its behaviour.

We may proceed in this manner with a stability analysis of functions on trees [2]. However, given that the functional form of the isn’t exactly known a combinatorial analysis may be a better approach. Specifically, we may try to infer the number of independent constraints on given the number of independent constraints on .

## Complexity and Robustness:

We may make the reasonable assumption that common life forms persist because they are robust. From an algorithmic perspective one way for a pattern to be robust is if it is maximally informative. In other words, given a biological pattern the Kolmogorov Complexity of is given by:

\begin{equation} 3 \leq K(L) \approx \lvert L \rvert \end{equation}

where is the length of and is a reasonable lower-bound.

In order to apply this method to our analysis of compositional structures we must choose a representative from each set: .

Using the principle that mundane life forms must be robust and the fact that biological systems have a compositional structure we define as follows:

\begin{equation} E := \text{most common eukaryote} \end{equation}

\begin{equation} C := \text{most common cell type in } E \end{equation}

\begin{equation} P := \text{set of proteins found in } C \end{equation}

\begin{equation} \mathcal{A} := \text{set of 20 amino acids} \end{equation}

Now, let’s suppose that the number of independent constraints on the reproduction of is approximately given by:

\begin{equation} K(E) = \text{Kolmogorov Complexity of genome of } E \end{equation}

\begin{equation} K(E) \geq 3 \end{equation}

If each independent constraint on may be expressed in terms of , each independent constraint on may be expressed in terms of …the number of constraints on must be on the order of:

\begin{equation} \mathcal{L} = K(A)^{K(P)^{K(C)^{K(E)}}} \geq e^{e^{e^{e}}} > 10^{1000} \end{equation}

assuming that:

\begin{equation} \min(K(\mathcal{A}),K(P),K(C),K(E)) \geq 3 \end{equation}

At present I can’t fully explain the origin of the iterated exponential but I think it arises naturally in complex systems with emergent behaviour. The basic idea is that you can think of unfolding higher-level abstractions into more fundamental lower-level abstractions in higher-dimensional spaces using a sequence of multipartite graphs.

Edges that define interactions in one space become points in another.

## The number of independent constraints on the fundamental set of amino acids:

It follows that finding a fundamental set of amino acids that simultaneously satify all constraints is a multi-objective discrete optimisation problem in a search space that is at least as large as:

\begin{equation} 2^{\mathcal{L}} > 2^{10^{1000}} \end{equation}

since for each independent constraint there are at least two options. The probability that this set of amino acids was discovered by chance is probably small… but exactly how small?

## Time and Space constraints on biological evolution:

I’d like to make a compelling argument that the time-bounded search space for the 20 amino acids we know is much smaller than .

Consider that the smallest organism weighs on the order of one picogram. If there are less than kilograms of living organism at any moment on any Earth-like planet, bacteria reproduce every 20 mins and the universe is less than years old then the effective search space for a universal common ancestor containing all 20 amino acids is less than:

\begin{equation} S = \text{time} \times \text{space} \leq 10^{20} \cdot 10^{10} \cdot 10^{15} \cdot 365 \cdot 24 \cdot 3 \leq 10^{36} \end{equation}

so the probability that the 20 amino acids were discovered by chance has to be much smaller than:

\begin{equation} \frac{S}{2^\mathcal{L}} < \frac{1}{10^{1000}} \end{equation}

which suggests that either a highly effective search procedure was responsible for their discovery or that fundamental laws of physics impose regularities on the search space for amino acids and thus considerably simplify the optimisation problem. Both of these ideas lead us to the conclusion that there must be a biological equivalent to the anthropic principle in cosmology.

## Discussion:

The above calculations started as back-of-the-envelope calculations and I think they can be refined. One approach would be to think of this as a problem in programming language design. What if we allowed programming languages for dynamic control problems to evolve in such a way that higher-level languages emerge? Would we find that the more primitive instruction set is more sensitive to random variation?

Alternatively, it’s a good idea to take a closer look at the problem from the perspective of protein folding. How many amino acids are required to cover the most fundamental types of protein interactions?

## References:

- Ke Fan and Wei Wang. What is the minimum number of letters required to fold a protein? Journal of Molecular Biology.
- Roozbeh Farhoodi, Khashayar Filom, Ilenna Simone Jones, Konrad Paul Kording. On functions computed on trees. Arxiv. 2019.
- Yi Liu and Stephen Freeland. On the evolution of the standard amino-acid alphabet. Genome Biology. 2006.
- H. James Cleaves. The origin of the biologically coded amino acids. Journal of Theoretical Biology. 2009.
- Matthias Granold et al. Modern diversification of the amino acid repertoire driven by oxygen. PNAS. 2017.