## Motivation:

To understand the potential effectiveness of a peer-to-peer not-a-tracing(NAT) system against an unknown pathogen, we may consider how a data-driven approach to defining an infection risk function $\mathcal{R}$ at each node of a graph may allow us to allocate limited testing capacity in a rational manner. Key challenges shall become clear after we have analysed the structure of our problem.

Initially, it appears that the main challenge involves finding a reliable definition of $\mathcal{R}$ when we are in the low-data regime.

## Graph structure and state space:

We shall assume that our graph is a small world network sampled from an epidemiological state space:

a. Each node $v_i \in V$ represents an individual whose set of neighbours(i.e. physical contacts) may be represented by $\mathcal{N}(v_i) \subset V$.

b. We shall assume that at any instant a particular node is either susceptible or infected, so for a graph with $N$ nodes there are $2^N$ possible states.

## Data collection:

We shall assume that individuals in this social network use a NAT app where:

a. They log clinically relevant phenotypic data(aka phenotypic space): age, biological sex, pre-existing medical conditions

b. Symptoms(aka symptom space): cough, sore throat, temperature, sense of smell

We may assume that symptoms shall be logged on a daily basis.

## Modelling infection risk using machine learning:

Given that test kits are limited we need a method for prioritising test allocation. This may be done using a model of infection risk. Mathematically, in order to determine whether a particular individual should take a test we use a parametric risk function:

\begin{equation} \mathcal{R}(\theta): \mathbb{R}^l \times \mathbb{R}^{d} \rightarrow [0,1] \end{equation}

where $d$ is the dimension of the symptom space, and $l$ is the dimension of the phenotypic space. Furthermore, for each vertex $v_i \in V$ there is a feature map $F$ such that $F(v_i) \in \mathbb{R}^l \times \mathbb{R}^{d}$.

## $\mathcal{R}(\theta)$ in the low-data regime:

In the low-data regime, learning is unstable and there are no convergence guarantees for $\theta$. However, we still need a reliable definition of $\mathcal{R}(\theta)$. In such a regime it may be possible to have $\mathcal{R}(\theta)$ either hard-coded by a team of experts or we may use a function $\hat{\mathcal{R}}(\theta)$ that is pre-trained on data with similar properties.

I must add that in this regime, we don’t do any learning just function evaluations or what some in the machine learning community would call inference.

## Discussion:

In the large-data regime, we can use some form of privacy-preserving machine learning. However, I think it makes sense to first focus on the low-data regime problem. I suspect that the authors of  might have a reasonable candidate for $\hat{\mathcal{R}}(\theta)$.

Finally, I’d like to add that if $\hat{\mathcal{R}}(\theta)$ is good enough not only is machine learning unnecessary but it may also be used as a proxy measure for test outcomes.

1. Alexander A. Alemi, Matthew Bierbaum, Christopher R. Myers, James P. Sethna. You Can Run, You Can Hide: The Epidemiology and Statistical Mechanics of Zombies. Arxiv. 2015.

2. Jussi Taipale, Paul Romer, Sten Linnarsson. Population-scale testing can suppress the spread of covid-19. medrxiv. 2020.

3. Hagai Rossman, Ayya Keshet, Smadar Shilo, Amir Gavrieli, Tal Bauman, Ori Cohen, Esti Shelly, Ran Balicer, Benjamin Geiger, Yuval Dor & Eran Segal. A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys. Nature. 2020.