To understand the potential effectiveness of a peer-to-peer not-a-tracing(NAT) system against an unknown pathogen, we may consider how a data-driven approach to defining an infection risk function at each node of a graph may allow us to allocate limited testing capacity in a rational manner. Key challenges shall become clear after we have analysed the structure of our problem.

Initially, it appears that the main challenge involves finding a reliable definition of when we are in the low-data regime.

Graph structure and state space:

We shall assume that our graph is a small world network sampled from an epidemiological state space:

a. Each node represents an individual whose set of neighbours(i.e. physical contacts) may be represented by .

b. We shall assume that at any instant a particular node is either susceptible or infected, so for a graph with nodes there are possible states.

Data collection:

We shall assume that individuals in this social network use a NAT app where:

a. They log clinically relevant phenotypic data(aka phenotypic space): age, biological sex, pre-existing medical conditions

b. Symptoms(aka symptom space): cough, sore throat, temperature, sense of smell

We may assume that symptoms shall be logged on a daily basis.

Modelling infection risk using machine learning:

Given that test kits are limited we need a method for prioritising test allocation. This may be done using a model of infection risk. Mathematically, in order to determine whether a particular individual should take a test we use a parametric risk function:

\begin{equation} \mathcal{R}(\theta): \mathbb{R}^l \times \mathbb{R}^{d} \rightarrow [0,1] \end{equation}

where is the dimension of the symptom space, and is the dimension of the phenotypic space. Furthermore, for each vertex there is a feature map such that .

in the low-data regime:

In the low-data regime, learning is unstable and there are no convergence guarantees for . However, we still need a reliable definition of . In such a regime it may be possible to have either hard-coded by a team of experts or we may use a function that is pre-trained on data with similar properties.

I must add that in this regime, we don’t do any learning just function evaluations or what some in the machine learning community would call inference.


In the large-data regime, we can use some form of privacy-preserving machine learning. However, I think it makes sense to first focus on the low-data regime problem. I suspect that the authors of [3] might have a reasonable candidate for .

Finally, I’d like to add that if is good enough not only is machine learning unnecessary but it may also be used as a proxy measure for test outcomes.


  1. Alexander A. Alemi, Matthew Bierbaum, Christopher R. Myers, James P. Sethna. You Can Run, You Can Hide: The Epidemiology and Statistical Mechanics of Zombies. Arxiv. 2015.

  2. Jussi Taipale, Paul Romer, Sten Linnarsson. Population-scale testing can suppress the spread of covid-19. medrxiv. 2020.

  3. Hagai Rossman, Ayya Keshet, Smadar Shilo, Amir Gavrieli, Tal Bauman, Ori Cohen, Esti Shelly, Ran Balicer, Benjamin Geiger, Yuval Dor & Eran Segal. A framework for identifying regional outbreak and spread of COVID-19 from one-minute population-wide surveys. Nature. 2020.