Introduction
Suppose we have a population of objects of different lifespans (starting at different times). Given a sample from a specific time point, we should expect “most” of the objects we see to be drawn from objects of longer life spans. Let’s briefly look into this phenomenon.
All these results all well-known in the literature, under “renewal theory”, “Palm theory”, “length-based sampling”.
Lifespans
Assume objects are created over time with a constant-rate process.
Let \(A_{\varepsilon} := \text{Object alive anywhere in} \ [t, t + \varepsilon]\) and call “lifespan” of an object \(L\)1.
We are interested in the following distribution:
\[ p_{L|A_{\epsilon}}(\ell) \]
by Bayes’ rule
\[ p_{L|A_{\epsilon}}(\ell) = \frac{P(A_{\varepsilon} | \ L = \ell ) p_{L}(\ell)}{P(A_{\epsilon})} \]
If the objects lifespan is \([S, S + L]\), then
\[ P(A_{\varepsilon} | L = \ell) = P(t_0 - \ell \leq S \leq t_0 + \varepsilon) \]
\(S\) is the birth of the object. Let’s assume \(S\) is roughly constant density. Then
\[ P(A_{\varepsilon} | L = \ell) \propto \ell + \varepsilon \]
Rename \(P(L = \ell)\) to \(p_L(\ell)\). We know:
\[ P(A_{\varepsilon}) = \int P(A_{\varepsilon} | L = \ell)p_L(\ell)d\ell \]
so
\[ P(A_{\varepsilon}) = \int c(\ell + \varepsilon)p_{L}(\ell)d\ell \]
for some constant \(c\), and splitting this up we get
\[ P(A_{\varepsilon}) = c(E[L] + \varepsilon) \]
Plugging this in, we get
\[ p_{L|A_{\epsilon}}(\ell) = \frac{(\ell + \varepsilon)p_L(\ell)}{\mathbb{E}[L] + \varepsilon} \]
For \(\varepsilon = 0\), we have
\[ p_{L|A}(\ell )= \frac{\ell p_L(\ell)}{\mathbb{E}[L]} \]
This is ultimately dependent on a choice of distribution over \(L\), and the constant birth rate.
This also implies that
\[ \mathbb{E}[L | A] = \frac{\mathbb{E}[L^2]}{\mathbb{\mathbb{E}}[L]} \]
(which is strictly larger than \(\mathbb{E}[L]\) for non-degenerate \(L\)).
Also
\[ \mathbb{E}[L | A] - \mathbb{E}[L] = \frac{\text{Var}(L)}{\mathbb{E}[L]} \]
Exponential
Let’s try exponential (which corresponds to “random death”/constant hazard rate)
\[ L \sim \text{Exp}(\lambda) \] \[ p_{L}(\ell) = \lambda e^{-\lambda \ell} \]
Then we have
\[ \mathbb{E}[L] = \frac{1}{\lambda} \]
The conditional density is thus
\[ p_{L | A}(\ell) = \lambda^2 \ell e^{-\lambda \ell} \]
Which is a gamma distribution.
Even though exponential lifetimes are “memoryless”, the population is not. Memorylessness is not preserved under selection-by-survival.
Also: if the variation in the population is large, the bias can be large.
Gamma
Let’s do the gamma distribution.
\[ L \sim \text{Gamma}(k, \lambda) \]
\[ p_L(\ell) = \frac{\lambda^k}{\Gamma(k)}\ell^{k-1}e^{-\lambda \ell} \]
\[ \mathbb{E}[L] = \frac{k}{\lambda} \]
By length bias formula:
\[ p_{L | A}(\ell) = \frac{\lambda^{k + 1}}{\Gamma(k + 1)}\ell^ke^{-\lambda \ell} \]
which is also a gamma
\[ L \sim \text{Gamma}(k + 1, \lambda) \]
with expected value \(\mathbb{E}[L | A] = \frac{k + 1}{\lambda} = \mathbb{E}[L] + \frac{1}{\lambda}\).
The Gamma shape parameter measures how many “chances to die” have already been survived. Observing an object at a random time guarantees at least one “survival”. So we increase the shape by one.
Log Uniform
This example is just to show how strong the effect can be.
Let’s say we have a log-uniform distribution over 10 orders of magnitude.
\[ p_L(\ell) = \frac{1}{\ell \ln(b/a)} \]
Multiplying by \(\ell\) gives a constant. So the new distribution is uniform!
Let’s look at the top decile
\[ P(L \in [10^9, 10^10]) = 0.1 \] \[ P(L \in [10^9, 10^10] | A) = \frac{10^10 - 10^9}{10^10 - 1} \approx 0.9 \]
So now most of the mass lives in the top decile!
Population Traits
Let’s now connect the lifespan to a set of “traits”. So \(L = f(\theta_1, \theta_2, ..., \theta_n)\)
To simplify further, assume \(f(\theta) = a + \sum_i b_i \theta_i = a + b^{\top}\theta\). So a
What happens to traits in our sample? We should expect the positively correlated \(\theta\) to be overrepresented in the sample, and vice-versa.
In fact
\[ \mathbb{E}[L | A] - \mathbb{E}[L] = \frac{\text{Var}(L)}{\mathbb{E}[L]} = \frac{\text{Var}(a + b^{\top}\theta)}{a + b^{\top}\mu_{\theta}} = \frac{b^{\top}\text{Var}(\theta)b}{a + b^{\top}\mu_{\theta}} \]
where \(\mu_{\theta}\) is \(\mathbb{E}[\theta]\). Since \(\theta\) is a vector, \(\text{Var}(\theta)\) is actual a matrix: the covariance of \(\theta\) with itself.
This entire expression shows that the component of variability aligned with \(b\) is what drives the sample bias. So if all the bias is “orthogonal” to \(b\), there will be little selection bias, but if the bias is “in the direction of” \(b\), there will be substantial selection bias. This is interesting as it grants us a “direction” based purely on the persistence of objects, which we can tie to geometry.
Conclusion
In the typical statistical story, we are interested in information about the population, and we observe a sample obtained through a random process to infer the relevant information. I’m interested in two related statistical concepts. That is:
We know the population and the sampling process, and we are interested in the properties of the sample (this example).
We know the population and the sample, and we are interested in what process was used to obtain the sample.
This example is interesting because we managed to derive a “direction” purely from conditioning on persistence.
Footnotes
The \(\varepsilon\) window avoids any issues with measure zero that I’m too lazy to think through.↩︎
