This week’s Riddler Classic is a simple-sounding problem about statistics. But it’s not so simple! Here is a paraphrased version of the problem:
There are $n$ tanks, labeled $\{1,2,\dots,n\}$. We are presented with a sample of $k$ tanks, chosen uniformly at random, and we observe that the smallest label among the sample is $22$ and the largest label is $114$. Both $n$ and $k$ are unknown. What is our best estimate of $n$?
Here is a short introduction on parameter estimation, for the curious:
[Show Solution]
This problem falls under the purview of parametric statistics, and it’s worth doing a short introduction for those of you not familiar, because this comes up a lot in statistics. The basic setup is that there is a “population” of values that follows some assumed probability distribution that depends on a set of fixed parameters. The goal is to estimate the parameters of this distribution by observing a random “sample” from the population.
For example, our population could be the heights of all people living in the US. Let’s say this population follows a normal distribution, but we don’t know the mean or variance parameters of the distribution. We’d like to estimate the mean, because we want to know the average height of all people in the US. It’s impractical to measure everybody in the US, so instead we survey 100 people at random and obtain their heights. We might then compute the mean of our sample and use this as an estimate for the mean of the population. This is an example of an estimator; a function that maps samples to a parameter estimate.
What are important properties of an estimator? What makes a “good” estimator? If we fix the size of the sample, but take lots of different random samples, we will obtain a set of sample means. They might all be different due to the variability in our samples. Two important concepts are:
- The bias of an estimator: this is the difference between the true population mean and the mean of all our sample means. Ideally, we want bias to be as small as possible. For our example, the sample mean estimator has zero bias, so we say that it is unbiased.
- The variance of an estimator: this is the sample variance of our set of sample means. Roughly, this is how much variability we can expect our parameter estimates to have. Clearly, we would also like the variance to be as small as possible.
The notions of bias and variance are analogous to the notions of accuracy and precision. Would you rather be accurate or precise? It turns out it’s generally impossible to have both. In parametric statistics, this is referred to as the bias-variance trade-off. We can’t make both bias and variance zero. We can only make one small at the expense of the other. Continuing our average height example, we could have used a much simpler estimator: always estimate six feet as the average height! This estimator has no variance at all because it always estimates the same thing, but it is surely biased, because the true population mean likely isn’t six feet.
Although this was a rather simple example, I hope it illustrates that it doesn’t really make sense to talk about the “best estimator” unless we are specific about our notion of “best”. Sometimes, variance can be reduced dramatically and it only costs us a little bit of bias. This is one of the main benefits of using regularization in machine learning. It’s always a trade-off, and the notion of “best” depends on the context and the application.
With that out of the way, let’s get back to counting tanks!
Here is my solution to the problem:
[Show Solution]
The parameters of our distribution are $n$ (the number of tanks) and $k$, the size of the group. The random variables we get to observe are the minimum and maximum labels, which I’ll call $x$ and $y$, respectively. The probability mass function is
\[
P_{n,k}(x,y) = \frac{y-x-1 \choose k-2}{n \choose k}
\]because there are $y-x-1 \choose k-2$ ways of picking as set of $k$ distinct integers whose minimum and maximum values are $x$ and $y$ respectively, and there are $n\choose k$ total ways of picking $k$ distinct integers out of a set of $n$. We can verify that this is a legitimate probability mass function by checking that
\[
\sum_{x=1}^{n-k+1} \sum_{y=x+k-1}^{n} P_{n,k}(x,y) = 1
\]The reason for the strange bounds in the sums is to account for the fact that not all minimum and maximum values are admissible. For example, if we pick $k=3$ different numbers among $\{1,2,3,4,5\}$, the only possible values for the minimum $x$ are $\{1,2,3\}$.
For this problem, our “sample” is the single pair $(x,y)$ that we observe, and our task is to estimate $n$. Unlike our population height example from above, it’s no longer clear what sort of estimator we should use here. I’ll present some options.
Maximum likelihood estimator
One possible way to estimate $n$ is to use the maximize likelihood estimator (MLE). The likelihood is what you get when you fix $(x,y)$ to be what you observed, and you view $P_{n,k}(x,y)$ as a function of $(n,k)$. This tells you which values of the parameters were likelier given the data you observed. For any fixed $k,x,y$, it’s clear that we can maximize $P_{n,k}(x,y)$ by making $n$ as small as possible, so the MLE is to choose $\hat n = y$. What can we say about bias and variance for the MLE? We can directly compute these quantities using the $P_{n,k}(x,y)$ above:
\[
\text{MLE:}\,(\hat n = y)\, \left\{
\begin{aligned}
\text{Bias} &= -\frac{n-k}{k+1} \\
\text{Variance} &= \frac{k (n+1) (n-k)}{(k+1)^2 (k+2)}
\end{aligned}
\right.
\]As we can see, the estimator has bias. This makes sense: our estimate is the largest number we saw, so we will always be under-estimating the true value of $n$.
Minimum-variance unbiased estimator
Another way to estimate $n$ is to restrict our attention to unbiased estimators. Since we only get to observe $x$ and $y$, a reasonable thing to try is to compute the expectations $\mathbb{E} x$ and $\mathbb{E} y$. Doing so, we obtain:
\[
\mathbb{E} x = \frac{n+1}{k+1},\qquad
\mathbb{E} y = \frac{k(n+1)}{k+1}
\]We can eliminate $k$ by adding these together! Indeed, we find that:
\[
n = \mathbb{E}( x+y-1 )
\]This tells us that if we use $\hat n = x+y-1$ as our estimator, it will be unbiased! We can compute bias and variance as before, and we obtain:
\[
\text{MVUE:}\,(\hat n = x+y-1)\, \left\{
\begin{aligned}
\text{Bias} &= 0 \\
\text{Variance} &= \frac{2 (n+1) (n-k)}{(k+1) (k+2)}
\end{aligned}
\right.
\]Using some more advanced statistics, it’s actually possible to prove that among all unbiased estimators, this is the one with the smallest variance. Not bad! But note that this estimator still has about twice the variance of MLE estimator. Another manifestation of the bias-variance trade-off.
Solution
The notion of “best estimate” is an ambiguous one, so we have to be clear about what we mean by “best”. The maximum-likelihood estimator (MLE) is to estimate $\hat n = y$, the largest label observed. For the example in the problem, we would pick $\hat n = 114$. Although this is the likeliest $n$, it is a biased estimate: it will always under-estimate the true $n$.
Another option is to look for an estimator that is unbiased. That is, if we had a large number of independent samples of groups of tanks and we averaged the estimates across groups, we would on average obtain the true $n$. The minimum-variance unbiased estimator (MVUE) is to pick $\hat n = x+y-1$ (the sum of largest and smallest labels observed, minus one). For the example in the problem, this yields $\hat n = 135$. While the MVUE is unbiased, it has about twice the variance as the MLE, which means it produces more highly variable estimates.
This higher variability is unavoidable; if we want to have an unbiased estimator, we must be willing to give up some variance! This is analogous to the notion of precision vs accuracy. The MVUE above is actually the estimator with the smallest variance among all unbiased estimators, so if you’re looking for an unbiased estimator, this is the “best” one!
Simulation
To demonstrate that the MVUE estimator works as intended, I created the following simulation: I set $n=100$ and $k=15$, and then picked $k$ tanks at random among the $n$, recorded the minimum and maximum labels $x$ and $y$, and computed the MVUE estimator for $n$, which is $\hat n = x+y-1$. I repeated the process $10^6$ times, and recorded all $\hat n$ values. Based on the calculations above, the mean and standard deviation of this set of $\hat n$ estimates should be:
\[
\mathbb{E}(\hat n) = n = 100,\quad\text{and}\quad
\mathrm{Var}(\hat n) = \sqrt{\tfrac{2(n+1)(n-k)}{(k+1)(k+2)}} = 7.9451
\]The values I obtained were $100.0015$ and $7.9279$, respectively. Looks like a match! Here is a histogram of the results of the simulation:
So it looks like the MVUE estimator performs as expected. Note that even if $k$ is completely unknown to us or if every random sample of tanks we observe has a different size $k$, the mean will still be $n$. Changing $k$ only affects our confidence in our estimates. A larger $k$ means we can be more confident (lower variance).
If we assume a uniform normalizable distribution of N = [114,some_cutoff] is it legit to integrate P(22,114,k,N)*N over N=[114,some_cutoff] and k=[2,93] to get an expectation of N? [
Yes, that works. Mathematically, we are given $P(z\mid\theta)$, where $z=(x,y)$ is our data (the min and max tank numbers observed), and $\theta=(n,k)$ are the unknown parameters. This is the distribution of the min and max values assuming we know the parameters. If we imagine that our data $z$ is fixed, then $P(z\mid\theta)$ (viewed as a function of $\theta$) is called the “likelihood”.
The maximum likelihood estimator (MLE) simply maximizes likelihood, i.e. what is the value of the parameter $\theta$ that maximizes $P(z\mid\theta)$.
If you have a “prior” distribution on $\theta$, e.g. if you know that $n$ is uniformly distributed on [114,some_cutoff], as you suggested, then we can call this $P(\theta)$. Then, we can compute the “posterior distribution”, which is the distribution of your parameters conditioned on your data. Using Bayes’ theorem, we’re saying that the posterior satisfies:
\[
P(\theta\mid z) \propto P(z\mid\theta)P(\theta)
\]The maximum a posteriori estimator (MAP) is to choose the value of $\theta$ that maximizes this posterior distribution.
What you’re suggesting, which is to take the expectation of the posterior distribution (instead of finding the maximum) also has a name. It’s called the minimum mean squared error estimator (MMSE). This is a specific example of a more general class of estimators called Bayes Estimators. Again, no wrong answer here, just different potential choices for estimators.
Note: In your formula, you’re not only assuming $n$ is uniformly distributed, but that $k$ is as well! You have to remember to normalize the probability distributions for $n$ and $k$, so that would mean dividing your integral (which is actually a sum) by $(\text{cutoff}-114)(93-2)$.
Great post, thank you for doing this!
Interesting : this is a problem that allows for a MVUE.
Missing : the Bayesian section!
Ah, but German tanks go into combat and some percentage of them are destroyed by the allies. Since it takes days to build a tank, not all tanks seen on any particular day have been on the battlefield the same number of days. The earlier built tanks, thus see more days of battle and are more likely to be destroyed. For example, tanks number 1 through 22 are more likely to have been destroyed which means you cannot observe their serial number, than tanks 125 through 135.
But all is not lost. Because of the above, the estimate of 135 is more likely too high than too low. Which means the allies most likely would have fewer German tanks to battle.
Oh, but you say these are not German tanks. It is still more likely that the earlier built tanks are out of commission, having either broken down or in need of routing maintenance.