The balls are numbered 1 through N. There is also a group of N cups, labeled 1 through N, each of which can hold an unlimited number of ping-pong balls. The game is played in rounds. A round is composed of two phases: throwing and pruning.
During the throwing phase, the player takes balls randomly, one at a time, from the infinite supply and tosses them at the cups. The throwing phase is over when every cup contains at least one ping-pong ball. Next comes the pruning phase. During this phase the player goes through all the balls in each cup and removes any ball whose number does not match the containing cup. Every ball drawn has a uniformly random number, every ball lands in a uniformly random cup, and every throw lands in some cup. The game is over when, after a round is completed, there are no empty cups.
How many rounds would you expect to need to play to finish this game? How many balls would you expect to need to draw and throw to finish this game?
We’ll address the second question first: How many balls would you expect to need to draw and throw to finish the game? For this question, the rounds don’t matter at all. The game will end once there is at least one correct ball in each cup, and the pruning phase at the end of each round only removes incorrect balls so we can effectively neglect it.
Each time a ball is thrown, there is a $1/N$ chance it will land in the correct cup. This is a classical coupon collector problem (see my write-up on the card collection completion problem for more background). The solution rests on the fact that if some trial is successful with probability $p$, then we must perform $1/p$ trials on average before we obtain our first success. The probability of landing our first correct ball is $1/N$ so it will take an average of $N$ tosses to do so. The probability of landing our next new correct ball (in a different cup) is $\tfrac{1}{N}\cdot \tfrac{N-1}{N}$. This takes on average $\tfrac{N^2}{N-1}$ more tosses. Continuing in this fashion and summing up all of the intervals until each cup is filled, the total expected number of tosses is:
$\displaystyle
T_\text{tot} = N^2 \sum_{k=1}^N \frac{1}{k} \approx N^2(\log N + \gamma)
$
where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant.
Now let’s turn our attention to the first question: How many rounds would you expect to play to finish this game? This is a much more complicated question, so to illustrate the approach I’ll solve it for the case $N=3$. We’ll begin by breaking each round into a number of stages, which track the progress of cups as they are filled. In the diagram below, each cup is either empty (no ball), is red (incorrect ball), or is green (correct ball). We can also figure out all the transition probabilities and write them on the arrows.
In this diagram, self-loops have whatever probability is required so that the outgoing arrows sum to $1$. This is an example of a Markov chain. We start on the left (all cups empty), and we work our way to the right (all cups full). Here is what the transition matrix looks like for $N=3$ when we arrange the states from top to bottom and left to right:
\[
A =
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
2/3 & 2/9 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
1/3 & 1/9 & 1/3 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 4/9 & 0 & 4/9 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 2/9 & 4/9 & 2/9 & 5/9 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 2/9 & 0 & 1/9 & 2/3 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 2/9 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1/9 & 2/9 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 1/9 & 2/9 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 1/9 & 0 & 0 & 0 & 1
\end{bmatrix}
\begin{array}{c}
(k=0,j=0)\\
(k=1,j=0)\\
(k=1,j=1)\\
(k=2,j=0)\\
(k=2,j=1)\\
(k=2,j=2)\\
(k=3,j=0)\\
(k=3,j=1)\\
(k=3,j=2)\\
(k=3,j=3)\\
\end{array}
\]Note that the columns sum to $1$ since all outgoing probabilities must sum to $1$. The question is: what is the probability that we’ll end up in each of the possible ending point? (the four last states). This is called the limiting distribution. In our case, we can compute it easily. First, we can eliminate the self-loops by normalizing. This works because we don’t actually care how much time is spent at each stage. See below for an illustration of this normalization.
Here is what our transition matrix looks like after normalization
\[
A_\text{norm} =
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
2/3 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
1/3 & 1/7 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 4/7 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 2/7 & 2/3 & 2/5 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 1/3 & 0 & 1/4 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 2/5 & 0 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 0 & 1/5 & 1/2 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 1/4 & 2/3 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 1/3 & 0 & 0 & 0 & 1
\end{bmatrix}
\begin{array}{c}
(k=0,j=0)\\
(k=1,j=0)\\
(k=1,j=1)\\
(k=2,j=0)\\
(k=2,j=1)\\
(k=2,j=2)\\
(k=3,j=0)\\
(k=3,j=1)\\
(k=3,j=2)\\
(k=3,j=3)\\
\end{array}
\]Once again, the columns sum to $1$, but this time only the absorbing states have self-loops. Finding the limiting distribution simply amounts to finding the limit $\lim_{t\to\infty} A_\text{norm}^t$. Since the upper-left block of $A_\text{norm}$ is nilpotent (the diagonal is zero), powers of the matrix will eventually be constant. It suffices to evaluate $A_{\text{norm}}^{2N}$. Extracting the last N+1 rows and the columns corresponding to the states $\{1,3,6,\dots,\tfrac{(N+1)(N+2)}{2}\}$ (the states where we have exactly $k$ correct balls in cups, for $k=0,1,\dots,N$), we obtain the transition matrix for one round of the game. Here is the result we obtain in the case $N=3$:
\[
P = \begin{bmatrix}
16/105 & 0 & 0 & 0 \\
41/105 & 1/3 & 0 & 0 \\
5/14 & 1/2 & 2/3 & 0 \\
1/10 & 1/6 & 1/3 & 1
\end{bmatrix}
\begin{array}{c}
(k=0)\\
(k=1)\\
(k=2)\\
(k=3)\\
\end{array}
\]In other words, $P_{ij}$ is the probability that if we currently have $j$ correct balls in cups, we will have $i$ after the next round is over. From here, our next task is to compute the expected number of rounds it will take to travel from $k=0$ to $k=N$. This can be computed as follows: if we call $E_k$ the expected number of rounds starting from $k$, then we have:
\begin{align}
E_0 &= 1 + P_{00} E_0 + P_{10}E_1 + \cdots + P_{N0}E_N \\
E_1 &= 1 + P_{01} E_0 + P_{11}E_1 + \cdots + P_{N1}E_N \\
&\vdots \\
E_{N-1} &= 1 + P_{0,N-1} E_0 + P_{1,N-1}E_1 + \cdots + P_{N,N-1} E_N \\
E_N &= 0
\end{align}Or, in matrix-vector notation: $E = \mathbb{1} – e_N + P^\mathsf{T} E$. This can be solved via simple matrix inversion. We can then extract $E_0$ from the solution, which is the expected number of rounds starting with all empty cups. As far as I can tell, there does not appear to be an easy way to extract an analytic formula, but all the steps above are very easy to compute numerically. Here are plots of the results for small values of $N$:
I also overlayed a result obtained via Monte-Carlo simulation using $1000$ simulated games. This shows that the analytical results agree well with simulation. Here are further numerical results, this time extended up to $N=100$:
It appears that the number of rounds required grows approximately linearly with $N$ while the number of balls required grows quadratically. This latter result is to be expected, since we know the formula in this case is approximately $N^2 \log N$.
Hi Laurent:
I did this a bit differently, calculating the expected number of throws in each turn to fill cups using NH_n, with n remaining cups in that turn, then expected cups exhausted in each turn, reducing number of cups, and then proceeding that way, forcing total cups exhausted to be N and total throws to be N^2H_N. These were my reusults curious how close they are to yours:
[1.0, 2.5, 4.166666666666667, 5.916666666666667, 8.3000000000000007, 10.333333333333334, 12.357142857142858, 14.825000000000001, 16.505555555555556, 18.992857142857144]
My results were similar, but different. Here are the numerical results I found:
[1.0, 2.4, 4.17135, 6.09981, 8.12472, 10.2184, 12.3648, 14.5536, 16.7777, 19.0318].
I actually computed exact fractions as well:
\[
\{1, \tfrac{12}{5}, \tfrac{1485}{356}, \tfrac{317440}{52041 }, \tfrac{7182252125}{883999816}, \tfrac{3139918735968}{307281361145}, \tfrac{2953154676073308334607}{238835679474876862230},\tfrac{1936604564184224758145986176}{133066780411872510357957803},\dots\}
\]Didn’t include this in the write-up because it wasn’t particularly instructive, but it might be useful to compare actual rational answers. The first major difference is with the 2.4 vs 2.5; that would be a good place to start.
Tks – it’s not expected to be exact, but it looks pretty close……