This Riddler puzzle is a classic probability problem: how long can one expect to wait until the entire set of cards is collected?
My son recently started collecting Riddler League football cards and informed me that he planned on acquiring every card in the set. It made me wonder, naturally, how much of his allowance he would have to spend in order to achieve his goal. His favorite set of cards is Riddler Silver; a set consisting of 100 cards, numbered 1 to 100. The cards are only sold in packs containing 10 random cards, without duplicates, with every card number having an equal chance of being in a pack.
Each pack can be purchased for \$1. If his allowance is \$10 a week, how long would we expect it to take before he has the entire set?
What if he decides to collect the more expansive Riddler Gold set, which has 300 different cards?
Here is my solution:
[Show Solution]
This problem is a twist on the classical Coupon Collector Problem. In this problem, there are $n$ different coupons in a basket. For one dollar, we get to pick a coupon at random from the basket. We then return the coupon to the basket. We keep doing this until we’ve seen each of the $n$ different coupons at least once. How many dollars will we spend on average?
Simpler version: packs of one
Let’s start by solving the standard coupon collector problem. First, we’ll need the following fact: if a coin comes up Heads with probability $p$, then we have to flip it on average $1/p$ times before we obtain our first Heads. To see why, let $x$ be the expected number of flips. If the first flip is Heads, (probability $p$), then we have performed 1 flip and we stop. If the first flip is Tails, (probability $1-p$), then we have performed 1 flip but we will need to perform $x$ more on average. Mathematically, this amounts to:
\[
x = p\cdot 1 + (1-p)\cdot (1+x)
\]Solving for $x$ yields $x=1/p$. You can also solve this using recursion, as in my solution to the Monsters’ gems puzzle.
We can think of the coupon collection problem by asking how many draws are required before we pick our next new coupon. Our first coupon is always new (probability 1) and will take 1 draw. The probability of picking our next new coupon is $\tfrac{n-1}{n}$ since any of the $n-1$ remaining coupons will do, and this will take on average $\tfrac{n}{n-1}$ draws. Next, our probability is $\tfrac{n-2}{n}$, which takes on average $\tfrac{n}{n-2}$ draws, and so on. Therefore, the expected number of draws $C_n$ in order to pick all $n$ coupons is:
\begin{align}
C_n
&= \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n-2} + \cdots + \frac{n}{1} \\
&= n \left( 1 + \frac{1}{2} + \cdots + \frac{1}{n} \right) \\
&\approx n (\log n + \gamma)
\end{align}Where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant and the approximation becomes exact as $n\to\infty$. This is a harmonic sum and it has surfaced in several past problems (bears, camels, dwarfs).
Full version: multi-packs
Let’s now assume that we draw $m$ coupons at a time from the basket, which contains $n$ distinct coupons. This is precisely the original problem (with $m=10$ and $n=100$). Let’s call $X_k$ the expected number of additional draws required to obtain all $n$ coupons given that we have already collected $k$ coupons. We could get lucky and obtain several new coupons in our draw, or we could strike out and receive all duplicates. In general, there are $\binom{k}{m-i}\binom{n-k}{i}$ ways to obtain $i$ new coupons (out of $\binom{n}{m}$ total possible draws), since we can choose $i$ out of the $n-k$ possible new coupons remaining and $m-i$ out of the $k$ already-seen coupons. Upon drawing our $i$ new coupons, we will have to make $X_{k+i}$ further draws. We therefore have the recursion:
\[
X_k = 1 + \sum_{i=0}^m \frac{\binom{k}{m-i}\binom{n-k}{i}}{\binom{n}{m}} X_{k+i}
\]Noting that $X_k$ occurs on both sides, we can simplify and obtain:
\[
X_k = \frac{1}{\binom{n}{m}-\binom{k}{m}}\left( \binom{n}{m} + \sum_{i=1}^m \binom{k}{m-i}\binom{n-k}{i} X_{k+i} \right)
\]Now each $X_k$ depends on $X_{k+1},X_{k+2},\dots$, and we have the terminal condition $X_n = 0$. So we can work backwards, first evaluating $X_{n-1}$, then $X_{n-2}$, and so on until we arrive at $X_0$, which is the final answer we seek. As a side note, the fact that the probabilities above sum to $1$ is a consequence of Vandermonde’s Identity, which I discussed in my post on double counting.
It doesn’t appear that there is a nice closed-form expression for $X_0$, however it’s straightforward to do this numerically, as long as care is taken with the binomials causing integer overflow when $n$ and $m$ get large. We can also approximate the solution decently well: the $m=1$ case (one card per pack) requires approximately $n(\log(n)+\gamma)$ packs to collect all cards, so one might expect that with $m$ cards per pack, one could collect all cards roughly $m$ times faster. This turns out to be a good approximation!
In the plot below, I computed the exact expectation and also showed the approximation $\tfrac{n}{m}(\log(n)+\gamma)$.
Numerical solutions
The original problem statement asked about the case of $m=10$ cards per pack, with either $n=100$ or $n=300$ total different cards. Here is Julia code that evaluates the solution rather quickly:
# Julia 0.6.4 # define binomial for large values of n (use infinite precision) binom(n,m) = binomial(big(n),m) function expected_draws(n,m) X = zeros(n+m+1) for k = n-1:-1:0 q0 = binom(n,m)/(binom(n,m)-binom(k,m)) q = [ binom(k,m-i)*binom(n-k,i)/(binom(n,m)-binom(k,m)) for i in 1:m ] X[k+1] = q0 + sum(q .* X[k+2:k+1+m]) end X[1] end println("100 cards, packs of 10: Number of packs = ", expected_draws(100,10)) println("300 cards, packs of 10: Number of draws = ", expected_draws(300,10))
and the output of the above code is:
100 cards, packs of 10: Number of packs = 49.94456605666412 300 cards, packs of 10: Number of draws = 186.0851712198894
Therefore, it requires about 50 packs (5 weeks allowance) on average to collect all cards in the 100-set, and about 186 packs (18.6 weeks allowance) for the 300-card set. If we use the approximation $\tfrac{n}{m}(\log(n)+\gamma)$, we actually come pretty close to the exact answer:
100/10 * (log(100) + γ) = 51.8238585088962 300/10 * (log(300) + γ) = 188.429944186732
Laurent, did you have a chance to attempt the Riddler Express – competitive coin flipping puzzle – from last week? The solution provided yesterday said “according to the simulations of many solvers… the Red Team wins this game about 59 percent of the time.” Seems to me this problem can be solved exactly, but my answer was 5/8=62.5%. Anyway, just wondering if anyone else got the same result or if there may be a flaw in my analysis. See –
https://www.reddit.com/r/puzzles/comments/9bt0yr/the_new_national_pastime_competitive_coin_flipping/
I got the answer 5/8 also. I believe that you are exactly correct.
John, appreciate the confirmation. The closed-form solution didn’t seem very complicated. Surprising that wasn’t the solution on 538 Riddler site. 3.5% difference between 59% answer provided and 62.5% isn’t negligible.
I didn’t look at that problem, but it does seem like there should be a closed-form solution. Your approach looks correct o me.
Laurent, I got exactly the same solutions using what is essentially the same methodology as you employed-but I used Mathematica. I then went on to find the probability generating function for the Riddler Silver 100 card case. The mean is, of course, 49.94457; the exact value being a rational number whose numerator in lowest terms contains 569 digits and whose denominator contains 567 digits. The standard deviation is 11.9989. The mode is 44 and the median is 47. The upper 90%, 95%, 99% and 99.9% points of the cumulative probability distribution are, respectively, 66, 72, 88 and 110. I would have enclosed a plot of the probability density function, but I can’t figure out how to attach an image file to this message.
Cool! I didn’t think to compute the other moments of the distribution, but I’d love to see how you went about it! Any chance you could post a write-up?
If you can send me some kind of link I’ll send you back a PDF with a simple example of how I did it.
you can email me at laurent.lessard at gmail dot com.
This is how I did it except since the same binoms were being called over and over, and never with a “bottom” term more than 10, I just generated all of them first. This sped things up quite a bit.
NCards = 100
# make B, a 2D array holding B[n][k] for all n between 0 and NCards and for all k between 0 and 10
B = [[0]*11 for i in range(0,NCards+1)]
for n in range(0,NCards+1):
for k in range(0,11):
if(k==n or k==0): B[n][k]=1
else: B[n][k] = B[n-1][k] + B[n-1][k-1]
# now C(n,k) can make use of B very quickly
def C(n,k):
if(kn): return 0
else: return B[n][k]
Nice, that’s a great idea!
Like John Snyder, I came up with 47 for 100 cards. And 179 for 300 cards. Those were the numbers for which 50% of the 100 and 300 cards would be found. It’s pretty simple math…here’s the 100 one that I did. The MAX 62815650955529472000 is 100 * 99 * 98 * 97 * 96 * 95 * 94 * 93 * 92 * 91. That’s the number of possible 10 *different* cards to get. I grant that the ten lines of code are a bit longer than usual 🙂
#include
#include
#include
#include
#define TOTAL 100
#define MAX 62815650955529472000.
int
main(int ac, char **av)
{
double n_cards[TOTAL + 1], new_cards[TOTAL + 1];
int i, n, i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, new, count, card_count;
for(i = 0; i <= TOTAL; i++)
n_cards[i] = 0;
n_cards[10] = 1.;
for(n = 0; n_cards[TOTAL] < .5;n++) {
printf("n = %d\n", n);
for(i = 0; i <= TOTAL; i++)
new_cards[i] = 0;
for(i = 0; i <= TOTAL ; i++) {
if(n_cards[i] != 0) {
new_cards[i + 10] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*(TOTAL – i – 5)*(TOTAL – i – 6)*(TOTAL – i – 7)*(TOTAL – i – 8)*(TOTAL – i – 9) / MAX;
new_cards[i + 9] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*(TOTAL – i – 5)*(TOTAL – i – 6)*(TOTAL – i – 7)*(TOTAL – i – 8)*i*(10) / MAX;
new_cards[i + 8] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*(TOTAL – i – 5)*(TOTAL – i – 6)*(TOTAL – i – 7)*i*(i – 1)*(10 * 9 / 2.) / MAX;
new_cards[i + 7] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*(TOTAL – i – 5)*(TOTAL – i – 6)*i*(i – 1)*(i – 2)*(10 * 9 * 8 / 2. / 3.) / MAX;
new_cards[i + 6] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*(TOTAL – i – 5)*i*(i – 1)*(i – 2)*(i – 3)*(10 * 9 * 8 * 7 / 2. / 3. / 4.) / MAX;
new_cards[i + 5] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*(TOTAL – i – 4)*i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(10 * 9 * 8 * 7 * 6 / 2. / 3. / 4. / 5.) / MAX;
new_cards[i + 4] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*(TOTAL – i – 3)*i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(i – 5)*(10 * 9 * 8 * 7 * 6 * 5 / 2. / 3. / 4. / 5. / 6.) / MAX;
new_cards[i + 3] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*(TOTAL – i – 2)*i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(i – 5)*(i – 6)*(10 * 9 * 8 * 7 * 6 * 5 * 4 / 2. / 3. / 4. / 5. / 6. / 7.) / MAX;
new_cards[i + 2] += n_cards[i] * (TOTAL – i – 0)*(TOTAL – i – 1)*i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(i – 5)*(i – 6)*(i – 7)*(10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 / 2. / 3. / 4. / 5. / 6. / 7. / 8.) / MAX;
new_cards[i + 1] += n_cards[i] * (TOTAL – i – 0)*i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(i – 5)*(i – 6)*(i – 7)*(i – 8)*(10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 / 2. / 3. / 4. / 5. / 6. / 7. / 8. / 9.) / MAX;
new_cards[i + 0] += n_cards[i] * i*(i – 1)*(i – 2)*(i – 3)*(i – 4)*(i – 5)*(i – 6)*(i – 7)*(i – 8)*(i – 9)*(10 * 9 * 8 * 7 * 6 * 5 * 4 * 3 * 2 * 1 / 2. / 3. / 4. / 5. / 6. / 7. / 8. / 9. /10.) / MAX;
}
}
for(i = 0; i .000001)
printf(” %3d: %12.10lf\n”, i, n_cards[i]);
}
}
printf(“n = %d\n”, n);
}
I like your solution and explanation! I solved it a similar way (and used Julia as well). I was a little disappointed that most of the solutions I saw simply simulated the process over and over and averaged the results.
But – I don’t think this answer is *quite* right! I know it was the accepted solution, and the calculated expected packs are definitely correct, but the question asked for the expected number of weeks, which wouldn’t just be packs/10, since the packs are not distributed evenly throughout the week (at least that was not my interpretation – it seems much more reasonable that he is buying all ten at once one time per week, rather than one every 16.8 hours). For example if the packs happened to be distributed so that 25% of the time it took 49, 50% of the time it took 50, and 25% of the time it took 51, the expected number of packs would be 50, but the expected number of weeks would be .25*5 + .5*5 + .25*6 = 5.25, since packs 49 and 50 occur at week 5, but pack 51 occurs at week 6.
by the way, to complement your final analysis, it occurs to me some basic error bounds are useful.
The underlying idea relates to stopping trials (or overshoots in random walks).
Suppose we model this as having a draw of one card at a time but in fact get $m$ in a pack. Then for any sample path, once we’ve recovered an entire collection, we stop collecting more cards and we have at a minimum of an overshoot of 0 and at a maximum an overshoot of $m-1$ cards (i.e. all the cards in the pack are unused except the ‘first’ one).
so in terms of numbers of cards
$\text{single card solution} \leq \text{expected cards for actual problem} \leq \text{single card solution} + (m-1)\leq \text{single card solution} +m$
and dividing by $m$ cards per pack we can change the units to get
======================
$\text{(single cards solution divided by m) }\leq \text{expected actual solution in terms of packs} \leq \text{(single cards solution divided by m)} + 1$
======================
edit: Hah! I had this idea a week or two after looking at the problem and I forgot that there are no duplicates in the packs… in effect dealing $m$ cards at a time for a given draw / pack without replacement. The math of course is easier with replacement, and there are some standard hypergeometric bounds we could use, but anyway the above idea is a bit incomplete I guess.