This week’s Riddler Classic is a problem familiar to many…
I have $n$ pairs of socks in a drawer. Each pair is distinct from another and consists of two matching socks. Alas, I’m negligent when it comes to folding my laundry, and so the socks are not folded into pairs. This morning, fumbling around in the dark, I pull the socks out of the drawer, randomly and one at a time, until I have a matching pair of socks among the ones I’ve removed from the drawer.
On average, how many socks will I pull out of the drawer in order to get my first matching pair?
Here is my solution:
[Show Solution]
Computing the expectation
Define $T$ to be number of turns that the game lasts (this is a random variable). Define $q_k = \textbf{P}(T > k)$ to be the probability that the game has still not ended after $k$ draws. We can compute this explicitly. There are $\binom{2n}{k}$ possible (equally likely) ways of choosing $k$ socks among the $2n$ socks in the drawer. If there is no match, we must have drawn $k$ socks of different colors. There are $\binom{n}{k}$ ways to choose the $k$ colors and $2^k$ ways to pick the socks once the colors are chosen. Putting everything together and performing some algebraic manipulations, we obtain:
\[
q_k = \frac{\binom{n}{k} 2^k}{\binom{2n}{k}}
= \frac{n! \cdot (2n-k)! \cdot 2^k}{(2n)! \cdot (n-k)!}
= \frac{ \binom{2n-k}{n} 2^k }{\binom{2n}{n}}
\]Let’s also define $p_k = \textbf{P}(T=k)$ to be the probability that the game ends on precisely the $k^\text{th}$ draw. Note that $q_{k} = p_{k+1}+p_{k+2}+\cdots+p_{n+1}$, or alternatively, $p_k = q_{k-1}-q_k$. From the definition of expected value:
\[
\textbf{E}(T) = \sum_{k=1}^{n+1} k\,p_k = \sum_{k=1}^{n+1}k(q_{k-1}-q_k) = \sum_{k=0}^n q_k
\]This is a manifestation of the general fact that $\textbf{E}(T) = \sum_{k\ge 0}\textbf{P}(T > k)$. Make the change $k\mapsto n-k$ in the expectation:
\[
\textbf{E}(T) = \frac{\sum_{k=0}^n \binom{2n-k}{n} 2^k}{\binom{2n}{n}}
= \frac{2^n \sum_{k=0}^n \binom{n+k}{k} 2^{-k}}{\binom{2n}{n}}
\]Consider the sum in the numerator. Define:
\[
S_n = \sum_{k=0}^n \binom{n+k}{k} 2^{-k}
\]We can evaluate this sum recursively:
\begin{align}
S_{n+1} &= \sum_{k=0}^{n+1} \binom{n+k+1}{k} 2^{-k} \\
&= \sum_{k=0}^{n+1}\left[ \binom{n+k}{k} + \binom{n+k}{k-1} \right] 2^{-k} \\
&= \sum_{k=0}^{n+1} \binom{n+k}{k}2^{-k} + \sum_{k=0}^{n+1} \binom{n+k}{k-1} 2^{-k} \\
&= S_n + \binom{2n+1}{n+1}2^{-n-1} + \sum_{k=0}^{n} \binom{n+k+1}{k} 2^{-k-1} \\
&= S_n + \tfrac{1}{2}S_{n+1}
\end{align}Solving, we obtain $S_{n+1} = 2S_n$, and therefore conclude that $S_n = 2^n$. Substituting this result back into the formula we had above, we obtain:
$\displaystyle
\textbf{E}(T) = \frac{4^n}{\binom{2n}{n}}
$
To get a feel for how this expression varies with $n$, we can use Stirling’s approximation, which leads us to the approximation:
\[
\textbf{E}(T) \approx \sqrt{\pi n}
\]So the expected number of socks we’ll have to draw grows as the square root of the number of pairs. For example, when we have $n=10$ pairs of socks, $\textbf{E}(T) = \frac{262144}{46189}\approx 5.67546$. Meanwhile, the approximation yields $\sqrt{10\pi} \approx 5.60499$. The approximation gets better as $n$ gets larger. Here is a plot showing the true expected value and the approximation:
Computing the distribution
We have the expectation $\textbf{E}(T) = \frac{4^n}{\binom{2n}{n}} \approx \sqrt{\pi n}$. But what about the distribution of $T$ itself? Does it approach anything interesting as $n$ gets large? First, we need to compute the exact probability mass function, which is nothing other than $p_k$. We can do this from the recursion we derived earlier: $p_k = q_{k-1}-q_k$. This simplifies to:
\[
p_k = \frac{k-1}{n-k+1} \frac{\binom{2n-k}{n} 2^{k-1} }{\binom{2n}{n}}\qquad\text{for }k=1,2,\dots,n+1
\]We can plot these distributions along with their means to see what it looks like. Here are the distributions of $T$ for $n=\{10,20,50,100\}$.
To understand what this distribution looks like as $n$ gets very large, we have to do a bit of asymptotic analysis. This part is a little rough around the edges (read: not particularly rigorous), but it seems to give a reasonable answer. My reasoning was that as $n\to\infty$, $k$ is small compared to $n$ because the mean grows like $\sqrt{n}$. Therefore, we will assume $k \sim \sqrt{n}$.
Using Mathematica, we can look at how $\frac{p_k}{k-1}$ varies for small $k$. Using a series approximation about $k=0$ and $n=\infty$, we find that:
\[
\log\left( \frac{p_k}{k-1} \right) \approx \log(\tfrac{1}{2n}) + O(\tfrac{1}{n}) + \tfrac{5}{4n} k-\tfrac{1}{4n}k^2+O(k^3)
\]Therefore, we conclude that:
\[
p_k \approx \frac{k}{2n} e^{-\tfrac{k^2}{4n}}
\]This means that as $n\to\infty$, the distribution of stopping time $T$ approaches a Rayleigh distribution with parameter $\sigma = \sqrt{2n}$. This also gives the correct mean for large $n$, as the mean of the Rayleigh distribution is $\sigma\sqrt{\frac{\pi}{2}} = \sqrt{\pi n}$. For a comparison, here is the plot for $n=500$ along with the Rayleigh approximation: