The birthday problem (and indicator random variables)



This link contains two examples of using indicator random variables. The second example is what concerns us here. The author calculates the number of people that need to be in a room so that we expect to get one birthday match. Of course, if there are more people in the room, we should expect more matches. The calculation using indicator random variables gives an answer of about 28 (people in a room, to expect one match).

Note that saying "we expect to get one birthday match" has nothing to do with guaranteeing a match. It should be interpreted as the average outcome of an experiment. In other words, consider setting up a huge number of rooms, say N. Each of these N rooms will have 28 people. Count the total number of birthday matches found in each separate room. When we divide this by N, we get the expected number of matches per room. Note that using linearity of expectation, we can get one from the other.

The number 28 found here is different from the number 23 that we derived in class when we first considered the birthday problem. That is because we derived something different in class; the number of people that should be in a room in order to have equal odds of finding a match or not. This is not an average number of matches; it is a median. By no means is this equivalent to saying that we expect 0.5 matches. If we had N rooms, each with 23 people, we would expect to get N/2 rooms with no match, and N/2 rooms with at least one match. (OK, I'm ignoring the fact that for 50-50 odds, we derived a non-integer number of people. Let's just call it 23). When we derived the number 23 in class, all we were doing was summing up the number of times that we won't find a match vs the number of times that we will; we didn't care about how many matches occur in a room if at least one match is there. That is the difference between averages and medians.

Partition the N rooms, each with 23 people, based on which ones contain no match vs at least one match, and look at how these rooms would contribute to the total number of birthday matches. The "no-match" rooms would contribute zero, and we said half of the rooms will be of this type. But the "match" rooms could contribute any number of matches from 1 to 23-choose-2. These rooms would have to average exactly 2 matches, for the overall average to be 1. This is what the link above is concerned with, and it turns out that, on average, the N/2 "match" rooms will actually contribute fewer than 2 matches each (but at least one each). Combined with the zero matches in the other N/2 rooms this results in an overall average of less than 1. That is why we actually need more people (28) in each room, to expect 1 match per room.

Finally, here is another version of the problem. Suppose we have a a set of more than 365 people, so that we know we will find a match (by pigeonhole). If we ask for their birthdays, one at a time, until we find a match, we expect to need to ask about 24 people. So, we have yet another number as an answer to a birthday problem! That is because it is different to be forced to use k people in your experiment every single time, than to allow a count from 1 to n. If you repeat this third version N times, sometimes you'll get a match with 2 people, and in theory sometimes you'll actually have to wait for pigeonhole (although we've seen that that won't happen in the lifetime of our universe). It turns out that on average you'll have to ask about 24 people to get the first match. If I'm not mistaken, the math involved for this 3rd version is slightly trickier than the other two.

So, depending on how you formulate your question about birthdays, you get different answers. Think about this before you go making any bets!



Back to my notes on IRVs