Strings of digits, Mersenne primes

Dave Radcliffe observed that The digits of M(74207281) contain 8929592 distinct seven-digit substrings, and 20021565 distinct eight-digit substrings.. (So if you pick an arbitrary 7-digit string, you have an 89% chance of finding it in this number; if you pick an arbitrary 8-digit string, you have a 20% chance.)

Note that M(74207281) is, as the BBC put it, the largest known prime number, discovered in Missouri; it’s 2^74207281 – 1. In base 2 it has a very nice expansion as a sequence of 1s – but in base 10, there’s no reason it should.

Does this seem right? In particular, what would we expect for a random string of digits of this length? The number has $\lceil 74207281 \times \log_{10}(2) \rceil = 22338618$ digits – call this $d$ , so I don’t have to write it out over and over again. Therefore it contains $d-6$ seven-digit strings.

The probability that it contains some given seven-digit string is approximately $1 - \exp (-d/10^7)$ . Any “slot” of seven digits has probability $10^{-7}$ of holding any particular string. If all the slots were independent, then the distribution of the number of different seven-digit strings which appear would be $Binomial(d-6, 10^{-7})$ , which is essentially $Poisson(n/10^7)$ . But they are not quite independent – some overlap. The extent of the overlap is very small, though, so we needn’t worry, and we proceed with the approximation. This works out to about 0.8929, exactly the proportion that Radcliffe reports. Similarly for eight-digit strings we get a probability of 0.2002.

Both of these are right to four digits – just what we expect from this sort of situation! Think of 8929592, above, as being a realization of $Binomial(n, p)$ with $n = 10^7, p = 0.8929$ – we flip a “biased coin” with probability 0.8929 of success to determine if each string, from 0000000 to 9999999, appears in our random string. Then this has mean $np$ and standard deviation $\sqrt{np(1-p)} \approx 978$ – so we should expect the last three or four digits of any realization to be different from those of the mean. In general, $Binomial(n, p)$ has a standard deviation somewhat less than $\sqrt{n}$ – so in realizations of it we expect half the digits to be “right” (i. e. the same as in $np$ ) and half to be wrong. (I swear I’ve read before about this rule of thumb of half the digits being right, but this is hard to Google.)