Thue-Morse and fair sharing

Matt Parker on “the fairest sharing sequence”, the Thue-Morse sequence. The sequence is

0110 1001 1001 0110 1001 0110 0110 1001…

which is generated as follows:

  • at the first step, start with 0
  • invert the sequence, replacing all 0s with 1s and vice versa, and concatenate this to the original sequence

So this gives, in sequential steps: 0, 01, 0110, 01101001, …

It’s an interesting sequence, but why the title?

This is a follow-up to a video in which he proposes a puzzle: partition the numbers 0 through 2^{k+1}-1 into two groups such that the sum of the 1st, 2nd, …, kth powers of each group are the same. For example, for k = 3 we have 0 + 3 + 5 + 6 = 1 + 2 + 4 + 7 and 0^2 + 3^2 + 5^2 + 6^2 = 1^2 + 2^2 + 4^2 + 7^2.

The solution, as you may have guessed, is that the numbers on one side of the equality should be the positions of 0s among the first 2^{k+1} elements of the Thue-Morse sequence, and the numbers on the other side should be the positions of the 1s, where indexing starts at 0. This fact is proven in Allouche and Shallit’s paper “The ubiquitous Prouhet-􏰀Thue-􏰀Morse sequence” – see section 5.1.

So why is this about fair sharing? Let’s consider the k = 2 case again. We have 0^k + 3^k + 5^k + 6^k = 1^k + 2^k + 4^k + 7^k for k = 0, 1, 2. That means, then, that we have

f(0) + f(3) + f(5) + f(6) = f(1) + f(2) + f(4) + f(7)

for any quadratic polynomial f. So say that we’re divvying up some goods and I get the 0th, 3rd, 5th, and 6th most valuable item while you get the 1st, 2nd, 4th, and 7th. (We are computer scientists so we start counting from zero.) Then if we can write the value of the nth item as a quadratic in n, then we end up with the same value.

(How does this generalize to the case where f is not polynomial? That seems relevant – for example what if f(x) = e^{-x} or 1/(x+1)? In a paper which independently rediscovered the Thue-Morse sequence, Robert Richman did some experiments which suggests that this still holds as long as f isn’t “too far” from polynomial. This got written up in the Guardian as how to pour the perfect cup of coffee. Thue was Norwegian, and I hear Norwegians like coffee, so I think he’d approve.

(Via Metafilter.)

Strings of digits, Mersenne primes

Dave Radcliffe observed that The digits of M(74207281) contain 8929592 distinct seven-digit substrings, and 20021565 distinct eight-digit substrings.. (So if you pick an arbitrary 7-digit string, you have an 89% chance of finding it in this number; if you pick an arbitrary 8-digit string, you have a 20% chance.)

Note that M(74207281) is, as the BBC put it, the largest known prime number, discovered in Missouri; it’s 274207281 – 1. In base 2 it has a very nice expansion as a sequence of 1s – but in base 10, there’s no reason it should.

Does this seem right? In particular, what would we expect for a random string of digits of this length? The number has \lceil 74207281 \times \log_{10}(2) \rceil = 22338618 digits – call this d, so I don’t have to write it out over and over again. Therefore it contains d-6 seven-digit strings.

The probability that it contains some given seven-digit string is approximately 1 - \exp (-d/10^7). Any “slot” of seven digits has probability 10^{-7} of holding any particular string. If all the slots were independent, then the distribution of the number of different seven-digit strings which appear would be Binomial(d-6, 10^{-7}), which is essentially Poisson(n/10^7). But they are not quite independent – some overlap. The extent of the overlap is very small, though, so we needn’t worry, and we proceed with the approximation. This works out to about 0.8929, exactly the proportion that Radcliffe reports. Similarly for eight-digit strings we get a probability of 0.2002.

Both of these are right to four digits – just what we expect from this sort of situation! Think of 8929592, above, as being a realization of Binomial(n, p) with n = 10^7, p = 0.8929 – we flip a “biased coin” with probability 0.8929 of success to determine if each string, from 0000000 to 9999999, appears in our random string. Then this has mean $np$ and standard deviation \sqrt{np(1-p)} \approx 978 – so we should expect the last three or four digits of any realization to be different from those of the mean. In general, Binomial(n, p) has a standard deviation somewhat less than \sqrt{n} – so in realizations of it we expect half the digits to be “right” (i. e. the same as in np) and half to be wrong. (I swear I’ve read before about this rule of thumb of half the digits being right, but this is hard to Google.)

Open Syllabus Project

The Open Syllabus Project has collected syllabuses1 from about a million college classes. From this they’ve created the Open Syllabus Explorer.

It’s generally focused on listing which texts occur in syllabi. This ends up meaning that it’s mostly useful for humanities classes, since they tend to assign a larger number of texts. I remember my grad school oral exams, where there were perhaps four books I was studying from; there were friends of mine in the humanities that would have, say, 50 books.

That being said, it’s still interesting to know that people who assign Stanley’s Enumerative Combinatorics also assign Wilf’s generatingfunctionology. And, somewhat more confusingly, Rudin’s Real and Complex Analysis – which makes me suspect that perhaps some lists of canonical math texts in a variety of subjects got caught in the dragnet.

An analogue of this for more technical subjects, where there are many books covering similar material and where there are fewer texts per course, would examine the portions of the books which are covered. Which subjects are seen as essential, and which ones are not? Many textbooks include a bit in the introduction that can be paraphrased as “my book is too long for one course, I know, so here are some one-course subsets of it” – but which of those susbets actually get used? This would require parsing the syllabi, and is perhaps feasible (and maybe the folks at the OSP are thinking about it). This is very much the sort of thing I would have liked to see when I was planning new courses, and on a hard drive that no longer works I’m pretty sure I have some files with lists of “professor X, in semester Y, covered chapters A, B, and C of the textbook”.

(Via NYT.)

1. My wife, a classics major, informs me that that’s the right plural, because “syllabus” is not of the declension that pluralizes by taking “-us” to “-i”. Interestingly, the NYT article uses “syllabuses” but the OSP’s page uses “syllabi”.

Snow day?

The National Weather Service’s Weather Prediction Center now has probabilistic winter precipitation guidance.  You can get maps showing

  • the nth percentile of the predicted distribution of accumulation for n = 5, 10, 25, 50, 75, 90, 95;
  • dually to that, contours on which the 10th, 20th, …, 90th percentile of the predicted distribution is a selected constant

for the next 72 hours, or for 24- or 48-hour subsets of that time period.  Here’s a more in-depth description.

Looks like the mid-Atlantic is in for a rough ride. Stay home. Have some French toast.  If you do go outside, ignore the wind chill.

We’re number one!

Glassdoor has one of those lists of the “best jobs in America”. “Data scientist” is #1.

Seeing this got my data scientist spidey sense tingling. It’s kind of ironic, seeing how a data scientist could poke so many holes in this…

Glassdoor describes their methodology as follows:

Methodology: Glassdoor’s 25 Best Jobs in America report identifies specific jobs with the highest overall Glassdoor Job Score. The Glassdoor Job Score is determined by weighting three factors equally: earning potential (median annual base salary), career opportunities rating, and number of job openings. Results represent job titles that rate highly among all three categories. The Glassdoor Job Score is based on a 5-point scale (5.0=best job, 1.0=bad job). For a job title to be considered, it must receive at least 75 salary reports and at least 75 career opportunities ratings shared by U.S.-based employees over the past year (1/8/15-1/7/16). The number of job openings per job title represents active job listings on Glassdoor as of 1/8/16. This report takes into account job title normalization that groups similar job titles.

The “career opportunities rating” appears to be aggregated from ratings left in Glassdoor reviews, and would seem to correspond more to the companies that data scientists work for than data science as a career. Earning potential I can agree with as an important variable (who doesn’t like money?) Looking at “number of job openings” seems a bit off to me – I’m sure there are a lot of job openings for hotel housekeepers – but perhaps they’re doing some normalization and this is actually some measure of number of openings relative to number of qualified candidates. (“Software engineer” has 49,270 openings and nothing else on the list breaks five figures, but it’s not at the top of the list.)

Data scientist has gone up since last year’s list, when it was #9, driven by an increase in salary and in career opportunities rating. Back when I would have called myself a mathematician, that was the #1 job (in a different study, but who cares?) Clearly this is all my fault.

Links for January 18

Siobhan Roberts interviews John Conway: Genius Behind the Numbers

Aaron Fisher, G. Brooke Anderson, Roger Peng, Jeff Leek​
on applying randomized trials to figure out if students in MOOCs can learn what statistical significance looks like.

How FiveThirtyEight is forecasting presidential primaries.

Secrets of the MIT Poker Course by Nick Greene; the course referred to is at MIT OpenCourseWare.

Richard V. Reeves and Nathan Joo on How much social mobility do people really want? – it turns out that they want upward mobility without the corresponding amount of downward mobility that it implies.

Ana Swanson at the Washington Post’s Wonkblog on the mistake you’re making with your Fitbit. (The mistake, it turns out, is not wearing it – but there’s some interesting work on sampling to determine how accurate the Fitbit and other wearables are.)

Jordan Eilenberg was messing around with word2vec and found some interesting things.