# Thue-Morse and fair sharing

Matt Parker on “the fairest sharing sequence”, the Thue-Morse sequence. The sequence is

0110 1001 1001 0110 1001 0110 0110 1001…

which is generated as follows:

• invert the sequence, replacing all 0s with 1s and vice versa, and concatenate this to the original sequence

So this gives, in sequential steps: 0, 01, 0110, 01101001, …

It’s an interesting sequence, but why the title?

This is a follow-up to a video in which he proposes a puzzle: partition the numbers 0 through $2^{k+1}-1$ into two groups such that the sum of the 1st, 2nd, …, kth powers of each group are the same. For example, for $k = 3$ we have $0 + 3 + 5 + 6 = 1 + 2 + 4 + 7$ and $0^2 + 3^2 + 5^2 + 6^2 = 1^2 + 2^2 + 4^2 + 7^2$.

The solution, as you may have guessed, is that the numbers on one side of the equality should be the positions of 0s among the first $2^{k+1}$ elements of the Thue-Morse sequence, and the numbers on the other side should be the positions of the 1s, where indexing starts at 0. This fact is proven in Allouche and Shallit’s paper “The ubiquitous Prouhet-􏰀Thue-􏰀Morse sequence” – see section 5.1.

So why is this about fair sharing? Let’s consider the $k = 2$ case again. We have $0^k + 3^k + 5^k + 6^k = 1^k + 2^k + 4^k + 7^k$ for $k = 0, 1, 2$. That means, then, that we have

$f(0) + f(3) + f(5) + f(6) = f(1) + f(2) + f(4) + f(7)$

for any quadratic polynomial $f$. So say that we’re divvying up some goods and I get the 0th, 3rd, 5th, and 6th most valuable item while you get the 1st, 2nd, 4th, and 7th. (We are computer scientists so we start counting from zero.) Then if we can write the value of the $n$th item as a quadratic in $n$, then we end up with the same value.

(How does this generalize to the case where $f$ is not polynomial? That seems relevant – for example what if $f(x) = e^{-x}$ or $1/(x+1)$? In a paper which independently rediscovered the Thue-Morse sequence, Robert Richman did some experiments which suggests that this still holds as long as $f$ isn’t “too far” from polynomial. This got written up in the Guardian as how to pour the perfect cup of coffee. Thue was Norwegian, and I hear Norwegians like coffee, so I think he’d approve.

(Via Metafilter.)

# Strings of digits, Mersenne primes

Dave Radcliffe observed that The digits of M(74207281) contain 8929592 distinct seven-digit substrings, and 20021565 distinct eight-digit substrings.. (So if you pick an arbitrary 7-digit string, you have an 89% chance of finding it in this number; if you pick an arbitrary 8-digit string, you have a 20% chance.)

Note that M(74207281) is, as the BBC put it, the largest known prime number, discovered in Missouri; it’s 274207281 – 1. In base 2 it has a very nice expansion as a sequence of 1s – but in base 10, there’s no reason it should.

Does this seem right? In particular, what would we expect for a random string of digits of this length? The number has $\lceil 74207281 \times \log_{10}(2) \rceil = 22338618$ digits – call this $d$, so I don’t have to write it out over and over again. Therefore it contains $d-6$ seven-digit strings.

The probability that it contains some given seven-digit string is approximately $1 - \exp (-d/10^7)$. Any “slot” of seven digits has probability $10^{-7}$ of holding any particular string. If all the slots were independent, then the distribution of the number of different seven-digit strings which appear would be $Binomial(d-6, 10^{-7})$, which is essentially $Poisson(n/10^7)$. But they are not quite independent – some overlap. The extent of the overlap is very small, though, so we needn’t worry, and we proceed with the approximation. This works out to about 0.8929, exactly the proportion that Radcliffe reports. Similarly for eight-digit strings we get a probability of 0.2002.

Both of these are right to four digits – just what we expect from this sort of situation! Think of 8929592, above, as being a realization of $Binomial(n, p)$ with $n = 10^7, p = 0.8929$ – we flip a “biased coin” with probability 0.8929 of success to determine if each string, from 0000000 to 9999999, appears in our random string. Then this has mean $np$ and standard deviation $\sqrt{np(1-p)} \approx 978$ – so we should expect the last three or four digits of any realization to be different from those of the mean. In general, $Binomial(n, p)$ has a standard deviation somewhat less than $\sqrt{n}$ – so in realizations of it we expect half the digits to be “right” (i. e. the same as in $np$) and half to be wrong. (I swear I’ve read before about this rule of thumb of half the digits being right, but this is hard to Google.)

# Open Syllabus Project

The Open Syllabus Project has collected syllabuses1 from about a million college classes. From this they’ve created the Open Syllabus Explorer.

It’s generally focused on listing which texts occur in syllabi. This ends up meaning that it’s mostly useful for humanities classes, since they tend to assign a larger number of texts. I remember my grad school oral exams, where there were perhaps four books I was studying from; there were friends of mine in the humanities that would have, say, 50 books.

That being said, it’s still interesting to know that people who assign Stanley’s Enumerative Combinatorics also assign Wilf’s generatingfunctionology. And, somewhat more confusingly, Rudin’s Real and Complex Analysis – which makes me suspect that perhaps some lists of canonical math texts in a variety of subjects got caught in the dragnet.

An analogue of this for more technical subjects, where there are many books covering similar material and where there are fewer texts per course, would examine the portions of the books which are covered. Which subjects are seen as essential, and which ones are not? Many textbooks include a bit in the introduction that can be paraphrased as “my book is too long for one course, I know, so here are some one-course subsets of it” – but which of those susbets actually get used? This would require parsing the syllabi, and is perhaps feasible (and maybe the folks at the OSP are thinking about it). This is very much the sort of thing I would have liked to see when I was planning new courses, and on a hard drive that no longer works I’m pretty sure I have some files with lists of “professor X, in semester Y, covered chapters A, B, and C of the textbook”.

(Via NYT.)

1. My wife, a classics major, informs me that that’s the right plural, because “syllabus” is not of the declension that pluralizes by taking “-us” to “-i”. Interestingly, the NYT article uses “syllabuses” but the OSP’s page uses “syllabi”.

# Snow day?

The National Weather Service’s Weather Prediction Center now has probabilistic winter precipitation guidance.  You can get maps showing

• the $n$th percentile of the predicted distribution of accumulation for $n = 5, 10, 25, 50, 75, 90, 95$;
• dually to that, contours on which the 10th, 20th, …, 90th percentile of the predicted distribution is a selected constant

for the next 72 hours, or for 24- or 48-hour subsets of that time period.  Here’s a more in-depth description.

Looks like the mid-Atlantic is in for a rough ride. Stay home. Have some French toast.  If you do go outside, ignore the wind chill.

# We’re number one!

Glassdoor has one of those lists of the “best jobs in America”. “Data scientist” is #1.

Seeing this got my data scientist spidey sense tingling. It’s kind of ironic, seeing how a data scientist could poke so many holes in this…

Glassdoor describes their methodology as follows:

Methodology: Glassdoor’s 25 Best Jobs in America report identifies specific jobs with the highest overall Glassdoor Job Score. The Glassdoor Job Score is determined by weighting three factors equally: earning potential (median annual base salary), career opportunities rating, and number of job openings. Results represent job titles that rate highly among all three categories. The Glassdoor Job Score is based on a 5-point scale (5.0=best job, 1.0=bad job). For a job title to be considered, it must receive at least 75 salary reports and at least 75 career opportunities ratings shared by U.S.-based employees over the past year (1/8/15-1/7/16). The number of job openings per job title represents active job listings on Glassdoor as of 1/8/16. This report takes into account job title normalization that groups similar job titles.

The “career opportunities rating” appears to be aggregated from ratings left in Glassdoor reviews, and would seem to correspond more to the companies that data scientists work for than data science as a career. Earning potential I can agree with as an important variable (who doesn’t like money?) Looking at “number of job openings” seems a bit off to me – I’m sure there are a lot of job openings for hotel housekeepers – but perhaps they’re doing some normalization and this is actually some measure of number of openings relative to number of qualified candidates. (“Software engineer” has 49,270 openings and nothing else on the list breaks five figures, but it’s not at the top of the list.)

Data scientist has gone up since last year’s list, when it was #9, driven by an increase in salary and in career opportunities rating. Back when I would have called myself a mathematician, that was the #1 job (in a different study, but who cares?) Clearly this is all my fault.

Siobhan Roberts interviews John Conway: Genius Behind the Numbers

Aaron Fisher, G. Brooke Anderson, Roger Peng, Jeff Leek​
on applying randomized trials to figure out if students in MOOCs can learn what statistical significance looks like.

Secrets of the MIT Poker Course by Nick Greene; the course referred to is at MIT OpenCourseWare.

Richard V. Reeves and Nathan Joo on How much social mobility do people really want? – it turns out that they want upward mobility without the corresponding amount of downward mobility that it implies.

Ana Swanson at the Washington Post’s Wonkblog on the mistake you’re making with your Fitbit. (The mistake, it turns out, is not wearing it – but there’s some interesting work on sampling to determine how accurate the Fitbit and other wearables are.)

Jordan Eilenberg was messing around with word2vec and found some interesting things.

A lot of links this week for some reason – to be honest some of this is clearing out an old backlog, but it may reflect the beginning of the new year as well

Andy T. Woods​, Charles Michel, and Charles Spence give a scientific study of the ‘rules’ of plating. This follows an earlier “plating manifesto” in two parts: Charles Spence, Betina Piqueras-Fiszman, Charles Michel and Ophelia Deroy:
The plating manifesto (I): from decoration to creation and Plating manifesto (II): the art and science of plating.

Alan J. Bishop, Western mathematics: the secret weapon of cultural imperialism (1990; via Hacker News).

Machine Learning, Uncertain Information, and the Inevitability of Negative “Probabilities” (video lecture, David Lowe, 2007)

John Horgan on Bayes’ Theorem: What’s the big deal? at Scientific American.

Bill Gosper on continued fraction arithmetic (1972). At some point I want to sit down and digest this. See also Mark Jason Dominus’ talk on the material.

Aaron Clauset is teaching a course on The History and Future of Computing which has an intresting reading list.

Nick Berry on the Koch snowflake.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright have a new book, Statistical Learning with Sparsity: The Lasso and Generalizations, which you can download.

Christie Aschwanden at FiveThirtyEight wrote You CAn’t Trust What You Read About Nutrition – because collecting data about nutrition is hard and also because there are so many studies that data mining is easy.

Udemy and Priceonomics on How William Cleveland Turned Data Visualization Into a Science.

Nicky Case is simulating the world (in emoji).

Cathy O’Neil gave a talk on “Weapons of Math Destruction” and she’s finishing up a book of the same title.

Frank Wilczek (whose classes I slept through freshman year of college) on people’s preferences in recreations showing that they like math and don’t realize it.

# Logarithmic approximations for Collatz

A question from Republic of Math:

The fit in this plot looked a bit off to me – it seems like it should be a log, not a square root. (But of course a log is just a zeroth power…) For those who aren’t familiar, the Collatz conjecture is as follows: given an arbitrary positive integer, if it’s even, divide it by 2, and if it’s odd, multiply by 3 and add 1. Repeat. As far as we know, you always end up at 1. For example, say we start with 7; then we get

7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1

So why should this be true? Consider only the odd numbers in the sequence. In the previous sequence those are

7, 11, 17, 13, 5, 1

where we get each number by taking three times the previous one, plus one, and then dividing by 2 as many times as we can. We can figure that half the time we’ll divide by 2 once, a quarter of the time we’ll divide by 2 twice, and so on; on average we’ll divide by 2 twice. So each number in this sequence should be, on average, about three-quarters of the previous one, and each step in this sequence corresponds to three steps in the previous sequence (one tripling and two halvings). Thus each step on average multiplies by $(3/4)^{1/3}$ and it should take $(\log n)/(\log (4/3)^{1/3})$ steps to get from $n$ down to 1. Call this $C \log n$ where $C = 1/(\log (4/3)^{1/3}) \approx 10.42$.

This heuristic argument is in, for example, the review paper of Lagarias; this online HTML version of the same paper gives the additional remark that the stopping time of the sequence is conjectured to be asymptotically normally distributed with mean and variance both on the order of $log n$. (The constant is different there; that’s because Lagarias goes directly from $x$ to $(3x+1)/2$ if $x$ is odd.)

Let’s define a function $t(n)$ to be the number of iterations in this process needed to get to 1. For example look at the trajectory for 7 above; it takes 16 iterations to get to 1, so $t(7) = 16$. Similarly for example $t(4) = 2$ (the trajectory is 4, 2, 1) and $t(6) = 8$ (the trajectory is 6, 3, 10, 5, 16, 8, 4, 2, 1). This is the sequence A006577 in the OEIS. Then the original question is about the function $m(n) = {1 \over n} (t(1) + t(2) + \cdots + t(n))$. But since “most” numbers less than $\latex n$ have their logarithms near $\latex \log n$, that sum can be approximated as just $(1/n) (n \times (C \log n))$ or $\latex C \log n$.

And indeed a logarithmic model is the better fit (red in the plot below; blue is a power law – in both cases I’ve just fit to the range $100 \le n \le 5000$)

Sadly, we don’t recover that constant 10.42… the red curve is $m = -16.87 + 11.07 n$ and the blue curve is $m = 16.31 \times n^{0.185}$.  But it’s not like you’d expect asymptotic behavior to kick in by 5000 anyway.

# Powerball

A pre-Powerball roundup of links, in advance of tonight’s \$900 million drawing:
Is it rational for an economist to play powerball? Probably not – the expected value of the Powerball drawing tonight might be negative since so many people buy tickets. Sales apparently go up faster than the jackpot. And Britain’s national lottery had a big jackpot today as well.

Here’s some advice on how to win and what to do if you win. I’d steer clear of thinking that certain numbers are luckier than others, although the standard advice – try to play numbers that other people aren’t playing, so you don’t have to share the jackpot if you win – definitely applies.