Olympic mathematics

More Olympic mathematics: from the arxiv blog, statisticans predict the number of olympic records that will fall at London 2012. The paper is A survival analysis of the duration of Olympic records, by Elliott Hollifield, Victoria Treviño, and Adam Zarn.

Jordan Ellenberg explains the genius of the new (in 2008) gymnastics scoring system. If I may summarize: gymnastics doesn’t naturally have some right wall of performance, so why act like it does?

I’m actually watching the women’s all-around gymnastics final right now. There are eight teams in it, and four pieces of apparatus, so obviously turns have to be taken; therefore at any given time each team will have performed the same number of routines, but not necessarily on the same apparatus. And since different teams are strong on different apparatus, this means that saying “X is ahead of Y” is a bit meaningless. If gymnastics had a bigger following someone would have worked out how to correct for this and tell me who’s really in the lead, given their remaining events and how they’ve performed historically in them. (I had the same thought during the individual medley in swimming yesterday; at some point one person was well behind but the commentators said that they’d probably come back on the last leg because the last leg was their best stroke, and if I recall correctly that turned out to be true.)

Why am I sitting here at nine in the morning watching the Olympics? Because I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for July 30

Nate Silver asks which Olympic records get shattered.

mages forecasts the winning time in the men’s 100m in the Olympics (9.68 sec).

Dan Graetinnger predicts the Olympic medal counts.

Andrew Hacker’s opinion piece Is Algebra Necessary? has been making the rounds. No comment.

Why isn’t iTunes shuffle random?, from AskDifferent (the StackExchange site for Apple products).

A four-time lottery winner is a Stanford statistics PhD.

A couple series of posts from John Baez: Symmetry and the four dimension one, two, three, four. The Mathematics of Biodiversity eight-part series.

Boolean buddhas.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Number of Fibonacci primes

A “Fibonacci prime” is nothing more than a number which is both prime and a Fibonacci number. The indices of the first 33 Fibonacci primes, and the “probable” next 13 (in the probable prime sense), are A001605 in the On-Line Encyclopedia of Integer Sequences. The sequence of such indices begins 3, 4, 5, 7, 11, 13, …; the corresponding Fibonacci numbers are 2, 3, 5, 13, 89, 233, … . As is pointed out in the note at the OEIS, since Fmn is divisible by Fn all further entries in that sequence must be prime. The converse is not true; the first exception is F19 = 4181 = (37)(113).

Are Fibonacci primes rare? To answer this question we need to define “rare”; specifically, if you know a number is a Fibonacci number, does that make it more or less likely to be prime than it would otherwise be? The nth Fibonacci number is roughly φn, so by the prime number theorem the “probability” that the n

Fibonacci number is prime is 1/(n log φ). Integrating, there ought to be about (log n)/(log φ) primes among the first n Fibonacci numbers; inverting, the xth Fibonacci prime should be somewhere around index φx.

We can plot the points (1, log(3)), (2, log(4)), (3, log(5)), (4, log(7)), …, (46, log(1636007)), where the y-coordinates are the indices of the Fibonacci primes, and see if they fall roughly on a straight line. The picture is below. If the heuristic above is correct, they should fall on a line of slope log φ, or about 0.48. They actually fall on a line of slope about 0.31. For example, there should be, by the heuristic in the previous paragraph, about (log 1636007)/(log φ) ≈ 30 Fibonacci primes with index less than 1636007; there are actually 46. There are too many Fibonacci primes.

More generally, do Fibonacci numbers have fewer factors than are typical among numbers of their size? Here are factorizations of Fibonacci numbers at mersennus.net. I don’t know the answer to this question.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

How much waiting time does sub-optimal train spacing on BART cost?

Service on Bay Area Rapid Transit (BART) runs every fifteen minutes on each route; every twenty minutes on evenings and Sundays. Here’s a map of the system. (The lines in red and green don’t run all the time, but that’s not relevant for what I’m about to say.)

The San Francisco city segment (Daly City to Embarcadero) is served by four of the five routes, and so has service by sixteen trains an hour when all those lines are running, between about 6 AM and 7 PM. Here’s a schedule. If you’re standing at, say, 16th Street (which I was a couple hours ago), northbound (Embarcadero-bound) trains depart at

6:11 am (red, i. e. Richmond-bound)

6:17 am (blue, i. e. Dublin-bound)

6:19 am (yellow, i. e. Pittsburg-bound)

6:24 am (green, i. e. Fremont-bound)

and then that pattern repeats every fifteen minutes through 4:11 pm. (Things get complicated slightly by extra trains in the Pittsburg direction in the evening rush.) That is, in a typical hour:

Dublin trains depart at 2, 17, 32, and 47 after the hour;

Pittsburg trains depart at 4, 19, 34, and 49 after the hour, that is, 2 minutes after the Dublin trains;

Fremont trains depart at 9, 24, 39, and 54 after the hour, that is, 5 minutes after the Pittsburg trains;

Richmond trains depart at 11, 26, 41, and 56 after the hour, that is, 2 minutes after the Fremont trains

(and of course Dublin trains depart 6 minutes after the Richmond trains).

(After that there are some extra trains in the direction of Pittsburg/Bay Point for the evening rush hour.)

You’ll notice that these trains are not evenly spaced.
So say I show up at 16th Street at a time chosen uniformly at random between, say, 2 PM and 3 PM, and I want to go to Embarcadero; what is my expected waiting time?

With probability 6/15, the next train heading in that direction is a Dublin train. In these cases, on average I will have to wait 6/2 = 3 minutes.

With probability 2/15, the next train heading in that direction is a Pittsburg train. In these cases, on average I will have to wait 2/2 = 1 minute.

With probability 5/15, the next train heading in that direction is a Fremont train. In these cases, on average I will have to wait 5/2 = 2.5 minutes.

With probability 2/15, the next train heading in that direction is a Richmond train. In these cases, on average I will have to wait 2/2 = 1 minute.

So my average waiting time is (6/15)(3) + (2/15)(1) + (5/15)(2.5) + (2/15)(1) = 2.3 minutes. You could also write this as

{6^2 + 2^2 + 5^2 + 2^2 \over 2 (6+2+5+2)}

and this is an example of a general formula; the numerator is the sum of the squares of the times between different trains, the denominator is twice the period on which the system runs. On average I’ll have to wait 2 minutes and 18 seconds for a train if I just show up without looking at the schedule.

How much worse is this than the optimal spacing? The denominator of that fraction has to be 30; the numerator is the sum of the squares of four numbers that add up to 15. To minimize that sum of squares we want the numbers in the numerator to be as close together as possible. (This is roughly Jensen’s inequality.) Assuming that the inter-train spacings have to be integer numbers of minutes, we’ll go with 3, 4, 4, and 4, and the average wait time ends up being

{3^2 + 4^2 + 4^2 + 4^2 \over 2 \times 15} = 1.9

and if we were allowed to not have the spacings be integer numbers of minutes, we could get this down to 1.875 minutes by having trains exactly three minutes and 45 seconds apart.

So BART is costing people who commute within San Francisco 0.425 minutes — or twenty-five and a half seconds — every morning and evening, with this non-optimal train spacing! Or fifty-one whole seconds a day! I suppose this isn’t really worth worrying about, especially since I spent a good half-hour writing this post…

(I haven’t thought hard enough to figure out if there’s some constraint that makes it impossible to space out trains more evenly; in particular that might interact poorly with the spacing on some other interlined segment of the network. It is also possible that this is deliberately done by BART so that people who just show up without looking at a schedule, intending to take the next train to a point within the city, are more likely to end up on Dublin or Fremont trains, as those are the lines that are less used in their East Bay segments; you can see that from this chart of number of people exiting at each station on weekdays and some artihmetic. If so, well done, BART people! If not, well, pretend you meant to do it that way.)

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for July 22

Happy Pi approximation day, a holiday for people who are GOOD ENOUGH, just not transcendental.

What do top Kaggle competitors focus on?

Staying dry in rain: run or walk? (That goes to the BBC; here’s the original paper, which is actually free despite being behind what looks like a paywall.

Jean-Baptiste Michel’s TED talk, The mathematics of history.

An excerpt from Steven Strogatz’ The Joy of x: A Guided Tour of Math, from One to Infinity

David MacKay has an excellent book Information Theory, Inference and Learning Algorithms (also available as a free download from the author); he teaches a course based on the book which has online video lectures.

Brian Whitman asks How well does music predict your politics?

The data genetics blog on the egg breaking puzzle. via hacker news.

From the Chronicle of Higher Education, Using big data to redesign education by steering students towards courses where they’ll do well.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Fibonacci subset fun

A week ago I wrote a post on bitwise set trickery in which I asked a question from James Tanton: how many subsets S of {1, 2, …, 10} have the property that S and S+2 cover all the numbers from 1 to 12? To solve this is a one-liner in R. Slightly generalizing, we can replace 10 by n:

g = function(n){sum(bitOr(0:(2^n-1), 4*(0:(2^n-1))) == (2^(n+2)-1))}

Then it’s easy to compute g(n) for n = 1, 2, …, 20:

n 1 2 3 4 5 6 7 8 9 10
g(n) 0 1 1 1 2 4 6 9 15 25
n 11 12 13 14 15 16 17 18 19 20
g(n) 40 64 104 169 273 441 714 1156 1870 3025

and if positive integers are at least your casual acquaintances you’ll recognize a lot of squares here, and in particular a lot of squares of Fibonacci numbers:

\cdots 9 = 3^2, 25 = 5^2, 64 = 8^2, 169 = 13^2, 441 = 21^2, \cdots

The numbers in between the squares are a little trickier, but once we’re primed to think Fibonacci it’s not hard to see

\cdots 15 = 3 \times 5, 40 = 5 \times 8, 104 = 8 \times 13, 273 = 13 \times 21, \cdots

So this leads to the conjecture that

g(2n) = F_n^2, g(2n+1) = F_n F_{n+1}

for positive integers $n$. If you’re allergic to cases you can write this as

g(n) = F_{\lfloor n/2 \rfloor} F_{\lceil n/2 \rceil}.

So how to prove these formulas? We can explicitly list, say, the g(8) = 9 sets that we need.

sets = function(n){
indices = which(bitOr(0:(2^n-1), 4*(0:(2^n-1))) == (2^(n+2)-1))-1; #like the bit trickery above
for (i in 1:length(indices)) {
print(which(intToBits(indices[i]) == 01)) #convert from integer to vector of its 1s

(The “-1” is because lists in R are 1-indexed.) Then, for example, sets(9) outputs the fifteen sets

1 2 4 5 8 9
1 2 3 4 5 8 9
1 2 5 6 8 9
1 2 3 5 6 8 9
1 2 4 5 6 8 9
1 2 3 4 5 6 8 9
1 2 3 4 7 8 9
1 2 4 5 7 8 9
1 2 3 4 5 7 8 9
1 2 3 6 7 8 9
1 2 3 4 6 7 8 9
1 2 5 6 7 8 9
1 2 3 5 6 7 8 9
1 2 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9

and now we examine the entrails. Each row consists of a subset S of {1, … 9} such that, when we take its union with S + 2, we get all the integers from 1 up to 11. Now when we add 2 to every element of S, we don’t change parity, so it makes sense to look at even and odd numbers separately. If we extract just the even numbers from each of the sets, we get {2, 4, 8}, {2, 6, 8}, or {2, 4, 6, 8}, each of which occurs five times; if we extract just the odd numbers we get {1, 5, 9}, {1, 3, 5, 9}, {1, 3, 7, 9}, {1, 5, 7, 9} or {1, 3, 5, 7, 9}, each of which occurs three times. Every possible combination of one of the even subsets with one of the odd subsets occurs exactly once!

What to make of this? Well, in order for S \subset \{1, 2, \ldots, 9\} to satisfy \{ 2, 4, 6, 8, 10 \} \subset (S \cup (S+2)), we must have that S contains 2; at least one of 2 or 4; at least one of 4 or 6; at least one of 6 or 8; and 8. Similarly, looking at just the odd numbers, S must contain 1; at least one of 1 or 3; at least one of 3 or 5; at least one of 5 or 7; at least one of 7 or 9; and 9. And S will satisfy the overall condition if and only if its even part and its odd part do what they need to do; there’s no interaction between them.

So now what? Consider the set in bold above; call it S. Its even part is {2, 4, 8} and its odd part is {1, 5, 7, 9}. We can take the even part and divide everything by 2 to obtain a set T = {1, 2, 4}; we can take the odd part, divide through by 2, and round up to get U = {1, 3, 4, 5}. The conditions transform similarly; in order for S \subset \{ 1, 2, \ldots, 9\} to satisfy \{ 1, \ldots 11 \} = (S \cup (S+2)), we have to have that:

  • the set T, obtained from S by taking its even subset and dividing through by 2, contains 1; at least one of 1 or 2; at least one of 2 or 3; at least one of 3 or 4; and 4;
  • the set U, obtained from S by taking its odd subset and dividing through by 2, contains 1; at least one of 1 or 2; at least one of 2 or 3; at least one of 3 or 4; at least one of 4 or 5; and 5.

Of course it’s easier to say that T is a subset of {1, 2, 3, 4} containing 1, 4, and at least one of every two consecutive elements, and U is a similar subset of {1, 2, 3, 4, 5}. So how many subsets of {1, 2, 3, …, n} contain 1, n, and at least one of every two consecutive elements?

This is finally where the Fibonacci numbers come in. It’s easier to count possible complements of T. If T satisfies the condition given above, then its complement with respect to {1, 2, …, n} is a subset of {2, 3, …, n-1} containing no two consecutive elements. Call the number of such sets f(n). For n = 3 there are two such subsets of {2}, namely the empty set and {2} itself; for n = 4 there are three such subsets of {2, 3}, namely the empty set, {2}, and {3}. So f(3) = 2, f(4) = 3. Now to find f(n) if n > 4. This is the number of subsets of {2, 3, …, n-1} which contain no two consecutive elements. These can be divided into two types: those which contain n-1 and those which don’t. Those which contain n-1 can’t contain n-2, and so are just subsets of {2, 3, …, n-3} which contain no two consecutive elements, with n-1 added. There are f(n-2) of these. Those which don’t contain n-1 are just subsets of {2, 3, …, n-2} with no two consecutive elements; there are f(n-1) of these. So f(n) = f(n-1) + f(n-2); combined with the initial conditions we see that f(n) is just the nth Fibonacci number.

So we can finally compute g(n). For a set S to satisfy the condition defining g(n) — that is, to have S \subset {1, 2, \ldots, n} and $S \cup (S+2) = {1, 2, \ldots, n+2}$ — we have to have that the corresponding T^c is a subset of {1, 2, \ldots, \lfloor n/2 \rfloor} containing no two consecutive elements, and the corresponding U^c is a subset of {1, 2, \ldots, \lceil n/2 \rceil} containing no two consecutive elements. The number of ways to do this is exactly F_{\lfloor n/2 \rfloor} F_{\lceil n/2 \rceil}, which is what we wanted.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Neighborhood boundaries

Crowdsourced neighborhood boundaries in Boston, from Bostonography, based on an ongoing survey that asks people to draw the outlines of neighborhoods on a map. One nice feature of these maps are fuzzy edges; shading indicates points that less than 25%, 25 to 50%, 50 to 75%, or more than 75% of people think are in a given neighborhood. I think this is interesting even if you don’t know Boston; it’s fascinating how some neighborhoods have well-defined edges (major streets seem to play this role) and some just radiate out from a center.

This reminds me of an older project in San Francisco. Data here comes from Craigslist housing posts and an on-site form. The methodology is somewhat different in that it justs asks you to give an address and say what’s neighborhood it’s in. I would imagine this tends to give less straight-line boundaries. I could imagine someone in San Francisco thinking, for example, that Cesar Chavez is the boundary between the Mission and Bernal Heights (I actually had this conversation yesterday, because I’m moving into a place near there), but the same person might not use this as their decision rule when asked about a particular address in the neighborhood with which they were familiar.

What bothers me about using Craigslist housing posts as a data source is that “more desirable” neighborhoods tend to grow on Craigslist, as people with places for rent who are anywhere near a neighborhood border try to sort into the more desirable neighborhoods. In clicking through craigslist I saw stretches of the truth, some by only a few hundred feet (see above) and some by a couple miles (43rd Avenue is not the inner Sunset, it’s half a mile from the ocean!). The “good” neighborhoods tend to expand, and the inner Sunset has cachet that the Sunset has a whole doesn’t right now. For two years I’ve lived a couple hundred feet south (i. e. Oakland-side) of the Oakland-Berkeley border, but my landlord lists the place as being in Berkeley. And I have to give a hat tip to my old neighborhood in Philadelphia: this is West Philly, “University City” is a marketing scheme”.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for July 15

Felix Salmon on how economists get tripped up by statistics , a summary of The illustion of predictability: how regression statistics mislead experts by Emre Soyer and Robin Hogarth. Here’s Andrew Gelman’s response.

Jeffrey Stanton of Syracuse has written an introductory textbook on data science, available free online in PDF format. (There’s also an interactive version for the iPad, which I can’t vouch for because I don’t have an iPad.)

Graham Wheeler writes for Significance magazine on the probability that your home HIV test is a false alarm.

Derek Thompson at The Atlantic tells us eleven ways in which consumers are bad at math

how high would you have to launch something in New York for it to be visible in Los Angeles?

A guide to mining Twitter with R.

Norm Matloff’s text From Algorithms to Z-Score: Probabilistic and Statistical Modeling in Computer Science is a text for a course in mathematical probability and statistics for CS students.

What are the chances of finding a golden ticket in a Caramilk bar? Jeffrey Rosenthal knows. (Incidentally, Jeffrey Rosenthal is my academic uncle. I also have a biological uncle named Jeffrey. What are the chances of that?)

At New York’s Lincoln Center, this week, there’s an opera based on the life of Émilie du Châtelet.

I’m looking for a job, in the SF Bay Area. See my my LinkedIn profile

People like to move on weekends, but also the first of the month…

A thought: lots of leases on apartments are timed to calendar months. Are people even ever-so-slightly more likely to move when the first of the month falls on a weekend, as opposed to midweek. Obviously this wouldn’t be a big effect, but it’s at least slightly easier to move on the weekend, and I could imagine someone with a month-to-month tenancy wanting to move on either August 1 or September 1 this year, noticing that August 1 is a Wednesday and September 1 is a Saturday, and deciding September would be better. (This might be more true in markets less crazy than San Francisco.)

This is, as it turns out, not hypothetical. I’m looking to move for August 1. (The links goes to my Craigslist “housing wanted” ad. Short version: in SF, non-ridiculous commute to downtown, would like to pay around $1000. Not included in that link: you get to live with someone who’s at least very mildly Internet-famous, being a math blogger and all.) I visited a place where the old roommate was moving out the weekend before because it’s the weekend; I’ve also seen ads for places that are available August 5, presumably for this reason.

Also, still: I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Bitwise set trickery

James Tanton asked last week on Twitter: How many subsets S of {1, 2, …, 10} have the property that S and S+2 cover all the numbers from 1 to 12?

(I’ve taken the liberty of expanding some abbreviations here because my blog, unlike Twitter, doesn’t have a 140-character limit.  Also, Tanton gave a hint which I’m omitting; you can click through to the link if you want it.)

In case you don’t know the notation, S + 2 means the set that we get from S by adding 2 to each element. So, for example, if S = {3, 5, 9} then S + 2 = {5, 7, 11}.

This is, with a clever trick, a one-liner in R:

sum(bitOr(0:1023, 4*(0:1023)) == 4095)

which outputs 25, the answer to the question.

(bitOr comes from the bitops package.)

Why does this work? We can represent each subset of {1, 2, …, 10} as an integer in 0, 1, …, 1023 = 210 – 1. Namely, we represent the set S by
\sum_{s \in S} 2^{s-1}
so, for example, the set {3, 5, 9} is represented by the integer 22 + 24 + 28 = 276, or 100010100 in binary. (Read from right to left, it’ll make more sense.) Multiplying an integer by 4 = 22 then corresponds to adding 2 to each element — so 1104 = 4 × 276 = 24 + 26 + 210 represents the set {5, 7, 11}. 1104 is 10001010000 in binary; we just add two zeroes to the end.

The union of two sets then just becomes bitwise OR. We can line up our two integers (right-aligned, padding the shorter one with zeroes on the left) and then, in a third row, write a 1 in any column that has a 1 in either position above it:


The result 10101010100 (binary) corresponds to the set {3, 5, 7, 9, 11}, the union of the two original sets. The bitOr command does this for each possible set; 4095 is 212-1 in binary and corresponds to the set of all the numbers from 1 to 12. So we’re just summing up a vector of TRUEs and FALSEs, with a TRUE in positions corresponding to sets S for which S and S+2 cover and FALSEs otherwise; in this context R treats TRUE as 1 and FALSE as 0, so we get the number of TRUEs.

But why is the answer 25? Can we find a general formula? Think about it; I’ll post the solution tomorrow.