Weekly links for July 22

Happy Pi approximation day, a holiday for people who are GOOD ENOUGH, just not transcendental.

What do top Kaggle competitors focus on?

Staying dry in rain: run or walk? (That goes to the BBC; here’s the original paper, which is actually free despite being behind what looks like a paywall.

Jean-Baptiste Michel’s TED talk, The mathematics of history.

An excerpt from Steven Strogatz’ The Joy of x: A Guided Tour of Math, from One to Infinity

David MacKay has an excellent book Information Theory, Inference and Learning Algorithms (also available as a free download from the author); he teaches a course based on the book which has online video lectures.

Brian Whitman asks How well does music predict your politics?

The data genetics blog on the egg breaking puzzle. via hacker news.

From the Chronicle of Higher Education, Using big data to redesign education by steering students towards courses where they’ll do well.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Fibonacci subset fun

A week ago I wrote a post on bitwise set trickery in which I asked a question from James Tanton: how many subsets S of {1, 2, …, 10} have the property that S and S+2 cover all the numbers from 1 to 12? To solve this is a one-liner in R. Slightly generalizing, we can replace 10 by n:


library(bitops)
g = function(n){sum(bitOr(0:(2^n-1), 4*(0:(2^n-1))) == (2^(n+2)-1))}

Then it’s easy to compute g(n) for n = 1, 2, …, 20:

n 1 2 3 4 5 6 7 8 9 10
g(n) 0 1 1 1 2 4 6 9 15 25
n 11 12 13 14 15 16 17 18 19 20
g(n) 40 64 104 169 273 441 714 1156 1870 3025

and if positive integers are at least your casual acquaintances you’ll recognize a lot of squares here, and in particular a lot of squares of Fibonacci numbers:

\cdots 9 = 3^2, 25 = 5^2, 64 = 8^2, 169 = 13^2, 441 = 21^2, \cdots

The numbers in between the squares are a little trickier, but once we’re primed to think Fibonacci it’s not hard to see

\cdots 15 = 3 \times 5, 40 = 5 \times 8, 104 = 8 \times 13, 273 = 13 \times 21, \cdots

So this leads to the conjecture that

g(2n) = F_n^2, g(2n+1) = F_n F_{n+1}

for positive integers $n$. If you’re allergic to cases you can write this as

g(n) = F_{\lfloor n/2 \rfloor} F_{\lceil n/2 \rceil}.

So how to prove these formulas? We can explicitly list, say, the g(8) = 9 sets that we need.


sets = function(n){
indices = which(bitOr(0:(2^n-1), 4*(0:(2^n-1))) == (2^(n+2)-1))-1; #like the bit trickery above
for (i in 1:length(indices)) {
print(which(intToBits(indices[i]) == 01)) #convert from integer to vector of its 1s
}
}

(The “-1” is because lists in R are 1-indexed.) Then, for example, sets(9) outputs the fifteen sets


1 2 4 5 8 9
1 2 3 4 5 8 9
1 2 5 6 8 9
1 2 3 5 6 8 9
1 2 4 5 6 8 9
1 2 3 4 5 6 8 9
1 2 3 4 7 8 9
1 2 4 5 7 8 9
1 2 3 4 5 7 8 9
1 2 3 6 7 8 9
1 2 3 4 6 7 8 9
1 2 5 6 7 8 9
1 2 3 5 6 7 8 9
1 2 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9

and now we examine the entrails. Each row consists of a subset S of {1, … 9} such that, when we take its union with S + 2, we get all the integers from 1 up to 11. Now when we add 2 to every element of S, we don’t change parity, so it makes sense to look at even and odd numbers separately. If we extract just the even numbers from each of the sets, we get {2, 4, 8}, {2, 6, 8}, or {2, 4, 6, 8}, each of which occurs five times; if we extract just the odd numbers we get {1, 5, 9}, {1, 3, 5, 9}, {1, 3, 7, 9}, {1, 5, 7, 9} or {1, 3, 5, 7, 9}, each of which occurs three times. Every possible combination of one of the even subsets with one of the odd subsets occurs exactly once!

What to make of this? Well, in order for S \subset \{1, 2, \ldots, 9\} to satisfy \{ 2, 4, 6, 8, 10 \} \subset (S \cup (S+2)), we must have that S contains 2; at least one of 2 or 4; at least one of 4 or 6; at least one of 6 or 8; and 8. Similarly, looking at just the odd numbers, S must contain 1; at least one of 1 or 3; at least one of 3 or 5; at least one of 5 or 7; at least one of 7 or 9; and 9. And S will satisfy the overall condition if and only if its even part and its odd part do what they need to do; there’s no interaction between them.

So now what? Consider the set in bold above; call it S. Its even part is {2, 4, 8} and its odd part is {1, 5, 7, 9}. We can take the even part and divide everything by 2 to obtain a set T = {1, 2, 4}; we can take the odd part, divide through by 2, and round up to get U = {1, 3, 4, 5}. The conditions transform similarly; in order for S \subset \{ 1, 2, \ldots, 9\} to satisfy \{ 1, \ldots 11 \} = (S \cup (S+2)), we have to have that:

  • the set T, obtained from S by taking its even subset and dividing through by 2, contains 1; at least one of 1 or 2; at least one of 2 or 3; at least one of 3 or 4; and 4;
  • the set U, obtained from S by taking its odd subset and dividing through by 2, contains 1; at least one of 1 or 2; at least one of 2 or 3; at least one of 3 or 4; at least one of 4 or 5; and 5.

Of course it’s easier to say that T is a subset of {1, 2, 3, 4} containing 1, 4, and at least one of every two consecutive elements, and U is a similar subset of {1, 2, 3, 4, 5}. So how many subsets of {1, 2, 3, …, n} contain 1, n, and at least one of every two consecutive elements?

This is finally where the Fibonacci numbers come in. It’s easier to count possible complements of T. If T satisfies the condition given above, then its complement with respect to {1, 2, …, n} is a subset of {2, 3, …, n-1} containing no two consecutive elements. Call the number of such sets f(n). For n = 3 there are two such subsets of {2}, namely the empty set and {2} itself; for n = 4 there are three such subsets of {2, 3}, namely the empty set, {2}, and {3}. So f(3) = 2, f(4) = 3. Now to find f(n) if n > 4. This is the number of subsets of {2, 3, …, n-1} which contain no two consecutive elements. These can be divided into two types: those which contain n-1 and those which don’t. Those which contain n-1 can’t contain n-2, and so are just subsets of {2, 3, …, n-3} which contain no two consecutive elements, with n-1 added. There are f(n-2) of these. Those which don’t contain n-1 are just subsets of {2, 3, …, n-2} with no two consecutive elements; there are f(n-1) of these. So f(n) = f(n-1) + f(n-2); combined with the initial conditions we see that f(n) is just the nth Fibonacci number.

So we can finally compute g(n). For a set S to satisfy the condition defining g(n) — that is, to have S \subset {1, 2, \ldots, n} and $S \cup (S+2) = {1, 2, \ldots, n+2}$ — we have to have that the corresponding T^c is a subset of {1, 2, \ldots, \lfloor n/2 \rfloor} containing no two consecutive elements, and the corresponding U^c is a subset of {1, 2, \ldots, \lceil n/2 \rceil} containing no two consecutive elements. The number of ways to do this is exactly F_{\lfloor n/2 \rfloor} F_{\lceil n/2 \rceil}, which is what we wanted.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Neighborhood boundaries

Crowdsourced neighborhood boundaries in Boston, from Bostonography, based on an ongoing survey that asks people to draw the outlines of neighborhoods on a map. One nice feature of these maps are fuzzy edges; shading indicates points that less than 25%, 25 to 50%, 50 to 75%, or more than 75% of people think are in a given neighborhood. I think this is interesting even if you don’t know Boston; it’s fascinating how some neighborhoods have well-defined edges (major streets seem to play this role) and some just radiate out from a center.

This reminds me of an older project in San Francisco. Data here comes from Craigslist housing posts and an on-site form. The methodology is somewhat different in that it justs asks you to give an address and say what’s neighborhood it’s in. I would imagine this tends to give less straight-line boundaries. I could imagine someone in San Francisco thinking, for example, that Cesar Chavez is the boundary between the Mission and Bernal Heights (I actually had this conversation yesterday, because I’m moving into a place near there), but the same person might not use this as their decision rule when asked about a particular address in the neighborhood with which they were familiar.

What bothers me about using Craigslist housing posts as a data source is that “more desirable” neighborhoods tend to grow on Craigslist, as people with places for rent who are anywhere near a neighborhood border try to sort into the more desirable neighborhoods. In clicking through craigslist I saw stretches of the truth, some by only a few hundred feet (see above) and some by a couple miles (43rd Avenue is not the inner Sunset, it’s half a mile from the ocean!). The “good” neighborhoods tend to expand, and the inner Sunset has cachet that the Sunset has a whole doesn’t right now. For two years I’ve lived a couple hundred feet south (i. e. Oakland-side) of the Oakland-Berkeley border, but my landlord lists the place as being in Berkeley. And I have to give a hat tip to my old neighborhood in Philadelphia: this is West Philly, “University City” is a marketing scheme”.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for July 15

Felix Salmon on how economists get tripped up by statistics , a summary of The illustion of predictability: how regression statistics mislead experts by Emre Soyer and Robin Hogarth. Here’s Andrew Gelman’s response.

Jeffrey Stanton of Syracuse has written an introductory textbook on data science, available free online in PDF format. (There’s also an interactive version for the iPad, which I can’t vouch for because I don’t have an iPad.)

Graham Wheeler writes for Significance magazine on the probability that your home HIV test is a false alarm.

Derek Thompson at The Atlantic tells us eleven ways in which consumers are bad at math

how high would you have to launch something in New York for it to be visible in Los Angeles?

A guide to mining Twitter with R.

Norm Matloff’s text From Algorithms to Z-Score: Probabilistic and Statistical Modeling in Computer Science is a text for a course in mathematical probability and statistics for CS students.

What are the chances of finding a golden ticket in a Caramilk bar? Jeffrey Rosenthal knows. (Incidentally, Jeffrey Rosenthal is my academic uncle. I also have a biological uncle named Jeffrey. What are the chances of that?)

At New York’s Lincoln Center, this week, there’s an opera based on the life of Émilie du Châtelet.

I’m looking for a job, in the SF Bay Area. See my my LinkedIn profile

People like to move on weekends, but also the first of the month…

A thought: lots of leases on apartments are timed to calendar months. Are people even ever-so-slightly more likely to move when the first of the month falls on a weekend, as opposed to midweek. Obviously this wouldn’t be a big effect, but it’s at least slightly easier to move on the weekend, and I could imagine someone with a month-to-month tenancy wanting to move on either August 1 or September 1 this year, noticing that August 1 is a Wednesday and September 1 is a Saturday, and deciding September would be better. (This might be more true in markets less crazy than San Francisco.)

This is, as it turns out, not hypothetical. I’m looking to move for August 1. (The links goes to my Craigslist “housing wanted” ad. Short version: in SF, non-ridiculous commute to downtown, would like to pay around $1000. Not included in that link: you get to live with someone who’s at least very mildly Internet-famous, being a math blogger and all.) I visited a place where the old roommate was moving out the weekend before because it’s the weekend; I’ve also seen ads for places that are available August 5, presumably for this reason.

Also, still: I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Bitwise set trickery

James Tanton asked last week on Twitter: How many subsets S of {1, 2, …, 10} have the property that S and S+2 cover all the numbers from 1 to 12?

(I’ve taken the liberty of expanding some abbreviations here because my blog, unlike Twitter, doesn’t have a 140-character limit.  Also, Tanton gave a hint which I’m omitting; you can click through to the link if you want it.)

In case you don’t know the notation, S + 2 means the set that we get from S by adding 2 to each element. So, for example, if S = {3, 5, 9} then S + 2 = {5, 7, 11}.

This is, with a clever trick, a one-liner in R:

sum(bitOr(0:1023, 4*(0:1023)) == 4095)

which outputs 25, the answer to the question.

(bitOr comes from the bitops package.)

Why does this work? We can represent each subset of {1, 2, …, 10} as an integer in 0, 1, …, 1023 = 210 – 1. Namely, we represent the set S by
\sum_{s \in S} 2^{s-1}
so, for example, the set {3, 5, 9} is represented by the integer 22 + 24 + 28 = 276, or 100010100 in binary. (Read from right to left, it’ll make more sense.) Multiplying an integer by 4 = 22 then corresponds to adding 2 to each element — so 1104 = 4 × 276 = 24 + 26 + 210 represents the set {5, 7, 11}. 1104 is 10001010000 in binary; we just add two zeroes to the end.

The union of two sets then just becomes bitwise OR. We can line up our two integers (right-aligned, padding the shorter one with zeroes on the left) and then, in a third row, write a 1 in any column that has a 1 in either position above it:

00100010100
10001010000
-----------
10101010100

The result 10101010100 (binary) corresponds to the set {3, 5, 7, 9, 11}, the union of the two original sets. The bitOr command does this for each possible set; 4095 is 212-1 in binary and corresponds to the set of all the numbers from 1 to 12. So we’re just summing up a vector of TRUEs and FALSEs, with a TRUE in positions corresponding to sets S for which S and S+2 cover and FALSEs otherwise; in this context R treats TRUE as 1 and FALSE as 0, so we get the number of TRUEs.

But why is the answer 25? Can we find a general formula? Think about it; I’ll post the solution tomorrow.

The rain in Philadelphia falls mainly… when?

Tony Wood, who writes about weather for my hometown paper, the Philadelphia Inquirer, observes that at the current rate, “2012 precipitation in Philadelphia would finish at 26.87, which would make it the driest year on record”. My native Philadelphia has had very low rain so far this year, only 14.06 inches through yesterday; this is the fourth-lowest amount of rain to have occurred through July 10, behind only 1992, 1995, and 1963. (Wood gives 1922 instead of 1992.)

This is essentially true, although the calculation is actually a bit off because of leap year.) Not only is this an all-time minimum, but it’s far below the old minimum of 29.48 inches in 1922. (1922 actually had 20.81 inches of rain by July 9, which is about average, but it had the second-driest second half in the 127 years of records I have. (Disclaimer: I’m calculating this by adding up the Franklin Institute’s daily data, and it may differ from what you see tabulated elsewhere.)

So if Philly is only at fourth-lowest rain year-to-date right now, why would keeping up at the same pace lead to the all-time lowest amount of rain?

First, rain is seasonal. It turns out that this actually isn’t a problem, though, in Philadelphia’s climate; between 1873 and 1999, in an average year 51.7 percent of the rain fell in the 191 days, or 52.3 percent of the year, up to July 9. (That’s in common years; it’s a bit different in leap years.)

More importantly, though, there’s regression to the mean. One might naively assume that if it rains more in the first half of the year, we should expect it to rain more in the second half of the year as well. Still, the rainiest first half is likely not to come in the same year as the rainiest second half, and the driest first half is likely not to come in the same year as the driest second half, since the correlation is imperfect.

Actually, the correlation is very imperfect. The coefficient of correlation between the amount of rain in the first half of the year and in the second half of the year is about -0.05. That’s right, it’s negative! (But it’s not significantly different from zero.) The amount of rain in the first half of the year tells us basically nothing about the second half. The regression line for predicting amount of rain in the second half of the year from the first half is

(second half rain) = (21.29 inches) – 0.05779 (first half rain)

but the slope of the line has standard error 0.1066. See the plot below:

We should expect this year to be drier than average in Philadelphia, overall, but only because the first half was so dry. The regression line for predicting total year-end rain from first-half rain is

(year rain) = (21.29 inches) + 0.9422 (first half rain)

which you could have guessed; just add first half rain to the first equation. A scatterplot is below:

For this year, the first-half rain is 14.06 inches; the predicted second-half rain is 20.47 inches, for an overall total of 34.54 inches. This is drier than all but 21 years in the 127-year sample, or about one out of six. 2012 as a whole will likely be dry, but not historically dry.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Visualizing commute-mode shares

From the department of semi-useless plots: Wikipedia’s Major U. S. City Commute Patterns, 2006 plots the share of people commuting to work by public transit against the share of people commuting to work by car for major American cities. But most of the points in this plot fall pretty close to a straight, downward-sloping line, as you’d expect, because these are in most places the two most common ways to get to work and they should come close to adding up to 100%. But in actuality they generally add up to less than 100% because there are pedestrian commuters and bicycle commuters.

I say this plot is semi-useless because there are two dimensions and really only one is used. I suppose it could be called “semi-useful”, if I were more optimistic.

Can we do anything else with the same data? I don’t have the original data from which the plot was generated, but the good people at carfree census have some similar data. The link there gives the proportion of commuters who bike, walk, and take public transit in each of the top 50 cities. (Unfortunately it’s from the 2000 census; a lot more people are biking now, The Wikipedia chart includes only 31 cities; I’ll admit that I expanded to 50 so that I could include Oakland, where I live.)

Here’s one example which does reasonably well at spreading out the points into two dimensions and also has good ways of interpreting the x-axis and the y-axis.

On the x-axis plot the logarithm of the proportion of commuters who get to work by means other than driving.

It turns out that the proportion of self-propelled commuters (walkers and bikers), divided by the square of the proportion of public transit commuters, is typically about .17, and is close to being uncorrelated with the proportion of public transit commuters. For example, in Oakland (18.18% public transit, 5.16% self-propelled) this is (0.0516)/sqrt(0.1818) or about 0.121, a relatively low value. That quotient might be a good measure of the relatively friendliness of the city to the self-propelled and to transit; high values mean the city is self-propelled-friendly (or transit-hostile), and low values mean the city is transit-friendly (or self-propelled-hostile). Plot this on the y-axis.  The plot below is what you get.

Note that, strictly speaking, it could be constructed from the Wikipedia plot by suitable stretching. (Since the two are actually generated from slightly different data this might not be apparent from looking at them.) I’m not sure what to make of it. In particular interpreting the y-coordinate seems tricky. In the upper left we have Virginia Beach, Colorado Springs, Mesa, Albuquerque, and Tucson as cities with low public transit and high rates of self-propelledness for their transit rate; in the upper right we have Boston, Washington, San Francisco, and Philadelphia as cities with high public transit and high self-propelledness for their transit rate. But what do these two groups of cities really have in common, and how do they differ from cities at the bottom, with low self-propelledness  rates compared to their transit rates, like Charlotte, Dallas, Detroit, and Atlanta? I leave answering this question to the city planners.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Optimal coinage-system design

“If a currency system could only have two coins, what should the values of those coins be?” – from Numberplay. Implicit in the question (the way it’s stated there) is that there are 100 cents in a dollar; I’m going to generalize to a dollar consisting of k cents.

Let’s say that we’re requiring that you be able to make change for any number of cents from 0 to k-1. and that we’d like to be able to do this with the smallest possible number of coins, on average. In order to make change for one cent we will need a one-cent coin. So there’s really only one parameter to play with — we’ll have two coins, of values 1 and n — and we want to choose n to minimize the average number of coins needed. We’ll assume that every possible amount of change from 0 to k-1 is equally likely. (This is probably not true, but the way in which it’s not true should evolve in tandem with the currency system. For example, in the US we have a 25-cent coin and so lots of prices are multiples of 25 cents. In the eurozone the comparable coin is a 20-cent coin; are prices which are multiples of 20 cents common?)

So we can break this down into, on average, how many 1-cent we’ll need and how many n-cent coins we’ll need when making change. On average we’ll need (n-1)/2 1-cent coins, if n happens to be a factor of 100. For example if n is 5 (that is, the other coin is a nickel) then on average we need 2 pennies, since we’re equally likely to need 0, 1, 2, 3, or 4 pennies. Let’s ignore the 1 and call this n/2. And to make change for m cents we’ll need m/n n-cent coins. On average we’re making change for k/2 cents, so on average we need k/2n n-cent coins.

So we want to minimize k/2n + n/2 as a function of n. Differentiating with respect to n gives -k/(2n^2) + 1/2; this is zero when n = \sqrt{k}. So if you only have two coins, you want a one-cent coin and a (√ k)-cent coin. Then on average you’d need (√ k)/2 pennies and (√ k)/2 n-cent coins, or a total of √k coins, when making change.

In the US case this means you want a penny and a dime. As it turns out an 11-cent coin would work just as well. The R function

f = function(k,n){sum(((0:(k-1))%%n) + floor((0:(k-1))/n))}

will give you the total number of coins needed to make each of 0, 1, …, k-1 cents out of 1-cent and n-cent coins. Interestingly, f(100, 10) and f(100, 11) both return 900 — that is, we’d need 900/100 = 9 coins, on average, if we had only pennies and dimes, or if we had pennies and 11-cent coins. For practical purposes, of course, we also want coins that are easily positioned for mental arithmetic.

It seems reasonable to guess that if you have three coins you’d want coins worth roughly 1, k1/3, and k2/3 cents, and in general denominations should be evenly spaced; this seems to be the principle that, say, euro coins/notes are based on. These are valued at 1 cent, 2 cents, 5 cents, and powers of ten times these. It’s also the principle that US currency would be based on except for the historical factors that have led to people not using half-dollar coins and $2 bills.) But it takes a bit more thought to figure out the optimal ways to make change in that situation, and it’s more than a one-liner to do the computation… any thoughts?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Another linkdump

I’ve been on vacation this week, hence the lack of posts while I enjoy the ocean and seeing my family, but here are some links to things I’ve read.

From Mark Liberman at Language Log, Macroscopic bosons among us; apparently graduate course enrollments at UPenn follow Bose-Einstein statistics.

Dionysis Zindros has written a Gentle Introduciton to Algorithm Complexity Analysis.

A network theory analysis of football strategies, by Javier López Peña and Hugo Touchette. (Exercise for the reader: given the names of the authors, what kind of football are we talking about?)

How fivethirtyeight incorporates the economy into its political forecasting models.

Data analysis recipes: probability calculus for inference and Data analysis recipes: fitting a model to data, by David Hogg via John D. Cook and Andrew Gelman. Described as “chapters from a non-existent book.

Hilary Mason, bit.ly’s chief scientist, gave a 33-minute talk “Machine Learning for Hackers”.

Carnival of Mathematics 88

Algorithms with finite expected running time and infinite variance, from CS Theory stackexchange.

Laura McLay discusses the optimal false alarm rate for tornado warnings.

From Math Goes Pop!, Ranking baseball teams and Mathematical analysis of the half-your-age-plus-seven rule. (I’d like to see some data for the later.)

Behindness in National Novel Writing Month by Andrew Taylor

How long does it take to get pregnant? by Richie Cotton. (This is self-interested data analysis, as Cotton’s girlfriend has a biological clock.)

Uber asks What San Francisco neighborhood is most like New York?, among other neighborhood-comparison questions.

How to convert rugby scores to football/soccer scores.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.