Weekly links for July 15

Felix Salmon on how economists get tripped up by statistics , a summary of The illustion of predictability: how regression statistics mislead experts by Emre Soyer and Robin Hogarth. Here’s Andrew Gelman’s response.

Jeffrey Stanton of Syracuse has written an introductory textbook on data science, available free online in PDF format. (There’s also an interactive version for the iPad, which I can’t vouch for because I don’t have an iPad.)

Graham Wheeler writes for Significance magazine on the probability that your home HIV test is a false alarm.

Derek Thompson at The Atlantic tells us eleven ways in which consumers are bad at math

how high would you have to launch something in New York for it to be visible in Los Angeles?

A guide to mining Twitter with R.

Norm Matloff’s text From Algorithms to Z-Score: Probabilistic and Statistical Modeling in Computer Science is a text for a course in mathematical probability and statistics for CS students.

What are the chances of finding a golden ticket in a Caramilk bar? Jeffrey Rosenthal knows. (Incidentally, Jeffrey Rosenthal is my academic uncle. I also have a biological uncle named Jeffrey. What are the chances of that?)

At New York’s Lincoln Center, this week, there’s an opera based on the life of Émilie du Châtelet.

I’m looking for a job, in the SF Bay Area. See my my LinkedIn profile

People like to move on weekends, but also the first of the month…

A thought: lots of leases on apartments are timed to calendar months. Are people even ever-so-slightly more likely to move when the first of the month falls on a weekend, as opposed to midweek. Obviously this wouldn’t be a big effect, but it’s at least slightly easier to move on the weekend, and I could imagine someone with a month-to-month tenancy wanting to move on either August 1 or September 1 this year, noticing that August 1 is a Wednesday and September 1 is a Saturday, and deciding September would be better. (This might be more true in markets less crazy than San Francisco.)

This is, as it turns out, not hypothetical. I’m looking to move for August 1. (The links goes to my Craigslist “housing wanted” ad. Short version: in SF, non-ridiculous commute to downtown, would like to pay around $1000. Not included in that link: you get to live with someone who’s at least very mildly Internet-famous, being a math blogger and all.) I visited a place where the old roommate was moving out the weekend before because it’s the weekend; I’ve also seen ads for places that are available August 5, presumably for this reason.

Also, still: I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Bitwise set trickery

James Tanton asked last week on Twitter: How many subsets S of {1, 2, …, 10} have the property that S and S+2 cover all the numbers from 1 to 12?

(I’ve taken the liberty of expanding some abbreviations here because my blog, unlike Twitter, doesn’t have a 140-character limit.  Also, Tanton gave a hint which I’m omitting; you can click through to the link if you want it.)

In case you don’t know the notation, S + 2 means the set that we get from S by adding 2 to each element. So, for example, if S = {3, 5, 9} then S + 2 = {5, 7, 11}.

This is, with a clever trick, a one-liner in R:

sum(bitOr(0:1023, 4*(0:1023)) == 4095)

which outputs 25, the answer to the question.

(bitOr comes from the bitops package.)

Why does this work? We can represent each subset of {1, 2, …, 10} as an integer in 0, 1, …, 1023 = 210 – 1. Namely, we represent the set S by
\sum_{s \in S} 2^{s-1}
so, for example, the set {3, 5, 9} is represented by the integer 22 + 24 + 28 = 276, or 100010100 in binary. (Read from right to left, it’ll make more sense.) Multiplying an integer by 4 = 22 then corresponds to adding 2 to each element — so 1104 = 4 × 276 = 24 + 26 + 210 represents the set {5, 7, 11}. 1104 is 10001010000 in binary; we just add two zeroes to the end.

The union of two sets then just becomes bitwise OR. We can line up our two integers (right-aligned, padding the shorter one with zeroes on the left) and then, in a third row, write a 1 in any column that has a 1 in either position above it:


The result 10101010100 (binary) corresponds to the set {3, 5, 7, 9, 11}, the union of the two original sets. The bitOr command does this for each possible set; 4095 is 212-1 in binary and corresponds to the set of all the numbers from 1 to 12. So we’re just summing up a vector of TRUEs and FALSEs, with a TRUE in positions corresponding to sets S for which S and S+2 cover and FALSEs otherwise; in this context R treats TRUE as 1 and FALSE as 0, so we get the number of TRUEs.

But why is the answer 25? Can we find a general formula? Think about it; I’ll post the solution tomorrow.

The rain in Philadelphia falls mainly… when?

Tony Wood, who writes about weather for my hometown paper, the Philadelphia Inquirer, observes that at the current rate, “2012 precipitation in Philadelphia would finish at 26.87, which would make it the driest year on record”. My native Philadelphia has had very low rain so far this year, only 14.06 inches through yesterday; this is the fourth-lowest amount of rain to have occurred through July 10, behind only 1992, 1995, and 1963. (Wood gives 1922 instead of 1992.)

This is essentially true, although the calculation is actually a bit off because of leap year.) Not only is this an all-time minimum, but it’s far below the old minimum of 29.48 inches in 1922. (1922 actually had 20.81 inches of rain by July 9, which is about average, but it had the second-driest second half in the 127 years of records I have. (Disclaimer: I’m calculating this by adding up the Franklin Institute’s daily data, and it may differ from what you see tabulated elsewhere.)

So if Philly is only at fourth-lowest rain year-to-date right now, why would keeping up at the same pace lead to the all-time lowest amount of rain?

First, rain is seasonal. It turns out that this actually isn’t a problem, though, in Philadelphia’s climate; between 1873 and 1999, in an average year 51.7 percent of the rain fell in the 191 days, or 52.3 percent of the year, up to July 9. (That’s in common years; it’s a bit different in leap years.)

More importantly, though, there’s regression to the mean. One might naively assume that if it rains more in the first half of the year, we should expect it to rain more in the second half of the year as well. Still, the rainiest first half is likely not to come in the same year as the rainiest second half, and the driest first half is likely not to come in the same year as the driest second half, since the correlation is imperfect.

Actually, the correlation is very imperfect. The coefficient of correlation between the amount of rain in the first half of the year and in the second half of the year is about -0.05. That’s right, it’s negative! (But it’s not significantly different from zero.) The amount of rain in the first half of the year tells us basically nothing about the second half. The regression line for predicting amount of rain in the second half of the year from the first half is

(second half rain) = (21.29 inches) – 0.05779 (first half rain)

but the slope of the line has standard error 0.1066. See the plot below:

We should expect this year to be drier than average in Philadelphia, overall, but only because the first half was so dry. The regression line for predicting total year-end rain from first-half rain is

(year rain) = (21.29 inches) + 0.9422 (first half rain)

which you could have guessed; just add first half rain to the first equation. A scatterplot is below:

For this year, the first-half rain is 14.06 inches; the predicted second-half rain is 20.47 inches, for an overall total of 34.54 inches. This is drier than all but 21 years in the 127-year sample, or about one out of six. 2012 as a whole will likely be dry, but not historically dry.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Visualizing commute-mode shares

From the department of semi-useless plots: Wikipedia’s Major U. S. City Commute Patterns, 2006 plots the share of people commuting to work by public transit against the share of people commuting to work by car for major American cities. But most of the points in this plot fall pretty close to a straight, downward-sloping line, as you’d expect, because these are in most places the two most common ways to get to work and they should come close to adding up to 100%. But in actuality they generally add up to less than 100% because there are pedestrian commuters and bicycle commuters.

I say this plot is semi-useless because there are two dimensions and really only one is used. I suppose it could be called “semi-useful”, if I were more optimistic.

Can we do anything else with the same data? I don’t have the original data from which the plot was generated, but the good people at carfree census have some similar data. The link there gives the proportion of commuters who bike, walk, and take public transit in each of the top 50 cities. (Unfortunately it’s from the 2000 census; a lot more people are biking now, The Wikipedia chart includes only 31 cities; I’ll admit that I expanded to 50 so that I could include Oakland, where I live.)

Here’s one example which does reasonably well at spreading out the points into two dimensions and also has good ways of interpreting the x-axis and the y-axis.

On the x-axis plot the logarithm of the proportion of commuters who get to work by means other than driving.

It turns out that the proportion of self-propelled commuters (walkers and bikers), divided by the square of the proportion of public transit commuters, is typically about .17, and is close to being uncorrelated with the proportion of public transit commuters. For example, in Oakland (18.18% public transit, 5.16% self-propelled) this is (0.0516)/sqrt(0.1818) or about 0.121, a relatively low value. That quotient might be a good measure of the relatively friendliness of the city to the self-propelled and to transit; high values mean the city is self-propelled-friendly (or transit-hostile), and low values mean the city is transit-friendly (or self-propelled-hostile). Plot this on the y-axis.  The plot below is what you get.

Note that, strictly speaking, it could be constructed from the Wikipedia plot by suitable stretching. (Since the two are actually generated from slightly different data this might not be apparent from looking at them.) I’m not sure what to make of it. In particular interpreting the y-coordinate seems tricky. In the upper left we have Virginia Beach, Colorado Springs, Mesa, Albuquerque, and Tucson as cities with low public transit and high rates of self-propelledness for their transit rate; in the upper right we have Boston, Washington, San Francisco, and Philadelphia as cities with high public transit and high self-propelledness for their transit rate. But what do these two groups of cities really have in common, and how do they differ from cities at the bottom, with low self-propelledness  rates compared to their transit rates, like Charlotte, Dallas, Detroit, and Atlanta? I leave answering this question to the city planners.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Optimal coinage-system design

“If a currency system could only have two coins, what should the values of those coins be?” – from Numberplay. Implicit in the question (the way it’s stated there) is that there are 100 cents in a dollar; I’m going to generalize to a dollar consisting of k cents.

Let’s say that we’re requiring that you be able to make change for any number of cents from 0 to k-1. and that we’d like to be able to do this with the smallest possible number of coins, on average. In order to make change for one cent we will need a one-cent coin. So there’s really only one parameter to play with — we’ll have two coins, of values 1 and n — and we want to choose n to minimize the average number of coins needed. We’ll assume that every possible amount of change from 0 to k-1 is equally likely. (This is probably not true, but the way in which it’s not true should evolve in tandem with the currency system. For example, in the US we have a 25-cent coin and so lots of prices are multiples of 25 cents. In the eurozone the comparable coin is a 20-cent coin; are prices which are multiples of 20 cents common?)

So we can break this down into, on average, how many 1-cent we’ll need and how many n-cent coins we’ll need when making change. On average we’ll need (n-1)/2 1-cent coins, if n happens to be a factor of 100. For example if n is 5 (that is, the other coin is a nickel) then on average we need 2 pennies, since we’re equally likely to need 0, 1, 2, 3, or 4 pennies. Let’s ignore the 1 and call this n/2. And to make change for m cents we’ll need m/n n-cent coins. On average we’re making change for k/2 cents, so on average we need k/2n n-cent coins.

So we want to minimize k/2n + n/2 as a function of n. Differentiating with respect to n gives -k/(2n^2) + 1/2; this is zero when n = \sqrt{k}. So if you only have two coins, you want a one-cent coin and a (√ k)-cent coin. Then on average you’d need (√ k)/2 pennies and (√ k)/2 n-cent coins, or a total of √k coins, when making change.

In the US case this means you want a penny and a dime. As it turns out an 11-cent coin would work just as well. The R function

f = function(k,n){sum(((0:(k-1))%%n) + floor((0:(k-1))/n))}

will give you the total number of coins needed to make each of 0, 1, …, k-1 cents out of 1-cent and n-cent coins. Interestingly, f(100, 10) and f(100, 11) both return 900 — that is, we’d need 900/100 = 9 coins, on average, if we had only pennies and dimes, or if we had pennies and 11-cent coins. For practical purposes, of course, we also want coins that are easily positioned for mental arithmetic.

It seems reasonable to guess that if you have three coins you’d want coins worth roughly 1, k1/3, and k2/3 cents, and in general denominations should be evenly spaced; this seems to be the principle that, say, euro coins/notes are based on. These are valued at 1 cent, 2 cents, 5 cents, and powers of ten times these. It’s also the principle that US currency would be based on except for the historical factors that have led to people not using half-dollar coins and $2 bills.) But it takes a bit more thought to figure out the optimal ways to make change in that situation, and it’s more than a one-liner to do the computation… any thoughts?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Another linkdump

I’ve been on vacation this week, hence the lack of posts while I enjoy the ocean and seeing my family, but here are some links to things I’ve read.

From Mark Liberman at Language Log, Macroscopic bosons among us; apparently graduate course enrollments at UPenn follow Bose-Einstein statistics.

Dionysis Zindros has written a Gentle Introduciton to Algorithm Complexity Analysis.

A network theory analysis of football strategies, by Javier López Peña and Hugo Touchette. (Exercise for the reader: given the names of the authors, what kind of football are we talking about?)

How fivethirtyeight incorporates the economy into its political forecasting models.

Data analysis recipes: probability calculus for inference and Data analysis recipes: fitting a model to data, by David Hogg via John D. Cook and Andrew Gelman. Described as “chapters from a non-existent book.

Hilary Mason, bit.ly’s chief scientist, gave a 33-minute talk “Machine Learning for Hackers”.

Carnival of Mathematics 88

Algorithms with finite expected running time and infinite variance, from CS Theory stackexchange.

Laura McLay discusses the optimal false alarm rate for tornado warnings.

From Math Goes Pop!, Ranking baseball teams and Mathematical analysis of the half-your-age-plus-seven rule. (I’d like to see some data for the later.)

Behindness in National Novel Writing Month by Andrew Taylor

How long does it take to get pregnant? by Richie Cotton. (This is self-interested data analysis, as Cotton’s girlfriend has a biological clock.)

Uber asks What San Francisco neighborhood is most like New York?, among other neighborhood-comparison questions.

How to convert rugby scores to football/soccer scores.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.