The rain in Philadelphia falls mainly… when?

Tony Wood, who writes about weather for my hometown paper, the Philadelphia Inquirer, observes that at the current rate, “2012 precipitation in Philadelphia would finish at 26.87, which would make it the driest year on record”. My native Philadelphia has had very low rain so far this year, only 14.06 inches through yesterday; this is the fourth-lowest amount of rain to have occurred through July 10, behind only 1992, 1995, and 1963. (Wood gives 1922 instead of 1992.)

This is essentially true, although the calculation is actually a bit off because of leap year.) Not only is this an all-time minimum, but it’s far below the old minimum of 29.48 inches in 1922. (1922 actually had 20.81 inches of rain by July 9, which is about average, but it had the second-driest second half in the 127 years of records I have. (Disclaimer: I’m calculating this by adding up the Franklin Institute’s daily data, and it may differ from what you see tabulated elsewhere.)

So if Philly is only at fourth-lowest rain year-to-date right now, why would keeping up at the same pace lead to the all-time lowest amount of rain?

First, rain is seasonal. It turns out that this actually isn’t a problem, though, in Philadelphia’s climate; between 1873 and 1999, in an average year 51.7 percent of the rain fell in the 191 days, or 52.3 percent of the year, up to July 9. (That’s in common years; it’s a bit different in leap years.)

More importantly, though, there’s regression to the mean. One might naively assume that if it rains more in the first half of the year, we should expect it to rain more in the second half of the year as well. Still, the rainiest first half is likely not to come in the same year as the rainiest second half, and the driest first half is likely not to come in the same year as the driest second half, since the correlation is imperfect.

Actually, the correlation is very imperfect. The coefficient of correlation between the amount of rain in the first half of the year and in the second half of the year is about -0.05. That’s right, it’s negative! (But it’s not significantly different from zero.) The amount of rain in the first half of the year tells us basically nothing about the second half. The regression line for predicting amount of rain in the second half of the year from the first half is

(second half rain) = (21.29 inches) – 0.05779 (first half rain)

but the slope of the line has standard error 0.1066. See the plot below:

We should expect this year to be drier than average in Philadelphia, overall, but only because the first half was so dry. The regression line for predicting total year-end rain from first-half rain is

(year rain) = (21.29 inches) + 0.9422 (first half rain)

which you could have guessed; just add first half rain to the first equation. A scatterplot is below:

For this year, the first-half rain is 14.06 inches; the predicted second-half rain is 20.47 inches, for an overall total of 34.54 inches. This is drier than all but 21 years in the 127-year sample, or about one out of six. 2012 as a whole will likely be dry, but not historically dry.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Visualizing commute-mode shares

From the department of semi-useless plots: Wikipedia’s Major U. S. City Commute Patterns, 2006 plots the share of people commuting to work by public transit against the share of people commuting to work by car for major American cities. But most of the points in this plot fall pretty close to a straight, downward-sloping line, as you’d expect, because these are in most places the two most common ways to get to work and they should come close to adding up to 100%. But in actuality they generally add up to less than 100% because there are pedestrian commuters and bicycle commuters.

I say this plot is semi-useless because there are two dimensions and really only one is used. I suppose it could be called “semi-useful”, if I were more optimistic.

Can we do anything else with the same data? I don’t have the original data from which the plot was generated, but the good people at carfree census have some similar data. The link there gives the proportion of commuters who bike, walk, and take public transit in each of the top 50 cities. (Unfortunately it’s from the 2000 census; a lot more people are biking now, The Wikipedia chart includes only 31 cities; I’ll admit that I expanded to 50 so that I could include Oakland, where I live.)

Here’s one example which does reasonably well at spreading out the points into two dimensions and also has good ways of interpreting the x-axis and the y-axis.

On the x-axis plot the logarithm of the proportion of commuters who get to work by means other than driving.

It turns out that the proportion of self-propelled commuters (walkers and bikers), divided by the square of the proportion of public transit commuters, is typically about .17, and is close to being uncorrelated with the proportion of public transit commuters. For example, in Oakland (18.18% public transit, 5.16% self-propelled) this is (0.0516)/sqrt(0.1818) or about 0.121, a relatively low value. That quotient might be a good measure of the relatively friendliness of the city to the self-propelled and to transit; high values mean the city is self-propelled-friendly (or transit-hostile), and low values mean the city is transit-friendly (or self-propelled-hostile). Plot this on the y-axis. The plot below is what you get.

Note that, strictly speaking, it could be constructed from the Wikipedia plot by suitable stretching. (Since the two are actually generated from slightly different data this might not be apparent from looking at them.) I’m not sure what to make of it. In particular interpreting the y-coordinate seems tricky. In the upper left we have Virginia Beach, Colorado Springs, Mesa, Albuquerque, and Tucson as cities with low public transit and high rates of self-propelledness for their transit rate; in the upper right we have Boston, Washington, San Francisco, and Philadelphia as cities with high public transit and high self-propelledness for their transit rate. But what do these two groups of cities really have in common, and how do they differ from cities at the bottom, with low self-propelledness rates compared to their transit rates, like Charlotte, Dallas, Detroit, and Atlanta? I leave answering this question to the city planners.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Optimal coinage-system design

“If a currency system could only have two coins, what should the values of those coins be?” – from Numberplay. Implicit in the question (the way it’s stated there) is that there are 100 cents in a dollar; I’m going to generalize to a dollar consisting of k cents.

Let’s say that we’re requiring that you be able to make change for any number of cents from 0 to k-1. and that we’d like to be able to do this with the smallest possible number of coins, on average. In order to make change for one cent we will need a one-cent coin. So there’s really only one parameter to play with — we’ll have two coins, of values 1 and n — and we want to choose n to minimize the average number of coins needed. We’ll assume that every possible amount of change from 0 to k-1 is equally likely. (This is probably not true, but the way in which it’s not true should evolve in tandem with the currency system. For example, in the US we have a 25-cent coin and so lots of prices are multiples of 25 cents. In the eurozone the comparable coin is a 20-cent coin; are prices which are multiples of 20 cents common?)

So we can break this down into, on average, how many 1-cent we’ll need and how many n-cent coins we’ll need when making change. On average we’ll need (n-1)/2 1-cent coins, if n happens to be a factor of 100. For example if n is 5 (that is, the other coin is a nickel) then on average we need 2 pennies, since we’re equally likely to need 0, 1, 2, 3, or 4 pennies. Let’s ignore the 1 and call this n/2. And to make change for m cents we’ll need m/n n-cent coins. On average we’re making change for k/2 cents, so on average we need k/2n n-cent coins.

So we want to minimize $k/2n + n/2$ as a function of n. Differentiating with respect to n gives $-k/(2n^2) + 1/2$ ; this is zero when $n = \sqrt{k}$ . So if you only have two coins, you want a one-cent coin and a (√ k)-cent coin. Then on average you’d need (√ k)/2 pennies and (√ k)/2 n-cent coins, or a total of √k coins, when making change.

In the US case this means you want a penny and a dime. As it turns out an 11-cent coin would work just as well. The R function

f = function(k,n){sum(((0:(k-1))%%n) + floor((0:(k-1))/n))}

will give you the total number of coins needed to make each of 0, 1, …, k-1 cents out of 1-cent and n-cent coins. Interestingly, f(100, 10) and f(100, 11) both return 900 — that is, we’d need 900/100 = 9 coins, on average, if we had only pennies and dimes, or if we had pennies and 11-cent coins. For practical purposes, of course, we also want coins that are easily positioned for mental arithmetic.

It seems reasonable to guess that if you have three coins you’d want coins worth roughly 1, k^1/3, and k^2/3 cents, and in general denominations should be evenly spaced; this seems to be the principle that, say, euro coins/notes are based on. These are valued at 1 cent, 2 cents, 5 cents, and powers of ten times these. It’s also the principle that US currency would be based on except for the historical factors that have led to people not using half-dollar coins and $2 bills.) But it takes a bit more thought to figure out the optimal ways to make change in that situation, and it’s more than a one-liner to do the computation… any thoughts?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Another linkdump

I’ve been on vacation this week, hence the lack of posts while I enjoy the ocean and seeing my family, but here are some links to things I’ve read.

From Mark Liberman at Language Log, Macroscopic bosons among us; apparently graduate course enrollments at UPenn follow Bose-Einstein statistics.

Dionysis Zindros has written a Gentle Introduciton to Algorithm Complexity Analysis.

A network theory analysis of football strategies, by Javier López Peña and Hugo Touchette. (Exercise for the reader: given the names of the authors, what kind of football are we talking about?)

How fivethirtyeight incorporates the economy into its political forecasting models.

Data analysis recipes: probability calculus for inference and Data analysis recipes: fitting a model to data, by David Hogg via John D. Cook and Andrew Gelman. Described as “chapters from a non-existent book.

Hilary Mason, bit.ly’s chief scientist, gave a 33-minute talk “Machine Learning for Hackers”.

Carnival of Mathematics 88

Algorithms with finite expected running time and infinite variance, from CS Theory stackexchange.

Laura McLay discusses the optimal false alarm rate for tornado warnings.

From Math Goes Pop!, Ranking baseball teams and Mathematical analysis of the half-your-age-plus-seven rule. (I’d like to see some data for the later.)

Behindness in National Novel Writing Month by Andrew Taylor

How long does it take to get pregnant? by Richie Cotton. (This is self-interested data analysis, as Cotton’s girlfriend has a biological clock.)

Uber asks What San Francisco neighborhood is most like New York?, among other neighborhood-comparison questions.

How to convert rugby scores to football/soccer scores.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for July 1

Inverse Fizzbuzz.

More on Minesweeper: where should you click first? (It depends on what you’re trying to do: win this game at any cost? Or set a fast time, at the cost of throwing away a game when it becomes clear it won’t be a fast one?)

Grand slam statistics: tennis is a numbers game

Pixar wants you to take more math classes, an interview with Tony DeRose, one of their senior scientists. (I saw Brave. I’m convinced that movie grew out of someone at Pixar realizing that could make realistic-looking curly hair.)

Could the periodic table have been done using group theory?, a question at physics stackexchange.

Allen Downey has a series of posts on secularization in America: Part one, two, three, four.

Men’s Health magazine gives us 5 ways math can improve your life. (These were suggested by Steve Strogatz, who has a new book coming out in October, The Joy of x: A Guided Tour of Math, from One to Infinity.)

Apparently from May, but I missed it then: Patents aren’t only for engineers, on the actuary who patented statistical sampling. (To his credit, it looks like he think the idea that this can be patented might be a little silly.)

Tom Mitchell is working on a possible second edition of Machine Learning; he has a chapter on naive Bayesian classifiers and logistic regression available for free downloads.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.