100 years of crosswords

It’s the hundredth anniversary of the publication of the first crossword – check out today’s Google Doodle.

On a related note, crosswords are possible in English (or other natural languages) because a large enough proportion of the possible strings of letters are actual words. I learned this from chapter 18 of Information Theory, Inference, and Learning Algorithms by David Mackay (which you can read online). (Chapter 19 is about why to have sex, from an information-theoretic point of view.) And Dr. Fill is a crossword-solving program by Matthew Ginsberg which did not win the 2012 American Crossword Puzzle Tournament.

This made the rounds last week: Substantiating Fears of Grade Inflation, Dean Says Median Grade at Harvard College Is A-, Most Common Grade Is A, from the Harvard Crimson.

Now, I agree that an A-minus is probably too high here. (Although Jordan Ellenberg says we shouldn’t worry about grade inflation.)

But does it really matter that the most common grade is an A? Consider, say, a situation where there is a “triangular” distribution of grades: 5 A, 4 B, 3 C, 2 D, and 1 F. The most common grade is an A, but the median is a B (and the mean is 2.67 on a 4.0 scale, a B-minus). If there are more grade categories the same thing happens – if we have a triangular distribution of grades such as this, the median grade $1/\sqrt{2} \approx 0.71$ of the way up — about midway between a B-minus and a B on the 4.0 scale usual in the US. The mean grade would be $2/3 \approx 0.67$ of the way up the scale. More generally, say grades are in the interval [0, 1]. If grades are beta-distributed with parameters 1 and $\beta > 1$ (my triangular idea is just the Beta(1, 2) distribution) then the modal grade will be 1 but the mean and median will be a good bit lower, $\beta/(\beta+1)$ and $2^(1/\beta)$ respectively.

(I’m not claiming that grades are beta-distributed, but that’s not a bad model for something that’s often thought of as being roughly normally distributed but has to be contained within an interval.)

Basically, modes don’t tell you much.

This week’s best statistics joke

This week’s best statistics joke: median rent.

State-to-state migration in the US

Here’s an interactive visualization showing state-by-state migrations within the US, by Chris Walker.

It’s not possible to reconstruct all migrations between states from this chart. The data are available in a spreadsheet that the American Community Survey (part of the Census Bureau) puts out.

In case you’re wondering, the (ordered) pair of states with the most movement is California to Texas. Tyler Cowen would have forecasted that, but it’s worth pointing out that this is hardly surprising as California and Texas are the states with the largest population. Relative to the population of the target state, Californians are most likely to move to Nevada, Washington, Arizona, and Oregon; Texans are most likely to move to Oklahoma, New Mexico, Louisiana, and Arkansas. For non-American readers, I just said “people are most likely to move to nearby states”, which is the sort of thing that it’s easy to lose track of in my position living in San Francisco and generally surrounded by transplants from far away.

If I could spare the time I’d try to visualize this – which pairs of states have greater flows between them than would be expected from their populations and the distance between them? The prototype here would probably be the flow from the northeastern states to Florida.