On airline hubs

I am writing this post from somewhere over the south-central United States, an hour and a half from landing on a San Francisco to Charlotte flight.  As you might suspect, my final destination is not Charlotte. Nothing against Charlotte – it’s just a hub airport, and so I’m flying through on my way to Pittsburgh.

As I was getting on the plane, a couple passengers whose final destination is Charlotte – and who seemed a bit nervous about flying – asked where I was headed. Pittsburgh, I answered. So why go via Charlotte, they wondered? I commented that I’d booked this flight fairly late, that there aren’t that many SFO-PIT flights, and that it could be worse. I had the option of going via Philadelphia, my hometown. That would have been a flight where I wanted to parachute out of the plane on the way down.

So why have hubs? I spend seven and a quarter hours today from takeoff at SFO to landing at PIT; a nonstop flight would probably be a hair under five. (These are of course not door-to-door times, but the extra time that has to be factored in on either end for ground transportation, security, and so on doesn’t depend on the amount of time I spend in the air.) So the fact that I have to fly via Charlotte costs me, today, somewhere above two hours.

On the other hand, the airline I’m flying (US Airways, if you haven’t guessed from my choices of hub) offers something like a dozen possible routings each day between these two cities. They have no nonstops, but there are a couple possible itineraries each day via each of Charlotte, Philadelphia, and Phoenix, as well as some more complicated ones. So if I called US Airways and said “I’d like to fly from San Francisco to Pittsburgh at [insert time here]”, they could probably accommodate that preference to within an hour or so.

On the other hand, say they didn’t have hubs. Then perhaps US Airways could fill, I don’t know, two planes a day on this route. I’d have to wait around for half a day for my plane!  In the end I spend longer in transit, but less time waiting.

Jarrett Walker, a transit consultant, has written about transfers in designing bus networks. Essentially, if you want to provide a bus from every point to every other point, you can’t provide them very frequently, so people have to wait a long time for their one-seat ride. If you force them to transfer, you can offer more frequent service to some hubs from all the outlying points, so even though people have a two-seat ride the total time spent waiting is less. The same principle applies to air travel.

Finally, my trip today involves a surfeit of coincidences. Had all gone as planned this morning, I would have left my house in Oakland, California, and boarded a BART train that originated in Pittsburg (sic), California to get to the San Francisco airport. After all the flying, I’ll land at the Pittsburgh airport and take a SuperShuttle to my final destination in Oakland, a neighborhood of Pittsburgh.

What are the chances of that?

And what are the chances that on one of the very few times I actually have to cross San Francisco Bay at a certain time, BART trains aren’t running through the Transbay Tube? There was a fire overnight.

Don’t answer those questions.

I’m looking for a job. Despite the peregrinations described in this post, I’m looking in the SF Bay Area. See my LinkedIn profile.

Jesse Kelly’s P-value extravaganza

P-value extravaganza, a comic (!) video on the basics of hypothesis testing. Statistics applied to beer, but not how you think. I think that everybody who’s taught (frequentist) statistics has wanted to do what Fisher does to the guy who says that the p value is the probability that the null hypothesis is true. And how had I not heard the mnemonic that “the null hypothesis is the dull hypothesis”? I tend to call it the “boring hypothesis” or the “uninteresting hypothesis” which has much the same meaning but doesn’t rhyme.

(I drink beer out of bottles, so I hadn’t heard of the example on which this video is based.)

By Jesse Kelly, apparently a grad student at the University of Texas. Via Christian Perfect, who also pointed out youtube’s automatically generated math feed.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

How many states have above-average unemployment rates?

Quick: in April 2012, how many US states have an employment rate at or above the national average? (Hint: the US has 51 states or state-equivalents. I’m counting DC.)

Here’s the data by state; the national rate is 8.1 percent. That gives seventeen states. Only one-third of the states have rates above average.

This actually isn’t that surprising, once you look at the data — states with large population have larger unemployment rates. (I’m not an economist; why should this be?) In the plot below we have unemployment on the y-axis and log (base 10) of population on the x-axis. The dotted line represents the average unemployment rate of 8.1 percent; you can see that most states are above it. The upward slope indicates that low-population states have lower unemployment rates than high-population states. The solid line is the least-squares regression line for predicting unemployment from log population.

R code for generating the plot, from the file “unemployment.csv” (data is in the first comment to this post; of course you should edit the first line to wherever you store the file):


unemployment = read.csv('c:/users/michael/desktop/blog/unemployment.csv')
x = log(unemployment$pop)/log(10)+3;
y = unemployment$unemp;
plot(x, y, xlab="log_10 population", ylab="unemployment", main="Population of state vs. unemployment rate, April 2012")
text(x, y, labels=unemployment$state, cex=0.5, adj=c(0,-1))
abline(8.1, 0, lty=2)
abline(lm(y~x))

The equation of the line is y = 1.66x - 3.62. To interpret the slope, we say that if we multiply a state’s population by 10 then we expect to add 1.66 percent to the unemployment rate. It’s probably easier to think in terms of doublings; we expect a state with twice the population to have an unemployment rate (1.66) \log_10 2 percent, or 0.50 percent, larger.

So does half the population live in states with a higher-than-average unemployment rate? Pretty much.


sum(unemployment$pop[y>=8.1])

gives the sum of the populations in those 17 states; it returns 159398, for 159.398 million. (My population data are in thousands.) Total population at that time was 308.746 million; so 51.6% of the population lived in a state with unemployment at or above average. (If you throw out Washington state, which had unemployment equal to the national average of 8.1 percent, you get 49.4%.)

Perhaps somewhat ironically given the content of this post, I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Weekly links for June 10

Day 1 problems, Day 2 problems and solutions (both days) to the first day of the USA Mathematical Olympiad, 2012.

Alejandro at Knewton briefly explains item response theory, a method for scoring exams. (Say two students get 9 out of 10 on an exam; one misses the easiest question, one the hardest. Which one is better at what the exam tests?)

Nate Silver, of fivethirtyeight, has launched the 2012 presidential election forecast.

Tim Gowers asks How should mathematics be taught to non-mathematicians? The post is motivated by certain proposed changes to secondary education in the UK, to introduce courses in “Uses of Mathematics”, but most of the post is devoted to suggesting the sort of questions that students in such courses would be able to answer, and you don’t need to know anything about the UK education system to appreciate these.

A graph in a glass: a machine that turns the distributions of fruits mentioned on Twitter into smoothies. (I’d prefer a pie chart made of actual pie.)

Distributional footprints of deceptive product reviews. Some companies soliciting people to write fake reviews of their products get too greedy, and this can be detected.

High school kids are assholes. (Not the actual title, which is “friendship networks and social status”.) In brief: “In every network, without exception, we find that there exists a ranking of participants, from low to high, such that almost all unreciprocated friendships consist of a lower-ranked individual claiming friendship with a higher-ranked one.” Perhaps I’d have more to say if the subject if this paper were emotionally neutral, but I’m not in the mood to dredge up painful memories.

Jordan Ellenberg’s review of Alexander Masters’ Simon: The Genius in My Basement. I mentioned this book back in March in a weekly links post (in which I also mentioned Jordan!).

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

How do departments get their names?

Jean Joseph, at the AMS Grad Math Blog, asks why some departments are called “Department of Mathematics” and others are called “Department of Mathematical Sciences”. The obvious explanation is that the “Mathematical Sciences” ones are more applied, but that doesn’t necessarily hold.

“Department of Mathematics” is much more common; I get 9,090,000 Google hits for it, compared to 531,000 for “Department of Mathematical Sciences”, for a 17 to 1 ratio.

In my Googling, the first ten hits for “Department of Mathematics” are the departmental web pages of Berkeley, Stanford, Washington, Purdue, Penn State, Florida State, Chicago, Wisconsin, MIT, and UCLA.

The first ten hits for “Department of Mathematical Sciences” are the departmental web pages of Carnegie Mellon, Clemson, Montana, Delaware, Michigan Tech, New Jersey Institute of Technology, Cincinnati, Florida Atlantic, Montana (again), and Central Connecticut State.

I don’t know how to interpret this data; obviously the “Department of Mathematical Sciences” schools are less notable, but that makes sense simply because there are less of them. (Besides, I don’t want to be on record as insulting Carnegie Mellon because someone I love is in Pittsburgh.)

Now, historically statistics departments tend to be more applied in their outlook than mathematics, so if Joseph’s idea is right, then perhaps we’d expect “Statistical Sciences” to be more common, relatively speaking.

For “Department of Statistics” I get 4,640,000 hits; the first ten are Berkeley, Stanford, Washington, Penn State, Texas A&M, Oxford, UCLA, Chicago, Purdue, and Michigan. For “Department of Statistical Sciences” I get 63,400 hits, for a 73 to 1 ratio. The hits here start with Cornell, University College London, Duke, Cape Town, Padua, Virginia Commonwealth (which is actually “Statistical Sciences and Operations Research”), VCU again (this time a listing of their faculty), VCU again (some sort of “handbook”), VCU again (the page of Paul Brooks), and a flyer about Padua’s department. Interestingly, Cornell can’t make up its mind what to call its department; the HTML title of their page is apparently “Department of Statistics” but the banner at the top of the page identifies them as “Department of Statistical Science”.

So if anything, math departments are more likely to add “science” to their name than stats departments. Why?

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Santana’s no-hitter redux

Slate‘s sports podcast, “Hang Up and Listen”, talked about Johan Santana’s June 1 no-hitter in their most recent episode; I mentioned it back on June 2. Starting at 47:47, they talk briefly about this post by tangotiger. tangotiger argues that of the 27 outs in the game, all but six were “routine” outs; he figures that given the distribution of batted balls, Santana should have given up about two hits. If like me, you didn’t see the game, you can see video of all 27 outs at mlb.com. (The “blown call” that’s been mentioned in a lot of places came in the top of the sixth, with Carlos Beltran batting. It’s a line drive down the third base line that was ruled foul.)

But every no-hitter has some degree of luck. Consider the following model: the batter hits the ball. Depending on where he hits it, that sets the probability of heads of a certain (imaginary) coin, i. e. a Bernoulli random variable. Take this to be 0 for a strikeout, 1 for a home run, and somewhere in between for balls in play. (Of course you could go back a step and start with the pitcher pitching.) Then if that coin comes up heads, the ball is a hit, and if not it’s an out. For each innings, record the number of hits until getting three outs; nine innings make a ball game.

Then for each team in every ball game you get two numbers: the sum of those probabilities of heads, which you could call the “expected” number of hits, and the actual number of hits. On average they’ll be the same. And of course they’re highly correlated. But conditional on the actual number of hits being 0, which is well below the average, the sum of the probability of heads — the “expected” number of hits — will be somewhere greater than 0, always. (Unless we’re talking about a 27-strikeout game, which happened once in the minors in 1952 and never in the majors. This is just regression to the mean.

With the right data set you could empirically determine the probability that any given batted ball goes for a hit, and for recent no-hitters (where that data is presumably available somewhere) compute how much the “average” amount of luck is. I don’t have that data, though. But some pitchers of no-hitters benefited more from luck than others, and this wouldn’t be a horrible way to quantify that.

I’m looking for a job, in the SF Bay Area. See my linkedin profile.

Spelling and prime factorization

Ben Zimmer writes a column for the New York Times, “On Language”. His June 25, 2010 column was entitled Ghoti. It’s not about beards. That’s not a misspelling of “goatee”. Rather, it’s a misspelling of “fish” (the “gh” of “enough”, the “o” of “women”, and the “ti” of “action”) that’s traditionally attributed to George Bernard Shaw.

In this column we learn about the absurd respellings that Alexander Ellis, a mid-ninteenth-century spelling reformer, came up with. And he did some calculations. He thought “scissors” should be spelled “sizerz” (okay, that’s not bad, although how would you spell “sizers”, as in “people who size”?), but at least it’s not spelled “schiesourrhce” (“combining parts of SCHism, sIEve, aS, honOUr, myRRH and sacrifiCE.”).

And Ellis gave three different numbers for the number of possible spellings of “scissors”: 1745226, 58366440, and 81997920. In the interest of trying to guess where these came from, the first thing that comes to mind is finding the prime factorizations. Why? Well, say someone told us “there are twelve ways to spell cat“. We’d logically think that they’d come up with, say, three ways to spell the first sound of that word (say, “c”, “k”, and “ck”) , three ways to spell the second sound (“a” and “ah”), and two ways to spell the third sound (“t” and “tt”), for a total of 3 \times 2 \times 2 = 12 spellings:

cat, catt, caht, cahtt, kat, katt, kaht, kahtt, ckat, ckatt, ckaht, ckahtt

Of course English doesn’t work that way — you can spell the first sound of “cat” as “ck” but not at the beginning of a word! Zimmer tells us that Ellis acknowledged this. But if you assume the calculation was done this way, then twelve is an easy number to get. But eleven and thirteen are less likely, being primes. The numbers obtained in this way should be products of relatively small numbers, and therefore shouldn’t have large prime factors. And indeed we get

1745226 = 2 \times 3^8 \times 7 \times 19, 58366440 = 2^3 \times 3^3 \times 5 \times 11 \times 17^3, 81997920 = 2^5 \times 3^6 \times 5 \times 19 \times 37

and these could conceivably be products of six relatively small numbers. For example:

1745226 = 9 \times 193914 = 9 \times 9 \times 21546 = 9 \times 9 \times 14 \times 1539
= 9 \times 9 \times 14 \times 9 \times 171 = 9 \times 9 \times 14 \times 9 \times 9 \times 19

58366440 = 20 \times 2918322 = 20 \times 18 \times 162129 = 20 \times 18 \times 17 \times 9537
= 20 \times 18 \times 17 \times 17 \times 561 = 20 \times 18 \times 17 \times 17 \times 17 \times 33

1997920 = 20 \times 4099896 = 20 \times 19 \times 215784 = 20 \times 19 \times 24 \times 8991
= 20 \times 19 \times 24 \times 27 \times 333 = 20 \times 19 \times 24 \times 27 \times 9 \times 37

Where did I get these from? Let’s consider how I went from 20 \times 18 \times 162129 to 20 \times 18 \times 17 \times 9537 in my decomposition of 58366440. I’ve already written 58366440 = 20 \times 18 \times 162129. I know I’m going to have to write 162129 as a product of four numbers, so they’re going to be near 162129^(1/4) = 20.07. It turns out that 162129/17 is an integer, namely 9537, and no factor of 162129 is closer to its fourth root than 17 is. (That is, 18, 19, 20, 21, 22, and 23 are not factors of 162129.) This is a greedy algorithm, and these aren’t optimal decompositions in the sense of having the smallest sum. For example in the last one I could replace 24 and 9, which multiply to 216, with 18 and 12 which have the same product but a smaller sum. But there’s no reason to expect that Ellis’ products had this property anyway; some sounds can be spelled in more way than others. In particular the last one of these is unlikely to be what Ellis came up with, because the word “scissors” has two of the same sound — so I’d expect two of the factors to be the same. But what do you want from a greedy algorithm?

By the way, it’s not terribly hard to write down rules for going from spelling to pronunciation that work reasonably well. It seems like the same should be true of the reverse.

I’m looking for a job! See my linkedin profile.