# Flightstats statistical mumbo-jumbo

From flight stats, describing a flight that is the first leg of a two-leg itinerary I’m flying in the near future – obviously this is the sort of flight where one is interested in knowing whether it tends to be on time, because one does not like being stuck in Charlotte:

This flight has an on-time performance of 84%. Statistically, when controlling for sample size, standard deviation, and mean, this flight is on-time more often than 95% of other flights.

I didn’t realize one could control for standard deviation and mean.

(Presumably controlling for “sample size” could mean some Bayesian approach, where if there is a small amount of data for a flight they tend to give moderate predictions. This is probably not too influential as

# Weekly links for February 25

Matthew Barsalou does a Bayesian analysis of the deaths of redshirts in Star Trek.

Charles Radin asks why can you stand on ice but not on water? in the Notices of the AMS.

Igor Pak has a blog; his most recent post is on the history of Catalan numbers.

Brian MacDonald has a paper on realignment in the four major sports leagues, with a view towards minimizing the total amount of travel required for teams. At Hockey Prospectus he’s written a three-part series on this paper (emphasizing hockey, of course): part one, part two, part three.

From It’s Okay to be Smart, Drake’s equation applied to finding love.

Rick Durrett writes Cancer modeling: a personal perspective for the Notices of the AMS.

Disease spreads like ripples on a pond, but only if you have the right metric.

# Oscars edition

Nate Silver fivethirtyeights the Oscars. (Yes, that’s a verb.) That is, he predicts who’s going to win Academy Awards tonight by looking at who’s won (or been nominated for) awards previously in this awards season, weighting the results in proportion to how well those results have predicted Oscar results in the past. See also his 2009 and 2011 (behind NYT paywall) attempts at the same, which try to take some other variables into account; Silver seems to believe that he may have overfit, hence the simplification.

Meanwhile, John Lopez of Vanity Fair reports on a 2008 paper by Jonas Krauss, Stefan Nann, Daniel Simon, Kai Fischbach, and Peter Gloor, “Predicting Movie Success and Academy Awards Through Sentiment and Social Network Analysis”; at least at the time, the IMDB comments section gave lots of useful information. But there was no Twitter at the time of the paper (which was based on data from 2006); the folks at Topsy have an Oscars Index.

(I will refrain from predicting, because unlike Nate Silver I don’t have minions to clean the data for me.)

# (Bi-)weekly links for February 18

Larry Wasserman: statistics declares war on machine learning.

Natalie Wolchover at Wired: In Mysterious Pattern, Math and Nature Converge, on random matrix theory.

A draft book by John Hopcroft and Ravi Kannan, CS theory for the information age (large PDF). Used in this CMU course by Venkatesan Guruswami and Ravi Kannan on modern mathematics for computer science, emphasizing high-dimensional geometry, probability, and other non-discrete mathematics.

257885161-1 is prime, says GIMPS. Liz Landau blogged about it and people at Metafilter talked about it.

Daniel Navarro of the University of Adelaide has a free e-book Learning statistics with R:
A tutorial for psychology students and other beginners

sarah-marie belcastro writes Adventures in Mathematical Knitting for American Scientist.

# Simpson’s paradox in the wild

Found on Wikipedia by Kate Owens: a chart of education by income and race. At each level of education, white Americans outearn Asian-Americans. But overall, Asian Americans outearn white Americans. How does this happen?

The answer, of course, is that Asian Americans have a higher level of education overall. If the two groups had the same overall level of education, white Americans would outearn Asian Americans. It’s an example of Simpson’s paradox in the wild. (Note: one example of Simpson’s paradox at the Wikipedia article involves characters called “Lisa” and “Bart”.)

The Wikipedia chart is based on 2003 data. I would like to be able to reconstruct this with present data, but unfortunately more recent data seems to not break out Asian Americans separately.

# Fractal broccoli

Did you know that broccoli is fractal in nature? It’s self-similar – little bits of broccoli look like big bits of broccoli.

To illustrate this, here’s a big piece of broccoli from tonight’s dinner at God Plays Dice headquarters:

And here’s a small piece of broccoli, against a backdrop of a smaller pattern:

They look quite similar!

I’m not the first to notice this: see Fractal Broccoli for the Gardening Geek and Fractal Broccoli with a Macro Lens, which features better photography. But what do you want before dinner?
My art department has a variety of fabric backdrops, mostly from recent quilting pursuits. More about that, perhaps, in a future post.

# Super Bowl edition

From Freakonomics this morning: Just how bad are football pundits at picking winners? Not bad – about half of the time right, against the spread. I’m not surprised that individuals picking can’t beat the Vegas line consistently – my understanding is that thos individuals who consistently make money on sports betting are doing it by taking advantage of those rare occasions when Vegas misses something, and are only betting on some very small minority of games.

But what kind of success could someone expect picking not against the spread, but just trying to pick a winner? Sean J. Taylor wrote something about this back in November in which he observed that “These rankings only are only about 70-75% accurate, while optimal ranking almost always breaks 80%.” By “optimal ranking” he means an ordering of the teams done retrospectively, at the end of the season; “these rankings” are various methods which attempt to assign a rating to each team based on its statistics and then picks the team with the higher rating to win the game. The disparity here is because the “optimal ranking” model is inherently overfitting.

As for rating systems, the simple rating system is an example, where the rating works out to be the amount that a team would beat an average opponent on on a neutral field. Interestingly, that rating system, as implemented at pro-football-reference.com, has the 49ers at 10.2 points better than an average team and the Ravens at 2.9 points better. But the 49ers are only a 4-point favorite today. I don’t really care about football so I’m not going to comment.

Also, I recently came across a series of articles from 2009 at Math Goes Pop! on the “Super Bowl Squares” betting pool: one, two, three. One interesting variant is to use the score mod 9 instead of the score mod 10 as the thing being bet on – due to football scores typically coming in sevens and threes, and the arithmetic fact 7+3 = 10, some last digits are far more common in football scores than others, but this effect goes away if you work mod 9.

And you guys know about Facebook’s football map (also by Sean J. Taylor!), right?