Links for October 19

John Cook spoke at KeenCon on Bayesian statistics as a way to integrate intuition and data.

A fantasy sports wizard’s winning formula, from Brad Reagan at the Wall Street Journal. Via Hacker News.

From Dan Egan of Betterment, It’s About Time in the Market, Not Market Timing, an analysis of the distribution of returns based on investment period.

Sarah Fallon at Wired:
This Man’s Simple System Could Transform American Medicine
. The man is David Newman, professor of emergency medicine at Mount Sinai Hospital, and the system is based on the number needed to treat (i. e. how many patients do you have to treat to have one positive outcome?)

A talk from Jake Porway of DataKind on using data for good, and Defending microfinance with data science by Everett Wechtler at Bayes Impact.

Jordan Ellenberg, author of How Not to Be Wrong, speaks to students in Education as Self-Fashioning at Stanford University on 10-10-14.

Tom Siegfried on the top ten unsung geniuses of science and math, at Nautilus.

Michael Byrne at Motherboard on a model that has predicted the Ebola outbreak, following An IDEA for Short Term Outbreak Projection: Nearcasting Using the Basic Reproduction Number by David N. Fisman, Tanya S. Hauck, Ashleigh R. Tuite, and Amy L. Greer. Also in Ebola news, let’s do some math on Ebola before we start quarantining people, on the old question of false positive rates in medical tests.

Tyler L Hobbs on probability distributions for algorithmic artists and randomness in the composition of artowrk.

The MAA has released some long-lost Martin Gardner footage.

State population and area

Ork Posters makes maps of cities with their neighborhoods. I have three: Boston, Philadelphia, San Francisco. I was looking at them recently and noticed one thing these three cities have in common: the more centrally located neighborhoods are physically smaller than the outlying neighborhoods.

Is this a general trend? It’s hard to tell because neighborhoods don’t “officially” exist. Perhaps we can do something similar looking at states.

The first question is: what’s the “central” location to use? Since states have historically been formed from the capital, i. e. Washington, DC, I plotted land area against the distance from the state’s largest city to DC. (This was the driving distance as given on Google Maps, except for Hawaii where I used flight distance from


This confirms what you see on a map – states get bigger as you move away from the capital. There are basically two clusters:

– the “main” cluster stretching from Pennsylvania to California;
– the “northeastern” cluster, where states are smaller than you’d expect from their distance to DC.

The existence of this “northeastern” cluster suggests that it might have made more sense to use a point further north – Philadelphia or even New York – for these earlier states. These states were formed as colonies, before the United States had a capital or was even a thing – but New York and Philadelphia both had their turns as capital before Washington, DC existed.

As you probably could have guessed beforehand if you know anything about the United States, Texas, Alaska, and Hawaii are outliers.
More distant states turn out also to have lower population. (Not just lower population density.)


Incidentally, at the county level, Ed Stephan observed that some states have variation in county size and some don’t, and that much of this seems to be explainable by population density changes. So it is very possible that the observation that more remote states tend to be larger may hold because remoteness from DC, in the United States, is correlated with lower population density. That’s a fancy way of saying that as you go west there are less people, which if you ignore the western coast is true.

As for my original question: how would you answer it, given that there aren’t official definitions of neighborhoods?

Election forecasting and big data

Matt Yglesias writes on what he calls the hazy metaphysics of probability as it applies to election forecasting. For example, as of right now, Nate Silver’s forecasts have a 42.2% chance of the Democrats (plus independents) getting the majority in the Senate, while Sam Wang’s forecasts have that same number at 39%. As Yglesias points out, we’ll just never get enough data points to know which of these models is closer to the truth. (By the time we do, the political system will change underneath us.)

Yglesias’ vox has a roundup of various forecasts as well. (Silver is the fox logo; Wang is the orange-and-black Princeton shield.) One thing that jumps out, for me, is that the Washington Post tends to give much more extreme win probabilities – that is, closer to 0 or 100 percent – than the other models. I suspect that the models with more extreme win probabilities generally have the same point estimate – in all these models, the point estimate is basically the average of poll results, although of course you can quibble about what polls to average and how to weight them and so on. The secret sauce of any of these models is going to be how accurately the results today – nearly a month before the election – predict the results on Election Day. Silver claims Wang has historically gotten this wrong in his explanation of the models he uses, and critiqued Wang at Political Wire; here’s Wang’s response.

This is one of those places where the end of theory folks are just wrong. In some sense election forecasting is “big data” – at least, Silver, Wang, and the like are trying to predict a big data result (millions of votes), although the samples aren’t so large. We don’t get to run the elections over and over again – or, to go bigger, we get one shot at getting this climate change thing right. See for example Big Data: the end of theory in healthcare?, by Ben Wanamaker and Devin Bean. And there’s a reason that in my big data job I work with a bunch of people who were academic scientists in their former lives.

Links for October 12

An exact fishy test, a Shiny app by Macartan Humphreys, via Andrew Gelman.

Beautiful Chemistry is a project from Tsinghua University Press and China’s University of Science and Technology, with beautiful close-up footage of chemical reactions.

Laura McLay in defense of model complexity, a counterpoint to her post in defense of model simplicity.

Pledge something to Relatively Prime: Series 2, Samuel Hansen’s newest series of podcasts. (You listened to Series 1, right?)

Michael Spivak has lecture notes on Elementary mechanics from a mathematician’s viewpoint (via metafilter)

Randomness: The ghost in the machine from Yohan J. John at 3 Quarks Daily.

Telegraph Research has done a quantitative analysis of pooled ridesharing.

DataGenetics on optimizing rope swings for distance traveled.

From Alex’s Adventures in Numberland, The man who loved only integer sequences. (Is this title a callout to the Erdos biography The man who loved only numbers?)

The UK has a National Numeracy charity. Among other things, they would like to shame celebrities who “boast of being no good at maths”.

Can Apple predict how long a file transfer takes?, from Rhett Allain at Wired, via Hacker News.

From colah, Visualizing MNIST: An exploration of dimensionality reduction. (That’s the MNIST database of handwritten digits.)

Thoughts on ranking employers

Via Alex Tabarrok at Marginal Revolution – OK, who am I fooling, via LinkedIn when it said that schools I’d attended were highly ranked – LinkedIn has put out university rankings based on career outcomes. They’ve done this in eight fields: accounting professionals, designers, finance professionals, investment bankers, marketers, media professionals, software developers, and software developers at startups. Here’s the blog post by Navneet Kapur this explains the rankings. Essentially, employer A is assumed to be more attractive than employer B if more people leave jobs at B to go to A than vice versa, and employers with less employee turnover are assumed to be more attractive. A school performs better on these rankings if its students work for better employers.

This is basically revealed preference ranking for employers, although LinkedIn hasn’t made public the rankings of employers, only the rankings of schools. Some years ago there was a revealed preference ranking for schools1 that was based on admissions data – school X is better than school Y if people who were admitted to both choose one over the other. It would be interesting to see a comparable ranking for employers, although that would be a much harder statistical problem. If I get into school X and school Y, the offers are roughly comparable; if I get job offers from employer A and employer B, the offers may not be. In addition, since general employment doesn’t have some centralized calendar, multiple offers are less likely – I would guess that most people in job searches in most fields take the first “good enough” offer that comes.

One idea that I’ve thought about on and off for a while — although it hasn’t crossed my mind nearly as frequently since leaving academia — is if one could construct a ranking of graduate departments in a field by exploiting the fact that such departments are both educators and employers. A good academic department is one where the students end up working in good departments. Of course this is deeply academia-centric, and one would want to take into account where students who leave academia after their PhD end up as well. But if we ignore that and only look at the tip of the iceberg, it wouldn’t be too hard to put together the data set, at least in a field like mathematics – university web sites have lists of faculty, and the mathematics genealogy project can give where they graduated from.

1. Christopher N. Avery & Mark E. Glickman & Caroline M. Hoxby & Andrew Metrick, 2013. “A Revealed Preference Ranking of U.S. Colleges and Universities,” The Quarterly Journal of Economics, Oxford University Press, vol. 128(1), pages 425-467.

Don’t let your babies grow up to be culinary arts majors

From Christopher Ingraham at the Washington Post’s Wonkblog: Want to do what you love and get paid for it? Choose one of these majors, a post based on data from

Although I don’t have the raw data, there’s a nice scatterplot showing for each major the mid-career salary and the percentage of people saying their work is meaningful, and there’s a negative correlation between the percentage saying work is meaningful and the pay. This would seem to imply that people take some proportion of their compensation in meaningfulness. If I had to eyeball the regression line, I’d say it has a slope of about negative 500 dollars per percent meaningfulness – that is, for every extra 1% that your job is meaningful, you take a $500 hit in annual pay. Note the ecological fallacy here, though.
I’d be interested to see, within each major, how people saying their work is meaningful is correlated with pay. That would shed some more light.

A fun fact: statistics majors make more than mathematics majors ($103K vs $93K) but think their work is less meaningful (34% vs 46%). Given that this is based on undergraduate majors, I wonder if this has to do with the choice of jobs that they take – in particular, math majors may be more likely to teach high school than statistics majors. This is based on small samples of my students, though – and the math majors I taught were at a different university than the statistics students – so take it with a grain of salt.

Brookings has an interesting similar report, and Felix Salmon has a tool for exploring the PayScale data. If you listen to Slate Money you’ve heard him mention it.

Crystallographers and statisticians

If you like podcasts, and you like science, you should listen to Jim Al-Khalili‘s Life Scientific, on BBC Radio 4. (I suppose if you’re in Britain you could listen on an actual radio.). Al-Khalili, a nuclear physicist, interviews prominent (mostly British) scientists about their life and work.

The most recent program was with Elspeth Garman, Professor of Molecular Biophysics at Oxford University. Garman’s work is in X-ray crystallography of proteins, which, as the BBC website puts it, is ”
…a technique that’s led to 28 Nobel Prizes in the last century.” Perhaps the one that comes to mind most easily is the prize that Watson, Crick, and Wilkins won for the structure of DNA. How many people remember Rosalind Franklin? (Fortunately for the Nobel committee, they could exclude her on the basis that she was dead by the time the prize was awarded…). There’s some interesting discussion in the interview about how crystallographers are generally working to improve methods that other scientists will use – so they are rarely the public face of the research.

Much the same is true of statisticians, at least before the current data surge of interest in “data science”. Tukey put it as “the best thing about being a statistician is that you get to play in everyone’s backyard” – but they don’t necessarily let you into the house, and you’re certainly not going to be on the Christmas card.