Links for October 12

An exact fishy test, a Shiny app by Macartan Humphreys, via Andrew Gelman.

Beautiful Chemistry is a project from Tsinghua University Press and China’s University of Science and Technology, with beautiful close-up footage of chemical reactions.

Laura McLay in defense of model complexity, a counterpoint to her post in defense of model simplicity.

Pledge something to Relatively Prime: Series 2, Samuel Hansen’s newest series of podcasts. (You listened to Series 1, right?)

Michael Spivak has lecture notes on Elementary mechanics from a mathematician’s viewpoint (via metafilter)

Randomness: The ghost in the machine from Yohan J. John at 3 Quarks Daily.

Telegraph Research has done a quantitative analysis of pooled ridesharing.

DataGenetics on optimizing rope swings for distance traveled.

From Alex’s Adventures in Numberland, The man who loved only integer sequences. (Is this title a callout to the Erdos biography The man who loved only numbers?)

The UK has a National Numeracy charity. Among other things, they would like to shame celebrities who “boast of being no good at maths”.

Can Apple predict how long a file transfer takes?, from Rhett Allain at Wired, via Hacker News.

From colah, Visualizing MNIST: An exploration of dimensionality reduction. (That’s the MNIST database of handwritten digits.)

Thoughts on ranking employers

Via Alex Tabarrok at Marginal Revolution – OK, who am I fooling, via LinkedIn when it said that schools I’d attended were highly ranked – LinkedIn has put out university rankings based on career outcomes. They’ve done this in eight fields: accounting professionals, designers, finance professionals, investment bankers, marketers, media professionals, software developers, and software developers at startups. Here’s the blog post by Navneet Kapur this explains the rankings. Essentially, employer A is assumed to be more attractive than employer B if more people leave jobs at B to go to A than vice versa, and employers with less employee turnover are assumed to be more attractive. A school performs better on these rankings if its students work for better employers.

This is basically revealed preference ranking for employers, although LinkedIn hasn’t made public the rankings of employers, only the rankings of schools. Some years ago there was a revealed preference ranking for schools¹ that was based on admissions data – school X is better than school Y if people who were admitted to both choose one over the other. It would be interesting to see a comparable ranking for employers, although that would be a much harder statistical problem. If I get into school X and school Y, the offers are roughly comparable; if I get job offers from employer A and employer B, the offers may not be. In addition, since general employment doesn’t have some centralized calendar, multiple offers are less likely – I would guess that most people in job searches in most fields take the first “good enough” offer that comes.

One idea that I’ve thought about on and off for a while — although it hasn’t crossed my mind nearly as frequently since leaving academia — is if one could construct a ranking of graduate departments in a field by exploiting the fact that such departments are both educators and employers. A good academic department is one where the students end up working in good departments. Of course this is deeply academia-centric, and one would want to take into account where students who leave academia after their PhD end up as well. But if we ignore that and only look at the tip of the iceberg, it wouldn’t be too hard to put together the data set, at least in a field like mathematics – university web sites have lists of faculty, and the mathematics genealogy project can give where they graduated from.

1. Christopher N. Avery & Mark E. Glickman & Caroline M. Hoxby & Andrew Metrick, 2013. “A Revealed Preference Ranking of U.S. Colleges and Universities,” The Quarterly Journal of Economics, Oxford University Press, vol. 128(1), pages 425-467.

Don’t let your babies grow up to be culinary arts majors

From Christopher Ingraham at the Washington Post’s Wonkblog: Want to do what you love and get paid for it? Choose one of these majors, a post based on data from Payscale.com.

Although I don’t have the raw data, there’s a nice scatterplot showing for each major the mid-career salary and the percentage of people saying their work is meaningful, and there’s a negative correlation between the percentage saying work is meaningful and the pay. This would seem to imply that people take some proportion of their compensation in meaningfulness. If I had to eyeball the regression line, I’d say it has a slope of about negative 500 dollars per percent meaningfulness – that is, for every extra 1% that your job is meaningful, you take a $500 hit in annual pay. Note the ecological fallacy here, though.
I’d be interested to see, within each major, how people saying their work is meaningful is correlated with pay. That would shed some more light.

A fun fact: statistics majors make more than mathematics majors ($103K vs $93K) but think their work is less meaningful (34% vs 46%). Given that this is based on undergraduate majors, I wonder if this has to do with the choice of jobs that they take – in particular, math majors may be more likely to teach high school than statistics majors. This is based on small samples of my students, though – and the math majors I taught were at a different university than the statistics students – so take it with a grain of salt.

Brookings has an interesting similar report, and Felix Salmon has a tool for exploring the PayScale data. If you listen to Slate Money you’ve heard him mention it.

Crystallographers and statisticians

If you like podcasts, and you like science, you should listen to Jim Al-Khalili‘s Life Scientific, on BBC Radio 4. (I suppose if you’re in Britain you could listen on an actual radio.). Al-Khalili, a nuclear physicist, interviews prominent (mostly British) scientists about their life and work.

The most recent program was with Elspeth Garman, Professor of Molecular Biophysics at Oxford University. Garman’s work is in X-ray crystallography of proteins, which, as the BBC website puts it, is ”
…a technique that’s led to 28 Nobel Prizes in the last century.” Perhaps the one that comes to mind most easily is the prize that Watson, Crick, and Wilkins won for the structure of DNA. How many people remember Rosalind Franklin? (Fortunately for the Nobel committee, they could exclude her on the basis that she was dead by the time the prize was awarded…). There’s some interesting discussion in the interview about how crystallographers are generally working to improve methods that other scientists will use – so they are rarely the public face of the research.

Much the same is true of statisticians, at least before the current data surge of interest in “data science”. Tukey put it as “the best thing about being a statistician is that you get to play in everyone’s backyard” – but they don’t necessarily let you into the house, and you’re certainly not going to be on the Christmas card.

How rare are eighteen-inning games, really?

What’s More Improbable: An 18-Inning Playoff Game Or A 13-Inch Penis?, from Deadspin’s Regressing blog, on sports statistics. Ross Benes points out that postseason baseball games are on average 9.22 innings with a standard deviation of 0.79, so the 18-inning Nationals-Giants game is about eleven standard deviations from the mean, or as rare as the title phallus.

But this should raise a red flag – an eleven sigma event, in a normally distributed population, should never happen. (Doing the arithmetic in my head, roughly one time in 10²⁷, and there haven’t been more than a couple thousand playoff games.). Of course the culprit is that game lengths are not normally distributed. As Darren Glass and Philip Lowry have written, game lengths are actually modeled well by a quasi-geometric distribution. They claim that the probability that a game is still tied after $n$ is $Tk^{n-9}$ , where T is the probability of a game being tied after nine innings (about 0.103) and k is the probability of both teams scoring the same number of runs in a given inning (about 0.556). The basic idea is that once you finish nine innings, each inning is an independent Bernoulli trial. (Think “weighted coin flip” except that weighted coins don’t exist.). Under this model, the probability of a game being tied after seventeen innings (and therefore going at least eighteen) is $0.103 (0.556)^8$ , or about 0.00094. Just under one in a thousand. There have been perhaps fifteen hundred postseason games in history, so the fact that it’s taken this long for a one in a thousand event to occur is not all that surprising.

KDD panel: a data scientist’s guide to startups

A data scientist’s guide to startups, a panel including:

Foster Provost, NYU Stern professor and author of Data Science for Business
Geoffrey Webb, Monash University professor
Ron Bekkerman of the University of Haifa
Oren Etzioni of the Allen Institute for Artificial Intelligence (that’s Paul Allen, of Microsoft)
Usama Fayyad, chief data officer of Barclays Bank
Claudia Perlich, of dstillery

(It may not be obvious from the list of current institutional affiliations, but many of these panelists have worked for start-ups in the past or have advised them.)

There’s video of this panel as well, but I’ve never been a big fan of watching video of talking heads when there’s a transcript available.

A couple interesting points:

Oren Etzioni and Usama Fayyad point out that going into data science at a startup (presumably, for people like those in the audience at KDD 2013, is not risky. I’d tend to agree, and this was basically what I figured when I left academic life – although I’d want to see it backed up with examples of people who have went back from industry to academia. There’s always the question of whether that is a one-way or two-way door.
there’s an interesting discussion of risk vs. reward – it’s possible that moving to a startup is a good move in expected value, but comes with more risk. Of course that’s a personal decision.

Note that this is all about venture-scale startups; are there what folks involved in the VC ecosystem sometimes call “lifestyle businesses” (something that’s not aiming for meteoric growth but supports its founders and some employees and is generally what people outside of that ecosystem would just call a “business”) that are dependent on data science?

Links for October 5

Allen Downey on when we will see a two-hour marathon. (Basically, extrapolate the current world record progression linearly, but there are good theoretical reasons to expect this to make sense.

Robert Smallshire (of the Norway-based Sixty North) on Predictive Models of Development Teams and the Systems They Build

Todd Schneider on How Many Paths are Possible in an 18 Hole Round of Match Play Golf?. (Match play golf is played on a hole-by-hole basis, with the winner being the one who wins the most holes.) Schneider is kicking ass lately; he put together a traveling salesman app in Shiny.

Adam Piore at Nautilus on why we keep playing the lottery.

Evelyn Lamb at Scientific American on Leslie Lamport’s ideas on how to write proofs. (From the abstract to Lamport’s paper: “A method of writing proofs is described that makes it harder to prove things that are not true.” This is in response to a lecture Lamport gave at the Heidelberg Laureate Forum; see also talks by Martin Hairer (“Taming infinities”, i. e. renomalization) and Wendelin Werner (“Randomness, continuum, and complex analysis” there.

Jeremy Kun on making hybrid images (which look like one thing from close up and another thing from far away) using Fourier transforms.

Vi Hart at EleVR on camera balls for mono spherical video: how do you put a bunch of cameras that sense rectangular images together to make a “camera ball” which sees in all directions?

Jim Albert at Chance explores streakiness in home run hitting. (Summary: there’s evidence that streakiness exists, because there are more streaks than you’d expect if it didn’t, but that evidence is too weak to pick out individual streaky players.)

Susan Marshall and Donald Smith wrote an article for Mathematics Magazine, Feedback, control, and distribution of prime numbers; it won an expository award from the MAA.

A few links to recent stories on Bayesian statistics

A few links to recent stories on Bayesian statistics:

The Odds, Continually Updated (NYT) by Faye Flam (it’s the first of the month, so you might be able to read it without paying!), on the differences between frequentist and Bayesian statistics. The big examples here are the search for the fisherman John Aldridge, an (unpublished?) work of Gelman criticizing this paper claiming that ovulation has effects on voting, and recent efforts to make more precise estimates of the age of the Universe.
Andrew Gelman’s issues with the article (basically, Flam got the big ideas right but misquoted Gelman in a few places);
this Jon Butterworth blog post at the Guardian.

(hat tip to Tamara Broderick for the Butterworth article.)

Clustering of bigram frequencies

Rick Wicklin at the SAS blog writes on the frequency of bigrams in an English corpus. In English: how often does a pair of letters, such as “TH” or “QZ”, appear in English text? This is a follow-up to a previous post on the frequency of letters in an English corpus and builds on an analysis by Peter Norvig of letter frequencies in the Google Books corpus.

The post on bigrams ends with a heatmap of the distribution of bigrams. It’s what Howard Wainer calls an Alabama first graphic – the letters are in alphabetical order. Now, if there’s ever a time that called for alphabetical order, it’s when ordering the alphabet. But what can we learn by putting the letters in another order?

The “heatmap” command in R does exactly this,hierarchically clustering both the rows and columns. (The command “hclust”, which does the clustering under the hood, does complete-linkage clustering by default.) If we cluster on log counts, we get the image below:

where first letters correspond to columns (listed along the bottom) and second letters to rows (listed along the right). More intense blue corresponds to more frequent bigrams, so this roughly replicates Wicklin’s heat map. Not surprisingly, the vowels end up together.

Alternatively, we can associate with each bigram a number which is its frequency divided by its expected frequency if adjacent letters were independent of each other. This number is greater than 1 (represented by a blue color below) if the letters like to be arranged that way, and less than 1 (red) if the letters don’t like this.

There are some clear color patterns, which could be useful in e. g. cryptography. For example, vowels tend to be preceded and followed by consonants (the blue patches in the upper left and lower center, respectively). Certain consonants (the ones at the right: H, F, V, J, W, Q) really don’t like to followed by other consonants, but other consonants (S, L, R, N) don’t mind. What else do you see?

Side note: the bonus round in Wheel of Fortune spots you the letters R, S, T, L, N, and E. But the most common five consonants are (in order) T, N, S, R, H; L is sixth. Why the discrepancy? This is brought up by Wicklin’s commenter Quentin. I’d add that under the current rules you’re allowed to then guess three more consonants and a vowel; for a while, if I remember correctly, the usual choices were C, D, M, and A. Those are the seventh, eighth, and ninth most common consonants, still omitting H. Ben Blatt, at Slate, suggests that B, G, H, O is better, in the sense that it will reveal more letters in a typical puzzle.