Today in p-values

From Nature, by Jeffrey T. Leek and Roger D. Peng : p-values are just the tip of the iceberg (that is, of ways statistics are misused). Here’s Leek’s post at Simply Statistics giving some examples of subcultures of data analysis.

Andrew Gelman also posted today on good, mediocre, and bad p-values, quoting an article he wrote in 2012, P-values and statistical practice.

Stark and Freishtat on course evaluations

Philip Stark (UC Berkeley statistics) and Richard Freishtat (UC Berkeley Center for Teaching and Learning, which supports undergraduate education) have written An evaluation of course evaluations.

Stark and Freishtat observe, among other things:

that nonresponse bias is a serious problem;
that averaging ordinal variables doesn’t make sense (are a 3 and a 7 on a seven-point scale the same as two 5s?);
that students can more effectively comment on some aspects of pedagogy than others;
That student evaluations are influenced by student grade expectations, and by instructor gender, age, ethnicity, and attractiveness…

If anybody should know how hard measurement is, it’s statisticians.

With this in mind, I’m amused to remember that when I was there, teaching evaluation averages by instructor and course were actually posted on a bulletin board, with other information relevant to students, outside the departmental office on the third floor of Evans Hall. Apparently the department has a more “holistic” procedure in place now for evaluating teaching; I was not at Berkeley long enough to comment on the old process. (Two academic years, as a lecturer.)

To be honest, I often found student comments more useful than grades – but it is difficult to read those comments. The format of the evaluations and the fact that they’re usually given at the end of a class period seems designed to discourage thorough comments (and Stark and Freishtat point out that comments in evaluations of technical courses tend to be less discursive). And the most critical comments tend to stick in one’s craw, which is only human nature.

Coo, shiver my sceptre!

A couple weeks ago James Grime linked on Twitter to a puzzle in the January 8, 1981 issue of New Scientist, which runs as follows:

“Beauty? Courage? Generosity? Patience? Wisdom? Which do you wish for the new-born Princess?” asked the Good Fairy.

“Beauty and Wisdom will do nicely, thank you,” replied the King, not wanting to seem greedy.

“Wait! For each gift you name, I shall bestow on her two of the other gifts instead. Each name of a gift triggers a different pair. Each gift is triggered by two of the names. But if you mention both names, they cancel out and she will not get that gift at all.”

“Coo, shiver my sceptre!” exclaimed His Majesty.

“Quite simple! For instance if you ask for Beauty and Courage, she will receive Generosity and Patience. If you ask for Beauty, Generosity and Patience, she will receive those three and Wisdom too. You in fact wished for Beauty and Wisdom. She shall have them, provided you ask for them in the simplest way.”

What gift or gifts should His Majesty ask for?

So we can associate each of the five gifts — let’s denote them by $B, C, G, P, W$ — with two of the others. Let’s denote this by $f(B) = X + Y$ , for example, where $X$ and $Y$ are the two gifts triggered by $B$ . So what we know is

$f(B) + f(C) = G + P$

and

$f(B) + f(G) + f(P) = B + G + P + W.$

Formally, $B, C, G, P, W$ are generators of $(\mathbb{Z}/2\mathbb{Z})^5$ ; informally they’re symbols that when added to themselves cancel out. Now, this function $f$ is a homomorphism from $(\mathbb{Z}/2\mathbb{Z})^5$ to itself – that is, $f(x+y) = f(x) + f(y)$ . (This means we can do the cancelling either before or after translating from the language of what was wished for to what actually happens.) So therefore we know

$f(B+C) = G+P, f(B+G+P) = B+G+P+W$ .

Adding these together we get

$f(C+G+P) = B+W$

which gives a way to get both beauty and wisdom — namely, asking for courage, generosity, and patience.

But what’s $f(B+C+G+P+W)$ — that is, what do you get if you ask for everything? We have that “each name of a gift triggers a different pair, and each gift is triggered by two of the names”. So we have
$f(B+C+G+P+W) = 2B+2C+2G+2P+2W$
since each gift is triggered twice. And the right-hand side there is just zero. So
$f(B+C+G+P+W) = 0$
and we can rewrite this:
$f((C+G+P) + (B+W)) = 0.$
But since $f$ is a homomorphism that’s just
$f(C+G+P) + f(B+W) = 0$
which is another way of saying $f(C+G+P) = f(B+W)$ . So in fact $f(B+W) = B+W$ — that is, to get beauty and wisdom, just ask for them.

To see that this is the simplest possible solution, we need to show that no single gift triggers both beauty and wisdom. I can’t come up with a “clean” way to do this, but we can go brute force. Beauty must trigger wisdom, wisdom must trigger beauty, and beauty and wisdom both trigger the same other gift. This gift can’t be courage – if it were, then we’d have $f(B) = W+C$ , but we know $f(B+C) = G+P$ , so adding these we’d have $f(C) = W+C+G+P$, which is impossible. So the “other” gift is either generosity or patience. Repeatedly applying the constraints that every gift can occur twice and the facts we already know, those lead to the solutions

$f(B) = W+G, f(C) = W+P , f(G) = C+P, f(P) = B+C, f(W) = B+G$

and

$f(B) = W+P, f(C) = W+G, f(G) = B+C, f(P) = C+G, f(W) = B+P$

respectively, which differ by exchanging $G$ and $P$ wherever they appear. In no case is there a single gift $X$ with $f(X) = B+W$ , so the solution $f(B+W) = B+W$ is indeed the simplest one.

Note: Jim Randell has been doing programmatic solutions to these puzzles for quite a while now.

Help with social science about science

Perhaps of interest to many of my readers: Melanie Sinche, a career counselor and consultant, and research associate at Harvard Law’s Labor and Worklife Program, is working on a study of the career paths of PhD receipients in the sciences. If you:

– Earned a PhD in any of the physical, computational, social, life sciences or engineering between 2004 and 2014

– Has ever worked, trained, or studied in the U.S.

then you should fill out this survey and help to paint a more accurate picture of what PhD recipients actually do.

Edited, April 16: turns out I forgot to link to the actual survey.

(H/T Tamara Broderick)

Various computations of the odds of a perfect March Madness bracket

Joseph Nebus has written a series of posts on the entropy in basketball results: for a single team, for both teams, in the win-loss results for a 64-team tournament like March Madness. For the final question he gets an answer of about 48 bits. That is, the probability of guessing the winners and losers of a tournament correctly is on the order of 2⁴⁸.

From blind guessing one gets 1 in 2⁶³, quoted for example in this USA Today story, with the caveat that:

Even so, allowing for some knowledge of college basketball and taking it account the norms of the NCAA tournament, the odds of a perfect bracket are still about 1 in 128 billion, according to DePaul math professor Jay Bergen.

This refers to Jeff Bergen’s video, “where does 1 in 128 billion come from”. Note 128 billion is roughly 2³⁷. Bergen’s strategy is to assume that the top seeds always win, since this is the most likely outcome.

The fact that two reasonable people gave two such different answers is an example of just how hard it is to estimate small probabilities. But both of these models gave up on using empirical data after the first round. Yet matchups between the “favorite” seeds should happen fairly often, and there will be data! Let’s look at win probabilities by seed as compiled at mcubed.net. In a tournament where all the favorites win, we’ll have:

in the first round, four matches of 1 vs. 16, 2 vs. 15, …, 8 vs. 9, one in each region
four matches of 1 vs. 8, 2 vs. 7, 3 vs. 6, 4 vs. 5, one in each region
four matches of 1 vs. 4, 2 vs. 3, one in each region
four matches of 1 vs. 2, one in each region
three matches of 1 vs. 1, from different regions, of course

Historically, 1 seeds are 124-0 against 16 seeds, 2 seeds are 117-7 against 15 seeds, and so on until 8 seeds are 79-69 against 9 seeds. So the probability of picking all eight first-round games in one region perfectly is

$(124/124)(117/124)(104/124)(99/123)(97/144)(95/144)(90/148)(79/148) = 0.09188$

and the probability of getting all 32 first-round games right is the fourth power of this, about $7.12 \times 10^{-5}$ or one in 14,000. (The different denominators correspond to different numbers of times each matchup has occurred, presumably due to changes in the tournament structure; the 64-team field only dates back to 1985. Oddly enough, the Washington Post reports that nobody ever seems to pick a perfect first round. This isn’t a contradiction – nobody is boring enough to pick the strategy with the highest expected value for that bet, when the bet most people are interested in is trying to win their pool.

The probability of picking a perfect second round in any given region is $(65/81)(64/88)(46/84)(49/88) = 0.17796$ ; for all four regions it’s the fourth power of this, about $1.00 \times 10^{-3}$ .

The third round in each region consists of a 1-vs-4 game and a 2-vs-3 game, where the favorites win with probability 46/68 and 36/59 respectively; the probability of picking all eight third round games correctly is $((46/68)(36/59))^4 = 0.0290$ .

The fourth round in each region is a 1-vs-2 game, where the 1 seed has historically won with probability 38/69; the probability of picking all four correctly is $(38/69)^4 = 0.0919$ .

Finally, the probability of picking all three Final Four games correctly is 1/8 – the model knows nothing beyond seeding.

Multiplying this all out, I get that the probability of picking all 63 games correctly is

$(7.12 \times 10^{-5}) (1.00 \times 10^{-3}) (0.0290) (0.0919) (1/8) = 2.38 \times 10^{-11}$

or about one in 42 billion, in a generic tournament. For what it’s worth, FiveThirtyEight gave 1 in 1.6 billion this year and 1 in 7.4 billion last year, using a model that actually knew something about basketball.

Distribution of sick time

Carl Erickson, the president of a small software company, writes that sick time follows a logarithmic distribution. His terminology is a bit nonstandard, but here’s what he’s saying. Take all the people who have worked for his company and list the amount of sick time they’ve taken, in hours. Sort that list in descending order. Then the xth entry on the list will be about -52 ln x + 236. There were a total of 86 employees.

Does this translate into a more standard statement about the distribution of about of sick time taken? Let F(z) be the probability that someone took at least z sick days. Then we have F(236 – 52 ln x) = x/86. Let z = 236 – 52 ln x and solve for x in terms of z. This gives x = exp((236 – z)/52). So we get

F(z) = exp((236 – z)/52) / 86

and as commenters at Hacker News pointed out, exp(236/52) is about 86, so we very roughly have

F(z) = exp(-z/52)

which also forces F(0) = 1, which is necessary because amounts of sick time must be nonnegative. This is exactly the exponential distribution – which is memoryless. So sick time is exponential with mean 52 hours.

Is there a good theoretical reason that this should happen, though? The exponential distribution is memoryless but I don’t see why sickness times should be, especially since we’re talking about the total amount of time that people spend sick, not the time they spend dealing with any single illness. Or is this just an example of everything looking linear (with the right variable transformation) if you try hard enough.